# Grammar Scoring Engine from Speech Audio

This project aims to build a machine learning model that automatically scores spoken language samples based on grammar quality. The dataset consists of audio files labeled with **MOS Likert Grammar Scores (0 to 5)**.
  🧠 Objective:
To extract meaningful features from audio files (like MFCCs and Chroma) and train a regression model to predict grammar scores, evaluated using **Pearson Correlation**.

---


In [44]:
import librosa
import numpy as np

def extract_features(file_path, n_mfcc=13):
    y, sr = librosa.load(file_path, sr=None)
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=n_mfcc)
    mfcc_mean = np.mean(mfcc.T, axis=0)
    mfcc_std = np.std(mfcc.T, axis=0)

    chroma = librosa.feature.chroma_stft(y=y, sr=sr)
    chroma_mean = np.mean(chroma.T, axis=0)
    chroma_std = np.std(chroma.T, axis=0)

    return np.concatenate([mfcc_mean, mfcc_std, chroma_mean, chroma_std])


## Importing Required Libraries
We use libraries like `librosa` for audio feature extraction, `pandas` for data handling, and `scikit-learn`/`xgboost` for model training.


##  Loading Dataset
We read the training file which contains audio filenames and corresponding grammar scores. These will be used to extract features and train our models.


In [45]:
import os
import pandas as pd
from tqdm import tqdm

train_csv_path = r"C:\Users\pc\OneDrive\Desktop\project\Grammar Scoring Engine for spoken data samples\shl-intern-hiring-assessment\dataset\train.csv"
train_audio_dir = r"C:\Users\pc\OneDrive\Desktop\project\Grammar Scoring Engine for spoken data samples\shl-intern-hiring-assessment\dataset\audios_train"

train_df = pd.read_csv(train_csv_path)

features = []
labels = []

for file, label in tqdm(zip(train_df['filename'], train_df['label']), total=len(train_df)):
    file_path = os.path.join(train_audio_dir, file)
    try:
        feats = extract_features(file_path)
        features.append(feats)
        labels.append(label)
    except Exception as e:
        print(f"Failed to process {file}: {e}")

import numpy as np
X = np.array(features)
y = np.array(labels)

print("Features shape:", X.shape)
print("Labels shape:", y.shape)


100%|████████████████████████████████████| 444/444 [04:22<00:00,  1.69it/s]

Features shape: (444, 50)
Labels shape: (444,)





## Feature Preparation

We extract features for all training and test audio files and convert them into numerical arrays for model training.


## Audio Feature Extraction

We extract the following features for each audio file:
- **MFCC Mean & Std**: Captures spectral shape
- **Chroma Mean & Std**: Represents harmonic content

This combination provides both timbral and pitch information useful for grammar scoring.


In [46]:
features = []
labels = []

for file, label in tqdm(zip(train_df['filename'], train_df['label']), total=len(train_df)):
    file_path = os.path.join(train_audio_dir, file)
    try:
        feats = extract_features(file_path)  
        features.append(feats)
        labels.append(label)
    except Exception as e:
        print(f"Failed to process {file}: {e}")

X = np.array(features)
y = np.array(labels)

print("Feature shape:", X.shape)


100%|████████████████████████████████████| 444/444 [04:21<00:00,  1.70it/s]

Feature shape: (444, 50)





## Model Training

We train both a **Random Forest Regressor** and **XGBoost Regressor**, then take their average predictions for improved performance.


In [47]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train shape:", X_train.shape)
print("Validation shape:", X_val.shape)


Train shape: (355, 50)
Validation shape: (89, 50)


In [48]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(n_estimators=200, random_state=42)
rf_model.fit(X_train, y_train)


In [49]:
from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=200, random_state=42)
xgb_model.fit(X_train, y_train)


## Evaluation Metrics

We use **MAE (Mean Absolute Error)** and **RMSE (Root Mean Squared Error)** for local evaluation. The leaderboard evaluates **Pearson Correlation**, which we aim to improve through feature engineering and ensembling.


In [50]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

val_preds_rf = rf_model.predict(X_val)
val_preds_xgb = xgb_model.predict(X_val)

val_preds_ensemble = (val_preds_rf + val_preds_xgb) / 2

mae = mean_absolute_error(y_val, val_preds_ensemble)
rmse = mean_squared_error(y_val, val_preds_ensemble, squared=False)

print(f"Mean Absolute Error: {mae:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")


Mean Absolute Error: 0.7470
Root Mean Squared Error: 0.9309


In [51]:
test_preds_rf = rf_model.predict(X_test)
test_preds_xgb = xgb_model.predict(X_test)

final_test_preds = (test_preds_rf + test_preds_xgb) / 2


## Submission File

We create a CSV file with predictions for the test set. This file is uploaded for leaderboard evaluation.


In [52]:
# Create DataFrame for predictions
submission_df = pd.DataFrame({
    'filename': valid_filenames,
    'mos': final_test_preds
})

# Save to CSV
submission_df.to_csv("submission.csv", index=False)
print("✅ Submission file 'submission.csv' created!")


✅ Submission file 'submission.csv' created!


In [54]:
# Create a DataFrame with predictions
submission_df = pd.DataFrame({
    'filename': valid_filenames,
    'label': final_test_preds
})

# Sanity check: view the first few rows
submission_df.head()


Unnamed: 0,filename,label
0,audio_706.wav,2.846406
1,audio_800.wav,2.683649
2,audio_68.wav,3.926985
3,audio_1267.wav,3.534893
4,audio_683.wav,3.295017


In [55]:
# Save to CSV (no index!)
submission_df.to_csv('final_submission.csv', index=False)


In [60]:
submission_df.to_csv("final_submission.csv", index=False)


In [61]:
import os
print(os.getcwd())


C:\Users\pc
