# Grammar Scoring Engine Project

This project is a **Grammar Scoring Engine** that processes audio files, transcribes them to text, and evaluates their grammar. It uses both machine learning and deep learning methods to score the quality of grammar in audio transcriptions. Below is a summary of the steps and methodology:

---

## Steps in the Pipeline

1. **Audio Pre-Processing**:
   - Audio files are preprocessed using the `librosa` library by resampling, trimming noise, and normalizing the volume.

2. **Transcription**:
   - Preprocessed audio files are transcribed into text using OpenAI's `Whisper` model.

3. **Combining with Scores**:
   - Transcriptions are combined with `train.csv` to create a dataset for further use.

4. **Cleaning the Transcriptions**:
   - The transcription text is cleaned using regular expressions to remove filler words, non-ASCII characters, and unwanted spaces.

5. **Grammar Checking**:
   - Grammar mistakes in the cleaned transcriptions are counted using the `language_tool_python` library.

6. **Feature Engineering**:
   - Additional features like **parts of speech diversity**, **sentence length**, and **stopword ratio** are generated using `spaCy`.

7. **Model Training**:
   - Two models are trained:
     - **Random Forest Regressor**: Trained on engineered features.
     - **DistilBERT**: Extracts sentence embeddings and trains a Ridge regression model.

8. **Hybrid Ensembling**:
   - A **meta-model** is created using the predictions from both Random Forest and DistilBERT. This final model uses K-fold cross-validation and Ridge regression for better accuracy.

---

## Evaluation Metrics (Meta-Model with K-Fold on RF + BERT)

The meta-model was evaluated on test data using the following metrics:
- **Mean Absolute Error (MAE)**: `0.7096`
- **Mean Squared Error (MSE)**: `0.7845`
- **Root Mean Squared Error (RMSE)**: `0.8857`
- **Pearson Correlation**: `0.6699`

---

## Conclusion

This project successfully combines audio preprocessing, transcription, grammar checking, and machine learning to create a scoring engine. The hybrid model, which uses both Random Forest and DistilBERT, achieves a strong correlation (`0.6699`) between predicted and true scores, making it effective for grammar evaluation tasks.

#Training

##Audio Pre-Processing##



In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import librosa    #librosa is used for audio preprocessing
import numpy as np
import os
import soundfile as sf  # for saving audio
from tqdm import tqdm
import pandas as pd

input_path = '/content/drive/MyDrive/Dataset/audios/train'
output_path = '/content/drive/MyDrive/Grammar Scoring Engine/Pre-Processed Data'

os.makedirs(output_path, exist_ok=True)

# iterating the audio files
for filename in tqdm(os.listdir(input_path), desc = "Preprocessing Audio Files", unit = "file"):
    if filename.endswith('.wav'):  #finds the file that endswith .wav
        file_path = os.path.join(input_path, filename)

        # resampling
        y, sr = librosa.load(file_path, sr=16000)

        # trims the noise which is 25db quiter
        y, _ = librosa.effects.trim(y, top_db=25)

        # normalising the volume
        y = y / np.max(np.abs(y))

        # save preprocessed audio
        output_file_path = os.path.join(output_path, filename)
        sf.write(output_file_path, y, sr)

print('\n Successfully Pre-Processed the Audio files')


Preprocessing Audio Files: 100%|██████████| 444/444 [01:08<00:00,  6.49file/s]


 Successfully Pre-Processed the Audio files





In [None]:
!python3 -m pip install openai-whisper

##Transcribing using Whisper##

In [None]:
import os
import whisper   #whisper is library by openai which is used for audio transcription
from tqdm import tqdm
import torch

input_path = '/content/drive/MyDrive/Grammar Scoring Engine/Pre-Processed Data'
output_path = '/content/drive/MyDrive/Grammar Scoring Engine/Transcripts'

if not os.path.exists(output_path):
    os.makedirs(output_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base", device=device) #load the whisper model ie: 'base'

#iterating through the files
for filename in tqdm(os.listdir(input_path), desc="Transcribing Audio Files", unit="file"): # wrapping the path for progress record
    if filename.endswith('.wav'):
        file_path = os.path.join(input_path, filename)

        result = model.transcribe(file_path) #transcribing the audio to text
        transcript = result['text']

        base = os.path.splitext(filename)[0]
        with open(os.path.join(output_path, f"{base}.txt"), "w") as f:
            f.write(transcript) #write the transcribed text to respective .txt file
print('\n Successfully Transcribed the Audio Files')


100%|████████████████████████████████████████| 139M/139M [00:00<00:00, 184MiB/s]
Transcribing Audio Files: 100%|██████████| 444/444 [36:52<00:00,  4.98s/file]


 Successfully Transcribed the Audio Files





##Combining Audio File with the train_score.csv##

In [4]:
import pandas as pd
from tqdm import tqdm
import os

csv_path = '/content/drive/MyDrive/Dataset/train.csv'
transcript_path = '/content/drive/MyDrive/Grammar Scoring Engine/Training/Transcripts'
output_csv = '/content/drive/MyDrive/Grammar Scoring Engine/Training/combined_scores.csv'

# load CSV
df_train = pd.read_csv(csv_path)

# add a new column for transcripts
transcripts = []

# loop through each filename in the CSV and find its transcript
for filename in tqdm(df_train['filename'], desc="Adding Transcripts", unit="file"):
    base = os.path.splitext(filename)[0]
    transcript_file = os.path.join(transcript_path, f"{base}.txt")

    try: #read the transcript file
        with open(transcript_file, 'r') as f:
            transcripts.append(f.read().strip()) #remove the whitespaces from start and end
    except FileNotFoundError:
        transcripts.append(None) # if transcript not found
        print('Transcripts not Found')

# add the column to DataFrame
df_train['transcription'] = transcripts

# save the updated CSV
df_train.to_csv(output_csv, index=False)

print(f" \n Successfully saved combined CSV to: {output_csv}")


Adding Transcripts: 100%|██████████| 444/444 [02:28<00:00,  2.99file/s]


 
 Successfully saved combined CSV to: /content/drive/MyDrive/Grammar Scoring Engine/Training/combined_scores.csv


##Cleaning Transcription##

In [5]:
import re     # for text cleaning using patterns (regular expressions)
import unicodedata     # to remove accents from letters

stopwords = ['uh', 'um', 'you know','i mean', 'well', 'exactly', 'yess', 'er']

def clean_text(text):
    if not isinstance(text, str) or text.isnumeric():
        return ""  # if the text is not string or it is numerics is return nothing

    text = text.lower()
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
    #this normalizes the text and (NKFD)removes the accents, (ascii) removes all the non-ascii characters

    # remove filler words, the stop words
    pattern = r'\b(?:' + '|'.join(map(re.escape, stopwords)) + r')\b'
    text = re.sub(pattern, '', text)

    # remove unwanted characters
    text = re.sub(r"[^a-z0-9.,?!'\" \n]", '', text)

    # clean up spacing and punctuation
    text = re.sub(r'\s+', ' ', text)    #remove the whitespaces
    text = re.sub(r'\s([.,?!])', r'\1', text)
    text = re.sub(r'([.,?!])([^\s])', r'\1 \2', text)

    return text.strip()

df_train['cleaned_transcription'] = df_train['transcription'].apply(clean_text) #cleaned transcription
df_train.to_csv("/content/drive/MyDrive/Grammar Scoring Engine/Training/combined_scores.csv", index=False)
print(" \n Cleaning complete and saved to combined_scores.csv")


 
 Cleaning complete and saved to combined_scores.csv


##Checking Grammar(language_tool_python)##

In [None]:
!pip install language_tool_python

In [None]:
!sudo apt-get install openjdk-17-jdk

In [8]:
import language_tool_python    #language_tool_python is used to check the grammar of a text input.
from tqdm import tqdm
import pandas as pd

df_train = pd.read_csv('/content/drive/MyDrive/Grammar Scoring Engine/Training/combined_scores.csv')
tool = language_tool_python.LanguageTool('en-US')

def check_grammar(text):      #function which check how many grammatical mistakes were there in the code and returns the number
    if isinstance(text, str):
        matches = tool.check(text)
        return len(matches)
    else:
        return 0

tqdm.pandas()
df_train['matches'] = df_train['cleaned_transcription'].apply(check_grammar)    #adding a matches columns in the the combined_scores.csv
df_train.to_csv("/content/drive/MyDrive/Grammar Scoring Engine/Training/combined_scores.csv", index=False) # save with new features
print("\n Grammar checking complete and saved to combined_scores.csv")

Downloading LanguageTool latest: 100%|██████████| 252M/252M [00:15<00:00, 15.9MB/s]
INFO:language_tool_python.download_lt:Unzipping /tmp/tmp_8qbo7cx.zip to /root/.cache/language_tool_python.
INFO:language_tool_python.download_lt:Downloaded https://internal1.languagetool.org/snapshots/LanguageTool-latest-snapshot.zip to /root/.cache/language_tool_python.



 Grammar checking complete and saved to combined_scores.csv


In [None]:
!pip install spacy

##More Features (Parts_of_speech, Sentence_len)##

In [9]:
import spacy   #spaCy help in knowing about a text in detailed way, like the parts of speech, len of the sentence etc.

# load spaCy model
nlp = spacy.load("en_core_web_sm")

# define your feature extraction function
def extract_spacy_features(text):
    if isinstance(text, str):
        doc = nlp(text)
        pos_diversity = len(set(token.pos_ for token in doc))  #return number of parts of speeches used in a sentence
        sentence_length = len([token for token in doc if token.is_alpha])    #return the length of the sentence
        stopword_ratio = sum(token.is_stop for token in doc) / (len(doc) or 1)   #return the proportion of the words vs the stopwords
        return pd.Series([pos_diversity, sentence_length, stopword_ratio])
    else:
        return pd.Series([0, 0, 0.0])

# adding diffent columns in the the combined_scores.csv
df_train[['pos_diversity', 'sentence_length', 'stopword_ratio']] = df_train['cleaned_transcription'].apply(extract_spacy_features)

# save with new features
df_train.to_csv("/content/drive/MyDrive/Grammar Scoring Engine/Training/combined_scores.csv", index=False)

print(" \n spaCy features added and saved to combined_scores.csv")


 
 spaCy features added and saved to combined_scores.csv


# **Hybrid Method** #
*RandomForest and DistilBERT*

##Training the model(using RandomFroest)##

In [10]:
#importing all the necessary libraries that are used for model training
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor    #using RandomForestRegressor to train out model
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.stats import pearsonr
import numpy as np

df_train.head()

# defing the respective columns
x = df_train[['matches', 'pos_diversity', 'sentence_length', 'stopword_ratio']]
y = df_train['label']

#splitting them into test and train
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# applying the model
model_rf = RandomForestRegressor(random_state=42)
model_rf.fit(x_train, y_train)

#predicting the output
rf_pred = model_rf.predict(x_test)

# calculating error
mae = mean_absolute_error(y_test, rf_pred)
mse = mean_squared_error(y_test, rf_pred)
rmse = np.sqrt(mse)
pearson_correlation, _ = pearsonr(y_test, rf_pred)

print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"Pearson Correlation: {pearson_correlation:.4f}")

MAE: 0.8886
MSE: 1.2307
RMSE: 1.1094
Pearson Correlation: 0.3556


##Training the model(distilBERT)##

In [11]:
from sklearn.linear_model import Ridge
from transformers import DistilBertModel, DistilBertTokenizer
import torch
from tqdm import tqdm

# defing the respective columns(preparing the features and label)
x = df_train[['matches', 'pos_diversity', 'sentence_length', 'stopword_ratio']]
y = df_train['label']

#splitting them into test and train
x_train, x_test, y_train, y_test, train_indices, test_indices = train_test_split(
    x, y, np.arange(len(df_train)), test_size=0.2, random_state=42
)

print(f"Training set size: {len(x_train)}")
print(f"Test set size: {len(x_test)}")

# this gets original text for BERT embeddings
texts = df_train['transcription'].tolist()
texts_train = [texts[i] for i in train_indices]
texts_test = [texts[i] for i in test_indices]

# loading pre-trained DistilBERT model and tokenizer
print("Loading DistilBERT model...")
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')


# function which is used to get sentence embeddings from DistilBERT
def get_embeddings(texts, batch_size=8):
    all_embeddings = []

    # this handles empty or NaN texts
    cleaned_texts = []
    valid_indices = []

    for i, text in enumerate(texts):
        if isinstance(text, str) and len(text.strip()) > 0:
            cleaned_texts.append(text)
            valid_indices.append(i)
        else:
            # if text is empty or NaN, use a placeholder
            cleaned_texts.append("empty text")
            valid_indices.append(i)

    print(f"Processing {len(cleaned_texts)} texts...")

    for i in range(0, len(cleaned_texts), batch_size):
        batch_texts = cleaned_texts[i:i+batch_size]

        # this shows the simple progress
        if i % 20 == 0:
            print(f"Processing batch starting at position {i}")

        # tokenize
        encoded_input = tokenizer(batch_texts, padding=True, truncation=True,
                                 max_length=128, return_tensors='pt')

        # get model output
        with torch.no_grad():
            model_output = model(**encoded_input)

        # use CLS token embeddings (first token)
        embeddings = model_output.last_hidden_state[:, 0, :].numpy()
        all_embeddings.append(embeddings)

    return np.vstack(all_embeddings)

# this get embeddings for train and test sets
print("Generating DistilBERT embeddings for training set...")
x_train_bert = get_embeddings(texts_train)
print("Generating DistilBERT embeddings for test set...")
x_test_bert = get_embeddings(texts_test)

print(f"BERT training embeddings shape: {x_train_bert.shape}")
print(f"BERT testing embeddings shape: {x_test_bert.shape}")

# training the model on DistilBERT embeddings
print("Training Ridge regression model on BERT embeddings...")
bert_model = Ridge(alpha=1.0)
bert_model.fit(x_train_bert, y_train)

# making predictions
bert_pred = bert_model.predict(x_test_bert)

# comparing predictions
mae = mean_absolute_error(y_test, bert_pred)
mse = mean_squared_error(y_test, bert_pred)
rmse = np.sqrt(mse)
pearson_correlation, _ = pearsonr(y_test, bert_pred)

print("\nModel Evaluation Results:")
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"Pearson Correlation: {pearson_correlation:.4f}")

Training set size: 355
Test set size: 89
Loading DistilBERT model...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Generating DistilBERT embeddings for training set...
Processing 355 texts...
Processing batch starting at position 0
Processing batch starting at position 40
Processing batch starting at position 80
Processing batch starting at position 120
Processing batch starting at position 160
Processing batch starting at position 200
Processing batch starting at position 240
Processing batch starting at position 280
Processing batch starting at position 320
Generating DistilBERT embeddings for test set...
Processing 89 texts...
Processing batch starting at position 0
Processing batch starting at position 40
Processing batch starting at position 80
BERT training embeddings shape: (355, 768)
BERT testing embeddings shape: (89, 768)
Training Ridge regression model on BERT embeddings...

Model Evaluation Results:
MAE: 0.7431
MSE: 0.8980
RMSE: 0.9476
Pearson Correlation: 0.6192


##Hybrid Emsembling(The Meta Model)

In [12]:
# average of RF and BERT predictions
hybrid_pred_avg = (rf_pred + bert_pred) / 2

# evaluating the average of RF and BERT predictions
mae = mean_absolute_error(y_test, hybrid_pred_avg)
mse = mean_squared_error(y_test, hybrid_pred_avg)
rmse = np.sqrt(mse)
pearson_correlation, _ = pearsonr(y_test, hybrid_pred_avg)

print("\nHybrid Model (Average):")
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"Pearson Correlation: {pearson_correlation:.4f}")

# weighted average of RF and BERT predictions (giving more importance to BERT 0.7 and less to RF 0.3)
hybrid_pred_weighted = 0.3 * rf_pred + 0.7 * bert_pred

# evaluating the weighted average of RF and BERT predictions
mae = mean_absolute_error(y_test, hybrid_pred_weighted)
mse = mean_squared_error(y_test, hybrid_pred_weighted)
rmse = np.sqrt(mse)
pearson_correlation, _ = pearsonr(y_test, hybrid_pred_weighted)

print("\nHybrid Model (Weighted Average):")
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"Pearson Correlation: {pearson_correlation:.4f}")


from sklearn.model_selection import KFold

# preparing original features (used for both RF and meta model)
X_train_features = x_train
X_test_features = x_test

# arrays to store predictions from RF and BERT using 5-fold cross-validation
cv_rf_preds = np.zeros(len(X_train_features))
cv_bert_preds = np.zeros(len(X_train_features))

# 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

print("\nStarting stacking process...")
for train_idx, val_idx in kf.split(X_train_features):
    # splitting features and labels
    X_train_cv = X_train_features.iloc[train_idx]
    X_val_cv = X_train_features.iloc[val_idx]
    y_train_cv = y_train.iloc[train_idx]
    y_val_cv = y_train.iloc[val_idx]

    # get texts for BERT (using training set indices)
    texts_train_cv = [texts_train[i] for i in train_idx]
    texts_val_cv = [texts_train[i] for i in val_idx]

    # train Random Forest and predict on validation
    rf_cv = RandomForestRegressor(random_state=42)
    rf_cv.fit(X_train_cv, y_train_cv)
    cv_rf_preds[val_idx] = rf_cv.predict(X_val_cv)

    # train Ridge on BERT embeddings and predict
    X_train_bert_cv = get_embeddings(texts_train_cv)
    X_val_bert_cv = get_embeddings(texts_val_cv)

    bert_cv = Ridge(alpha=1.0)
    bert_cv.fit(X_train_bert_cv, y_train_cv)
    cv_bert_preds[val_idx] = bert_cv.predict(X_val_bert_cv)

# combine original features with RF and BERT predictions for meta-model
meta_train_features = np.column_stack([
    X_train_features.values,
    cv_rf_preds,
    cv_bert_preds
])

# train the final meta-model (Ridge regression)
meta_model = Ridge(alpha=0.5)
meta_model.fit(meta_train_features, y_train)

# prepare test features for final prediction
meta_test_features = np.column_stack([
    X_test_features.values,
    rf_pred,
    bert_pred
])

# final prediction from the stacked model
hybrid_pred_stacked = meta_model.predict(meta_test_features)

print("Stacked model prediction completed.")

# evaluating the linear regression hybrid model
mae = mean_absolute_error(y_test, hybrid_pred_stacked)
mse = mean_squared_error(y_test, hybrid_pred_stacked)
rmse = np.sqrt(mse)
pearson_correlation, _ = pearsonr(y_test, hybrid_pred_stacked)

print("\nHybrid Model (Kfold on RF + BERT):")
print(f"MAE: {mae:.4f}")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"Pearson Correlation: {pearson_correlation:.4f}")



Hybrid Model (Average):
MAE: 0.7316
MSE: 0.8300
RMSE: 0.9110
Pearson Correlation: 0.6500

Hybrid Model (Weighted Average):
MAE: 0.7083
MSE: 0.8009
RMSE: 0.8949
Pearson Correlation: 0.6519

Starting stacking process...
Processing 284 texts...
Processing batch starting at position 0
Processing batch starting at position 40
Processing batch starting at position 80
Processing batch starting at position 120
Processing batch starting at position 160
Processing batch starting at position 200
Processing batch starting at position 240
Processing batch starting at position 280
Processing 71 texts...
Processing batch starting at position 0
Processing batch starting at position 40
Processing 284 texts...
Processing batch starting at position 0
Processing batch starting at position 40
Processing batch starting at position 80
Processing batch starting at position 120
Processing batch starting at position 160
Processing batch starting at position 200
Processing batch starting at position 240
Process

#Testing


##Loading and Pre-Processing Test Dataset

In [None]:
import librosa    #librosa is used for audio preprocessing
import numpy as np
import os
import soundfile as sf  # for saving audio
from tqdm import tqdm
import pandas as pd

input_path = '/content/drive/MyDrive/Dataset/audios/test'
output_path = '/content/drive/MyDrive/Grammar Scoring Engine/Testing/Pre-Processing'

os.makedirs(output_path, exist_ok=True)

# iterating the audio files
for filename in tqdm(os.listdir(input_path), desc = "Preprocessing Audio Files", unit = "file"):
    if filename.endswith('.wav'):  #finds the file that endswith .wav
        file_path = os.path.join(input_path, filename)

        # resampling
        y, sr = librosa.load(file_path, sr=16000)

        # trims the noise which is 25db quiter
        y, _ = librosa.effects.trim(y, top_db=25)

        # normalising the volume
        y = y / np.max(np.abs(y))

        # save preprocessed audio
        output_file_path = os.path.join(output_path, filename)
        sf.write(output_file_path, y, sr)

print('\n Successfully Pre-Processed the Audio files')


Preprocessing Audio Files: 100%|██████████| 204/204 [00:38<00:00,  5.32file/s]


 Successfully Pre-Processed the Audio files





##Transcribing Test Audio Dataset

In [None]:
import os
import whisper   #whisper is library by openai which is used for audio transcription
from tqdm import tqdm
import torch

input_path = '/content/drive/MyDrive/Grammar Scoring Engine/Testing/Pre-Processing'
output_path = '/content/drive/MyDrive/Grammar Scoring Engine/Testing/Transcription'

if not os.path.exists(output_path):
    os.makedirs(output_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = whisper.load_model("base", device=device) #load the whisper model ie: 'base'

#iterating through the files
for filename in tqdm(os.listdir(input_path), desc="Transcribing Audio Files", unit="file"): # wrapping the path for progress record
    if filename.endswith('.wav'):
        file_path = os.path.join(input_path, filename)

        result = model.transcribe(file_path) #transcribing the audio to text
        transcript = result['text']

        base = os.path.splitext(filename)[0]
        with open(os.path.join(output_path, f"{base}.txt"), "w") as f:
            f.write(transcript) #write the transcribed text to respective .txt file
print('\n Successfully Transcribed the Audio Files')


Transcribing Audio Files: 100%|██████████| 204/204 [13:40<00:00,  4.02s/file]


 Successfully Transcribed the Audio Files





##Making CSV file for Transcripts

In [13]:
import pandas as pd
from tqdm import tqdm
import os

csv_path = '/content/drive/MyDrive/Dataset/test.csv'
transcript_path = '/content/drive/MyDrive/Grammar Scoring Engine/Testing/Transcription'
output_csv = '/content/drive/MyDrive/Grammar Scoring Engine/Testing/combined_scores.csv'

# load CSV
df = pd.read_csv(csv_path)

# add a new column for transcripts
transcripts = []

# loop through each filename in the CSV and find its transcript
for filename in tqdm(df['filename'], desc="Adding Transcripts", unit="file"):
    base = os.path.splitext(filename)[0]
    transcript_file = os.path.join(transcript_path, f"{base}.txt")

    try: #read the transcript file
        with open(transcript_file, 'r') as f:
            transcripts.append(f.read().strip()) #remove the whitespaces from start and end
    except FileNotFoundError:
        transcripts.append(None) # if transcript not found
        print('Transcripts not Found')

# add the column to DataFrame
df['transcription'] = transcripts

# save the updated CSV
df.to_csv(output_csv, index=False)

print(f" \n Successfully saved combined CSV to: {output_csv}")


Adding Transcripts: 100%|██████████| 204/204 [01:26<00:00,  2.35file/s]


 
 Successfully saved combined CSV to: /content/drive/MyDrive/Grammar Scoring Engine/Testing/combined_scores.csv


## Cleaning the Transcript and Adding more Features

In [14]:
import re     # for text cleaning using patterns (regular expressions)
import unicodedata     # to remove accents from letters

stopwords = ['uh', 'um', 'you know','i mean', 'well', 'exactly', 'yess', 'er']

def clean_text(text):
    if not isinstance(text, str) or text.isnumeric():
        return ""  # if the text is not string or it is numerics is return nothing

    text = text.lower()
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
    #this normalizes the text and (NKFD)removes the accents, (ascii) removes all the non-ascii characters

    # remove filler words, the stop words
    pattern = r'\b(?:' + '|'.join(map(re.escape, stopwords)) + r')\b'
    text = re.sub(pattern, '', text)

    # remove unwanted characters
    text = re.sub(r"[^a-z0-9.,?!'\" \n]", '', text)

    # clean up spacing and punctuation
    text = re.sub(r'\s+', ' ', text)    #remove the whitespaces
    text = re.sub(r'\s([.,?!])', r'\1', text)
    text = re.sub(r'([.,?!])([^\s])', r'\1 \2', text)

    return text.strip()

df['cleaned_transcription'] = df['transcription'].apply(clean_text) #cleaned transcription
df.to_csv("/content/drive/MyDrive/Grammar Scoring Engine/Testing/combined_scores.csv", index=False)
print(" \n Cleaning complete and saved to combined_scores.csv")



 
 Cleaning complete and saved to combined_scores.csv


In [15]:
import language_tool_python    #language_tool_python is used to check the grammar of a text input.
from tqdm import tqdm
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/Grammar Scoring Engine/Testing/combined_scores.csv')
tool = language_tool_python.LanguageTool('en-US')

def check_grammar(text):      #function which check how many grammatical mistakes were there in the code and returns the number
    if isinstance(text, str):
        matches = tool.check(text)
        return len(matches)
    else:
        return 0

tqdm.pandas()
df['matches'] = df['cleaned_transcription'].apply(check_grammar)    #adding a matches columns in the the combined_scores.csv
df.to_csv("/content/drive/MyDrive/Grammar Scoring Engine/Testing/combined_scores.csv", index=False) # save with new features
print("\n Grammar checking complete and saved to combined_scores.csv")


 Grammar checking complete and saved to combined_scores.csv


In [16]:
import spacy   #spaCy help in knowing about a text in detailed way, like the parts of speech, len of the sentence etc.

# load spaCy model
nlp = spacy.load("en_core_web_sm")

# define your feature extraction function
def extract_spacy_features(text):
    if isinstance(text, str):
        doc = nlp(text)
        pos_diversity = len(set(token.pos_ for token in doc))  #return number of parts of speeches used in a sentence
        sentence_length = len([token for token in doc if token.is_alpha])    #return the length of the sentence
        stopword_ratio = sum(token.is_stop for token in doc) / (len(doc) or 1)   #return the proportion of the words vs the stopwords
        return pd.Series([pos_diversity, sentence_length, stopword_ratio])
    else:
        return pd.Series([0, 0, 0.0])

# adding diffent columns in the the combined_scores.csv
df[['pos_diversity', 'sentence_length', 'stopword_ratio']] = df['cleaned_transcription'].apply(extract_spacy_features)

# save with new features
df.to_csv("/content/drive/MyDrive/Grammar Scoring Engine/Testing/combined_scores.csv", index=False)

print(" \n spaCy features added and saved to combined_scores.csv")

 
 spaCy features added and saved to combined_scores.csv


##Predicting the Output

In [20]:
# defining the test features
x_test_features = df[['matches', 'pos_diversity', 'sentence_length', 'stopword_ratio']]

# prediction using random forest
pred_rf = model_rf.predict(x_test_features)

# generate embeddings for test dataset
test_texts = df['transcription'].tolist()
print("Generating DistilBERT embeddings for test dataset...")
x_test_bert_full = get_embeddings(test_texts)

# predict using the trained Ridge regression model
print("Making predictions on test dataset...")
pred_bert = bert_model.predict(x_test_bert_full)

# prepare meta features by combining original features with RF and BERT predictions
meta_features = np.column_stack([
    x_test_features.values,
    pred_rf,
    pred_bert
])

# prediction using meta model
pred_meta = meta_model.predict(meta_features)

# assign predicted labels
df['label'] = pred_meta

# create submission DataFrame with filename and predicted label
final_submission = df[['filename', 'label']]

# save to CSV
final_submission.to_csv('/content/drive/MyDrive/Grammar Scoring Engine/Testing/final_submission.csv', index=False)
print("Final Submission saved to /content/drive/MyDrive/Grammar Scoring Engine/Testing/final_submission.csv")

# display the first few rows
final_submission.head()

Generating DistilBERT embeddings for test dataset...
Processing 204 texts...
Processing batch starting at position 0
Processing batch starting at position 40
Processing batch starting at position 80
Processing batch starting at position 120
Processing batch starting at position 160
Processing batch starting at position 200
Making predictions on test dataset...
Final Submission saved to /content/drive/MyDrive/Grammar Scoring Engine/Testing/final_submission.csv


Unnamed: 0,filename,label
0,audio_804.wav,2.987819
1,audio_1028.wav,3.590164
2,audio_865.wav,4.232104
3,audio_774.wav,4.28814
4,audio_1138.wav,3.229035
