<a href="https://colab.research.google.com/github/tejasrinainala/audio-analysis-infosys-springboard/blob/main/nlp_preprocessing_and_text_vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Preprocessing & Text Vectorization with Model Comparison.

This notebook is part of my Infosys Springboard internship.

## Task Overview

- Use a dataset with around **10,000 speech transcripts**.
- Perform basic NLP preprocessing:
  - text cleaning (lowercase, remove numbers, punctuation, extra spaces)
  - tokenization
  - stopword removal
  - stemming
  - lemmatization
  - simple n-gram and frequency analysis
- Apply vectorization methods:
  - CountVectorizer
  - TF-IDF
- Train and compare:
  - Logistic Regression
  - Multinomial Naive Bayes
- Do basic hyperparameter tuning using GridSearchCV.

The dataset used here is a simple **10k dummy LibriSpeech-style transcript CSV** with two columns:
- `text`  → transcript sentence
- `label` → class (0–4)


## Theory – Simple Definitions

**Hyperparameter Tuning**  
Choosing the best settings (like C, alpha, learning rate, etc.) for a model to improve performance.

**Optimization**  
Method used to update model weights to reduce the loss. Examples: Gradient Descent, SGD, Adam.

**Loss Function**  
A function that measures how wrong the model’s predictions are. The model tries to minimize it.

**Evaluation Metrics**  
Measures used to check how good a model is. Examples: Accuracy, Precision, Recall, F1-score.

**Confusion Matrix**  
A table showing how many predictions were correct or incorrect for each class (Actual vs Predicted).


In [7]:
import re
import numpy as np
import pandas as pd

from collections import Counter

# NLP
import nltk
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("wordnet")

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Sklearn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [6]:
import pandas as pd
from google.colab import files

print("Upload librispeech_dummy_10k.csv")
uploaded = files.upload()

csv_name = list(uploaded.keys())[0]
print("Loaded file:", csv_name)

df = pd.read_csv("librispeech_dummy_10k (1).csv")
df.head()



Upload librispeech_dummy_10k.csv


Saving librispeech_dummy_10k.csv to librispeech_dummy_10k (3).csv
Loaded file: librispeech_dummy_10k (3).csv


Unnamed: 0,text,label
0,"This is sample transcript number 1, containing...",1
1,"This is sample transcript number 2, containing...",2
2,"This is sample transcript number 3, containing...",3
3,"This is sample transcript number 4, containing...",4
4,"This is sample transcript number 5, containing...",0


In [8]:
stop_words = set(stopwords.words("english"))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def clean_text(text: str) -> str:
    # convert to lowercase
    text = text.lower()
    # remove numbers
    text = re.sub(r"\d+", " ", text)
    # remove punctuation and keep only letters and spaces
    text = re.sub(r"[^a-z\s]", " ", text)
    # remove extra spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

def tokenize(text: str):
    return nltk.word_tokenize(text)

def remove_stopwords(tokens):
    return [t for t in tokens if t not in stop_words]

def apply_stemming(tokens):
    return [stemmer.stem(t) for t in tokens]

def apply_lemmatization(tokens):
    return [lemmatizer.lemmatize(t) for t in tokens]


In [10]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [11]:
# Make a copy just to be safe
df = df.copy()

# Step 1: clean text
df["clean"] = df["text"].astype(str).apply(clean_text)

# Step 2: tokenize
df["tokens"] = df["clean"].apply(tokenize)

# Step 3: remove stopwords
df["no_stop"] = df["tokens"].apply(remove_stopwords)

# Step 4: stemming & lemmatization
df["stemmed"] = df["no_stop"].apply(apply_stemming)
df["lemmatized"] = df["no_stop"].apply(apply_lemmatization)

df.head()


Unnamed: 0,text,label,clean,tokens,no_stop,stemmed,lemmatized
0,"This is sample transcript number 1, containing...",1,this is sample transcript number containing sp...,"[this, is, sample, transcript, number, contain...","[sample, transcript, number, containing, speec...","[sampl, transcript, number, contain, speech, l...","[sample, transcript, number, containing, speec..."
1,"This is sample transcript number 2, containing...",2,this is sample transcript number containing sp...,"[this, is, sample, transcript, number, contain...","[sample, transcript, number, containing, speec...","[sampl, transcript, number, contain, speech, l...","[sample, transcript, number, containing, speec..."
2,"This is sample transcript number 3, containing...",3,this is sample transcript number containing sp...,"[this, is, sample, transcript, number, contain...","[sample, transcript, number, containing, speec...","[sampl, transcript, number, contain, speech, l...","[sample, transcript, number, containing, speec..."
3,"This is sample transcript number 4, containing...",4,this is sample transcript number containing sp...,"[this, is, sample, transcript, number, contain...","[sample, transcript, number, containing, speec...","[sampl, transcript, number, contain, speech, l...","[sample, transcript, number, containing, speec..."
4,"This is sample transcript number 5, containing...",0,this is sample transcript number containing sp...,"[this, is, sample, transcript, number, contain...","[sample, transcript, number, containing, speec...","[sampl, transcript, number, contain, speech, l...","[sample, transcript, number, containing, speec..."


In [12]:
for i in range(3):
    print("==== Sample", i, "====")
    print("Original text:")
    print(df.loc[i, "text"])
    print("\nCleaned text:")
    print(df.loc[i, "clean"])
    print("\nLemmatized tokens:")
    print(df.loc[i, "lemmatized"])
    print("\n-------------------\n")


==== Sample 0 ====
Original text:
This is sample transcript number 1, containing speech-like text for NLP preprocessing tasks.

Cleaned text:
this is sample transcript number containing speech like text for nlp preprocessing tasks

Lemmatized tokens:
['sample', 'transcript', 'number', 'containing', 'speech', 'like', 'text', 'nlp', 'preprocessing', 'task']

-------------------

==== Sample 1 ====
Original text:
This is sample transcript number 2, containing speech-like text for NLP preprocessing tasks.

Cleaned text:
this is sample transcript number containing speech like text for nlp preprocessing tasks

Lemmatized tokens:
['sample', 'transcript', 'number', 'containing', 'speech', 'like', 'text', 'nlp', 'preprocessing', 'task']

-------------------

==== Sample 2 ====
Original text:
This is sample transcript number 3, containing speech-like text for NLP preprocessing tasks.

Cleaned text:
this is sample transcript number containing speech like text for nlp preprocessing tasks

Lemmatiz

In [13]:
# Join lemmatized tokens back into text for vectorizers
df["processed_text"] = df["lemmatized"].apply(lambda toks: " ".join(toks))

# Word frequency (unigrams)
all_tokens = [t for doc in df["lemmatized"] for t in doc]
word_freq = Counter(all_tokens).most_common(20)
print("Top 20 most common words:")
for w, c in word_freq:
    print(f"{w}: {c}")


Top 20 most common words:
sample: 10000
transcript: 10000
number: 10000
containing: 10000
speech: 10000
like: 10000
text: 10000
nlp: 10000
preprocessing: 10000
task: 10000


In [14]:
# Features (X) and labels (y)
X_text = df["processed_text"]
y = df["label"]

# CountVectorizer
count_vec = CountVectorizer(min_df=3)
X_count = count_vec.fit_transform(X_text)

# TF-IDF
tfidf_vec = TfidfVectorizer(min_df=3)
X_tfidf = tfidf_vec.fit_transform(X_text)

print("CountVectorizer shape:", X_count.shape)
print("TF-IDF shape:", X_tfidf.shape)


CountVectorizer shape: (10000, 10)
TF-IDF shape: (10000, 10)


In [16]:
Xc_train, Xc_test, yc_train, yc_test = train_test_split(X_count, y, test_size=0.2, random_state=42)
Xt_train, Xt_test, yt_train, yt_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

print("Train/Test split done.")


Train/Test split done.


In [17]:
nb = MultinomialNB()

param_grid_nb = {
    "alpha": [0.1, 0.5, 1.0]
}

grid_nb = GridSearchCV(nb, param_grid_nb, cv=3, n_jobs=-1, verbose=0)
grid_nb.fit(Xc_train, yc_train)

print("Best params for NB + CountVectorizer:", grid_nb.best_params_)

y_pred_nb = grid_nb.predict(Xc_test)
print("Accuracy (NB + CountVectorizer):", accuracy_score(yc_test, y_pred_nb))
print("\nClassification report (NB + CountVectorizer):\n")
print(classification_report(yc_test, y_pred_nb))


Best params for NB + CountVectorizer: {'alpha': 0.1}
Accuracy (NB + CountVectorizer): 0.189

Classification report (NB + CountVectorizer):

              precision    recall  f1-score   support

           0       0.19      1.00      0.32       378
           1       0.00      0.00      0.00       415
           2       0.00      0.00      0.00       378
           3       0.00      0.00      0.00       407
           4       0.00      0.00      0.00       422

    accuracy                           0.19      2000
   macro avg       0.04      0.20      0.06      2000
weighted avg       0.04      0.19      0.06      2000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [18]:
lr = LogisticRegression(max_iter=500)

param_grid_lr = {
    "C": [0.1, 1.0, 10.0]
}

grid_lr = GridSearchCV(lr, param_grid_lr, cv=3, n_jobs=-1, verbose=0)
grid_lr.fit(Xt_train, yt_train)

print("Best params for LR + TF-IDF:", grid_lr.best_params_)

y_pred_lr = grid_lr.predict(Xt_test)
print("Accuracy (LR + TF-IDF):", accuracy_score(yt_test, y_pred_lr))
print("\nClassification report (LR + TF-IDF):\n")
print(classification_report(yt_test, y_pred_lr))


Best params for LR + TF-IDF: {'C': 0.1}
Accuracy (LR + TF-IDF): 0.189

Classification report (LR + TF-IDF):

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       378
           1       0.00      0.00      0.00       415
           2       0.19      1.00      0.32       378
           3       0.00      0.00      0.00       407
           4       0.00      0.00      0.00       422

    accuracy                           0.19      2000
   macro avg       0.04      0.20      0.06      2000
weighted avg       0.04      0.19      0.06      2000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Conclusion

In this notebook, I:

- Used a **10k-sample dummy LibriSpeech-style transcript dataset** (CSV with `text` and `label`).
- Performed basic NLP preprocessing:
  - lowercasing
  - removing numbers and punctuation
  - removing extra spaces
  - tokenization
  - stopword removal
  - stemming and lemmatization
- Did a simple **word frequency analysis** on the lemmatized tokens.
- Converted the cleaned text into numeric features using:
  - **CountVectorizer**
  - **TF-IDF Vectorizer**
- Trained and evaluated two models:
  - **Multinomial Naive Bayes** using CountVectorizer
  - **Logistic Regression** using TF-IDF
- Used **GridSearchCV** to tune basic hyperparameters (`alpha` for NB, `C` for LR).

This completes the task:  
**"NLP Preprocessing & Vectorization on Speech Transcripts with model comparison and basic hyperparameter tuning."**
