<a href="https://colab.research.google.com/github/shraddha-an/nlp/blob/main/nlp_cs_tc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Case Study: NLP models for Text Classification**

## **Introduction**

In this project, I look at 3 different NLP models for classifying questions on Stack Overflow into 3 categories depending on their quality.

This Case Study outlines 3 techniques to achieve the task of text classification:

1.   Training Word Embedding
2.   Pretrained GloVe Word Embeddings
3.   Pretrained Word2Vec Embeddings


**Dataset**: [Stack Overflow Questions](https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate)


## 1) **Data Preparation**

In [1]:
# Importing libraries
# Data Manipulation/ Handling
import pandas as pd, numpy as np

# Visualization
import seaborn as sb, matplotlib.pyplot as plt

# NLP libraries
import re
from nltk.corpus import stopwords
from gensim.utils import simple_preprocess



In [None]:
# Importing training & testing datasets
dataset = pd.read_csv('train.csv')[['Body', 'Y']].rename(columns = {'Body': 'question', 'Y': 'category'})
ds = pd.read_csv('valid.csv')[['Body', 'Y']].rename(columns = {'Body': 'question', 'Y': 'category'})

## **2) NLP Preprocessing**

In [None]:
# Removing symbols, stopwords, punctuation
stop_words = set(stopwords.words('english'))

symbols = re.compile(pattern = '[/<>(){}\[\]\|@,;]')
tags = ['href', 'http', 'https', 'www']

def text_clean(s: str) -> str:
    """
    Removes unwanted symbols, punctuation and stop words from a given string.
    """
    s = symbols.sub(' ', s)
    for i in tags:
        s = s.replace(i, ' ')
    cleaned_text = ' '.join(word for word in simple_preprocess(s, deacc = True) if not word in stop_words)
    return cleaned_text

# Applying the function on the questions column
dataset.iloc[:, 0] = dataset.iloc[:, 0].apply(text_clean)
ds.iloc[:, 0] = ds.iloc[:, 0].apply(text_clean)

# Train & Test subsets
X_train, y_train = dataset.iloc[:, 0].values, dataset.iloc[:, 1].values.reshape(-1, 1)
X_test, y_test = ds.iloc[:, 0].values, ds.iloc[:, 1].values.reshape(-1, 1)


## **3) Categorical Encoding**

In [None]:
# One Hot Encoding the Categories Column
from sklearn.preprocessing import OneHotEncoder as ohe
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer(transformers = [('one_hot_encoder', ohe(categories = 'auto'), [0])],
                       remainder = 'passthrough')

y_train = ct.fit_transform(y_train)
y_test = ct.transform(y_test)

## **4) Tokenization**

In [None]:
# Vectorizing our text corpus of questions
# Setting some paramters
vocab_size = 2000
sequence_length = 6700

# Tokenization with keras
from keras.preprocessing.text import Tokenizer

tk = Tokenizer(num_words = vocab_size)
tk.fit_on_texts(X_train)

X_train = tk.texts_to_sequences(X_train)
X_test = tk.texts_to_sequences(X_test)

# Padding all questions with zeros
from keras.preprocessing.sequence import pad_sequences

X_train_seq = pad_sequences(X_train, maxlen = 6700, padding = 'post')
X_test_seq = pad_sequences(X_test, maxlen = 6700, padding = 'post')


## **5) Embedding Model**


In [None]:
# Building & Training the model
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten

model = Sequential()
model.add(Embedding(input_dim = vocab_size, output_dim = 5 , input_length = sequence_length))
model.add(Flatten())

model.add(Dense(units = 3, activation = 'softmax'))

model.compile(loss = 'categorical_crossentropy',
              optimizer = 'rmsprop',
              metrics = ['accuracy'])

model.summary()

# Compiling the model
model.compile(optimizer = 'adam', metrics = ['accuracy'], loss = 'categorical_crossentropy')

history = model.fit(X_train_seq, y_train, epochs = 20, batch_size = 512, verbose = 1)


## **6) Evaluating Performance**

In [None]:
# Evaluating model performance on test set
loss, accuracy = model.evaluate(X_test_seq, y_test, verbose = 1)
print("\nAccuracy: {}\nLoss: {}".format(accuracy, loss))

# Save the model
#model.save('mdl_80.h5')

In [None]:
# Plotting Accuracy & Loss over epochs
sb.set_style('darkgrid')

# 1) Accuracy
plt.plot(history.history['accuracy'], label = 'training', color = '#5499C7')
plt.legend(shadow = True, loc = 'lower right')
plt.title('Accuracy Plot over Epochs')
plt.show()

# 1) Loss
plt.plot(history.history['loss'], label = 'training loss', color = 'purple')
plt.legend(shadow = True, loc = 'upper right')
plt.title('Loss Plot over Epochs')
plt.show()