# NLP - Classification and Sentiment Analysis of Reddit Posts

## Part 3: Baseline Classification Models and Zero Shot Classification

Part 1: Web API Data Collection <br>
Part 2: Exploratory Data Analysis <br>
Part 3: Baseline Classification Models and Zero Shot Classification <br>
Part 4: PyCaret Classification Models <br>
Part 5: Sentiment Analysis <br>

---

In part 3, baseline classification model is explored using different preprocessing. The purpose was to decide on the preprocessing steps that produce the best model performance. Zero shot classification was also experimented.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

#Text processing
from nltk import word_tokenize, RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

#Sklearn classification models
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,get_scorer, accuracy_score, precision_score, recall_score, f1_score

#Zero shot classification
from transformers import pipeline

#Mlflow
import mlflow

#Switch off warning
import warnings
warnings.filterwarnings("ignore")

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

2022-10-12 16:59:09.966957: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Import Cleaned Data

In [3]:
data = pd.read_csv('Cleaned Datasets/cleaned_dataset.csv')
print(data.shape)
data.head()

(36057, 4)


Unnamed: 0,text,subreddit,lemmatised,stemmed
0,frauenrztin mnchen zentrum schwangerenvorsorge,google,frauenrztin mnchen zentrum schwangerenvorsorge,frauenrztin mnchen zentrum schwangerenvorsorg
1,petition to save stadia i do not have high exp...,google,petition save stadium high expectation i not d...,petit save stadia high expect i not delusion i...
2,google stadia will be shutting down in january...,google,stadium shutting january purchase will be refu...,stadia shut januari purchas will be refund tec...
3,google will not let you create an email withou...,google,let create email without phone number,let creat email without phone number
4,recover android backup with different resoluti...,google,recover android backup different resolution pa...,recov android backup differ resolut pattern


In [4]:
data.dropna(axis=0,inplace=True)
data.shape

(36006, 4)

## Train Test Split

In [5]:
X = data.drop(columns='subreddit')
y = data['subreddit'].map({'google':0,'apple':1})

In [6]:
y.value_counts(normalize=True)

1    0.526607
0    0.473393
Name: subreddit, dtype: float64

The two classes are fairly balanced. Hence, we can use accuracy score subsequently to evaluate model performance.

In [7]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,stratify=y)
print(f'train:{X_train.shape}')
print(f'test:{X_test.shape}')

train:(27004, 3)
test:(9002, 3)


In [8]:
X_train_stem = X_train['stemmed']
X_test_stem = X_test['stemmed']
X_train_lemmatised = X_train['lemmatised']
X_test_lemmatised = X_test['lemmatised']

Two sets of X_train and X_test were prepared, one was stemmed text, one was lemmatised text. Model performance using these two text datasets will be compared later.

## Baseline Model

### Test Baseline Model with Different Data Preprocessing

Using Logistic Regression as the estimator for baseline model, four combinatons of data preprocessing will be tested:
- Lemmatising + Count Vectorizer
- Lemmatising + Term Frequency-Inverse Document Frequency (TF-IDF)
- Stemming + Count Vectorizer
- Stemming + Term Frequency-Inverse Document Frequency (TF-IDF)

The goal is to decide on the preprocessing steps that give the best model performance.

In [170]:
# Pipeline for CountVectorizer
pipe_cvec = Pipeline([
    ('cvec',CountVectorizer(max_features=5000)),
    ('lr',LogisticRegression(max_iter=1000))
])

# Pipeline for TF-IDF
pipe_tfidf = Pipeline([
    ('tvec',TfidfVectorizer(max_features=5000)),
    ('lr', LogisticRegression(max_iter=1000))
])

# Function to train model and return four classification metrics
def model_train(pipeline,X_train,X_test):
    model = pipeline.fit(X_train, y_train)
    for metric in ['accuracy','precision','recall','f1']:
        print(f'Test {metric}:{get_scorer(metric)(pipeline, X_test, y_test)}')

In [176]:
# Use mlflow to track experiment results
mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("subreddit-posts")
mlflow.autolog()

2022/10/11 23:02:39 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2022/10/11 23:02:39 INFO mlflow.tracking.fluent: Autologging successfully enabled for tensorflow.


In [55]:
with mlflow.start_run():
    model_train(pipe_cvec,X_train_lemmatised,X_test_lemmatised)

Test accuracy:0.8687597071222543
Test precision:0.8907175773535221
Test recall:0.8556070826306914
Test f1:0.8728093753359853


In [56]:
model_train(pipe_tfidf,X_train_lemmatised,X_test_lemmatised)

2022/10/11 00:48:18 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'f50976f1f8804e418c8fb68bc0a7f517', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Test accuracy:0.8751941424450854
Test precision:0.9025583982202448
Test recall:0.855185497470489
Test f1:0.8782335750622362


In [57]:
model_train(pipe_cvec,X_train_stem,X_test_stem)

2022/10/11 00:48:25 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'c80f2a1539b04bd693689f5215d4f10f', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Test accuracy:0.8708675393831817
Test precision:0.8891304347826087
Test recall:0.862141652613828
Test f1:0.8754280821917809


In [58]:
model_train(pipe_tfidf,X_train_stem,X_test_stem)

2022/10/11 00:48:33 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'f97276fd1fd74f068a98180282d82474', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Test accuracy:0.8747503882848902
Test precision:0.9010428222764588
Test recall:0.8560286677908938
Test f1:0.8779591395524807


**Insights from Different Preprocessing** <br>
|           Estimator          | Preprocessing                  |   Test Accuracy  |  Test Precision  |    Test Recall   |   Test F1 Score  |
|:----------------------------:|--------------------------------|:----------------:|:----------------:|:----------------:|:----------------:|
| Logistic Regression          | Lemmatising + Count Vectoriser | 0.8687597071     | 0.8907175774     | 0.8556070826     | 0.8728093753     |
| **Logistic Regression**      | **Lemmatising + TF-IDF**       | **0.8751941424** | **0.9025583982** | **0.8551854975** | **0.8782335751** |
| Logistic Regression          | Stemming + Count Vectoriser    | 0.8708675394     | 0.8891304348     | 0.8621416526     | 0.8754280822     |
| Logistic Regression          | Stemming + TF-IDF              | 0.8747503883     | 0.9010428223     | 0.8560286678     | 0.8779591396     |

<br>
The combination of lemmatising and TF-IDF produced the best model performance, with highest overall accuracy and f1 score. This means that the model has good accuracy, precision and recall. Hence, this combination will be used for subsequent modelling.

### Baseline Model Hyperparameter Tuning

In [173]:
pipe_params = {
    'tvec__max_features': [1000, 2000, 3000, 4000, 5000],
    'tvec__ngram_range': [(1,1), (1,2)]
}

gs = GridSearchCV(pipe_tfidf,
                  param_grid = pipe_params,
                  cv=5, 
                  n_jobs=-1)

gs.fit(X_train_lemmatised, y_train)

2022/10/11 23:01:49 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'ec4b270b225f482f9d2a433fe9896ccf', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2022/10/11 23:02:37 INFO mlflow.sklearn.utils: Logging the 5 best runs, 5 runs will be omitted.


In [174]:
gs.best_params_

{'tvec__max_features': 5000, 'tvec__ngram_range': (1, 1)}

In [175]:
print(f'Train: {gs.score(X_train_lemmatised, y_train)}')
print(f'Test accuracy: {gs.score(X_test_lemmatised, y_test)}')

Train: 0.9027366863905325
Test accuracy: 0.8751941424450854


Best performing model is the same as the baseline model (maximum features = 5000, unigram).

## Random Forest

Tree based algorithm, in particular, random forest is explored as an alternative model, because random forest uses bagging and random subspace method to reduce variance in basic tree based model.

### Random Forest with Default Parameter

Maximum features are reduced to 2000 for random forest since random forest is more computationally expensive compared to logistic regression.

In [9]:
# Text preprocessing
tfidf = TfidfVectorizer(max_features=2000)
X_train_lemma_tfidf = tfidf.fit_transform(X_train_lemmatised)
X_train_lemma_tfidf_df = pd.DataFrame(X_train_lemma_tfidf.toarray(),columns=tfidf.get_feature_names())

X_test_lemma_tfidf = tfidf.fit_transform(X_test_lemmatised)
X_test_lemma_tfidf_df = pd.DataFrame(X_test_lemma_tfidf.toarray(),columns=tfidf.get_feature_names())

In [157]:
# Model training
rf = RandomForestClassifier(class_weight='balanced')
rf.fit(X_train_lemma_tfidf,y_train)
print(f'Train score: {rf.score(X_train_lemma_tfidf_df,y_train)}')
print(f'Test score:{rf.score(X_test_lemma_tfidf_df,y_test)}')

2022/10/11 01:48:39 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '41c53213337e470f991a00b8c6dcc5a9', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


Train score: 0.9795118343195266
Test score:0.6015087641446638


### Random Forest with Hyperparameter Tuning

In [179]:
rf_param_grid = {
    'max_depth': range(6,13,2),
    'n_estimators': [100,200]
}

gs = GridSearchCV(rf,
                  param_grid=rf_param_grid,
                  n_jobs=-1)

gs.fit(X_train_lemma_tfidf_df,y_train)

2022/10/11 23:14:45 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '0be0dc19a1c24adc9c5b5c2dd2375ee9', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow
2022/10/11 23:25:23 INFO mlflow.sklearn.utils: Logging the 5 best runs, 3 runs will be omitted.


In [180]:
print(gs.best_params_)

{'max_depth': 12, 'n_estimators': 200}


In [182]:
y_pred = gs.predict(X_test_lemma_tfidf_df)
print(f'Train score: {gs.score(X_train_lemma_tfidf_df,y_train)}')

for metric in [accuracy_score, precision_score, recall_score, f1_score]:
    metric_score = metric(y_test,y_pred)
    print(f'{metric.__name__}:{metric_score}')

Train score: 0.8115014792899409
accuracy_score:0.58597736853783
precision_score:0.7203832752613241
recall_score:0.34865092748735244
f1_score:0.4698863636363636


In [61]:
# Switch off mlflow autologging
mlflow.autolog(disable=True)

**Summary of Random Forest Model** <br>

|           Estimator          | Preprocessing                  |   Test Accuracy  |  Test Precision  |    Test Recall   |   Test F1 Score  |
|:----------------------------:|--------------------------------|:----------------:|:----------------:|:----------------:|:----------------:|
| Logistic Regression          | Lemmatising + Count Vectoriser | 0.8687597071     | 0.8907175774     | 0.8556070826     | 0.8728093753     |
| Logistic Regression          | Lemmatising + TF-IDF           | 0.8751941424     | 0.9025583982     | 0.8551854975     | 0.8782335751     |
| Logistic Regression          | Stemming + Count Vectoriser    | 0.8708675394     | 0.8891304348     | 0.8621416526     | 0.8754280822     |
| Logistic Regression          | Stemming + TF-IDF              | 0.8747503883     | 0.9010428223     | 0.8560286678     | 0.8779591396     |
| **Random Forest**            | **Lemmatising + TF-IDF**       | **0.5859773685** | **0.7203832753** | **0.3486509275** | **0.4698863636** |
- Random forest had poorer accuracy score compared to logistic regression.
- Random forest also had issue of overfitting despite reducing maximum depth to address overfitting. This is evident from the much lower accuracy score on test set.

## Zero Shot Classification

Next, zero shot classification with pre-trained model is tested on the current dataset.<br>
<br>
The model selected for zero shot classification was [BART model](https://huggingface.co/facebook/bart-large-mnli), which was trained on the Multi-Genre Natural Language Inference (MultiNLI) English corpus, a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.

In [163]:
# Prepare test dataset
test_text = pd.concat([X_test['text'],y_test],axis=1)

In [164]:
# Import model
classifier = pipeline('zero-shot-classification',model='facebook/bart-large-mnli')
labels = ['apple','google']

In [165]:
# Store results of zero shot classification in 'classifier_results'
tqdm.pandas() # Initialize tqdm to activate progress bar
test_text['classifier_result'] = test_text.progress_apply(lambda row:classifier(row['text'],labels),axis=1)

100%|█████████████████████████████████████| 9014/9014 [3:47:15<00:00,  1.51s/it]


In [166]:
# Tease out y_pred from results dictionary
test_text['y_pred'] = test_text.progress_apply(lambda row:row['classifier_result']['labels'][0],axis=1)
test_text['y_pred'] = test_text['y_pred'].map({'google':0,'apple':1})

100%|████████████████████████████████████| 9014/9014 [00:00<00:00, 26544.08it/s]


In [167]:
# Test model performance
for metric in [accuracy_score, precision_score, recall_score, f1_score]:
    metric_score = metric(test_text['subreddit'],test_text['y_pred'])
    print(f'{metric.__name__}:{metric_score}')

accuracy_score:0.7164410916352341
precision_score:0.9358565737051793
recall_score:0.49515177065767285
f1_score:0.6476426799007444


**Summary of Findings**

|           Estimator          | Preprocessing                  |   Test Accuracy  |  Test Precision  |    Test Recall   |   Test F1 Score  |
|:----------------------------:|--------------------------------|:----------------:|:----------------:|:----------------:|:----------------:|
| Logistic Regression          | Lemmatising + Count Vectoriser | 0.8687597071     | 0.8907175774     | 0.8556070826     | 0.8728093753     |
| Logistic Regression          | Lemmatising + TF-IDF           | 0.8751941424     | 0.9025583982     | 0.8551854975     | 0.8782335751     |
| Logistic Regression          | Stemming + Count Vectoriser    | 0.8708675394     | 0.8891304348     | 0.8621416526     | 0.8754280822     |
| Logistic Regression          | Stemming + TF-IDF              | 0.8747503883     | 0.9010428223     | 0.8560286678     | 0.8779591396     |
| Random Forest                | Lemmatising + TF-IDF           | 0.5859773685     | 0.7203832753     | 0.3486509275     | 0.4698863636     |
| **Zero Shot Classification** | **-**                          | **0.7164410916** | **0.9358565737** | **0.4951517707** | **0.6476426799** |
<br>
- Overall, zero shot classification did not outperform baseline logistic regression model. The accuracy score was significantly lower. This was expected since the corpus the model was trained on may not necessarily be relevant for the current text data.
- Notably, zero shot classification has very high precision score. This means low type 2 error, and the model is less likely to misclassify posts from Apple subrreddit as Google.
- In contrast, recall score is poor. This means high type 1 error, and the model has tendency to misclassify posts from Google subreddits as Apple.

## Export Data for PyCaret

- Data was exported for PyCaret analysis, since PyCaret would be run using a different environment.
- Features in the datasets were capped at 2000 instead of 5000 in the baseline model, since PyCaret is computationally expensive when it searched over 16 different classification models.

In [10]:
train_data = pd.concat([X_train_lemma_tfidf_df,y_train.reset_index(drop=True)],axis=1)
train_data.to_csv('Cleaned Datasets/train_data.csv',index=False)
print(train_data.shape)
train_data.head()

(27004, 2001)


Unnamed: 0,aasp,ability,able,about,absolutely,acabou,acc,accept,access,accessory,...,yet,you,your,yous,youtube,yt,zcarsales,zero,zoom,subreddit
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [11]:
test_data = pd.concat([X_test_lemma_tfidf_df,y_test.reset_index(drop=True)], axis=1)
test_data.to_csv('Cleaned Datasets/test_data.csv',index=False)
print(test_data.shape)
test_data.head()

(9002, 2001)


Unnamed: 0,aasp,ability,able,abortion,about,absolutely,ac,acc,accept,access,...,yet,yo,you,young,your,yous,youtube,zcarsales,zoom,subreddit
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.214146,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.416624,0.0,0.0,1


## Summary of Baseline Classification Models and One Shot Classification Models

|         Estimator        | Preprocessing                  |   Test Accuracy  |  Test Precision  |    Test Recall   |   Test F1 Score  |
|:------------------------:|--------------------------------|:----------------:|:----------------:|:----------------:|:----------------:|
| Logistic Regression      | Lemmatising + Count Vectoriser | 0.8687597071     | 0.8907175774     | 0.8556070826     | 0.8728093753     |
| **Logistic Regression**  | **Lemmatising + TF-IDF**       | **0.8751941424** | **0.9025583982** | **0.8551854975** | **0.8782335751** |
| Logistic Regression      | Stemming + Count Vectoriser    | 0.8708675394     | 0.8891304348     | 0.8621416526     | 0.8754280822     |
| Logistic Regression      | Stemming + TF-IDF              | 0.8747503883     | 0.9010428223     | 0.8560286678     | 0.8779591396     |
| Random Forest            | Lemmatising + TF-IDF           | 0.5859773685     | 0.7203832753     | 0.3486509275     | 0.4698863636     |
| Zero Shot Classification | -                              | 0.7164410916     | 0.9358565737     | 0.4951517707     | 0.6476426799     |
- The text preprocessing that produced the best model performance was lemmatising and Term Frequency-Inverse Document Frequency.
- Logistic regression, random forest and one shot classification were tested on the text data. The best performing model was using logistic regression, with an accuracy score of 0.88.