## 4. Preprocessing & Modeling

In [1]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

In [2]:
# load collected data
subreddit_content_vf = pd.read_csv('../dataset/subreddit_content_final.csv')

subreddit_content_vf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1829 entries, 0 to 1828
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   subreddit     1829 non-null   int64  
 1   post_content  1801 non-null   object 
 2   upvote_ratio  1829 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 43.0+ KB


Missing values are observed in the 'post_content' column, which could be attributed to the earlier steps taken during data cleaning and EDA for the removal of:
- Words that were identified to not have any contextual relation to the respective subreddit topics
- English stop words
- Words that only appears once or twice in the posts

In this case, the rows with missing values would be dropped as imputing them would introduce artifically created content curated specifically to the respective subreddit topics which may affect the subsequent model training.

In [3]:
# drop rows with missing values
subreddit_content_vf = subreddit_content_vf.dropna()

# verify non-null count
subreddit_content_vf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1801 entries, 0 to 1828
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   subreddit     1801 non-null   int64  
 1   post_content  1801 non-null   object 
 2   upvote_ratio  1801 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 56.3+ KB


### a. Preprocessing

#### i. Dataset split

In [4]:
# create features
X = subreddit_content_vf['post_content']
y = subreddit_content_vf['subreddit']

In [5]:
# train / test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify = y, random_state = 42)

#### ii. Conversion of text data into matrix representation

In [6]:
# instantiate
tfidf = TfidfVectorizer()

In [7]:
X_train_tfid = tfidf.fit_transform(X_train)
X_test_tfid = tfidf.transform(X_test)

In [8]:
# get feature names
feature_names = tfidf.get_feature_names_out()

### b. Modeling

#### i. Baseline model

Prior to implementing advanced models, the baseline model is first established to be used as a reference to compare the performance of the advanced models. This also means that the advanced models are not learning anything meaningful from the data if they are not able to outperform the baseline model. The baseline model is a very simple and naive approach which always predicts the majority class in the data. Thus, it is meant as a comparison benchmark and not meant to be used for making accurate predictions.

In [9]:
y.value_counts(normalize = True)

0    0.536369
1    0.463631
Name: subreddit, dtype: float64

Here, class 0 (statistics) is the majority class and the baseline accuracy is about 0.54 (rounded to 2 decimal places).

#### ii. Logistic regression model

In [10]:
# instantiate
logreg = LogisticRegression(penalty = "l1", solver = 'liblinear')

# define params
logreg_params = {'C': np.linspace(1,100,20)}

# perform grid search with cross-validation
logreg_gs = GridSearchCV(logreg, param_grid = logreg_params, cv = 5)
logreg_gs.fit(X_train_tfid, y_train)

# identify best params
logreg_best = logreg_gs.best_params_

# instantiate new model with best params
logreg_new = LogisticRegression(**logreg_best)

# fit model using best params
logreg_new.fit(X_train_tfid, y_train)

# make predictions
y_pred_logreg = logreg_new.predict(X_test_tfid)

print('Best Params:', logreg_best)

Best Params: {'C': 21.842105263157894}


In [11]:
# evaluate model
print('Train Score:', logreg_new.score(X_train_tfid, y_train))
print('Test Score:', logreg_new.score(X_test_tfid, y_test))

Train Score: 1.0
Test Score: 0.9645232815964523


In [12]:
# calculate specificity
# save confusion matrix values
logreg_tn, logreg_fp, logreg_fn, logreg_tp = confusion_matrix(y_test, y_pred_logreg).ravel()

logreg_spec = logreg_tn / (logreg_tn + logreg_fp)

print('Specificity:', logreg_spec)

Specificity: 0.9586776859504132


In [13]:
# retrieve respective feature names and coefficients
feature_name_coefs = {x: y for x, y in zip(feature_names, logreg_new.coef_[0])}
sorted_feat_name_coefs = {k: v for k, v in sorted(feature_name_coefs.items(), key = lambda item: item[1])}

# sort dictionary values in descending order
sorted_feat_name_coefs_descending = dict(sorted(sorted_feat_name_coefs.items(), key=lambda w: w[1], reverse = True))

# print sorted dictionary
for key, value in sorted_feat_name_coefs_descending.items():
    print(f'{key}: {value}')

llm: 6.449037636556689
ai: 5.788796076665126
ive: 5.27352776348982
ml: 5.177704577497379
image: 5.12002004374264
thanks: 5.0824051404601
help: 4.713484123189027
dont: 4.693264140095504
machine: 4.605782922654693
training: 4.1745923981371105
model: 4.039366722461685
chatgpt: 3.9223866042739624
trying: 3.684487664237419
network: 3.6429441712534425
learning: 3.613828222658071
task: 3.573895683486724
say: 3.5703485268626918
train: 3.4329612858708325
feature: 3.3935164749469005
however: 3.3851219542749678
set: 3.2949098912419146
project: 3.193208166529537
input: 3.1586558434304246
neural: 2.9756021924758103
github: 2.958863320070211
self: 2.8304510120365767
llama: 2.771279004763487
code: 2.740395750568077
text: 2.7093952080140684
blog: 2.701111529880404
token: 2.6971110231409328
demo: 2.687397219586784
deep: 2.6691104093339737
application: 2.6365979055355373
classification: 2.591634206729222
polynomial: 2.521358287296935
transformer: 2.5083737303006055
2023: 2.44960512106388
cost: 2.4314232

andor: 0.24070745941326033
vicuna: 0.24064205378072995
resulting: 0.24052622848914537
end: 0.24005985439950636
engine: 0.23995623668164986
substantially: 0.23924756602565656
matched: 0.23912685394292077
hpo: 0.2389992287733647
becoming: 0.2388557432702139
accept: 0.2385279787830012
vanilla: 0.23811030306161826
even: 0.23794076872313843
officially: 0.23766971690827166
gpt4all: 0.23740890337405118
including: 0.23731446642359572
10x: 0.23724179130491932
professional: 0.23721528637267975
gui: 0.23699656816387243
submix: 0.23685510345870056
gaming: 0.23615813315439568
planning: 0.23609157687620863
grammar: 0.23606068886043266
aigenerated: 0.23601642405787454
maybe: 0.23590311744036616
gigabyte: 0.2356549166968582
inform: 0.23551955941362387
old: 0.23544632387548237
indexing: 0.23543697809057115
15k: 0.23539983762811417
caption: 0.2353960185301936
transparent: 0.23499031242861715
preprint: 0.23477809957665355
highquality: 0.23353857310099999
12295: 0.23353773300205657
139: 0.2335377330020565

inspection: -0.05521665597775702
scatterplot: -0.05543198934356784
diastolic: -0.05543764044642233
elementary: -0.05559100766851736
onesided: -0.056059816486080105
hero: -0.05640758558973626
verify: -0.05659961744334001
differently: -0.05710219282869852
net: -0.05770655266205389
mse: -0.05788464151865209
tangent: -0.05788464151865209
imputation: -0.05789274398034998
obtained: -0.05793999891922323
absolutely: -0.05803540614068685
naturally: -0.05813574419034956
incorrectly: -0.058377419282653206
message: -0.0590375014047182
replicates: -0.059211336557136916
prospective: -0.05943933972599237
deviance: -0.059516292770404315
adequate: -0.05953616667937787
lagging: -0.05958250164903334
taught: -0.059674011958187265
long: -0.05969899470385813
supplementary: -0.05996643728112567
necessarily: -0.06060280392704846
unaware: -0.060605198126008925
simplicity: -0.061834315423285574
identified: -0.06230372036041485
83: -0.06262534399475223
experienced: -0.0627270064542494
simplified: -0.062919310112

unit: -0.6906930782957097
eda: -0.6907547254588645
15: -0.6909675689967859
adding: -0.6917411200927219
installation: -0.6929217821729476
rlang: -0.6929217821729476
sense: -0.6932248898676011
define: -0.6932639598612226
social: -0.6940727895782612
quarterly: -0.6942028413309478
apologize: -0.6943384655103402
percentage: -0.694503418479097
said: -0.6945535538514753
last: -0.6948632754887177
computed: -0.6970136936351483
sorry: -0.6992887303072084
beginner: -0.6999938423536659
height: -0.7006845751441796
necessary: -0.7007256712272933
drug: -0.7010930935719752
zero: -0.703469022796507
coworker: -0.705025436975363
reasonable: -0.707149400122184
mixedeffect: -0.7099676481019016
listing: -0.710307348494219
claim: -0.7117852450574854
seasonal: -0.7119934376095091
questionnaire: -0.7129170568423474
insignificant: -0.7132203924666433
operation: -0.713667305241143
chi2: -0.7138737830914467
lowest: -0.7149668435120345
stage: -0.7159206040467172
boat: -0.7160205853478597
decide: -0.716296550076488

The above output for the dictionary 'sorted_feat_name_coefs_descending' shows the coefficients in an descending order, beginning from the most positive coefficient. In this case, the smaller (or more negative) the coefficient is, those corresponding words are classified under the statistics topic. On the opposite end of the spectrum, those words with bigger (or more positive) coefficient would be classified under the machine learning topic. Based on the above coefficients, words such as 'llm', 'ai', 'ml', 'model', 'machine', 'training', 'chatgpt', 'neural' are among those with highly positive coefficients and are contextually related to the machine learning topic. Words such as 'statistic', 'statistical', 'variable', 'stats', 'test', 'sample', 'variance', 'interval' are among those with highly negative coefficients and are contextually related to the statistics topic.

##### Model summary and performance successes/downfalls:

Logistic regression is used here for binary classification where it models the relationship between the target variable (i.e., the subreddit topic') and dependent variable (i.e., the features derived from the text data) by estimating the probabilities using a logistic function. Based on the train and test scores, the model is observed to be performing relatively well as the difference in score is relatively small. The test score of 0.96 measures the accuracy, which indicated a high percentage of correct classification for both statistics and machine learning topics. With the calculation of specificity (ratio of true negatives to sum of true negatives and false positives), we use it to compare to the baseline model which considers all instances as true negatives. We see that the logistic regression model outperforms the baseline model significantly (i.e., 0.96 versus baseline score of 0.54).

It is a simple and interpretable algorithm which allows us to review the coefficients and the corresponding words, to understand which are the words that relate to the respective subreddit topics and their relative importance. Since it is a simple algorithm, it has low computational complexity and makes it an efficient way to work through large datasets such as this analysis as each word in the text data is being vectorized individually which adds to the total number of features.

However, logistic regression assumes a linear relationship between the features and the log-odds of the target variable which may not work well in the presence of highly non-linear relationship. The objective of both machine learning and statistics are the same which is to learn from the data and provide insights, but they deviate in the methodology of achieving the objective. Thus, there may be overlaps to a certain extent with regard to the words used in describing each of them.

#### iii. Random forest model

In [14]:
# instantiate
rf = RandomForestClassifier()

# define params
rf_params = {'n_estimators': [3, 5, 10, 20, 25, 30, 50, 100, 150],
             'max_depth': [None, 1, 2, 3, 4, 5]}

# perform grid search with cross-validation
rf_gs = GridSearchCV(rf, param_grid = rf_params, cv = 5)
rf_gs.fit(X_train_tfid, y_train)

# identify best estimator
rf_best = rf_gs.best_params_

# instantiate new model with best params
rf_new = RandomForestClassifier(**rf_best)

# fit model using best params
rf_new.fit(X_train_tfid, y_train)

# make predictions
y_pred_rf = rf_new.predict(X_test_tfid)

print('Best params:', rf_best)

Best params: {'max_depth': None, 'n_estimators': 150}


In [15]:
# evaluate model
print('Train Score:', rf_new.score(X_train_tfid, y_train))
print('Test Score:', rf_new.score(X_test_tfid, y_test))

Train Score: 1.0
Test Score: 0.9645232815964523


In [16]:
# calculate specificity
# save confusion matrix values
rf_tn, rf_fp, rf_fn, rf_tp = confusion_matrix(y_test, y_pred_rf).ravel()

rf_spec = rf_tn / (rf_tn + rf_fp)

print('Specificity:', rf_spec)

Specificity: 0.9504132231404959


In [17]:
# retrieve feature importance scores
feat_importance = rf_new.feature_importances_

# get the indices of the top 30 features based on scores
top_indices = np.argsort(feat_importance)[::-1][:30]

for i in top_indices:
    feature_name = feature_names[i]
    print('Feature: %s, Score: %.5f' % (feature_name, feat_importance[i]))

Feature: one, Score: 0.02752
Feature: statistic, Score: 0.02586
Feature: use, Score: 0.02245
Feature: way, Score: 0.02073
Feature: model, Score: 0.01862
Feature: llm, Score: 0.01833
Feature: variable, Score: 0.01741
Feature: also, Score: 0.01629
Feature: using, Score: 0.01575
Feature: make, Score: 0.01399
Feature: ai, Score: 0.01218
Feature: ive, Score: 0.01134
Feature: dont, Score: 0.01114
Feature: much, Score: 0.01077
Feature: training, Score: 0.01077
Feature: anyone, Score: 0.01074
Feature: sample, Score: 0.01002
Feature: however, Score: 0.00975
Feature: image, Score: 0.00839
Feature: thanks, Score: 0.00838
Feature: stats, Score: 0.00825
Feature: train, Score: 0.00821
Feature: help, Score: 0.00753
Feature: used, Score: 0.00752
Feature: test, Score: 0.00745
Feature: regression, Score: 0.00727
Feature: ml, Score: 0.00716
Feature: statistical, Score: 0.00684
Feature: text, Score: 0.00676
Feature: machine, Score: 0.00675


The above output presents the top 30 features based on the feature importance score, which provides a mean to identify important predictors. We observe important keywords such as 'statistic', 'model', 'llm', 'ai', 'variable', 'training', 'neural', 'regression', which concurs with the words identified earlier from the Logistic Regression model.

##### Model summary and performance successes/downfalls:

The Random Forest model is an ensemble learning method which utilises multiple decision trees to make predictions. Similar to the earlier Logistic Regression model, this model is also observed to be performing relatively well in view of the relatively small difference between the train and test scores. The test score of 0.96 again indicates a high percentage of correct classification for both statistics and machine learning topics. Comparing the specificity against the baseline model, the Random Forest model also outperforms the baseline model significantly (i.e., 0.95 versus baseline score of 0.54).

In contrast to the Logistic Regression model, the Random Forest model is capable of capturing complex non-linear relationship between the target variable (i.e., the subreddit topic') and dependent variable (i.e., the features derived from the text data). This model also allows for variable selection using the feature importance score (as illustrated in the earlier section), and reduces overfitting due to the combination of multiple decision trees.

However, the Random Forest model is less interpretable as compared to the Logistic Regression model and we are not able to extract words that correspond to the respective subreddit topics. With the use of multiple decision trees, it also adds to the computational complexity and thereby resulting in a potentially higher compute cost.

### c. Selection of production model

The following table summarises the outcome of the Logistic Regression and Random Forest models (rounded to 2 decimal places).

|Model|Train Score|Test Score|Specificity|
|:---|:---|:---|:---|
|Logistic Regression|1.0|0.96|0.96|
|Random Forest|1.0|0.96|0.95|

Based on the modeling outcome, the performance of both models are comparable and they also perform significantly better than the baseline model which indicates that the models have learnt substantially from the data and are capable of classifying new posts relatively well. In selecting the production model, the Logistic Regression model would be selected in favour of its lower computational complexity (and deployment cost) and higher interpretability since its performance is comparable to that of the Random Forest model.