### Modelling

In [3]:
 #!conda install -c conda-forge imbalanced-learn --yes

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,  roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from nltk.corpus import stopwords
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings('ignore')




In [5]:
combined = pd.read_csv('datasets/combined.csv')
combined.shape

(59129, 12)

In [6]:
# select X and y columns we need
df =  combined[['Subreddit', 'preprocessed_words']]

# add label for classification
df['is_amd'] = df['Subreddit'].apply(lambda x: 1 if x == "AMD" else 0)
df = df.drop(columns = 'Subreddit')
df = df.rename(columns={'preprocessed_words':'text'})

In [7]:
df.head()

Unnamed: 0,text,is_amd
0,keeps its price from the vram it has a is a ta...,0
1,it s not you just a combo of the game running ...,0
2,starfield is cpu heavy which one do you have a...,0
3,the game is just awfully optimized i just upgr...,0
4,i tried using fsr but i did nt see any noticib...,0


In [8]:
# Specify Stopwords
custom_stopwords = [ "subreddit", "reddit"]  # remove these words as it is not meaningful for our analysis
stopwords_list = list(set(stopwords.words('english') + custom_stopwords))

For this notebook on Modelling, I will only be considering the "text" column of the scraped dataset, and this has been pre-processed in notebook 2. This ensures that our model can be properly trained on the content of the subreddit posts.


##### Baseline 
We always begin with creating a baseline model.

In [9]:
# Baseline model
X = df['text']
y = df['is_amd']
y.value_counts(normalize=True)

is_amd
1    0.634545
0    0.365455
Name: proportion, dtype: float64

#### Model Preparation


Steps I took for this section:
1. Train Test Split
2. Instantiating Vectorizers and Models
3. Creating a User Define Function* with Scikit-learn's Pipeline tool that will help calculate the relevant classification metrics from each model (Metrics include Accuracy, Specificity and F1_Score)
4. Evaluate best model
5. Tune Hyper-parameters of best model



In [10]:
# Train Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify = y)
X_train = X_train.values.astype('U')
X_test = X_test.values.astype('U')
print(X_train.shape)
print(X_test.shape)

(47303,)
(11826,)


In [11]:
cvec = CountVectorizer(stop_words=stopwords_list)
cvec.fit(X_train)
X_train = cvec.transform(X_train) #transform the corpus

In [12]:
print(cvec.get_feature_names_out())
print(X_train.shape)

['aa' 'aaa' 'aaaaa' ... 'zx' 'zz' 'zzx']
(47303, 32124)


In [13]:
# Transform test
X_test = cvec.transform(X_test)

##### 1. Train Test Split

In [14]:
# Redefine train and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify = y)
X_train = X_train.values.astype('U')
X_test = X_test.values.astype('U')

From the Baseline model, we can see moderate imbalance in our dataset (65%-35%). Hence, we can use the SMOTE technique to correct this.

In [16]:
X_train.shape

(47303,)

In [15]:
##Now we can create synthetic data for our training set

sm = SMOTE()
Xsm_train, ysm_train = sm.fit_resample(X_train, y_train)

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/main/win-64/current_repodata.json HTTP/1.1" 304 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/r/win-64/current_repodata.json HTTP/1.1" 304 0
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/main/noarch/current_repodata.json HTTP/1.1" 304 0
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): repo.anaconda.com:443
DEBUG:urllib3.connectionpool:https://repo.anaconda.com:443 "GET /pkgs/r/noarch/current_repodata.json HTTP/1.1" 304 0
DEBUG:urll

ValueError: Expected 2D array, got 1D array instead:
array=['in most cases ai is if else combo i m working with it ai is a magic word for sale'
 'ha i did the same thing on my asus board i was like how the fuck do you disable this thing lol'
 'man amd did an amd again and copied nvidia rtx ti shenanigans at least the rxxt is much more obvious choice over the rtx ti gb and obviously gb honestly would pay bucks more for rtx for the raytracing'
 ... 'deleted'
 'third party sellers on amazon are scalping it for usd and apparently it s still selling at that price'
 'i m pretty sure you will be disappointed because after today s todd interview i am it s clear as day they wo nt do any real improvements'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

##### 2. Instantiation of Vectorizers and Models

I will be exploring different Classification algorithms and using both Count Vectorizer or Term Frequency-Inverse Document Frequency (TFIDF) transformers:
- Count Vectorizer: Takes every word as a token, and uses it as a feature.
- TFIFD: accounts for frequency of a word in a given document and the frequency between documents. Word importance increases proportionally to the number of times it appears in a document, but is offset by frequency of word in entire corpus.


In [None]:
# Instantiate Vectorizers
vectorizers = {'cvec': CountVectorizer(stop_words=stopwords_list),
               'tvec': TfidfVectorizer(stop_words=stopwords_list)
}

In [None]:
# Instiantiate models
models = {'nb': MultinomialNB(),
          'log_reg': LogisticRegression(max_iter=500, random_state=123),
          'rf': RandomForestClassifier(random_state=123),
          'knn': KNeighborsClassifier()}

##### 3. User Define Function - inputs required are vectorizer and model

In [None]:

df_results = []

def clf_model(vec, mod, cv_num, X_train, y_train, vec_params={},  grid_search=False):   # option to include Grid Search
    
    results = {}
    
    pipe = Pipeline([
            (vec, vectorizers[vec]),
            (mod, models[mod])
            ])
    
    if grid_search:
        if mod == 'rf':
            rs = RandomizedSearchCV(pipe, param_distributions = {**vec_params}, cv=cv_num, verbose=1, n_jobs=-1) 
            rs.fit(X_train, y_train)
            pipe = rs

        else:
            gs = GridSearchCV(pipe, param_grid = {**vec_params}, cv=cv_num, verbose=1, n_jobs=-1)
            gs.fit(X_train, y_train)
            pipe = gs
        
    else:
        pipe.fit(X_train, y_train)
    
    # Get predictions
    preds = pipe.predict(X_test)

    # Confusion Matrix
    tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
    cm = confusion_matrix(y_test, preds)
    tn, fp, fn, tp = cm[0, 0], cm[0, 1], cm[1, 0], cm[1, 1]
    acc = (tp + tn)/ (tp+tn+fp+fn)
    spec = tn / (tn + fp)
    
    # Retrieve metrics
    results['Model'] = mod
    results['Vectorizer'] = vec
    results['Train Score'] = pipe.score(X_train, y_train)
    results['Test Score'] = pipe.score(X_test, y_test)
    results['Accuracy'] = acc
    results['Specificity'] = spec
    results['f_score'] = f1_score(y_test, preds)
    results['ROC_AUC'] = roc_auc_score(y_test, preds)

    
    
    if grid_search:
        tuning_results.append(results)
        print(f"--- Best Parameters for {mod},{vec} ---")
        display(pipe.best_params_)
        
        
    else:
        df_results.append(results)
    
    print(f"--- METRICS for {mod},{vec} ---")
    display(results)
    
    return pipe

In [None]:
# Multinomial Naive Bayes
cvec_nb = clf_model('cvec', 'nb', 5)
tvec_nb = clf_model('tvec', 'nb', 5)

--- METRICS for nb,cvec ---


{'Model': 'nb',
 'Vectorizer': 'cvec',
 'Train Score': 0.7876878844893559,
 'Test Score': 0.7323693556570269,
 'Accuracy': 0.7323693556570269,
 'Specificity': 0.5826006478482184,
 'f_score': 0.7951588893922723,
 'ROC_AUC': 0.7006153559070516}

--- METRICS for nb,tvec ---


{'Model': 'nb',
 'Vectorizer': 'tvec',
 'Train Score': 0.7469082299219922,
 'Test Score': 0.7006595636732623,
 'Accuracy': 0.7006595636732623,
 'Specificity': 0.23739009717723275,
 'f_score': 0.8039867109634551,
 'ROC_AUC': 0.6024370528530087}

In [None]:
# Logistic Regression
cvec_lr = clf_model('cvec', 'log_reg', 5)
tvec_lr = clf_model('tvec', 'log_reg', 5)


--- METRICS for log_reg,cvec ---


{'Model': 'log_reg',
 'Vectorizer': 'cvec',
 'Train Score': 0.8479166226243579,
 'Test Score': 0.736935565702689,
 'Accuracy': 0.736935565702689,
 'Specificity': 0.5367885238315595,
 'f_score': 0.8043519275517262,
 'ROC_AUC': 0.6945003386748416}

--- METRICS for log_reg,tvec ---


{'Model': 'log_reg',
 'Vectorizer': 'tvec',
 'Train Score': 0.7980677758281716,
 'Test Score': 0.7448841535599526,
 'Accuracy': 0.7448841535599526,
 'Specificity': 0.501850994909764,
 'f_score': 0.814873903172363,
 'ROC_AUC': 0.6933562010796155}

In [None]:
# Logistic Regression using SMOTE data
cvec_lr = clf_model('cvec', 'log_reg', 5, Xsm_train, ysm_train)
tvec_lr = clf_model('tvec', 'log_reg', 5, Xsm_train, ysm_train)


--- METRICS for log_reg,cvec ---


{'Model': 'log_reg',
 'Vectorizer': 'cvec',
 'Train Score': 0.8479166226243579,
 'Test Score': 0.736935565702689,
 'Accuracy': 0.736935565702689,
 'Specificity': 0.5367885238315595,
 'f_score': 0.8043519275517262,
 'ROC_AUC': 0.6945003386748416}

--- METRICS for log_reg,tvec ---


{'Model': 'log_reg',
 'Vectorizer': 'tvec',
 'Train Score': 0.7980677758281716,
 'Test Score': 0.7448841535599526,
 'Accuracy': 0.7448841535599526,
 'Specificity': 0.501850994909764,
 'f_score': 0.814873903172363,
 'ROC_AUC': 0.6933562010796155}

In [None]:
# Random Forest
cvec_rf = clf_model('cvec', 'rf', 3) # put 3-fold cross validation for random forest to reduce runtime
tvec_rf = clf_model('tvec', 'rf', 3)

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "c:\Users\tiffa\Documents\DSIF-SG-11\.conda\Lib\site-packages\IPython\core\interactiveshell.py", line 3526, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\tiffa\AppData\Local\Temp\ipykernel_95988\699685553.py", line 2, in <module>
    cvec_rf = clf_model('cvec', 'rf', 3) # put 3-fold cross validation for random forest to reduce runtime
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tiffa\AppData\Local\Temp\ipykernel_95988\3848433570.py", line 24, in clf_model
    pipe.fit(X_train, y_train)
  File "c:\Users\tiffa\Documents\DSIF-SG-11\.conda\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
  File "c:\Users\tiffa\Documents\DSIF-SG-11\.conda\Lib\site-packages\sklearn\pipeline.py", line 420, in fit
    pipeline.
        ^^^^^^
  File "c:\Users\tiffa\Documents\DSIF-SG-11\.conda\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
  File "c:\Users\tiffa\Documents\DSIF-SG-11\.conda\Lib\site-p

In [None]:
# KNN
cvec_knn = clf_model('cvec', 'knn', 5)
tvec_knn = clf_model('tvec', 'knn', 5)
pd.DataFrame(df_results)

--- METRICS for knn,cvec ---


{'Model': 'knn',
 'Vectorizer': 'cvec',
 'Train Score': 0.7651945965372176,
 'Test Score': 0.6431591408760359,
 'Accuracy': 0.6431591408760359,
 'Specificity': 0.4449329014345211,
 'f_score': 0.7292441935069934,
 'ROC_AUC': 0.6011311628707786}

--- METRICS for knn,tvec ---


{'Model': 'knn',
 'Vectorizer': 'tvec',
 'Train Score': 0.6845020400397438,
 'Test Score': 0.6421444275325554,
 'Accuracy': 0.6421444275325554,
 'Specificity': 0.07681628875520592,
 'f_score': 0.7743655363616976,
 'ROC_AUC': 0.5222834109021233}

Unnamed: 0,Model,Vectorizer,Train Score,Test Score,Accuracy,Specificity,f_score,ROC_AUC
0,nb,cvec,0.787688,0.732369,0.732369,0.582601,0.795159,0.700615
1,nb,tvec,0.746908,0.70066,0.70066,0.23739,0.803987,0.602437
2,rf,cvec,0.990529,0.728902,0.728902,0.499769,0.801191,0.680321
3,rf,tvec,0.990445,0.728057,0.728057,0.442619,0.806382,0.667538
4,knn,cvec,0.765195,0.643159,0.643159,0.444933,0.729244,0.601131
5,knn,tvec,0.684502,0.642144,0.642144,0.076816,0.774366,0.522283


#### 2.2 Hyperparameter Tuning of Models

To train a robust machine learning model, we need to select the correct combination of hyperparameters.
Recall that in the user defined function created earlier, if "Gridsearch" = True, the function will perform a gridsearch to find the optimal hyperparameters that will give the best score. GridsearchCV searches all combinations of paramters in for a model that will give the best peformance score. It is not practically feasible to run a GridSearchCV on all models due to the complexity and hence time taken. Hence, I will select only the best 3 models to perform hyperparameter tuning.

From the above results table, seems like the Multinomial Naive Bayes Model, Logistic Regression and Random Forest perform the best.

- Multinomial Naive Bayes: Based on Bayes's theorem - the assumption that each feature (in our case, each word) is independent of each other.
- Logistic Regression:
- Random Forest: Consists of n number of decision trees that act as an ensemble. Each decision tree makes a class prediction and the class with the most votes becomes the model's prediction.    

In [None]:
tuning_results = []

# Vectorizer Parameters

cvec_params = {
    'cvec__max_features': [None],
    'cvec__min_df':[3, 4, 5],
    'cvec__max_df': [0.2, 0.3, 0.4],
    'cvec__stop_words': [stopwords_list],
    'cvec__ngram_range':[(1,1), (1,2)]
}


tvec_params = {
    'tvec__max_features': [None],
    'tvec__min_df':[3, 4, 5],
    'tvec__max_df': [0.2, 0.3, 0.4],
    'tvec__stop_words': [stopwords_list],
    'tvec__ngram_range':[(1,1), (1,2)]
}

rf_pipe_cvec_params = {
    'cvec__max_features': [100],
    'cvec__max_df': [0.2, 0.3],
    'cvec__min_df': [1, 2, 3],
    'cvec__ngram_range': [(1,1), (1,2)],
    'rf__n_estimators': [50,75,100]}


rf_pipe_tvec_params = {
    'tvec__max_features': [100],
    'tvec__max_df': [0.2, 0.3],
    'tvec__min_df': [1, 2, 3],
    'tvec__ngram_range': [(1,1), (1,2)],
    'rf__n_estimators': [50,75,100]}




In [None]:
# Tune for Logistic Regression
cvec_lr_gs = clf_model('cvec', 'log_reg', 3, vec_params=cvec_params, grid_search=True)


Fitting 5 folds for each of 18 candidates, totalling 90 fits
--- Best Parameters for log_reg,cvec ---


{'cvec__max_df': 0.3,
 'cvec__max_features': None,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 2),
 'cvec__stop_words': ['of',
  'o',
  'if',
  'most',
  'as',
  'into',
  'yourselves',
  'any',
  "weren't",
  'ours',
  'itself',
  'that',
  'our',
  'doesn',
  "aren't",
  'd',
  'having',
  'out',
  "shan't",
  'both',
  "you're",
  'other',
  'here',
  "shouldn't",
  'all',
  'she',
  'aren',
  'off',
  'herself',
  'yourself',
  'should',
  'll',
  'were',
  "needn't",
  'how',
  'a',
  'about',
  'hasn',
  "haven't",
  'whom',
  "she's",
  'nor',
  'wasn',
  'ma',
  'now',
  'by',
  'them',
  "don't",
  'y',
  'what',
  "doesn't",
  'can',
  'they',
  'being',
  'not',
  'been',
  'no',
  'which',
  'than',
  "won't",
  'mightn',
  'its',
  'during',
  'between',
  'why',
  'because',
  'shouldn',
  'before',
  'or',
  'the',
  "mightn't",
  'against',
  'don',
  "couldn't",
  'hadn',
  'very',
  'does',
  'your',
  'have',
  'but',
  'down',
  'couldn',
  'subreddit',
  'on',
  

--- METRICS for log_reg,cvec ---


{'Model': 'log_reg',
 'Vectorizer': 'cvec',
 'Train Score': 0.9349724118977655,
 'Test Score': 0.7404870624048706,
 'Accuracy': 0.7404870624048706,
 'Specificity': 0.5552984729291994,
 'f_score': 0.8055502756130014,
 'ROC_AUC': 0.7012233302812308}

Fitting 5 folds for each of 18 candidates, totalling 90 fits
--- Best Parameters for log_reg,tvec ---


{'tvec__max_df': 0.2,
 'tvec__max_features': None,
 'tvec__min_df': 3,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': ['of',
  'o',
  'if',
  'most',
  'as',
  'into',
  'yourselves',
  'any',
  "weren't",
  'ours',
  'itself',
  'that',
  'our',
  'doesn',
  "aren't",
  'd',
  'having',
  'out',
  "shan't",
  'both',
  "you're",
  'other',
  'here',
  "shouldn't",
  'all',
  'she',
  'aren',
  'off',
  'herself',
  'yourself',
  'should',
  'll',
  'were',
  "needn't",
  'how',
  'a',
  'about',
  'hasn',
  "haven't",
  'whom',
  "she's",
  'nor',
  'wasn',
  'ma',
  'now',
  'by',
  'them',
  "don't",
  'y',
  'what',
  "doesn't",
  'can',
  'they',
  'being',
  'not',
  'been',
  'no',
  'which',
  'than',
  "won't",
  'mightn',
  'its',
  'during',
  'between',
  'why',
  'because',
  'shouldn',
  'before',
  'or',
  'the',
  "mightn't",
  'against',
  'don',
  "couldn't",
  'hadn',
  'very',
  'does',
  'your',
  'have',
  'but',
  'down',
  'couldn',
  'subreddit',
  'on',
  

--- METRICS for log_reg,tvec ---


{'Model': 'log_reg',
 'Vectorizer': 'tvec',
 'Train Score': 0.8311523582013826,
 'Test Score': 0.7434466429900219,
 'Accuracy': 0.7434466429900219,
 'Specificity': 0.49074502545118,
 'f_score': 0.8147288715192965,
 'ROC_AUC': 0.6898687813823065}

Fitting 3 folds for each of 18 candidates, totalling 54 fits


KeyboardInterrupt: 

In [None]:
pd.DataFrame(tuning_results)

Unnamed: 0,Model,Vectorizer,Train Score,Test Score,Accuracy,Specificity,f_score,ROC_AUC
0,log_reg,cvec,0.934972,0.740487,0.740487,0.555298,0.80555,0.701223
1,log_reg,tvec,0.831152,0.743447,0.743447,0.490745,0.814729,0.689869


In [None]:
# Tune for Random Forest using RandomizedSearchCV to improve model runtime
cvec_rf_gs = clf_model('cvec', 'rf', 3, vec_params=rf_pipe_cvec_params, grid_search=True)


Fitting 3 folds for each of 10 candidates, totalling 30 fits


KeyboardInterrupt: 

In [None]:
tvec_rf_gs = clf_model('tvec', 'rf', 3, vec_params=rf_pipe_tvec_params, grid_search=True)


NameError: name 'results' is not defined

In [None]:
pd.DataFrame(tuning_results)

In [None]:
lr_tvec_results = {
    'Model': "Logistic Regression",
    'Vectorizer': "TVEC",
    'Train Score': tvec_lr_gs.score(X_train, y_train),
    'Test Score': tvec_lr_gs.score(X_test, y_test),
    'Accuracy': acc,
    'Specificity': spec,
    'F1 Score': f1_score
}

results_df.append(lr_tvec_results)
pd.DataFrame(results_df)


Unnamed: 0,Model,Vectorizer,Train Score,Test Score,Accuracy,Specificity,F1 Score
0,Logistic Regression,CVEC,0.680676,0.681549,0.670726,0.217955,
1,Logistic Regression,TVEC,0.672114,0.670726,0.670726,0.217955,0.787568


#### 3.2 Random Forest using *TFIFD Vectorizer*