### III: Modelling & Tuning

# Write an intro here

In [1]:
 #!conda install -c conda-forge imbalanced-learn --yes

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,  roc_auc_score, accuracy_score, precision_score, recall_score, f1_score
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV,HalvingGridSearchCV, train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')


In [3]:
combined = pd.read_csv('datasets/combined.csv')

combined.shape

(60778, 12)

In [4]:
combined.head()

Unnamed: 0.1,Unnamed: 0,id,date,title,selftext,n_comments,author,comment,Subreddit,sentences,words,preprocessed_words
0,0,16d4gg0,2023-09-08 08:04:04,New to pc building,What’s better rtx 3090 for 700$-I’ve seen thes...,1.0,43VerLoner,3090 keeps its price from the VRAM it has. A 4...,Nvidia,"['3090 keeps its price from the VRAM it has.',...","['3090', 'keeps', 'its', 'price', 'from', 'the...",keeps its price from the vram it has a is a ta...
1,1,16d46z5,2023-09-08 07:47:35,Experiences,I personally play on a typical 1080ti system(...,7.0,SEALEJ2001,"It's not you, just a combo of the game running...",Nvidia,"[""It's not you, just a combo of the game runni...","['It', ""'s"", 'not', 'you', ',', 'just', 'a', '...",it s not you just a combo of the game running ...
2,2,16d46z5,2023-09-08 07:47:35,Experiences,I personally play on a typical 1080ti system(...,7.0,SEALEJ2001,"Starfield is CPU heavy, which one do you have?...",Nvidia,"['Starfield is CPU heavy, which one do you hav...","['Starfield', 'is', 'CPU', 'heavy', ',', 'whic...",starfield is cpu heavy which one do you have a...
3,3,16d46z5,2023-09-08 07:47:35,Experiences,I personally play on a typical 1080ti system(...,7.0,SEALEJ2001,The game is just awfully optimized. I just upg...,Nvidia,"['The game is just awfully optimized.', 'I jus...","['The', 'game', 'is', 'just', 'awfully', 'opti...",the game is just awfully optimized i just upgr...
4,4,16d46z5,2023-09-08 07:47:35,Experiences,I personally play on a typical 1080ti system(...,7.0,SEALEJ2001,I tried using fsr but I didn't see any noticib...,Nvidia,"[""I tried using fsr but I didn't see any notic...","['I', 'tried', 'using', 'fsr', 'but', 'I', 'di...",i tried using fsr but i did nt see any noticib...


In [5]:
# select X and y columns we need
df =  combined[['Subreddit', 'preprocessed_words']]

# add label for classification
df['is_amd'] = df['Subreddit'].apply(lambda x: 1 if x == "AMD" else 0)
df = df.drop(columns = 'Subreddit')
df = df.rename(columns={'preprocessed_words':'text'})

In [6]:
df.head()

Unnamed: 0,text,is_amd
0,keeps its price from the vram it has a is a ta...,0
1,it s not you just a combo of the game running ...,0
2,starfield is cpu heavy which one do you have a...,0
3,the game is just awfully optimized i just upgr...,0
4,i tried using fsr but i did nt see any noticib...,0


In [7]:
# Specify Stopwords
custom_stopwords = [ "subreddit", "reddit"]  # remove these words as it is not meaningful for our analysis
stopwords_list = list(set(stopwords.words('english') + custom_stopwords))

In [8]:
# Let's also import the top common words of both subreddits from Notebook 2
common_words = pd.read_csv('datasets/common_words.csv',index_col=[0])
common_words_30 = common_words.head(20)

# Create an additional stopwords list that includes the top 10 common words
stopwords_list_with_common =  list(set(stopwords.words('english') + custom_stopwords + common_words_30['0'].values.tolist()))

In [9]:
stopwords_list_with_common

['after',
 'why',
 "weren't",
 "shan't",
 'good',
 'amd',
 'nvidia',
 'who',
 'being',
 'because',
 'won',
 'was',
 'at',
 'both',
 'such',
 'own',
 'fps',
 'it',
 'hers',
 'under',
 'when',
 'having',
 'shouldn',
 'subreddit',
 'xt',
 'hasn',
 'wasn',
 'did',
 'should',
 'fsr',
 'for',
 'those',
 'were',
 'a',
 'no',
 "you're",
 'what',
 'game',
 'weren',
 'its',
 'didn',
 'would',
 'only',
 'out',
 "don't",
 'on',
 'their',
 'this',
 'or',
 'very',
 'they',
 'her',
 "hasn't",
 'o',
 'can',
 "mustn't",
 'hadn',
 'the',
 'we',
 'from',
 'doesn',
 'your',
 "haven't",
 'over',
 'gpu',
 'against',
 "aren't",
 "wasn't",
 "it's",
 'up',
 "doesn't",
 'to',
 'nor',
 'not',
 'once',
 'which',
 'below',
 'am',
 'how',
 'other',
 'shan',
 'one',
 "you'll",
 't',
 'each',
 'so',
 'y',
 "hadn't",
 'isn',
 'ma',
 'mustn',
 'and',
 'ain',
 'be',
 'd',
 'than',
 'as',
 'needn',
 "that'll",
 "should've",
 "you've",
 "she's",
 'ourselves',
 're',
 'wouldn',
 'between',
 'do',
 'of',
 'll',
 'him',
 'he

For this notebook on Modelling, I will only be considering the "text" column of the scraped dataset, and this has been pre-processed in notebook 2. This ensures that our model can be properly trained on the content of the subreddit posts.


##### Baseline 
We always begin with creating a baseline model.

In [15]:
# Baseline model
X = df['text']
y = df['is_amd']
y.value_counts(normalize=True)

is_amd
1    0.629734
0    0.370266
Name: proportion, dtype: float64

#### Model Preparation


Steps I took for this section:
1. Train Test Split
2. Instantiating Vectorizers and Models
3. Creating a User-Defined Function* with Scikit-learn's Pipeline tool that will help calculate the relevant classification metrics from each model (Metrics include Accuracy, Specificity and F1_Score)
4. Evaluate best model
5. Tune Hyper-parameters of best model



In [13]:
# Train Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify = y)
X_train = X_train.values.astype('U')
X_test = X_test.values.astype('U')
print(X_train.shape)
print(X_test.shape)

(48622,)
(12156,)


In [14]:
cvec = CountVectorizer(stop_words=stopwords_list)
cvec.fit(X_train)
X_train = cvec.transform(X_train) #transform the corpus

In [15]:
print(cvec.get_feature_names_out())
print(X_train.shape)

['aa' 'aaa' 'aaaaa' ... 'zz' 'zzx' 'zzzzz']
(48622, 32966)


In [16]:
# Transform test
X_test = cvec.transform(X_test)

##### 1. Train Test Split

In [16]:
# Redefine train and test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify = y)
X_train = X_train.values.astype('U')
X_test = X_test.values.astype('U')

In [18]:
X_train

array(['must be something weird with nanite i m not sure to be honest maybe they are potato quality because they wanted consoles at fps and they didn t update them for pc i m just guessing though',
       'deleted', 'the switch port doesnt have fsr', ...,
       'no x either but can just look at the x for it i suppose since it s pretty much the same in gaming mostly',
       'yeah it was weird what s even weirder is that none of the dx s advertised performance advantages ever materialised in regular games i guess that game developers decided to use the freed resources to make their games even less optimized',
       'same here got mine in on friday to take the place of my and it s such a massive improvement i just wish fsr could remove jaggies and shinies as well as dlss does'],
      dtype='<U6613')

##### 2. Instantiation of Vectorizers and Models

2.1. I will be exploring different Classification algorithms:
- Multinomial Naive Bayes: Based on Bayes's theorem - the assumption that each feature (in our case, each word) is independent of each other.
- Logistic Regression:
- Random Forest: Consists of n number of decision trees that act as an ensemble. Each decision tree makes a class prediction and the class with the most votes becomes the model's prediction.  
- K-nearest Neighbors Classifier  

2.2. I will also be evaluating the models using both Count Vectorizer or Term Frequency-Inverse Document Frequency (TFIDF) transformers:
- Count Vectorizer: Takes every word as a token, and uses it as a feature.
- TFIFD: accounts for frequency of a word in a given document and the frequency between documents. Word importance increases proportionally to the number of times it appears in a document, but is offset by frequency of word in entire corpus.

In [10]:
# Instiantiate models
models = {'nb': MultinomialNB(),
          'log_reg': LogisticRegression(max_iter=500, random_state=123),
          'rf': RandomForestClassifier(random_state=123),
          'knn': KNeighborsClassifier()}

##### 3. User-Defined Function - inputs required are vectorizer and model

3.1 Ngram Range option: To explore if different n-grams will give better results. I have opted to include this based on the EDA in Notebook 2 - seems like some bigrams and trigrams might be good predictors of whether or not a Subreddit is from AMD or Nvidia. <br>

3.2 List of Stopwords: Explore if the standard stopword list in nltk of English words or English words + common words from both subreddits will produce better results. By removing them, I am retaining the words that are more specific to the content of each subreddit. This might enhance the model's ability to capture the distinctive language and topics associated with each community, hence improving model performance.

3.3 Function will produce a dataframe of classification results: Accuracy, Specificity, F_Score, ROC_AUC

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

df_results = []

def clf_model(vec, mod, stopwords, ngram_range, cv_num,type):   # option to include Grid Search

    vec_params={}
    results = {}
    vec_params['ngram_range'] = ngram_range  # Add ngram_range to the vectorizer parameters
    vec_params['stop_words'] = stopwords
    
    # Instantiate Vectorizers
    if vec == 'cvec':     
        vectorizer = CountVectorizer(**vec_params)
    elif vec == 'tvec':
        vectorizer = TfidfVectorizer(**vec_params)
    else:
        raise ValueError("Invalid 'vec' parameter. Supported values are 'cvec' and 'tvec'.")
    
    pipe = Pipeline([
            (vec, vectorizer), 
            (mod, models[mod])
            ])     # pipeline helps to automatically transform the data

    pipe.fit(X_train, y_train)
    
    # Get predictions
    preds = pipe.predict(X_test)

    # Confusion Matrix
    tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
    cm = confusion_matrix(y_test, preds)
    tn, fp, fn, tp = cm[0, 0], cm[0, 1], cm[1, 0], cm[1, 1]
    acc = (tp + tn)/ (tp+tn+fp+fn)
    spec = tn / (tn + fp)
    
    # Retrieve metrics
    results['Model'] = mod
    results['Vectorizer'] = vec
    results['Train Score'] = pipe.score(X_train, y_train)
    results['Test Score'] = pipe.score(X_test, y_test)
    results['Accuracy'] = acc
    results['Specificity'] = spec
    results['f_score'] = f1_score(y_test, preds)
    results['ROC_AUC'] = roc_auc_score(y_test, preds)
    results['Ngram Range'] = ngram_range
    results['Stopword List'] = type
        
    df_results.append(results)
    
    print(f"--- METRICS for {mod},{vec} ---")
    display(results)
    
    

In [18]:
# Logistic Regression

clf_model('cvec', 'log_reg', stopwords_list_with_common, (1,1), 3,"Stop Words List with Common Words")
clf_model('cvec', 'log_reg', stopwords_list, (1,1), 10 ,"Stop Words List with English Words")
clf_model('cvec', 'log_reg', stopwords_list_with_common, (1,2), 10,"Stop Words List with Common Words")
clf_model('cvec', 'log_reg', stopwords_list, (1,2), 10 ,"Stop Words List with English Words")
clf_model('cvec', 'log_reg', stopwords_list_with_common, (1,3), 10,"Stop Words List with Common Words")
clf_model('cvec', 'log_reg', stopwords_list, (1,3), 10 ,"Stop Words List with English Words")
clf_model('tvec', 'log_reg', stopwords_list_with_common, (1,1), 10,"Stop Words List with Common Words")
clf_model('tvec', 'log_reg', stopwords_list, (1,1), 10 ,"Stop Words List with English Words")
clf_model('tvec', 'log_reg', stopwords_list_with_common, (1,2), 10,"Stop Words List with Common Words")
clf_model('tvec', 'log_reg', stopwords_list, (1,2), 10 ,"Stop Words List with English Words")
clf_model('tvec', 'log_reg', stopwords_list_with_common, (1,3), 10,"Stop Words List with Common Words")
clf_model('tvec', 'log_reg', stopwords_list, (1,3), 10 ,"Stop Words List with English Words")


--- METRICS for log_reg,cvec ---


{'Model': 'log_reg',
 'Vectorizer': 'cvec',
 'Train Score': 0.8411418699354202,
 'Test Score': 0.7221125370187562,
 'Accuracy': 0.7221125370187562,
 'Specificity': 0.5136636303043768,
 'f_score': 0.7928877988963827,
 'ROC_AUC': 0.6791701561058134,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for log_reg,cvec ---


{'Model': 'log_reg',
 'Vectorizer': 'cvec',
 'Train Score': 0.8489778289663116,
 'Test Score': 0.7411977624218493,
 'Accuracy': 0.7411977624218493,
 'Specificity': 0.5509886691846256,
 'f_score': 0.8058743675182032,
 'ROC_AUC': 0.7020129498764408,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for log_reg,cvec ---


{'Model': 'log_reg',
 'Vectorizer': 'cvec',
 'Train Score': 0.9760396528320513,
 'Test Score': 0.7392234287594603,
 'Accuracy': 0.7392234287594603,
 'Specificity': 0.5314374583425905,
 'f_score': 0.8062110282430616,
 'ROC_AUC': 0.6964176187859262,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for log_reg,cvec ---


{'Model': 'log_reg',
 'Vectorizer': 'cvec',
 'Train Score': 0.9793714779318005,
 'Test Score': 0.7535373478117802,
 'Accuracy': 0.7535373478117802,
 'Specificity': 0.5660964230171073,
 'f_score': 0.8152897657213317,
 'ROC_AUC': 0.7149228032786384,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for log_reg,cvec ---


{'Model': 'log_reg',
 'Vectorizer': 'cvec',
 'Train Score': 0.9790424087861462,
 'Test Score': 0.7360974004606778,
 'Accuracy': 0.7360974004606778,
 'Specificity': 0.5132192846034215,
 'f_score': 0.8053870419801019,
 'ROC_AUC': 0.6901824705185624,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for log_reg,cvec ---


{'Model': 'log_reg',
 'Vectorizer': 'cvec',
 'Train Score': 0.9830323721772037,
 'Test Score': 0.7544422507403751,
 'Accuracy': 0.7544422507403751,
 'Specificity': 0.552766051988447,
 'f_score': 0.8174423582655496,
 'ROC_AUC': 0.71289510959971,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for log_reg,tvec ---


{'Model': 'log_reg',
 'Vectorizer': 'tvec',
 'Train Score': 0.7893957467812924,
 'Test Score': 0.7321487331359,
 'Accuracy': 0.7321487331359,
 'Specificity': 0.48655854254610087,
 'f_score': 0.804749340369393,
 'ROC_AUC': 0.681554908111718,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for log_reg,tvec ---


{'Model': 'log_reg',
 'Vectorizer': 'tvec',
 'Train Score': 0.8019003743161531,
 'Test Score': 0.7492596248766041,
 'Accuracy': 0.7492596248766041,
 'Specificity': 0.5263274827816041,
 'f_score': 0.8155633547137843,
 'ROC_AUC': 0.7033335650354787,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for log_reg,tvec ---


{'Model': 'log_reg',
 'Vectorizer': 'tvec',
 'Train Score': 0.8539344329727284,
 'Test Score': 0.7319842053307009,
 'Accuracy': 0.7319842053307009,
 'Specificity': 0.4652299489002444,
 'f_score': 0.8068303094983993,
 'ROC_AUC': 0.6770303892117159,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for log_reg,tvec ---


{'Model': 'log_reg',
 'Vectorizer': 'tvec',
 'Train Score': 0.8627987330837892,
 'Test Score': 0.7524679170779862,
 'Accuracy': 0.7524679170779862,
 'Specificity': 0.5065540990890913,
 'f_score': 0.8202831033864898,
 'ROC_AUC': 0.7018074218502283,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for log_reg,tvec ---


{'Model': 'log_reg',
 'Vectorizer': 'tvec',
 'Train Score': 0.8632512031590638,
 'Test Score': 0.730174399473511,
 'Accuracy': 0.730174399473511,
 'Specificity': 0.4723394801155299,
 'f_score': 0.8045292014302741,
 'ROC_AUC': 0.6770580483529969,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for log_reg,tvec ---


{'Model': 'log_reg',
 'Vectorizer': 'tvec',
 'Train Score': 0.8734523466743449,
 'Test Score': 0.7476143468246134,
 'Accuracy': 0.7476143468246134,
 'Specificity': 0.505443234836703,
 'f_score': 0.816221396909069,
 'ROC_AUC': 0.6977248832576722,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with English Words'}

In [19]:
# Multinomial Naive Bayes
clf_model('cvec', 'nb', stopwords_list_with_common, (1,1), 10,"Stop Words List with Common Words")
clf_model('cvec', 'nb', stopwords_list, (1,1), 10 ,"Stop Words List with English Words")
clf_model('cvec', 'nb', stopwords_list_with_common, (1,2), 10,"Stop Words List with Common Words")
clf_model('cvec', 'nb', stopwords_list, (1,2), 10 ,"Stop Words List with English Words")
clf_model('cvec', 'nb', stopwords_list_with_common, (1,3), 10,"Stop Words List with Common Words")
clf_model('cvec', 'nb', stopwords_list, (1,3), 10 ,"Stop Words List with English Words")
clf_model('tvec', 'nb', stopwords_list_with_common, (1,1), 10,"Stop Words List with Common Words")
clf_model('tvec', 'nb', stopwords_list, (1,1), 10 ,"Stop Words List with English Words")
clf_model('tvec', 'nb', stopwords_list_with_common, (1,2), 10,"Stop Words List with Common Words")
clf_model('tvec', 'nb', stopwords_list, (1,2), 10 ,"Stop Words List with English Words")
clf_model('tvec', 'nb', stopwords_list_with_common, (1,3), 10,"Stop Words List with Common Words")
clf_model('tvec', 'nb', stopwords_list, (1,3), 10 ,"Stop Words List with English Words")

--- METRICS for nb,cvec ---


{'Model': 'nb',
 'Vectorizer': 'cvec',
 'Train Score': 0.7873596314425568,
 'Test Score': 0.7274596906877262,
 'Accuracy': 0.7274596906877262,
 'Specificity': 0.5734281270828705,
 'f_score': 0.7908063395845174,
 'ROC_AUC': 0.6957277800665822,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for nb,cvec ---


{'Model': 'nb',
 'Vectorizer': 'cvec',
 'Train Score': 0.7899099173213772,
 'Test Score': 0.7362619282658769,
 'Accuracy': 0.7362619282658769,
 'Specificity': 0.6049766718506998,
 'f_score': 0.7952745849297572,
 'ROC_AUC': 0.7092159649260031,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for nb,cvec ---


{'Model': 'nb',
 'Vectorizer': 'cvec',
 'Train Score': 0.9510098309407264,
 'Test Score': 0.730174399473511,
 'Accuracy': 0.730174399473511,
 'Specificity': 0.39457898244834483,
 'f_score': 0.8123569794050344,
 'ROC_AUC': 0.6610386747643422,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for nb,cvec ---


{'Model': 'nb',
 'Vectorizer': 'cvec',
 'Train Score': 0.9473695035169265,
 'Test Score': 0.7431720960842383,
 'Accuracy': 0.7431720960842383,
 'Specificity': 0.44123528104865584,
 'f_score': 0.8186781275409456,
 'ROC_AUC': 0.6809703511709642,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for nb,cvec ---


{'Model': 'nb',
 'Vectorizer': 'cvec',
 'Train Score': 0.9677306569042821,
 'Test Score': 0.7205495228693649,
 'Accuracy': 0.7205495228693649,
 'Specificity': 0.32948233725838705,
 'f_score': 0.8107415454899994,
 'ROC_AUC': 0.6399861065782464,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for nb,cvec ---


{'Model': 'nb',
 'Vectorizer': 'cvec',
 'Train Score': 0.9698490395294311,
 'Test Score': 0.729104968739717,
 'Accuracy': 0.729104968739717,
 'Specificity': 0.3556987336147523,
 'f_score': 0.8151765168097883,
 'ROC_AUC': 0.6521798697466314,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for nb,tvec ---


{'Model': 'nb',
 'Vectorizer': 'tvec',
 'Train Score': 0.7480975690016864,
 'Test Score': 0.6994076999012833,
 'Accuracy': 0.6994076999012833,
 'Specificity': 0.251721839591202,
 'f_score': 0.801326663766855,
 'ROC_AUC': 0.607180318881166,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for nb,tvec ---


{'Model': 'nb',
 'Vectorizer': 'tvec',
 'Train Score': 0.7526839702192423,
 'Test Score': 0.7042612701546561,
 'Accuracy': 0.7042612701546561,
 'Specificity': 0.2646078649189069,
 'f_score': 0.8039269157349331,
 'ROC_AUC': 0.6136886483314326,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for nb,tvec ---


{'Model': 'nb',
 'Vectorizer': 'tvec',
 'Train Score': 0.7654353996133437,
 'Test Score': 0.660167818361303,
 'Accuracy': 0.660167818361303,
 'Specificity': 0.09264607864918907,
 'f_score': 0.7864785238021398,
 'ROC_AUC': 0.5432531503631315,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for nb,tvec ---


{'Model': 'nb',
 'Vectorizer': 'tvec',
 'Train Score': 0.7652091645757064,
 'Test Score': 0.6645278051990786,
 'Accuracy': 0.6645278051990786,
 'Specificity': 0.10619862252832704,
 'f_score': 0.7884635335615727,
 'ROC_AUC': 0.5495068880113876,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for nb,tvec ---


{'Model': 'nb',
 'Vectorizer': 'tvec',
 'Train Score': 0.8152688083583562,
 'Test Score': 0.6549029285949326,
 'Accuracy': 0.6549029285949326,
 'Specificity': 0.07598311486336369,
 'f_score': 0.7841300879946483,
 'ROC_AUC': 0.535640153120774,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for nb,tvec ---


{'Model': 'nb',
 'Vectorizer': 'tvec',
 'Train Score': 0.8101476697791123,
 'Test Score': 0.6576999012833169,
 'Accuracy': 0.6576999012833169,
 'Specificity': 0.08353699177960454,
 'f_score': 0.7855044074436827,
 'ROC_AUC': 0.5394170915788943,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with English Words'}

In [20]:
# Random Forest Model
clf_model('cvec', 'rf', stopwords_list_with_common, (1,1), 3,"Stop Words List with Common Words")
clf_model('cvec', 'rf', stopwords_list, (1,1), 3 ,"Stop Words List with English Words")
clf_model('cvec', 'rf', stopwords_list_with_common, (1,2), 3,"Stop Words List with Common Words")
clf_model('cvec', 'rf', stopwords_list, (1,2), 3 ,"Stop Words List with English Words")


--- METRICS for rf,cvec ---


{'Model': 'rf',
 'Vectorizer': 'cvec',
 'Train Score': 0.9886059808317222,
 'Test Score': 0.7185751892069759,
 'Accuracy': 0.7185751892069759,
 'Specificity': 0.5158853588091535,
 'f_score': 0.7894380501015573,
 'ROC_AUC': 0.6768192306782541,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for rf,cvec ---


{'Model': 'rf',
 'Vectorizer': 'cvec',
 'Train Score': 0.9904981284192341,
 'Test Score': 0.7348634419216847,
 'Accuracy': 0.7348634419216847,
 'Specificity': 0.5214396800710953,
 'f_score': 0.8034156755108265,
 'ROC_AUC': 0.6908961953588657,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for rf,cvec ---


{'Model': 'rf',
 'Vectorizer': 'cvec',
 'Train Score': 0.9886882481181358,
 'Test Score': 0.7123231326094109,
 'Accuracy': 0.7123231326094109,
 'Specificity': 0.4196845145523217,
 'f_score': 0.7947408581323002,
 'ROC_AUC': 0.652036901299675,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for rf,cvec ---


{'Model': 'rf',
 'Vectorizer': 'cvec',
 'Train Score': 0.9905803957056476,
 'Test Score': 0.7335472194800922,
 'Accuracy': 0.7335472194800922,
 'Specificity': 0.44590091090868694,
 'f_score': 0.8101295503839615,
 'ROC_AUC': 0.6742894495758327,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with English Words'}

In [22]:
# KNN
clf_model('cvec', 'knn', stopwords_list_with_common, (1,1), 3,"Stop Words List with Common Words")
clf_model('cvec', 'knn', stopwords_list, (1,1), 3 ,"Stop Words List with English Words")
clf_model('cvec', 'knn', stopwords_list_with_common, (1,2), 3,"Stop Words List with Common Words")
clf_model('cvec', 'knn', stopwords_list, (1,2), 3 ,"Stop Words List with English Words")
clf_model('cvec', 'knn', stopwords_list_with_common, (1,3), 3,"Stop Words List with Common Words")
clf_model('cvec', 'knn', stopwords_list, (1,3), 3 ,"Stop Words List with English Words")
clf_model('tvec', 'knn', stopwords_list_with_common, (1,1), 3,"Stop Words List with Common Words")
clf_model('tvec', 'knn', stopwords_list, (1,1), 3 ,"Stop Words List with English Words")
clf_model('tvec', 'knn', stopwords_list_with_common, (1,2), 3,"Stop Words List with Common Words")
clf_model('tvec', 'knn', stopwords_list, (1,2), 3 ,"Stop Words List with English Words")
clf_model('tvec', 'knn', stopwords_list_with_common, (1,3), 3,"Stop Words List with Common Words")
clf_model('tvec', 'knn', stopwords_list, (1,3), 3 ,"Stop Words List with English Words")


--- METRICS for knn,cvec ---


{'Model': 'knn',
 'Vectorizer': 'cvec',
 'Train Score': 0.7567562008967135,
 'Test Score': 0.6115498519249754,
 'Accuracy': 0.6115498519249754,
 'Specificity': 0.4630082203954677,
 'f_score': 0.693814031902477,
 'ROC_AUC': 0.5809489175132139,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for knn,cvec ---


{'Model': 'knn',
 'Vectorizer': 'cvec',
 'Train Score': 0.7674303813088725,
 'Test Score': 0.638203356367226,
 'Accuracy': 0.638203356367226,
 'Specificity': 0.46167518329260165,
 'f_score': 0.7209036679781698,
 'ROC_AUC': 0.6018369384784367,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for knn,cvec ---


{'Model': 'knn',
 'Vectorizer': 'cvec',
 'Train Score': 0.747645098926412,
 'Test Score': 0.607683448502797,
 'Accuracy': 0.607683448502797,
 'Specificity': 0.4463452566096423,
 'f_score': 0.692818035426731,
 'ROC_AUC': 0.5744463056398963,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for knn,cvec ---


{'Model': 'knn',
 'Vectorizer': 'cvec',
 'Train Score': 0.7602525605692896,
 'Test Score': 0.6284139519578809,
 'Accuracy': 0.6284139519578809,
 'Specificity': 0.4456787380582093,
 'f_score': 0.7138059937907876,
 'ROC_AUC': 0.5907688268997774,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for knn,cvec ---


{'Model': 'knn',
 'Vectorizer': 'cvec',
 'Train Score': 0.7419069556990663,
 'Test Score': 0.5966600855544587,
 'Accuracy': 0.5966600855544587,
 'Specificity': 0.4452343923572539,
 'f_score': 0.6816440490877217,
 'ROC_AUC': 0.5654650080662821,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for knn,cvec ---


{'Model': 'knn',
 'Vectorizer': 'cvec',
 'Train Score': 0.7531775739377237,
 'Test Score': 0.6280026324448832,
 'Accuracy': 0.6280026324448832,
 'Specificity': 0.43034881137524994,
 'f_score': 0.7158833877858759,
 'ROC_AUC': 0.5872841378888007,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for knn,tvec ---


{'Model': 'knn',
 'Vectorizer': 'tvec',
 'Train Score': 0.6689153058286372,
 'Test Score': 0.644290885159592,
 'Accuracy': 0.644290885159592,
 'Specificity': 0.09753388135969784,
 'f_score': 0.7737310308738881,
 'ROC_AUC': 0.5316539426393525,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for knn,tvec ---


{'Model': 'knn',
 'Vectorizer': 'tvec',
 'Train Score': 0.6847517584632471,
 'Test Score': 0.6412471207634091,
 'Accuracy': 0.6412471207634091,
 'Specificity': 0.08842479449011331,
 'f_score': 0.7723309840772644,
 'ROC_AUC': 0.5273606663502166,
 'Ngram Range': (1, 1),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for knn,tvec ---


{'Model': 'knn',
 'Vectorizer': 'tvec',
 'Train Score': 0.6379622393155362,
 'Test Score': 0.6336788417242514,
 'Accuracy': 0.6336788417242514,
 'Specificity': 0.026438569206842923,
 'f_score': 0.7730492839304828,
 'ROC_AUC': 0.5085817927680198,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for knn,tvec ---


{'Model': 'knn',
 'Vectorizer': 'tvec',
 'Train Score': 0.644502488585414,
 'Test Score': 0.6325271470878578,
 'Accuracy': 0.6325271470878578,
 'Specificity': 0.018440346589646744,
 'f_score': 0.7730067584734996,
 'ROC_AUC': 0.506019650760532,
 'Ngram Range': (1, 2),
 'Stopword List': 'Stop Words List with English Words'}

--- METRICS for knn,tvec ---


{'Model': 'knn',
 'Vectorizer': 'tvec',
 'Train Score': 0.6357204557607667,
 'Test Score': 0.6322803553800592,
 'Accuracy': 0.6322803553800592,
 'Specificity': 0.02066207509442346,
 'f_score': 0.7725885225885226,
 'ROC_AUC': 0.506281396789537,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with Common Words'}

--- METRICS for knn,tvec ---


{'Model': 'knn',
 'Vectorizer': 'tvec',
 'Train Score': 0.6401423224054954,
 'Test Score': 0.631951299769661,
 'Accuracy': 0.631951299769661,
 'Specificity': 0.015329926682959343,
 'f_score': 0.7728934010152285,
 'ROC_AUC': 0.5049216583120871,
 'Ngram Range': (1, 3),
 'Stopword List': 'Stop Words List with English Words'}

In [24]:
# print dataframe of results
df_results = pd.DataFrame(df_results)
df_results_sorted = pd.DataFrame(df_results).sort_values(by='f_score', ascending=False)
df_results_sorted

Unnamed: 0,Model,Vectorizer,Train Score,Test Score,Accuracy,Specificity,f_score,ROC_AUC,Ngram Range,Stopword List
9,log_reg,tvec,0.862799,0.752468,0.752468,0.506554,0.820283,0.701807,"(1, 2)",Stop Words List with English Words
15,nb,cvec,0.94737,0.743172,0.743172,0.441235,0.818678,0.68097,"(1, 2)",Stop Words List with English Words
5,log_reg,cvec,0.983032,0.754442,0.754442,0.552766,0.817442,0.712895,"(1, 3)",Stop Words List with English Words
11,log_reg,tvec,0.873452,0.747614,0.747614,0.505443,0.816221,0.697725,"(1, 3)",Stop Words List with English Words
7,log_reg,tvec,0.8019,0.74926,0.74926,0.526327,0.815563,0.703334,"(1, 1)",Stop Words List with English Words
3,log_reg,cvec,0.979371,0.753537,0.753537,0.566096,0.81529,0.714923,"(1, 2)",Stop Words List with English Words
17,nb,cvec,0.969849,0.729105,0.729105,0.355699,0.815177,0.65218,"(1, 3)",Stop Words List with English Words
14,nb,cvec,0.95101,0.730174,0.730174,0.394579,0.812357,0.661039,"(1, 2)",Stop Words List with Common Words
16,nb,cvec,0.967731,0.72055,0.72055,0.329482,0.810742,0.639986,"(1, 3)",Stop Words List with Common Words
27,rf,cvec,0.99058,0.733547,0.733547,0.445901,0.81013,0.674289,"(1, 2)",Stop Words List with English Words


#### 2.2 Hyperparameter Tuning of Models

To train a robust machine learning model, we need to select the optimal combination of hyperparameters. Depending on the type of vectorizer and classification model, there will be a different set of hyperparameters. It is not practical to perform hyperparameter tuning on all combination of models and vectorizers due to the complexity and hence time taken. Hence, I will select the best models from Section 2.1 and perform HalvingGridSearchCV.
GridsearchCV searches all combinations of parameters in for a model that will give the best performance score, while HalvingGridSearchCV will be used on the Random Forest model to reduce run time. I have opted for the latter as it is time and computationally efficient.

From the above results table in 2.1, we can see that the best performing models based on F1 score* are:
- Multinomial Naive Bayes Model (CVEC, bigram)
- Logistic Regression (CVEC and TVEC, bigrams and trigrams) 
- Random Forest perform (CVEC and TVEC)

*F1 score was selected as the evaluation metric of model performance as it helps to balance between mnimising false positives (precision) and false negatives (recall). <br>
I will proceed to tune these models' hyperparameters to get the best combination. From there, we can determine which model performs the best at this classification task.



First, to easily generate and compare the results, we will creating a Function to perform GridSearchCV or HalvingGridSearchCV, with inputs being parameter grids.

In [19]:
# Initialize Parameters 

tuning_results = []

#cvec_nb_params 

# CVEC and Logistic Regression
cvec_logr_params = {'cvec__max_features': [5000, 6000, 7000, None],
                    'cvec__max_df': [0.5, 0.75, 0.9, 1],
                    'cvec__min_df': [1, 2, 3],
                    'cvec__ngram_range': [(1,2), (1,3)],   # from earlier results table, bigrams and trigram perform better
                    'log_reg__penalty': ['l1', 'l2', 'elasticnet'],
                    'log_reg__C': [1, 0.5, 0.1] }

# TFIFD and Logistic Regression

tvec_logr_params = {'tvec__max_features': [5000, 6000, 7000, None],
                    'tvec__max_df': [0.5, 0.75, 0.9, 1],
                    'tvec__min_df': [1, 2, 3],
                    'tvec__ngram_range': [(1,2), (1,3)],   # from earlier results table, bigrams and trigram perform better
                    'log_reg__penalty': ['l1', 'l2', 'elasticnet'],
                    'log_reg__C': [1, 0.5, 0.1] }



# CVEC and Random Forest

rf_cvec_params = {
    'cvec__max_features':  [5000, 6000, 7000, None],
    'cvec__max_df':[0.5, 0.75, 0.9, 1],
    'cvec__min_df': [1, 2, 3],
    'cvec__ngram_range': [(1,1), (1,2)],
    'rf__n_estimators': [50,75,100]}

# TFIFD and Random Forest

rf_tvec_params = {
    'tvec__max_features':  [5000, 6000, 7000, None],
    'tvec__max_df': [0.5, 0.75, 0.9, 1],
    'tvec__min_df': [1, 2, 3],
    'tvec__ngram_range': [(1,1), (1,2)],
    'rf__n_estimators': [50,75,100]}

# TFIFD and NB

nb_tvec_params = {
    'tvec__max_features':  [5000, 6000, 7000, None],
    'tvec__max_df': [0.5, 0.75, 0.9, 1],
    'tvec__min_df': [1, 2, 3],
    'tvec__ngram_range': [(1,1), (1,2)],
    'nb__alpha': np.linspace(0, 1, 5),
    'nb__fit_prior': [True, False]
    }

# CVEC and NB

nb_cvec_params = {
    'cvec__max_features': [5000, 6000, 7000, None],
    'cvec__min_df':[0.5, 0.75, 0.9, 1],
    'cvec__max_df': [0.3, 0.4, 0.5, 0.6, 0.7],
    'cvec__ngram_range':[(1,1), (1,2)],
    'nb__alpha': np.linspace(0, 1, 5),
    'nb__fit_prior': [True, False]
    }




In [29]:

df_tuning_results = []

def perform_grid_search(X_train, y_train, vec, mod, param_grid, cv=10):
 
    if vec == 'cvec':
        vectorizer = CountVectorizer(stop_words=stopwords_list)
    elif vec == 'tvec':
        vectorizer = TfidfVectorizer(stop_words=stopwords_list)
    else:
        raise ValueError("Invalid 'vec' parameter. Supported values are 'cvec' and 'tvec'.")
    
    pipe = Pipeline([
        (vec, vectorizer),
        (mod, models[mod])
    ])

    # Perform HalvingGridSearch

    gs = HalvingGridSearchCV(pipe, param_grid=param_grid, cv=cv, n_jobs=-1, verbose=1)
    gs.fit(X_train, y_train)
  
    # Get the best model and best parameters
    best_estimator = gs.best_estimator_
    best_params = gs.best_params_

    # Predict on test set
    preds = best_estimator.predict(X_test)

    # Confusion Matrix
    tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
    cm = confusion_matrix(y_test, preds)
    tn, fp, fn, tp = cm[0, 0], cm[0, 1], cm[1, 0], cm[1, 1]
    acc = (tp + tn)/ (tp+tn+fp+fn)
    spec = tn / (tn + fp)

    # Calculate evaluation metrics
    train_score = best_estimator.score(X_train, y_train)
    test_score = best_estimator.score(X_test, y_test)
    f_score = f1_score(y_test, preds)
    roc_auc = roc_auc_score(y_test, preds)

    # Append the results to df_tuning_results list
    result_dict = {
        'Vectorizer': vec,
        'Model': mod,
        'Best_Estimator': best_estimator,
        'Best_Params': best_params,
        'Train Score': train_score,
        'Test Score': test_score,
        'Accuracy': acc,
        'Specificity': spec,
        'f_score': f_score,
        'ROC_AUC': roc_auc,
        
    }
    
    df_tuning_results.append(result_dict)


In [30]:

# Logistic Regression
perform_grid_search(X_train, y_train, 'cvec', 'log_reg', cvec_logr_params)
perform_grid_search(X_train, y_train, 'tvec', 'log_reg', tvec_logr_params)



n_iterations: 7
n_required_iterations: 7
n_possible_iterations: 7
min_resources_: 66
max_resources_: 48622
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 864
n_resources: 66
Fitting 10 folds for each of 864 candidates, totalling 8640 fits
----------
iter: 1
n_candidates: 288
n_resources: 198
Fitting 10 folds for each of 288 candidates, totalling 2880 fits
----------
iter: 2
n_candidates: 96
n_resources: 594
Fitting 10 folds for each of 96 candidates, totalling 960 fits
----------
iter: 3
n_candidates: 32
n_resources: 1782
Fitting 10 folds for each of 32 candidates, totalling 320 fits
----------
iter: 4
n_candidates: 11
n_resources: 5346
Fitting 10 folds for each of 11 candidates, totalling 110 fits
----------
iter: 5
n_candidates: 4
n_resources: 16038
Fitting 10 folds for each of 4 candidates, totalling 40 fits
----------
iter: 6
n_candidates: 2
n_resources: 48114
Fitting 10 folds for each of 2 candidates, totalling 20 fits
n_iterations: 7
n_required_iteration

In [34]:
# Random Forest

perform_grid_search(X_train, y_train, 'cvec', 'rf', rf_cvec_params)
perform_grid_search(X_train, y_train, 'tvec', 'rf', rf_tvec_params)


n_iterations: 5
n_required_iterations: 6
n_possible_iterations: 5
min_resources_: 200
max_resources_: 48622
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 288
n_resources: 200
Fitting 10 folds for each of 288 candidates, totalling 2880 fits
----------
iter: 1
n_candidates: 96
n_resources: 600
Fitting 10 folds for each of 96 candidates, totalling 960 fits
----------
iter: 2
n_candidates: 32
n_resources: 1800
Fitting 10 folds for each of 32 candidates, totalling 320 fits
----------
iter: 3
n_candidates: 11
n_resources: 5400
Fitting 10 folds for each of 11 candidates, totalling 110 fits
----------
iter: 4
n_candidates: 4
n_resources: 16200
Fitting 10 folds for each of 4 candidates, totalling 40 fits
n_iterations: 5
n_required_iterations: 6
n_possible_iterations: 5
min_resources_: 200
max_resources_: 48622
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 288
n_resources: 200
Fitting 10 folds for each of 288 candidates, totalling 2880 fits
-

In [32]:
# Multinominal Naive Bayes

perform_grid_search(X_train, y_train, 'cvec', 'nb', nb_cvec_params)
perform_grid_search(X_train, y_train, 'tvec', 'nb', nb_tvec_params)

n_iterations: 7
n_required_iterations: 7
n_possible_iterations: 7
min_resources_: 66
max_resources_: 48622
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 1600
n_resources: 66
Fitting 10 folds for each of 1600 candidates, totalling 16000 fits
----------
iter: 1
n_candidates: 534
n_resources: 198
Fitting 10 folds for each of 534 candidates, totalling 5340 fits
----------
iter: 2
n_candidates: 178
n_resources: 594
Fitting 10 folds for each of 178 candidates, totalling 1780 fits
----------
iter: 3
n_candidates: 60
n_resources: 1782
Fitting 10 folds for each of 60 candidates, totalling 600 fits
----------
iter: 4
n_candidates: 20
n_resources: 5346
Fitting 10 folds for each of 20 candidates, totalling 200 fits
----------
iter: 5
n_candidates: 7
n_resources: 16038
Fitting 10 folds for each of 7 candidates, totalling 70 fits
----------
iter: 6
n_candidates: 3
n_resources: 48114
Fitting 10 folds for each of 3 candidates, totalling 30 fits
n_iterations: 7
n_required_ite

In [35]:
pd.DataFrame(df_tuning_results)

Unnamed: 0,Vectorizer,Model,Best_Estimator,Best_Params,Train Score,Test Score,Accuracy,Specificity,f_score,ROC_AUC
0,cvec,log_reg,"(CountVectorizer(max_df=0.5, ngram_range=(1, 3...","{'cvec__max_df': 0.5, 'cvec__max_features': No...",0.975711,0.75362,0.75362,0.538991,0.818099,0.709404
1,tvec,log_reg,"(TfidfVectorizer(max_df=0.75, max_features=700...","{'log_reg__C': 1, 'log_reg__penalty': 'l2', 't...",0.787606,0.746956,0.746956,0.531437,0.813032,0.702557
2,cvec,nb,"(CountVectorizer(max_df=0.7, ngram_range=(1, 2...","{'cvec__max_df': 0.7, 'cvec__max_features': No...",0.959545,0.756992,0.756992,0.564097,0.81855,0.717254
3,tvec,nb,"(TfidfVectorizer(max_df=0.75, min_df=3,\n ...","{'nb__alpha': 0.75, 'nb__fit_prior': True, 'tv...",0.769384,0.729434,0.729434,0.401466,0.811075,0.66187
4,cvec,rf,"(CountVectorizer(max_df=0.9, min_df=3, ngram_r...","{'cvec__max_df': 0.9, 'cvec__max_features': No...",0.988688,0.739552,0.739552,0.510553,0.808701,0.692377
5,tvec,rf,"(TfidfVectorizer(max_df=0.75, max_features=600...","{'rf__n_estimators': 100, 'tvec__max_df': 0.75...",0.985809,0.732149,0.732149,0.492335,0.804139,0.682745


#### Choosing Best Model and final tuning using GridSearchCV

After tuning the hyperparameters using HalvingGridSearch, the **Naive Bayes model with CountVectorizer** transformer achieved the best F1 score of **0.819**. It also had the best ROC_AUC score of 0.717 and Accuracy of 76%, meaning the model is able to predict 76% of the test data correctly.
I will now try to tune the model one last time using GridSearchCV to obtain the best parameters and see if the Accuracy, ROC-AUC and F1 scores can be improved.

In [19]:
# Re-run best model

# Setting up pipeline with NB and CountVec

pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())
])

nb_cvec_params = {
    'cvec__max_features': [None],
    'cvec__min_df':[0.5, 0.75, 0.9, 1],
    'cvec__max_df': [0.6, 0.7, 0.8, 0.9],
    'cvec__ngram_range':[(1,1), (1,2)],
    'nb__alpha': np.linspace(0, 1, 5),
    'nb__fit_prior': [True, False]
    }



In [20]:
# Instantiate GridSearchCV
gs = GridSearchCV(pipe, nb_cvec_params, cv=5, verbose=True, n_jobs=-1)

# Fit GridSearch to training data
gs.fit(X_train, y_train)

# Finding the Best Hyperparameter Values
gs.best_params_

Fitting 5 folds for each of 320 candidates, totalling 1600 fits


##### Classification Report

##### Confusion Matrix
It helps to categorize predictions into true positives, true negatives, false positives, and false negatives, and helps to compare the number of Type I and II errors.

In [28]:
# Plot confusion matrix on test data


   pipe = Pipeline([
        (vec, vectorizer),
        (mod, models[mod])
    ])

    # Perform HalvingGridSearch

    gs_best = HalvingGridSearchCV(pipe, param_grid=param_grid, cv=cv, n_jobs=-1, verbose=1, scoring = 'f1')
    gs_best.fit(X_train, y_train)
  
    # Get the best model and best parameters
    best_estimator = gs_best.best_estimator_
    best_params = gs_best.best_params_

    # Predict on test set
    preds = best_estimator.predict(X_test)

    # Confusion Matrix
    tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
    cm = confusion_matrix(y_test, preds)
    tn, fp, fn, tp = cm[0, 0], cm[0, 1], cm[1, 0], cm[1, 1]
    acc = (tp + tn)/ (tp+tn+fp+fn)
    spec = tn / (tn + fp)


plot_confusion_matrix(nb_cvec, X_test, y_test, cmap='Blues', ax=ax, values_format = 'd')

IndentationError: unexpected indent (1536752403.py, line 4)

#### AUC ROC Curve
ROC is a probability curve and AUC represents the degree or measure of separability - hence telling us how capable the model is in distinguishing between the two classes - in our case, whether it is from AMD or Nvidia subreddit.

#### Coefficients of Features