# Youtube popularity predictor (Part 2): text frequency-based models

In the previous notebook, we used natural language processing (NLP) to explore the YouTube video dataset and hunted for possible correlations between the language features in the video titles and descriptions and the video popularity, which we associated with a binary categorical variable corresponding to a video having obtained over 50k views (class 1) or under 50k views (class 0). We did indeed see that the frequency of the tokens in the byte-pair encoded text had predictive value for classification. In this notebook we will construct a variety of classification models based on text frequency.

Let's import the scikit-learn library and load the dataset, which was already processed in the previous notebook to extract the relevant ML features.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas(desc='processing rows')
pd.options.display.float_format = '{:.6e}'.format


In [2]:
videos = pd.read_csv('https://raw.githubusercontent.com/tommyliphysics/tommyli-ml/main/youtube_predictor/data/YT_data_v2.csv', lineterminator='\n')
videos

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
0,University of New Haven,27,Master of Science in Cellular and Molecular Bi...,"Christina Zito, assistant professor and coordi...",75,3.610660e+00,0
1,PennWest California,27,Faculty Showcase: Dr. Ben Reuter - Exercise Sc...,Interested in pursing a exercise science degre...,75,3.168203e+00,0
2,University of New Haven,27,Master of Science in Mechanical Engineering: B...,The University of New Haven’s master’s degree ...,75,3.447313e+00,0
3,Operation Ouch,24,Science for kids | BROKEN BONES- Unluckiest K...,Learn about Broken Bones with the Unluckiest K...,75,6.603942e+00,1
4,Crazy GkTrick,27,Science Gk : Diseases (मानव रोग ) - Part-2,Biology (‎जीव विज्ञान) | Gk Science | Science ...,76,6.409320e+00,1
...,...,...,...,...,...,...,...
31657,Morinda Enterprises,22,Vivo v30pro pro photography // aura light por...,,1,2.534026e+00,0
31658,Christian Dunham,20,POV me growing up,,1,1.000000e+00,0
31659,Gegee gegee,22,28 March 2024,,1,4.771213e-01,0
31660,Sangita . 20k views. 2 days ago,27,TLM WORKSHOP on FLN ||👏😱||#viral #tlm,"project work,tlm workshop,maths project work,t...",1,1.431364e+00,0


Let's look at the video categories:

In [3]:
videos.groupby('video_category').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,label,label,label,label,label,label,label,label
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
video_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,306.0,42.2451,23.89558,1.0,20.0,48.0,63.75,75.0,306.0,4.271067,...,5.674682,7.752964,306.0,0.4084967,0.492361,0.0,0.0,0.0,1.0,1.0
2,179.0,28.47486,22.77203,1.0,13.0,19.0,40.0,75.0,179.0,4.288575,...,5.602989,7.831337,179.0,0.4134078,0.493826,0.0,0.0,0.0,1.0,1.0
10,245.0,23.38367,22.98805,1.0,6.0,15.0,36.0,74.0,245.0,4.580353,...,5.645615,8.234742,245.0,0.5591837,0.4975013,0.0,0.0,1.0,1.0,1.0
15,41.0,31.43902,23.29383,2.0,12.0,27.0,56.0,75.0,41.0,4.541532,...,5.84489,8.419579,41.0,0.4634146,0.5048545,0.0,0.0,0.0,1.0,1.0
17,487.0,51.03491,19.40761,1.0,39.0,56.0,68.0,75.0,487.0,3.832325,...,4.665426,7.836966,487.0,0.2422998,0.4289153,0.0,0.0,0.0,0.0,1.0
19,111.0,38.67568,22.12451,1.0,19.5,38.0,57.0,75.0,111.0,4.107556,...,4.999286,7.486532,111.0,0.3063063,0.463049,0.0,0.0,0.0,1.0,1.0
20,603.0,19.15257,18.76016,1.0,6.5,14.0,21.0,76.0,603.0,4.025035,...,5.776823,8.183316,603.0,0.4610282,0.4988927,0.0,0.0,0.0,1.0,1.0
22,5831.0,34.82164,22.61416,1.0,15.0,32.0,55.0,76.0,5831.0,3.577791,...,4.629766,8.20297,5831.0,0.2364946,0.4249657,0.0,0.0,0.0,0.0,1.0
23,220.0,27.2,24.87863,1.0,7.0,17.0,54.25,75.0,220.0,5.085606,...,6.448389,8.227258,220.0,0.6545455,0.4766007,0.0,0.0,1.0,1.0,1.0
24,1382.0,29.52605,22.92586,1.0,10.0,23.5,48.0,75.0,1382.0,4.669223,...,5.98817,8.494091,1382.0,0.515919,0.4999274,0.0,0.0,1.0,1.0,1.0


We see that category 0 only has a single member, so we will drop it.

In [4]:
videos[videos['video_category']==30]

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26648,YouTube Movies,30,"Underground Aliens, Baba Vanga And Quantum Bio...",Baba Vanga was a female mystic in Bulgaria. Sh...,11,0.0,0


In [5]:
videos.drop(videos[videos['video_category']==30].index, inplace=True)

In [6]:
videos.reset_index(drop=True, inplace=True)

Let's look at the distribution of video view counts:

In [7]:
videos[['months','video_view_count','label']].groupby('label').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0,19168.0,40.84401,21.00774,-1.0,24.0,42.0,59.0,76.0,19168.0,3.353037,1.067583,0.0,2.692847,3.633519,4.205265,4.69897
1,12493.0,29.56135,21.26779,1.0,12.0,24.0,46.0,76.0,12493.0,5.582265,0.6834096,4.699005,5.037442,5.433327,5.977578,8.588679


We can see that the classes are approximately evenly distributed. They aren't exactly balanced, but that is due to the fact that the classification is based on a round milestone of 50k views. To exactly balance the data would result in a discrimination threshold that is far less striking.

We'll select a test set based on an 80/20 train/test split which we will then use for all future model building and validation.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(videos[['video_title']], videos['label'], test_size=0.2, stratify=videos['video_category'], random_state=524)
test = videos.iloc[X_test.index]
train = videos.iloc[X_train.index]

In [9]:
test

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26498,RG LECTURES,27,MHTCET FULL REVISION ONE SHOT ALL FORMULAS - P...,MHTCET PHYSICS FULL COMPLETE ONE SHOT REVISION...,11,5.238984e+00,1
27395,FuTechs,28,Tony Robbin and Robot conversation Relationshi...,"Speaker :Anthony Jay Robbins (né Mahavoric, bo...",10,4.364063e+00,0
23126,That Chemist,27,Nobel Prize in Chemistry 2022 (Recap),The Nobel Prize in Chemistry for 2022 has been...,18,4.484656e+00,0
15634,SCIENCE FUN For Everyone!,27,Friction Fun Friction Science Experiment,Have fun exploring friction with this easy sci...,36,4.503437e+00,0
7075,Michigan Medicine,26,Deconstructing the Legitimization of Acupunctu...,"Rick Harris, PhD\nAssociate Professor, Anesthe...",57,4.632467e+00,0
...,...,...,...,...,...,...,...
24112,CARB ACADEMY,27,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,16,5.496467e+00,1
2034,Rafael Verdonck's World,22,Science World #7 Will Strangelets destroy th...,Will the universe be destroyed by a tiny eleme...,70,3.183270e+00,0
22862,Trik Matematika mesi,27,deret angka matematika #shorts #maths,,19,5.764919e+00,1
6425,edureka!,27,Statistics And Probability Tutorial | Statisti...,🔥 Data Science Certification using R (Use Code...,59,5.561255e+00,1


In [10]:
train.to_csv('train.csv', index=False, encoding='utf-8', sep=',')
test.to_csv('test.csv', index=False, encoding='utf-8', sep=',')

In [11]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271e+00,0
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389e+00,1
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385e+00,1
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802e+00,0
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282e+00,0
...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115e+00,0
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098e+00,1
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341e+00,0
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217e+00,1


In this notebook, we will only be using the train dataset to build the models.

To convert the text into numerical features, we can use byte-pair encoding (BPE). We can train three separate encoders for the channel name, video title and video description. We will convert all text to lower case to make the vocabulary size smaller.

We first have to set all NA values to empty strings:

In [12]:
train = train.fillna('')

In [13]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

def build_tokenizer(train_texts, save=None):
    BPE_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    BPE_tokenizer.pre_tokenizer = Whitespace()
    BPE_tokenizer.train_from_iterator(train_texts, trainer=trainer)
    if save:
        BPE_tokenizer.save(save)
    return BPE_tokenizer

training_data_uncased = {field: train[field].apply(lambda x: x.lower()).tolist() for field in ['channel_title', 'video_title', 'video_description']}

In [14]:
%%time
BPE_tokenizers_uncased = {}

for field in training_data_uncased:
    BPE_tokenizers_uncased[field]= build_tokenizer(training_data_uncased[field], save=f"tokenizers/BPE_tokenizer_{field}_uncased.json")

CPU times: user 21.6 s, sys: 6.75 s, total: 28.3 s
Wall time: 7.05 s


In [15]:
from transformers import PreTrainedTokenizerFast

tokenizers_trained_uncased = {}

for field in training_data_uncased:
    tokenizers_trained_uncased[field] = PreTrainedTokenizerFast(tokenizer_file=f"tokenizers/BPE_tokenizer_{field}_uncased.json")

In [16]:
def tokenize(text, field, cased=True):
    if cased == False:
        return [str(t) for t in tokenizers_trained_uncased[field](text.lower())['input_ids']]

def tokenizer_decode(tokenized, field, cased=True):
    if cased == False:
        return tokenizers_trained_uncased[field].decode([int(t) for t in tokenized])


In [17]:
train.loc[:,'channel_title_tokenized'] = train['channel_title'].progress_apply(lambda text: tokenize(text.lower(), 'channel_title', cased=False))
train.loc[:,'video_title_tokenized'] = train['video_title'].progress_apply(lambda text: tokenize(text.lower(), 'video_title', cased=False))
train.loc[:,'video_description_tokenized'] = train['video_description'].progress_apply(lambda text: tokenize(text.lower(), 'video_description', cased=False))

processing rows: 100%|██████████| 25328/25328 [00:01<00:00, 19324.23it/s]
processing rows: 100%|██████████| 25328/25328 [00:01<00:00, 13081.49it/s]
processing rows: 100%|██████████| 25328/25328 [00:14<00:00, 1720.93it/s]


In [18]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label,channel_title_tokenized,video_title_tokenized,video_description_tokenized
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271e+00,0,[1165],"[2319, 2692, 3910, 2848, 6602, 3910, 2077, 196...","[10988, 5597, 12955, 5606, 5315, 4227, 4430, 4..."
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389e+00,1,[16769],"[3084, 5038, 4400, 1871, 3829, 5, 12, 1889, 59...","[4091, 9748, 4132, 17593, 4153, 5, 4123, 9748,..."
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385e+00,1,"[1300, 3294, 777]","[1883, 9686, 1910, 1817, 2178, 2469]","[4451, 9906, 4027, 17896, 4094, 4306, 4123, 42..."
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802e+00,0,[1165],"[6224, 6245, 1963, 2159, 2250, 2525, 1890, 206...","[25286, 28274, 4082, 4058, 5315, 10641, 4393, ..."
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282e+00,0,[19463],"[6465, 2587, 30, 1883, 1815, 1846, 21675, 1842...","[7408, 4039, 41, 17229, 5423, 4459, 33, 4006, ..."
...,...,...,...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115e+00,0,[16197],"[3683, 7242, 7, 3945, 7, 1815, 7, 2062]","[8809, 25929, 4021, 41, 7093, 17, 5087, 25929,..."
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098e+00,1,[10110],"[2074, 3274, 41, 10225, 1957, 2573, 3306, 5804...","[5864, 30, 5316, 44, 4035, 17185, 4053, 4299, ..."
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341e+00,0,"[3250, 900]","[1815, 6401, 68, 2386, 18, 4589, 18, 2158]","[21, 18, 4896, 17, 5122, 8991, 4027, 4331, 107..."
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217e+00,1,"[829, 3098, 1169]","[2295, 1869, 7835, 2475, 1846, 2629, 7, 1897, ...","[4365, 4093, 4410, 4347, 9114, 5460, 4487, 19,..."


In [19]:
idx = train.sample(1, random_state=524).index.tolist()[0]
print('channel title:')
print(train.at[idx,'channel_title'])
print('channel title tokenized:')
print(train.at[idx,'channel_title_tokenized'])
print('video title: ')
print(train.at[idx,'video_title'])
print('video title tokenized:')
print(train.at[idx,'video_title_tokenized'])
print('video description:')
print(train.at[idx,'video_description'])
print('video description tokenized:')
print(train.at[idx,'video_description_tokenized'])

channel title:
CrashCourse
channel title tokenized:
['1946']
video title: 
Micro-Biology: Crash Course History of Science #24
video title tokenized:
['2635', '17', '1915', '30', '3465', '2299', '2744', '1846', '1815', '7', '2763']
video description:
It's all about the SUPER TINY in this episode of Crash Course: History of Science. In it, Hank Green talks about germ theory, John Snow (the other one), pasteurization,  and why following our senses isn't always the worst idea. 

***

Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse

Thanks to the following Patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:

Mark Brouwer, Kenneth F Penttinen, Trevin Beattie, Satya Ridhima Parvathaneni, Erika & Alexa Saur, Glenn Elliott, Justin Zingsheim, Jessica Wode, Eric Prestemon, Kathrin Benoit, Tom Trval, Jason Saslow, Nathan Taylor, Brian Thomas Gossett, Khaled El Shalakany, Indika Siriwarde

We are now ready to apply machine learning techniques on the tokenized text. The discussion of EDA in the previous notebook suggests that a text-frequency based analysis could be a powerful tool for language-based prediction. We can use TfidfVectorizer() from scikit-learn, which efficiently counts the tokens in a text and generates a vector consisting of a numerical description of the token frequencies. Rather than simply counting the token frequency in the individual samples (the *term frequency*), however, TfidfVectorizer also incorporates the frequencies of the tokens in the entire training corpus (the *document frequency*). By default, TfidfVectorizer multiplies each token $i$ by a weight IDF = $\log(\frac{N_{\text{samples}}}{N_{\text{samples containing }i}})$, which describes the specificity of the token to the sample.

The parameters are:
* ngram_range: rather than considering individual tokens, we can consider pairs, triples, etc. of consecutive tokens and perform frequency analysis on these larger units. These are known as n-grams, with $n=1,2,3, \dots$ being the number of consecutive tokens that form the unit. The ngram_range is a tuple (n,m) with $n$ and $m$ being the minimum and maximum sizes of the n-grams used in generating features from the tokenised text.
* min_df, max_df: we can filter the tokens by the minimum and maximum number of documents in which the token must appear, which allows for dimensionality reduction.
* use_idf: this allows the incorporation of the IDF factor into the vector representation of the text: without it, the text is represented as a set of numbers corresponding to the frequency of each token or n-gram appearing in the text, with a normalisation factor. With use_idf, this frequency is divided by a factor (idf) that suppresses tokens that appear in a large number of documents.
* norm: with 'l1', the vector of input features is normalised so that the sum of the features is unity, with 'l2', the sum of the squares is unity.
* sublinear_tf: this uses the logarithm of the term frequencies rather than the term frequencies themselves.

We will introduce a function that trains the vectoriser on the total vocabulary of channel names, video titles and descriptions, vectorises them individually and then combines them. We'll also determine the effect of incorporating the video category, which will be one-hot encoded and stacked with the vectoriser output.

In [20]:
from sklearn.preprocessing import OneHotEncoder

video_category_encoder = OneHotEncoder()
video_category_encoder.fit(train[['video_category']])
video_category_encoder.categories_[0]

array([ 1,  2, 10, 15, 17, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29])

In [21]:
from scipy.sparse import csr_matrix, hstack

def dummy(x):
    return x

train_texts_tokenized = {'channel_title': train['channel_title_tokenized'],
                           'video_title': train['video_title_tokenized'],
                           'video_description': train['video_description_tokenized']}

def get_features(ngram_range=(1,1), min_df=1, max_df=1.0, verbose=True, use_idf=True, norm='l2', sublinear_tf=False, video_category_encoder=None):
    vectorizers = {}
    X_vectorized = {}
    for field in train_texts_tokenized:
        vectorizers[field] = TfidfVectorizer(preprocessor=dummy, tokenizer=dummy, ngram_range=ngram_range, min_df=min_df, max_df=max_df, token_pattern=None, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
        X_vectorized[field] = vectorizers[field].fit_transform(train_texts_tokenized[field])
        if verbose:
            print(f"Fit tfidf vectorizer with {len(vectorizers[field].get_feature_names_out())} features in the {ngram_range} ngram range.")

    if video_category_encoder != None:
        X_category = video_category_encoder.transform(train[['video_category']]).toarray()
        X_train = hstack([X_category, X_vectorized['channel_title'], X_vectorized['video_title'], X_vectorized['video_description']])
    else:
        X_train = hstack([X_vectorized['channel_title'], X_vectorized['video_title'], X_vectorized['video_description']])
    return X_train, vectorizers

Let's look at the number of features for each n-gram range:

In [22]:
for ngram_range in [(1,1), (1,2), (1,3), (1,4), (1,5)]:
    _,_ = get_features(ngram_range=ngram_range)

Fit tfidf vectorizer with 12424 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 24974 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 26901 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 29438 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 158781 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 995189 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 44876 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 585682 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 5766326 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 47958 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 810580 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 8696668 featu

## Multinomial naive Bayes

We see that for the higher n-gram ranges, we have millions or tens of millions of features, which is orders of magnitude larger than the training sample size.

I'll start our exploration of classical machine learning approaches with the multinomial naive Bayes model, which is known to perform well for text classification tasks with the tf-idf approach despite the large vocabularies. This has two main advantages: for the number of features we are considering, it is comparatively fast, and it requires tuning of only one hyperparameter, the Laplacian smoothing $\alpha$, which can be fixed by cross-validation to minimise overfitting.

We'll vary the n-gram range from (1,1) (only single tokens) to (1,5), as well as the the vectoriser settings, (use_idf = [True, False], norm = ['l1', 'l2'], and 'sublinear_tf' = [True, False]), and use Bayesian hyperparameter tuning to optimise the value of $\alpha$ (Laplacian smoothing) with the Optuna library.

In [23]:
from sklearn.metrics import *

import warnings
from sklearn.exceptions import UndefinedMetricWarning
warnings.simplefilter("ignore", UndefinedMetricWarning)

In [24]:
from sklearn.model_selection import cross_validate, KFold
import optuna

max_trials=100

def objective(trial, X_train, y_train, estimator, get_params, scoring):
    np.random.seed(524)
    params = get_params(trial=trial)
    model = estimator(**params)
    scores = cross_validate(model, X_train, y_train, scoring=scoring, cv=KFold(n_splits=5, random_state=42, shuffle=True), n_jobs=-1, verbose=0)
    return np.mean(scores['test_score'])

def report_optuna_results(X_train, y_train, estimator, get_params, scoring):
    sampler = optuna.samplers.TPESampler(seed=524)
    study = optuna.create_study(sampler=sampler, direction='maximize')
    study.optimize(lambda trial: objective(trial, X_train, y_train, estimator, get_params, scoring), n_trials=max_trials)
    return study.best_params

In [25]:
def report_tuned_models(X_trains, y_train, params_fixed, estimator, get_params, scoring_tune, scoring_report):
    results_list = []
    for n in range(len(X_trains)):
        X_train = X_trains[n]

        best = report_optuna_results(X_train, y_train, estimator, get_params, scoring_tune)
        model = estimator(**get_params(best=best))
        scores = cross_validate(model, X_train, y_train, scoring=scoring_report, cv=KFold(n_splits=5, random_state=42, shuffle=True), n_jobs=-1, verbose=5)

        cv_results = {}
        for param in params_fixed:
            cv_results[param] = params_fixed[param][n]
        cv_results['mean_fit_time'] = np.mean(scores['fit_time'])
        for score in scoring_report:
            cv_results[score] = f'{np.min(scores["test_"+score]):.5f}/{np.mean(scores["test_"+score]):.5f}/{np.max(scores["test_"+score]):.5f}'
        for param in best:
            cv_results[param] = best[param]
        results_list.append(cv_results)
        print(pd.DataFrame(results_list))
    return results_list

In [26]:
from sklearn.naive_bayes import MultinomialNB

def get_params_mnB(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    elif best != None:
        alpha = best['alpha']
    return {'alpha': alpha}

In [27]:
X_trains = []
params_fixed = {'vectorizer_type': [], 'norm': [], 'ngram_range': []}

for use_idf in [False,True]:
    for norm in ['l1','l2']:
        for sublinear_tf in [False,True]:
            for ngram_range in [(1,1), (1,2), (1,3), (1,4), (1,5)]:
                if use_idf == False and sublinear_tf == False:
                    params_fixed['vectorizer_type'].append('TF')
                elif use_idf == False and sublinear_tf == True:
                    params_fixed['vectorizer_type'].append('log(TF)')
                elif use_idf == True and sublinear_tf == False:
                    params_fixed['vectorizer_type'].append('TF-IDF')
                elif use_idf == True and sublinear_tf == True:
                    params_fixed['vectorizer_type'].append('log(TF)-IDF')
                params_fixed['norm'].append(norm)
                params_fixed['ngram_range'].append(ngram_range)

                X_train, _ = get_features(ngram_range=ngram_range, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
                X_trains.append(X_train)

Fit tfidf vectorizer with 12424 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 24974 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 26901 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 29438 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 158781 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 995189 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 44876 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 585682 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 5766326 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 47958 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 810580 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 8696668 featu

In [28]:
%%time
mnB_tune_ngrams = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))


[I 2024-05-15 15:48:54,424] A new study created in memory with name: no-name-6dea9289-dbcb-4b6f-8c15-9b5c64ca64bb
  pid = os.fork()
[I 2024-05-15 15:48:57,255] Trial 0 finished with value: 0.7788218458110103 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7788218458110103.
[I 2024-05-15 15:48:58,151] Trial 1 finished with value: 0.773847047209802 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7788218458110103.
[I 2024-05-15 15:48:58,393] Trial 2 finished with value: 0.7718334295298115 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7788218458110103.
[I 2024-05-15 15:48:58,624] Trial 3 finished with value: 0.7225995341258469 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7788218458110103.
[I 2024-05-15 15:48:58,863] Trial 4 finished with value: 0.7783874846108368 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7788218458110103.
[

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   

                   roc_auc        alpha  
0  0.84670/0.85191/0.86416 7.060859e-02  


[I 2024-05-15 15:49:22,084] Trial 0 finished with value: 0.7967861230766712 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7967861230766712.
[I 2024-05-15 15:49:22,656] Trial 1 finished with value: 0.7942594202723458 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7967861230766712.
[I 2024-05-15 15:49:23,112] Trial 2 finished with value: 0.7928381026910722 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7967861230766712.
[I 2024-05-15 15:49:23,577] Trial 3 finished with value: 0.7132422993777302 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7967861230766712.
[I 2024-05-15 15:49:24,137] Trial 4 finished with value: 0.7944962233951134 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7967861230766712.
[I 2024-05-15 15:49:24,690] Trial 5 finished with value: 0.7945357100683612 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   

                   roc_auc        alpha  
0  0.84670/0.85191/0.86416 7.060859e-02  
1  0.86802/0.87337/0.88423 1.087060e-02  


[I 2024-05-15 15:50:19,658] Trial 0 finished with value: 0.7970624050782387 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-15 15:50:20,716] Trial 1 finished with value: 0.7946937191169358 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-15 15:50:21,756] Trial 2 finished with value: 0.7944567523107615 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-15 15:50:22,679] Trial 3 finished with value: 0.7181774554167321 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-15 15:50:23,720] Trial 4 finished with value: 0.7924429553584686 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-15 15:50:24,797] Trial 5 finished with value: 0.7925219209105163 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   

                   roc_auc        alpha  
0  0.84670/0.85191/0.86416 7.060859e-02  
1  0.86802/0.87337/0.88423 1.087060e-02  
2  0.86774/0.87366/0.88516 6.921634e-03  


[I 2024-05-15 15:52:05,946] Trial 0 finished with value: 0.797733678523451 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.797733678523451.
[I 2024-05-15 15:52:07,515] Trial 1 finished with value: 0.7896793714868962 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.797733678523451.
[I 2024-05-15 15:52:09,098] Trial 2 finished with value: 0.7913771503420399 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.797733678523451.
[I 2024-05-15 15:52:10,704] Trial 3 finished with value: 0.7191644741534158 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.797733678523451.
[I 2024-05-15 15:52:12,297] Trial 4 finished with value: 0.7875472080482352 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.797733678523451.
[I 2024-05-15 15:52:13,891] Trial 5 finished with value: 0.7873103113920923 and parameters: {'alpha': 0.00039799342667825053}. Best is tri

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   9.664436e-01  0.79767/0.80689/0.81859   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
3  0.76836/0.77903/0.78942  0.69837/0.71275/0.74135  0.73657/0.74433/0.76296   

                   roc_auc        alpha  
0  0.84670/0.85191/0.86416 7.060859e-02  
1  0.86802/0.87337/0.88423 1.087060e-02  
2  0.86774/0.87366/0.88516 6.921634e-03  
3  0.86564/0.87216/0.88416 4.012247e-03 

[I 2024-05-15 15:54:44,994] Trial 0 finished with value: 0.7996287972114583 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7996287972114583.
[I 2024-05-15 15:54:47,086] Trial 1 finished with value: 0.7836780129146208 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7996287972114583.
[I 2024-05-15 15:54:49,219] Trial 2 finished with value: 0.7879816705762319 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7996287972114583.
[I 2024-05-15 15:54:51,398] Trial 3 finished with value: 0.7190459907503287 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7996287972114583.
[I 2024-05-15 15:54:53,443] Trial 4 finished with value: 0.7835201597550049 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7996287972114583.
[I 2024-05-15 15:54:55,450] Trial 5 finished with value: 0.7835201675494529 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   9.664436e-01  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.364744e+00  0.79807/0.80650/0.81761   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
3  0.76836/0.77903/0.78942  0.69837/0.71275/0.74135  0.73657/0.74433/0.76296   
4  0.77510/0.78496/0.79442  0.68753/0.70184/0.72782  0.73422/0.74100/0.75862   

                   roc_auc        alpha  
0  0.846

[I 2024-05-15 15:58:16,914] Trial 0 finished with value: 0.7787823591377625 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7787823591377625.
[I 2024-05-15 15:58:17,137] Trial 1 finished with value: 0.7731363416524776 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7787823591377625.
[I 2024-05-15 15:58:17,350] Trial 2 finished with value: 0.7704121119485379 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7787823591377625.
[I 2024-05-15 15:58:17,563] Trial 3 finished with value: 0.7219283775973536 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7787823591377625.
[I 2024-05-15 15:58:17,775] Trial 4 finished with value: 0.7772030247134663 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7787823591377625.
[I 2024-05-15 15:58:17,988] Trial 5 finished with value: 0.7772030169190184 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   9.664436e-01  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.364744e+00  0.79807/0.80650/0.81761   
5         log(TF)   l1      (1, 1)   4.737167e-02  0.77852/0.78388/0.78918   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
3  0.76836/0.77903/0.78942  0.69837/0.71275/0.74135  0.73657/0.74433/0.76296   
4  0.77510/0.78496/0.79442  0.68753/0.70184/0.72782  

[I 2024-05-15 15:58:39,390] Trial 0 finished with value: 0.7965097553361764 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7965097553361764.
[I 2024-05-15 15:58:39,935] Trial 1 finished with value: 0.7948122259033668 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7965097553361764.
[I 2024-05-15 15:58:40,490] Trial 2 finished with value: 0.7929565705052634 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7965097553361764.
[I 2024-05-15 15:58:41,045] Trial 3 finished with value: 0.7130449283670748 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7965097553361764.
[I 2024-05-15 15:58:41,610] Trial 4 finished with value: 0.7942198322712748 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7965097553361764.
[I 2024-05-15 15:58:42,186] Trial 5 finished with value: 0.7941803455980271 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   9.664436e-01  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.364744e+00  0.79807/0.80650/0.81761   
5         log(TF)   l1      (1, 1)   4.737167e-02  0.77852/0.78388/0.78918   
6         log(TF)   l1      (1, 2)   2.712305e-01  0.79668/0.80599/0.81563   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
3  0.76836/0.77903/0.78942  0.69837/0.71275/0.74135  0.

[I 2024-05-15 15:59:36,795] Trial 0 finished with value: 0.7977730872522193 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-15 15:59:37,870] Trial 1 finished with value: 0.794535733451705 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-15 15:59:39,013] Trial 2 finished with value: 0.7946541155269689 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-15 15:59:40,180] Trial 3 finished with value: 0.7179800298449411 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-15 15:59:41,220] Trial 4 finished with value: 0.7933905263941442 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-15 15:59:42,256] Trial 5 finished with value: 0.7933510475153444 and parameters: {'alpha': 0.00039799342667825053}. Best i

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   9.664436e-01  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.364744e+00  0.79807/0.80650/0.81761   
5         log(TF)   l1      (1, 1)   4.737167e-02  0.77852/0.78388/0.78918   
6         log(TF)   l1      (1, 2)   2.712305e-01  0.79668/0.80599/0.81563   
7         log(TF)   l1      (1, 3)   6.126305e-01  0.80024/0.80930/0.82057   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73

[I 2024-05-15 16:01:23,798] Trial 0 finished with value: 0.7978915238886188 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7978915238886188.
[I 2024-05-15 16:01:25,500] Trial 1 finished with value: 0.7900741992471343 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7978915238886188.
[I 2024-05-15 16:01:27,054] Trial 2 finished with value: 0.7910612491616098 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7978915238886188.
[I 2024-05-15 16:01:28,528] Trial 3 finished with value: 0.7190460219281204 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7978915238886188.
[I 2024-05-15 16:01:30,080] Trial 4 finished with value: 0.7875076278416121 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7978915238886188.
[I 2024-05-15 16:01:31,637] Trial 5 finished with value: 0.7872312678955653 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   9.664436e-01  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.364744e+00  0.79807/0.80650/0.81761   
5         log(TF)   l1      (1, 1)   4.737167e-02  0.77852/0.78388/0.78918   
6         log(TF)   l1      (1, 2)   2.712305e-01  0.79668/0.80599/0.81563   
7         log(TF)   l1      (1, 3)   6.126305e-01  0.80024/0.80930/0.82057   
8         log(TF)   l1      (1, 4)   9.841421e-01  0.79807/0.80764/0.81761   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.7354

[I 2024-05-15 16:04:03,504] Trial 0 finished with value: 0.7989181150374776 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7989181150374776.
[I 2024-05-15 16:04:05,625] Trial 1 finished with value: 0.7835595840726692 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7989181150374776.
[I 2024-05-15 16:04:07,738] Trial 2 finished with value: 0.7879816627817839 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7989181150374776.
[I 2024-05-15 16:04:09,836] Trial 3 finished with value: 0.7192434007332238 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7989181150374776.
[I 2024-05-15 16:04:11,936] Trial 4 finished with value: 0.7833621662953262 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7989181150374776.
[I 2024-05-15 16:04:14,080] Trial 5 finished with value: 0.7830857907603834 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   9.664436e-01  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.364744e+00  0.79807/0.80650/0.81761   
5         log(TF)   l1      (1, 1)   4.737167e-02  0.77852/0.78388/0.78918   
6         log(TF)   l1      (1, 2)   2.712305e-01  0.79668/0.80599/0.81563   
7         log(TF)   l1      (1, 3)   6.126305e-01  0.80024/0.80930/0.82057   
8         log(TF)   l1      (1, 4)   9.841421e-01  0.79807/0.80764/0.81761   
9         log(TF)   l1      (1, 5)   1.426892e+00  0.79747/0.80697/0.81820   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/

[I 2024-05-15 16:07:33,789] Trial 0 finished with value: 0.7785452130592858 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7785452130592858.
[I 2024-05-15 16:07:34,011] Trial 1 finished with value: 0.7729781767149442 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7785452130592858.
[I 2024-05-15 16:07:34,227] Trial 2 finished with value: 0.7700170503548618 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7785452130592858.
[I 2024-05-15 16:07:34,455] Trial 3 finished with value: 0.7787427321644519 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7787427321644519.
[I 2024-05-15 16:07:34,678] Trial 4 finished with value: 0.7765711599970225 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7787427321644519.
[I 2024-05-15 16:07:34,891] Trial 5 finished with value: 0.7765316733237747 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   9.664436e-01  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.364744e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   4.737167e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   2.712305e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.126305e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   9.841421e-01  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.426892e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.919233e-02  0.78168/0.78834/0.80063   

                  precision                   recal

[I 2024-05-15 16:07:57,171] Trial 0 finished with value: 0.7980100462639456 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7980100462639456.
[I 2024-05-15 16:07:57,733] Trial 1 finished with value: 0.7973783686142524 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7980100462639456.
[I 2024-05-15 16:07:58,340] Trial 2 finished with value: 0.7956806754980359 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7980100462639456.
[I 2024-05-15 16:07:58,911] Trial 3 finished with value: 0.754461382212836 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7980100462639456.
[I 2024-05-15 16:07:59,524] Trial 4 finished with value: 0.7980101086195293 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.7980101086195293.
[I 2024-05-15 16:08:00,108] Trial 5 finished with value: 0.7979311430674816 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   9.664436e-01  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.364744e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   4.737167e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   2.712305e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.126305e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   9.841421e-01  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.426892e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.919233e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.844517e-01  

[I 2024-05-15 16:08:54,695] Trial 0 finished with value: 0.7933907290497906 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7933907290497906.
[I 2024-05-15 16:08:55,732] Trial 1 finished with value: 0.7976943555336099 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7976943555336099.
[I 2024-05-15 16:08:56,759] Trial 2 finished with value: 0.7977733132912095 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7977733132912095.
[I 2024-05-15 16:08:57,779] Trial 3 finished with value: 0.7621207367779856 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7977733132912095.
[I 2024-05-15 16:08:58,704] Trial 4 finished with value: 0.7952463376812063 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7977733132912095.
[I 2024-05-15 16:08:59,640] Trial 5 finished with value: 0.7951673877180545 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.865258e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.769546e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   5.978314e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   9.664436e-01  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.364744e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   4.737167e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   2.712305e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.126305e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   9.841421e-01  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.426892e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.919233e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.844517e-01  

[I 2024-05-15 16:10:40,378] Trial 0 finished with value: 0.7842704143411607 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7842704143411607.
[I 2024-05-15 16:10:41,907] Trial 1 finished with value: 0.7923246901999238 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7923246901999238.
[I 2024-05-15 16:10:43,440] Trial 2 finished with value: 0.7948516424265831 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7948516424265831.
[I 2024-05-15 16:10:44,991] Trial 3 finished with value: 0.7676877263556396 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7948516424265831.
[I 2024-05-15 16:10:46,573] Trial 4 finished with value: 0.7869947064006838 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7948516424265831.
[I 2024-05-15 16:10:48,202] Trial 5 finished with value: 0.7872315874679308 and parameters: {'alpha': 0.00039799342667825053}. Best 

KeyboardInterrupt: 

In [29]:
mnB_tune_ngrams = pd.DataFrame(mnB_tune_ngrams)
display(mnB_tune_ngrams.style.hide())

NameError: name 'mnB_tune_ngrams' is not defined

We can see that the (1,3) n-gram range consistently outperformed the lower ranges during cross-validation, but no improvement was seen for the (1,4) and (1,5) ranges. Let's look at the dependence on the other hyperparameters:

In [None]:
display(mnB_tune_ngrams[mnB_tune_ngrams['ngram_range']==(1,3)].style.hide())

We find that L$^2$ normalisation performs better than L$^1$, but there is no improvement from including the IDF factor or using log(TF) instead of TF.

## Including the video category

Next we can incorporate the video category.

In [None]:
%%time

params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []
vectorizers = []

ngram_range = (1,3)
for sublinear_tf in [False,True]:
    for use_idf in [False,True]:
        if use_idf == False and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF')
        elif use_idf == False and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)')
        elif use_idf == True and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF-IDF')
        elif use_idf == True and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)-IDF')

        params_fixed['ngram_range'].append(ngram_range)

        X_train, vectorizer = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)
        vectorizers.append(vectorizer)

In [None]:
mnB_tune_category = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

In [None]:
mnB_tune_category = pd.DataFrame(mnB_tune_category)
mnB_tune_category.style.hide()

In [None]:
mnB_tune_category.to_csv('mnB_tuned.csv', index=False, sep=',', encoding='utf-8')

## Dimensionality reduction

Our best performing models use the (1,3) n-gram range, which requires over 2 million features. We will now look at reducing the number of features by setting a minimum and maximum document frequency filter that drops tokens from the vocabulary that are either too rare or too common. I'll show results for TF-IDF with L$^2$ norm.

In [None]:
%%time

X_trains = []

params_fixed = {'min_df': [], 'max_df': []}

for min_df in [5,10,20,50,100,200,500,1000]:
    for max_df in [1.0, 0.9, 0.8, 0.7]:
        X_train, _ = get_features(ngram_range=ngram_range, use_idf=True, norm='l2', sublinear_tf=True, min_df=min_df, max_df=max_df, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)
        params_fixed['min_df'].append(min_df)
        params_fixed['max_df'].append(f"{max_df:.1f}")

mnB_tune_dim_reduction = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

In [None]:
pd.DataFrame(mnB_tune_dim_reduction).style.hide()

We see that as the vocabulary size is decreased, the cross validation scores rapidly degrade.

All of these results show that identifying whether a video will be popular or not by its text metadata is a machine-learning problem that contradicts the common wisdom in text classification tasks. This is a fundamentally different challenge to, for example, determining whether a text message or email is spam, etc. In our case, both the most common and rarest terms are relevant, and incorporating an IDF factor seems to have no effect on the accuracy. Whether or not a viewer likes a certain YouTube video or channel, and whether they share it on social media to contribute to its virality, is primarily subjective determination, which makes the classification problem significantly more difficult, and this is reflected in the low cross-validation metrics we have seen so far.

## Further classification models

Now that we have understood the influence of the vectoriser hyperparameters -- the n-gram range, the TF/log(TF)/TF-IDF/log(TF)-IDF modalities and the normalisation, we are ready to build some more models. Having considered a Bayesian model already we can explore three linear methods:

* Support vector machine
* Logistic regression
* Perceptron

To avoid overfitting, we will employ statistical regularisation via a combination of L$^1$ and L$^2$ penalty terms, known as *elasticnet.* There are two hyperparameters which we will again use Bayesian optimization to tune. We will also implement the linear algorithms via stochastic gradient descent using SGDClassifier from scikit-learn, which uses a randomised algorithm to solve the linear models with regularisation. A random state variable will be set for reproducibility.

In [None]:
from sklearn.linear_model import SGDClassifier
params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []
vectorizers = []

for ngram_range in [(1,3)]:
    for sublinear_tf in [False,True]:
        for use_idf in [False,True]:
            if use_idf == False and sublinear_tf == False:
                params_fixed['vectorizer_type'].append('TF')
            elif use_idf == False and sublinear_tf == True:
                params_fixed['vectorizer_type'].append('log(TF)')
            elif use_idf == True and sublinear_tf == False:
                params_fixed['vectorizer_type'].append('TF-IDF')
            elif use_idf == True and sublinear_tf == True:
                params_fixed['vectorizer_type'].append('log(TF)-IDF')

            params_fixed['ngram_range'].append(ngram_range)

            X_train, vectorizer = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
            X_trains.append(X_train)
            vectorizers.append(vectorizer)

### Support vector machine

In [None]:
%%time

def get_params_SVM(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'loss': 'hinge', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

SVM_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_SVM, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

In [None]:
SVM_tune = pd.DataFrame(SVM_tune)
SVM_tune.style.hide()

In [None]:
SVM_tune.to_csv('SVM_tuned.csv', index=False, sep=',', encoding='utf-8')

### Logistic regression

In [None]:
%%time

def get_params_log_reg(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'loss': 'log_loss', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

log_reg_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_log_reg, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

In [None]:
log_reg_tune = pd.DataFrame(log_reg_tune)
log_reg_tune.style.hide()

In [None]:
log_reg_tune.to_csv('log_reg_tune.csv', index=False, sep=',', encoding='utf-8')

### Perceptron

In [None]:
%%time

def get_params_perceptron(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'loss': 'perceptron', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

perceptron_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_perceptron, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

In [None]:
perceptron_tune = pd.DataFrame(perceptron_tune)
perceptron_tune.style.hide()

In [None]:
perceptron_tune.to_csv('perceptron_tune.csv', index=False, sep=',', encoding='utf-8')

## Training the final models

Now that we've obtained the optimal hyperparameters we can train the models on the full training data. We'll save the models and evaluate them in the next notebook.

In [None]:
mnB_clfs = []
svm_clfs = []
logreg_clfs = []
perceptron_clfs = []
models = {}

for n in range(len(X_trains)):
    mnB_clfs.append(MultinomialNB(alpha=mnB_tune_category[n]['alpha']))
    svm_clfs.append(SGDClassifier(loss='hinge', penalty='elasticnet', alpha=SVM_tune[n]['alpha'], l1_ratio=SVM_tune[n]['l1_ratio']))
    logreg_clfs.append(SGDClassifier(loss='log_loss', penalty='elasticnet', alpha=log_reg_tune[n]['alpha'], l1_ratio=log_reg_tune[n]['l1_ratio']))
    perceptron_clfs.append(SGDClassifier(loss='perceptron', penalty='elasticnet', alpha=perceptron_tune[n]['alpha'], l1_ratio=perceptron_tune[n]['l1_ratio']))

    for model in [mnB_clfs[-1],svm_clfs[-1], logreg_clfs[-1], perceptron_clfs[-1]]:
        model.fit(X_trains[n], y_train)

    models[f"models/mnB_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = mnB_clfs[-1]
    models[f"models/svm_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = svm_clfs[-1]
    models[f"models/logreg_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = logreg_clfs[-1]
    models[f"models/perceptron_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = perceptron_clfs[-1]

In [None]:
import joblib

for model_name in models:
    joblib.dump(models[model_name], model_name+'.joblib')

joblib.dump(video_category_encoder, 'models/video_category_encoder.joblib')

In [None]:
for n in range(len(vectorizers)):
    joblib.dump(vectorizers[n]['channel_title'], f"vectorizers/channel_title_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")
    joblib.dump(vectorizers[n]['video_title'], f"vectorizers/video_title_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")
    joblib.dump(vectorizers[n]['video_description'], f"vectorizers/video_description_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")

## Probability calibration

We can see that, based on the cross-validation scores, the models are quite far from being accurate. We would like to model the probabilities  $P(y\in \mathcal{C}|P)$ of a data $y$ belonging in class $\mathcal{C}$ given the predictions of each of the models, which is not the same as the reported probabilities. (In some cases, there are also no reported probabilities). We can do this using a probability calibrator, which treats the predictions of each model as a feature that can then be used to model the true probability. This requires validation data, so we'll again use a five-fold cross-validation split.

In [None]:
from sklearn.calibration import CalibratedClassifierCV

calibrated_clfs = {}

for model_name in models:
    calibrated_clfs[model_name] = CalibratedClassifierCV(models[model_name], cv = KFold(n_splits=5, random_state=42, shuffle=True))
    calibrated_clfs[model_name].fit(X_train, y_train)

We'll save the calibrated models for evaluation in the next notebook:

In [None]:
for model_name in calibrated_clfs:
    joblib.dump(calibrated_clfs[model_name], model_name+'_calibrated.joblib')

## Stacking

Now that we have our sixteen models, we can combine them into a single classifier that uses all of their predictions. One approach is stacking, which involves a single metaclassifier that first gathers the predictions of the individual models, then uses these predictions as features and converts them into a final prediction. We will need to train the meta-classifier with cross-validation and select a model. We will compare two choices: logistic regression and gaussian naive Bayes.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
stacking_logreg = StackingClassifier(list(models.items()), final_estimator=LogisticRegression(), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_logreg.fit(X_train, y_train)

In [None]:
from sklearn.naive_bayes import GaussianNB

stacking_gnb = StackingClassifier(list(models.items()), final_estimator=GaussianNB(), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_gnb.fit(X_train, y_train)

In [None]:
joblib.dump(stacking_logreg, 'models/stacking_logreg.joblib')
joblib.dump(stacking_gnb, 'models/stacking_gnb.joblib')

We've successfully built a total of 34 different classical ML models -- four different classification approaches (Bayesian and linear), four different text vectorisation methods (TF, log(TF), TF-IDF and log(TF)-IDF vectorisation), and then used probability calibration and stacking to further improve model performance. In the next notebook we'll compare the performance of these models on the test data.