# Youtube popularity predictor (Part 2): text frequency-based models

In the previous notebook, we used natural language processing (NLP) to explore the YouTube video dataset and hunted for possible correlations between the language features in the video titles and descriptions and the video popularity, which we associated with a binary categorical variable corresponding to a video having obtained over 100k views (class 1) or under 100k views (class 0). We did indeed see that the frequency of the tokens in the byte-pair encoded text had predictive value for classification. In this notebook we will construct a variety of classification models based on text frequency.

Let's import the scikit-learn library and load the dataset, which was already processed in the previous notebook to extract the relevant ML features.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas(desc='processing rows')

In [2]:
videos = pd.read_csv('https://raw.githubusercontent.com/tommyliphysics/tommyli-ml/main/youtube_predictor/data/YT_data_v2.csv', lineterminator='\n')
videos

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
0,University of New Haven,27,Master of Science in Cellular and Molecular Bi...,"Christina Zito, assistant professor and coordi...",75,3.610660,0
1,PennWest California,27,Faculty Showcase: Dr. Ben Reuter - Exercise Sc...,Interested in pursing a exercise science degre...,75,3.168203,0
2,University of New Haven,27,Master of Science in Mechanical Engineering: B...,The University of New Haven’s master’s degree ...,75,3.447313,0
3,Operation Ouch,24,Science for kids | BROKEN BONES- Unluckiest K...,Learn about Broken Bones with the Unluckiest K...,75,6.603942,1
4,Crazy GkTrick,27,Science Gk : Diseases (मानव रोग ) - Part-2,Biology (‎जीव विज्ञान) | Gk Science | Science ...,76,6.409320,1
...,...,...,...,...,...,...,...
31657,Morinda Enterprises,22,Vivo v30pro pro photography // aura light por...,,1,2.534026,0
31658,Christian Dunham,20,POV me growing up,,1,1.000000,0
31659,Gegee gegee,22,28 March 2024,,1,0.477121,0
31660,Sangita . 20k views. 2 days ago,27,TLM WORKSHOP on FLN ||👏😱||#viral #tlm,"project work,tlm workshop,maths project work,t...",1,1.431364,0


In [3]:
videos.groupby('video_category').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,label,label,label,label,label,label,label,label
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
video_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,306.0,42.245098,23.89558,1.0,20.0,48.0,63.75,75.0,306.0,4.271067,...,5.674682,7.752964,306.0,0.408497,0.492361,0.0,0.0,0.0,1.0,1.0
2,179.0,28.47486,22.772031,1.0,13.0,19.0,40.0,75.0,179.0,4.288575,...,5.602989,7.831337,179.0,0.413408,0.493826,0.0,0.0,0.0,1.0,1.0
10,245.0,23.383673,22.988052,1.0,6.0,15.0,36.0,74.0,245.0,4.580353,...,5.645615,8.234742,245.0,0.559184,0.497501,0.0,0.0,1.0,1.0,1.0
15,41.0,31.439024,23.293828,2.0,12.0,27.0,56.0,75.0,41.0,4.541532,...,5.84489,8.419579,41.0,0.463415,0.504854,0.0,0.0,0.0,1.0,1.0
17,487.0,51.034908,19.407606,1.0,39.0,56.0,68.0,75.0,487.0,3.832325,...,4.665426,7.836966,487.0,0.2423,0.428915,0.0,0.0,0.0,0.0,1.0
19,111.0,38.675676,22.124508,1.0,19.5,38.0,57.0,75.0,111.0,4.107556,...,4.999286,7.486532,111.0,0.306306,0.463049,0.0,0.0,0.0,1.0,1.0
20,603.0,19.15257,18.760156,1.0,6.5,14.0,21.0,76.0,603.0,4.025035,...,5.776823,8.183316,603.0,0.461028,0.498893,0.0,0.0,0.0,1.0,1.0
22,5831.0,34.821643,22.614164,1.0,15.0,32.0,55.0,76.0,5831.0,3.577791,...,4.629766,8.20297,5831.0,0.236495,0.424966,0.0,0.0,0.0,0.0,1.0
23,220.0,27.2,24.878628,1.0,7.0,17.0,54.25,75.0,220.0,5.085606,...,6.448389,8.227258,220.0,0.654545,0.476601,0.0,0.0,1.0,1.0,1.0
24,1382.0,29.526049,22.925857,1.0,10.0,23.5,48.0,75.0,1382.0,4.669223,...,5.98817,8.494091,1382.0,0.515919,0.499927,0.0,0.0,1.0,1.0,1.0


In [4]:
videos[videos['video_category']==30]

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26648,YouTube Movies,30,"Underground Aliens, Baba Vanga And Quantum Bio...",Baba Vanga was a female mystic in Bulgaria. Sh...,11,0.0,0


In [5]:
videos.drop(videos[videos['video_category']==30].index, inplace=True)

In [6]:
videos.reset_index(drop=True, inplace=True)

We can see that there are videos with titles and descriptions containing non-latin characters. We won't filter the videos by language or alphabet, so the non-latin characters will become part of the features. Let's look at the distribution of video view counts:

In [7]:
videos[['months','video_view_count','label']].groupby('label').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0,19168.0,40.844011,21.007736,-1.0,24.0,42.0,59.0,76.0,19168.0,3.353037,1.067583,0.0,2.692847,3.633519,4.205265,4.69897
1,12493.0,29.561354,21.267795,1.0,12.0,24.0,46.0,76.0,12493.0,5.582265,0.68341,4.699005,5.037442,5.433327,5.977578,8.588679


We can see that the classes are approximately evenly distributed. They aren't exactly balanced, but that is due to the fact that the classification is based on a milestone of 100k views. To exactly balance the data would result in a discrimination threshold that is far less striking.

We'll select a test set based on an 80/20 train/test split which we will then use for all future model building and validation.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(videos[['video_title']], videos['label'], test_size=0.2, stratify=videos['video_category'], random_state=524)
test = videos.iloc[X_test.index]
train = videos.iloc[X_train.index]

In [9]:
test

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26498,RG LECTURES,27,MHTCET FULL REVISION ONE SHOT ALL FORMULAS - P...,MHTCET PHYSICS FULL COMPLETE ONE SHOT REVISION...,11,5.238984,1
27395,FuTechs,28,Tony Robbin and Robot conversation Relationshi...,"Speaker :Anthony Jay Robbins (né Mahavoric, bo...",10,4.364063,0
23126,That Chemist,27,Nobel Prize in Chemistry 2022 (Recap),The Nobel Prize in Chemistry for 2022 has been...,18,4.484656,0
15634,SCIENCE FUN For Everyone!,27,Friction Fun Friction Science Experiment,Have fun exploring friction with this easy sci...,36,4.503437,0
7075,Michigan Medicine,26,Deconstructing the Legitimization of Acupunctu...,"Rick Harris, PhD\nAssociate Professor, Anesthe...",57,4.632467,0
...,...,...,...,...,...,...,...
24112,CARB ACADEMY,27,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,16,5.496467,1
2034,Rafael Verdonck's World,22,Science World #7 Will Strangelets destroy th...,Will the universe be destroyed by a tiny eleme...,70,3.183270,0
22862,Trik Matematika mesi,27,deret angka matematika #shorts #maths,,19,5.764919,1
6425,edureka!,27,Statistics And Probability Tutorial | Statisti...,🔥 Data Science Certification using R (Use Code...,59,5.561255,1


In [10]:
train.to_csv('train.csv', index=False, encoding='utf-8', sep=',')
test.to_csv('test.csv', index=False, encoding='utf-8', sep=',')

In [11]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271,0
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389,1
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385,1
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802,0
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282,0
...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115,0
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098,1
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341,0
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217,1


In this notebook, we will only be using the train dataset to build the models.

To convert the text into numerical features, we can use byte-pair encoding (BPE). We can train three separate encoders for the channel name, video title and video description. We can first set all the NA values to empty strings:

In [12]:
train = train.fillna('')

In [13]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

def build_tokenizer(train_texts, save=None):
    BPE_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    BPE_tokenizer.pre_tokenizer = Whitespace()
    BPE_tokenizer.train_from_iterator(train_texts, trainer=trainer)
    if save:
        BPE_tokenizer.save(save)
    return BPE_tokenizer

training_data_uncased = {field: train[field].apply(lambda x: x.lower()).tolist() for field in ['channel_title', 'video_title', 'video_description']}
#training_data_cased = {field: train[field].tolist() for field in ['channel_title', 'video_title', 'video_description']}

In [19]:
%%time
BPE_tokenizers_uncased = {}
#BPE_tokenizers_cased = {}
for field in training_data_uncased:
    BPE_tokenizers_uncased[field]= build_tokenizer(training_data_uncased[field], save=f"tokenizers/BPE_tokenizer_{field}_uncased.json")
#    BPE_tokenizers_cased[field] = build_tokenizer(training_data_cased[field], save=f"tokenizers/BPE_tokenizer_{field}_cased.json")

CPU times: user 1min 46s, sys: 416 ms, total: 1min 46s
Wall time: 1min 47s


In [20]:
from transformers import PreTrainedTokenizerFast

#tokenizers_trained_cased = {}
tokenizers_trained_uncased = {}

for field in training_data_uncased:
#    tokenizers_trained_cased[field] = PreTrainedTokenizerFast(tokenizer_file=f"tokenizers/BPE_tokenizer_{field}_cased.json")
    tokenizers_trained_uncased[field] = PreTrainedTokenizerFast(tokenizer_file=f"tokenizers/BPE_tokenizer_{field}_uncased.json")

In [21]:
def tokenize(text, field, cased=True):
    if cased == False:
        return [str(t) for t in tokenizers_trained_uncased[field](text.lower())['input_ids']]

def tokenizer_decode(tokenized, field, cased=True):
    if cased == False:
        return tokenizers_trained_uncased[field].decode([int(t) for t in tokenized])


In [22]:
train.loc[:,'channel_title_tokenized'] = train['channel_title'].progress_apply(lambda text: tokenize(text.lower(), 'channel_title', cased=False))
train.loc[:,'video_title_tokenized'] = train['video_title'].progress_apply(lambda text: tokenize(text.lower(), 'video_title', cased=False))
train.loc[:,'video_description_tokenized'] = train['video_description'].progress_apply(lambda text: tokenize(text.lower(), 'video_description', cased=False))

processing rows: 100%|██████████| 25328/25328 [00:01<00:00, 14613.50it/s]
processing rows: 100%|██████████| 25328/25328 [00:02<00:00, 9495.16it/s]
processing rows: 100%|██████████| 25328/25328 [00:28<00:00, 881.21it/s]


In [23]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label,channel_title_tokenized,video_title_tokenized,video_description_tokenized
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271,0,[1165],"[2319, 2692, 3910, 2848, 6602, 3910, 2077, 196...","[10988, 5597, 12955, 5606, 5315, 4227, 4430, 4..."
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389,1,[16769],"[3084, 5038, 4400, 1871, 3829, 5, 12, 1889, 59...","[4091, 9748, 4132, 17593, 4153, 5, 4123, 9748,..."
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385,1,"[1300, 3294, 777]","[1883, 9686, 1910, 1817, 2178, 2469]","[4451, 9906, 4027, 17896, 4094, 4306, 4123, 42..."
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802,0,[1165],"[6224, 6245, 1963, 2159, 2250, 2525, 1890, 206...","[25286, 28274, 4082, 4058, 5315, 10641, 4393, ..."
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282,0,[19463],"[6465, 2587, 30, 1883, 1815, 1846, 21675, 1842...","[7408, 4039, 41, 17229, 5423, 4459, 33, 4006, ..."
...,...,...,...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115,0,[16197],"[3683, 7242, 7, 3945, 7, 1815, 7, 2062]","[8809, 25929, 4021, 41, 7093, 17, 5087, 25929,..."
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098,1,[10110],"[2074, 3274, 41, 10225, 1957, 2573, 3306, 5804...","[5864, 30, 5316, 44, 4035, 17185, 4053, 4299, ..."
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341,0,"[3250, 900]","[1815, 6401, 68, 2386, 18, 4589, 18, 2158]","[21, 18, 4896, 17, 5122, 8991, 4027, 4331, 107..."
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217,1,"[829, 3098, 1169]","[2295, 1869, 7835, 2475, 1846, 2629, 7, 1897, ...","[4365, 4093, 4410, 4347, 9114, 5460, 4487, 19,..."


In [24]:
idx = train.sample(1, random_state=524).index.tolist()[0]
print('channel title:')
print(train.at[idx,'channel_title'])
print('channel title tokenized:')
print(train.at[idx,'channel_title_tokenized'])
print('video title: ')
print(train.at[idx,'video_title'])
print('video title tokenized:')
print(train.at[idx,'video_title_tokenized'])
print('video description:')
print(train.at[idx,'video_description'])
print('video description tokenized:')
print(train.at[idx,'video_description_tokenized'])

channel title:
CrashCourse
channel title tokenized:
['1946']
video title: 
Micro-Biology: Crash Course History of Science #24
video title tokenized:
['2635', '17', '1915', '30', '3465', '2299', '2744', '1846', '1815', '7', '2763']
video description:
It's all about the SUPER TINY in this episode of Crash Course: History of Science. In it, Hank Green talks about germ theory, John Snow (the other one), pasteurization,  and why following our senses isn't always the worst idea. 

***

Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse

Thanks to the following Patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:

Mark Brouwer, Kenneth F Penttinen, Trevin Beattie, Satya Ridhima Parvathaneni, Erika & Alexa Saur, Glenn Elliott, Justin Zingsheim, Jessica Wode, Eric Prestemon, Kathrin Benoit, Tom Trval, Jason Saslow, Nathan Taylor, Brian Thomas Gossett, Khaled El Shalakany, Indika Siriwarde

We are now ready to apply machine learning techniques on the tokenized text. The discussion of EDA in the previous notebook suggests that a text-frequency based analysis could be a powerful tool for language-based prediction. We can use TfidfVectorizer() from scikit-learn, which efficiently counts the tokens in a text and generates a vector consisting of a numerical description of the token frequencies. Rather than simply counting the token frequency in the individual samples (the *term frequency*), however, TfidfVectorizer also incorporates the frequencies of the tokens in the entire training corpus (the *document frequency*). By default, TfidfVectorizer multiplies each token $i$ by a weight $IDF = \log(\frac{N_{\text{samples}}}{N_{\text{samples containing }i}})$, which describes the specificity of the token to the sample.

The parameters are:
* ngram_range: rather than considering individual tokens, we can consider pairs, triples, etc. of consecutive tokens and perform frequency analysis on these larger units. These are known as n-grams, with $n=1,2,3, \dots$ being the number of consecutive tokens that form the unit. The ngram_range is a tuple (n,m) with $n$ and $m$ being the minimum and maximum sizes of the n-grams used in generating features from the tokenised text.
* min_df, max_df: we can filter the tokens by the minimum and maximum number of documents in which the token must appear, which allows for dimensionality reduction.
* use_idf: this allows the incorporation of the IDF factor into the vector representation of the text: without it, the text is represented as a set of numbers corresponding to the frequency of each token or n-gram appearing in the text, with a normalisation factor. With use_idf, this frequency is divided by a factor (idf) that suppresses tokens that appear in a large number of documents.
* norm: with 'l1', the vector of input features is normalised so that the sum of the features is unity, with 'l2', the sum of the squares is unity.
* sublinear_tf: this uses the logarithm of the term frequencies rather than the term frequencies themselves.

We will introduce a function that trains the vectoriser on the total vocabulary of channel names, video titles and descriptions, vectorises them individually and then combines them. We'll also determine the effect of incorporating the video category, which will be one-hot encoded and stacked with the vectoriser output.

In [25]:
from sklearn.preprocessing import OneHotEncoder

video_category_encoder = OneHotEncoder()
video_category_encoder.fit(train[['video_category']])
video_category_encoder.categories_[0]

array([ 1,  2, 10, 15, 17, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29])

In [26]:
from scipy.sparse import csr_matrix, hstack

def dummy(x):
    return x

train_texts_tokenized = {'channel_title': train['channel_title_tokenized'],
                           'video_title': train['video_title_tokenized'],
                           'video_description': train['video_description_tokenized']}

def get_features(ngram_range=(1,1), min_df=1, max_df=1.0, verbose=True, use_idf=True, norm='l2', sublinear_tf=False, video_category_encoder=None):
    vectorizers = {}
    X_trains = {}
    for field in train_texts_tokenized:
        vectorizers[field] = TfidfVectorizer(preprocessor=dummy, tokenizer=dummy, ngram_range=ngram_range, min_df=min_df, max_df=max_df, token_pattern=None, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
        X_trains[field] = vectorizers[field].fit_transform(train_texts_tokenized[field])
        if verbose:
            print(f"Fit tfidf vectorizer with {len(vectorizers[field].get_feature_names_out())} features in the {ngram_range} ngram range.")

    if video_category_encoder != None:
        X_category = video_category_encoder.transform(train[['video_category']]).toarray()
        X_train = hstack([X_category, X_trains['channel_title'], X_trains['video_title'], X_trains['video_description']])
    else:
        X_train = hstack([X_trains['channel_title'], X_trains['video_title'], X_trains['video_description']])
    return X_train, vectorizers

Let's look at the number of features for each n-gram range:

In [27]:
for ngram_range in [(1,1), (1,2), (1,3), (1,4), (1,5)]:
    _,_ = get_features(ngram_range=ngram_range)

Fit tfidf vectorizer with 12424 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 24974 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 26901 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 29438 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 158781 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 995189 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 44876 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 585682 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 5766326 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 47958 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 810580 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 8696668 featu

## Multinomial naive Bayes

We see that for the higher n-gram ranges, we have millions or tens of millions of features, which is orders of magnitude larger than the training sample size.

As a baseline for exploring different approaches I'll use multinomial naive Bayes, which is known to perform well for text classification tasks with the tf-idf approach despite the large vocabularies. This has two main advantages: for the number of features we are considering, it is comparatively fast, and it requires tuning of only one hyperparameter, the Laplacian smoothing, which can be fixed by cross-validation to minimise overfitting.

We'll run grid search for n-gram ranges (1,1), (1,2), and (1,3) and vary the vectoriser settings, use_idf = [True, False], norm = ['l1', 'l2'], and 'sublinear_tf' = [True, False].

In [28]:
from sklearn.metrics import *

import warnings
from sklearn.exceptions import UndefinedMetricWarning
warnings.simplefilter("ignore", UndefinedMetricWarning)

In [31]:
from sklearn.model_selection import cross_validate, KFold
import optuna

max_trials=100

def objective(trial, X_train, y_train, estimator, get_params, scoring):
    np.random.seed(524)
    params = get_params(trial)
    model = estimator(**params)
    scores = cross_validate(model, X_train, y_train, scoring=scoring, cv=KFold(n_splits=5, random_state=42, shuffle=True), n_jobs=-1, verbose=0)
    return np.mean(scores['test_score'])

def report_optuna_results(X_train, y_train, estimator, get_params, scoring):
    sampler = optuna.samplers.TPESampler(seed=524)
    study = optuna.create_study(sampler=sampler, direction='maximize')
    study.optimize(lambda trial: objective(trial, X_train, y_train, estimator, get_params, scoring), n_trials=max_trials)
    return study.best_params

In [32]:
def report_tuned_models(X_trains, y_train, params_fixed, estimator, get_params, scoring_tune, scoring_report):
    results_list = []
    for n in range(len(X_trains)):
        X_train = X_trains[n]

        best = report_optuna_results(X_train, y_train, estimator, get_params, scoring_tune)
        model = estimator(**best)

        scores = cross_validate(model, X_train, y_train, scoring=scoring_report, cv=KFold(n_splits=5, random_state=42, shuffle=True), n_jobs=-1, verbose=5)

        cv_results = {}
        for param in params_fixed:
            cv_results[param] = params_fixed[param][n]
        cv_results['mean_fit_time'] = np.mean(scores['fit_time'])
        for score in scoring_report:
            cv_results[score] = f'{np.min(scores["test_"+score]):.5f}/{np.mean(scores["test_"+score]):.5f}/{np.max(scores["test_"+score]):.5f}'
        for param in best:
            cv_results[param] = best[param]
        results_list.append(cv_results)
        print(pd.DataFrame(results_list))
    return results_list

In [33]:
from sklearn.naive_bayes import MultinomialNB

def get_params_mnB(trial):
    alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    return {'alpha': alpha}

In [34]:
'''
X_trains = []
params_fixed = {'vectorizer_type': [], 'norm': [], 'ngram_range': []}

for use_idf in [False,True]:
    for norm in ['l1','l2']:
        for sublinear_tf in [False,True]:
            for ngram_range in [(1,3)]:
                if use_idf == False and sublinear_tf == False:
                    params_fixed['vectorizer_type'].append('TF')
                elif use_idf == False and sublinear_tf == True:
                    params_fixed['vectorizer_type'].append('log(TF)')
                elif use_idf == True and sublinear_tf == False:
                    params_fixed['vectorizer_type'].append('TF-IDF')
                elif use_idf == True and sublinear_tf == True:
                    params_fixed['vectorizer_type'].append('log(TF)-IDF')
                params_fixed['norm'].append(norm)
                params_fixed['ngram_range'].append(ngram_range)

                X_train, _ = get_features(ngram_range=ngram_range, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
                X_trains.append(X_train)
'''

"\nX_trains = []\nparams_fixed = {'vectorizer_type': [], 'norm': [], 'ngram_range': []}\n\nfor use_idf in [False,True]:\n    for norm in ['l1','l2']:\n        for sublinear_tf in [False,True]:\n            for ngram_range in [(1,3)]:\n                if use_idf == False and sublinear_tf == False:\n                    params_fixed['vectorizer_type'].append('TF')\n                elif use_idf == False and sublinear_tf == True:\n                    params_fixed['vectorizer_type'].append('log(TF)')\n                elif use_idf == True and sublinear_tf == False:\n                    params_fixed['vectorizer_type'].append('TF-IDF')\n                elif use_idf == True and sublinear_tf == True:\n                    params_fixed['vectorizer_type'].append('log(TF)-IDF')\n                params_fixed['norm'].append(norm)\n                params_fixed['ngram_range'].append(ngram_range)\n\n                X_train, _ = get_features(ngram_range=ngram_range, use_idf=use_idf, norm=norm, sublinear_tf

In [35]:
'''
%%time
mnB_tune_ngrams = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))
'''

"\n%%time\nmnB_tune_ngrams = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))\n"

In [36]:
'''
mnB_tune_ngrams = pd.DataFrame(mnB_tune_ngrams)
display(mnB_tune_ngrams.style.hide())
'''

'\nmnB_tune_ngrams = pd.DataFrame(mnB_tune_ngrams)\ndisplay(mnB_tune_ngrams.style.hide())\n'

We can see that the (1,3) n-gram range consistently outperformed the lower ranges. Let's look at the dependence on the other hyperparameters:

In [37]:
#display(mnB_tune_ngrams[mnB_tune_ngrams['ngram_range']==(1,3)].style.hide())

Interestingly, L$^2$ normalisation performs better than L$^1$, and among the models using L$^1$ normalisation, the ones that used log(TF) performed slightly better. Incorporating the IDF factor made no difference.

Let's now look at including the video category:

## Including the video category

In [38]:
'''
%%time

params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []

ngram_range = (1,3)
for sublinear_tf in [False,True]:
    for use_idf in [False,True]:
        if use_idf == False and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF')
        elif use_idf == False and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)')
        elif use_idf == True and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF-IDF')
        elif use_idf == True and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)-IDF')

        params_fixed['ngram_range'].append(ngram_range)

        X_train, _ = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)

mnB_tune_category = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))
'''

"\n%%time\n\nparams_fixed = {'vectorizer_type': [], 'ngram_range': []}\nX_trains = []\n\nngram_range = (1,3)\nfor sublinear_tf in [False,True]:\n    for use_idf in [False,True]:\n        if use_idf == False and sublinear_tf == False:\n            params_fixed['vectorizer_type'].append('TF')\n        elif use_idf == False and sublinear_tf == True:\n            params_fixed['vectorizer_type'].append('log(TF)')\n        elif use_idf == True and sublinear_tf == False:\n            params_fixed['vectorizer_type'].append('TF-IDF')\n        elif use_idf == True and sublinear_tf == True:\n            params_fixed['vectorizer_type'].append('log(TF)-IDF')\n\n        params_fixed['ngram_range'].append(ngram_range)\n\n        X_train, _ = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)\n        X_trains.append(X_train)\n\nmnB_tune_category = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB

In [39]:
#pd.DataFrame(mnB_tune_category).style.hide()

## Dimensionality reduction

Our best performing models use the (1,3) n-gram range, which requires over 2 million features. We will now look at reducing the number of features by setting a minimum and maximum document frequency filter that drops tokens from the vocabulary that are either too rare or too common. I'll show results for TF-IDF with L$^2$ norm.

In [40]:
'''
%%time

X_trains = []

params_fixed = {'min_df': [], 'max_df': []}

for min_df in [5,10,20,50,100,200,500,1000]:
    for max_df in [1.0, 0.9, 0.8, 0.7]:
        X_train, _ = get_features(ngram_range=ngram_range, use_idf=True, norm='l2', sublinear_tf=True, min_df=min_df, max_df=max_df, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)
        params_fixed['min_df'].append(min_df)
        params_fixed['max_df'].append(f"{max_df:.1f}")

mnB_tune_dim_reduction = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))
'''

'\n%%time\n\nX_trains = []\n\nparams_fixed = {\'min_df\': [], \'max_df\': []}\n\nfor min_df in [5,10,20,50,100,200,500,1000]:\n    for max_df in [1.0, 0.9, 0.8, 0.7]:\n        X_train, _ = get_features(ngram_range=ngram_range, use_idf=True, norm=\'l2\', sublinear_tf=True, min_df=min_df, max_df=max_df, video_category_encoder=video_category_encoder)\n        X_trains.append(X_train)\n        params_fixed[\'min_df\'].append(min_df)\n        params_fixed[\'max_df\'].append(f"{max_df:.1f}")\n\nmnB_tune_dim_reduction = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, \'accuracy\', (\'accuracy\', \'precision\', \'recall\', \'f1\', \'roc_auc\'))\n'

In [41]:
#pd.DataFrame(mnB_tune_dim_reduction).style.hide()

we see that as the vocabulary size is decreased, the cross validation scores rapidly degrade.

All of these results show that identifying whether a video will be popular or not by its text metadata is a machine-learning problem that contradicts the common wisdom in text classification tasks. This is a fundamentally different challenge to, for example, determining whether a text message or email is spam, etc. In our case, both the most common and rarest terms are relevant, and incorporating an IDF factor seems to have no effect on the accuracy. Whether or not a viewer likes a certain YouTube video or channel, and whether they share it on social media to contribute to its virality, is primarily subjective determination, which makes the classification problem significantly more difficult, and this is reflected in the low cross-validation metrics we have seen so far.

## Further classification models

Now that we have understood the influence of the vectoriser hyperparameters -- the n-gram range, the TF/log(TF)/TF-IDF/log(TF)-IDF modalities and the normalisation, we are ready to build some more models. In addition to multinomial naive Bayes, I will use four other classification models:

* K nearest neighbours
* Support vector machine
* Logistic regression
* Perceptron

These models all have their hyperparameters, which we'll tune using a grid search with five-fold cross validation as we did earlier for Ridge regression. For the support vector machine, logistic regression and the perceptron algorithm we'll use stochastic gradient descent to speed up training. We'll also extend the n-gram range, and increase the number of trials to 250, since we have seen optuna reach the best hyperparameters close to the previous maximum of 100 trials.

In [42]:
#max_trials = 250

In [43]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []
vectorizers = []

for ngram_range in [(1,3)]:
#for ngram_range in [(1,3),(1,4),(1,5)]:
    for sublinear_tf in [False,True]:
        for use_idf in [False,True]:
            if use_idf == False and sublinear_tf == False:
                params_fixed['vectorizer_type'].append('TF')
            elif use_idf == False and sublinear_tf == True:
                params_fixed['vectorizer_type'].append('log(TF)')
            elif use_idf == True and sublinear_tf == False:
                params_fixed['vectorizer_type'].append('TF-IDF')
            elif use_idf == True and sublinear_tf == True:
                params_fixed['vectorizer_type'].append('log(TF)-IDF')

            params_fixed['ngram_range'].append(ngram_range)

            X_train, vectorizer = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
            X_trains.append(X_train)
            vectorizers.append(vectorizer)

Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.


### Multinomial naive Bayes

In [44]:
%%time

def get_params_mnB(trial):
    alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    return {'alpha': alpha}

mnB_tune = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

[I 2024-05-13 15:36:39,068] A new study created in memory with name: no-name-f785088c-72cd-4c32-b438-6f7d1a5b9e80
  pid = os.fork()
[I 2024-05-13 15:36:47,882] Trial 0 finished with value: 0.7973783764087002 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7973783764087002.
[I 2024-05-13 15:36:51,688] Trial 1 finished with value: 0.799470983024082 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 15:36:57,452] Trial 2 finished with value: 0.7991551675825793 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 15:37:01,822] Trial 3 finished with value: 0.7537901399454154 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 15:37:06,329] Trial 4 finished with value: 0.7977731885800425 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.799470983024082.
[I 20

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       1.360242  0.80991/0.81842/0.82965   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  


[I 2024-05-13 15:44:45,471] Trial 0 finished with value: 0.7960361023239535 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7960361023239535.
[I 2024-05-13 15:44:49,385] Trial 1 finished with value: 0.7985233886050628 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7985233886050628.
[I 2024-05-13 15:44:55,289] Trial 2 finished with value: 0.7986024243071418 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-13 15:44:59,906] Trial 3 finished with value: 0.7892450492589623 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-13 15:45:03,922] Trial 4 finished with value: 0.7967862166100466 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-13 15:45:09,930] Trial 5 finished with value: 0.7968256954888464 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       1.360242  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.945496  0.80754/0.82012/0.83083   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87564/0.88196/0.89345  0.145222  


[I 2024-05-13 15:52:37,477] Trial 0 finished with value: 0.8043667381287636 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 15:52:41,360] Trial 1 finished with value: 0.8037744224411509 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 15:52:45,238] Trial 2 finished with value: 0.8028269215555067 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 15:52:50,955] Trial 3 finished with value: 0.766503274252717 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 15:52:54,947] Trial 4 finished with value: 0.8047219700934827 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8047219700934827.
[I 2024-05-13 15:52:59,210] Trial 5 finished with value: 0.804643012335883 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       1.360242  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.945496  0.80754/0.82012/0.83083   
2         log(TF)      (1, 3)       1.615309  0.81030/0.82115/0.83281   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   
2  0.77851/0.79233/0.80597  0.73189/0.74083/0.75789  0.75553/0.76569/0.78119   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87564/0.88196/0.89345  0.145222  
2  0.87689/0.88348/0.89532  0.090669  


[I 2024-05-13 16:00:25,558] Trial 0 finished with value: 0.798878721897605 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.798878721897605.
[I 2024-05-13 16:00:29,695] Trial 1 finished with value: 0.8001026762626713 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-13 16:00:35,654] Trial 2 finished with value: 0.7999052584853283 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-13 16:00:39,883] Trial 3 finished with value: 0.7911796312368737 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-13 16:00:43,783] Trial 4 finished with value: 0.8002210193656957 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8002210193656957.
[I 2024-05-13 16:00:49,763] Trial 5 finished with value: 0.8001815404868957 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       1.360242  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.945496  0.80754/0.82012/0.83083   
2         log(TF)      (1, 3)       1.615309  0.81030/0.82115/0.83281   
3     log(TF)-IDF      (1, 3)       1.315254  0.80853/0.82099/0.83281   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   
2  0.77851/0.79233/0.80597  0.73189/0.74083/0.75789  0.75553/0.76569/0.78119   
3  0.77996/0.79514/0.80794  0.72203/0.73584/0.75489  0.75204/0.76431/0.78051   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87564/0.88196/0.89345  0.145222  
2  0.87689/0.88348/0.89532  0.090669  
3  0.87435/0.88119/0.89339  0.166633  
CPU times: user 6min 17s, sys: 1min 3s

[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.2s finished


In [45]:
pd.DataFrame(mnB_tune).style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,"(1, 3)",1.360242,0.80991/0.81842/0.82965,0.76245/0.77930/0.79236,0.73829/0.75305/0.76892,0.75729/0.76590/0.78046,0.88022/0.88591/0.89647,0.065719
TF-IDF,"(1, 3)",0.945496,0.80754/0.82012/0.83083,0.77413/0.79016/0.80138,0.72942/0.74083/0.75840,0.75273/0.76467/0.77929,0.87564/0.88196/0.89345,0.145222
log(TF),"(1, 3)",1.615309,0.81030/0.82115/0.83281,0.77851/0.79233/0.80597,0.73189/0.74083/0.75789,0.75553/0.76569/0.78119,0.87689/0.88348/0.89532,0.090669
log(TF)-IDF,"(1, 3)",1.315254,0.80853/0.82099/0.83281,0.77996/0.79514/0.80794,0.72203/0.73584/0.75489,0.75204/0.76431/0.78051,0.87435/0.88119/0.89339,0.166633


We can observe that the cross-validation metrics improve across the board as the n-gram range is increased from (1,1) to (1,3), but accounting for more n-grams seems to not lead to significant improvement in the accuracy, f1 or area under the ROC curve.

### Support vector machine

In [46]:
%%time

def get_params_SVM(trial):
    alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    return {'loss': 'hinge', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

SVM_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_SVM, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

[I 2024-05-13 16:08:28,727] A new study created in memory with name: no-name-b1ded412-c20b-4a77-aa1f-5adfa940ce94
[I 2024-05-13 16:08:57,336] Trial 0 finished with value: 0.712610481427974 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.712610481427974.
[I 2024-05-13 16:10:10,096] Trial 1 finished with value: 0.8144341951784323 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8144341951784323.
[I 2024-05-13 16:10:41,964] Trial 2 finished with value: 0.7563563372174367 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8144341951784323.
[I 2024-05-13 16:12:10,395] Trial 3 finished with value: 0.8173955787552968 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8173955787552968.
[I 2024-05-13 16:12:27,440] Trial 4 finished with value: 0.6150

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       7.298808  0.81405/0.82462/0.83636   

                 precision                   recall                       f1  \
0  0.77457/0.79888/0.81742  0.72154/0.74303/0.76140  0.75981/0.76971/0.78562   

     alpha  l1_ratio  
0  0.00006  0.068926  


[I 2024-05-13 17:44:10,423] Trial 0 finished with value: 0.6482154182754083 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6482154182754083.
[I 2024-05-13 17:45:07,558] Trial 1 finished with value: 0.8178693876564784 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8178693876564784.
[I 2024-05-13 17:45:38,667] Trial 2 finished with value: 0.7269420704937666 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8178693876564784.
[I 2024-05-13 17:46:58,015] Trial 3 finished with value: 0.8179481037861921 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8179481037861921.
[I 2024-05-13 17:47:16,289] Trial 4 finished with value: 0.6064831645770401 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       7.298808  0.81405/0.82462/0.83636   
1          TF-IDF      (1, 3)       7.309383  0.81366/0.82383/0.83597   

                 precision                   recall                       f1  \
0  0.77457/0.79888/0.81742  0.72154/0.74303/0.76140  0.75981/0.76971/0.78562   
1  0.79803/0.81227/0.83219  0.70922/0.71978/0.73619  0.75275/0.76321/0.77822   

      alpha  l1_ratio  
0  0.000060  0.068926  
1  0.000056  0.053256  


[I 2024-05-13 19:07:11,547] Trial 0 finished with value: 0.7157296869866625 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.7157296869866625.
[I 2024-05-13 19:08:33,031] Trial 1 finished with value: 0.814513277647199 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.814513277647199.
[I 2024-05-13 19:09:05,110] Trial 2 finished with value: 0.7557639981464803 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.814513277647199.
[I 2024-05-13 19:10:25,444] Trial 3 finished with value: 0.8194484025084092 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8194484025084092.
[I 2024-05-13 19:10:42,186] Trial 4 finished with value: 0.6077073215977526 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 with

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       7.298808  0.81405/0.82462/0.83636   
1          TF-IDF      (1, 3)       7.309383  0.81366/0.82383/0.83597   
2         log(TF)      (1, 3)       7.626954  0.82136/0.82813/0.83972   

                 precision                   recall                       f1  \
0  0.77457/0.79888/0.81742  0.72154/0.74303/0.76140  0.75981/0.76971/0.78562   
1  0.79803/0.81227/0.83219  0.70922/0.71978/0.73619  0.75275/0.76321/0.77822   
2  0.79115/0.80777/0.81904  0.73090/0.74057/0.76692  0.75988/0.77263/0.79029   

      alpha  l1_ratio  
0  0.000060  0.068926  
1  0.000056  0.053256  
2  0.000082  0.001323  


[I 2024-05-13 20:36:16,311] Trial 0 finished with value: 0.6344756304636643 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6344756304636643.
[I 2024-05-13 20:37:21,070] Trial 1 finished with value: 0.8192513588645672 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8192513588645672.
[I 2024-05-13 20:37:47,753] Trial 2 finished with value: 0.7118204673628927 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8192513588645672.
[I 2024-05-13 20:39:03,589] Trial 3 finished with value: 0.8189746559628113 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8192513588645672.
[I 2024-05-13 20:39:20,765] Trial 4 finished with value: 0.6069175257772137 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       7.298808  0.81405/0.82462/0.83636   
1          TF-IDF      (1, 3)       7.309383  0.81366/0.82383/0.83597   
2         log(TF)      (1, 3)       7.626954  0.82136/0.82813/0.83972   
3     log(TF)-IDF      (1, 3)       7.166856  0.80517/0.81475/0.82945   

                 precision                   recall                       f1  \
0  0.77457/0.79888/0.81742  0.72154/0.74303/0.76140  0.75981/0.76971/0.78562   
1  0.79803/0.81227/0.83219  0.70922/0.71978/0.73619  0.75275/0.76321/0.77822   
2  0.79115/0.80777/0.81904  0.73090/0.74057/0.76692  0.75988/0.77263/0.79029   
3  0.75000/0.78235/0.81504  0.66338/0.73731/0.77153  0.73143/0.75801/0.77823   

          alpha  l1_ratio  
0  5.979656e-05  0.068926  
1  5.579122e-05  0.053256  
2  8.208496e-05  0.001323  
3  4.336836e-07  0.961855  
CPU times: user 9min 33s, sys: 1min 39s, total: 11min 13s
Wall time: 5h 47min 26s


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   19.7s finished


In [47]:
pd.DataFrame(SVM_tune).style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,alpha,l1_ratio
TF,"(1, 3)",7.298808,0.81405/0.82462/0.83636,0.77457/0.79888/0.81742,0.72154/0.74303/0.76140,0.75981/0.76971/0.78562,6e-05,0.068926
TF-IDF,"(1, 3)",7.309383,0.81366/0.82383/0.83597,0.79803/0.81227/0.83219,0.70922/0.71978/0.73619,0.75275/0.76321/0.77822,5.6e-05,0.053256
log(TF),"(1, 3)",7.626954,0.82136/0.82813/0.83972,0.79115/0.80777/0.81904,0.73090/0.74057/0.76692,0.75988/0.77263/0.79029,8.2e-05,0.001323
log(TF)-IDF,"(1, 3)",7.166856,0.80517/0.81475/0.82945,0.75000/0.78235/0.81504,0.66338/0.73731/0.77153,0.73143/0.75801/0.77823,0.0,0.961855


### Logistic regression

In [48]:
%%time

def get_params_log_reg(trial):
    alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    return {'loss': 'log_loss', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

log_reg_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_log_reg, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

[I 2024-05-13 21:55:54,950] A new study created in memory with name: no-name-74dce7b8-8880-4413-8d79-8e73a0312fc0
[I 2024-05-13 21:56:11,666] Trial 0 finished with value: 0.6917245332976867 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6917245332976867.
[I 2024-05-13 21:57:35,590] Trial 1 finished with value: 0.8131707697290145 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8131707697290145.
[I 2024-05-13 21:57:54,404] Trial 2 finished with value: 0.7299036021651418 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8131707697290145.
[I 2024-05-13 21:58:29,969] Trial 3 finished with value: 0.8035371126792674 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8131707697290145.
[I 2024-05-13 21:58:46,126] Trial 4 finished with value: 0.64

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       7.451695  0.80932/0.81862/0.82689   

                 precision                   recall                       f1  \
0  0.77529/0.77984/0.79379  0.70681/0.75307/0.78697  0.74778/0.76580/0.78168   

                   roc_auc     alpha  l1_ratio  
0  0.88447/0.89003/0.89960  0.000012  0.072557  


[I 2024-05-13 23:00:42,129] Trial 0 finished with value: 0.6475836081200999 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6475836081200999.
[I 2024-05-13 23:01:49,010] Trial 1 finished with value: 0.8169613578551862 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8169613578551862.
[I 2024-05-13 23:02:08,306] Trial 2 finished with value: 0.7059379585327575 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8169613578551862.
[I 2024-05-13 23:02:24,701] Trial 3 finished with value: 0.8066957269667243 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8169613578551862.
[I 2024-05-13 23:02:42,959] Trial 4 finished with value: 0.6056541237111392 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       7.451695  0.80932/0.81862/0.82689   
1          TF-IDF      (1, 3)       4.930783  0.80754/0.81467/0.82807   

                 precision                   recall                       f1  \
0  0.77529/0.77984/0.79379  0.70681/0.75307/0.78697  0.74778/0.76580/0.78168   
1  0.77489/0.80497/0.83215  0.68361/0.70067/0.72478  0.73965/0.74891/0.76376   

                   roc_auc     alpha  l1_ratio  
0  0.88447/0.89003/0.89960  0.000012  0.072557  
1  0.87562/0.88595/0.89696  0.000005  0.033032  


[I 2024-05-13 23:57:56,367] Trial 0 finished with value: 0.69318536873 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.69318536873.
[I 2024-05-13 23:59:08,951] Trial 1 finished with value: 0.8172772278578246 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8172772278578246.
[I 2024-05-13 23:59:27,978] Trial 2 finished with value: 0.7338911014295407 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8172772278578246.
[I 2024-05-14 00:00:04,448] Trial 3 finished with value: 0.8095383777181675 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8172772278578246.
[I 2024-05-14 00:00:22,204] Trial 4 finished with value: 0.6142609012174538 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 with value:

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       7.451695  0.80932/0.81862/0.82689   
1          TF-IDF      (1, 3)       4.930783  0.80754/0.81467/0.82807   
2         log(TF)      (1, 3)       6.871266  0.81248/0.81957/0.82866   

                 precision                   recall                       f1  \
0  0.77529/0.77984/0.79379  0.70681/0.75307/0.78697  0.74778/0.76580/0.78168   
1  0.77489/0.80497/0.83215  0.68361/0.70067/0.72478  0.73965/0.74891/0.76376   
2  0.76564/0.78714/0.81399  0.70527/0.74475/0.78496  0.75574/0.76488/0.78300   

                   roc_auc     alpha  l1_ratio  
0  0.88447/0.89003/0.89960  0.000012  0.072557  
1  0.87562/0.88595/0.89696  0.000005  0.033032  
2  0.88682/0.89119/0.90040  0.000008  0.063208  


[I 2024-05-14 00:58:28,079] Trial 0 finished with value: 0.6283561470329071 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6283561470329071.
[I 2024-05-14 00:59:23,575] Trial 1 finished with value: 0.8222124150746183 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8222124150746183.
[I 2024-05-14 00:59:42,586] Trial 2 finished with value: 0.7003710936662706 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8222124150746183.
[I 2024-05-14 01:00:00,502] Trial 3 finished with value: 0.8036159301368041 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8222124150746183.
[I 2024-05-14 01:00:16,472] Trial 4 finished with value: 0.605575119186852 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 wi

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       7.451695  0.80932/0.81862/0.82689   
1          TF-IDF      (1, 3)       4.930783  0.80754/0.81467/0.82807   
2         log(TF)      (1, 3)       6.871266  0.81248/0.81957/0.82866   
3     log(TF)-IDF      (1, 3)       5.529133  0.80833/0.81641/0.82313   

                 precision                   recall                       f1  \
0  0.77529/0.77984/0.79379  0.70681/0.75307/0.78697  0.74778/0.76580/0.78168   
1  0.77489/0.80497/0.83215  0.68361/0.70067/0.72478  0.73965/0.74891/0.76376   
2  0.76564/0.78714/0.81399  0.70527/0.74475/0.78496  0.75574/0.76488/0.78300   
3  0.77790/0.79716/0.84181  0.67218/0.71896/0.74664  0.74749/0.75534/0.77002   

                   roc_auc     alpha  l1_ratio  
0  0.88447/0.89003/0.89960  0.000012  0.072557  
1  0.87562/0.88595/0.89696  0.000005  0.033032  
2  0.88682/0.89119/0.90040  0.000008  0.063208  
3  0.88312/0.88799/0.89360  0.000002 

[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   15.3s finished


In [49]:
pd.DataFrame(log_reg_tune).style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha,l1_ratio
TF,"(1, 3)",7.451695,0.80932/0.81862/0.82689,0.77529/0.77984/0.79379,0.70681/0.75307/0.78697,0.74778/0.76580/0.78168,0.88447/0.89003/0.89960,1.2e-05,0.072557
TF-IDF,"(1, 3)",4.930783,0.80754/0.81467/0.82807,0.77489/0.80497/0.83215,0.68361/0.70067/0.72478,0.73965/0.74891/0.76376,0.87562/0.88595/0.89696,5e-06,0.033032
log(TF),"(1, 3)",6.871266,0.81248/0.81957/0.82866,0.76564/0.78714/0.81399,0.70527/0.74475/0.78496,0.75574/0.76488/0.78300,0.88682/0.89119/0.90040,8e-06,0.063208
log(TF)-IDF,"(1, 3)",5.529133,0.80833/0.81641/0.82313,0.77790/0.79716/0.84181,0.67218/0.71896/0.74664,0.74749/0.75534/0.77002,0.88312/0.88799/0.89360,2e-06,0.019979


### Perceptron

In [50]:
%%time

def get_params_perceptron(trial):
    alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    return {'loss': 'perceptron', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

perceptron_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_perceptron, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

[I 2024-05-14 01:55:08,740] A new study created in memory with name: no-name-17c4674b-4b67-40a2-8618-695d7eb59e63
[I 2024-05-14 01:55:23,148] Trial 0 finished with value: 0.614024994456199 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.614024994456199.
[I 2024-05-14 01:56:35,116] Trial 1 finished with value: 0.8149474595750702 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8149474595750702.
[I 2024-05-14 01:56:50,862] Trial 2 finished with value: 0.6376338472342765 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8149474595750702.
[I 2024-05-14 01:57:20,488] Trial 3 finished with value: 0.7273784972226434 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8149474595750702.
[I 2024-05-14 01:57:35,734] Trial 4 finished with value: 0.6320

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)      10.398158  0.78879/0.80563/0.81919   

                 precision                   recall                       f1  \
0  0.71668/0.77509/0.83365  0.65605/0.72151/0.78036  0.73426/0.74527/0.76059   

          alpha  l1_ratio  
0  6.795851e-08  0.909917  


[I 2024-05-14 03:39:22,279] Trial 0 finished with value: 0.5944041086093964 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.5944041086093964.
[I 2024-05-14 03:40:27,197] Trial 1 finished with value: 0.8172772200633766 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8172772200633766.
[I 2024-05-14 03:40:42,385] Trial 2 finished with value: 0.6170283589296508 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8172772200633766.
[I 2024-05-14 03:41:09,056] Trial 3 finished with value: 0.7313256991912092 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8172772200633766.
[I 2024-05-14 03:41:23,690] Trial 4 finished with value: 0.5496689425155568 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

KeyboardInterrupt: 

In [51]:
pd.DataFrame(perceptron_tune).style.hide()

NameError: name 'perceptron_tune' is not defined

## Training the final models

Now that we've obtained the optimal hyperparameters we can train the models on the full training data. We'll save the models and evaluate them in the next notebook.

In [None]:
mnB_clfs = []
svm_clfs = []
logreg_clfs = []
perceptron_clfs = []
models = {}

for n in range(len(X_trains)):
    mnB_clfs.append(MultinomialNB(alpha=mnB_tune[n]['alpha']))
    svm_clfs.append(SGDClassifier(loss='hinge', penalty='elasticnet', alpha=SVM_tune[n]['alpha'], l1_ratio=SVM_tune[n]['l1_ratio']))
    logreg_clfs.append(SGDClassifier(loss='log_loss', penalty='elasticnet', alpha=log_reg_tune[n]['alpha'], l1_ratio=log_reg_tune[n]['l1_ratio']))
    perceptron_clfs.append(SGDClassifier(loss='perceptron', penalty='elasticnet', alpha=perceptron_tune[n]['alpha'], l1_ratio=perceptron_tune[n]['l1_ratio']))

    for model in [mnB_clfs[-1],svm_clfs[-1], logreg_clfs[-1], perceptron_clfs[-1]]:
        model.fit(X_trains[n], y_train)

    models[f"models/mnB_{params_fixed[n]['vectorizer_type']}_{params_fixed[n]['ngram_range']}"] = mnB_clfs[-1]
    models[f"models/svm_{params_fixed[n]['vectorizer_type']}_{params_fixed[n]['ngram_range']}"] = svm_clfs[-1]
    models[f"models/logreg_{params_fixed[n]['vectorizer_type']}_{params_fixed[n]['ngram_range']}"] = logreg_clfs[-1]
    models[f"models/perceptron_{params_fixed[n]['vectorizer_type']}_{params_fixed[n]['ngram_range']}"] = perceptron_clfs[-1]

In [None]:
for model_name in models:
    joblib.dump(models[model_name], 'models/'+model_name+'.joblib')

for n in range(len(vectorizers)):
    joblib.dump(vectorizers[n], f"models/vectorizers/{params_fixed[n]['vectorizer_type']}_{params_fixed[n]['ngram_range']}"

joblib.dump(video_category_encoder, 'models/video_category_encoder.joblib')

## Probability calibration

We can see that, based on the cross-validation scores, the models are quite far from being accurate. We would like to model the probabilities  $P(y\in \mathcal{C}|P)$ of a data $y$ belonging in class $\mathcal{C}$ given the predictions of each of the models, which is not the same as the reported probabilities. (In some cases, there are also no reported probabilities). We can do this using a probability calibrator, which treats the predictions of each model as a feature that can then be used to model the true probability. This requires validation data, so we'll again use a five-fold cross-validation split.

In [None]:
from sklearn.calibration import CalibratedClassifierCV

calibrated_clfs = {}

for model_name in models:
    calibrated_clfs[model_name] = CalibratedClassifierCV(models[model_name], cv = KFold(n_splits=5, random_state=42, shuffle=True))
    calibrated_clfs[model_name].fit(X_train, y_train)

We'll save the calibrated models for evaluation in the next notebook:

In [None]:
for model_name in calibrated_clfs:
    joblib.dump(calibrated_clfs[model_name], 'models/'+model_name+'_calibrated.joblib')

## Stacking

Now that we have our four models, we can combine them into a single classifier that uses all of their predictions. One approach is stacking, which involves a single metaclassifier that first gathers the predictions of the individual models, then uses these predictions as features and converts them into a final prediction. We will need to train the meta-classifier with cross-validation and select a model. We will compare two choices: logistic regression and gaussian naive Bayes.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
models

In [None]:
stacking_logreg = StackingClassifier(list(models.items()), final_estimator=LogisticRegression(), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_logreg.fit(X_train, y_train)

In [None]:
from sklearn.naive_bayes import GaussianNB

stacking_gnb = StackingClassifier(list(models.items()), final_estimator=GaussianNB(), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_gnb.fit(X_train, y_train)

In [None]:
joblib.dump(stacking_logreg, 'models/stacking_logreg.joblib')
joblib.dump(stacking_gnb, 'models/stacking_gnb.joblib')