# Youtube popularity predictor (Part 2): text frequency-based models

In the previous notebook, we used natural language processing (NLP) to explore the YouTube video dataset and hunted for possible correlations between the language features in the video titles and descriptions and the video popularity, which we associated with a binary categorical variable corresponding to a video having obtained over 50k views (class 1) or under 50k views (class 0). We did indeed see that the frequency of the tokens in the byte-pair encoded text had predictive value for classification. In this notebook we will construct a variety of classification models based on text frequency.

Let's import the scikit-learn library and load the dataset, which was already processed in the previous notebook to extract the relevant ML features.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas(desc='processing rows')

In [2]:
videos = pd.read_csv('https://raw.githubusercontent.com/tommyliphysics/tommyli-ml/main/youtube_predictor/data/YT_data_v2.csv', lineterminator='\n')
videos

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
0,University of New Haven,27,Master of Science in Cellular and Molecular Bi...,"Christina Zito, assistant professor and coordi...",75,3.610660,0
1,PennWest California,27,Faculty Showcase: Dr. Ben Reuter - Exercise Sc...,Interested in pursing a exercise science degre...,75,3.168203,0
2,University of New Haven,27,Master of Science in Mechanical Engineering: B...,The University of New Haven’s master’s degree ...,75,3.447313,0
3,Operation Ouch,24,Science for kids | BROKEN BONES- Unluckiest K...,Learn about Broken Bones with the Unluckiest K...,75,6.603942,1
4,Crazy GkTrick,27,Science Gk : Diseases (मानव रोग ) - Part-2,Biology (‎जीव विज्ञान) | Gk Science | Science ...,76,6.409320,1
...,...,...,...,...,...,...,...
31657,Morinda Enterprises,22,Vivo v30pro pro photography // aura light por...,,1,2.534026,0
31658,Christian Dunham,20,POV me growing up,,1,1.000000,0
31659,Gegee gegee,22,28 March 2024,,1,0.477121,0
31660,Sangita . 20k views. 2 days ago,27,TLM WORKSHOP on FLN ||👏😱||#viral #tlm,"project work,tlm workshop,maths project work,t...",1,1.431364,0


In [3]:
videos.groupby('video_category').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,label,label,label,label,label,label,label,label
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
video_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,306.0,42.245098,23.89558,1.0,20.0,48.0,63.75,75.0,306.0,4.271067,...,5.674682,7.752964,306.0,0.408497,0.492361,0.0,0.0,0.0,1.0,1.0
2,179.0,28.47486,22.772031,1.0,13.0,19.0,40.0,75.0,179.0,4.288575,...,5.602989,7.831337,179.0,0.413408,0.493826,0.0,0.0,0.0,1.0,1.0
10,245.0,23.383673,22.988052,1.0,6.0,15.0,36.0,74.0,245.0,4.580353,...,5.645615,8.234742,245.0,0.559184,0.497501,0.0,0.0,1.0,1.0,1.0
15,41.0,31.439024,23.293828,2.0,12.0,27.0,56.0,75.0,41.0,4.541532,...,5.84489,8.419579,41.0,0.463415,0.504854,0.0,0.0,0.0,1.0,1.0
17,487.0,51.034908,19.407606,1.0,39.0,56.0,68.0,75.0,487.0,3.832325,...,4.665426,7.836966,487.0,0.2423,0.428915,0.0,0.0,0.0,0.0,1.0
19,111.0,38.675676,22.124508,1.0,19.5,38.0,57.0,75.0,111.0,4.107556,...,4.999286,7.486532,111.0,0.306306,0.463049,0.0,0.0,0.0,1.0,1.0
20,603.0,19.15257,18.760156,1.0,6.5,14.0,21.0,76.0,603.0,4.025035,...,5.776823,8.183316,603.0,0.461028,0.498893,0.0,0.0,0.0,1.0,1.0
22,5831.0,34.821643,22.614164,1.0,15.0,32.0,55.0,76.0,5831.0,3.577791,...,4.629766,8.20297,5831.0,0.236495,0.424966,0.0,0.0,0.0,0.0,1.0
23,220.0,27.2,24.878628,1.0,7.0,17.0,54.25,75.0,220.0,5.085606,...,6.448389,8.227258,220.0,0.654545,0.476601,0.0,0.0,1.0,1.0,1.0
24,1382.0,29.526049,22.925857,1.0,10.0,23.5,48.0,75.0,1382.0,4.669223,...,5.98817,8.494091,1382.0,0.515919,0.499927,0.0,0.0,1.0,1.0,1.0


In [4]:
videos[videos['video_category']==30]

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26648,YouTube Movies,30,"Underground Aliens, Baba Vanga And Quantum Bio...",Baba Vanga was a female mystic in Bulgaria. Sh...,11,0.0,0


In [5]:
videos.drop(videos[videos['video_category']==30].index, inplace=True)

In [6]:
videos.reset_index(drop=True, inplace=True)

Let's look at the distribution of video view counts:

In [7]:
videos[['months','video_view_count','label']].groupby('label').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0,19168.0,40.844011,21.007736,-1.0,24.0,42.0,59.0,76.0,19168.0,3.353037,1.067583,0.0,2.692847,3.633519,4.205265,4.69897
1,12493.0,29.561354,21.267795,1.0,12.0,24.0,46.0,76.0,12493.0,5.582265,0.68341,4.699005,5.037442,5.433327,5.977578,8.588679


We can see that the classes are approximately evenly distributed. They aren't exactly balanced, but that is due to the fact that the classification is based on a round milestone of 50k views. To exactly balance the data would result in a discrimination threshold that is far less striking.

We'll select a test set based on an 80/20 train/test split which we will then use for all future model building and validation.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(videos[['video_title']], videos['label'], test_size=0.2, stratify=videos['video_category'], random_state=524)
test = videos.iloc[X_test.index]
train = videos.iloc[X_train.index]

In [9]:
test

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26498,RG LECTURES,27,MHTCET FULL REVISION ONE SHOT ALL FORMULAS - P...,MHTCET PHYSICS FULL COMPLETE ONE SHOT REVISION...,11,5.238984,1
27395,FuTechs,28,Tony Robbin and Robot conversation Relationshi...,"Speaker :Anthony Jay Robbins (né Mahavoric, bo...",10,4.364063,0
23126,That Chemist,27,Nobel Prize in Chemistry 2022 (Recap),The Nobel Prize in Chemistry for 2022 has been...,18,4.484656,0
15634,SCIENCE FUN For Everyone!,27,Friction Fun Friction Science Experiment,Have fun exploring friction with this easy sci...,36,4.503437,0
7075,Michigan Medicine,26,Deconstructing the Legitimization of Acupunctu...,"Rick Harris, PhD\nAssociate Professor, Anesthe...",57,4.632467,0
...,...,...,...,...,...,...,...
24112,CARB ACADEMY,27,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,16,5.496467,1
2034,Rafael Verdonck's World,22,Science World #7 Will Strangelets destroy th...,Will the universe be destroyed by a tiny eleme...,70,3.183270,0
22862,Trik Matematika mesi,27,deret angka matematika #shorts #maths,,19,5.764919,1
6425,edureka!,27,Statistics And Probability Tutorial | Statisti...,🔥 Data Science Certification using R (Use Code...,59,5.561255,1


In [10]:
train.to_csv('train.csv', index=False, encoding='utf-8', sep=',')
test.to_csv('test.csv', index=False, encoding='utf-8', sep=',')

In [11]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271,0
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389,1
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385,1
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802,0
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282,0
...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115,0
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098,1
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341,0
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217,1


In this notebook, we will only be using the train dataset to build the models.

To convert the text into numerical features, we can use byte-pair encoding (BPE). We can train three separate encoders for the channel name, video title and video description. We will convert all text to lower case to make the vocabulary size smaller.

We first have to set all NA values to empty strings:

In [12]:
train = train.fillna('')

In [13]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

def build_tokenizer(train_texts, save=None):
    BPE_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    BPE_tokenizer.pre_tokenizer = Whitespace()
    BPE_tokenizer.train_from_iterator(train_texts, trainer=trainer)
    if save:
        BPE_tokenizer.save(save)
    return BPE_tokenizer

training_data_uncased = {field: train[field].apply(lambda x: x.lower()).tolist() for field in ['channel_title', 'video_title', 'video_description']}

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
%%time
BPE_tokenizers_uncased = {}

for field in training_data_uncased:
    BPE_tokenizers_uncased[field]= build_tokenizer(training_data_uncased[field], save=f"tokenizers/BPE_tokenizer_{field}_uncased.json")










CPU times: user 50.6 s, sys: 208 ms, total: 50.8 s
Wall time: 50.8 s


In [15]:
from transformers import PreTrainedTokenizerFast

tokenizers_trained_uncased = {}

for field in training_data_uncased:
    tokenizers_trained_uncased[field] = PreTrainedTokenizerFast(tokenizer_file=f"tokenizers/BPE_tokenizer_{field}_uncased.json")

In [16]:
def tokenize(text, field, cased=True):
    if cased == False:
        return [str(t) for t in tokenizers_trained_uncased[field](text.lower())['input_ids']]

def tokenizer_decode(tokenized, field, cased=True):
    if cased == False:
        return tokenizers_trained_uncased[field].decode([int(t) for t in tokenized])


In [17]:
train.loc[:,'channel_title_tokenized'] = train['channel_title'].progress_apply(lambda text: tokenize(text.lower(), 'channel_title', cased=False))
train.loc[:,'video_title_tokenized'] = train['video_title'].progress_apply(lambda text: tokenize(text.lower(), 'video_title', cased=False))
train.loc[:,'video_description_tokenized'] = train['video_description'].progress_apply(lambda text: tokenize(text.lower(), 'video_description', cased=False))

processing rows: 100%|█████████████████| 25328/25328 [00:00<00:00, 48254.78it/s]
processing rows: 100%|█████████████████| 25328/25328 [00:01<00:00, 22289.60it/s]
processing rows: 100%|██████████████████| 25328/25328 [00:09<00:00, 2649.55it/s]


In [18]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label,channel_title_tokenized,video_title_tokenized,video_description_tokenized
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271,0,[1165],"[2319, 2692, 3910, 2848, 6602, 3910, 2077, 196...","[10988, 5597, 12955, 5606, 5315, 4227, 4430, 4..."
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389,1,[16769],"[3084, 5038, 4400, 1871, 3829, 5, 12, 1889, 59...","[4091, 9748, 4132, 17593, 4153, 5, 4123, 9748,..."
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385,1,"[1300, 3294, 777]","[1883, 9686, 1910, 1817, 2178, 2469]","[4451, 9906, 4027, 17896, 4094, 4306, 4123, 42..."
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802,0,[1165],"[6224, 6245, 1963, 2159, 2250, 2525, 1890, 206...","[25286, 28274, 4082, 4058, 5315, 10641, 4393, ..."
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282,0,[19463],"[6465, 2587, 30, 1883, 1815, 1846, 21675, 1842...","[7408, 4039, 41, 17229, 5423, 4459, 33, 4006, ..."
...,...,...,...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115,0,[16197],"[3683, 7242, 7, 3945, 7, 1815, 7, 2062]","[8809, 25929, 4021, 41, 7093, 17, 5087, 25929,..."
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098,1,[10110],"[2074, 3274, 41, 10225, 1957, 2573, 3306, 5804...","[5864, 30, 5316, 44, 4035, 17185, 4053, 4299, ..."
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341,0,"[3250, 900]","[1815, 6401, 68, 2386, 18, 4589, 18, 2158]","[21, 18, 4896, 17, 5122, 8991, 4027, 4331, 107..."
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217,1,"[829, 3098, 1169]","[2295, 1869, 7835, 2475, 1846, 2629, 7, 1897, ...","[4365, 4093, 4410, 4347, 9114, 5460, 4487, 19,..."


In [19]:
idx = train.sample(1, random_state=524).index.tolist()[0]
print('channel title:')
print(train.at[idx,'channel_title'])
print('channel title tokenized:')
print(train.at[idx,'channel_title_tokenized'])
print('video title: ')
print(train.at[idx,'video_title'])
print('video title tokenized:')
print(train.at[idx,'video_title_tokenized'])
print('video description:')
print(train.at[idx,'video_description'])
print('video description tokenized:')
print(train.at[idx,'video_description_tokenized'])

channel title:
CrashCourse
channel title tokenized:
['1946']
video title: 
Micro-Biology: Crash Course History of Science #24
video title tokenized:
['2635', '17', '1915', '30', '3465', '2299', '2744', '1846', '1815', '7', '2763']
video description:
It's all about the SUPER TINY in this episode of Crash Course: History of Science. In it, Hank Green talks about germ theory, John Snow (the other one), pasteurization,  and why following our senses isn't always the worst idea. 

***

Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse

Thanks to the following Patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:

Mark Brouwer, Kenneth F Penttinen, Trevin Beattie, Satya Ridhima Parvathaneni, Erika & Alexa Saur, Glenn Elliott, Justin Zingsheim, Jessica Wode, Eric Prestemon, Kathrin Benoit, Tom Trval, Jason Saslow, Nathan Taylor, Brian Thomas Gossett, Khaled El Shalakany, Indika Siriwarde

We are now ready to apply machine learning techniques on the tokenized text. The discussion of EDA in the previous notebook suggests that a text-frequency based analysis could be a powerful tool for language-based prediction. We can use TfidfVectorizer() from scikit-learn, which efficiently counts the tokens in a text and generates a vector consisting of a numerical description of the token frequencies. Rather than simply counting the token frequency in the individual samples (the *term frequency*), however, TfidfVectorizer also incorporates the frequencies of the tokens in the entire training corpus (the *document frequency*). By default, TfidfVectorizer multiplies each token $i$ by a weight IDF = $\log(\frac{N_{\text{samples}}}{N_{\text{samples containing }i}})$, which describes the specificity of the token to the sample.

The parameters are:
* ngram_range: rather than considering individual tokens, we can consider pairs, triples, etc. of consecutive tokens and perform frequency analysis on these larger units. These are known as n-grams, with $n=1,2,3, \dots$ being the number of consecutive tokens that form the unit. The ngram_range is a tuple (n,m) with $n$ and $m$ being the minimum and maximum sizes of the n-grams used in generating features from the tokenised text.
* min_df, max_df: we can filter the tokens by the minimum and maximum number of documents in which the token must appear, which allows for dimensionality reduction.
* use_idf: this allows the incorporation of the IDF factor into the vector representation of the text: without it, the text is represented as a set of numbers corresponding to the frequency of each token or n-gram appearing in the text, with a normalisation factor. With use_idf, this frequency is divided by a factor (idf) that suppresses tokens that appear in a large number of documents.
* norm: with 'l1', the vector of input features is normalised so that the sum of the features is unity, with 'l2', the sum of the squares is unity.
* sublinear_tf: this uses the logarithm of the term frequencies rather than the term frequencies themselves.

We will introduce a function that trains the vectoriser on the total vocabulary of channel names, video titles and descriptions, vectorises them individually and then combines them. We'll also determine the effect of incorporating the video category, which will be one-hot encoded and stacked with the vectoriser output.

In [20]:
from sklearn.preprocessing import OneHotEncoder

video_category_encoder = OneHotEncoder()
video_category_encoder.fit(train[['video_category']])
video_category_encoder.categories_[0]

array([ 1,  2, 10, 15, 17, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29])

In [21]:
from scipy.sparse import csr_matrix, hstack

def dummy(x):
    return x

train_texts_tokenized = {'channel_title': train['channel_title_tokenized'],
                           'video_title': train['video_title_tokenized'],
                           'video_description': train['video_description_tokenized']}

def get_features(ngram_range=(1,1), min_df=1, max_df=1.0, verbose=True, use_idf=True, norm='l2', sublinear_tf=False, video_category_encoder=None):
    vectorizers = {}
    X_trains = {}
    for field in train_texts_tokenized:
        vectorizers[field] = TfidfVectorizer(preprocessor=dummy, tokenizer=dummy, ngram_range=ngram_range, min_df=min_df, max_df=max_df, token_pattern=None, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
        X_trains[field] = vectorizers[field].fit_transform(train_texts_tokenized[field])
        if verbose:
            print(f"Fit tfidf vectorizer with {len(vectorizers[field].get_feature_names_out())} features in the {ngram_range} ngram range.")

    if video_category_encoder != None:
        X_category = video_category_encoder.transform(train[['video_category']]).toarray()
        X_train = hstack([X_category, X_trains['channel_title'], X_trains['video_title'], X_trains['video_description']])
    else:
        X_train = hstack([X_trains['channel_title'], X_trains['video_title'], X_trains['video_description']])
    return X_train, vectorizers

We find that L$^2$ normalisation performs better than L$^1$, but there is no improvement from including the IDF factor or using log(TF) instead of TF.

## Including the video category

Next we can incorporate the video category.

In [37]:
%%time

params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []
vectorizers = []

ngram_range = (1,3)
for sublinear_tf in [False,True]:
    for use_idf in [False,True]:
        if use_idf == False and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF')
        elif use_idf == False and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)')
        elif use_idf == True and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF-IDF')
        elif use_idf == True and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)-IDF')

        params_fixed['ngram_range'].append(ngram_range)

        X_train, vectorizer = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)
        vectorizers.append(vectorizer)

Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
CPU times: user 2min 39s, sys: 3.13 s, total: 2min 42s
Wall time: 2min 41s


In [None]:
mnB_tune_category = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.


[I 2024-05-14 11:16:59,626] A new study created in memory with name: no-name-6f2c17f6-83a8-43b9-a2c4-bfb733b0e029


Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.


[I 2024-05-14 11:17:01,477] Trial 0 finished with value: 0.7973783764087002 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7973783764087002.
[I 2024-05-14 11:17:02,946] Trial 1 finished with value: 0.799470983024082 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-14 11:17:04,379] Trial 2 finished with value: 0.7991551675825793 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-14 11:17:05,794] Trial 3 finished with value: 0.7537901399454154 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-14 11:17:07,213] Trial 4 finished with value: 0.7977731885800425 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-14 11:17:08,629] Trial 5 finished with value: 0.7976942230279949 and parameters: {'alpha': 0.00039799342667825053}. Best is tr

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.535699  0.80991/0.81842/0.82965   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  


[I 2024-05-14 11:19:27,996] Trial 0 finished with value: 0.7960361023239535 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7960361023239535.
[I 2024-05-14 11:19:29,420] Trial 1 finished with value: 0.7985233886050628 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7985233886050628.
[I 2024-05-14 11:19:30,830] Trial 2 finished with value: 0.7986024243071418 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-14 11:19:32,240] Trial 3 finished with value: 0.7892450492589623 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-14 11:19:33,648] Trial 4 finished with value: 0.7967862166100466 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-14 11:19:35,055] Trial 5 finished with value: 0.7968256954888464 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.535699  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.541765  0.80754/0.82012/0.83083   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87564/0.88196/0.89345  0.145222  


[I 2024-05-14 11:21:52,624] Trial 0 finished with value: 0.8043667381287636 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-14 11:21:54,014] Trial 1 finished with value: 0.8037744224411509 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-14 11:21:55,481] Trial 2 finished with value: 0.8028269215555067 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-14 11:21:56,899] Trial 3 finished with value: 0.766503274252717 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-14 11:21:58,292] Trial 4 finished with value: 0.8047219700934827 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8047219700934827.
[I 2024-05-14 11:21:59,693] Trial 5 finished with value: 0.804643012335883 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.535699  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.541765  0.80754/0.82012/0.83083   
2         log(TF)      (1, 3)       0.763978  0.81030/0.82115/0.83281   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   
2  0.77851/0.79233/0.80597  0.73189/0.74083/0.75789  0.75553/0.76569/0.78119   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87564/0.88196/0.89345  0.145222  
2  0.87689/0.88348/0.89532  0.090669  


[I 2024-05-14 11:24:21,011] Trial 0 finished with value: 0.798878721897605 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.798878721897605.
[I 2024-05-14 11:24:23,322] Trial 1 finished with value: 0.8001026762626713 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-14 11:24:25,608] Trial 2 finished with value: 0.7999052584853283 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-14 11:24:27,966] Trial 3 finished with value: 0.7911796312368737 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-14 11:24:30,308] Trial 4 finished with value: 0.8002210193656957 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8002210193656957.
[I 2024-05-14 11:24:32,701] Trial 5 finished with value: 0.8001815404868957 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.535699  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.541765  0.80754/0.82012/0.83083   
2         log(TF)      (1, 3)       0.763978  0.81030/0.82115/0.83281   
3     log(TF)-IDF      (1, 3)       0.505469  0.80853/0.82099/0.83281   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   
2  0.77851/0.79233/0.80597  0.73189/0.74083/0.75789  0.75553/0.76569/0.78119   
3  0.77996/0.79514/0.80794  0.72203/0.73584/0.75489  0.75204/0.76431/0.78051   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87564/0.88196/0.89345  0.145222  
2  0.87689/0.88348/0.89532  0.090669  
3  0.87435/0.88119/0.89339  0.166633  
CPU times: user 6min 23s, sys: 28.6 s,

[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.9s finished


In [35]:
pd.DataFrame(mnB_tune_category).style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,"(1, 3)",0.63804,0.80991/0.81842/0.82965,0.76245/0.77930/0.79236,0.73829/0.75305/0.76892,0.75729/0.76590/0.78046,0.88022/0.88591/0.89647,0.065719
TF-IDF,"(1, 3)",0.643662,0.80754/0.82012/0.83083,0.77413/0.79016/0.80138,0.72942/0.74083/0.75840,0.75273/0.76467/0.77929,0.87564/0.88196/0.89345,0.145222
log(TF),"(1, 3)",0.643582,0.81030/0.82115/0.83281,0.77851/0.79233/0.80597,0.73189/0.74083/0.75789,0.75553/0.76569/0.78119,0.87689/0.88348/0.89532,0.090669
log(TF)-IDF,"(1, 3)",0.705773,0.80853/0.82099/0.83281,0.77996/0.79514/0.80794,0.72203/0.73584/0.75489,0.75204/0.76431/0.78051,0.87435/0.88119/0.89339,0.166633


## Dimensionality reduction

Our best performing models use the (1,3) n-gram range, which requires over 2 million features. We will now look at reducing the number of features by setting a minimum and maximum document frequency filter that drops tokens from the vocabulary that are either too rare or too common. I'll show results for TF-IDF with L$^2$ norm.

In [None]:
%%time

X_trains = []

params_fixed = {'min_df': [], 'max_df': []}

for min_df in [5,10,20,50,100,200,500,1000]:
    for max_df in [1.0, 0.9, 0.8, 0.7]:
        X_train, _ = get_features(ngram_range=ngram_range, use_idf=True, norm='l2', sublinear_tf=True, min_df=min_df, max_df=max_df, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)
        params_fixed['min_df'].append(min_df)
        params_fixed['max_df'].append(f"{max_df:.1f}")

mnB_tune_dim_reduction = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315921 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315921 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315921 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315920 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 2104 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 9402 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 151356 features in the (

[I 2024-05-14 11:36:52,356] A new study created in memory with name: no-name-bbe25eae-b885-4ea9-bae2-c50d9bb7a3ff


Fit tfidf vectorizer with 545 features in the (1, 3) ngram range.


  pid = os.fork()
[I 2024-05-14 11:36:54,845] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:36:56,138] Trial 1 finished with value: 0.8070906950270252 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:36:56,840] Trial 2 finished with value: 0.8046428096802366 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:36:57,549] Trial 3 finished with value: 0.7877447271534013 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:36:58,236] Trial 4 finished with value: 0.809380618091927 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:36:58,951] Trial 5 finished with value: 0.8093411470075751 and parameters: {'alpha': 0.00039799342

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.145634  0.80513/0.81116/0.81998   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   

                   roc_auc     alpha  
0  0.86727/0.87335/0.88590  0.004751  


[I 2024-05-14 11:38:08,635] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:38:09,347] Trial 1 finished with value: 0.8070906950270252 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:38:10,068] Trial 2 finished with value: 0.8046428096802366 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:38:10,752] Trial 3 finished with value: 0.7877447271534013 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:38:11,476] Trial 4 finished with value: 0.809380618091927 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:38:12,187] Trial 5 finished with value: 0.8093411470075751 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.145634  0.80513/0.81116/0.81998   
1       5    0.9       0.138543  0.80513/0.81116/0.81998   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   

                   roc_auc     alpha  
0  0.86727/0.87335/0.88590  0.004751  
1  0.86727/0.87335/0.88590  0.004751  


[I 2024-05-14 11:39:21,722] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:39:22,445] Trial 1 finished with value: 0.8070906950270252 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:39:23,157] Trial 2 finished with value: 0.8046428096802366 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:39:23,858] Trial 3 finished with value: 0.7877447271534013 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:39:24,572] Trial 4 finished with value: 0.809380618091927 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:39:25,270] Trial 5 finished with value: 0.8093411470075751 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.145634  0.80513/0.81116/0.81998   
1       5    0.9       0.138543  0.80513/0.81116/0.81998   
2       5    0.8       0.132506  0.80513/0.81116/0.81998   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   

                   roc_auc     alpha  
0  0.86727/0.87335/0.88590  0.004751  
1  0.86727/0.87335/0.88590  0.004751  
2  0.86727/0.87335/0.88590  0.004751  


[I 2024-05-14 11:40:35,871] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:40:36,579] Trial 1 finished with value: 0.8070512161482254 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:40:37,276] Trial 2 finished with value: 0.8045638441281889 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:40:38,006] Trial 3 finished with value: 0.7877841904433054 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:40:38,772] Trial 4 finished with value: 0.8092621736610794 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-14 11:40:39,509] Trial 5 finished with value: 0.8093016603343273 and parameters: {'alpha': 0.00039799342667825053}. Best 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.145634  0.80513/0.81116/0.81998   
1       5    0.9       0.138543  0.80513/0.81116/0.81998   
2       5    0.8       0.132506  0.80513/0.81116/0.81998   
3       5    0.7       0.132934  0.80474/0.81104/0.82017   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   

                   roc_auc     alpha  
0  0.86727/0.87335/0.88590  0.004751  
1  0.86727/0.87335/0.88590  0.004751  
2  0.86727/0.87335/0.88590  0.004751  
3  0.86727/0.87335/0.88590  0.004581  


[I 2024-05-14 11:41:48,791] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:41:49,376] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:41:49,953] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:41:50,534] Trial 3 finished with value: 0.779887674210783 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:41:51,103] Trial 4 finished with value: 0.7948514319764889 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:41:51,684] Trial 5 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.145634  0.80513/0.81116/0.81998   
1       5    0.9       0.138543  0.80513/0.81116/0.81998   
2       5    0.8       0.132506  0.80513/0.81116/0.81998   
3       5    0.7       0.132934  0.80474/0.81104/0.82017   
4      10    1.0       0.079082  0.79171/0.79615/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   

                   roc_auc     alpha  
0  0.86727/0.87335/0.88590  0.004751  
1  0.86727/0.87335/0.88590  0.004751  
2  0.86727/0.87335/0.88590  0.004751  
3 

[I 2024-05-14 11:42:47,017] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:42:47,592] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:42:48,156] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:42:48,719] Trial 3 finished with value: 0.779887674210783 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:42:49,274] Trial 4 finished with value: 0.7948514319764889 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:42:49,839] Trial 5 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.145634  0.80513/0.81116/0.81998   
1       5    0.9       0.138543  0.80513/0.81116/0.81998   
2       5    0.8       0.132506  0.80513/0.81116/0.81998   
3       5    0.7       0.132934  0.80474/0.81104/0.82017   
4      10    1.0       0.079082  0.79171/0.79615/0.80340   
5      10    0.9       0.076884  0.79171/0.79615/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   
5  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   

                  

[I 2024-05-14 11:43:45,085] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:43:45,648] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:43:46,214] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:43:46,778] Trial 3 finished with value: 0.779887674210783 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:43:47,346] Trial 4 finished with value: 0.7948514319764889 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:43:47,908] Trial 5 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.145634  0.80513/0.81116/0.81998   
1       5    0.9       0.138543  0.80513/0.81116/0.81998   
2       5    0.8       0.132506  0.80513/0.81116/0.81998   
3       5    0.7       0.132934  0.80474/0.81104/0.82017   
4      10    1.0       0.079082  0.79171/0.79615/0.80340   
5      10    0.9       0.076884  0.79171/0.79615/0.80340   
6      10    0.8       0.092904  0.79171/0.79615/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   
5  0.75927/0.76583/0.77490  0.68260/0.6

[I 2024-05-14 11:44:42,958] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:44:43,529] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:44:44,092] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:44:44,654] Trial 3 finished with value: 0.7798087086587353 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:44:45,215] Trial 4 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-14 11:44:45,776] Trial 5 finished with value: 0.7950093552861361 and parameters: {'alpha': 0.00039799342667825053}. Best is

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.145634  0.80513/0.81116/0.81998   
1       5    0.9       0.138543  0.80513/0.81116/0.81998   
2       5    0.8       0.132506  0.80513/0.81116/0.81998   
3       5    0.7       0.132934  0.80474/0.81104/0.82017   
4      10    1.0       0.079082  0.79171/0.79615/0.80340   
5      10    0.9       0.076884  0.79171/0.79615/0.80340   
6      10    0.8       0.092904  0.79171/0.79615/0.80340   
7      10    0.7       0.078404  0.79076/0.79608/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.7221

[I 2024-05-14 11:45:40,698] Trial 0 finished with value: 0.7831255736226529 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7831255736226529.
[I 2024-05-14 11:45:41,185] Trial 1 finished with value: 0.7835203546162034 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:45:41,683] Trial 2 finished with value: 0.7831649823514213 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:45:42,163] Trial 3 finished with value: 0.775189516155747 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:45:42,643] Trial 4 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:45:43,125] Trial 5 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.145634  0.80513/0.81116/0.81998   
1       5    0.9       0.138543  0.80513/0.81116/0.81998   
2       5    0.8       0.132506  0.80513/0.81116/0.81998   
3       5    0.7       0.132934  0.80474/0.81104/0.82017   
4      10    1.0       0.079082  0.79171/0.79615/0.80340   
5      10    0.9       0.076884  0.79171/0.79615/0.80340   
6      10    0.8       0.092904  0.79171/0.79615/0.80340   
7      10    0.7       0.078404  0.79076/0.79608/0.80340   
8      20    1.0       0.066544  0.78030/0.78360/0.78875   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   

[I 2024-05-14 11:46:29,682] Trial 0 finished with value: 0.7831255736226529 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7831255736226529.
[I 2024-05-14 11:46:30,161] Trial 1 finished with value: 0.7835203546162034 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:46:30,652] Trial 2 finished with value: 0.7831649823514213 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:46:31,140] Trial 3 finished with value: 0.775189516155747 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:46:31,619] Trial 4 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:46:32,095] Trial 5 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.145634  0.80513/0.81116/0.81998   
1       5    0.9       0.138543  0.80513/0.81116/0.81998   
2       5    0.8       0.132506  0.80513/0.81116/0.81998   
3       5    0.7       0.132934  0.80474/0.81104/0.82017   
4      10    1.0       0.079082  0.79171/0.79615/0.80340   
5      10    0.9       0.076884  0.79171/0.79615/0.80340   
6      10    0.8       0.092904  0.79171/0.79615/0.80340   
7      10    0.7       0.078404  0.79076/0.79608/0.80340   
8      20    1.0       0.066544  0.78030/0.78360/0.78875   
9      20    0.9       0.066022  0.78030/0.78360/0.78875   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/

[I 2024-05-14 11:47:18,454] Trial 0 finished with value: 0.7831255736226529 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7831255736226529.
[I 2024-05-14 11:47:18,932] Trial 1 finished with value: 0.7835203546162034 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:47:19,424] Trial 2 finished with value: 0.7831649823514213 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:47:19,896] Trial 3 finished with value: 0.775189516155747 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:47:20,385] Trial 4 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-14 11:47:20,863] Trial 5 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00039799342667825053}. Best i

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   

                  precision                   recall                       f1  \
0   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2   0.76661/0.77804/0.78

[I 2024-05-14 11:48:07,374] Trial 0 finished with value: 0.783046631453949 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.783046631453949.
[I 2024-05-14 11:48:07,857] Trial 1 finished with value: 0.7834413968586037 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-14 11:48:08,336] Trial 2 finished with value: 0.7830860323882696 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-14 11:48:08,814] Trial 3 finished with value: 0.775189523950195 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-14 11:48:09,292] Trial 4 finished with value: 0.7832835047267481 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-14 11:48:09,768] Trial 5 finished with value: 0.7832835047267481 and parameters: {'alpha': 0.00039799342667825053}. Best is 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   

                  precision                   recall                       f1  \
0   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1   0.76661/0.77804/0.78788  0.71858/0.72927

[I 2024-05-14 11:48:56,137] Trial 0 finished with value: 0.7601070801257557 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-14 11:48:56,540] Trial 1 finished with value: 0.7600676012469558 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-14 11:48:56,939] Trial 2 finished with value: 0.7601070957146515 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-14 11:48:57,338] Trial 3 finished with value: 0.7583304604297314 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-14 11:48:57,728] Trial 4 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-14 11:48:58,116] Trial 5 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   

                  precision                   recall                       f1  \
0   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.

[I 2024-05-14 11:49:36,052] Trial 0 finished with value: 0.7601070801257557 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-14 11:49:36,443] Trial 1 finished with value: 0.7600676012469558 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-14 11:49:36,832] Trial 2 finished with value: 0.7601070957146515 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-14 11:49:37,223] Trial 3 finished with value: 0.7583304604297314 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-14 11:49:37,613] Trial 4 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-14 11:49:38,004] Trial 5 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   

                  precision                   recall                       f1  \
0  

[I 2024-05-14 11:50:16,049] Trial 0 finished with value: 0.7601070801257557 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-14 11:50:16,438] Trial 1 finished with value: 0.7600676012469558 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-14 11:50:16,816] Trial 2 finished with value: 0.7601070957146515 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-14 11:50:17,206] Trial 3 finished with value: 0.7583304604297314 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-14 11:50:17,596] Trial 4 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-14 11:50:17,986] Trial 5 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   

                  preci

[I 2024-05-14 11:50:56,052] Trial 0 finished with value: 0.7599886434893561 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-14 11:50:56,447] Trial 1 finished with value: 0.7598701912640606 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-14 11:50:56,839] Trial 2 finished with value: 0.7599096857317564 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-14 11:50:57,229] Trial 3 finished with value: 0.7585278626181784 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-14 11:50:57,618] Trial 4 finished with value: 0.7599096779373085 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-14 11:50:58,008] Trial 5 finished with value: 0.7599096779373085 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:51:35,826] Trial 0 finished with value: 0.743958885846023 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:51:36,160] Trial 1 finished with value: 0.743958885846023 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:51:36,491] Trial 2 finished with value: 0.743958885846023 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:51:36,821] Trial 3 finished with value: 0.7439193991727752 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:51:37,153] Trial 4 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:51:37,488] Trial 5 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:52:10,291] Trial 0 finished with value: 0.743958885846023 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:52:10,623] Trial 1 finished with value: 0.743958885846023 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:52:10,964] Trial 2 finished with value: 0.743958885846023 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:52:11,298] Trial 3 finished with value: 0.7439193991727752 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:52:11,639] Trial 4 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:52:11,970] Trial 5 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:52:44,771] Trial 0 finished with value: 0.743958885846023 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:52:45,116] Trial 1 finished with value: 0.743958885846023 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:52:45,449] Trial 2 finished with value: 0.743958885846023 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:52:45,780] Trial 3 finished with value: 0.7439193991727752 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:52:46,111] Trial 4 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-14 11:52:46,454] Trial 5 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:53:19,186] Trial 0 finished with value: 0.7442747246708696 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-14 11:53:19,518] Trial 1 finished with value: 0.7442352379976218 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-14 11:53:19,849] Trial 2 finished with value: 0.7442352379976218 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-14 11:53:20,178] Trial 3 finished with value: 0.7437219813954321 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-14 11:53:20,522] Trial 4 finished with value: 0.7442747246708696 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-14 11:53:20,864] Trial 5 finished with value: 0.7442747246708696 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:53:53,571] Trial 0 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:53:53,860] Trial 1 finished with value: 0.7327065557932431 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:53:54,147] Trial 2 finished with value: 0.7327065557932431 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:53:54,436] Trial 3 finished with value: 0.7321538203122534 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:53:54,725] Trial 4 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:53:55,018] Trial 5 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:54:23,416] Trial 0 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:54:23,705] Trial 1 finished with value: 0.7327065557932431 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:54:23,992] Trial 2 finished with value: 0.7327065557932431 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:54:24,282] Trial 3 finished with value: 0.7321538203122534 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:54:24,573] Trial 4 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:54:24,863] Trial 5 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:54:53,373] Trial 0 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:54:53,660] Trial 1 finished with value: 0.7327065557932431 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:54:53,948] Trial 2 finished with value: 0.7327065557932431 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:54:54,237] Trial 3 finished with value: 0.7321538203122534 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:54:54,527] Trial 4 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-14 11:54:54,815] Trial 5 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:55:23,306] Trial 0 finished with value: 0.7325880879790517 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-14 11:55:23,597] Trial 1 finished with value: 0.7325880879790517 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-14 11:55:23,885] Trial 2 finished with value: 0.7325880879790517 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-14 11:55:24,172] Trial 3 finished with value: 0.7322722569486529 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-14 11:55:24,460] Trial 4 finished with value: 0.7325880879790517 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-14 11:55:24,750] Trial 5 finished with value: 0.7325880879790517 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:55:53,081] Trial 0 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-14 11:55:53,330] Trial 1 finished with value: 0.7031742733333619 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-14 11:55:53,574] Trial 2 finished with value: 0.7031742733333619 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-14 11:55:53,823] Trial 3 finished with value: 0.7032136820621304 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-14 11:55:54,070] Trial 4 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-14 11:55:54,325] Trial 5 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:56:18,196] Trial 0 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-14 11:56:18,441] Trial 1 finished with value: 0.7031742733333619 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-14 11:56:18,686] Trial 2 finished with value: 0.7031742733333619 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-14 11:56:18,929] Trial 3 finished with value: 0.7032136820621304 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-14 11:56:19,174] Trial 4 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-14 11:56:19,421] Trial 5 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:56:43,274] Trial 0 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-14 11:56:43,521] Trial 1 finished with value: 0.7031742733333619 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-14 11:56:43,766] Trial 2 finished with value: 0.7031742733333619 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-14 11:56:44,012] Trial 3 finished with value: 0.7032136820621304 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-14 11:56:44,255] Trial 4 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-14 11:56:44,502] Trial 5 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:57:08,609] Trial 0 finished with value: 0.7031347866601141 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7031347866601141.
[I 2024-05-14 11:57:08,854] Trial 1 finished with value: 0.7031347866601141 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031347866601141.
[I 2024-05-14 11:57:09,097] Trial 2 finished with value: 0.7031347866601141 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031347866601141.
[I 2024-05-14 11:57:09,355] Trial 3 finished with value: 0.7034110842505775 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7034110842505775.
[I 2024-05-14 11:57:09,614] Trial 4 finished with value: 0.7031347866601141 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7034110842505775.
[I 2024-05-14 11:57:09,862] Trial 5 finished with value: 0.7031347866601141 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:57:33,998] Trial 0 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-14 11:57:34,227] Trial 1 finished with value: 0.6803932610762028 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-14 11:57:34,455] Trial 2 finished with value: 0.6803932610762028 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-14 11:57:34,683] Trial 3 finished with value: 0.6805116899181544 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-14 11:57:34,900] Trial 4 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-14 11:57:35,114] Trial 5 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:57:56,269] Trial 0 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-14 11:57:56,484] Trial 1 finished with value: 0.6803932610762028 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-14 11:57:56,698] Trial 2 finished with value: 0.6803932610762028 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-14 11:57:56,913] Trial 3 finished with value: 0.6805116899181544 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-14 11:57:57,127] Trial 4 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-14 11:57:57,342] Trial 5 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:58:18,522] Trial 0 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-14 11:58:18,737] Trial 1 finished with value: 0.6803932610762028 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-14 11:58:18,951] Trial 2 finished with value: 0.6803932610762028 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-14 11:58:19,168] Trial 3 finished with value: 0.6805116899181544 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-14 11:58:19,383] Trial 4 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-14 11:58:19,597] Trial 5 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-14 11:58:40,741] Trial 0 finished with value: 0.6800379667559 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-14 11:58:40,955] Trial 1 finished with value: 0.6800379667559 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-14 11:58:41,169] Trial 2 finished with value: 0.6800379667559 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-14 11:58:41,382] Trial 3 finished with value: 0.6797220655754699 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-14 11:58:41,594] Trial 4 finished with value: 0.6800379667559 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-14 11:58:41,812] Trial 5 finished with value: 0.6800379667559 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0 with value: 0.68003

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.145634  0.80513/0.81116/0.81998   
1        5    0.9       0.138543  0.80513/0.81116/0.81998   
2        5    0.8       0.132506  0.80513/0.81116/0.81998   
3        5    0.7       0.132934  0.80474/0.81104/0.82017   
4       10    1.0       0.079082  0.79171/0.79615/0.80340   
5       10    0.9       0.076884  0.79171/0.79615/0.80340   
6       10    0.8       0.092904  0.79171/0.79615/0.80340   
7       10    0.7       0.078404  0.79076/0.79608/0.80340   
8       20    1.0       0.066544  0.78030/0.78360/0.78875   
9       20    0.9       0.066022  0.78030/0.78360/0.78875   
10      20    0.8       0.072519  0.78030/0.78360/0.78875   
11      20    0.7       0.063716  0.78030/0.78348/0.78875   
12      50    1.0       0.044987  0.75464/0.76015/0.77280   
13      50    0.9       0.047032  0.75464/0.76015/0.77280   
14      50    0.8       0.045433  0.75464/0.76015/0.77280   
15      50    0.7       

In [None]:
pd.DataFrame(mnB_tune_dim_reduction).style.hide()

min_df,max_df,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
5,1.0,0.145634,0.80513/0.81116/0.81998,0.76661/0.77804/0.78788,0.71858/0.72927/0.74286,0.74374/0.75285/0.76471,0.86727/0.87335/0.88590,0.004751
5,0.9,0.138543,0.80513/0.81116/0.81998,0.76661/0.77804/0.78788,0.71858/0.72927/0.74286,0.74374/0.75285/0.76471,0.86727/0.87335/0.88590,0.004751
5,0.8,0.132506,0.80513/0.81116/0.81998,0.76661/0.77804/0.78788,0.71858/0.72927/0.74286,0.74374/0.75285/0.76471,0.86727/0.87335/0.88590,0.004751
5,0.7,0.132934,0.80474/0.81104/0.82017,0.76636/0.77784/0.78799,0.71858/0.72917/0.74336,0.74307/0.75270/0.76502,0.86727/0.87335/0.88590,0.004581
10,1.0,0.079082,0.79171/0.79615/0.80340,0.75927/0.76583/0.77490,0.68260/0.69613/0.70576,0.72219/0.72928/0.73872,0.85340/0.86079/0.87297,0.01519
10,0.9,0.076884,0.79171/0.79615/0.80340,0.75927/0.76583/0.77490,0.68260/0.69613/0.70576,0.72219/0.72928/0.73872,0.85340/0.86079/0.87297,0.01519
10,0.8,0.092904,0.79171/0.79615/0.80340,0.75927/0.76583/0.77490,0.68260/0.69613/0.70576,0.72219/0.72928/0.73872,0.85340/0.86079/0.87297,0.01519
10,0.7,0.078404,0.79076/0.79608/0.80340,0.75719/0.76481/0.77370,0.68408/0.69764/0.70777,0.72245/0.72963/0.73927,0.85350/0.86081/0.87329,0.004618
20,1.0,0.066544,0.78030/0.78360/0.78875,0.74147/0.74644/0.75665,0.67275/0.68378/0.68972,0.71143/0.71370/0.71723,0.84053/0.84584/0.85559,3e-06
20,0.9,0.066022,0.78030/0.78360/0.78875,0.74147/0.74644/0.75665,0.67275/0.68378/0.68972,0.71143/0.71370/0.71723,0.84053/0.84584/0.85559,3e-06


We see that as the vocabulary size is decreased, the cross validation scores rapidly degrade.

All of these results show that identifying whether a video will be popular or not by its text metadata is a machine-learning problem that contradicts the common wisdom in text classification tasks. This is a fundamentally different challenge to, for example, determining whether a text message or email is spam, etc. In our case, both the most common and rarest terms are relevant, and incorporating an IDF factor seems to have no effect on the accuracy. Whether or not a viewer likes a certain YouTube video or channel, and whether they share it on social media to contribute to its virality, is primarily subjective determination, which makes the classification problem significantly more difficult, and this is reflected in the low cross-validation metrics we have seen so far.

## Further classification models

Now that we have understood the influence of the vectoriser hyperparameters -- the n-gram range, the TF/log(TF)/TF-IDF/log(TF)-IDF modalities and the normalisation, we are ready to build some more models. Having considered a Bayesian model already we can explore three linear methods:

* Support vector machine
* Logistic regression
* Perceptron

To avoid overfitting, we will employ statistical regularisation via a combination of L$^1$ and L$^2$ penalty terms, known as *elasticnet.* There are two hyperparameters which we will again use Bayesian optimization to tune. We will also implement the linear algorithms via stochastic gradient descent using SGDClassifier from scikit-learn, which uses a randomised algorithm to solve the linear models with regularisation. A random state variable will be set for reproducibility.

In [None]:
from sklearn.linear_model import SGDClassifier
params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []
vectorizers = []

for ngram_range in [(1,3)]:
    for sublinear_tf in [False,True]:
        for use_idf in [False,True]:
            if use_idf == False and sublinear_tf == False:
                params_fixed['vectorizer_type'].append('TF')
            elif use_idf == False and sublinear_tf == True:
                params_fixed['vectorizer_type'].append('log(TF)')
            elif use_idf == True and sublinear_tf == False:
                params_fixed['vectorizer_type'].append('TF-IDF')
            elif use_idf == True and sublinear_tf == True:
                params_fixed['vectorizer_type'].append('log(TF)-IDF')

            params_fixed['ngram_range'].append(ngram_range)

            X_train, vectorizer = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
            X_trains.append(X_train)
            vectorizers.append(vectorizer)

Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.


### Support vector machine

In [None]:
%%time

def get_params_SVM(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'loss': 'hinge', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

SVM_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_SVM, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

[I 2024-05-14 04:47:05,663] A new study created in memory with name: no-name-8e20f773-6de4-4249-9774-52b63583d5fe
  pid = os.fork()
[I 2024-05-14 04:47:12,543] Trial 0 finished with value: 0.7120182436848408 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.7120182436848408.
[I 2024-05-14 04:47:27,518] Trial 1 finished with value: 0.8138421445020498 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8138421445020498.
[I 2024-05-14 04:47:34,271] Trial 2 finished with value: 0.7548956342907384 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8138421445020498.
[I 2024-05-14 04:47:51,867] Trial 3 finished with value: 0.8170005951061 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8170005951061.
[I 2024-05-14 04:47:55,527] Trial 4 finished with

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)      13.005414  0.81642/0.82581/0.83794   

                 precision                   recall                       f1  \
0  0.78575/0.80054/0.81458  0.73047/0.74355/0.76190  0.75710/0.77099/0.78736   

      alpha  l1_ratio  
0  0.000042  0.074738  


[I 2024-05-14 05:07:44,139] Trial 0 finished with value: 0.647504814045907 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.647504814045907.
[I 2024-05-14 05:07:59,291] Trial 1 finished with value: 0.8164478206528708 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8164478206528708.
[I 2024-05-14 05:08:06,280] Trial 2 finished with value: 0.7255998977368432 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8164478206528708.
[I 2024-05-14 05:08:23,330] Trial 3 finished with value: 0.8196063492014003 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8196063492014003.
[I 2024-05-14 05:08:27,661] Trial 4 finished with value: 0.6062462601264493 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 wit

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)      13.005414  0.81642/0.82581/0.83794   
1          TF-IDF      (1, 3)      12.118849  0.81560/0.82454/0.83675   

                 precision                   recall                       f1  \
0  0.78575/0.80054/0.81458  0.73047/0.74355/0.76190  0.75710/0.77099/0.78736   
1  0.80220/0.81192/0.82589  0.70084/0.72261/0.74185  0.75278/0.76462/0.78162   

      alpha  l1_ratio  
0  0.000042  0.074738  
1  0.000042  0.175647  


[I 2024-05-14 05:26:50,778] Trial 0 finished with value: 0.7132815756008837 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.7132815756008837.
[I 2024-05-14 05:27:07,337] Trial 1 finished with value: 0.81557907486918 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.81557907486918.
[I 2024-05-14 05:27:14,247] Trial 2 finished with value: 0.7553692561251695 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.81557907486918.
[I 2024-05-14 05:27:31,844] Trial 3 finished with value: 0.8180270927215835 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8180270927215835.
[I 2024-05-14 05:27:35,546] Trial 4 finished with value: 0.6069964367681258 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 with va

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)      13.005414  0.81642/0.82581/0.83794   
1          TF-IDF      (1, 3)      12.118849  0.81560/0.82454/0.83675   
2         log(TF)      (1, 3)      12.537444  0.81935/0.82462/0.83656   

                 precision                   recall                       f1  \
0  0.78575/0.80054/0.81458  0.73047/0.74355/0.76190  0.75710/0.77099/0.78736   
1  0.80220/0.81192/0.82589  0.70084/0.72261/0.74185  0.75278/0.76462/0.78162   
2  0.79346/0.80683/0.82149  0.71547/0.73020/0.74737  0.75245/0.76656/0.78268   

      alpha  l1_ratio  
0  0.000042  0.074738  
1  0.000042  0.175647  
2  0.000037  0.249480  


[I 2024-05-14 05:47:28,260] Trial 0 finished with value: 0.6347126128587346 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6347126128587346.
[I 2024-05-14 05:47:41,372] Trial 1 finished with value: 0.819172214040217 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.819172214040217.
[I 2024-05-14 05:47:47,641] Trial 2 finished with value: 0.7123737094829982 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.819172214040217.
[I 2024-05-14 05:48:03,498] Trial 3 finished with value: 0.8188561024096925 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.819172214040217.
[I 2024-05-14 05:48:07,701] Trial 4 finished with value: 0.6062462601264493 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 with 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)      13.005414  0.81642/0.82581/0.83794   
1          TF-IDF      (1, 3)      12.118849  0.81560/0.82454/0.83675   
2         log(TF)      (1, 3)      12.537444  0.81935/0.82462/0.83656   
3     log(TF)-IDF      (1, 3)       9.852694  0.81721/0.82691/0.83952   

                 precision                   recall                       f1  \
0  0.78575/0.80054/0.81458  0.73047/0.74355/0.76190  0.75710/0.77099/0.78736   
1  0.80220/0.81192/0.82589  0.70084/0.72261/0.74185  0.75278/0.76462/0.78162   
2  0.79346/0.80683/0.82149  0.71547/0.73020/0.74737  0.75245/0.76656/0.78268   
3  0.79318/0.80974/0.82437  0.71562/0.73369/0.75288  0.76023/0.76978/0.78701   

      alpha  l1_ratio  
0  0.000042  0.074738  
1  0.000042  0.175647  
2  0.000037  0.249480  
3  0.000068  0.003152  
CPU times: user 4min 25s, sys: 34.7 s, total: 5min
Wall time: 1h 17min 53s


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   11.2s finished


In [None]:
pd.DataFrame(SVM_tune).style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,alpha,l1_ratio
TF,"(1, 3)",13.005414,0.81642/0.82581/0.83794,0.78575/0.80054/0.81458,0.73047/0.74355/0.76190,0.75710/0.77099/0.78736,4.2e-05,0.074738
TF-IDF,"(1, 3)",12.118849,0.81560/0.82454/0.83675,0.80220/0.81192/0.82589,0.70084/0.72261/0.74185,0.75278/0.76462/0.78162,4.2e-05,0.175647
log(TF),"(1, 3)",12.537444,0.81935/0.82462/0.83656,0.79346/0.80683/0.82149,0.71547/0.73020/0.74737,0.75245/0.76656/0.78268,3.7e-05,0.24948
log(TF)-IDF,"(1, 3)",9.852694,0.81721/0.82691/0.83952,0.79318/0.80974/0.82437,0.71562/0.73369/0.75288,0.76023/0.76978/0.78701,6.8e-05,0.003152


### Logistic regression

In [None]:
%%time

def get_params_log_reg(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'loss': 'log_loss', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

log_reg_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_log_reg, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

[I 2024-05-14 06:06:34,369] A new study created in memory with name: no-name-00a494cd-f5f3-4d76-964a-b075898717e8
[I 2024-05-14 06:06:38,569] Trial 0 finished with value: 0.6905005867270685 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6905005867270685.
[I 2024-05-14 06:06:54,920] Trial 1 finished with value: 0.8146713646402531 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8146713646402531.
[I 2024-05-14 06:06:59,433] Trial 2 finished with value: 0.729627117507928 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8146713646402531.
[I 2024-05-14 06:07:07,687] Trial 3 finished with value: 0.8077617580221432 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8146713646402531.
[I 2024-05-14 06:07:11,303] Trial 4 finished with value: 0.646

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       5.177744  0.81362/0.82079/0.83162   

                 precision                   recall                       f1  \
0  0.77549/0.79471/0.81582  0.70823/0.73630/0.76356  0.74678/0.76410/0.77710   

                   roc_auc     alpha  l1_ratio  
0  0.88667/0.89355/0.90262  0.000003  0.006446  


[I 2024-05-14 06:21:41,746] Trial 0 finished with value: 0.6489656338893244 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6489656338893244.
[I 2024-05-14 06:21:55,529] Trial 1 finished with value: 0.819646108680326 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.819646108680326.
[I 2024-05-14 06:22:00,446] Trial 2 finished with value: 0.7054643600816702 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.819646108680326.
[I 2024-05-14 06:22:04,214] Trial 3 finished with value: 0.8054721077629194 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.819646108680326.
[I 2024-05-14 06:22:07,890] Trial 4 finished with value: 0.6061673023688496 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 with 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       5.177744  0.81362/0.82079/0.83162   
1          TF-IDF      (1, 3)       4.311730  0.81544/0.82221/0.83379   

                 precision                   recall                       f1  \
0  0.77549/0.79471/0.81582  0.70823/0.73630/0.76356  0.74678/0.76410/0.77710   
1  0.78369/0.80046/0.81077  0.71443/0.73188/0.75439  0.75279/0.76450/0.78141   

                   roc_auc     alpha  l1_ratio  
0  0.88667/0.89355/0.90262  0.000003  0.006446  
1  0.88743/0.89243/0.90113  0.000002  0.314917  


[I 2024-05-14 06:36:18,126] Trial 0 finished with value: 0.6920404344781169 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6920404344781169.
[I 2024-05-14 06:36:33,953] Trial 1 finished with value: 0.8161322234559101 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8161322234559101.
[I 2024-05-14 06:36:38,547] Trial 2 finished with value: 0.7335751223046312 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8161322234559101.
[I 2024-05-14 06:36:45,550] Trial 3 finished with value: 0.8102097992578905 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8161322234559101.
[I 2024-05-14 06:36:49,247] Trial 4 finished with value: 0.6102338373353277 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       5.177744  0.81362/0.82079/0.83162   
1          TF-IDF      (1, 3)       4.311730  0.81544/0.82221/0.83379   
2         log(TF)      (1, 3)       4.256745  0.81800/0.82427/0.83458   

                 precision                   recall                       f1  \
0  0.77549/0.79471/0.81582  0.70823/0.73630/0.76356  0.74678/0.76410/0.77710   
1  0.78369/0.80046/0.81077  0.71443/0.73188/0.75439  0.75279/0.76450/0.78141   
2  0.79760/0.80708/0.81560  0.70150/0.72875/0.76108  0.75021/0.76570/0.78109   

                   roc_auc     alpha  l1_ratio  
0  0.88667/0.89355/0.90262  0.000003  0.006446  
1  0.88743/0.89243/0.90113  0.000002  0.314917  
2  0.89100/0.89644/0.90670  0.000019  0.007205  


[I 2024-05-14 06:49:15,027] Trial 0 finished with value: 0.6295009877514147 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6295009877514147.
[I 2024-05-14 06:49:29,702] Trial 1 finished with value: 0.8186193226702688 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8186193226702688.
[I 2024-05-14 06:49:35,104] Trial 2 finished with value: 0.7019110817173819 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8186193226702688.
[I 2024-05-14 06:49:38,795] Trial 3 finished with value: 0.8067353071733473 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8186193226702688.
[I 2024-05-14 06:49:42,435] Trial 4 finished with value: 0.6054961614292523 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       5.177744  0.81362/0.82079/0.83162   
1          TF-IDF      (1, 3)       4.311730  0.81544/0.82221/0.83379   
2         log(TF)      (1, 3)       4.256745  0.81800/0.82427/0.83458   
3     log(TF)-IDF      (1, 3)       4.554346  0.81520/0.82509/0.83597   

                 precision                   recall                       f1  \
0  0.77549/0.79471/0.81582  0.70823/0.73630/0.76356  0.74678/0.76410/0.77710   
1  0.78369/0.80046/0.81077  0.71443/0.73188/0.75439  0.75279/0.76450/0.78141   
2  0.79760/0.80708/0.81560  0.70150/0.72875/0.76108  0.75021/0.76570/0.78109   
3  0.78272/0.80596/0.82155  0.71288/0.73355/0.74614  0.75860/0.76790/0.78160   

                   roc_auc     alpha  l1_ratio  
0  0.88667/0.89355/0.90262  0.000003  0.006446  
1  0.88743/0.89243/0.90113  0.000002  0.314917  
2  0.89100/0.89644/0.90670  0.000019  0.007205  
3  0.89142/0.89659/0.90561  0.000002 

[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.4s finished


In [None]:
pd.DataFrame(log_reg_tune).style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha,l1_ratio
TF,"(1, 3)",5.177744,0.81362/0.82079/0.83162,0.77549/0.79471/0.81582,0.70823/0.73630/0.76356,0.74678/0.76410/0.77710,0.88667/0.89355/0.90262,3e-06,0.006446
TF-IDF,"(1, 3)",4.31173,0.81544/0.82221/0.83379,0.78369/0.80046/0.81077,0.71443/0.73188/0.75439,0.75279/0.76450/0.78141,0.88743/0.89243/0.90113,2e-06,0.314917
log(TF),"(1, 3)",4.256745,0.81800/0.82427/0.83458,0.79760/0.80708/0.81560,0.70150/0.72875/0.76108,0.75021/0.76570/0.78109,0.89100/0.89644/0.90670,1.9e-05,0.007205
log(TF)-IDF,"(1, 3)",4.554346,0.81520/0.82509/0.83597,0.78272/0.80596/0.82155,0.71288/0.73355/0.74614,0.75860/0.76790/0.78160,0.89142/0.89659/0.90561,2e-06,0.090095


### Perceptron

In [None]:
%%time

def get_params_perceptron(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'loss': 'perceptron', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

perceptron_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_perceptron, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

[I 2024-05-14 07:02:27,819] A new study created in memory with name: no-name-0c4a5da5-0842-4de1-94f6-4c0da3d7ce3d
[I 2024-05-14 07:02:31,240] Trial 0 finished with value: 0.6257101735862527 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6257101735862527.
[I 2024-05-14 07:02:47,139] Trial 1 finished with value: 0.8138026734176979 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8138026734176979.
[I 2024-05-14 07:02:50,703] Trial 2 finished with value: 0.6460426067907569 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8138026734176979.
[I 2024-05-14 07:02:57,755] Trial 3 finished with value: 0.7338138272726954 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8138026734176979.
[I 2024-05-14 07:03:01,119] Trial 4 finished with value: 0.57

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)      10.178176  0.80596/0.81637/0.82945   

                 precision                   recall                       f1  \
0  0.78098/0.79122/0.80391  0.70681/0.72605/0.75739  0.74448/0.75713/0.77766   

          alpha  l1_ratio  
0  3.938711e-07   0.91481  


[I 2024-05-14 07:23:34,224] Trial 0 finished with value: 0.5549243022702498 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.5549243022702498.
[I 2024-05-14 07:23:47,621] Trial 1 finished with value: 0.8171979972945472 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8171979972945472.
[I 2024-05-14 07:23:50,979] Trial 2 finished with value: 0.5918403509995795 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8171979972945472.
[I 2024-05-14 07:23:56,331] Trial 3 finished with value: 0.7363391192819443 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8171979972945472.
[I 2024-05-14 07:23:59,621] Trial 4 finished with value: 0.5531869899751708 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)      10.178176  0.80596/0.81637/0.82945   
1          TF-IDF      (1, 3)       8.851195  0.80951/0.81894/0.83143   

                 precision                   recall                       f1  \
0  0.78098/0.79122/0.80391  0.70681/0.72605/0.75739  0.74448/0.75713/0.77766   
1  0.78966/0.80185/0.82105  0.69529/0.71849/0.73928  0.73948/0.75780/0.77359   

          alpha  l1_ratio  
0  3.938711e-07  0.914810  
1  1.824530e-07  0.957513  


[I 2024-05-14 07:40:55,380] Trial 0 finished with value: 0.5930998558416855 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.5930998558416855.
[I 2024-05-14 07:41:13,623] Trial 1 finished with value: 0.8170402532572023 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8170402532572023.
[I 2024-05-14 07:41:17,134] Trial 2 finished with value: 0.6524803375307735 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8170402532572023.
[I 2024-05-14 07:41:23,608] Trial 3 finished with value: 0.7547770417653801 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8170402532572023.
[I 2024-05-14 07:41:26,946] Trial 4 finished with value: 0.6169461586817094 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)      10.178176  0.80596/0.81637/0.82945   
1          TF-IDF      (1, 3)       8.851195  0.80951/0.81894/0.83143   
2         log(TF)      (1, 3)       9.799220  0.81030/0.81976/0.83241   

                 precision                   recall                       f1  \
0  0.78098/0.79122/0.80391  0.70681/0.72605/0.75739  0.74448/0.75713/0.77766   
1  0.78966/0.80185/0.82105  0.69529/0.71849/0.73928  0.73948/0.75780/0.77359   
2  0.77210/0.79302/0.81940  0.71495/0.73534/0.75361  0.74763/0.76293/0.77593   

          alpha  l1_ratio  
0  3.938711e-07  0.914810  
1  1.824530e-07  0.957513  
2  1.150331e-07  0.810002  


[I 2024-05-14 08:00:15,317] Trial 0 finished with value: 0.6328956335112935 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6328956335112935.
[I 2024-05-14 08:00:27,455] Trial 1 finished with value: 0.8194088924518177 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8194088924518177.
[I 2024-05-14 08:00:30,784] Trial 2 finished with value: 0.572573488978066 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8194088924518177.
[I 2024-05-14 08:00:36,350] Trial 3 finished with value: 0.7503941613349395 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8194088924518177.
[I 2024-05-14 08:00:39,736] Trial 4 finished with value: 0.5285520994540379 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 wi

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)      10.178176  0.80596/0.81637/0.82945   
1          TF-IDF      (1, 3)       8.851195  0.80951/0.81894/0.83143   
2         log(TF)      (1, 3)       9.799220  0.81030/0.81976/0.83241   
3     log(TF)-IDF      (1, 3)       9.066259  0.81169/0.82229/0.83518   

                 precision                   recall                       f1  \
0  0.78098/0.79122/0.80391  0.70681/0.72605/0.75739  0.74448/0.75713/0.77766   
1  0.78966/0.80185/0.82105  0.69529/0.71849/0.73928  0.73948/0.75780/0.77359   
2  0.77210/0.79302/0.81940  0.71495/0.73534/0.75361  0.74763/0.76293/0.77593   
3  0.78571/0.80540/0.82843  0.70926/0.72505/0.73484  0.75268/0.76301/0.77798   

          alpha  l1_ratio  
0  3.938711e-07  0.914810  
1  1.824530e-07  0.957513  
2  1.150331e-07  0.810002  
3  1.824397e-07  0.944276  
CPU times: user 4min 22s, sys: 33.8 s, total: 4min 55s
Wall time: 1h 14min 48s


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   11.2s finished


In [None]:
pd.DataFrame(perceptron_tune).style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,alpha,l1_ratio
TF,"(1, 3)",10.178176,0.80596/0.81637/0.82945,0.78098/0.79122/0.80391,0.70681/0.72605/0.75739,0.74448/0.75713/0.77766,0.0,0.91481
TF-IDF,"(1, 3)",8.851195,0.80951/0.81894/0.83143,0.78966/0.80185/0.82105,0.69529/0.71849/0.73928,0.73948/0.75780/0.77359,0.0,0.957513
log(TF),"(1, 3)",9.79922,0.81030/0.81976/0.83241,0.77210/0.79302/0.81940,0.71495/0.73534/0.75361,0.74763/0.76293/0.77593,0.0,0.810002
log(TF)-IDF,"(1, 3)",9.066259,0.81169/0.82229/0.83518,0.78571/0.80540/0.82843,0.70926/0.72505/0.73484,0.75268/0.76301/0.77798,0.0,0.944276


## Training the final models

Now that we've obtained the optimal hyperparameters we can train the models on the full training data. We'll save the models and evaluate them in the next notebook.

In [None]:
mnB_clfs = []
svm_clfs = []
logreg_clfs = []
perceptron_clfs = []
models = {}

for n in range(len(X_trains)):
    mnB_clfs.append(MultinomialNB(alpha=mnB_tune_category[n]['alpha']))
    svm_clfs.append(SGDClassifier(loss='hinge', penalty='elasticnet', alpha=SVM_tune[n]['alpha'], l1_ratio=SVM_tune[n]['l1_ratio']))
    logreg_clfs.append(SGDClassifier(loss='log_loss', penalty='elasticnet', alpha=log_reg_tune[n]['alpha'], l1_ratio=log_reg_tune[n]['l1_ratio']))
    perceptron_clfs.append(SGDClassifier(loss='perceptron', penalty='elasticnet', alpha=perceptron_tune[n]['alpha'], l1_ratio=perceptron_tune[n]['l1_ratio']))

    for model in [mnB_clfs[-1],svm_clfs[-1], logreg_clfs[-1], perceptron_clfs[-1]]:
        model.fit(X_trains[n], y_train)

    models[f"models/mnB_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = mnB_clfs[-1]
    models[f"models/svm_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = svm_clfs[-1]
    models[f"models/logreg_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = logreg_clfs[-1]
    models[f"models/perceptron_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = perceptron_clfs[-1]

In [None]:
import joblib

for model_name in models:
    joblib.dump(models[model_name], model_name+'.joblib')

joblib.dump(video_category_encoder, 'models/video_category_encoder.joblib')

['models/video_category_encoder.joblib']

In [40]:
for n in range(len(vectorizers)):
    joblib.dump(vectorizers[n]['channel_title'], f"vectorizers/channel_title_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")
    joblib.dump(vectorizers[n]['video_title'], f"vectorizers/video_title_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")
    joblib.dump(vectorizers[n]['video_description'], f"vectorizers/video_description_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")

## Probability calibration

We can see that, based on the cross-validation scores, the models are quite far from being accurate. We would like to model the probabilities  $P(y\in \mathcal{C}|P)$ of a data $y$ belonging in class $\mathcal{C}$ given the predictions of each of the models, which is not the same as the reported probabilities. (In some cases, there are also no reported probabilities). We can do this using a probability calibrator, which treats the predictions of each model as a feature that can then be used to model the true probability. This requires validation data, so we'll again use a five-fold cross-validation split.

In [None]:
from sklearn.calibration import CalibratedClassifierCV

calibrated_clfs = {}

for model_name in models:
    calibrated_clfs[model_name] = CalibratedClassifierCV(models[model_name], cv = KFold(n_splits=5, random_state=42, shuffle=True))
    calibrated_clfs[model_name].fit(X_train, y_train)

We'll save the calibrated models for evaluation in the next notebook:

In [None]:
for model_name in calibrated_clfs:
    joblib.dump(calibrated_clfs[model_name], model_name+'_calibrated.joblib')

## Stacking

Now that we have our sixteen models, we can combine them into a single classifier that uses all of their predictions. One approach is stacking, which involves a single metaclassifier that first gathers the predictions of the individual models, then uses these predictions as features and converts them into a final prediction. We will need to train the meta-classifier with cross-validation and select a model. We will compare two choices: logistic regression and gaussian naive Bayes.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
stacking_logreg = StackingClassifier(list(models.items()), final_estimator=LogisticRegression(), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_logreg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
from sklearn.naive_bayes import GaussianNB

stacking_gnb = StackingClassifier(list(models.items()), final_estimator=GaussianNB(), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_gnb.fit(X_train, y_train)

In [None]:
joblib.dump(stacking_logreg, 'models/stacking_logreg.joblib')
joblib.dump(stacking_gnb, 'models/stacking_gnb.joblib')

['models/stacking_gnb.joblib']

We've successfully built a total of 34 different classical ML models -- four different classification approaches (Bayesian and linear), four different text vectorisation methods (TF, log(TF), TF-IDF and log(TF)-IDF vectorisation), and then used probability calibration and stacking to further improve model performance. In the next notebook we'll compare the performance of these models on the test data.