# Youtube popularity predictor (Part 2): text frequency-based models

In the previous notebook, we used natural language processing (NLP) to explore the YouTube video dataset and hunted for possible correlations between the language features in the video titles and descriptions and the video popularity, which we associated with a binary categorical variable corresponding to a video having obtained over 100k views (class 1) or under 100k views (class 0). We did indeed see that the frequency of the tokens in the byte-pair encoded text had predictive value for classification. In this notebook we will construct a variety of classification models based on text frequency.

Let's import the scikit-learn library and load the dataset, which was already processed in the previous notebook to extract the relevant ML features.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas(desc='processing rows')

In [2]:
videos = pd.read_csv('YT_data_v2.csv', lineterminator='\n')
videos

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
0,University of New Haven,27,Master of Science in Cellular and Molecular Bi...,"Christina Zito, assistant professor and coordi...",75,3.610660,0
1,PennWest California,27,Faculty Showcase: Dr. Ben Reuter - Exercise Sc...,Interested in pursing a exercise science degre...,75,3.168203,0
2,University of New Haven,27,Master of Science in Mechanical Engineering: B...,The University of New Haven’s master’s degree ...,75,3.447313,0
3,Operation Ouch,24,Science for kids | BROKEN BONES- Unluckiest K...,Learn about Broken Bones with the Unluckiest K...,75,6.603942,1
4,Crazy GkTrick,27,Science Gk : Diseases (मानव रोग ) - Part-2,Biology (‎जीव विज्ञान) | Gk Science | Science ...,76,6.409320,1
...,...,...,...,...,...,...,...
31657,Morinda Enterprises,22,Vivo v30pro pro photography // aura light por...,,1,2.534026,0
31658,Christian Dunham,20,POV me growing up,,1,1.000000,0
31659,Gegee gegee,22,28 March 2024,,1,0.477121,0
31660,Sangita . 20k views. 2 days ago,27,TLM WORKSHOP on FLN ||👏😱||#viral #tlm,"project work,tlm workshop,maths project work,t...",1,1.431364,0


In [3]:
videos.groupby('video_category').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,label,label,label,label,label,label,label,label
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
video_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,306.0,42.222222,23.894775,1.0,20.0,48.0,63.75,75.0,306.0,4.271067,...,5.674682,7.752964,306.0,0.408497,0.492361,0.0,0.0,0.0,1.0,1.0
2,179.0,28.469274,22.774244,1.0,13.0,19.0,40.0,75.0,179.0,4.288575,...,5.602989,7.831337,179.0,0.413408,0.493826,0.0,0.0,0.0,1.0,1.0
10,245.0,23.383673,22.988052,1.0,6.0,15.0,36.0,74.0,245.0,4.580353,...,5.645615,8.234742,245.0,0.559184,0.497501,0.0,0.0,1.0,1.0,1.0
15,41.0,31.414634,23.295896,2.0,12.0,27.0,56.0,75.0,41.0,4.541532,...,5.84489,8.419579,41.0,0.463415,0.504854,0.0,0.0,0.0,1.0,1.0
17,487.0,51.026694,19.407725,1.0,39.0,56.0,68.0,75.0,487.0,3.832325,...,4.665426,7.836966,487.0,0.2423,0.428915,0.0,0.0,0.0,0.0,1.0
19,111.0,38.666667,22.117181,1.0,19.5,38.0,57.0,75.0,111.0,4.107556,...,4.999286,7.486532,111.0,0.306306,0.463049,0.0,0.0,0.0,1.0,1.0
20,603.0,19.140962,18.759672,1.0,6.5,14.0,21.0,76.0,603.0,4.025035,...,5.776823,8.183316,603.0,0.461028,0.498893,0.0,0.0,0.0,1.0,1.0
22,5831.0,34.805694,22.613711,1.0,15.0,32.0,55.0,76.0,5831.0,3.577791,...,4.629766,8.20297,5831.0,0.236495,0.424966,0.0,0.0,0.0,0.0,1.0
23,220.0,27.186364,24.877541,1.0,7.0,17.0,54.25,75.0,220.0,5.085606,...,6.448389,8.227258,220.0,0.654545,0.476601,0.0,0.0,1.0,1.0,1.0
24,1382.0,29.514472,22.926467,1.0,10.0,23.5,48.0,75.0,1382.0,4.669223,...,5.98817,8.494091,1382.0,0.515919,0.499927,0.0,0.0,1.0,1.0,1.0


In [4]:
videos[videos['video_category']==30]

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26648,YouTube Movies,30,"Underground Aliens, Baba Vanga And Quantum Bio...",Baba Vanga was a female mystic in Bulgaria. Sh...,11,0.0,0


In [5]:
videos.drop(videos[videos['video_category']==30].index, inplace=True)

In [6]:
videos.reset_index(drop=True, inplace=True)

We can see that there are videos with titles and descriptions containing non-latin characters. We won't filter the videos by language or alphabet, so the non-latin characters will become part of the features. Let's look at the distribution of video view counts:

In [7]:
videos[['months','video_view_count','label']].groupby('label').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0,19168.0,40.830238,21.008292,-1.0,24.0,42.0,59.0,76.0,19168.0,3.353037,1.067583,0.0,2.692847,3.633519,4.205265,4.69897
1,12493.0,29.549668,21.267577,1.0,12.0,24.0,46.0,76.0,12493.0,5.582265,0.68341,4.699005,5.037442,5.433327,5.977578,8.588679


We can see that the classes are approximately evenly distributed. They aren't exactly balanced, but that is due to the fact that the classification is based on a milestone of 100k views. To exactly balance the data would result in a discrimination threshold that is far less striking.

We'll select a test set based on an 80/20 train/test split which we will then use for all future model building and validation.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(videos[['video_title']], videos['label'], test_size=0.2, stratify=videos['video_category'], random_state=524)
test = videos.iloc[X_test.index]
train = videos.iloc[X_train.index]

In [9]:
test

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26498,RG LECTURES,27,MHTCET FULL REVISION ONE SHOT ALL FORMULAS - P...,MHTCET PHYSICS FULL COMPLETE ONE SHOT REVISION...,11,5.238984,1
27395,FuTechs,28,Tony Robbin and Robot conversation Relationshi...,"Speaker :Anthony Jay Robbins (né Mahavoric, bo...",10,4.364063,0
23126,That Chemist,27,Nobel Prize in Chemistry 2022 (Recap),The Nobel Prize in Chemistry for 2022 has been...,18,4.484656,0
15634,SCIENCE FUN For Everyone!,27,Friction Fun Friction Science Experiment,Have fun exploring friction with this easy sci...,36,4.503437,0
7075,Michigan Medicine,26,Deconstructing the Legitimization of Acupunctu...,"Rick Harris, PhD\nAssociate Professor, Anesthe...",57,4.632467,0
...,...,...,...,...,...,...,...
24112,CARB ACADEMY,27,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,16,5.496467,1
2034,Rafael Verdonck's World,22,Science World #7 Will Strangelets destroy th...,Will the universe be destroyed by a tiny eleme...,70,3.183270,0
22862,Trik Matematika mesi,27,deret angka matematika #shorts #maths,,19,5.764919,1
6425,edureka!,27,Statistics And Probability Tutorial | Statisti...,🔥 Data Science Certification using R (Use Code...,59,5.561255,1


In [10]:
train.to_csv('train.csv', index=False, encoding='utf-8', sep=',')
test.to_csv('test.csv', index=False, encoding='utf-8', sep=',')

In [11]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271,0
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389,1
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385,1
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802,0
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282,0
...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115,0
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098,1
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341,0
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217,1


In this notebook, we will only be using the train dataset to build the models.

To convert the text into numerical features, we can use byte-pair encoding (BPE). We can train three separate encoders for the channel name, video title and video description. We can first set all the NA values to empty strings:

In [12]:
train = train.fillna('')

In [13]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

def build_tokenizer(train_texts, save=None):
    BPE_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    BPE_tokenizer.pre_tokenizer = Whitespace()
    BPE_tokenizer.train_from_iterator(train_texts, trainer=trainer)
    if save:
        BPE_tokenizer.save(save)
    return BPE_tokenizer

training_data_uncased = {field: train[field].apply(lambda x: x.lower()).tolist() for field in ['channel_title', 'video_title', 'video_description']}
#training_data_cased = {field: train[field].tolist() for field in ['channel_title', 'video_title', 'video_description']}

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
%%time
BPE_tokenizers_uncased = {}
#BPE_tokenizers_cased = {}
for field in training_data_uncased:
    BPE_tokenizers_uncased[field]= build_tokenizer(training_data_uncased[field], save=f"tokenizers/BPE_tokenizer_{field}_uncased.json")
#    BPE_tokenizers_cased[field] = build_tokenizer(training_data_cased[field], save=f"tokenizers/BPE_tokenizer_{field}_cased.json")










CPU times: user 39.1 s, sys: 143 ms, total: 39.3 s
Wall time: 39.3 s


In [15]:
from transformers import PreTrainedTokenizerFast

#tokenizers_trained_cased = {}
tokenizers_trained_uncased = {}

for field in training_data_uncased:
#    tokenizers_trained_cased[field] = PreTrainedTokenizerFast(tokenizer_file=f"tokenizers/BPE_tokenizer_{field}_cased.json")
    tokenizers_trained_uncased[field] = PreTrainedTokenizerFast(tokenizer_file=f"tokenizers/BPE_tokenizer_{field}_uncased.json")

In [16]:
def tokenize(text, field, cased=True):
    if cased == False:
        return [str(t) for t in tokenizers_trained_uncased[field](text.lower())['input_ids']]

def tokenizer_decode(tokenized, field, cased=True):
    if cased == False:
        return tokenizers_trained_uncased[field].decode([int(t) for t in tokenized])
        

In [17]:
train.loc[:,'channel_title_tokenized'] = train['channel_title'].progress_apply(lambda text: tokenize(text.lower(), 'channel_title', cased=False))
train.loc[:,'video_title_tokenized'] = train['video_title'].progress_apply(lambda text: tokenize(text.lower(), 'video_title', cased=False))
train.loc[:,'video_description_tokenized'] = train['video_description'].progress_apply(lambda text: tokenize(text.lower(), 'video_description', cased=False))

processing rows: 100%|█████████████████████████████████████████████████████████████████████████████████| 25328/25328 [00:00<00:00, 58445.75it/s]
processing rows: 100%|█████████████████████████████████████████████████████████████████████████████████| 25328/25328 [00:00<00:00, 31879.51it/s]
processing rows: 100%|██████████████████████████████████████████████████████████████████████████████████| 25328/25328 [00:08<00:00, 2940.36it/s]


In [18]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label,channel_title_tokenized,video_title_tokenized,video_description_tokenized
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271,0,[1165],"[2319, 2692, 3910, 2848, 6602, 3910, 2077, 196...","[10988, 5597, 12955, 5606, 5315, 4227, 4430, 4..."
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389,1,[16769],"[3084, 5038, 4400, 1871, 3829, 5, 12, 1889, 59...","[4091, 9748, 4132, 17593, 4153, 5, 4123, 9748,..."
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385,1,"[1300, 3294, 777]","[1883, 9686, 1910, 1817, 2178, 2469]","[4451, 9906, 4027, 17896, 4094, 4306, 4123, 42..."
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802,0,[1165],"[6224, 6245, 1963, 2159, 2250, 2525, 1890, 206...","[25286, 28274, 4082, 4058, 5315, 10641, 4393, ..."
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282,0,[19463],"[6465, 2587, 30, 1883, 1815, 1846, 21675, 1842...","[7408, 4039, 41, 17229, 5423, 4459, 33, 4006, ..."
...,...,...,...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115,0,[16197],"[3683, 7242, 7, 3945, 7, 1815, 7, 2062]","[8809, 25929, 4021, 41, 7093, 17, 5087, 25929,..."
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098,1,[10110],"[2074, 3274, 41, 10225, 1957, 2573, 3306, 5804...","[5864, 30, 5316, 44, 4035, 17185, 4053, 4299, ..."
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341,0,"[3250, 900]","[1815, 6401, 68, 2386, 18, 4589, 18, 2158]","[21, 18, 4896, 17, 5122, 8991, 4027, 4331, 107..."
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217,1,"[829, 3098, 1169]","[2295, 1869, 7835, 2475, 1846, 2629, 7, 1897, ...","[4365, 4093, 4410, 4347, 9114, 5460, 4487, 19,..."


In [19]:
idx = train.sample(1, random_state=524).index.tolist()[0]
print('channel title:')
print(train.at[idx,'channel_title'])
print('channel title tokenized:')
print(train.at[idx,'channel_title_tokenized'])
print('video title: ')
print(train.at[idx,'video_title'])
print('video title tokenized:')
print(train.at[idx,'video_title_tokenized'])
print('video description:')
print(train.at[idx,'video_description'])
print('video description tokenized:')
print(train.at[idx,'video_description_tokenized'])

channel title:
CrashCourse
channel title tokenized:
['1946']
video title: 
Micro-Biology: Crash Course History of Science #24
video title tokenized:
['2635', '17', '1915', '30', '3465', '2299', '2744', '1846', '1815', '7', '2763']
video description:
It's all about the SUPER TINY in this episode of Crash Course: History of Science. In it, Hank Green talks about germ theory, John Snow (the other one), pasteurization,  and why following our senses isn't always the worst idea. 

***

Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse

Thanks to the following Patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:

Mark Brouwer, Kenneth F Penttinen, Trevin Beattie, Satya Ridhima Parvathaneni, Erika & Alexa Saur, Glenn Elliott, Justin Zingsheim, Jessica Wode, Eric Prestemon, Kathrin Benoit, Tom Trval, Jason Saslow, Nathan Taylor, Brian Thomas Gossett, Khaled El Shalakany, Indika Siriwarde

We are now ready to apply machine learning techniques on the tokenized text. The discussion of EDA in the previous notebook suggests that a text-frequency based analysis could be a powerful tool for language-based prediction. We can use TfidfVectorizer() from scikit-learn, which efficiently counts the tokens in a text and generates a vector consisting of a numerical description of the token frequencies. Rather than simply counting the token frequency in the individual samples (the *term frequency*), however, TfidfVectorizer also incorporates the frequencies of the tokens in the entire training corpus (the *document frequency*). By default, TfidfVectorizer multiplies each token $i$ by a weight $IDF = \log(\frac{N_{\text{samples}}}{N_{\text{samples containing }i}})$, which describes the specificity of the token to the sample. 

The parameters are:
* ngram_range: rather than considering individual tokens, we can consider pairs, triples, etc. of consecutive tokens and perform frequency analysis on these larger units. These are known as n-grams, with $n=1,2,3, \dots$ being the number of consecutive tokens that form the unit. The ngram_range is a tuple (n,m) with $n$ and $m$ being the minimum and maximum sizes of the n-grams used in generating features from the tokenised text.
* min_df, max_df: we can filter the tokens by the minimum and maximum number of documents in which the token must appear, which allows for dimensionality reduction.
* use_idf: this allows the incorporation of the IDF factor into the vector representation of the text: without it, the text is represented as a set of numbers corresponding to the frequency of each token or n-gram appearing in the text, with a normalisation factor. With use_idf, this frequency is divided by a factor (idf) that suppresses tokens that appear in a large number of documents.
* norm: with 'l1', the vector of input features is normalised so that the sum of the features is unity, with 'l2', the sum of the squares is unity.
* sublinear_tf: this uses the logarithm of the term frequencies rather than the term frequencies themselves.

We will introduce a function that trains the vectoriser on the total vocabulary of channel names, video titles and descriptions, vectorises them individually and then combines them. We'll also determine the effect of incorporating the video category, which will be one-hot encoded and stacked with the vectoriser output.

In [20]:
from sklearn.preprocessing import OneHotEncoder

video_category_encoder = OneHotEncoder()
video_category_encoder.fit(train[['video_category']])
video_category_encoder.categories_[0]

array([ 1,  2, 10, 15, 17, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29])

In [21]:
from scipy.sparse import csr_matrix, hstack

def dummy(x):
    return x

train_texts_tokenized = {'channel_title': train['channel_title_tokenized'],
                           'video_title': train['video_title_tokenized'],
                           'video_description': train['video_description_tokenized']}

def get_features(ngram_range=(1,1), min_df=1, max_df=1.0, verbose=True, use_idf=True, norm='l2', sublinear_tf=False, video_category_encoder=None):
    vectorizers = {}
    X_trains = {}
    for field in train_texts_tokenized:
        vectorizers[field] = TfidfVectorizer(preprocessor=dummy, tokenizer=dummy, ngram_range=ngram_range, min_df=min_df, max_df=max_df, token_pattern=None, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
        X_trains[field] = vectorizers[field].fit_transform(train_texts_tokenized[field])
        if verbose:
            print(f"Fit tfidf vectorizer with {len(vectorizers[field].get_feature_names_out())} features in the {ngram_range} ngram range.")

    if video_category_encoder != None:
        X_category = video_category_encoder.transform(train[['video_category']]).toarray()
        X_train = hstack([X_category, X_trains['channel_title'], X_trains['video_title'], X_trains['video_description']])
    else:
        X_train = hstack([X_trains['channel_title'], X_trains['video_title'], X_trains['video_description']])
    return X_train, vectorizers

Let's look at the number of features for each n-gram range:

In [22]:
for ngram_range in [(1,1), (1,2), (1,3), (1,4), (1,5)]:
    _,_ = get_features(ngram_range=ngram_range)

Fit tfidf vectorizer with 12424 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 24974 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 26901 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 29438 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 158781 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 995189 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 44876 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 585682 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 5766326 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 47958 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 810580 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 8696668 featu

## Multinomial naive Bayes

We see that for the higher n-gram ranges, we have millions or tens of millions of features, which is orders of magnitude larger than the training sample size.

As a baseline for exploring different approaches I'll use multinomial naive Bayes, which is known to perform well for text classification tasks with the tf-idf approach despite the large vocabularies. This has two main advantages: for the number of features we are considering, it is comparatively fast, and it requires tuning of only one hyperparameter, the Laplacian smoothing, which can be fixed by cross-validation to minimise overfitting.

We'll run grid search for n-gram ranges (1,1), (1,2), and (1,3) and vary the vectoriser settings, use_idf = [True, False], norm = ['l1', 'l2'], and 'sublinear_tf' = [True, False].

In [23]:
from sklearn.metrics import *

import warnings
from sklearn.exceptions import UndefinedMetricWarning
warnings.simplefilter("ignore", UndefinedMetricWarning)

In [24]:
from sklearn.model_selection import cross_validate, KFold
import optuna

max_trials=100

def objective(trial, X_train, y_train, estimator, get_params, scoring):
    np.random.seed(524)
    params = get_params(trial)
    model = estimator(**params)
    scores = cross_validate(model, X_train, y_train, scoring=scoring, cv=KFold(n_splits=5, random_state=42, shuffle=True), n_jobs=-1, verbose=0)
    return np.mean(scores['test_score'])

def report_optuna_results(X_train, y_train, estimator, get_params, scoring):
    sampler = optuna.samplers.TPESampler(seed=524)
    study = optuna.create_study(sampler=sampler, direction='maximize')
    study.optimize(lambda trial: objective(trial, X_train, y_train, estimator, get_params, scoring), n_trials=max_trials)
    return study.best_params

In [25]:
def report_tuned_models(X_trains, y_train, params_fixed, estimator, get_params, scoring_tune, scoring_report):
    results_list = []
    for n in range(len(X_trains)):
        X_train = X_trains[n]

        best = report_optuna_results(X_train, y_train, estimator, get_params, scoring_tune)
        model = estimator(**best)

        scores = cross_validate(model, X_train, y_train, scoring=scoring_report, cv=KFold(n_splits=5, random_state=42, shuffle=True), n_jobs=-1, verbose=5)

        cv_results = {}
        for param in params_fixed:
            cv_results[param] = params_fixed[param][n]
        cv_results['mean_fit_time'] = np.mean(scores['fit_time'])
        for score in scoring_report:
            cv_results[score] = f'{np.min(scores["test_"+score]):.5f}/{np.mean(scores["test_"+score]):.5f}/{np.max(scores["test_"+score]):.5f}'
        for param in best:
            cv_results[param] = best[param]
        results_list.append(cv_results)
        print(pd.DataFrame(results_list))
    return results_list

In [26]:
from sklearn.naive_bayes import MultinomialNB

def get_params_mnB(trial):
    alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    return {'alpha': alpha}

In [27]:
X_trains = []
params_fixed = {'vectorizer_type': [], 'norm': [], 'ngram_range': []}

for use_idf in [False,True]:
    for norm in ['l1','l2']:
        for sublinear_tf in [False,True]:
            for ngram_range in [(1,3)]:
                if use_idf == False and sublinear_tf == False:
                    params_fixed['vectorizer_type'].append('TF')
                elif use_idf == False and sublinear_tf == True:
                    params_fixed['vectorizer_type'].append('log(TF)')
                elif use_idf == True and sublinear_tf == False:
                    params_fixed['vectorizer_type'].append('TF-IDF')
                elif use_idf == True and sublinear_tf == True:
                    params_fixed['vectorizer_type'].append('log(TF)-IDF')
                params_fixed['norm'].append(norm)
                params_fixed['ngram_range'].append(ngram_range)

                X_train, _ = get_features(ngram_range=ngram_range, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
                X_trains.append(X_train)

Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 f

In [28]:
%%time
mnB_tune_ngrams = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

[I 2024-05-13 03:15:59,253] A new study created in memory with name: no-name-2a802492-c75e-4c7a-a0b7-c9118f9d96c0
[I 2024-05-13 03:16:01,466] Trial 0 finished with value: 0.7970624050782387 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-13 03:16:03,062] Trial 1 finished with value: 0.7946937191169358 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-13 03:16:04,298] Trial 2 finished with value: 0.7944567523107615 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-13 03:16:05,514] Trial 3 finished with value: 0.7181774554167321 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-13 03:16:06,746] Trial 4 finished with value: 0.7924429553584686 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-13 03:1

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 3)       0.701705  0.80004/0.80753/0.81840   

                 precision                   recall                       f1  \
0  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   

                   roc_auc     alpha  
0  0.86774/0.87366/0.88516  0.006922  


[I 2024-05-13 03:17:47,776] Trial 0 finished with value: 0.7977730872522193 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-13 03:17:48,822] Trial 1 finished with value: 0.794535733451705 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-13 03:17:49,870] Trial 2 finished with value: 0.7946541155269689 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-13 03:17:50,897] Trial 3 finished with value: 0.7179800298449411 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-13 03:17:51,918] Trial 4 finished with value: 0.7933905263941442 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-13 03:17:52,956] Trial 5 finished with value: 0.7933510475153444 and parameters: {'alpha': 0.00039799342667825053}. Best i

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 3)       0.701705  0.80004/0.80753/0.81840   
1         log(TF)   l1      (1, 3)       0.697260  0.80024/0.80930/0.82057   

                 precision                   recall                       f1  \
0  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
1  0.77949/0.78914/0.79773  0.69197/0.70503/0.73333  0.73646/0.74464/0.76297   

                   roc_auc     alpha  
0  0.86774/0.87366/0.88516  0.006922  
1  0.86793/0.87406/0.88573  0.006647  


[I 2024-05-13 03:19:32,185] Trial 0 finished with value: 0.7933907290497906 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7933907290497906.
[I 2024-05-13 03:19:33,183] Trial 1 finished with value: 0.7976943555336099 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7976943555336099.
[I 2024-05-13 03:19:34,197] Trial 2 finished with value: 0.7977733132912095 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7977733132912095.
[I 2024-05-13 03:19:35,212] Trial 3 finished with value: 0.7621207367779856 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7977733132912095.
[I 2024-05-13 03:19:36,238] Trial 4 finished with value: 0.7952463376812063 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7977733132912095.
[I 2024-05-13 03:19:37,227] Trial 5 finished with value: 0.7951673877180545 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 3)       0.701705  0.80004/0.80753/0.81840   
1         log(TF)   l1      (1, 3)       0.697260  0.80024/0.80930/0.82057   
2              TF   l2      (1, 3)       0.707674  0.80932/0.81657/0.82925   

                 precision                   recall                       f1  \
0  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
1  0.77949/0.78914/0.79773  0.69197/0.70503/0.73333  0.73646/0.74464/0.76297   
2  0.75895/0.77732/0.79064  0.73780/0.74983/0.77043  0.75241/0.76328/0.78040   

                   roc_auc     alpha  
0  0.86774/0.87366/0.88516  0.006922  
1  0.86793/0.87406/0.88573  0.006647  
2  0.87760/0.88359/0.89375  0.076254  


[I 2024-05-13 03:21:16,104] Trial 0 finished with value: 0.8015240172272888 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8015240172272888.
[I 2024-05-13 03:21:17,135] Trial 1 finished with value: 0.8012081238413066 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8015240172272888.
[I 2024-05-13 03:21:18,167] Trial 2 finished with value: 0.8007738405856124 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8015240172272888.
[I 2024-05-13 03:21:19,199] Trial 3 finished with value: 0.7741626833790023 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8015240172272888.
[I 2024-05-13 03:21:20,219] Trial 4 finished with value: 0.801129150494811 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8015240172272888.
[I 2024-05-13 03:21:21,241] Trial 5 finished with value: 0.8012870815989063 and parameters: {'alpha': 0.00039799342667825053}. Best i

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 3)       0.701705  0.80004/0.80753/0.81840   
1         log(TF)   l1      (1, 3)       0.697260  0.80024/0.80930/0.82057   
2              TF   l2      (1, 3)       0.707674  0.80932/0.81657/0.82925   
3         log(TF)   l2      (1, 3)       0.709934  0.80991/0.81878/0.82846   

                 precision                   recall                       f1  \
0  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
1  0.77949/0.78914/0.79773  0.69197/0.70503/0.73333  0.73646/0.74464/0.76297   
2  0.75895/0.77732/0.79064  0.73780/0.74983/0.77043  0.75241/0.76328/0.78040   
3  0.77439/0.78607/0.79538  0.73041/0.74283/0.75990  0.75700/0.76380/0.77724   

                   roc_auc     alpha  
0  0.86774/0.87366/0.88516  0.006922  
1  0.86793/0.87406/0.88573  0.006647  
2  0.87760/0.88359/0.89375  0.076254  
3  0.87444/0.88138/0.89292  0.102185  


[I 2024-05-13 03:22:59,968] Trial 0 finished with value: 0.7888499954597341 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7888499954597341.
[I 2024-05-13 03:23:00,993] Trial 1 finished with value: 0.7888503384154433 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7888503384154433.
[I 2024-05-13 03:23:02,025] Trial 2 finished with value: 0.7905085682417557 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7905085682417557.
[I 2024-05-13 03:23:03,038] Trial 3 finished with value: 0.7334173548839427 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7905085682417557.
[I 2024-05-13 03:23:04,123] Trial 4 finished with value: 0.7846255449780567 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7905085682417557.
[I 2024-05-13 03:23:05,163] Trial 5 finished with value: 0.7847439894089041 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 3)       0.701705  0.80004/0.80753/0.81840   
1         log(TF)   l1      (1, 3)       0.697260  0.80024/0.80930/0.82057   
2              TF   l2      (1, 3)       0.707674  0.80932/0.81657/0.82925   
3         log(TF)   l2      (1, 3)       0.709934  0.80991/0.81878/0.82846   
4          TF-IDF   l1      (1, 3)       0.672739  0.79688/0.80709/0.81879   

                 precision                   recall                       f1  \
0  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
1  0.77949/0.78914/0.79773  0.69197/0.70503/0.73333  0.73646/0.74464/0.76297   
2  0.75895/0.77732/0.79064  0.73780/0.74983/0.77043  0.75241/0.76328/0.78040   
3  0.77439/0.78607/0.79538  0.73041/0.74283/0.75990  0.75700/0.76380/0.77724   
4  0.77648/0.78908/0.79781  0.68260/0.69755/0.72331  0.73126/0.74043/0.75868   

                   roc_auc     alpha  
0  0.86774/

[I 2024-05-13 03:24:43,620] Trial 0 finished with value: 0.7892448310144202 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7892448310144202.
[I 2024-05-13 03:24:44,623] Trial 1 finished with value: 0.7891661694458421 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7892448310144202.
[I 2024-05-13 03:24:45,644] Trial 2 finished with value: 0.7908638937398502 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7908638937398502.
[I 2024-05-13 03:24:46,677] Trial 3 finished with value: 0.7332989104530951 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7908638937398502.
[I 2024-05-13 03:24:47,676] Trial 4 finished with value: 0.7848228926053683 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7908638937398502.
[I 2024-05-13 03:24:48,729] Trial 5 finished with value: 0.7848228848109204 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 3)       0.701705  0.80004/0.80753/0.81840   
1         log(TF)   l1      (1, 3)       0.697260  0.80024/0.80930/0.82057   
2              TF   l2      (1, 3)       0.707674  0.80932/0.81657/0.82925   
3         log(TF)   l2      (1, 3)       0.709934  0.80991/0.81878/0.82846   
4          TF-IDF   l1      (1, 3)       0.672739  0.79688/0.80709/0.81879   
5     log(TF)-IDF   l1      (1, 3)       0.693320  0.79668/0.80764/0.81899   

                 precision                   recall                       f1  \
0  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
1  0.77949/0.78914/0.79773  0.69197/0.70503/0.73333  0.73646/0.74464/0.76297   
2  0.75895/0.77732/0.79064  0.73780/0.74983/0.77043  0.75241/0.76328/0.78040   
3  0.77439/0.78607/0.79538  0.73041/0.74283/0.75990  0.75700/0.76380/0.77724   
4  0.77648/0.78908/0.79781  0.68260/0.69755/0.72331  

[I 2024-05-13 03:26:27,567] Trial 0 finished with value: 0.7916536505881496 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7916536505881496.
[I 2024-05-13 03:26:28,573] Trial 1 finished with value: 0.7959965143228827 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7959965143228827.
[I 2024-05-13 03:26:29,604] Trial 2 finished with value: 0.7965494134872788 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7965494134872788.
[I 2024-05-13 03:26:30,631] Trial 3 finished with value: 0.7979704816462185 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7979704816462185.
[I 2024-05-13 03:26:31,664] Trial 4 finished with value: 0.7941410771693216 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7979704816462185.
[I 2024-05-13 03:26:32,676] Trial 5 finished with value: 0.7941410771693216 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 3)       0.701705  0.80004/0.80753/0.81840   
1         log(TF)   l1      (1, 3)       0.697260  0.80024/0.80930/0.82057   
2              TF   l2      (1, 3)       0.707674  0.80932/0.81657/0.82925   
3         log(TF)   l2      (1, 3)       0.709934  0.80991/0.81878/0.82846   
4          TF-IDF   l1      (1, 3)       0.672739  0.79688/0.80709/0.81879   
5     log(TF)-IDF   l1      (1, 3)       0.693320  0.79668/0.80764/0.81899   
6          TF-IDF   l2      (1, 3)       0.682989  0.81011/0.81870/0.82767   

                 precision                   recall                       f1  \
0  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
1  0.77949/0.78914/0.79773  0.69197/0.70503/0.73333  0.73646/0.74464/0.76297   
2  0.75895/0.77732/0.79064  0.73780/0.74983/0.77043  0.75241/0.76328/0.78040   
3  0.77439/0.78607/0.79538  0.73041/0.74283/0.75990  0.

[I 2024-05-13 03:28:10,726] Trial 0 finished with value: 0.7956807300591715 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7956807300591715.
[I 2024-05-13 03:28:11,731] Trial 1 finished with value: 0.797694324355818 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.797694324355818.
[I 2024-05-13 03:28:12,756] Trial 2 finished with value: 0.7978916797775776 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7978916797775776.
[I 2024-05-13 03:28:13,810] Trial 3 finished with value: 0.7992733158244051 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7992733158244051.
[I 2024-05-13 03:28:14,895] Trial 4 finished with value: 0.7959174864152514 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7992733158244051.
[I 2024-05-13 03:28:15,965] Trial 5 finished with value: 0.796035915257203 and parameters: {'alpha': 0.00039799342667825053}. Best is 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 3)       0.701705  0.80004/0.80753/0.81840   
1         log(TF)   l1      (1, 3)       0.697260  0.80024/0.80930/0.82057   
2              TF   l2      (1, 3)       0.707674  0.80932/0.81657/0.82925   
3         log(TF)   l2      (1, 3)       0.709934  0.80991/0.81878/0.82846   
4          TF-IDF   l1      (1, 3)       0.672739  0.79688/0.80709/0.81879   
5     log(TF)-IDF   l1      (1, 3)       0.693320  0.79668/0.80764/0.81899   
6          TF-IDF   l2      (1, 3)       0.682989  0.81011/0.81870/0.82767   
7     log(TF)-IDF   l2      (1, 3)       0.686522  0.80892/0.81976/0.83024   

                 precision                   recall                       f1  \
0  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
1  0.77949/0.78914/0.79773  0.69197/0.70503/0.73333  0.73646/0.74464/0.76297   
2  0.75895/0.77732/0.79064  0.73780/0.74983/0.77043  0.75

[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    1.0s remaining:    1.5s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.0s finished


In [29]:
mnB_tune_ngrams = pd.DataFrame(mnB_tune_ngrams)
display(mnB_tune_ngrams.style.hide())

vectorizer_type,norm,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,l1,"(1, 3)",0.701705,0.80004/0.80753/0.81840,0.78280/0.79023/0.79700,0.68112/0.69725/0.72431,0.73452/0.74075/0.75853,0.86774/0.87366/0.88516,0.006922
log(TF),l1,"(1, 3)",0.69726,0.80024/0.80930/0.82057,0.77949/0.78914/0.79773,0.69197/0.70503/0.73333,0.73646/0.74464/0.76297,0.86793/0.87406/0.88573,0.006647
TF,l2,"(1, 3)",0.707674,0.80932/0.81657/0.82925,0.75895/0.77732/0.79064,0.73780/0.74983/0.77043,0.75241/0.76328/0.78040,0.87760/0.88359/0.89375,0.076254
log(TF),l2,"(1, 3)",0.709934,0.80991/0.81878/0.82846,0.77439/0.78607/0.79538,0.73041/0.74283/0.75990,0.75700/0.76380/0.77724,0.87444/0.88138/0.89292,0.102185
TF-IDF,l1,"(1, 3)",0.672739,0.79688/0.80709/0.81879,0.77648/0.78908/0.79781,0.68260/0.69755/0.72331,0.73126/0.74043/0.75868,0.86652/0.87292/0.88448,0.010333
log(TF)-IDF,l1,"(1, 3)",0.69332,0.79668/0.80764/0.81899,0.77036/0.78300/0.79251,0.69837/0.70893/0.73484,0.73371/0.74406/0.76176,0.86649/0.87316/0.88484,0.009311
TF-IDF,l2,"(1, 3)",0.682989,0.81011/0.81870/0.82767,0.77565/0.78651/0.79403,0.72992/0.74184/0.75940,0.75683/0.76348/0.77633,0.87300/0.87943/0.89050,0.182905
log(TF)-IDF,l2,"(1, 3)",0.686522,0.80892/0.81976/0.83024,0.77184/0.78555/0.79604,0.73583/0.74713/0.76491,0.75629/0.76582/0.78016,0.87204/0.87916/0.89094,0.183701


We can see that the (1,3) n-gram range consistently outperformed the lower ranges. Let's look at the dependence on the other hyperparameters:

In [30]:
display(mnB_tune_ngrams[mnB_tune_ngrams['ngram_range']==(1,3)].style.hide())

vectorizer_type,norm,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,l1,"(1, 3)",0.701705,0.80004/0.80753/0.81840,0.78280/0.79023/0.79700,0.68112/0.69725/0.72431,0.73452/0.74075/0.75853,0.86774/0.87366/0.88516,0.006922
log(TF),l1,"(1, 3)",0.69726,0.80024/0.80930/0.82057,0.77949/0.78914/0.79773,0.69197/0.70503/0.73333,0.73646/0.74464/0.76297,0.86793/0.87406/0.88573,0.006647
TF,l2,"(1, 3)",0.707674,0.80932/0.81657/0.82925,0.75895/0.77732/0.79064,0.73780/0.74983/0.77043,0.75241/0.76328/0.78040,0.87760/0.88359/0.89375,0.076254
log(TF),l2,"(1, 3)",0.709934,0.80991/0.81878/0.82846,0.77439/0.78607/0.79538,0.73041/0.74283/0.75990,0.75700/0.76380/0.77724,0.87444/0.88138/0.89292,0.102185
TF-IDF,l1,"(1, 3)",0.672739,0.79688/0.80709/0.81879,0.77648/0.78908/0.79781,0.68260/0.69755/0.72331,0.73126/0.74043/0.75868,0.86652/0.87292/0.88448,0.010333
log(TF)-IDF,l1,"(1, 3)",0.69332,0.79668/0.80764/0.81899,0.77036/0.78300/0.79251,0.69837/0.70893/0.73484,0.73371/0.74406/0.76176,0.86649/0.87316/0.88484,0.009311
TF-IDF,l2,"(1, 3)",0.682989,0.81011/0.81870/0.82767,0.77565/0.78651/0.79403,0.72992/0.74184/0.75940,0.75683/0.76348/0.77633,0.87300/0.87943/0.89050,0.182905
log(TF)-IDF,l2,"(1, 3)",0.686522,0.80892/0.81976/0.83024,0.77184/0.78555/0.79604,0.73583/0.74713/0.76491,0.75629/0.76582/0.78016,0.87204/0.87916/0.89094,0.183701


Interestingly, L$^2$ normalisation performs better than L$^1$, and among the models using L$^1$ normalisation, the ones that used log(TF) performed slightly better. Incorporating the IDF factor made no difference.

Let's now look at including the video category:

## Including the video category

In [31]:
%%time

params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []

ngram_range = (1,3)
for sublinear_tf in [False,True]:
    for use_idf in [False,True]:
        if use_idf == False and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF')
        elif use_idf == False and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)')
        elif use_idf == True and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF-IDF')
        elif use_idf == True and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)-IDF')

        params_fixed['ngram_range'].append(ngram_range)

        X_train, _ = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)

mnB_tune_category = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.


[I 2024-05-13 03:31:15,981] A new study created in memory with name: no-name-93f03c9a-5d01-4fc3-b3ce-007e1a379b4a
[I 2024-05-13 03:31:17,703] Trial 0 finished with value: 0.7973783764087002 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7973783764087002.
[I 2024-05-13 03:31:19,337] Trial 1 finished with value: 0.799470983024082 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 03:31:21,086] Trial 2 finished with value: 0.7991551675825793 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 03:31:22,799] Trial 3 finished with value: 0.7537901399454154 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 03:31:24,275] Trial 4 finished with value: 0.7977731885800425 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 03:31:25,

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.718354  0.80991/0.81842/0.82965   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  


[I 2024-05-13 03:33:53,956] Trial 0 finished with value: 0.7960361023239535 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7960361023239535.
[I 2024-05-13 03:33:55,550] Trial 1 finished with value: 0.7985233886050628 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7985233886050628.
[I 2024-05-13 03:33:57,118] Trial 2 finished with value: 0.7986024243071418 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-13 03:33:58,690] Trial 3 finished with value: 0.7892450492589623 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-13 03:34:00,200] Trial 4 finished with value: 0.7967862166100466 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-13 03:34:01,724] Trial 5 finished with value: 0.7968256954888464 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.718354  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.745115  0.80754/0.82012/0.83083   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87564/0.88196/0.89345  0.145222  


[I 2024-05-13 03:36:30,512] Trial 0 finished with value: 0.8043667381287636 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 03:36:32,021] Trial 1 finished with value: 0.8037744224411509 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 03:36:33,563] Trial 2 finished with value: 0.8028269215555067 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 03:36:35,097] Trial 3 finished with value: 0.766503274252717 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 03:36:36,656] Trial 4 finished with value: 0.8047219700934827 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8047219700934827.
[I 2024-05-13 03:36:38,155] Trial 5 finished with value: 0.804643012335883 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.718354  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.745115  0.80754/0.82012/0.83083   
2         log(TF)      (1, 3)       0.779326  0.81030/0.82115/0.83281   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   
2  0.77851/0.79233/0.80597  0.73189/0.74083/0.75789  0.75553/0.76569/0.78119   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87564/0.88196/0.89345  0.145222  
2  0.87689/0.88348/0.89532  0.090669  


[I 2024-05-13 03:39:06,532] Trial 0 finished with value: 0.798878721897605 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.798878721897605.
[I 2024-05-13 03:39:08,056] Trial 1 finished with value: 0.8001026762626713 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-13 03:39:09,592] Trial 2 finished with value: 0.7999052584853283 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-13 03:39:11,129] Trial 3 finished with value: 0.7911796312368737 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-13 03:39:12,648] Trial 4 finished with value: 0.8002210193656957 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8002210193656957.
[I 2024-05-13 03:39:14,190] Trial 5 finished with value: 0.8001815404868957 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.718354  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.745115  0.80754/0.82012/0.83083   
2         log(TF)      (1, 3)       0.779326  0.81030/0.82115/0.83281   
3     log(TF)-IDF      (1, 3)       0.735653  0.80853/0.82099/0.83281   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   
2  0.77851/0.79233/0.80597  0.73189/0.74083/0.75789  0.75553/0.76569/0.78119   
3  0.77996/0.79514/0.80794  0.72203/0.73584/0.75489  0.75204/0.76431/0.78051   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87564/0.88196/0.89345  0.145222  
2  0.87689/0.88348/0.89532  0.090669  
3  0.87435/0.88119/0.89339  0.166633  
CPU times: user 4min 40s, sys: 33 s, t

[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    1.1s remaining:    1.6s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.1s finished


In [32]:
pd.DataFrame(mnB_tune_category).style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,"(1, 3)",0.718354,0.80991/0.81842/0.82965,0.76245/0.77930/0.79236,0.73829/0.75305/0.76892,0.75729/0.76590/0.78046,0.88022/0.88591/0.89647,0.065719
TF-IDF,"(1, 3)",0.745115,0.80754/0.82012/0.83083,0.77413/0.79016/0.80138,0.72942/0.74083/0.75840,0.75273/0.76467/0.77929,0.87564/0.88196/0.89345,0.145222
log(TF),"(1, 3)",0.779326,0.81030/0.82115/0.83281,0.77851/0.79233/0.80597,0.73189/0.74083/0.75789,0.75553/0.76569/0.78119,0.87689/0.88348/0.89532,0.090669
log(TF)-IDF,"(1, 3)",0.735653,0.80853/0.82099/0.83281,0.77996/0.79514/0.80794,0.72203/0.73584/0.75489,0.75204/0.76431/0.78051,0.87435/0.88119/0.89339,0.166633


## Dimensionality reduction

Our best performing models use the (1,3) n-gram range, which requires over 2 million features. We will now look at reducing the number of features by setting a minimum and maximum document frequency filter that drops tokens from the vocabulary that are either too rare or too common. I'll show results for TF-IDF with L$^2$ norm.

In [33]:
%%time

X_trains = []

params_fixed = {'min_df': [], 'max_df': []}

for min_df in [5,10,20,50,100,200,500,1000]:
    for max_df in [1.0, 0.9, 0.8, 0.7]:
        X_train, _ = get_features(ngram_range=ngram_range, use_idf=True, norm='l2', sublinear_tf=True, min_df=min_df, max_df=max_df, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)
        params_fixed['min_df'].append(min_df)
        params_fixed['max_df'].append(f"{max_df:.1f}")

mnB_tune_dim_reduction = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315921 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315921 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315921 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315920 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 2104 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 9402 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 151356 features in the (

[I 2024-05-13 03:46:57,172] A new study created in memory with name: no-name-c0faec5a-cb2b-407a-acd6-d1d1c72e02a7
[I 2024-05-13 03:46:58,675] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:46:59,715] Trial 1 finished with value: 0.8070906950270252 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:47:00,429] Trial 2 finished with value: 0.8046428096802366 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:47:01,134] Trial 3 finished with value: 0.7877447271534013 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:47:01,827] Trial 4 finished with value: 0.809380618091927 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:47

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0        0.19803  0.80513/0.81116/0.81998   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   

                   roc_auc     alpha  
0  0.86727/0.87335/0.88590  0.004751  


[I 2024-05-13 03:48:11,032] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:48:11,732] Trial 1 finished with value: 0.8070906950270252 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:48:12,493] Trial 2 finished with value: 0.8046428096802366 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:48:13,192] Trial 3 finished with value: 0.7877447271534013 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:48:13,904] Trial 4 finished with value: 0.809380618091927 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:48:14,615] Trial 5 finished with value: 0.8093411470075751 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.198030  0.80513/0.81116/0.81998   
1       5    0.9       0.196452  0.80513/0.81116/0.81998   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   

                   roc_auc     alpha  
0  0.86727/0.87335/0.88590  0.004751  
1  0.86727/0.87335/0.88590  0.004751  


[I 2024-05-13 03:49:23,200] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:49:23,894] Trial 1 finished with value: 0.8070906950270252 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:49:24,601] Trial 2 finished with value: 0.8046428096802366 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:49:25,314] Trial 3 finished with value: 0.7877447271534013 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:49:26,017] Trial 4 finished with value: 0.809380618091927 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:49:26,741] Trial 5 finished with value: 0.8093411470075751 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.198030  0.80513/0.81116/0.81998   
1       5    0.9       0.196452  0.80513/0.81116/0.81998   
2       5    0.8       0.209306  0.80513/0.81116/0.81998   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   

                   roc_auc     alpha  
0  0.86727/0.87335/0.88590  0.004751  
1  0.86727/0.87335/0.88590  0.004751  
2  0.86727/0.87335/0.88590  0.004751  


[I 2024-05-13 03:50:36,060] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:50:36,749] Trial 1 finished with value: 0.8070512161482254 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:50:37,484] Trial 2 finished with value: 0.8045638441281889 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:50:38,197] Trial 3 finished with value: 0.7877841904433054 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:50:38,897] Trial 4 finished with value: 0.8092621736610794 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-13 03:50:39,639] Trial 5 finished with value: 0.8093016603343273 and parameters: {'alpha': 0.00039799342667825053}. Best 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.198030  0.80513/0.81116/0.81998   
1       5    0.9       0.196452  0.80513/0.81116/0.81998   
2       5    0.8       0.209306  0.80513/0.81116/0.81998   
3       5    0.7       0.211692  0.80474/0.81104/0.82017   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   

                   roc_auc     alpha  
0  0.86727/0.87335/0.88590  0.004751  
1  0.86727/0.87335/0.88590  0.004751  
2  0.86727/0.87335/0.88590  0.004751  
3  0.86727/0.87335/0.88590  0.004581  


[I 2024-05-13 03:51:48,504] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:51:49,096] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:51:49,669] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:51:50,223] Trial 3 finished with value: 0.779887674210783 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:51:50,806] Trial 4 finished with value: 0.7948514319764889 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:51:51,375] Trial 5 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.198030  0.80513/0.81116/0.81998   
1       5    0.9       0.196452  0.80513/0.81116/0.81998   
2       5    0.8       0.209306  0.80513/0.81116/0.81998   
3       5    0.7       0.211692  0.80474/0.81104/0.82017   
4      10    1.0       0.123231  0.79171/0.79615/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   

                   roc_auc     alpha  
0  0.86727/0.87335/0.88590  0.004751  
1  0.86727/0.87335/0.88590  0.004751  
2  0.86727/0.87335/0.88590  0.004751  
3 

[I 2024-05-13 03:52:46,886] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:52:47,436] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:52:48,006] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:52:48,550] Trial 3 finished with value: 0.779887674210783 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:52:49,112] Trial 4 finished with value: 0.7948514319764889 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:52:49,678] Trial 5 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.198030  0.80513/0.81116/0.81998   
1       5    0.9       0.196452  0.80513/0.81116/0.81998   
2       5    0.8       0.209306  0.80513/0.81116/0.81998   
3       5    0.7       0.211692  0.80474/0.81104/0.82017   
4      10    1.0       0.123231  0.79171/0.79615/0.80340   
5      10    0.9       0.118945  0.79171/0.79615/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   
5  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   

                  

[I 2024-05-13 03:53:44,410] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:53:44,975] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:53:45,566] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:53:46,179] Trial 3 finished with value: 0.779887674210783 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:53:46,832] Trial 4 finished with value: 0.7948514319764889 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:53:47,405] Trial 5 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.198030  0.80513/0.81116/0.81998   
1       5    0.9       0.196452  0.80513/0.81116/0.81998   
2       5    0.8       0.209306  0.80513/0.81116/0.81998   
3       5    0.7       0.211692  0.80474/0.81104/0.82017   
4      10    1.0       0.123231  0.79171/0.79615/0.80340   
5      10    0.9       0.118945  0.79171/0.79615/0.80340   
6      10    0.8       0.109025  0.79171/0.79615/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   
5  0.75927/0.76583/0.77490  0.68260/0.6

[I 2024-05-13 03:54:42,360] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:54:42,919] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:54:43,471] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:54:44,025] Trial 3 finished with value: 0.7798087086587353 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:54:44,593] Trial 4 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-13 03:54:45,150] Trial 5 finished with value: 0.7950093552861361 and parameters: {'alpha': 0.00039799342667825053}. Best is

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.198030  0.80513/0.81116/0.81998   
1       5    0.9       0.196452  0.80513/0.81116/0.81998   
2       5    0.8       0.209306  0.80513/0.81116/0.81998   
3       5    0.7       0.211692  0.80474/0.81104/0.82017   
4      10    1.0       0.123231  0.79171/0.79615/0.80340   
5      10    0.9       0.118945  0.79171/0.79615/0.80340   
6      10    0.8       0.109025  0.79171/0.79615/0.80340   
7      10    0.7       0.134741  0.79076/0.79608/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.7221

[I 2024-05-13 03:55:39,303] Trial 0 finished with value: 0.7831255736226529 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7831255736226529.
[I 2024-05-13 03:55:39,804] Trial 1 finished with value: 0.7835203546162034 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:55:40,280] Trial 2 finished with value: 0.7831649823514213 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:55:40,772] Trial 3 finished with value: 0.775189516155747 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:55:41,264] Trial 4 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:55:41,773] Trial 5 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.198030  0.80513/0.81116/0.81998   
1       5    0.9       0.196452  0.80513/0.81116/0.81998   
2       5    0.8       0.209306  0.80513/0.81116/0.81998   
3       5    0.7       0.211692  0.80474/0.81104/0.82017   
4      10    1.0       0.123231  0.79171/0.79615/0.80340   
5      10    0.9       0.118945  0.79171/0.79615/0.80340   
6      10    0.8       0.109025  0.79171/0.79615/0.80340   
7      10    0.7       0.134741  0.79076/0.79608/0.80340   
8      20    1.0       0.096581  0.78030/0.78360/0.78875   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   

[I 2024-05-13 03:56:29,110] Trial 0 finished with value: 0.7831255736226529 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7831255736226529.
[I 2024-05-13 03:56:29,569] Trial 1 finished with value: 0.7835203546162034 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:56:30,060] Trial 2 finished with value: 0.7831649823514213 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:56:30,517] Trial 3 finished with value: 0.775189516155747 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:56:31,011] Trial 4 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:56:31,512] Trial 5 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0       0.198030  0.80513/0.81116/0.81998   
1       5    0.9       0.196452  0.80513/0.81116/0.81998   
2       5    0.8       0.209306  0.80513/0.81116/0.81998   
3       5    0.7       0.211692  0.80474/0.81104/0.82017   
4      10    1.0       0.123231  0.79171/0.79615/0.80340   
5      10    0.9       0.118945  0.79171/0.79615/0.80340   
6      10    0.8       0.109025  0.79171/0.79615/0.80340   
7      10    0.7       0.134741  0.79076/0.79608/0.80340   
8      20    1.0       0.096581  0.78030/0.78360/0.78875   
9      20    0.9       0.095543  0.78030/0.78360/0.78875   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/

[I 2024-05-13 03:57:18,183] Trial 0 finished with value: 0.7831255736226529 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7831255736226529.
[I 2024-05-13 03:57:18,564] Trial 1 finished with value: 0.7835203546162034 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:57:19,027] Trial 2 finished with value: 0.7831649823514213 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:57:19,483] Trial 3 finished with value: 0.775189516155747 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:57:19,966] Trial 4 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-13 03:57:20,461] Trial 5 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00039799342667825053}. Best i

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   

                  precision                   recall                       f1  \
0   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2   0.76661/0.77804/0.78

[I 2024-05-13 03:58:06,786] Trial 0 finished with value: 0.783046631453949 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.783046631453949.
[I 2024-05-13 03:58:07,270] Trial 1 finished with value: 0.7834413968586037 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-13 03:58:07,748] Trial 2 finished with value: 0.7830860323882696 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-13 03:58:08,241] Trial 3 finished with value: 0.775189523950195 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-13 03:58:08,704] Trial 4 finished with value: 0.7832835047267481 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-13 03:58:09,174] Trial 5 finished with value: 0.7832835047267481 and parameters: {'alpha': 0.00039799342667825053}. Best is 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   

                  precision                   recall                       f1  \
0   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1   0.76661/0.77804/0.78788  0.71858/0.72927

[I 2024-05-13 03:58:55,911] Trial 0 finished with value: 0.7601070801257557 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-13 03:58:56,322] Trial 1 finished with value: 0.7600676012469558 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-13 03:58:56,731] Trial 2 finished with value: 0.7601070957146515 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-13 03:58:57,169] Trial 3 finished with value: 0.7583304604297314 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-13 03:58:57,583] Trial 4 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-13 03:58:57,989] Trial 5 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   

                  precision                   recall                       f1  \
0   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.

[I 2024-05-13 03:59:36,148] Trial 0 finished with value: 0.7601070801257557 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-13 03:59:36,550] Trial 1 finished with value: 0.7600676012469558 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-13 03:59:36,929] Trial 2 finished with value: 0.7601070957146515 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-13 03:59:37,302] Trial 3 finished with value: 0.7583304604297314 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-13 03:59:37,718] Trial 4 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-13 03:59:38,116] Trial 5 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   

                  precision                   recall                       f1  \
0  

[I 2024-05-13 04:00:16,429] Trial 0 finished with value: 0.7601070801257557 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-13 04:00:16,869] Trial 1 finished with value: 0.7600676012469558 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-13 04:00:17,258] Trial 2 finished with value: 0.7601070957146515 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-13 04:00:17,674] Trial 3 finished with value: 0.7583304604297314 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-13 04:00:18,123] Trial 4 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-13 04:00:18,518] Trial 5 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   

                  preci

[I 2024-05-13 04:00:56,712] Trial 0 finished with value: 0.7599886434893561 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-13 04:00:57,115] Trial 1 finished with value: 0.7598701912640606 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-13 04:00:57,490] Trial 2 finished with value: 0.7599096857317564 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-13 04:00:57,910] Trial 3 finished with value: 0.7585278626181784 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-13 04:00:58,337] Trial 4 finished with value: 0.7599096779373085 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-13 04:00:58,734] Trial 5 finished with value: 0.7599096779373085 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:01:36,282] Trial 0 finished with value: 0.743958885846023 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:01:36,626] Trial 1 finished with value: 0.743958885846023 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:01:36,992] Trial 2 finished with value: 0.743958885846023 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:01:37,353] Trial 3 finished with value: 0.7439193991727752 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:01:37,705] Trial 4 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:01:37,952] Trial 5 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:02:11,353] Trial 0 finished with value: 0.743958885846023 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:02:11,706] Trial 1 finished with value: 0.743958885846023 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:02:12,042] Trial 2 finished with value: 0.743958885846023 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:02:12,396] Trial 3 finished with value: 0.7439193991727752 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:02:12,765] Trial 4 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:02:13,113] Trial 5 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:02:46,724] Trial 0 finished with value: 0.743958885846023 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:02:47,077] Trial 1 finished with value: 0.743958885846023 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:02:47,421] Trial 2 finished with value: 0.743958885846023 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:02:47,767] Trial 3 finished with value: 0.7439193991727752 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:02:48,123] Trial 4 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-13 04:02:48,458] Trial 5 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:03:21,652] Trial 0 finished with value: 0.7442747246708696 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-13 04:03:22,034] Trial 1 finished with value: 0.7442352379976218 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-13 04:03:22,397] Trial 2 finished with value: 0.7442352379976218 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-13 04:03:22,783] Trial 3 finished with value: 0.7437219813954321 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-13 04:03:23,121] Trial 4 finished with value: 0.7442747246708696 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-13 04:03:23,342] Trial 5 finished with value: 0.7442747246708696 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:03:56,390] Trial 0 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:03:56,693] Trial 1 finished with value: 0.7327065557932431 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:03:57,014] Trial 2 finished with value: 0.7327065557932431 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:03:57,320] Trial 3 finished with value: 0.7321538203122534 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:03:57,635] Trial 4 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:03:57,945] Trial 5 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:04:27,375] Trial 0 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:04:27,673] Trial 1 finished with value: 0.7327065557932431 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:04:27,960] Trial 2 finished with value: 0.7327065557932431 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:04:28,262] Trial 3 finished with value: 0.7321538203122534 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:04:28,570] Trial 4 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:04:28,865] Trial 5 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:04:57,186] Trial 1 finished with value: 0.7327065557932431 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:04:57,359] Trial 2 finished with value: 0.7327065557932431 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:04:57,699] Trial 3 finished with value: 0.7321538203122534 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:04:58,001] Trial 4 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:04:58,337] Trial 5 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-13 04:04:58,525] Trial 6 finished with value: 0.7327065557932431 and parameters: {'alpha': 2.6074972019493715e-05}. Best

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:05:27,114] Trial 0 finished with value: 0.7325880879790517 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-13 04:05:27,399] Trial 1 finished with value: 0.7325880879790517 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-13 04:05:27,689] Trial 2 finished with value: 0.7325880879790517 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-13 04:05:27,974] Trial 3 finished with value: 0.7322722569486529 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-13 04:05:28,280] Trial 4 finished with value: 0.7325880879790517 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-13 04:05:28,582] Trial 5 finished with value: 0.7325880879790517 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:05:57,235] Trial 0 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-13 04:05:57,529] Trial 1 finished with value: 0.7031742733333619 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-13 04:05:57,779] Trial 2 finished with value: 0.7031742733333619 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-13 04:05:58,040] Trial 3 finished with value: 0.7032136820621304 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-13 04:05:58,287] Trial 4 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-13 04:05:58,540] Trial 5 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:06:22,468] Trial 1 finished with value: 0.7031742733333619 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-13 04:06:22,728] Trial 2 finished with value: 0.7031742733333619 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-13 04:06:22,880] Trial 3 finished with value: 0.7032136820621304 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-13 04:06:23,138] Trial 4 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-13 04:06:23,294] Trial 5 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-13 04:06:23,550] Trial 6 finished with value: 0.7031742733333619 and parameters: {'alpha': 2.6074972019493715e-05}. Best

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:06:46,092] Trial 1 finished with value: 0.7031742733333619 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-13 04:06:46,340] Trial 2 finished with value: 0.7031742733333619 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-13 04:06:46,615] Trial 3 finished with value: 0.7032136820621304 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-13 04:06:46,859] Trial 4 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-13 04:06:47,111] Trial 5 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-13 04:06:47,383] Trial 6 finished with value: 0.7031742733333619 and parameters: {'alpha': 2.6074972019493715e-05}. Best

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:07:11,762] Trial 0 finished with value: 0.7031347866601141 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7031347866601141.
[I 2024-05-13 04:07:12,046] Trial 1 finished with value: 0.7031347866601141 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031347866601141.
[I 2024-05-13 04:07:12,299] Trial 2 finished with value: 0.7031347866601141 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031347866601141.
[I 2024-05-13 04:07:12,564] Trial 3 finished with value: 0.7034110842505775 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7034110842505775.
[I 2024-05-13 04:07:12,801] Trial 4 finished with value: 0.7031347866601141 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7034110842505775.
[I 2024-05-13 04:07:13,050] Trial 5 finished with value: 0.7031347866601141 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:07:35,759] Trial 0 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-13 04:07:35,994] Trial 1 finished with value: 0.6803932610762028 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-13 04:07:36,222] Trial 2 finished with value: 0.6803932610762028 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-13 04:07:36,435] Trial 3 finished with value: 0.6805116899181544 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-13 04:07:36,671] Trial 4 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-13 04:07:36,903] Trial 5 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:07:55,095] Trial 0 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-13 04:07:55,210] Trial 1 finished with value: 0.6803932610762028 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-13 04:07:55,413] Trial 2 finished with value: 0.6803932610762028 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-13 04:07:55,632] Trial 3 finished with value: 0.6805116899181544 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-13 04:07:55,856] Trial 4 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-13 04:07:56,105] Trial 5 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:08:15,426] Trial 0 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-13 04:08:15,678] Trial 1 finished with value: 0.6803932610762028 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-13 04:08:15,801] Trial 2 finished with value: 0.6803932610762028 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-13 04:08:16,043] Trial 3 finished with value: 0.6805116899181544 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-13 04:08:16,157] Trial 4 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-13 04:08:16,362] Trial 5 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

[I 2024-05-13 04:08:36,407] Trial 0 finished with value: 0.6800379667559 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-13 04:08:36,517] Trial 1 finished with value: 0.6800379667559 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-13 04:08:36,621] Trial 2 finished with value: 0.6800379667559 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-13 04:08:36,829] Trial 3 finished with value: 0.6797220655754699 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-13 04:08:37,051] Trial 4 finished with value: 0.6800379667559 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-13 04:08:37,329] Trial 5 finished with value: 0.6800379667559 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0 with value: 0.68003

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0       0.198030  0.80513/0.81116/0.81998   
1        5    0.9       0.196452  0.80513/0.81116/0.81998   
2        5    0.8       0.209306  0.80513/0.81116/0.81998   
3        5    0.7       0.211692  0.80474/0.81104/0.82017   
4       10    1.0       0.123231  0.79171/0.79615/0.80340   
5       10    0.9       0.118945  0.79171/0.79615/0.80340   
6       10    0.8       0.109025  0.79171/0.79615/0.80340   
7       10    0.7       0.134741  0.79076/0.79608/0.80340   
8       20    1.0       0.096581  0.78030/0.78360/0.78875   
9       20    0.9       0.095543  0.78030/0.78360/0.78875   
10      20    0.8       0.097880  0.78030/0.78360/0.78875   
11      20    0.7       0.097255  0.78030/0.78348/0.78875   
12      50    1.0       0.065234  0.75464/0.76015/0.77280   
13      50    0.9       0.076649  0.75464/0.76015/0.77280   
14      50    0.8       0.069889  0.75464/0.76015/0.77280   
15      50    0.7       

In [34]:
pd.DataFrame(mnB_tune_dim_reduction).style.hide()

min_df,max_df,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
5,1.0,0.19803,0.80513/0.81116/0.81998,0.76661/0.77804/0.78788,0.71858/0.72927/0.74286,0.74374/0.75285/0.76471,0.86727/0.87335/0.88590,0.004751
5,0.9,0.196452,0.80513/0.81116/0.81998,0.76661/0.77804/0.78788,0.71858/0.72927/0.74286,0.74374/0.75285/0.76471,0.86727/0.87335/0.88590,0.004751
5,0.8,0.209306,0.80513/0.81116/0.81998,0.76661/0.77804/0.78788,0.71858/0.72927/0.74286,0.74374/0.75285/0.76471,0.86727/0.87335/0.88590,0.004751
5,0.7,0.211692,0.80474/0.81104/0.82017,0.76636/0.77784/0.78799,0.71858/0.72917/0.74336,0.74307/0.75270/0.76502,0.86727/0.87335/0.88590,0.004581
10,1.0,0.123231,0.79171/0.79615/0.80340,0.75927/0.76583/0.77490,0.68260/0.69613/0.70576,0.72219/0.72928/0.73872,0.85340/0.86079/0.87297,0.01519
10,0.9,0.118945,0.79171/0.79615/0.80340,0.75927/0.76583/0.77490,0.68260/0.69613/0.70576,0.72219/0.72928/0.73872,0.85340/0.86079/0.87297,0.01519
10,0.8,0.109025,0.79171/0.79615/0.80340,0.75927/0.76583/0.77490,0.68260/0.69613/0.70576,0.72219/0.72928/0.73872,0.85340/0.86079/0.87297,0.01519
10,0.7,0.134741,0.79076/0.79608/0.80340,0.75719/0.76481/0.77370,0.68408/0.69764/0.70777,0.72245/0.72963/0.73927,0.85350/0.86081/0.87329,0.004618
20,1.0,0.096581,0.78030/0.78360/0.78875,0.74147/0.74644/0.75665,0.67275/0.68378/0.68972,0.71143/0.71370/0.71723,0.84053/0.84584/0.85559,3e-06
20,0.9,0.095543,0.78030/0.78360/0.78875,0.74147/0.74644/0.75665,0.67275/0.68378/0.68972,0.71143/0.71370/0.71723,0.84053/0.84584/0.85559,3e-06


we see that as the vocabulary size is decreased, the cross validation scores rapidly degrade.

All of these results show that identifying whether a video will be popular or not by its text metadata is a machine-learning problem that contradicts the common wisdom in text classification tasks. This is a fundamentally different challenge to, for example, determining whether a text message or email is spam, etc. In our case, both the most common and rarest terms are relevant, and incorporating an IDF factor seems to have no effect on the accuracy. Whether or not a viewer likes a certain YouTube video or channel, and whether they share it on social media to contribute to its virality, is primarily subjective determination, which makes the classification problem significantly more difficult, and this is reflected in the low cross-validation metrics we have seen so far.

## Further classification models

Now that we have understood the influence of the vectoriser hyperparameters -- the n-gram range, the TF/log(TF)/TF-IDF/log(TF)-IDF modalities and the normalisation, we are ready to build some more models. In addition to multinomial naive Bayes, I will use four other classification models:

* K nearest neighbours
* Support vector machine
* Logistic regression
* Perceptron

These models all have their hyperparameters, which we'll tune using a grid search with five-fold cross validation as we did earlier for Ridge regression. For the support vector machine, logistic regression and the perceptron algorithm we'll use stochastic gradient descent to speed up training. We'll also extend the n-gram range, and increase the number of trials to 250, since we have seen optuna reach the best hyperparameters close to the previous maximum of 100 trials.

In [35]:
max_trials = 250

In [36]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import SGDClassifier
params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []
vectorizers = []

for ngram_range in [(1,3),(1,4),(1,5)]:
    for sublinear_tf in [False,True]:
        for use_idf in [False,True]:
            if use_idf == False and sublinear_tf == False:
                params_fixed['vectorizer_type'].append('TF')
            elif use_idf == False and sublinear_tf == True:
                params_fixed['vectorizer_type'].append('log(TF)')
            elif use_idf == True and sublinear_tf == False:
                params_fixed['vectorizer_type'].append('TF-IDF')
            elif use_idf == True and sublinear_tf == True:
                params_fixed['vectorizer_type'].append('log(TF)-IDF')

            params_fixed['ngram_range'].append(ngram_range)

            X_train, vectorizer = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
            X_trains.append(X_train)
            vectorizers.append(vectorizer)

Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 44876 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 585682 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 5766326 f

### Multinomial naive Bayes

In [37]:
%%time

def get_params_mnB(trial):
    alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    return {'alpha': alpha}
    
mnB_tune = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

[I 2024-05-13 04:15:38,402] A new study created in memory with name: no-name-87e80218-1adb-4452-ab62-a210244e039a
[I 2024-05-13 04:15:42,901] Trial 0 finished with value: 0.7973783764087002 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7973783764087002.
[I 2024-05-13 04:15:45,092] Trial 1 finished with value: 0.799470983024082 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 04:15:46,856] Trial 2 finished with value: 0.7991551675825793 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 04:15:48,748] Trial 3 finished with value: 0.7537901399454154 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 04:15:50,499] Trial 4 finished with value: 0.7977731885800425 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-13 04:15:52,

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)        0.76653  0.80991/0.81842/0.82965   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  


[I 2024-05-13 04:22:21,057] Trial 0 finished with value: 0.7960361023239535 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7960361023239535.
[I 2024-05-13 04:22:22,629] Trial 1 finished with value: 0.7985233886050628 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7985233886050628.
[I 2024-05-13 04:22:24,176] Trial 2 finished with value: 0.7986024243071418 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-13 04:22:25,741] Trial 3 finished with value: 0.7892450492589623 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-13 04:22:27,280] Trial 4 finished with value: 0.7967862166100466 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-13 04:22:28,849] Trial 5 finished with value: 0.7968256954888464 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77614/0.79262/0.80463  0.72400/0.73815/0.75639  0.75185/0.76437/0.77904   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87557/0.88186/0.89336  0.150922  


[I 2024-05-13 04:28:54,600] Trial 0 finished with value: 0.8043667381287636 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 04:28:56,157] Trial 1 finished with value: 0.8037744224411509 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 04:28:57,685] Trial 2 finished with value: 0.8028269215555067 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 04:28:59,219] Trial 3 finished with value: 0.766503274252717 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-13 04:29:00,789] Trial 4 finished with value: 0.8047219700934827 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8047219700934827.
[I 2024-05-13 04:29:02,309] Trial 5 finished with value: 0.804643012335883 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   
2         log(TF)      (1, 3)       0.736549  0.81070/0.82134/0.83340   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77614/0.79262/0.80463  0.72400/0.73815/0.75639  0.75185/0.76437/0.77904   
2  0.77778/0.79137/0.80531  0.73337/0.74302/0.76090  0.75666/0.76641/0.78247   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87557/0.88186/0.89336  0.150922  
2  0.87695/0.88353/0.89537  0.088882  


[I 2024-05-13 04:35:26,662] Trial 0 finished with value: 0.798878721897605 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.798878721897605.
[I 2024-05-13 04:35:28,300] Trial 1 finished with value: 0.8001026762626713 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-13 04:35:29,838] Trial 2 finished with value: 0.7999052584853283 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-13 04:35:31,374] Trial 3 finished with value: 0.7911796312368737 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-13 04:35:32,904] Trial 4 finished with value: 0.8002210193656957 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8002210193656957.
[I 2024-05-13 04:35:34,469] Trial 5 finished with value: 0.8001815404868957 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   
2         log(TF)      (1, 3)       0.736549  0.81070/0.82134/0.83340   
3     log(TF)-IDF      (1, 3)       0.749131  0.80853/0.82099/0.83281   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77614/0.79262/0.80463  0.72400/0.73815/0.75639  0.75185/0.76437/0.77904   
2  0.77778/0.79137/0.80531  0.73337/0.74302/0.76090  0.75666/0.76641/0.78247   
3  0.77996/0.79514/0.80794  0.72203/0.73584/0.75489  0.75204/0.76431/0.78051   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1  0.87557/0.88186/0.89336  0.150922  
2  0.87695/0.88353/0.89537  0.088882  
3  0.87435/0.88119/0.89339  0.166633  


[I 2024-05-13 04:42:00,494] Trial 0 finished with value: 0.7883370428410139 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7883370428410139.
[I 2024-05-13 04:42:02,802] Trial 1 finished with value: 0.7955621843005009 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7955621843005009.
[I 2024-05-13 04:42:05,301] Trial 2 finished with value: 0.7971021099960287 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7971021099960287.
[I 2024-05-13 04:42:07,702] Trial 3 finished with value: 0.7580541316614762 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7971021099960287.
[I 2024-05-13 04:42:09,937] Trial 4 finished with value: 0.790390170577596 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7971021099960287.
[I 2024-05-13 04:42:12,247] Trial 5 finished with value: 0.790390178372044 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   
2         log(TF)      (1, 3)       0.736549  0.81070/0.82134/0.83340   
3     log(TF)-IDF      (1, 3)       0.749131  0.80853/0.82099/0.83281   
4              TF      (1, 4)       1.152033  0.80853/0.81945/0.83182   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77614/0.79262/0.80463  0.72400/0.73815/0.75639  0.75185/0.76437/0.77904   
2  0.77778/0.79137/0.80531  0.73337/0.74302/0.76090  0.75666/0.76641/0.78247   
3  0.77996/0.79514/0.80794  0.72203/0.73584/0.75489  0.75204/0.76431/0.78051   
4  0.76834/0.78192/0.79627  0.74174/0.75217/0.76992  0.75640/0.76672/0.78287   

                   roc_auc     alpha  
0  0.88022/0.88591/0.89647  0.065719  
1 

[I 2024-05-13 04:51:48,279] Trial 0 finished with value: 0.7870342476350671 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7870342476350671.
[I 2024-05-13 04:51:50,684] Trial 1 finished with value: 0.7946148393038154 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7946148393038154.
[I 2024-05-13 04:51:52,985] Trial 2 finished with value: 0.7954045026187396 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7954045026187396.
[I 2024-05-13 04:51:55,295] Trial 3 finished with value: 0.7915744901749036 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7954045026187396.
[I 2024-05-13 04:51:57,597] Trial 4 finished with value: 0.7895216040662076 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7954045026187396.
[I 2024-05-13 04:51:59,973] Trial 5 finished with value: 0.789482101804064 and parameters: {'alpha': 0.00039799342667825053}. Best i

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   
2         log(TF)      (1, 3)       0.736549  0.81070/0.82134/0.83340   
3     log(TF)-IDF      (1, 3)       0.749131  0.80853/0.82099/0.83281   
4              TF      (1, 4)       1.152033  0.80853/0.81945/0.83182   
5          TF-IDF      (1, 4)       1.198326  0.80734/0.82115/0.83202   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77614/0.79262/0.80463  0.72400/0.73815/0.75639  0.75185/0.76437/0.77904   
2  0.77778/0.79137/0.80531  0.73337/0.74302/0.76090  0.75666/0.76641/0.78247   
3  0.77996/0.79514/0.80794  0.72203/0.73584/0.75489  0.75204/0.76431/0.78051   
4  0.76834/0.78192/0.79627  0.74174/0.75217/0.76992  0.75640/0.76672/0.78287   
5  0.780

[I 2024-05-13 05:01:33,055] Trial 0 finished with value: 0.7974179254375315 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7974179254375315.
[I 2024-05-13 05:01:35,444] Trial 1 finished with value: 0.8008133896144438 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.8008133896144438.
[I 2024-05-13 05:01:37,707] Trial 2 finished with value: 0.8008134519700272 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.8008134519700272.
[I 2024-05-13 05:01:40,041] Trial 3 finished with value: 0.769108965992434 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.8008134519700272.
[I 2024-05-13 05:01:42,405] Trial 4 finished with value: 0.7987208141768536 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.8008134519700272.
[I 2024-05-13 05:01:44,675] Trial 5 finished with value: 0.7987997797289011 and parameters: {'alpha': 0.00039799342667825053}. Best i

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   
2         log(TF)      (1, 3)       0.736549  0.81070/0.82134/0.83340   
3     log(TF)-IDF      (1, 3)       0.749131  0.80853/0.82099/0.83281   
4              TF      (1, 4)       1.152033  0.80853/0.81945/0.83182   
5          TF-IDF      (1, 4)       1.198326  0.80734/0.82115/0.83202   
6         log(TF)      (1, 4)       1.147449  0.81070/0.82202/0.83379   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77614/0.79262/0.80463  0.72400/0.73815/0.75639  0.75185/0.76437/0.77904   
2  0.77778/0.79137/0.80531  0.73337/0.74302/0.76090  0.75666/0.76641/0.78247   
3  0.77996/0.79514/0.80794  0.72203/0.73584/0.75489  0.75204/0.76431/0.78051   
4  0.76834/0.78

[I 2024-05-13 05:11:17,059] Trial 0 finished with value: 0.7905480237372118 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7905480237372118.
[I 2024-05-13 05:11:19,390] Trial 1 finished with value: 0.7963914901776314 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7963914901776314.
[I 2024-05-13 05:11:21,692] Trial 2 finished with value: 0.7973786258310342 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7973786258310342.
[I 2024-05-13 05:11:24,066] Trial 3 finished with value: 0.7925615634727227 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7973786258310342.
[I 2024-05-13 05:11:26,372] Trial 4 finished with value: 0.7926011982404814 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7973786258310342.
[I 2024-05-13 05:11:28,669] Trial 5 finished with value: 0.7926406771192812 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   
2         log(TF)      (1, 3)       0.736549  0.81070/0.82134/0.83340   
3     log(TF)-IDF      (1, 3)       0.749131  0.80853/0.82099/0.83281   
4              TF      (1, 4)       1.152033  0.80853/0.81945/0.83182   
5          TF-IDF      (1, 4)       1.198326  0.80734/0.82115/0.83202   
6         log(TF)      (1, 4)       1.147449  0.81070/0.82202/0.83379   
7     log(TF)-IDF      (1, 4)       1.153544  0.80892/0.82130/0.83143   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77614/0.79262/0.80463  0.72400/0.73815/0.75639  0.75185/0.76437/0.77904   
2  0.77778/0.79137/0.80531  0.73337/0.74302/0.76090  0.75666/0.76641/0.78247   
3  0.77996/0.79514/0.8

[I 2024-05-13 05:21:03,507] Trial 0 finished with value: 0.7802433426645866 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7802433426645866.
[I 2024-05-13 05:21:06,642] Trial 1 finished with value: 0.7912587916501198 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7912587916501198.
[I 2024-05-13 05:21:09,787] Trial 2 finished with value: 0.7934698504907969 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7934698504907969.
[I 2024-05-13 05:21:12,893] Trial 3 finished with value: 0.7610942235736063 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7934698504907969.
[I 2024-05-13 05:21:16,029] Trial 4 finished with value: 0.7834413812697079 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7934698504907969.
[I 2024-05-13 05:21:19,199] Trial 5 finished with value: 0.7835203468217554 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   
2         log(TF)      (1, 3)       0.736549  0.81070/0.82134/0.83340   
3     log(TF)-IDF      (1, 3)       0.749131  0.80853/0.82099/0.83281   
4              TF      (1, 4)       1.152033  0.80853/0.81945/0.83182   
5          TF-IDF      (1, 4)       1.198326  0.80734/0.82115/0.83202   
6         log(TF)      (1, 4)       1.147449  0.81070/0.82202/0.83379   
7     log(TF)-IDF      (1, 4)       1.153544  0.80892/0.82130/0.83143   
8              TF      (1, 5)       1.650669  0.80872/0.82048/0.83261   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77614/0.79262/0.80463  0.72400/0.73815/0.75639  0.75185/0.76437/0.77904   
2  0.77778/0.79137/0.80531  0

[I 2024-05-13 05:34:12,256] Trial 0 finished with value: 0.7784666138462911 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7784666138462911.
[I 2024-05-13 05:34:15,472] Trial 1 finished with value: 0.7897981822567967 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7897981822567967.
[I 2024-05-13 05:34:18,575] Trial 2 finished with value: 0.7927592306723996 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7927592306723996.
[I 2024-05-13 05:34:21,688] Trial 3 finished with value: 0.7922851333766444 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7927592306723996.
[I 2024-05-13 05:34:24,893] Trial 4 finished with value: 0.7829676425185577 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7927592306723996.
[I 2024-05-13 05:34:28,095] Trial 5 finished with value: 0.7830860869494052 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   
2         log(TF)      (1, 3)       0.736549  0.81070/0.82134/0.83340   
3     log(TF)-IDF      (1, 3)       0.749131  0.80853/0.82099/0.83281   
4              TF      (1, 4)       1.152033  0.80853/0.81945/0.83182   
5          TF-IDF      (1, 4)       1.198326  0.80734/0.82115/0.83202   
6         log(TF)      (1, 4)       1.147449  0.81070/0.82202/0.83379   
7     log(TF)-IDF      (1, 4)       1.153544  0.80892/0.82130/0.83143   
8              TF      (1, 5)       1.650669  0.80872/0.82048/0.83261   
9          TF-IDF      (1, 5)       1.645833  0.80794/0.82091/0.83083   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77614/0.79262/0.80463  0.72400/

[I 2024-05-13 05:47:21,029] Trial 0 finished with value: 0.7897189984602069 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7897189984602069.
[I 2024-05-13 05:47:24,171] Trial 1 finished with value: 0.7973390534188592 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7973390534188592.
[I 2024-05-13 05:47:27,332] Trial 2 finished with value: 0.7974180969153861 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7974180969153861.
[I 2024-05-13 05:47:30,471] Trial 3 finished with value: 0.7710435791481369 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7974180969153861.
[I 2024-05-13 05:47:33,546] Trial 4 finished with value: 0.7928381416633118 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7974180969153861.
[I 2024-05-13 05:47:36,660] Trial 5 finished with value: 0.7928776127476638 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0               TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1           TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   
2          log(TF)      (1, 3)       0.736549  0.81070/0.82134/0.83340   
3      log(TF)-IDF      (1, 3)       0.749131  0.80853/0.82099/0.83281   
4               TF      (1, 4)       1.152033  0.80853/0.81945/0.83182   
5           TF-IDF      (1, 4)       1.198326  0.80734/0.82115/0.83202   
6          log(TF)      (1, 4)       1.147449  0.81070/0.82202/0.83379   
7      log(TF)-IDF      (1, 4)       1.153544  0.80892/0.82130/0.83143   
8               TF      (1, 5)       1.650669  0.80872/0.82048/0.83261   
9           TF-IDF      (1, 5)       1.645833  0.80794/0.82091/0.83083   
10         log(TF)      (1, 5)       1.627948  0.80971/0.82194/0.83261   

                  precision                   recall                       f1  \
0   0.76245/0.77930/0.79236  0

[I 2024-05-13 06:00:27,195] Trial 0 finished with value: 0.782849088965439 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.782849088965439.
[I 2024-05-13 06:00:30,322] Trial 1 finished with value: 0.7925223651940486 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7925223651940486.
[I 2024-05-13 06:00:33,601] Trial 2 finished with value: 0.794417476087608 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.794417476087608.
[I 2024-05-13 06:00:36,732] Trial 3 finished with value: 0.7930748044860165 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.794417476087608.
[I 2024-05-13 06:00:39,846] Trial 4 finished with value: 0.7854549755663542 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.794417476087608.
[I 2024-05-13 06:00:42,998] Trial 5 finished with value: 0.7853365311355069 and parameters: {'alpha': 0.00039799342667825053}. Best is tri

   vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0               TF      (1, 3)       0.766530  0.80991/0.81842/0.82965   
1           TF-IDF      (1, 3)       0.756433  0.80754/0.82048/0.83103   
2          log(TF)      (1, 3)       0.736549  0.81070/0.82134/0.83340   
3      log(TF)-IDF      (1, 3)       0.749131  0.80853/0.82099/0.83281   
4               TF      (1, 4)       1.152033  0.80853/0.81945/0.83182   
5           TF-IDF      (1, 4)       1.198326  0.80734/0.82115/0.83202   
6          log(TF)      (1, 4)       1.147449  0.81070/0.82202/0.83379   
7      log(TF)-IDF      (1, 4)       1.153544  0.80892/0.82130/0.83143   
8               TF      (1, 5)       1.650669  0.80872/0.82048/0.83261   
9           TF-IDF      (1, 5)       1.645833  0.80794/0.82091/0.83083   
10         log(TF)      (1, 5)       1.627948  0.80971/0.82194/0.83261   
11     log(TF)-IDF      (1, 5)       1.779980  0.80774/0.82123/0.83103   

                  precision          

[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    2.5s remaining:    3.8s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    2.5s finished


In [38]:
pd.DataFrame(mnB_tune).style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,"(1, 3)",0.76653,0.80991/0.81842/0.82965,0.76245/0.77930/0.79236,0.73829/0.75305/0.76892,0.75729/0.76590/0.78046,0.88022/0.88591/0.89647,0.065719
TF-IDF,"(1, 3)",0.756433,0.80754/0.82048/0.83103,0.77614/0.79262/0.80463,0.72400/0.73815/0.75639,0.75185/0.76437/0.77904,0.87557/0.88186/0.89336,0.150922
log(TF),"(1, 3)",0.736549,0.81070/0.82134/0.83340,0.77778/0.79137/0.80531,0.73337/0.74302/0.76090,0.75666/0.76641/0.78247,0.87695/0.88353/0.89537,0.088882
log(TF)-IDF,"(1, 3)",0.749131,0.80853/0.82099/0.83281,0.77996/0.79514/0.80794,0.72203/0.73584/0.75489,0.75204/0.76431/0.78051,0.87435/0.88119/0.89339,0.166633
TF,"(1, 4)",1.152033,0.80853/0.81945/0.83182,0.76834/0.78192/0.79627,0.74174/0.75217/0.76992,0.75640/0.76672/0.78287,0.87817/0.88469/0.89580,0.061751
TF-IDF,"(1, 4)",1.198326,0.80734/0.82115/0.83202,0.78015/0.79536/0.80556,0.72162/0.73606/0.75589,0.74974/0.76452/0.77993,0.87345/0.88042/0.89237,0.144887
log(TF),"(1, 4)",1.147449,0.81070/0.82202/0.83379,0.77450/0.78966/0.80294,0.73534/0.74826/0.76591,0.75838/0.76835/0.78399,0.87526/0.88229/0.89466,0.077774
log(TF)-IDF,"(1, 4)",1.153544,0.80892/0.82130/0.83143,0.77353/0.78804/0.79636,0.73829/0.74833/0.76842,0.75556/0.76764/0.78214,0.87239/0.88005/0.89264,0.128225
TF,"(1, 5)",1.650669,0.80872/0.82048/0.83261,0.77772/0.79200/0.80652,0.72548/0.73919/0.75639,0.75325/0.76463/0.78065,0.87612/0.88308/0.89451,0.067782
TF-IDF,"(1, 5)",1.645833,0.80794/0.82091/0.83083,0.78110/0.79568/0.80363,0.72211/0.73475/0.75489,0.75045/0.76397/0.77850,0.87174/0.87906/0.89118,0.137489


We can observe that the cross-validation metrics improve across the board as the n-gram range is increased from (1,1) to (1,3), but accounting for more n-grams seems to not lead to significant improvement in the accuracy, f1 or area under the ROC curve.

### Support vector machine

In [39]:
%%time

def get_params_SVM(trial):
    alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    return {'loss': 'hinge', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}
    
SVM_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_SVM, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

[I 2024-05-13 06:13:56,367] A new study created in memory with name: no-name-0a4bcb7f-58f1-4a02-ab57-43cda158bfd3
[I 2024-05-13 06:14:04,023] Trial 0 finished with value: 0.7121366803212403 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.7121366803212403.
[I 2024-05-13 06:14:28,612] Trial 1 finished with value: 0.8124997379116881 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8124997379116881.
[I 2024-05-13 06:14:38,656] Trial 2 finished with value: 0.7558035939419991 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8124997379116881.
[I 2024-05-13 06:15:02,275] Trial 3 finished with value: 0.8170006418727876 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8170006418727876.
[I 2024-05-13 06:15:07,739] Trial 4 finished with value: 0.61

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       6.198499  0.82037/0.82612/0.83735   

                 precision                   recall                       f1  \
0  0.79776/0.80639/0.81644  0.71464/0.73592/0.77343  0.75498/0.76936/0.78926   

      alpha  l1_ratio  
0  0.000084  0.003553  


[I 2024-05-13 07:30:40,704] Trial 0 finished with value: 0.649557926193593 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.649557926193593.
[I 2024-05-13 07:31:02,280] Trial 1 finished with value: 0.8186983973445875 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8186983973445875.
[I 2024-05-13 07:31:11,685] Trial 2 finished with value: 0.7269029111873322 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8186983973445875.
[I 2024-05-13 07:31:34,707] Trial 3 finished with value: 0.8202381359733648 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8202381359733648.
[I 2024-05-13 07:31:39,868] Trial 4 finished with value: 0.6062463770431684 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 wit

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       6.198499  0.82037/0.82612/0.83735   
1          TF-IDF      (1, 3)       5.289060  0.81405/0.82399/0.83735   

                 precision                   recall                       f1  \
0  0.79776/0.80639/0.81644  0.71464/0.73592/0.77343  0.75498/0.76936/0.78926   
1  0.79395/0.81123/0.82739  0.70780/0.72185/0.75338  0.75276/0.76383/0.78486   

      alpha  l1_ratio  
0  0.000084  0.003553  
1  0.000081  0.032311  


[I 2024-05-13 08:38:18,048] Trial 0 finished with value: 0.7134001369484502 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.7134001369484502.
[I 2024-05-13 08:38:40,056] Trial 1 finished with value: 0.8149869696316617 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8149869696316617.
[I 2024-05-13 08:38:50,185] Trial 2 finished with value: 0.7569879992782342 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8149869696316617.
[I 2024-05-13 08:39:13,046] Trial 3 finished with value: 0.8186193460536126 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8186193460536126.
[I 2024-05-13 08:39:17,940] Trial 4 finished with value: 0.6077862715609045 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       6.198499  0.82037/0.82612/0.83735   
1          TF-IDF      (1, 3)       5.289060  0.81405/0.82399/0.83735   
2         log(TF)      (1, 3)       5.998339  0.82017/0.82877/0.84070   

                 precision                   recall                       f1  \
0  0.79776/0.80639/0.81644  0.71464/0.73592/0.77343  0.75498/0.76936/0.78926   
1  0.79395/0.81123/0.82739  0.70780/0.72185/0.75338  0.75276/0.76383/0.78486   
2  0.79234/0.80563/0.81765  0.72944/0.74582/0.76642  0.76196/0.77454/0.79120   

      alpha  l1_ratio  
0  0.000084  0.003553  
1  0.000081  0.032311  
2  0.000072  0.019779  


[I 2024-05-13 09:52:29,048] Trial 0 finished with value: 0.6351468883199807 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6351468883199807.
[I 2024-05-13 09:52:48,491] Trial 1 finished with value: 0.8186194006147481 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8186194006147481.
[I 2024-05-13 09:52:57,478] Trial 2 finished with value: 0.7157687839375134 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8186194006147481.
[I 2024-05-13 09:53:21,702] Trial 3 finished with value: 0.8196456877801375 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8196456877801375.
[I 2024-05-13 09:53:27,123] Trial 4 finished with value: 0.6064043237361595 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       6.198499  0.82037/0.82612/0.83735   
1          TF-IDF      (1, 3)       5.289060  0.81405/0.82399/0.83735   
2         log(TF)      (1, 3)       5.998339  0.82017/0.82877/0.84070   
3     log(TF)-IDF      (1, 3)       4.965492  0.81583/0.82628/0.83774   

                 precision                   recall                       f1  \
0  0.79776/0.80639/0.81644  0.71464/0.73592/0.77343  0.75498/0.76936/0.78926   
1  0.79395/0.81123/0.82739  0.70780/0.72185/0.75338  0.75276/0.76383/0.78486   
2  0.79234/0.80563/0.81765  0.72944/0.74582/0.76642  0.76196/0.77454/0.79120   
3  0.79334/0.80972/0.82208  0.70823/0.73174/0.75739  0.76009/0.76863/0.78616   

      alpha  l1_ratio  
0  0.000084  0.003553  
1  0.000081  0.032311  
2  0.000072  0.019779  
3  0.000082  0.031673  


[I 2024-05-13 11:02:25,492] Trial 0 finished with value: 0.7056223535413488 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.7056223535413488.
[I 2024-05-13 11:02:58,072] Trial 1 finished with value: 0.8179089288908618 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8179089288908618.
[I 2024-05-13 11:03:12,834] Trial 2 finished with value: 0.7524477021772621 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8179089288908618.
[I 2024-05-13 11:03:45,375] Trial 3 finished with value: 0.8168820025807417 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8179089288908618.
[I 2024-05-13 11:03:52,391] Trial 4 finished with value: 0.6134716432138223 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)       6.198499  0.82037/0.82612/0.83735   
1          TF-IDF      (1, 3)       5.289060  0.81405/0.82399/0.83735   
2         log(TF)      (1, 3)       5.998339  0.82017/0.82877/0.84070   
3     log(TF)-IDF      (1, 3)       4.965492  0.81583/0.82628/0.83774   
4              TF      (1, 4)      10.160944  0.79609/0.80591/0.81188   

                 precision                   recall                       f1  \
0  0.79776/0.80639/0.81644  0.71464/0.73592/0.77343  0.75498/0.76936/0.78926   
1  0.79395/0.81123/0.82739  0.70780/0.72185/0.75338  0.75276/0.76383/0.78486   
2  0.79234/0.80563/0.81765  0.72944/0.74582/0.76642  0.76196/0.77454/0.79120   
3  0.79334/0.80972/0.82208  0.70823/0.73174/0.75739  0.76009/0.76863/0.78616   
4  0.73593/0.76844/0.81149  0.65028/0.73323/0.80737  0.72200/0.74785/0.77000   

          alpha  l1_ratio  
0  8.374330e-05  0.003553  
1  8.058851e-05  0.03231

[I 2024-05-13 12:58:21,666] Trial 0 finished with value: 0.640871684290563 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.640871684290563.
[I 2024-05-13 12:58:48,523] Trial 1 finished with value: 0.8160135296027287 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8160135296027287.
[I 2024-05-13 12:59:02,266] Trial 2 finished with value: 0.715374119860682 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8160135296027287.
[I 2024-05-13 12:59:32,730] Trial 3 finished with value: 0.8188955890829404 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8188955890829404.
[I 2024-05-13 12:59:40,899] Trial 4 finished with value: 0.6066806758877584 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 with

[CV] END  accuracy: (test=0.822) f1: (test=0.771) precision: (test=0.787) recall: (test=0.756) roc_auc: (test=0.888) total time=   1.0s
[CV] END  accuracy: (test=0.821) f1: (test=0.761) precision: (test=0.775) recall: (test=0.748) roc_auc: (test=0.882) total time=   1.5s
[CV] END  accuracy: (test=0.821) f1: (test=0.762) precision: (test=0.816) recall: (test=0.715) total time=   6.1s
[CV] END  accuracy: (test=0.822) f1: (test=0.761) precision: (test=0.822) recall: (test=0.708) total time=   4.8s
[CV] END  accuracy: (test=0.809) f1: (test=0.770) precision: (test=0.736) recall: (test=0.807) total time=  10.6s
[CV] END  accuracy: (test=0.810) f1: (test=0.759) precision: (test=0.769) recall: (test=0.750) roc_auc: (test=0.880) total time=   1.0s
[CV] END  accuracy: (test=0.820) f1: (test=0.758) precision: (test=0.779) recall: (test=0.737) roc_auc: (test=0.881) total time=   1.0s
[CV] END  accuracy: (test=0.819) f1: (test=0.764) precision: (test=0.797) recall: (test=0.733) roc_auc: (test=0.87

[W 2024-05-13 13:44:44,669] Trial 120 failed with parameters: {'alpha': 9.056701743868603e-07, 'l1_ratio': 0.700969407316893} because of the following error: KeyboardInterrupt().
Traceback (most recent call last):
  File "/home/tommy/.venv/lib/python3.12/site-packages/optuna/study/_optimize.py", line 196, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "/tmp/ipykernel_17743/2721935248.py", line 16, in <lambda>
    study.optimize(lambda trial: objective(trial, X_train, y_train, estimator, get_params, scoring), n_trials=max_trials)
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ipykernel_17743/2721935248.py", line 10, in objective
    scores = cross_validate(model, X_train, y_train, scoring=scoring, cv=KFold(n_splits=5, random_state=42, shuffle=True), n_jobs=-1, verbose=0)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

KeyboardInterrupt: 

In [40]:
pd.DataFrame(SVM_tune).style.hide()

NameError: name 'SVM_tune' is not defined

### Logistic regression

In [None]:
%%time

def get_params_log_reg(trial):
    alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    return {'loss': 'log_loss', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}
    
log_reg_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_log_reg, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

In [None]:
pd.DataFrame(log_reg_tune).style.hide()

### Perceptron

In [None]:
%%time

def get_params_perceptron(trial):
    alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    return {'loss': 'perceptron', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}
    
perceptron_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_perceptron, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

In [None]:
pd.DataFrame(perceptron_tune).style.hide()

## Training the final models

Now that we've obtained the optimal hyperparameters we can train the models on the full training data. We'll save the models and evaluate them in the next notebook.

In [None]:
mnB_clfs = []
svm_clfs = []
logreg_clfs = []
perceptron_clfs = []
models = {}

for n in range(len(X_trains)):
    mnB_clfs.append(MultinomialNB(alpha=mnB_tune[n]['alpha']))
    svm_clfs.append(SGDClassifier(loss='hinge', penalty='elasticnet', alpha=SVM_tune[n]['alpha'], l1_ratio=SVM_tune[n]['l1_ratio']))
    logreg_clfs.append(SGDClassifier(loss='log_loss', penalty='elasticnet', alpha=log_reg_tune[n]['alpha'], l1_ratio=log_reg_tune[n]['l1_ratio']))
    perceptron_clfs.append(SGDClassifier(loss='perceptron', penalty='elasticnet', alpha=perceptron_tune[n]['alpha'], l1_ratio=perceptron_tune[n]['l1_ratio']))

    for model in [mnB_clfs[-1],svm_clfs[-1], logreg_clfs[-1], perceptron_clfs[-1]]:
        model.fit(X_trains[n], y_train)

    models[f"models/mnB_{params_fixed[n]['vectorizer_type']}_{params_fixed[n]['ngram_range']}"] = mnB_clfs[-1]
    models[f"models/svm_{params_fixed[n]['vectorizer_type']}_{params_fixed[n]['ngram_range']}"] = svm_clfs[-1]
    models[f"models/logreg_{params_fixed[n]['vectorizer_type']}_{params_fixed[n]['ngram_range']}"] = logreg_clfs[-1]
    models[f"models/perceptron_{params_fixed[n]['vectorizer_type']}_{params_fixed[n]['ngram_range']}"] = perceptron_clfs[-1]

In [None]:
for model_name in models:
    joblib.dump(models[model_name], 'models/'+model_name+'.joblib')

for n in range(len(vectorizers)):
    joblib.dump(vectorizers[n], f"models/vectorizers/{params_fixed[n]['vectorizer_type']}_{params_fixed[n]['ngram_range']}"

joblib.dump(video_category_encoder, 'models/video_category_encoder.joblib')

## Probability calibration

We can see that, based on the cross-validation scores, the models are quite far from being accurate. We would like to model the probabilities  $P(y\in \mathcal{C}|P)$ of a data $y$ belonging in class $\mathcal{C}$ given the predictions of each of the models, which is not the same as the reported probabilities. (In some cases, there are also no reported probabilities). We can do this using a probability calibrator, which treats the predictions of each model as a feature that can then be used to model the true probability. This requires validation data, so we'll again use a five-fold cross-validation split.

In [None]:
from sklearn.calibration import CalibratedClassifierCV

calibrated_clfs = {}

for model_name in models:
    calibrated_clfs[model_name] = CalibratedClassifierCV(models[model_name], cv = KFold(n_splits=5, random_state=42, shuffle=True))
    calibrated_clfs[model_name].fit(X_train, y_train)

We'll save the calibrated models for evaluation in the next notebook:

In [None]:
for model_name in calibrated_clfs:
    joblib.dump(calibrated_clfs[model_name], 'models/'+model_name+'_calibrated.joblib')

## Stacking

Now that we have our four models, we can combine them into a single classifier that uses all of their predictions. One approach is stacking, which involves a single metaclassifier that first gathers the predictions of the individual models, then uses these predictions as features and converts them into a final prediction. We will need to train the meta-classifier with cross-validation and select a model. We will compare two choices: logistic regression and gaussian naive Bayes.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
models

In [None]:
stacking_logreg = StackingClassifier(list(models.items()), final_estimator=LogisticRegression(), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_logreg.fit(X_train, y_train)

In [None]:
from sklearn.naive_bayes import GaussianNB

stacking_gnb = StackingClassifier(list(models.items()), final_estimator=GaussianNB(), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_gnb.fit(X_train, y_train)

In [None]:
joblib.dump(stacking_logreg, 'models/stacking_logreg.joblib')
joblib.dump(stacking_gnb, 'models/stacking_gnb.joblib')