# Youtube popularity predictor (Part 2): text frequency-based models

In the [previous notebook](https://github.com/tommyliphysics/tommyli-ml/blob/main/youtube_predictor/notebooks/EDA.ipynb), we used natural language processing (NLP) to explore the YouTube video dataset and hunted for possible correlations between the language features in the video titles and descriptions and the video popularity, which we associated with a binary categorical variable corresponding to a video having obtained over 50k views (class 1) or under 50k views (class 0). We did indeed see that the frequency of the tokens in the byte-pair encoded text had predictive value for classification. In this notebook we will construct a variety of classification models based on text frequency.

Let's import the scikit-learn library and load the dataset, which was already processed in the previous notebook to extract the relevant ML features.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas(desc='processing rows')
pd.options.display.float_format = '{:.6e}'.format


In [None]:
videos = pd.read_csv('https://raw.githubusercontent.com/tommyliphysics/tommyli-ml/main/youtube_predictor/data/YT_data_v2.csv', lineterminator='\n')
videos

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
0,University of New Haven,27,Master of Science in Cellular and Molecular Bi...,"Christina Zito, assistant professor and coordi...",75,3.610660e+00,0
1,PennWest California,27,Faculty Showcase: Dr. Ben Reuter - Exercise Sc...,Interested in pursing a exercise science degre...,75,3.168203e+00,0
2,University of New Haven,27,Master of Science in Mechanical Engineering: B...,The University of New Haven’s master’s degree ...,75,3.447313e+00,0
3,Operation Ouch,24,Science for kids | BROKEN BONES- Unluckiest K...,Learn about Broken Bones with the Unluckiest K...,75,6.603942e+00,1
4,Crazy GkTrick,27,Science Gk : Diseases (मानव रोग ) - Part-2,Biology (‎जीव विज्ञान) | Gk Science | Science ...,76,6.409320e+00,1
...,...,...,...,...,...,...,...
31657,Morinda Enterprises,22,Vivo v30pro pro photography // aura light por...,,1,2.534026e+00,0
31658,Christian Dunham,20,POV me growing up,,1,1.000000e+00,0
31659,Gegee gegee,22,28 March 2024,,1,4.771213e-01,0
31660,Sangita . 20k views. 2 days ago,27,TLM WORKSHOP on FLN ||👏😱||#viral #tlm,"project work,tlm workshop,maths project work,t...",1,1.431364e+00,0


Let's look at the video categories:

In [None]:
videos.groupby('video_category').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,label,label,label,label,label,label,label,label
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
video_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,306.0,42.2451,23.89558,1.0,20.0,48.0,63.75,75.0,306.0,4.271067,...,5.674682,7.752964,306.0,0.4084967,0.492361,0.0,0.0,0.0,1.0,1.0
2,179.0,28.47486,22.77203,1.0,13.0,19.0,40.0,75.0,179.0,4.288575,...,5.602989,7.831337,179.0,0.4134078,0.493826,0.0,0.0,0.0,1.0,1.0
10,245.0,23.38367,22.98805,1.0,6.0,15.0,36.0,74.0,245.0,4.580353,...,5.645615,8.234742,245.0,0.5591837,0.4975013,0.0,0.0,1.0,1.0,1.0
15,41.0,31.43902,23.29383,2.0,12.0,27.0,56.0,75.0,41.0,4.541532,...,5.84489,8.419579,41.0,0.4634146,0.5048545,0.0,0.0,0.0,1.0,1.0
17,487.0,51.03491,19.40761,1.0,39.0,56.0,68.0,75.0,487.0,3.832325,...,4.665426,7.836966,487.0,0.2422998,0.4289153,0.0,0.0,0.0,0.0,1.0
19,111.0,38.67568,22.12451,1.0,19.5,38.0,57.0,75.0,111.0,4.107556,...,4.999286,7.486532,111.0,0.3063063,0.463049,0.0,0.0,0.0,1.0,1.0
20,603.0,19.15257,18.76016,1.0,6.5,14.0,21.0,76.0,603.0,4.025035,...,5.776823,8.183316,603.0,0.4610282,0.4988927,0.0,0.0,0.0,1.0,1.0
22,5831.0,34.82164,22.61416,1.0,15.0,32.0,55.0,76.0,5831.0,3.577791,...,4.629766,8.20297,5831.0,0.2364946,0.4249657,0.0,0.0,0.0,0.0,1.0
23,220.0,27.2,24.87863,1.0,7.0,17.0,54.25,75.0,220.0,5.085606,...,6.448389,8.227258,220.0,0.6545455,0.4766007,0.0,0.0,1.0,1.0,1.0
24,1382.0,29.52605,22.92586,1.0,10.0,23.5,48.0,75.0,1382.0,4.669223,...,5.98817,8.494091,1382.0,0.515919,0.4999274,0.0,0.0,1.0,1.0,1.0


We see that category 30 only has a single member, so we will drop it.

In [None]:
videos[videos['video_category']==30]

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26648,YouTube Movies,30,"Underground Aliens, Baba Vanga And Quantum Bio...",Baba Vanga was a female mystic in Bulgaria. Sh...,11,0.0,0


In [None]:
videos.drop(videos[videos['video_category']==30].index, inplace=True)

In [None]:
videos.reset_index(drop=True, inplace=True)

Let's look at the distribution of video view counts:

In [None]:
videos[['months','video_view_count','label']].groupby('label').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0,19168.0,40.84401,21.00774,-1.0,24.0,42.0,59.0,76.0,19168.0,3.353037,1.067583,0.0,2.692847,3.633519,4.205265,4.69897
1,12493.0,29.56135,21.26779,1.0,12.0,24.0,46.0,76.0,12493.0,5.582265,0.6834096,4.699005,5.037442,5.433327,5.977578,8.588679


We can see that the classes are approximately evenly distributed. They aren't exactly balanced, but that is due to the fact that the classification is based on a round milestone of 50k views. To exactly balance the data would result in a discrimination threshold that is far less striking.

We'll select a test set based on an 80/20 train/test split which we will then use for all future model building and validation.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(videos[['video_title']], videos['label'], test_size=0.2, stratify=videos['video_category'], random_state=524)
test = videos.iloc[X_test.index]
train = videos.iloc[X_train.index]

In [None]:
test

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26498,RG LECTURES,27,MHTCET FULL REVISION ONE SHOT ALL FORMULAS - P...,MHTCET PHYSICS FULL COMPLETE ONE SHOT REVISION...,11,5.238984e+00,1
27395,FuTechs,28,Tony Robbin and Robot conversation Relationshi...,"Speaker :Anthony Jay Robbins (né Mahavoric, bo...",10,4.364063e+00,0
23126,That Chemist,27,Nobel Prize in Chemistry 2022 (Recap),The Nobel Prize in Chemistry for 2022 has been...,18,4.484656e+00,0
15634,SCIENCE FUN For Everyone!,27,Friction Fun Friction Science Experiment,Have fun exploring friction with this easy sci...,36,4.503437e+00,0
7075,Michigan Medicine,26,Deconstructing the Legitimization of Acupunctu...,"Rick Harris, PhD\nAssociate Professor, Anesthe...",57,4.632467e+00,0
...,...,...,...,...,...,...,...
24112,CARB ACADEMY,27,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,16,5.496467e+00,1
2034,Rafael Verdonck's World,22,Science World #7 Will Strangelets destroy th...,Will the universe be destroyed by a tiny eleme...,70,3.183270e+00,0
22862,Trik Matematika mesi,27,deret angka matematika #shorts #maths,,19,5.764919e+00,1
6425,edureka!,27,Statistics And Probability Tutorial | Statisti...,🔥 Data Science Certification using R (Use Code...,59,5.561255e+00,1


In [None]:
train.to_csv('train.csv', index=False, encoding='utf-8', sep=',')
test.to_csv('test.csv', index=False, encoding='utf-8', sep=',')

In [None]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271e+00,0
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389e+00,1
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385e+00,1
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802e+00,0
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282e+00,0
...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115e+00,0
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098e+00,1
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341e+00,0
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217e+00,1


In this notebook, we will only be using the train dataset to build the models.

To convert the text into numerical features, we can use byte-pair encoding (BPE). We can train three separate encoders for the channel name, video title and video description. We will convert all text to lower case to make the vocabulary size smaller.

We first have to set all NA values to empty strings:

In [None]:
train = train.fillna('')

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

def build_tokenizer(train_texts, save=None):
    BPE_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    BPE_tokenizer.pre_tokenizer = Whitespace()
    BPE_tokenizer.train_from_iterator(train_texts, trainer=trainer)
    if save:
        BPE_tokenizer.save(save)
    return BPE_tokenizer

training_data_uncased = {field: train[field].apply(lambda x: x.lower()).tolist() for field in ['channel_title', 'video_title', 'video_description']}

In [None]:
%%time
BPE_tokenizers_uncased = {}

for field in training_data_uncased:
    BPE_tokenizers_uncased[field]= build_tokenizer(training_data_uncased[field], save=f"tokenizers/BPE_tokenizer_{field}_uncased.json")










CPU times: user 14.2 s, sys: 5.95 s, total: 20.1 s
Wall time: 10.6 s


In [None]:
from transformers import PreTrainedTokenizerFast

tokenizers_trained_uncased = {}

for field in training_data_uncased:
    tokenizers_trained_uncased[field] = PreTrainedTokenizerFast(tokenizer_file=f"tokenizers/BPE_tokenizer_{field}_uncased.json")

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
def tokenize(text, field, cased=True):
    if cased == False:
        return [str(t) for t in tokenizers_trained_uncased[field](text.lower())['input_ids']]

def tokenizer_decode(tokenized, field, cased=True):
    if cased == False:
        return tokenizers_trained_uncased[field].decode([int(t) for t in tokenized])


In [None]:
train.loc[:,'channel_title_tokenized'] = train['channel_title'].progress_apply(lambda text: tokenize(text.lower(), 'channel_title', cased=False))
train.loc[:,'video_title_tokenized'] = train['video_title'].progress_apply(lambda text: tokenize(text.lower(), 'video_title', cased=False))
train.loc[:,'video_description_tokenized'] = train['video_description'].progress_apply(lambda text: tokenize(text.lower(), 'video_description', cased=False))

processing rows: 100%|█████████████████| 25328/25328 [00:00<00:00, 32853.17it/s]
processing rows: 100%|█████████████████| 25328/25328 [00:01<00:00, 21845.45it/s]
processing rows: 100%|██████████████████| 25328/25328 [00:10<00:00, 2376.54it/s]


In [None]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label,channel_title_tokenized,video_title_tokenized,video_description_tokenized
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271e+00,0,[1165],"[2319, 2692, 3910, 2848, 6602, 3910, 2077, 196...","[10988, 5597, 12955, 5606, 5315, 4227, 4430, 4..."
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389e+00,1,[16769],"[3084, 5038, 4400, 1871, 3829, 5, 12, 1889, 59...","[4091, 9748, 4132, 17593, 4153, 5, 4123, 9748,..."
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385e+00,1,"[1300, 3294, 777]","[1883, 9686, 1910, 1817, 2178, 2469]","[4451, 9906, 4027, 17896, 4094, 4306, 4123, 42..."
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802e+00,0,[1165],"[6224, 6245, 1963, 2159, 2250, 2525, 1890, 206...","[25286, 28274, 4082, 4058, 5315, 10641, 4393, ..."
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282e+00,0,[19463],"[6465, 2587, 30, 1883, 1815, 1846, 21675, 1842...","[7408, 4039, 41, 17229, 5423, 4459, 33, 4006, ..."
...,...,...,...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115e+00,0,[16197],"[3683, 7242, 7, 3945, 7, 1815, 7, 2062]","[8809, 25929, 4021, 41, 7093, 17, 5087, 25929,..."
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098e+00,1,[10110],"[2074, 3274, 41, 10225, 1957, 2573, 3306, 5804...","[5864, 30, 5316, 44, 4035, 17185, 4053, 4299, ..."
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341e+00,0,"[3250, 900]","[1815, 6401, 68, 2386, 18, 4589, 18, 2158]","[21, 18, 4896, 17, 5122, 8991, 4027, 4331, 107..."
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217e+00,1,"[829, 3098, 1169]","[2295, 1869, 7835, 2475, 1846, 2629, 7, 1897, ...","[4365, 4093, 4410, 4347, 9114, 5460, 4487, 19,..."


In [None]:
idx = train.sample(1, random_state=524).index.tolist()[0]
print('channel title:')
print(train.at[idx,'channel_title'])
print('channel title tokenized:')
print(train.at[idx,'channel_title_tokenized'])
print('video title: ')
print(train.at[idx,'video_title'])
print('video title tokenized:')
print(train.at[idx,'video_title_tokenized'])
print('video description:')
print(train.at[idx,'video_description'])
print('video description tokenized:')
print(train.at[idx,'video_description_tokenized'])

channel title:
CrashCourse
channel title tokenized:
['1946']
video title: 
Micro-Biology: Crash Course History of Science #24
video title tokenized:
['2635', '17', '1915', '30', '3465', '2299', '2744', '1846', '1815', '7', '2763']
video description:
It's all about the SUPER TINY in this episode of Crash Course: History of Science. In it, Hank Green talks about germ theory, John Snow (the other one), pasteurization,  and why following our senses isn't always the worst idea. 

***

Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse

Thanks to the following Patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:

Mark Brouwer, Kenneth F Penttinen, Trevin Beattie, Satya Ridhima Parvathaneni, Erika & Alexa Saur, Glenn Elliott, Justin Zingsheim, Jessica Wode, Eric Prestemon, Kathrin Benoit, Tom Trval, Jason Saslow, Nathan Taylor, Brian Thomas Gossett, Khaled El Shalakany, Indika Siriwarde

We are now ready to apply machine learning techniques on the tokenized text. To perform a frequency analysis on the tokenised data we can use TfidfVectorizer() from scikit-learn, which efficiently counts the tokens in a text and generates a vector consisting of a numerical description of the token frequencies. Rather than simply counting the token frequency in the individual samples (the *term frequency*), however, TfidfVectorizer also by default incorporates the frequencies of the tokens in the entire training corpus (the *document frequency*). By default, TfidfVectorizer multiplies each token $i$ by a weight IDF = $\log(\frac{N_{\text{samples}}}{N_{\text{samples containing }i}})$, which describes the specificity of the token to the sample.

The parameters are:
* ngram_range: rather than considering individual tokens, we can consider pairs, triples, etc. of consecutive tokens and perform frequency analysis on these larger units. These are known as n-grams, with $n=1,2,3, \dots$ being the number of consecutive tokens that form the unit. The ngram_range is a tuple (n,m) with $n$ and $m$ being the minimum and maximum sizes of the n-grams used in generating features from the tokenised text.
* min_df, max_df: we can filter the tokens by the minimum and maximum number of documents in which the token must appear, which allows for dimensionality reduction.
* use_idf: this allows the incorporation of the IDF factor into the vector representation of the text: without it, the text is represented as a set of numbers corresponding to the frequency of each token or n-gram appearing in the text, with a normalisation factor. With the default option 'use_idf=True', this frequency is divided by a factor (idf) that suppresses tokens that appear in a large number of documents.
* norm: with 'l1', the vector of input features is normalised so that the sum of the features is unity, with 'l2', the sum of the squares is unity.
* sublinear_tf: this uses the logarithm of the term frequencies rather than the term frequencies themselves.

We will introduce a function that trains the vectoriser on the total vocabulary of channel names, video titles and descriptions, vectorises them individually and then combines them. We'll also determine the effect of incorporating the video category, which will be one-hot encoded and stacked with the vectoriser output.

In [None]:
from sklearn.preprocessing import OneHotEncoder

video_category_encoder = OneHotEncoder()
video_category_encoder.fit(train[['video_category']])
video_category_encoder.categories_[0]

array([ 1,  2, 10, 15, 17, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29])

In [None]:
from scipy.sparse import csr_matrix, hstack

def dummy(x):
    return x

train_texts_tokenized = {'channel_title': train['channel_title_tokenized'],
                           'video_title': train['video_title_tokenized'],
                           'video_description': train['video_description_tokenized']}

def get_features(ngram_range=(1,1), min_df=1, max_df=1.0, verbose=True, use_idf=True, norm='l2', sublinear_tf=False, video_category_encoder=None):
    vectorizers = {}
    X_vectorized = {}
    for field in train_texts_tokenized:
        vectorizers[field] = TfidfVectorizer(preprocessor=dummy, tokenizer=dummy, ngram_range=ngram_range, min_df=min_df, max_df=max_df, token_pattern=None, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
        X_vectorized[field] = vectorizers[field].fit_transform(train_texts_tokenized[field])
        if verbose:
            print(f"Fit tfidf vectorizer with {len(vectorizers[field].get_feature_names_out())} features in the {ngram_range} ngram range.")

    if video_category_encoder != None:
        X_category = video_category_encoder.transform(train[['video_category']]).toarray()
        X_train = hstack([X_category, X_vectorized['channel_title'], X_vectorized['video_title'], X_vectorized['video_description']])
    else:
        X_train = hstack([X_vectorized['channel_title'], X_vectorized['video_title'], X_vectorized['video_description']])
    return X_train, vectorizers

Let's look at the number of features for each n-gram range:

In [None]:
for ngram_range in [(1,1), (1,2), (1,3), (1,4), (1,5)]:
    _,_ = get_features(ngram_range=ngram_range)

Fit tfidf vectorizer with 12424 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 24974 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 26901 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 29438 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 158781 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 995189 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 44876 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 585682 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 5766326 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 47958 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 810580 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 8696668 featu

## Multinomial naive Bayes

We see that for the higher n-gram ranges, we have millions or tens of millions of features, which is orders of magnitude larger than the training sample size.

I'll start our exploration of classical machine learning approaches with the multinomial naive Bayes model, which is known to perform well for text classification tasks with the tf-idf approach despite the large vocabularies. This has two main advantages: for the number of features we are considering, it is comparatively fast, and it requires tuning of only one hyperparameter, the Laplacian smoothing $\alpha$, which can be fixed by cross-validation to minimise overfitting.

We'll vary the n-gram range from (1,1) (only single tokens) to (1,5), as well as the the vectoriser settings, (use_idf = [True, False], norm = ['l1', 'l2'], and 'sublinear_tf' = [True, False]), and use Bayesian hyperparameter tuning to optimise the value of $\alpha$ (Laplacian smoothing) with the Optuna library.

In [None]:
from sklearn.metrics import *

import warnings
from sklearn.exceptions import UndefinedMetricWarning
warnings.simplefilter("ignore", UndefinedMetricWarning)

In [None]:
from sklearn.model_selection import cross_validate, KFold
import optuna

max_trials=100

def objective(trial, X_train, y_train, estimator, get_params, scoring):
    np.random.seed(524)
    params = get_params(trial=trial)
    model = estimator(**params)
    scores = cross_validate(model, X_train, y_train, scoring=scoring, cv=KFold(n_splits=5, random_state=524, shuffle=True), n_jobs=-1, verbose=0)
    return np.mean(scores['test_score'])

def report_optuna_results(X_train, y_train, estimator, get_params, scoring):
    sampler = optuna.samplers.TPESampler(seed=524)
    study = optuna.create_study(sampler=sampler, direction='maximize')
    study.optimize(lambda trial: objective(trial, X_train, y_train, estimator, get_params, scoring), n_trials=max_trials)
    return study.best_params

In [None]:
def report_tuned_models(X_trains, y_train, params_fixed, estimator, get_params, scoring_tune, scoring_report):
    results_list = []
    for n in range(len(X_trains)):
        X_train = X_trains[n]

        best = report_optuna_results(X_train, y_train, estimator, get_params, scoring_tune)
        model = estimator(**get_params(best=best))
        scores = cross_validate(model, X_train, y_train, scoring=scoring_report, cv=KFold(n_splits=5, random_state=524, shuffle=True), n_jobs=-1, verbose=5)

        cv_results = {}
        for param in params_fixed:
            cv_results[param] = params_fixed[param][n]
        cv_results['mean_fit_time'] = np.mean(scores['fit_time'])
        for score in scoring_report:
            cv_results[score] = f'{np.min(scores["test_"+score]):.5f}/{np.mean(scores["test_"+score]):.5f}/{np.max(scores["test_"+score]):.5f}'
        for param in best:
            cv_results[param] = best[param]
        results_list.append(cv_results)
        print(pd.DataFrame(results_list))
    return results_list

In [None]:
from sklearn.naive_bayes import MultinomialNB

def get_params_mnB(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    elif best != None:
        alpha = best['alpha']
    return {'alpha': alpha}

In [None]:
X_trains = []
params_fixed = {'vectorizer_type': [], 'norm': [], 'ngram_range': []}

for use_idf in [False,True]:
    for norm in ['l1','l2']:
        for sublinear_tf in [False,True]:
            for ngram_range in [(1,1), (1,2), (1,3), (1,4), (1,5)]:
                if use_idf == False and sublinear_tf == False:
                    params_fixed['vectorizer_type'].append('TF')
                elif use_idf == False and sublinear_tf == True:
                    params_fixed['vectorizer_type'].append('log(TF)')
                elif use_idf == True and sublinear_tf == False:
                    params_fixed['vectorizer_type'].append('TF-IDF')
                elif use_idf == True and sublinear_tf == True:
                    params_fixed['vectorizer_type'].append('log(TF)-IDF')
                params_fixed['norm'].append(norm)
                params_fixed['ngram_range'].append(ngram_range)

                X_train, _ = get_features(ngram_range=ngram_range, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
                X_trains.append(X_train)

Fit tfidf vectorizer with 12424 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 24974 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 26901 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 29438 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 158781 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 995189 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 44876 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 585682 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 5766326 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 47958 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 810580 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 8696668 featu

In [None]:
%%time
mnB_tune_ngrams = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))


[I 2024-05-17 01:21:42,753] A new study created in memory with name: no-name-17b88b79-c74d-42ef-8507-bc8adbee7689
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current p

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   

                 precision                   recall                       f1  \
0  0.75491/0.77100/0.78688  0.64334/0.65436/0.66700  0.69706/0.70783/0.71577   

                   roc_auc        alpha  
0  0.84936/0.85339/0.85823 1.053041e-01  


[I 2024-05-17 01:22:10,441] Trial 0 finished with value: 0.7974573653440917 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7974573653440917.
[I 2024-05-17 01:22:11,209] Trial 1 finished with value: 0.7944172110763781 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7974573653440917.
[I 2024-05-17 01:22:11,856] Trial 2 finished with value: 0.7930748590471521 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7974573653440917.
[I 2024-05-17 01:22:12,414] Trial 3 finished with value: 0.7121362516266039 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7974573653440917.
[I 2024-05-17 01:22:13,041] Trial 4 finished with value: 0.7960754876693783 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7974573653440917.
[I 2024-05-17 01:22:13,660] Trial 5 finished with value: 0.7960754720804825 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1              TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   

                 precision                   recall                       f1  \
0  0.75491/0.77100/0.78688  0.64334/0.65436/0.66700  0.69706/0.70783/0.71577   
1  0.76786/0.78069/0.79388  0.68805/0.70487/0.71818  0.73348/0.74077/0.74803   

                   roc_auc        alpha  
0  0.84936/0.85339/0.85823 1.053041e-01  
1  0.87037/0.87390/0.87919 1.216302e-02  


[I 2024-05-17 01:23:12,912] Trial 0 finished with value: 0.7986418486248061 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7986418486248061.
[I 2024-05-17 01:23:14,306] Trial 1 finished with value: 0.7931935295169896 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7986418486248061.
[I 2024-05-17 01:23:15,551] Trial 2 finished with value: 0.7952071316080842 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7986418486248061.
[I 2024-05-17 01:23:16,760] Trial 3 finished with value: 0.7161239769299931 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7986418486248061.
[I 2024-05-17 01:23:17,935] Trial 4 finished with value: 0.7938250200999326 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7986418486248061.
[I 2024-05-17 01:23:19,084] Trial 5 finished with value: 0.7939039778575323 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1              TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2              TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   

                 precision                   recall                       f1  \
0  0.75491/0.77100/0.78688  0.64334/0.65436/0.66700  0.69706/0.70783/0.71577   
1  0.76786/0.78069/0.79388  0.68805/0.70487/0.71818  0.73348/0.74077/0.74803   
2  0.76557/0.77942/0.79150  0.69631/0.71490/0.73030  0.73771/0.74568/0.75391   

                   roc_auc        alpha  
0  0.84936/0.85339/0.85823 1.053041e-01  
1  0.87037/0.87390/0.87919 1.216302e-02  
2  0.87071/0.87427/0.87885 5.705727e-03  


[I 2024-05-17 01:25:13,176] Trial 0 finished with value: 0.7990761240860522 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7990761240860522.
[I 2024-05-17 01:25:15,191] Trial 1 finished with value: 0.7906664369902675 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7990761240860522.
[I 2024-05-17 01:25:17,128] Trial 2 finished with value: 0.7921274439004352 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7990761240860522.
[I 2024-05-17 01:25:18,982] Trial 3 finished with value: 0.7167951958140696 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7990761240860522.
[I 2024-05-17 01:25:20,938] Trial 4 finished with value: 0.7878631559953531 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7990761240860522.
[I 2024-05-17 01:25:22,785] Trial 5 finished with value: 0.7878236615276573 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1              TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2              TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3              TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   

                 precision                   recall                       f1  \
0  0.75491/0.77100/0.78688  0.64334/0.65436/0.66700  0.69706/0.70783/0.71577   
1  0.76786/0.78069/0.79388  0.68805/0.70487/0.71818  0.73348/0.74077/0.74803   
2  0.76557/0.77942/0.79150  0.69631/0.71490/0.73030  0.73771/0.74568/0.75391   
3  0.76442/0.77697/0.78845  0.70068/0.71629/0.72929  0.73986/0.74531/0.75169   

                   roc_auc        alpha  
0  0.84936/0.85339/0.85823 1.053041e-01  
1  0.87037/0.87390/0.87919 1.216302e-02  
2  0.87071/0.87427/0.87885 5.705727e-03  
3  0.86923/0.87283/0.87697 3.806735e-03 

[I 2024-05-17 01:28:21,690] Trial 0 finished with value: 0.7998657328398409 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7998657328398409.
[I 2024-05-17 01:28:24,373] Trial 1 finished with value: 0.7854152316763245 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7998657328398409.
[I 2024-05-17 01:28:26,901] Trial 2 finished with value: 0.7886529206381003 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7998657328398409.
[I 2024-05-17 01:28:29,365] Trial 3 finished with value: 0.7176638324754894 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7998657328398409.
[I 2024-05-17 01:28:31,804] Trial 4 finished with value: 0.7844283064730162 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7998657328398409.
[I 2024-05-17 01:28:34,198] Trial 5 finished with value: 0.7844282908841202 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1              TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2              TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3              TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4              TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   

                 precision                   recall                       f1  \
0  0.75491/0.77100/0.78688  0.64334/0.65436/0.66700  0.69706/0.70783/0.71577   
1  0.76786/0.78069/0.79388  0.68805/0.70487/0.71818  0.73348/0.74077/0.74803   
2  0.76557/0.77942/0.79150  0.69631/0.71490/0.73030  0.73771/0.74568/0.75391   
3  0.76442/0.77697/0.78845  0.70068/0.71629/0.72929  0.73986/0.74531/0.75169   
4  0.76065/0.77307/0.78686  0.70505/0.71978/0.73333  0.73971/0.74539/0.75194   

                   roc_auc        alpha  
0  0.849

[I 2024-05-17 01:32:22,379] Trial 0 finished with value: 0.780045995037275 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.780045995037275.
[I 2024-05-17 01:32:22,635] Trial 1 finished with value: 0.7751502087548019 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.780045995037275.
[I 2024-05-17 01:32:22,893] Trial 2 finished with value: 0.7718337568966248 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.780045995037275.
[I 2024-05-17 01:32:23,135] Trial 3 finished with value: 0.7209012953982749 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.780045995037275.
[I 2024-05-17 01:32:23,395] Trial 4 finished with value: 0.7781903708169634 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.780045995037275.
[I 2024-05-17 01:32:23,630] Trial 5 finished with value: 0.7783877730054105 and parameters: {'alpha': 0.00039799342667825053}. Best is tri

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1              TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2              TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3              TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4              TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5         log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   

                 precision                   recall                       f1  \
0  0.75491/0.77100/0.78688  0.64334/0.65436/0.66700  0.69706/0.70783/0.71577   
1  0.76786/0.78069/0.79388  0.68805/0.70487/0.71818  0.73348/0.74077/0.74803   
2  0.76557/0.77942/0.79150  0.69631/0.71490/0.73030  0.73771/0.74568/0.75391   
3  0.76442/0.77697/0.78845  0.70068/0.71629/0.72929  0.73986/0.74531/0.75169   
4  0.76065/0.77307/0.78686  0.70505/0.71978/0.73333  

[I 2024-05-17 01:32:49,136] Trial 0 finished with value: 0.7974179098486356 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7974179098486356.
[I 2024-05-17 01:32:49,755] Trial 1 finished with value: 0.7950488497538319 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7974179098486356.
[I 2024-05-17 01:32:50,375] Trial 2 finished with value: 0.7932722222633596 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7974179098486356.
[I 2024-05-17 01:32:50,986] Trial 3 finished with value: 0.7122546960574513 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7974179098486356.
[I 2024-05-17 01:32:51,584] Trial 4 finished with value: 0.7967071964968633 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7974179098486356.
[I 2024-05-17 01:32:52,199] Trial 5 finished with value: 0.7967466441978714 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1              TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2              TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3              TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4              TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5         log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6         log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   

                 precision                   recall                       f1  \
0  0.75491/0.77100/0.78688  0.64334/0.65436/0.66700  0.69706/0.70783/0.71577   
1  0.76786/0.78069/0.79388  0.68805/0.70487/0.71818  0.73348/0.74077/0.74803   
2  0.76557/0.77942/0.79150  0.69631/0.71490/0.73030  0.73771/0.74568/0.75391   
3  0.76442/0.77697/0.78845  0.70068/0.71629/0.72929  0.

[I 2024-05-17 01:33:52,451] Trial 0 finished with value: 0.7986418252414623 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7986418252414623.
[I 2024-05-17 01:33:53,647] Trial 1 finished with value: 0.7942199803657856 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7986418252414623.
[I 2024-05-17 01:33:54,797] Trial 2 finished with value: 0.794930701512006 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7986418252414623.
[I 2024-05-17 01:33:55,960] Trial 3 finished with value: 0.7164003134926961 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7986418252414623.
[I 2024-05-17 01:33:57,176] Trial 4 finished with value: 0.7939039856519802 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7986418252414623.
[I 2024-05-17 01:33:58,401] Trial 5 finished with value: 0.7939039856519802 and parameters: {'alpha': 0.00039799342667825053}. Best i

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1              TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2              TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3              TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4              TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5         log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6         log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7         log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   

                 precision                   recall                       f1  \
0  0.75491/0.77100/0.78688  0.64334/0.65436/0.66700  0.69706/0.70783/0.71577   
1  0.76786/0.78069/0.79388  0.68805/0.70487/0.71818  0.73348/0.74077/0.74803   
2  0.76557/0.77942/0.79150  0.69631/0.71490/0.73030  0.73

[I 2024-05-17 01:35:54,010] Trial 0 finished with value: 0.7995892871548665 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7995892871548665.
[I 2024-05-17 01:35:55,910] Trial 1 finished with value: 0.7905480081483159 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7995892871548665.
[I 2024-05-17 01:35:57,946] Trial 2 finished with value: 0.7922853126489471 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7995892871548665.
[I 2024-05-17 01:35:59,903] Trial 3 finished with value: 0.7171110424333642 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7995892871548665.
[I 2024-05-17 01:36:02,363] Trial 4 finished with value: 0.7882185048767912 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7995892871548665.
[I 2024-05-17 01:36:04,203] Trial 5 finished with value: 0.788257991550039 and parameters: {'alpha': 0.00039799342667825053}. Best i

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1              TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2              TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3              TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4              TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5         log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6         log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7         log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8         log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   

                 precision                   recall                       f1  \
0  0.75491/0.77100/0.78688  0.64334/0.65436/0.66700  0.69706/0.70783/0.71577   
1  0.76786/0.78069/0.79388  0.68805/0.70487/0.71818  0.7334

[I 2024-05-17 01:38:48,525] Trial 0 finished with value: 0.7993524060876197 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7993524060876197.
[I 2024-05-17 01:38:50,604] Trial 1 finished with value: 0.7847835072599436 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7993524060876197.
[I 2024-05-17 01:38:52,693] Trial 2 finished with value: 0.7894030427186411 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7993524060876197.
[I 2024-05-17 01:38:54,748] Trial 3 finished with value: 0.7175848669234417 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7993524060876197.
[I 2024-05-17 01:38:56,805] Trial 4 finished with value: 0.7838361310854666 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7993524060876197.
[I 2024-05-17 01:38:58,811] Trial 5 finished with value: 0.7837571421500751 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1              TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2              TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3              TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4              TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5         log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6         log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7         log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8         log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9         log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   

                 precision                   recall                       f1  \
0  0.75491/0.77100/0.78688  0.64334/0.65436/0.66700  0.69706/

[I 2024-05-17 01:42:14,187] Trial 0 finished with value: 0.7817042248635875 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7817042248635875.
[I 2024-05-17 01:42:14,410] Trial 1 finished with value: 0.7754660319907527 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7817042248635875.
[I 2024-05-17 01:42:14,620] Trial 2 finished with value: 0.7715968056793464 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7817042248635875.
[I 2024-05-17 01:42:14,842] Trial 3 finished with value: 0.7777558303444873 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7817042248635875.
[I 2024-05-17 01:42:15,054] Trial 4 finished with value: 0.7798090282311007 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7817042248635875.
[I 2024-05-17 01:42:15,257] Trial 5 finished with value: 0.7798090204366528 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   

                  precision                   recal

[I 2024-05-17 01:42:35,850] Trial 0 finished with value: 0.7999448854586388 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7999448854586388.
[I 2024-05-17 01:42:36,378] Trial 1 finished with value: 0.7991947088169626 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7999448854586388.
[I 2024-05-17 01:42:36,954] Trial 2 finished with value: 0.79650981769176 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7999448854586388.
[I 2024-05-17 01:42:37,489] Trial 3 finished with value: 0.7535924649512905 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7999448854586388.
[I 2024-05-17 01:42:38,089] Trial 4 finished with value: 0.7988393521410764 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7999448854586388.
[I 2024-05-17 01:42:38,618] Trial 5 finished with value: 0.7987998732622765 and parameters: {'alpha': 0.00039799342667825053}. Best is

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 01:43:32,507] Trial 0 finished with value: 0.793904001240876 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.793904001240876.
[I 2024-05-17 01:43:33,545] Trial 1 finished with value: 0.7991945763113476 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7991945763113476.
[I 2024-05-17 01:43:34,567] Trial 2 finished with value: 0.7986813742702935 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7991945763113476.
[I 2024-05-17 01:43:35,591] Trial 3 finished with value: 0.7609360898138646 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7991945763113476.
[I 2024-05-17 01:43:36,604] Trial 4 finished with value: 0.7960360087905783 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7991945763113476.
[I 2024-05-17 01:43:37,623] Trial 5 finished with value: 0.7961544766047697 and parameters: {'alpha': 0.00039799342667825053}. Best is

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 01:45:22,938] Trial 0 finished with value: 0.7853363752465481 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7853363752465481.
[I 2024-05-17 01:45:24,688] Trial 1 finished with value: 0.7941407887747478 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7941407887747478.
[I 2024-05-17 01:45:26,355] Trial 2 finished with value: 0.7970624596393743 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7970624596393743.
[I 2024-05-17 01:45:28,059] Trial 3 finished with value: 0.7661872873333595 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7970624596393743.
[I 2024-05-17 01:45:29,764] Trial 4 finished with value: 0.7887713884522916 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7970624596393743.
[I 2024-05-17 01:45:31,457] Trial 5 finished with value: 0.7886529362269962 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 01:48:18,521] Trial 0 finished with value: 0.7790192479994575 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7790192479994575.
[I 2024-05-17 01:48:20,795] Trial 1 finished with value: 0.7892055859690584 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7892055859690584.
[I 2024-05-17 01:48:23,098] Trial 2 finished with value: 0.7944961376561861 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7944961376561861.
[I 2024-05-17 01:48:25,385] Trial 3 finished with value: 0.7703724615918834 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7944961376561861.
[I 2024-05-17 01:48:27,782] Trial 4 finished with value: 0.7830069031528153 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7944961376561861.
[I 2024-05-17 01:48:30,168] Trial 5 finished with value: 0.7832043053412624 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 01:52:08,453] Trial 0 finished with value: 0.7848627378232212 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7848627378232212.
[I 2024-05-17 01:52:08,666] Trial 1 finished with value: 0.7783877028553791 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7848627378232212.
[I 2024-05-17 01:52:08,882] Trial 2 finished with value: 0.7744395733475088 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7848627378232212.
[I 2024-05-17 01:52:09,096] Trial 3 finished with value: 0.7847442076534464 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7848627378232212.
[I 2024-05-17 01:52:09,307] Trial 4 finished with value: 0.7826122624593277 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7848627378232212.
[I 2024-05-17 01:52:09,516] Trial 5 finished with value: 0.7826122624593277 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 01:52:30,645] Trial 0 finished with value: 0.804998532695176 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.804998532695176.
[I 2024-05-17 01:52:31,219] Trial 1 finished with value: 0.8011291271114672 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.804998532695176.
[I 2024-05-17 01:52:31,773] Trial 2 finished with value: 0.7986417083247431 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.804998532695176.
[I 2024-05-17 01:52:32,347] Trial 3 finished with value: 0.7676876328222644 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.804998532695176.
[I 2024-05-17 01:52:32,875] Trial 4 finished with value: 0.8040114749862525 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.804998532695176.
[I 2024-05-17 01:52:33,402] Trial 5 finished with value: 0.8040114749862525 and parameters: {'alpha': 0.00039799342667825053}. Best is tri

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 01:53:26,504] Trial 0 finished with value: 0.8018004161455753 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8018004161455753.
[I 2024-05-17 01:53:27,633] Trial 1 finished with value: 0.8031033126793453 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.8031033126793453.
[I 2024-05-17 01:53:28,649] Trial 2 finished with value: 0.8010108307751306 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.8031033126793453.
[I 2024-05-17 01:53:29,679] Trial 3 finished with value: 0.7738468523486036 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.8031033126793453.
[I 2024-05-17 01:53:30,712] Trial 4 finished with value: 0.8027085005080032 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.8031033126793453.
[I 2024-05-17 01:53:31,755] Trial 5 finished with value: 0.8027874738544988 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 01:55:22,216] Trial 0 finished with value: 0.7966282621226075 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7966282621226075.
[I 2024-05-17 01:55:23,727] Trial 1 finished with value: 0.799865717250945 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.799865717250945.
[I 2024-05-17 01:55:25,239] Trial 2 finished with value: 0.800892277222012 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.800892277222012.
[I 2024-05-17 01:55:26,772] Trial 3 finished with value: 0.7782294054122308 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.800892277222012.
[I 2024-05-17 01:55:28,275] Trial 4 finished with value: 0.7979311274785857 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.800892277222012.
[I 2024-05-17 01:55:29,805] Trial 5 finished with value: 0.7980100930306333 and parameters: {'alpha': 0.00039799342667825053}. Best is tri

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 01:57:58,773] Trial 0 finished with value: 0.7886923917224522 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7886923917224522.
[I 2024-05-17 01:58:00,924] Trial 1 finished with value: 0.797378275080877 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.797378275080877.
[I 2024-05-17 01:58:02,979] Trial 2 finished with value: 0.7992734093577802 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7992734093577802.
[I 2024-05-17 01:58:05,037] Trial 3 finished with value: 0.7795718197970404 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7992734093577802.
[I 2024-05-17 01:58:07,143] Trial 4 finished with value: 0.7922062691524202 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7992734093577802.
[I 2024-05-17 01:58:09,190] Trial 5 finished with value: 0.7922852347044678 and parameters: {'alpha': 0.00039799342667825053}. Best is

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:01:27,487] Trial 0 finished with value: 0.7780324864795557 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7780324864795557.
[I 2024-05-17 02:01:27,698] Trial 1 finished with value: 0.7719521701496806 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7780324864795557.
[I 2024-05-17 02:01:27,902] Trial 2 finished with value: 0.7679644760240831 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7780324864795557.
[I 2024-05-17 02:01:28,113] Trial 3 finished with value: 0.7455384151315176 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7780324864795557.
[I 2024-05-17 02:01:28,332] Trial 4 finished with value: 0.7770848608827446 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7780324864795557.
[I 2024-05-17 02:01:28,546] Trial 5 finished with value: 0.7769269297786494 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:01:49,797] Trial 0 finished with value: 0.7924036245741797 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7924036245741797.
[I 2024-05-17 02:01:50,348] Trial 1 finished with value: 0.7919692867573498 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7924036245741797.
[I 2024-05-17 02:01:50,870] Trial 2 finished with value: 0.7902715624633417 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7924036245741797.
[I 2024-05-17 02:01:51,611] Trial 3 finished with value: 0.733851490045126 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7924036245741797.
[I 2024-05-17 02:01:52,152] Trial 4 finished with value: 0.7928379000354259 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.7928379000354259.
[I 2024-05-17 02:01:52,687] Trial 5 finished with value: 0.7928773789142257 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:02:45,484] Trial 0 finished with value: 0.790192620294638 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.790192620294638.
[I 2024-05-17 02:02:46,509] Trial 1 finished with value: 0.7893241629055209 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.790192620294638.
[I 2024-05-17 02:02:47,570] Trial 2 finished with value: 0.7917326083457493 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7917326083457493.
[I 2024-05-17 02:02:48,618] Trial 3 finished with value: 0.7332987467696885 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7917326083457493.
[I 2024-05-17 02:02:49,763] Trial 4 finished with value: 0.7863234173665756 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7917326083457493.
[I 2024-05-17 02:02:50,767] Trial 5 finished with value: 0.7862839384877758 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:04:28,701] Trial 0 finished with value: 0.7881000448570479 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7881000448570479.
[I 2024-05-17 02:04:30,215] Trial 1 finished with value: 0.7849020140463747 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7881000448570479.
[I 2024-05-17 02:04:31,740] Trial 2 finished with value: 0.7881396562414628 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7881396562414628.
[I 2024-05-17 02:04:33,384] Trial 3 finished with value: 0.7323512302951485 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7881396562414628.
[I 2024-05-17 02:04:34,921] Trial 4 finished with value: 0.7796114155925593 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7881396562414628.
[I 2024-05-17 02:04:36,468] Trial 5 finished with value: 0.7796114155925593 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:07:06,019] Trial 0 finished with value: 0.7896793013368647 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7896793013368647.
[I 2024-05-17 02:07:08,099] Trial 1 finished with value: 0.7780716068137504 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7896793013368647.
[I 2024-05-17 02:07:10,125] Trial 2 finished with value: 0.7842703597800251 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7896793013368647.
[I 2024-05-17 02:07:12,135] Trial 3 finished with value: 0.7319169236561105 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7896793013368647.
[I 2024-05-17 02:07:14,186] Trial 4 finished with value: 0.7736103220315137 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7896793013368647.
[I 2024-05-17 02:07:16,209] Trial 5 finished with value: 0.7735313486850182 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:10:32,736] Trial 0 finished with value: 0.777992984217412 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.777992984217412.
[I 2024-05-17 02:10:32,939] Trial 1 finished with value: 0.7717546744278583 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.777992984217412.
[I 2024-05-17 02:10:33,146] Trial 2 finished with value: 0.7672932883177983 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.777992984217412.
[I 2024-05-17 02:10:33,378] Trial 3 finished with value: 0.7445513652170422 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.777992984217412.
[I 2024-05-17 02:10:33,580] Trial 4 finished with value: 0.7760977953793733 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.777992984217412.
[I 2024-05-17 02:10:33,792] Trial 5 finished with value: 0.7760977875849253 and parameters: {'alpha': 0.00039799342667825053}. Best is tri

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:10:55,559] Trial 0 finished with value: 0.793193256711312 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.793193256711312.
[I 2024-05-17 02:10:56,067] Trial 1 finished with value: 0.7915349723238638 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.793193256711312.
[I 2024-05-17 02:10:56,651] Trial 2 finished with value: 0.7904689880351327 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.793193256711312.
[I 2024-05-17 02:10:57,196] Trial 3 finished with value: 0.733614601183431 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.793193256711312.
[I 2024-05-17 02:10:57,736] Trial 4 finished with value: 0.7926799533424347 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.793193256711312.
[I 2024-05-17 02:10:58,255] Trial 5 finished with value: 0.792640466669187 and parameters: {'alpha': 0.00039799342667825053}. Best is trial

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:11:51,577] Trial 0 finished with value: 0.7900346969849907 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7900346969849907.
[I 2024-05-17 02:11:52,651] Trial 1 finished with value: 0.7898768594142707 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7900346969849907.
[I 2024-05-17 02:11:53,664] Trial 2 finished with value: 0.7914956961007106 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7914956961007106.
[I 2024-05-17 02:11:54,673] Trial 3 finished with value: 0.7329828689726022 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7914956961007106.
[I 2024-05-17 02:11:55,679] Trial 4 finished with value: 0.7866788052202536 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7914956961007106.
[I 2024-05-17 02:11:56,699] Trial 5 finished with value: 0.7867577629778533 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:13:39,047] Trial 0 finished with value: 0.7876263528725854 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7876263528725854.
[I 2024-05-17 02:13:40,621] Trial 1 finished with value: 0.7848625351675749 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7876263528725854.
[I 2024-05-17 02:13:42,141] Trial 2 finished with value: 0.7883765762809494 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7883765762809494.
[I 2024-05-17 02:13:43,706] Trial 3 finished with value: 0.7323117358274527 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7883765762809494.
[I 2024-05-17 02:13:45,215] Trial 4 finished with value: 0.7794930257228473 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7883765762809494.
[I 2024-05-17 02:13:46,764] Trial 5 finished with value: 0.7794140523763518 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:16:14,868] Trial 0 finished with value: 0.7893240537832498 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7893240537832498.
[I 2024-05-17 02:16:17,153] Trial 1 finished with value: 0.7775188167716254 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7893240537832498.
[I 2024-05-17 02:16:19,360] Trial 2 finished with value: 0.783836068729883 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7893240537832498.
[I 2024-05-17 02:16:21,468] Trial 3 finished with value: 0.7317589847575674 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7893240537832498.
[I 2024-05-17 02:16:23,508] Trial 4 finished with value: 0.7735709288916411 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7893240537832498.
[I 2024-05-17 02:16:25,585] Trial 5 finished with value: 0.7734129821986501 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:19:41,711] Trial 0 finished with value: 0.7785063109696332 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7785063109696332.
[I 2024-05-17 02:19:41,918] Trial 1 finished with value: 0.7708072904589331 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7785063109696332.
[I 2024-05-17 02:19:42,138] Trial 2 finished with value: 0.7676881940225158 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7785063109696332.
[I 2024-05-17 02:19:42,349] Trial 3 finished with value: 0.789166169445842 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.789166169445842.
[I 2024-05-17 02:19:42,559] Trial 4 finished with value: 0.7760978499405089 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.789166169445842.
[I 2024-05-17 02:19:42,767] Trial 5 finished with value: 0.7760978499405089 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:20:03,951] Trial 0 finished with value: 0.7986418953914937 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7986418953914937.
[I 2024-05-17 02:20:04,483] Trial 1 finished with value: 0.7962334031845776 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7986418953914937.
[I 2024-05-17 02:20:05,023] Trial 2 finished with value: 0.7947331200512562 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7986418953914937.
[I 2024-05-17 02:20:05,531] Trial 3 finished with value: 0.7913770100419769 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7986418953914937.
[I 2024-05-17 02:20:06,095] Trial 4 finished with value: 0.7989183410764678 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.7989183410764678.
[I 2024-05-17 02:20:06,667] Trial 5 finished with value: 0.7989183332820199 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:20:59,935] Trial 0 finished with value: 0.7937854788655493 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7937854788655493.
[I 2024-05-17 02:21:00,962] Trial 1 finished with value: 0.7973783764087002 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7973783764087002.
[I 2024-05-17 02:21:02,033] Trial 2 finished with value: 0.7972600099223321 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7973783764087002.
[I 2024-05-17 02:21:03,080] Trial 3 finished with value: 0.796154429838082 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7973783764087002.
[I 2024-05-17 02:21:04,057] Trial 4 finished with value: 0.795917556565283 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7973783764087002.
[I 2024-05-17 02:21:05,089] Trial 5 finished with value: 0.7958385910132353 and parameters: {'alpha': 0.00039799342667825053}. Best is

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:22:44,291] Trial 0 finished with value: 0.7852967950399251 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7852967950399251.
[I 2024-05-17 02:22:45,799] Trial 1 finished with value: 0.7936669798735663 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7936669798735663.
[I 2024-05-17 02:22:47,281] Trial 2 finished with value: 0.7963123141754895 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7963123141754895.
[I 2024-05-17 02:22:48,794] Trial 3 finished with value: 0.7989575393551419 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7989575393551419.
[I 2024-05-17 02:22:50,289] Trial 4 finished with value: 0.7880210559216565 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7989575393551419.
[I 2024-05-17 02:22:51,788] Trial 5 finished with value: 0.7881789948201996 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:25:18,292] Trial 0 finished with value: 0.7780321591127424 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7780321591127424.
[I 2024-05-17 02:25:20,420] Trial 1 finished with value: 0.7900346190405113 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7900346190405113.
[I 2024-05-17 02:25:22,445] Trial 2 finished with value: 0.7936275009947663 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7936275009947663.
[I 2024-05-17 02:25:24,509] Trial 3 finished with value: 0.7995891000881162 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7995891000881162.
[I 2024-05-17 02:25:26,496] Trial 4 finished with value: 0.7823751319697466 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7995891000881162.
[I 2024-05-17 02:25:28,509] Trial 5 finished with value: 0.7824540897273463 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:28:41,495] Trial 0 finished with value: 0.7799669203629562 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7799669203629562.
[I 2024-05-17 02:28:41,713] Trial 1 finished with value: 0.7720706613472157 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7799669203629562.
[I 2024-05-17 02:28:41,939] Trial 2 finished with value: 0.769583219177148 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7799669203629562.
[I 2024-05-17 02:28:42,167] Trial 3 finished with value: 0.7922457558256679 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7922457558256679.
[I 2024-05-17 02:28:42,375] Trial 4 finished with value: 0.7772822085100561 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7922457558256679.
[I 2024-05-17 02:28:42,593] Trial 5 finished with value: 0.7772427218368084 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:29:04,144] Trial 0 finished with value: 0.8018794830254462 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8018794830254462.
[I 2024-05-17 02:29:04,667] Trial 1 finished with value: 0.7986417940636704 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8018794830254462.
[I 2024-05-17 02:29:05,220] Trial 2 finished with value: 0.7958384507131725 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8018794830254462.
[I 2024-05-17 02:29:05,768] Trial 3 finished with value: 0.7950488419593839 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8018794830254462.
[I 2024-05-17 02:29:06,291] Trial 4 finished with value: 0.8008528529043477 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8018794830254462.
[I 2024-05-17 02:29:06,839] Trial 5 finished with value: 0.8008528606987957 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:30:00,197] Trial 0 finished with value: 0.7972994576233402 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7972994576233402.
[I 2024-05-17 02:30:01,203] Trial 1 finished with value: 0.7991155795815083 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7991155795815083.
[I 2024-05-17 02:30:02,232] Trial 2 finished with value: 0.7991551052269956 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7991551052269956.
[I 2024-05-17 02:30:03,264] Trial 3 finished with value: 0.7985232560994479 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7991551052269956.
[I 2024-05-17 02:30:04,308] Trial 4 finished with value: 0.799233985040116 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.799233985040116.
[I 2024-05-17 02:30:05,312] Trial 5 finished with value: 0.7992339772456681 and parameters: {'alpha': 0.00039799342667825053}. Best is

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:31:43,634] Trial 0 finished with value: 0.7891266515948026 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7891266515948026.
[I 2024-05-17 02:31:45,144] Trial 1 finished with value: 0.7956409627857981 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7956409627857981.
[I 2024-05-17 02:31:46,663] Trial 2 finished with value: 0.797891609627546 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.797891609627546.
[I 2024-05-17 02:31:48,155] Trial 3 finished with value: 0.8005762903026545 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.8005762903026545.
[I 2024-05-17 02:31:49,681] Trial 4 finished with value: 0.7922061756190448 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.8005762903026545.
[I 2024-05-17 02:31:51,169] Trial 5 finished with value: 0.7922851411710925 and parameters: {'alpha': 0.00039799342667825053}. Best is

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

[I 2024-05-17 02:34:17,369] Trial 0 finished with value: 0.7821382353136037 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7821382353136037.
[I 2024-05-17 02:34:19,387] Trial 1 finished with value: 0.7930352320738414 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7930352320738414.
[I 2024-05-17 02:34:21,414] Trial 2 finished with value: 0.7957989016843412 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7957989016843412.
[I 2024-05-17 02:34:23,520] Trial 3 finished with value: 0.8016422200302502 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.8016422200302502.
[I 2024-05-17 02:34:25,521] Trial 4 finished with value: 0.7854942595839558 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.8016422200302502.
[I 2024-05-17 02:34:27,614] Trial 5 finished with value: 0.7854152940319082 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   7.021747e-02  0.78144/0.78692/0.79510   
1               TF   l1      (1, 2)   3.398689e-01  0.79743/0.80539/0.81188   
2               TF   l1      (1, 3)   7.434023e-01  0.79882/0.80764/0.81362   
3               TF   l1      (1, 4)   1.245196e+00  0.80059/0.80689/0.81208   
4               TF   l1      (1, 5)   1.707485e+00  0.79901/0.80602/0.81208   
5          log(TF)   l1      (1, 1)   4.340715e-02  0.78065/0.78715/0.79570   
6          log(TF)   l1      (1, 2)   3.365528e-01  0.79882/0.80559/0.81109   
7          log(TF)   l1      (1, 3)   1.047504e+00  0.80178/0.80843/0.81405   
8          log(TF)   l1      (1, 4)   1.131295e+00  0.80197/0.80760/0.81405   
9          log(TF)   l1      (1, 5)   1.508509e+00  0.80000/0.80614/0.81129   
10              TF   l2      (1, 1)   4.749570e-02  0.78148/0.79003/0.79826   
11              TF   l2      (1, 2)   3.189680e-01  

In [None]:
mnB_tune_ngrams = pd.DataFrame(mnB_tune_ngrams)
display(mnB_tune_ngrams.style.hide())

vectorizer_type,norm,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,l1,"(1, 1)",0.070217,0.78144/0.78692/0.79510,0.75491/0.77100/0.78688,0.64334/0.65436/0.66700,0.69706/0.70783/0.71577,0.84936/0.85339/0.85823,0.105304
TF,l1,"(1, 2)",0.339869,0.79743/0.80539/0.81188,0.76786/0.78069/0.79388,0.68805/0.70487/0.71818,0.73348/0.74077/0.74803,0.87037/0.87390/0.87919,0.012163
TF,l1,"(1, 3)",0.743402,0.79882/0.80764/0.81362,0.76557/0.77942/0.79150,0.69631/0.71490/0.73030,0.73771/0.74568/0.75391,0.87071/0.87427/0.87885,0.005706
TF,l1,"(1, 4)",1.245196,0.80059/0.80689/0.81208,0.76442/0.77697/0.78845,0.70068/0.71629/0.72929,0.73986/0.74531/0.75169,0.86923/0.87283/0.87697,0.003807
TF,l1,"(1, 5)",1.707485,0.79901/0.80602/0.81208,0.76065/0.77307/0.78686,0.70505/0.71978/0.73333,0.73971/0.74539/0.75194,0.86758/0.87125/0.87503,0.002765
log(TF),l1,"(1, 1)",0.043407,0.78065/0.78715/0.79570,0.75701/0.77264/0.78900,0.63994/0.65267/0.66700,0.69766/0.70752/0.71605,0.85007/0.85372/0.85846,0.10566
log(TF),l1,"(1, 2)",0.336553,0.79882/0.80559/0.81109,0.76598/0.77847/0.78818,0.69534/0.70904/0.71970,0.73317/0.74208/0.74803,0.87070/0.87427/0.87954,0.011537
log(TF),l1,"(1, 3)",1.047504,0.80178/0.80843/0.81405,0.76226/0.77688/0.78898,0.70845/0.72187/0.73384,0.74128/0.74830/0.75481,0.87109/0.87467/0.87922,0.005385
log(TF),l1,"(1, 4)",1.131295,0.80197/0.80760/0.81405,0.76252/0.77461/0.78803,0.71040/0.72277/0.73384,0.74194/0.74772/0.75289,0.86953/0.87317/0.87726,0.003647
log(TF),l1,"(1, 5)",1.508509,0.80000/0.80614/0.81129,0.76413/0.77406/0.78607,0.70554/0.71847/0.73232,0.74000/0.74515/0.75130,0.86784/0.87152/0.87524,0.002855


We can see that the (1,3) n-gram range consistently outperformed the lower ranges during cross-validation, but no improvement was seen for the (1,4) and (1,5) ranges. Let's look at the dependence on the other hyperparameters:

In [None]:
display(mnB_tune_ngrams[mnB_tune_ngrams['ngram_range']==(1,3)].style.hide())

vectorizer_type,norm,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,l1,"(1, 3)",0.743402,0.79882/0.80764/0.81362,0.76557/0.77942/0.79150,0.69631/0.71490/0.73030,0.73771/0.74568/0.75391,0.87071/0.87427/0.87885,0.005706
log(TF),l1,"(1, 3)",1.047504,0.80178/0.80843/0.81405,0.76226/0.77688/0.78898,0.70845/0.72187/0.73384,0.74128/0.74830/0.75481,0.87109/0.87467/0.87922,0.005385
TF,l2,"(1, 3)",0.690687,0.80932/0.81586/0.82254,0.75050/0.76952/0.78202,0.74879/0.76149/0.76935,0.75507/0.76540/0.77107,0.87997/0.88502/0.89157,0.067011
log(TF),l2,"(1, 3)",0.705759,0.81287/0.81973/0.82511,0.76082/0.78289/0.79759,0.74101/0.75167/0.75836,0.75692/0.76687/0.77165,0.87795/0.88278/0.88985,0.095365
TF-IDF,l1,"(1, 3)",0.675048,0.79941/0.80666/0.81382,0.77280/0.78460/0.79427,0.68902/0.70307/0.71869,0.73624/0.74152/0.75112,0.86977/0.87353/0.87793,0.009836
log(TF)-IDF,l1,"(1, 3)",0.690587,0.79921/0.80701/0.81422,0.77534/0.78576/0.79506,0.68805/0.70247/0.71818,0.73578/0.74171/0.75139,0.86994/0.87368/0.87804,0.009997
TF-IDF,l2,"(1, 3)",0.697649,0.81445/0.81854/0.82409,0.77070/0.78928/0.79968,0.72546/0.73698/0.74438,0.75546/0.76214/0.76755,0.87583/0.88024/0.88767,0.197622
log(TF)-IDF,l2,"(1, 3)",0.694104,0.81623/0.82099/0.82527,0.76617/0.78660/0.79989,0.73810/0.74989/0.75736,0.76085/0.76770/0.77102,0.87584/0.88039/0.88762,0.182763


We find that L$^2$ normalisation performs better than L$^1$, but there is no improvement from including the IDF factor or using log(TF) instead of TF.

## Dimensionality reduction

Our best performing models use the (1,3) n-gram range, which requires over 2 million features. We will now look at reducing the number of features by setting a minimum and maximum document frequency filter that drops tokens from the vocabulary that are either too rare or too common. I'll show results for TF-IDF with L$^2$ norm.

In [None]:
%%time

X_trains = []

params_fixed = {'min_df': [], 'max_df': []}

for min_df in [5,10,20,50,100,200,500,1000]:
    for max_df in [1.0, 0.9, 0.8, 0.7]:
        X_train, _ = get_features(ngram_range=ngram_range, use_idf=True, norm='l2', sublinear_tf=True, min_df=min_df, max_df=max_df, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)
        params_fixed['min_df'].append(min_df)
        params_fixed['max_df'].append(f"{max_df:.1f}")

mnB_tune_dim_reduction = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

Fit tfidf vectorizer with 5126 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 28144 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 586326 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 5126 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 28144 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 586326 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 5126 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 28144 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 586326 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 5126 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 28144 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 586325 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 2375 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 11320 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 264187 features in the 

[I 2024-05-17 02:49:44,441] A new study created in memory with name: no-name-c67bd3a5-edd1-44c2-9f60-f293c64383eb
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current p

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   3.914842e-01  0.80497/0.81021/0.81579   

                 precision                   recall                       f1  \
0  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   

                   roc_auc        alpha  
0  0.86466/0.87098/0.87777 3.223657e-04  


[I 2024-05-17 02:51:39,414] Trial 0 finished with value: 0.8096968310502746 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096968310502746.
[I 2024-05-17 02:51:40,510] Trial 1 finished with value: 0.8088677901843738 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096968310502746.
[I 2024-05-17 02:51:41,619] Trial 2 finished with value: 0.8059855280485158 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096968310502746.
[I 2024-05-17 02:51:42,721] Trial 3 finished with value: 0.7814275765229669 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096968310502746.
[I 2024-05-17 02:51:43,812] Trial 4 finished with value: 0.809973237763009 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.809973237763009.
[I 2024-05-17 02:51:44,897] Trial 5 finished with value: 0.8099337666786571 and parameters: {'alpha': 0.00039799342667825053}. Best is

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1       5    0.9   3.812266e-01  0.80497/0.81021/0.81579   

                 precision                   recall                       f1  \
0  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   

                   roc_auc        alpha  
0  0.86466/0.87098/0.87777 3.223657e-04  
1  0.86466/0.87098/0.87777 3.223657e-04  


[I 2024-05-17 02:53:31,840] Trial 0 finished with value: 0.8096968310502746 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096968310502746.
[I 2024-05-17 02:53:32,923] Trial 1 finished with value: 0.8088677901843738 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096968310502746.
[I 2024-05-17 02:53:34,044] Trial 2 finished with value: 0.8059855280485158 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096968310502746.
[I 2024-05-17 02:53:35,134] Trial 3 finished with value: 0.7814275765229669 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096968310502746.
[I 2024-05-17 02:53:36,219] Trial 4 finished with value: 0.809973237763009 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.809973237763009.
[I 2024-05-17 02:53:37,323] Trial 5 finished with value: 0.8099337666786571 and parameters: {'alpha': 0.00039799342667825053}. Best is

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1       5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2       5    0.8   3.951408e-01  0.80497/0.81021/0.81579   

                 precision                   recall                       f1  \
0  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
2  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   

                   roc_auc        alpha  
0  0.86466/0.87098/0.87777 3.223657e-04  
1  0.86466/0.87098/0.87777 3.223657e-04  
2  0.86466/0.87098/0.87777 3.223657e-04  


[I 2024-05-17 02:55:23,860] Trial 0 finished with value: 0.8096573521714747 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096573521714747.
[I 2024-05-17 02:55:24,958] Trial 1 finished with value: 0.8088677901843738 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096573521714747.
[I 2024-05-17 02:55:26,081] Trial 2 finished with value: 0.8060250069273156 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096573521714747.
[I 2024-05-17 02:55:27,180] Trial 3 finished with value: 0.7813091398865674 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096573521714747.
[I 2024-05-17 02:55:28,271] Trial 4 finished with value: 0.8099337588842092 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8099337588842092.
[I 2024-05-17 02:55:29,384] Trial 5 finished with value: 0.8100522189039525 and parameters: {'alpha': 0.00039799342667825053}. Best 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1       5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2       5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3       5    0.7   3.747277e-01  0.80320/0.81025/0.81639   

                 precision                   recall                       f1  \
0  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
2  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
3  0.76126/0.78730/0.80315  0.69388/0.71169/0.72741  0.73784/0.74739/0.75384   

                   roc_auc        alpha  
0  0.86466/0.87098/0.87777 3.223657e-04  
1  0.86466/0.87098/0.87777 3.223657e-04  
2  0.86466/0.87098/0.87777 3.223657e-04  
3  0.86501/0.87173/0.87913 7.680699e-03  


[I 2024-05-17 02:57:16,007] Trial 0 finished with value: 0.795878101069827 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 02:57:16,828] Trial 1 finished with value: 0.7943778257309535 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 02:57:17,630] Trial 2 finished with value: 0.7930355204684152 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 02:57:18,433] Trial 3 finished with value: 0.7709648474295274 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 02:57:19,244] Trial 4 finished with value: 0.795285832148902 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 02:57:20,030] Trial 5 finished with value: 0.795285832148902 and parameters: {'alpha': 0.00039799342667825053}. Best is trial

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1       5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2       5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3       5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4      10    1.0   2.201627e-01  0.78973/0.79604/0.80375   

                 precision                   recall                       f1  \
0  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
2  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
3  0.76126/0.78730/0.80315  0.69388/0.71169/0.72741  0.73784/0.74739/0.75384   
4  0.75763/0.77902/0.79189  0.65452/0.67469/0.69396  0.71452/0.72290/0.73293   

                   roc_auc        alpha  
0  0.86466/0.87098/0.87777 3.223657e-04  
1  0.86466/0.87098/0.87777 3.223657e-04  
2  0.86466/0.87098/0.87777 3.223

[I 2024-05-17 02:58:38,444] Trial 0 finished with value: 0.795878101069827 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 02:58:39,267] Trial 1 finished with value: 0.7943778257309535 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 02:58:40,060] Trial 2 finished with value: 0.7930355204684152 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 02:58:40,868] Trial 3 finished with value: 0.7709648474295274 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 02:58:41,668] Trial 4 finished with value: 0.795285832148902 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 02:58:42,464] Trial 5 finished with value: 0.795285832148902 and parameters: {'alpha': 0.00039799342667825053}. Best is trial

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1       5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2       5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3       5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4      10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5      10    0.9   2.179671e-01  0.78973/0.79604/0.80375   

                 precision                   recall                       f1  \
0  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
2  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
3  0.76126/0.78730/0.80315  0.69388/0.71169/0.72741  0.73784/0.74739/0.75384   
4  0.75763/0.77902/0.79189  0.65452/0.67469/0.69396  0.71452/0.72290/0.73293   
5  0.75763/0.77902/0.79189  0.65452/0.67469/0.69396  0.71452/0.72290/0.73293   

                  

[I 2024-05-17 03:00:00,616] Trial 0 finished with value: 0.795878101069827 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 03:00:01,442] Trial 1 finished with value: 0.7943778257309535 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 03:00:02,258] Trial 2 finished with value: 0.7930355204684152 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 03:00:03,066] Trial 3 finished with value: 0.7709648474295274 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 03:00:03,882] Trial 4 finished with value: 0.795285832148902 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.795878101069827.
[I 2024-05-17 03:00:04,683] Trial 5 finished with value: 0.795285832148902 and parameters: {'alpha': 0.00039799342667825053}. Best is trial

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1       5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2       5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3       5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4      10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5      10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6      10    0.8   2.189385e-01  0.78973/0.79604/0.80375   

                 precision                   recall                       f1  \
0  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
2  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
3  0.76126/0.78730/0.80315  0.69388/0.71169/0.72741  0.73784/0.74739/0.75384   
4  0.75763/0.77902/0.79189  0.65452/0.67469/0.69396  0.71452/0.72290/0.73293   
5  0.75763/0.77902/0.79189  0.65452/0.6

[I 2024-05-17 03:01:22,884] Trial 0 finished with value: 0.7959965377062265 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7959965377062265.
[I 2024-05-17 03:01:23,693] Trial 1 finished with value: 0.7942988679733539 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7959965377062265.
[I 2024-05-17 03:01:24,530] Trial 2 finished with value: 0.7929170760375677 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7959965377062265.
[I 2024-05-17 03:01:25,350] Trial 3 finished with value: 0.7712017518801183 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7959965377062265.
[I 2024-05-17 03:01:26,161] Trial 4 finished with value: 0.7951673877180545 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7959965377062265.
[I 2024-05-17 03:01:26,974] Trial 5 finished with value: 0.7952068665968544 and parameters: {'alpha': 0.00039799342667825053}. Best 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1       5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2       5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3       5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4      10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5      10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6      10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7      10    0.7   2.181380e-01  0.78934/0.79600/0.80375   

                 precision                   recall                       f1  \
0  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
2  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
3  0.76126/0.78730/0.80315  0.69388/0.71169/0.72741  0.73784/0.74739/0.75384   
4  0.75763/0.77902/0.79189  0.65452/0.67469/0.69396  0.7145

[I 2024-05-17 03:02:44,826] Trial 0 finished with value: 0.7802434361979619 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:02:45,451] Trial 1 finished with value: 0.7798486396155154 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:02:46,056] Trial 2 finished with value: 0.7800854972994187 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:02:46,667] Trial 3 finished with value: 0.7671746256424087 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:02:47,263] Trial 4 finished with value: 0.7801644784403622 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:02:47,868] Trial 5 finished with value: 0.7801644784403622 and parameters: {'alpha': 0.00039799342667825053}. Best 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1       5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2       5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3       5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4      10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5      10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6      10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7      10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8      20    1.0   1.146873e-01  0.77083/0.78056/0.79072   

                 precision                   recall                       f1  \
0  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
2  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
3  0.76126/0.78730/0.80315  0.69388/0.71169/0.72741  0.73784/0.74739/0.75384   

[I 2024-05-17 03:03:46,693] Trial 0 finished with value: 0.7802434361979619 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:03:47,308] Trial 1 finished with value: 0.7798486396155154 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:03:47,909] Trial 2 finished with value: 0.7800854972994187 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:03:48,537] Trial 3 finished with value: 0.7671746256424087 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:03:49,142] Trial 4 finished with value: 0.7801644784403622 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:03:49,761] Trial 5 finished with value: 0.7801644784403622 and parameters: {'alpha': 0.00039799342667825053}. Best 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1       5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2       5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3       5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4      10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5      10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6      10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7      10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8      20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9      20    0.9   1.250141e-01  0.77083/0.78056/0.79072   

                 precision                   recall                       f1  \
0  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
2  0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
3  0.76126/0.78730/

[I 2024-05-17 03:04:48,158] Trial 0 finished with value: 0.7802434361979619 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:04:48,776] Trial 1 finished with value: 0.7798486396155154 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:04:49,371] Trial 2 finished with value: 0.7800854972994187 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:04:49,973] Trial 3 finished with value: 0.7671746256424087 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:04:50,569] Trial 4 finished with value: 0.7801644784403622 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7802434361979619.
[I 2024-05-17 03:04:51,173] Trial 5 finished with value: 0.7801644784403622 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   

                  precision                   recall                       f1  \
0   0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1   0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
2   0.75769/0.78147/0.80

[I 2024-05-17 03:05:49,190] Trial 0 finished with value: 0.7802039417302662 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7802039417302662.
[I 2024-05-17 03:05:49,805] Trial 1 finished with value: 0.779967076251915 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7802039417302662.
[I 2024-05-17 03:05:50,410] Trial 2 finished with value: 0.7802039339358181 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7802039417302662.
[I 2024-05-17 03:05:51,097] Trial 3 finished with value: 0.7671746412313045 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7802039417302662.
[I 2024-05-17 03:05:51,706] Trial 4 finished with value: 0.7803223939555615 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.7803223939555615.
[I 2024-05-17 03:05:52,323] Trial 5 finished with value: 0.7803223939555615 and parameters: {'alpha': 0.00039799342667825053}. Best i

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   

                  precision                   recall                       f1  \
0   0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.74978/0.75595   
1   0.75769/0.78147/0.80111  0.70068/0.72093

[I 2024-05-17 03:06:50,673] Trial 0 finished with value: 0.7578176714944178 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:06:51,153] Trial 1 finished with value: 0.7577782159989617 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:06:51,634] Trial 2 finished with value: 0.7576202848948667 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:06:52,117] Trial 3 finished with value: 0.7556066204481885 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:06:52,599] Trial 4 finished with value: 0.7577782004100658 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:06:53,081] Trial 5 finished with value: 0.7577782004100658 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   

                  precision                   recall                       f1  \
0   0.75769/0.78147/0.80111  0.70068/0.72093/0.73190  0.74311/0.

[I 2024-05-17 03:07:38,536] Trial 0 finished with value: 0.7578176714944178 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:07:39,030] Trial 1 finished with value: 0.7577782159989617 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:07:39,492] Trial 2 finished with value: 0.7576202848948667 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:07:40,015] Trial 3 finished with value: 0.7556066204481885 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:07:40,471] Trial 4 finished with value: 0.7577782004100658 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:07:40,927] Trial 5 finished with value: 0.7577782004100658 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   

                  precision                   recall                       f1  \
0  

[I 2024-05-17 03:08:26,657] Trial 0 finished with value: 0.7578176714944178 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:08:27,126] Trial 1 finished with value: 0.7577782159989617 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:08:27,582] Trial 2 finished with value: 0.7576202848948667 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:08:28,079] Trial 3 finished with value: 0.7556066204481885 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:08:28,551] Trial 4 finished with value: 0.7577782004100658 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7578176714944178.
[I 2024-05-17 03:08:29,010] Trial 5 finished with value: 0.7577782004100658 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   

                  preci

[I 2024-05-17 03:09:14,474] Trial 0 finished with value: 0.7579755948040651 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7579755948040651.
[I 2024-05-17 03:09:14,932] Trial 1 finished with value: 0.7580150892717608 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7580150892717608.
[I 2024-05-17 03:09:15,388] Trial 2 finished with value: 0.7578571581676655 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7580150892717608.
[I 2024-05-17 03:09:15,853] Trial 3 finished with value: 0.7554487049329891 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7580150892717608.
[I 2024-05-17 03:09:16,429] Trial 4 finished with value: 0.7580150736828649 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7580150892717608.
[I 2024-05-17 03:09:16,905] Trial 5 finished with value: 0.7580150736828649 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:10:02,511] Trial 0 finished with value: 0.7392611564856237 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7392611564856237.
[I 2024-05-17 03:10:02,887] Trial 1 finished with value: 0.7392216698123759 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7392611564856237.
[I 2024-05-17 03:10:03,272] Trial 2 finished with value: 0.7392216698123759 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7392611564856237.
[I 2024-05-17 03:10:03,645] Trial 3 finished with value: 0.7395771511994291 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7395771511994291.
[I 2024-05-17 03:10:04,029] Trial 4 finished with value: 0.7392216698123759 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7395771511994291.
[I 2024-05-17 03:10:04,396] Trial 5 finished with value: 0.7392216698123759 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:10:41,638] Trial 0 finished with value: 0.7392611564856237 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7392611564856237.
[I 2024-05-17 03:10:42,051] Trial 1 finished with value: 0.7392216698123759 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7392611564856237.
[I 2024-05-17 03:10:42,429] Trial 2 finished with value: 0.7392216698123759 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7392611564856237.
[I 2024-05-17 03:10:42,805] Trial 3 finished with value: 0.7395771511994291 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7395771511994291.
[I 2024-05-17 03:10:43,182] Trial 4 finished with value: 0.7392216698123759 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7395771511994291.
[I 2024-05-17 03:10:43,701] Trial 5 finished with value: 0.7392216698123759 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:11:20,281] Trial 0 finished with value: 0.7392611564856237 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7392611564856237.
[I 2024-05-17 03:11:20,666] Trial 1 finished with value: 0.7392216698123759 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7392611564856237.
[I 2024-05-17 03:11:21,041] Trial 2 finished with value: 0.7392216698123759 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7392611564856237.
[I 2024-05-17 03:11:21,432] Trial 3 finished with value: 0.7395771511994291 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7395771511994291.
[I 2024-05-17 03:11:21,828] Trial 4 finished with value: 0.7392216698123759 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7395771511994291.
[I 2024-05-17 03:11:22,207] Trial 5 finished with value: 0.7392216698123759 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:11:59,578] Trial 0 finished with value: 0.7393401064487755 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7393401064487755.
[I 2024-05-17 03:11:59,956] Trial 1 finished with value: 0.7393795931220233 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7393795931220233.
[I 2024-05-17 03:12:00,331] Trial 2 finished with value: 0.7393795931220233 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7393795931220233.
[I 2024-05-17 03:12:00,705] Trial 3 finished with value: 0.7397745689767722 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7397745689767722.
[I 2024-05-17 03:12:01,085] Trial 4 finished with value: 0.7393401064487755 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7397745689767722.
[I 2024-05-17 03:12:01,473] Trial 5 finished with value: 0.7393401064487755 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:12:38,423] Trial 0 finished with value: 0.7297853214177009 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:12:38,733] Trial 1 finished with value: 0.7297853214177009 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:12:39,106] Trial 2 finished with value: 0.7297853214177009 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:12:39,427] Trial 3 finished with value: 0.7284429849773708 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:12:39,750] Trial 4 finished with value: 0.7297853214177009 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:12:40,186] Trial 5 finished with value: 0.7297853214177009 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:13:10,673] Trial 0 finished with value: 0.7297853214177009 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:13:10,986] Trial 1 finished with value: 0.7297853214177009 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:13:11,311] Trial 2 finished with value: 0.7297853214177009 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:13:11,631] Trial 3 finished with value: 0.7284429849773708 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:13:11,940] Trial 4 finished with value: 0.7297853214177009 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:13:12,271] Trial 5 finished with value: 0.7297853214177009 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:13:43,720] Trial 0 finished with value: 0.7297853214177009 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:13:44,020] Trial 1 finished with value: 0.7297853214177009 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:13:44,343] Trial 2 finished with value: 0.7297853214177009 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:13:44,662] Trial 3 finished with value: 0.7284429849773708 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:13:44,994] Trial 4 finished with value: 0.7297853214177009 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7297853214177009.
[I 2024-05-17 03:13:45,324] Trial 5 finished with value: 0.7297853214177009 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:14:16,561] Trial 0 finished with value: 0.7295879114348058 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7295879114348058.
[I 2024-05-17 03:14:16,876] Trial 1 finished with value: 0.7295879114348058 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7295879114348058.
[I 2024-05-17 03:14:17,205] Trial 2 finished with value: 0.7295879114348058 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7295879114348058.
[I 2024-05-17 03:14:17,529] Trial 3 finished with value: 0.7286404105491617 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7295879114348058.
[I 2024-05-17 03:14:17,867] Trial 4 finished with value: 0.7295879114348058 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7295879114348058.
[I 2024-05-17 03:14:18,186] Trial 5 finished with value: 0.7295879114348058 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:14:49,631] Trial 0 finished with value: 0.6983180438741681 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6983180438741681.
[I 2024-05-17 03:14:49,900] Trial 1 finished with value: 0.6983180438741681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6983180438741681.
[I 2024-05-17 03:14:50,154] Trial 2 finished with value: 0.6983180438741681 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6983180438741681.
[I 2024-05-17 03:14:50,440] Trial 3 finished with value: 0.6990682438991882 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6990682438991882.
[I 2024-05-17 03:14:50,713] Trial 4 finished with value: 0.6983180438741681 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6990682438991882.
[I 2024-05-17 03:14:50,990] Trial 5 finished with value: 0.6983180438741681 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:15:16,696] Trial 0 finished with value: 0.6983180438741681 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6983180438741681.
[I 2024-05-17 03:15:16,962] Trial 1 finished with value: 0.6983180438741681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6983180438741681.
[I 2024-05-17 03:15:17,225] Trial 2 finished with value: 0.6983180438741681 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6983180438741681.
[I 2024-05-17 03:15:17,489] Trial 3 finished with value: 0.6990682438991882 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6990682438991882.
[I 2024-05-17 03:15:17,751] Trial 4 finished with value: 0.6983180438741681 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6990682438991882.
[I 2024-05-17 03:15:18,023] Trial 5 finished with value: 0.6983180438741681 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:15:42,601] Trial 0 finished with value: 0.6983180438741681 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6983180438741681.
[I 2024-05-17 03:15:42,853] Trial 1 finished with value: 0.6983180438741681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6983180438741681.
[I 2024-05-17 03:15:43,121] Trial 2 finished with value: 0.6983180438741681 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6983180438741681.
[I 2024-05-17 03:15:43,382] Trial 3 finished with value: 0.6990682438991882 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6990682438991882.
[I 2024-05-17 03:15:43,656] Trial 4 finished with value: 0.6983180438741681 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6990682438991882.
[I 2024-05-17 03:15:43,926] Trial 5 finished with value: 0.6983180438741681 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:16:09,195] Trial 0 finished with value: 0.6981601361534165 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6981601361534165.
[I 2024-05-17 03:16:09,465] Trial 1 finished with value: 0.6981601361534165 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6981601361534165.
[I 2024-05-17 03:16:09,740] Trial 2 finished with value: 0.6981601361534165 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6981601361534165.
[I 2024-05-17 03:16:10,044] Trial 3 finished with value: 0.6987129106066459 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6987129106066459.
[I 2024-05-17 03:16:10,318] Trial 4 finished with value: 0.6981601361534165 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6987129106066459.
[I 2024-05-17 03:16:10,586] Trial 5 finished with value: 0.6981601361534165 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:16:36,122] Trial 0 finished with value: 0.6734840363860417 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:16:36,366] Trial 1 finished with value: 0.6734840363860417 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:16:36,509] Trial 2 finished with value: 0.6734840363860417 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:16:36,728] Trial 3 finished with value: 0.6733655529829546 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:16:36,974] Trial 4 finished with value: 0.6734840363860417 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:16:37,204] Trial 5 finished with value: 0.6734840363860417 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:16:59,213] Trial 0 finished with value: 0.6734840363860417 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:16:59,445] Trial 1 finished with value: 0.6734840363860417 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:16:59,680] Trial 2 finished with value: 0.6734840363860417 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:16:59,935] Trial 3 finished with value: 0.6733655529829546 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:17:00,077] Trial 4 finished with value: 0.6734840363860417 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:17:00,196] Trial 5 finished with value: 0.6734840363860417 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:17:19,668] Trial 0 finished with value: 0.6734840363860417 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:17:19,896] Trial 1 finished with value: 0.6734840363860417 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:17:20,125] Trial 2 finished with value: 0.6734840363860417 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:17:20,356] Trial 3 finished with value: 0.6733655529829546 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:17:20,594] Trial 4 finished with value: 0.6734840363860417 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.6734840363860417.
[I 2024-05-17 03:17:20,831] Trial 5 finished with value: 0.6734840363860417 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

[I 2024-05-17 03:17:43,533] Trial 0 finished with value: 0.6729312697272605 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6729312697272605.
[I 2024-05-17 03:17:43,764] Trial 1 finished with value: 0.6729312697272605 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6729312697272605.
[I 2024-05-17 03:17:43,995] Trial 2 finished with value: 0.6729312697272605 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6729312697272605.
[I 2024-05-17 03:17:44,230] Trial 3 finished with value: 0.6730102508682041 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6730102508682041.
[I 2024-05-17 03:17:44,462] Trial 4 finished with value: 0.6729312697272605 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6730102508682041.
[I 2024-05-17 03:17:44,714] Trial 5 finished with value: 0.6729312697272605 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   3.914842e-01  0.80497/0.81021/0.81579   
1        5    0.9   3.812266e-01  0.80497/0.81021/0.81579   
2        5    0.8   3.951408e-01  0.80497/0.81021/0.81579   
3        5    0.7   3.747277e-01  0.80320/0.81025/0.81639   
4       10    1.0   2.201627e-01  0.78973/0.79604/0.80375   
5       10    0.9   2.179671e-01  0.78973/0.79604/0.80375   
6       10    0.8   2.189385e-01  0.78973/0.79604/0.80375   
7       10    0.7   2.181380e-01  0.78934/0.79600/0.80375   
8       20    1.0   1.146873e-01  0.77083/0.78056/0.79072   
9       20    0.9   1.250141e-01  0.77083/0.78056/0.79072   
10      20    0.8   1.201116e-01  0.77083/0.78056/0.79072   
11      20    0.7   1.217405e-01  0.77063/0.78056/0.79072   
12      50    1.0   9.002075e-02  0.74615/0.75794/0.76624   
13      50    0.9   8.203011e-02  0.74615/0.75794/0.76624   
14      50    0.8   8.774781e-02  0.74615/0.75794/0.76624   
15      50    0.7   8.83

In [None]:
pd.DataFrame(mnB_tune_dim_reduction).style.hide()

min_df,max_df,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
5,1.0,0.391484,0.80497/0.81021/0.81579,0.75769/0.78147/0.80111,0.70068/0.72093/0.73190,0.74311/0.74978/0.75595,0.86466/0.87098/0.87777,0.000322
5,0.9,0.381227,0.80497/0.81021/0.81579,0.75769/0.78147/0.80111,0.70068/0.72093/0.73190,0.74311/0.74978/0.75595,0.86466/0.87098/0.87777,0.000322
5,0.8,0.395141,0.80497/0.81021/0.81579,0.75769/0.78147/0.80111,0.70068/0.72093/0.73190,0.74311/0.74978/0.75595,0.86466/0.87098/0.87777,0.000322
5,0.7,0.374728,0.80320/0.81025/0.81639,0.76126/0.78730/0.80315,0.69388/0.71169/0.72741,0.73784/0.74739/0.75384,0.86501/0.87173/0.87913,0.007681
10,1.0,0.220163,0.78973/0.79604/0.80375,0.75763/0.77902/0.79189,0.65452/0.67469/0.69396,0.71452/0.72290/0.73293,0.85083/0.85762/0.86321,0.003365
10,0.9,0.217967,0.78973/0.79604/0.80375,0.75763/0.77902/0.79189,0.65452/0.67469/0.69396,0.71452/0.72290/0.73293,0.85083/0.85762/0.86321,0.003365
10,0.8,0.218939,0.78973/0.79604/0.80375,0.75763/0.77902/0.79189,0.65452/0.67469/0.69396,0.71452/0.72290/0.73293,0.85083/0.85762/0.86321,0.003365
10,0.7,0.218138,0.78934/0.79600/0.80375,0.75993/0.77943/0.79199,0.65306/0.67390/0.69246,0.71456/0.72263/0.73173,0.85076/0.85762/0.86315,0.005446
20,1.0,0.114687,0.77083/0.78056/0.79072,0.72999/0.75708/0.77214,0.63557/0.65371/0.67599,0.68596/0.70145/0.71357,0.83365/0.84327/0.84972,0.008981
20,0.9,0.125014,0.77083/0.78056/0.79072,0.72999/0.75708/0.77214,0.63557/0.65371/0.67599,0.68596/0.70145/0.71357,0.83365/0.84327/0.84972,0.008981


We see that as the vocabulary size is decreased, the cross validation scores rapidly degrade.

All of these results show that identifying whether a video will be popular or not by its text metadata is a machine-learning problem that contradicts the common wisdom in text classification tasks. This is a fundamentally different challenge to, for example, determining whether a text message or email is spam, etc. In our case, both the most common and rarest terms are relevant, and incorporating an IDF factor seems to have no effect on the accuracy. Whether or not a viewer likes a certain YouTube video or channel, and whether they share it on social media to contribute to its virality, is primarily subjective determination, which makes the classification problem significantly more difficult, and this is reflected in the low cross-validation metrics we have seen so far.

## Including the video category

Next we can incorporate the video category.

In [None]:
%%time

params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []
vectorizers = []

ngram_range = (1,3)
for sublinear_tf in [False,True]:
    for use_idf in [False,True]:
        if use_idf == False and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF')
        elif use_idf == False and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)')
        elif use_idf == True and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF-IDF')
        elif use_idf == True and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)-IDF')

        params_fixed['ngram_range'].append(ngram_range)

        X_train, vectorizer = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)
        vectorizers.append(vectorizer)

Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
CPU times: user 1min 17s, sys: 1.44 s, total: 1min 18s
Wall time: 1min 18s


In [None]:
mnB_tune_category = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

[I 2024-05-17 03:19:24,125] A new study created in memory with name: no-name-ac008e53-fa3a-4f01-ab5d-6836252f397d
[I 2024-05-17 03:19:25,782] Trial 0 finished with value: 0.7985234665495421 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7985234665495421.
[I 2024-05-17 03:19:27,453] Trial 1 finished with value: 0.8007738639689563 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.8007738639689563.
[I 2024-05-17 03:19:29,115] Trial 2 finished with value: 0.8005370296683969 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.8007738639689563.
[I 2024-05-17 03:19:30,795] Trial 3 finished with value: 0.7531974813020937 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.8007738639689563.
[I 2024-05-17 03:19:32,469] Trial 4 finished with value: 0.7996289531004169 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.8007738639689563.
[I 2024-05-17 03:1

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   7.171835e-01  0.81445/0.81874/0.82488   

                 precision                   recall                       f1  \
0  0.76842/0.78453/0.79735  0.73226/0.74549/0.75455  0.75648/0.76442/0.77110   

                   roc_auc        alpha  
0  0.88250/0.88667/0.89259 7.125200e-02  


[I 2024-05-17 03:22:01,421] Trial 0 finished with value: 0.7980101475917689 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7980101475917689.
[I 2024-05-17 03:22:02,970] Trial 1 finished with value: 0.7996288205948021 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7996288205948021.
[I 2024-05-17 03:22:04,508] Trial 2 finished with value: 0.7988787842531886 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7996288205948021.
[I 2024-05-17 03:22:06,083] Trial 3 finished with value: 0.7887711390299577 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7996288205948021.
[I 2024-05-17 03:22:07,637] Trial 4 finished with value: 0.8003790362087182 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8003790362087182.
[I 2024-05-17 03:22:09,150] Trial 5 finished with value: 0.8002605839834228 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   7.171835e-01  0.81445/0.81874/0.82488   
1          TF-IDF      (1, 3)   7.547930e-01  0.81718/0.82099/0.82665   

                 precision                   recall                       f1  \
0  0.76842/0.78453/0.79735  0.73226/0.74549/0.75455  0.75648/0.76442/0.77110   
1  0.77409/0.78917/0.80171  0.73081/0.74561/0.75337  0.76186/0.76668/0.77242   

                   roc_auc        alpha  
0  0.88250/0.88667/0.89259 7.125200e-02  
1  0.87917/0.88331/0.88959 1.388971e-01  


[I 2024-05-17 03:24:35,588] Trial 0 finished with value: 0.8043667615121073 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8043667615121073.
[I 2024-05-17 03:24:37,096] Trial 1 finished with value: 0.8049984703395923 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.8049984703395923.
[I 2024-05-17 03:24:38,602] Trial 2 finished with value: 0.802866478378786 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.8049984703395923.
[I 2024-05-17 03:24:40,132] Trial 3 finished with value: 0.7648448573596541 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.8049984703395923.
[I 2024-05-17 03:24:41,659] Trial 4 finished with value: 0.8049589680774487 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.8049984703395923.
[I 2024-05-17 03:24:43,193] Trial 5 finished with value: 0.8050379258350484 and parameters: {'alpha': 0.00039799342667825053}. Best i

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   7.171835e-01  0.81445/0.81874/0.82488   
1          TF-IDF      (1, 3)   7.547930e-01  0.81718/0.82099/0.82665   
2         log(TF)      (1, 3)   7.289278e-01  0.81899/0.82308/0.82827   

                 precision                   recall                       f1  \
0  0.76842/0.78453/0.79735  0.73226/0.74549/0.75455  0.75648/0.76442/0.77110   
1  0.77409/0.78917/0.80171  0.73081/0.74561/0.75337  0.76186/0.76668/0.77242   
2  0.77578/0.79283/0.80738  0.73324/0.74689/0.75237  0.76188/0.76909/0.77438   

                   roc_auc        alpha  
0  0.88250/0.88667/0.89259 7.125200e-02  
1  0.87917/0.88331/0.88959 1.388971e-01  
2  0.88084/0.88475/0.89105 8.823479e-02  


[I 2024-05-17 03:27:09,863] Trial 0 finished with value: 0.8021952439058134 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8021952439058134.
[I 2024-05-17 03:27:11,390] Trial 1 finished with value: 0.8016029593959926 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8021952439058134.
[I 2024-05-17 03:27:12,938] Trial 2 finished with value: 0.8010897183826987 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8021952439058134.
[I 2024-05-17 03:27:14,435] Trial 3 finished with value: 0.7908636053452764 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8021952439058134.
[I 2024-05-17 03:27:15,994] Trial 4 finished with value: 0.8023136649533171 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8023136649533171.
[I 2024-05-17 03:27:17,517] Trial 5 finished with value: 0.8025110671417643 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   7.171835e-01  0.81445/0.81874/0.82488   
1          TF-IDF      (1, 3)   7.547930e-01  0.81718/0.82099/0.82665   
2         log(TF)      (1, 3)   7.289278e-01  0.81899/0.82308/0.82827   
3     log(TF)-IDF      (1, 3)   7.193127e-01  0.81955/0.82312/0.82942   

                 precision                   recall                       f1  \
0  0.76842/0.78453/0.79735  0.73226/0.74549/0.75455  0.75648/0.76442/0.77110   
1  0.77409/0.78917/0.80171  0.73081/0.74561/0.75337  0.76186/0.76668/0.77242   
2  0.77578/0.79283/0.80738  0.73324/0.74689/0.75237  0.76188/0.76909/0.77438   
3  0.77991/0.79650/0.80919  0.72741/0.74120/0.74798  0.76200/0.76777/0.77418   

                   roc_auc        alpha  
0  0.88250/0.88667/0.89259 7.125200e-02  
1  0.87917/0.88331/0.88959 1.388971e-01  
2  0.88084/0.88475/0.89105 8.823479e-02  
3  0.87812/0.88254/0.88884 1.609698e-01  


In [None]:
mnB_tune_category = pd.DataFrame(mnB_tune_category)
mnB_tune_category.style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,"(1, 3)",0.717184,0.81445/0.81874/0.82488,0.76842/0.78453/0.79735,0.73226/0.74549/0.75455,0.75648/0.76442/0.77110,0.88250/0.88667/0.89259,0.071252
TF-IDF,"(1, 3)",0.754793,0.81718/0.82099/0.82665,0.77409/0.78917/0.80171,0.73081/0.74561/0.75337,0.76186/0.76668/0.77242,0.87917/0.88331/0.88959,0.138897
log(TF),"(1, 3)",0.728928,0.81899/0.82308/0.82827,0.77578/0.79283/0.80738,0.73324/0.74689/0.75237,0.76188/0.76909/0.77438,0.88084/0.88475/0.89105,0.088235
log(TF)-IDF,"(1, 3)",0.719313,0.81955/0.82312/0.82942,0.77991/0.79650/0.80919,0.72741/0.74120/0.74798,0.76200/0.76777/0.77418,0.87812/0.88254/0.88884,0.16097


In [None]:
mnB_tune_category.to_csv('mnB_tuned.csv', index=False, sep=',', encoding='utf-8')

## Further classification models

Now that we have understood the influence of the vectoriser hyperparameters -- the n-gram range, the TF/log(TF)/TF-IDF/log(TF)-IDF modalities and the normalisation, we are ready to build some more models. Having considered a Bayesian model already we can explore three linear methods:

* Support vector machine
* Logistic regression
* Perceptron

To avoid overfitting, we will employ statistical regularisation via a combination of L$^1$ and L$^2$ penalty terms, known as *elasticnet.* There are two hyperparameters which we will again use Bayesian optimization to tune. We will also implement the linear algorithms via stochastic gradient descent using SGDClassifier from scikit-learn, which uses a randomised algorithm to solve the linear models with regularisation. A random state variable will be set for reproducibility.

In [None]:
from sklearn.linear_model import SGDClassifier

### Support vector machine

In [None]:
%%time

def get_params_SVM(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'n_jobs': -1, 'loss': 'hinge', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

SVM_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_SVM, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

[I 2024-05-17 03:29:44,228] A new study created in memory with name: no-name-696f2205-d783-49f3-b73c-06698d330074
[I 2024-05-17 03:29:51,569] Trial 0 finished with value: 0.7114655159982993 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.7114655159982993.
[I 2024-05-17 03:30:16,584] Trial 1 finished with value: 0.8126578015213983 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8126578015213983.
[I 2024-05-17 03:30:25,335] Trial 2 finished with value: 0.7552115276767206 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8126578015213983.
[I 2024-05-17 03:30:47,192] Trial 3 finished with value: 0.8160532189316229 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8160532189316229.
[I 2024-05-17 03:30:51,910] Trial 4 finished with value: 0.61

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.777125e+01  0.81362/0.82269/0.82725   

                 precision                   recall                       f1  \
0  0.78500/0.79504/0.80674  0.72741/0.74214/0.75816  0.76028/0.76758/0.77135   

         alpha     l1_ratio  
0 2.997875e-05 1.511824e-01  


[I 2024-05-17 04:01:06,904] Trial 0 finished with value: 0.649873359707147 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.649873359707147.
[I 2024-05-17 04:01:26,494] Trial 1 finished with value: 0.8177115033190707 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8177115033190707.
[I 2024-05-17 04:01:34,770] Trial 2 finished with value: 0.7259156274394186 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8177115033190707.
[I 2024-05-17 04:01:55,378] Trial 3 finished with value: 0.8202381437678128 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8202381437678128.
[I 2024-05-17 04:02:00,078] Trial 4 finished with value: 0.6064036612080849 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 wit

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.777125e+01  0.81362/0.82269/0.82725   
1          TF-IDF      (1, 3)   1.370461e+01  0.81777/0.82596/0.83179   

                 precision                   recall                       f1  \
0  0.78500/0.79504/0.80674  0.72741/0.74214/0.75816  0.76028/0.76758/0.77135   
1  0.81074/0.81630/0.82299  0.71173/0.72120/0.73140  0.75898/0.76578/0.77244   

         alpha     l1_ratio  
0 2.997875e-05 1.511824e-01  
1 1.054487e-04 4.495600e-03  


[I 2024-05-17 04:23:55,840] Trial 0 finished with value: 0.71371585106213 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.71371585106213.
[I 2024-05-17 04:24:15,277] Trial 1 finished with value: 0.8149872034650999 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8149872034650999.
[I 2024-05-17 04:24:24,034] Trial 2 finished with value: 0.7563959486018514 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8149872034650999.
[I 2024-05-17 04:24:44,781] Trial 3 finished with value: 0.8191721127123939 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8191721127123939.
[I 2024-05-17 04:24:49,414] Trial 4 finished with value: 0.6077460989762383 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 with 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.777125e+01  0.81362/0.82269/0.82725   
1          TF-IDF      (1, 3)   1.370461e+01  0.81777/0.82596/0.83179   
2         log(TF)      (1, 3)   1.646708e+01  0.82073/0.82758/0.83416   

                 precision                   recall                       f1  \
0  0.78500/0.79504/0.80674  0.72741/0.74214/0.75816  0.76028/0.76758/0.77135   
1  0.81074/0.81630/0.82299  0.71173/0.72120/0.73140  0.75898/0.76578/0.77244   
2  0.79640/0.80775/0.82213  0.72692/0.73906/0.75187  0.76441/0.77178/0.77976   

         alpha     l1_ratio  
0 2.997875e-05 1.511824e-01  
1 1.054487e-04 4.495600e-03  
2 7.839625e-05 2.289851e-03  


[I 2024-05-17 04:49:01,069] Trial 0 finished with value: 0.634159511038692 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.634159511038692.
[I 2024-05-17 04:49:19,602] Trial 1 finished with value: 0.8216597575381079 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8216597575381079.
[I 2024-05-17 04:49:27,622] Trial 2 finished with value: 0.7124125570115151 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8216597575381079.
[I 2024-05-17 04:49:48,031] Trial 3 finished with value: 0.8223307737665383 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8223307737665383.
[I 2024-05-17 04:49:52,589] Trial 4 finished with value: 0.6062062668140856 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 wit

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.777125e+01  0.81362/0.82269/0.82725   
1          TF-IDF      (1, 3)   1.370461e+01  0.81777/0.82596/0.83179   
2         log(TF)      (1, 3)   1.646708e+01  0.82073/0.82758/0.83416   
3     log(TF)-IDF      (1, 3)   1.723592e+01  0.81658/0.82509/0.82922   

                 precision                   recall                       f1  \
0  0.78500/0.79504/0.80674  0.72741/0.74214/0.75816  0.76028/0.76758/0.77135   
1  0.81074/0.81630/0.82299  0.71173/0.72120/0.73140  0.75898/0.76578/0.77244   
2  0.79640/0.80775/0.82213  0.72692/0.73906/0.75187  0.76441/0.77178/0.77976   
3  0.80022/0.80816/0.81849  0.71137/0.73022/0.75051  0.75914/0.76708/0.77456   

         alpha     l1_ratio  
0 2.997875e-05 1.511824e-01  
1 1.054487e-04 4.495600e-03  
2 7.839625e-05 2.289851e-03  
3 2.465948e-05 3.341608e-01  
CPU times: user 3min 59s, sys: 43.9 s, total: 4min 43s
Wall time: 1h 43min 38s


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   17.9s finished


In [None]:
SVM_tune = pd.DataFrame(SVM_tune)
SVM_tune.style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,alpha,l1_ratio
TF,"(1, 3)",17.771247,0.81362/0.82269/0.82725,0.78500/0.79504/0.80674,0.72741/0.74214/0.75816,0.76028/0.76758/0.77135,3e-05,0.151182
TF-IDF,"(1, 3)",13.70461,0.81777/0.82596/0.83179,0.81074/0.81630/0.82299,0.71173/0.72120/0.73140,0.75898/0.76578/0.77244,0.000105,0.004496
log(TF),"(1, 3)",16.467076,0.82073/0.82758/0.83416,0.79640/0.80775/0.82213,0.72692/0.73906/0.75187,0.76441/0.77178/0.77976,7.8e-05,0.00229
log(TF)-IDF,"(1, 3)",17.235919,0.81658/0.82509/0.82922,0.80022/0.80816/0.81849,0.71137/0.73022/0.75051,0.75914/0.76708/0.77456,2.5e-05,0.334161


In [None]:
SVM_tune.to_csv('SVM_tuned.csv', index=False, sep=',', encoding='utf-8')

### Logistic regression

In [None]:
%%time

def get_params_log_reg(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'n_jobs': -1, 'loss': 'log_loss', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

log_reg_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_log_reg, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

[I 2024-05-17 05:13:23,081] A new study created in memory with name: no-name-e3aea245-4d63-4550-9cbb-5299b93122b7
[I 2024-05-17 05:13:27,968] Trial 0 finished with value: 0.6922377665165327 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6922377665165327.
[I 2024-05-17 05:13:50,860] Trial 1 finished with value: 0.8125394116516864 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8125394116516864.
[I 2024-05-17 05:13:57,073] Trial 2 finished with value: 0.7279295568973265 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8125394116516864.
[I 2024-05-17 05:14:07,173] Trial 3 finished with value: 0.8074459114028485 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8125394116516864.
[I 2024-05-17 05:14:12,034] Trial 4 finished with value: 0.64

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   9.126447e+00  0.81619/0.82269/0.82705   

                 precision                   recall                       f1  \
0  0.79611/0.80855/0.83912  0.69161/0.72223/0.74039  0.75244/0.76258/0.76887   

                   roc_auc        alpha     l1_ratio  
0  0.89225/0.89403/0.89737 1.269873e-05 1.301251e-01  


[I 2024-05-17 05:32:20,389] Trial 0 finished with value: 0.6506630541998629 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6506630541998629.
[I 2024-05-17 05:32:39,614] Trial 1 finished with value: 0.817435260289743 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.817435260289743.
[I 2024-05-17 05:32:45,347] Trial 2 finished with value: 0.7036085410001601 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.817435260289743.
[I 2024-05-17 05:32:49,959] Trial 3 finished with value: 0.7995498940149941 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.817435260289743.
[I 2024-05-17 05:32:54,625] Trial 4 finished with value: 0.6054560745835135 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 with 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   9.126447e+00  0.81619/0.82269/0.82705   
1          TF-IDF      (1, 3)   5.083005e+00  0.81362/0.82284/0.82902   

                 precision                   recall                       f1  \
0  0.79611/0.80855/0.83912  0.69161/0.72223/0.74039  0.75244/0.76258/0.76887   
1  0.78219/0.80303/0.81327  0.71842/0.73041/0.74388  0.75857/0.76489/0.77218   

                   roc_auc        alpha     l1_ratio  
0  0.89225/0.89403/0.89737 1.269873e-05 1.301251e-01  
1  0.89423/0.89653/0.89945 1.149440e-05 1.180009e-01  


[I 2024-05-17 05:48:48,595] Trial 0 finished with value: 0.691724509914343 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.691724509914343.
[I 2024-05-17 05:49:11,681] Trial 1 finished with value: 0.8167640024334266 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8167640024334266.
[I 2024-05-17 05:49:17,653] Trial 2 finished with value: 0.7336937304188853 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8167640024334266.
[I 2024-05-17 05:49:26,659] Trial 3 finished with value: 0.8081963530557548 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8167640024334266.
[I 2024-05-17 05:49:31,367] Trial 4 finished with value: 0.6120890484498986 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 wit

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   9.126447e+00  0.81619/0.82269/0.82705   
1          TF-IDF      (1, 3)   5.083005e+00  0.81362/0.82284/0.82902   
2         log(TF)      (1, 3)   7.029333e+00  0.82053/0.82565/0.83044   

                 precision                   recall                       f1  \
0  0.79611/0.80855/0.83912  0.69161/0.72223/0.74039  0.75244/0.76258/0.76887   
1  0.78219/0.80303/0.81327  0.71842/0.73041/0.74388  0.75857/0.76489/0.77218   
2  0.80601/0.81224/0.82306  0.70867/0.72587/0.73640  0.75654/0.76658/0.77050   

                   roc_auc        alpha     l1_ratio  
0  0.89225/0.89403/0.89737 1.269873e-05 1.301251e-01  
1  0.89423/0.89653/0.89945 1.149440e-05 1.180009e-01  
2  0.89393/0.89751/0.90002 2.254095e-05 4.493247e-03  


[I 2024-05-17 06:06:03,342] Trial 0 finished with value: 0.6295796025533053 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6295796025533053.
[I 2024-05-17 06:06:20,426] Trial 1 finished with value: 0.82008065694725 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.82008065694725.
[I 2024-05-17 06:06:26,396] Trial 2 finished with value: 0.6957905616250488 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.82008065694725.
[I 2024-05-17 06:06:31,066] Trial 3 finished with value: 0.8022346604290298 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.82008065694725.
[I 2024-05-17 06:06:35,780] Trial 4 finished with value: 0.6054955690512092 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 with valu

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   9.126447e+00  0.81619/0.82269/0.82705   
1          TF-IDF      (1, 3)   5.083005e+00  0.81362/0.82284/0.82902   
2         log(TF)      (1, 3)   7.029333e+00  0.82053/0.82565/0.83044   
3     log(TF)-IDF      (1, 3)   5.293550e+00  0.81619/0.82509/0.83024   

                 precision                   recall                       f1  \
0  0.79611/0.80855/0.83912  0.69161/0.72223/0.74039  0.75244/0.76258/0.76887   
1  0.78219/0.80303/0.81327  0.71842/0.73041/0.74388  0.75857/0.76489/0.77218   
2  0.80601/0.81224/0.82306  0.70867/0.72587/0.73640  0.75654/0.76658/0.77050   
3  0.78450/0.80510/0.82076  0.72626/0.73511/0.75408  0.76579/0.76833/0.77091   

                   roc_auc        alpha     l1_ratio  
0  0.89225/0.89403/0.89737 1.269873e-05 1.301251e-01  
1  0.89423/0.89653/0.89945 1.149440e-05 1.180009e-01  
2  0.89393/0.89751/0.90002 2.254095e-05 4.493247e-03  
3  0.89329/0.

[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    5.4s remaining:    8.2s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.5s finished


In [None]:
log_reg_tune = pd.DataFrame(log_reg_tune)
log_reg_tune.style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha,l1_ratio
TF,"(1, 3)",9.126447,0.81619/0.82269/0.82705,0.79611/0.80855/0.83912,0.69161/0.72223/0.74039,0.75244/0.76258/0.76887,0.89225/0.89403/0.89737,1.3e-05,0.130125
TF-IDF,"(1, 3)",5.083005,0.81362/0.82284/0.82902,0.78219/0.80303/0.81327,0.71842/0.73041/0.74388,0.75857/0.76489/0.77218,0.89423/0.89653/0.89945,1.1e-05,0.118001
log(TF),"(1, 3)",7.029333,0.82053/0.82565/0.83044,0.80601/0.81224/0.82306,0.70867/0.72587/0.73640,0.75654/0.76658/0.77050,0.89393/0.89751/0.90002,2.3e-05,0.004493
log(TF)-IDF,"(1, 3)",5.29355,0.81619/0.82509/0.83024,0.78450/0.80510/0.82076,0.72626/0.73511/0.75408,0.76579/0.76833/0.77091,0.89329/0.89565/0.89752,7e-06,0.315833


In [None]:
log_reg_tune.to_csv('log_reg_tuned.csv', index=False, sep=',', encoding='utf-8')

### Perceptron

In [None]:
%%time

def get_params_perceptron(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'n_jobs': -1, 'loss': 'perceptron', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

perceptron_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_perceptron, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

[I 2024-05-17 06:20:45,724] A new study created in memory with name: no-name-cfaa7c87-7ee9-45a7-949e-7ffd44465e21
[I 2024-05-17 06:20:50,082] Trial 0 finished with value: 0.6014700328808786 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6014700328808786.
[I 2024-05-17 06:21:14,148] Trial 1 finished with value: 0.809341645852243 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.809341645852243.
[I 2024-05-17 06:21:18,748] Trial 2 finished with value: 0.6458458125692488 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.809341645852243.
[I 2024-05-17 06:21:27,896] Trial 3 finished with value: 0.7489747767767542 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.809341645852243.
[I 2024-05-17 06:21:32,140] Trial 4 finished with value: 0.567793

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   2.071026e+01  0.81046/0.81440/0.81876   

                 precision                   recall                       f1  \
0  0.74901/0.78319/0.81602  0.69061/0.73435/0.77194  0.74810/0.75723/0.76255   

         alpha     l1_ratio  
0 1.231747e-07 8.874225e-01  


[I 2024-05-17 06:51:57,123] Trial 0 finished with value: 0.6047867107780457 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6047867107780457.
[I 2024-05-17 06:52:15,706] Trial 1 finished with value: 0.8160534683539569 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8160534683539569.
[I 2024-05-17 06:52:19,946] Trial 2 finished with value: 0.6278426098305916 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8160534683539569.
[I 2024-05-17 06:52:26,799] Trial 3 finished with value: 0.744591958701897 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8160534683539569.
[I 2024-05-17 06:52:31,102] Trial 4 finished with value: 0.5854402986208893 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 wi

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   2.071026e+01  0.81046/0.81440/0.81876   
1          TF-IDF      (1, 3)   1.571143e+01  0.81267/0.81929/0.82349   

                 precision                   recall                       f1  \
0  0.74901/0.78319/0.81602  0.69061/0.73435/0.77194  0.74810/0.75723/0.76255   
1  0.76778/0.79535/0.82420  0.70165/0.73109/0.75437  0.75801/0.76139/0.76779   

         alpha     l1_ratio  
0 1.231747e-07 8.874225e-01  
1 7.457411e-08 9.453583e-01  


[I 2024-05-17 07:16:47,035] Trial 0 finished with value: 0.6226297921727374 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6226297921727374.
[I 2024-05-17 07:17:08,227] Trial 1 finished with value: 0.8178298853943348 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8178298853943348.
[I 2024-05-17 07:17:12,733] Trial 2 finished with value: 0.6478613554778796 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8178298853943348.
[I 2024-05-17 07:17:20,593] Trial 3 finished with value: 0.7525666844250173 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8178298853943348.
[I 2024-05-17 07:17:24,710] Trial 4 finished with value: 0.5834257689904904 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   2.071026e+01  0.81046/0.81440/0.81876   
1          TF-IDF      (1, 3)   1.571143e+01  0.81267/0.81929/0.82349   
2         log(TF)      (1, 3)   1.993168e+01  0.81382/0.81783/0.82432   

                 precision                   recall                       f1  \
0  0.74901/0.78319/0.81602  0.69061/0.73435/0.77194  0.74810/0.75723/0.76255   
1  0.76778/0.79535/0.82420  0.70165/0.73109/0.75437  0.75801/0.76139/0.76779   
2  0.76625/0.78737/0.81332  0.71773/0.73868/0.77084  0.75600/0.76178/0.76854   

         alpha     l1_ratio  
0 1.231747e-07 8.874225e-01  
1 7.457411e-08 9.453583e-01  
2 6.795851e-08 9.099173e-01  


[I 2024-05-17 07:44:48,277] Trial 0 finished with value: 0.5943633358522391 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.5943633358522391.
[I 2024-05-17 07:45:05,423] Trial 1 finished with value: 0.8170801920084305 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8170801920084305.
[I 2024-05-17 07:45:09,652] Trial 2 finished with value: 0.6223894971372942 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8170801920084305.
[I 2024-05-17 07:45:16,586] Trial 3 finished with value: 0.7512234126509346 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8170801920084305.
[I 2024-05-17 07:45:20,865] Trial 4 finished with value: 0.5458355862535558 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   2.071026e+01  0.81046/0.81440/0.81876   
1          TF-IDF      (1, 3)   1.571143e+01  0.81267/0.81929/0.82349   
2         log(TF)      (1, 3)   1.993168e+01  0.81382/0.81783/0.82432   
3     log(TF)-IDF      (1, 3)   1.084760e+01  0.81145/0.82170/0.82883   

                 precision                   recall                       f1  \
0  0.74901/0.78319/0.81602  0.69061/0.73435/0.77194  0.74810/0.75723/0.76255   
1  0.76778/0.79535/0.82420  0.70165/0.73109/0.75437  0.75801/0.76139/0.76779   
2  0.76625/0.78737/0.81332  0.71773/0.73868/0.77084  0.75600/0.76178/0.76854   
3  0.77487/0.79706/0.81246  0.69679/0.73647/0.76985  0.75020/0.76498/0.77775   

         alpha     l1_ratio  
0 1.231747e-07 8.874225e-01  
1 7.457411e-08 9.453583e-01  
2 6.795851e-08 9.099173e-01  
3 4.694827e-07 8.300623e-01  
CPU times: user 4min 3s, sys: 43.8 s, total: 4min 47s
Wall time: 1h 45min 18s


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   11.9s finished


In [None]:
perceptron_tune = pd.DataFrame(perceptron_tune)
perceptron_tune.style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,alpha,l1_ratio
TF,"(1, 3)",20.710256,0.81046/0.81440/0.81876,0.74901/0.78319/0.81602,0.69061/0.73435/0.77194,0.74810/0.75723/0.76255,0.0,0.887423
TF-IDF,"(1, 3)",15.711429,0.81267/0.81929/0.82349,0.76778/0.79535/0.82420,0.70165/0.73109/0.75437,0.75801/0.76139/0.76779,0.0,0.945358
log(TF),"(1, 3)",19.93168,0.81382/0.81783/0.82432,0.76625/0.78737/0.81332,0.71773/0.73868/0.77084,0.75600/0.76178/0.76854,0.0,0.909917
log(TF)-IDF,"(1, 3)",10.8476,0.81145/0.82170/0.82883,0.77487/0.79706/0.81246,0.69679/0.73647/0.76985,0.75020/0.76498/0.77775,0.0,0.830062


In [None]:
perceptron_tune.to_csv('perceptron_tuned.csv', index=False, sep=',', encoding='utf-8')

## Training the final models

Now that we've obtained the optimal hyperparameters we can train the models on the full training data. We'll save the models and evaluate them in the next notebook.

In [None]:
mnB_clfs = []
svm_clfs = []
logreg_clfs = []
perceptron_clfs = []
models = {}

for n in range(len(X_trains)):
    mnB_clfs.append(MultinomialNB(alpha=mnB_tune_category.at[n,'alpha']))
    svm_clfs.append(SGDClassifier(loss='hinge', penalty='elasticnet', alpha=SVM_tune.at[n,'alpha'], l1_ratio=SVM_tune.at[n,'l1_ratio']))
    logreg_clfs.append(SGDClassifier(loss='log_loss', penalty='elasticnet', alpha=log_reg_tune.at[n,'alpha'], l1_ratio=log_reg_tune.at[n,'l1_ratio']))
    perceptron_clfs.append(SGDClassifier(loss='perceptron', penalty='elasticnet', alpha=perceptron_tune.at[n,'alpha'], l1_ratio=perceptron_tune.at[n,'l1_ratio']))

    for model in [mnB_clfs[-1],svm_clfs[-1], logreg_clfs[-1], perceptron_clfs[-1]]:
        model.fit(X_trains[n], y_train)

    models[f"models/mnB_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = mnB_clfs[-1]
    models[f"models/svm_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = svm_clfs[-1]
    models[f"models/logreg_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = logreg_clfs[-1]
    models[f"models/perceptron_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = perceptron_clfs[-1]

In [None]:
import joblib

for model_name in models:
    joblib.dump(models[model_name], model_name+'.joblib')

joblib.dump(video_category_encoder, 'models/video_category_encoder.joblib')

['models/video_category_encoder.joblib']

In [None]:
for n in range(len(vectorizers)):
    joblib.dump(vectorizers[n]['channel_title'], f"vectorizers/channel_title_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")
    joblib.dump(vectorizers[n]['video_title'], f"vectorizers/video_title_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")
    joblib.dump(vectorizers[n]['video_description'], f"vectorizers/video_description_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")

## Probability calibration

We can see that, based on the cross-validation scores, the models are quite far from being accurate. We would like to model the probabilities  $P(y\in \mathcal{C}|P)$ of a data $y$ belonging in class $\mathcal{C}$ given the predictions of each of the models, which is not the same as the reported probabilities. (In some cases, there are also no reported probabilities). We can do this using a probability calibrator, which treats the predictions of each model as a feature that can then be used to model the true probability. This requires validation data, so we'll again use a five-fold cross-validation split.

In [None]:
from sklearn.calibration import CalibratedClassifierCV

calibrated_clfs = {}

for model_name in models:
    calibrated_clfs[model_name] = CalibratedClassifierCV(models[model_name], cv = KFold(n_splits=5, random_state=42, shuffle=True))
    calibrated_clfs[model_name].fit(X_train, y_train)

[CV] END  accuracy: (test=0.805) f1: (test=0.743) precision: (test=0.758) recall: (test=0.729) roc_auc: (test=0.865) total time=   0.5s
[CV] END  accuracy: (test=0.808) f1: (test=0.748) precision: (test=0.801) recall: (test=0.701) roc_auc: (test=0.868) total time=   0.5s
[CV] END  accuracy: (test=0.793) f1: (test=0.719) precision: (test=0.758) recall: (test=0.684) roc_auc: (test=0.851) total time=   0.3s
[CV] END  accuracy: (test=0.800) f1: (test=0.733) precision: (test=0.777) recall: (test=0.694) roc_auc: (test=0.857) total time=   0.3s
[CV] END  accuracy: (test=0.790) f1: (test=0.717) precision: (test=0.792) recall: (test=0.655) roc_auc: (test=0.854) total time=   0.3s
[CV] END  accuracy: (test=0.794) f1: (test=0.720) precision: (test=0.760) recall: (test=0.683) roc_auc: (test=0.851) total time=   0.3s
[CV] END  accuracy: (test=0.780) f1: (test=0.697) precision: (test=0.759) recall: (test=0.645) roc_auc: (test=0.850) total time=   0.2s
[CV] END  accuracy: (test=0.771) f1: (test=0.686

We'll save the calibrated models for evaluation in the next notebook:

In [None]:
for model_name in calibrated_clfs:
    joblib.dump(calibrated_clfs[model_name], model_name+'_calibrated.joblib')

## Stacking

Now that we have our sixteen models, we can combine them into a single classifier that uses all of their predictions. One approach is stacking, which involves a single metaclassifier that first gathers the predictions of the individual models, then uses these predictions as features and converts them into a final prediction. We will need to train the meta-classifier with cross-validation and select a model. We will compare two choices: logistic regression and gaussian naive Bayes.

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
stacking_logreg = StackingClassifier(list(models.items()), final_estimator=LogisticRegression(max_iter=10000), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_logreg.fit(X_train, y_train)

In [None]:
from sklearn.naive_bayes import GaussianNB

stacking_gnb = StackingClassifier(list(models.items()), final_estimator=GaussianNB(), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_gnb.fit(X_train, y_train)

In [None]:
joblib.dump(stacking_logreg, 'models/stacking_logreg.joblib')
joblib.dump(stacking_gnb, 'models/stacking_gnb.joblib')

['models/stacking_gnb.joblib']

We've successfully built a total of 34 different classical ML models -- four different classification approaches (Bayesian and linear), four different text vectorisation methods (TF, log(TF), TF-IDF and log(TF)-IDF vectorisation), and then used probability calibration and stacking to further improve model performance. In the [next notebook](https://github.com/tommyliphysics/tommyli-ml/blob/main/youtube_predictor/notebooks/eval.ipynb) we'll compare the performance of these models on the test data.