# Youtube popularity predictor (Part 2): text frequency-based models

In the previous notebook, we used natural language processing (NLP) to explore the YouTube video dataset and hunted for possible correlations between the language features in the video titles and descriptions and the video popularity, which we associated with a binary categorical variable corresponding to a video having obtained over 50k views (class 1) or under 50k views (class 0). We did indeed see that the frequency of the tokens in the byte-pair encoded text had predictive value for classification. In this notebook we will construct a variety of classification models based on text frequency.

Let's import the scikit-learn library and load the dataset, which was already processed in the previous notebook to extract the relevant ML features.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas(desc='processing rows')
pd.options.display.float_format = '{:.6e}'.format


In [2]:
videos = pd.read_csv('https://raw.githubusercontent.com/tommyliphysics/tommyli-ml/main/youtube_predictor/data/YT_data_v2.csv', lineterminator='\n')
videos

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
0,University of New Haven,27,Master of Science in Cellular and Molecular Bi...,"Christina Zito, assistant professor and coordi...",75,3.610660e+00,0
1,PennWest California,27,Faculty Showcase: Dr. Ben Reuter - Exercise Sc...,Interested in pursing a exercise science degre...,75,3.168203e+00,0
2,University of New Haven,27,Master of Science in Mechanical Engineering: B...,The University of New Haven’s master’s degree ...,75,3.447313e+00,0
3,Operation Ouch,24,Science for kids | BROKEN BONES- Unluckiest K...,Learn about Broken Bones with the Unluckiest K...,75,6.603942e+00,1
4,Crazy GkTrick,27,Science Gk : Diseases (मानव रोग ) - Part-2,Biology (‎जीव विज्ञान) | Gk Science | Science ...,76,6.409320e+00,1
...,...,...,...,...,...,...,...
31657,Morinda Enterprises,22,Vivo v30pro pro photography // aura light por...,,1,2.534026e+00,0
31658,Christian Dunham,20,POV me growing up,,1,1.000000e+00,0
31659,Gegee gegee,22,28 March 2024,,1,4.771213e-01,0
31660,Sangita . 20k views. 2 days ago,27,TLM WORKSHOP on FLN ||👏😱||#viral #tlm,"project work,tlm workshop,maths project work,t...",1,1.431364e+00,0


Let's look at the video categories:

In [3]:
videos.groupby('video_category').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,label,label,label,label,label,label,label,label
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
video_category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,306.0,42.2451,23.89558,1.0,20.0,48.0,63.75,75.0,306.0,4.271067,...,5.674682,7.752964,306.0,0.4084967,0.492361,0.0,0.0,0.0,1.0,1.0
2,179.0,28.47486,22.77203,1.0,13.0,19.0,40.0,75.0,179.0,4.288575,...,5.602989,7.831337,179.0,0.4134078,0.493826,0.0,0.0,0.0,1.0,1.0
10,245.0,23.38367,22.98805,1.0,6.0,15.0,36.0,74.0,245.0,4.580353,...,5.645615,8.234742,245.0,0.5591837,0.4975013,0.0,0.0,1.0,1.0,1.0
15,41.0,31.43902,23.29383,2.0,12.0,27.0,56.0,75.0,41.0,4.541532,...,5.84489,8.419579,41.0,0.4634146,0.5048545,0.0,0.0,0.0,1.0,1.0
17,487.0,51.03491,19.40761,1.0,39.0,56.0,68.0,75.0,487.0,3.832325,...,4.665426,7.836966,487.0,0.2422998,0.4289153,0.0,0.0,0.0,0.0,1.0
19,111.0,38.67568,22.12451,1.0,19.5,38.0,57.0,75.0,111.0,4.107556,...,4.999286,7.486532,111.0,0.3063063,0.463049,0.0,0.0,0.0,1.0,1.0
20,603.0,19.15257,18.76016,1.0,6.5,14.0,21.0,76.0,603.0,4.025035,...,5.776823,8.183316,603.0,0.4610282,0.4988927,0.0,0.0,0.0,1.0,1.0
22,5831.0,34.82164,22.61416,1.0,15.0,32.0,55.0,76.0,5831.0,3.577791,...,4.629766,8.20297,5831.0,0.2364946,0.4249657,0.0,0.0,0.0,0.0,1.0
23,220.0,27.2,24.87863,1.0,7.0,17.0,54.25,75.0,220.0,5.085606,...,6.448389,8.227258,220.0,0.6545455,0.4766007,0.0,0.0,1.0,1.0,1.0
24,1382.0,29.52605,22.92586,1.0,10.0,23.5,48.0,75.0,1382.0,4.669223,...,5.98817,8.494091,1382.0,0.515919,0.4999274,0.0,0.0,1.0,1.0,1.0


We see that category 0 only has a single member, so we will drop it.

In [4]:
videos[videos['video_category']==30]

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26648,YouTube Movies,30,"Underground Aliens, Baba Vanga And Quantum Bio...",Baba Vanga was a female mystic in Bulgaria. Sh...,11,0.0,0


In [5]:
videos.drop(videos[videos['video_category']==30].index, inplace=True)

In [6]:
videos.reset_index(drop=True, inplace=True)

Let's look at the distribution of video view counts:

In [7]:
videos[['months','video_view_count','label']].groupby('label').describe()

Unnamed: 0_level_0,months,months,months,months,months,months,months,months,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
0,19168.0,40.84401,21.00774,-1.0,24.0,42.0,59.0,76.0,19168.0,3.353037,1.067583,0.0,2.692847,3.633519,4.205265,4.69897
1,12493.0,29.56135,21.26779,1.0,12.0,24.0,46.0,76.0,12493.0,5.582265,0.6834096,4.699005,5.037442,5.433327,5.977578,8.588679


We can see that the classes are approximately evenly distributed. They aren't exactly balanced, but that is due to the fact that the classification is based on a round milestone of 50k views. To exactly balance the data would result in a discrimination threshold that is far less striking.

We'll select a test set based on an 80/20 train/test split which we will then use for all future model building and validation.

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(videos[['video_title']], videos['label'], test_size=0.2, stratify=videos['video_category'], random_state=524)
test = videos.iloc[X_test.index]
train = videos.iloc[X_train.index]

In [9]:
test

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
26498,RG LECTURES,27,MHTCET FULL REVISION ONE SHOT ALL FORMULAS - P...,MHTCET PHYSICS FULL COMPLETE ONE SHOT REVISION...,11,5.238984e+00,1
27395,FuTechs,28,Tony Robbin and Robot conversation Relationshi...,"Speaker :Anthony Jay Robbins (né Mahavoric, bo...",10,4.364063e+00,0
23126,That Chemist,27,Nobel Prize in Chemistry 2022 (Recap),The Nobel Prize in Chemistry for 2022 has been...,18,4.484656e+00,0
15634,SCIENCE FUN For Everyone!,27,Friction Fun Friction Science Experiment,Have fun exploring friction with this easy sci...,36,4.503437e+00,0
7075,Michigan Medicine,26,Deconstructing the Legitimization of Acupunctu...,"Rick Harris, PhD\nAssociate Professor, Anesthe...",57,4.632467e+00,0
...,...,...,...,...,...,...,...
24112,CARB ACADEMY,27,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,Class 8th Maths Chapter 1 l NCERT EXERCISE-1.1...,16,5.496467e+00,1
2034,Rafael Verdonck's World,22,Science World #7 Will Strangelets destroy th...,Will the universe be destroyed by a tiny eleme...,70,3.183270e+00,0
22862,Trik Matematika mesi,27,deret angka matematika #shorts #maths,,19,5.764919e+00,1
6425,edureka!,27,Statistics And Probability Tutorial | Statisti...,🔥 Data Science Certification using R (Use Code...,59,5.561255e+00,1


In [10]:
train.to_csv('train.csv', index=False, encoding='utf-8', sep=',')
test.to_csv('test.csv', index=False, encoding='utf-8', sep=',')

In [11]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271e+00,0
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389e+00,1
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385e+00,1
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802e+00,0
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282e+00,0
...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115e+00,0
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098e+00,1
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341e+00,0
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217e+00,1


In this notebook, we will only be using the train dataset to build the models.

To convert the text into numerical features, we can use byte-pair encoding (BPE). We can train three separate encoders for the channel name, video title and video description. We will convert all text to lower case to make the vocabulary size smaller.

We first have to set all NA values to empty strings:

In [12]:
train = train.fillna('')

In [13]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

def build_tokenizer(train_texts, save=None):
    BPE_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    BPE_tokenizer.pre_tokenizer = Whitespace()
    BPE_tokenizer.train_from_iterator(train_texts, trainer=trainer)
    if save:
        BPE_tokenizer.save(save)
    return BPE_tokenizer

training_data_uncased = {field: train[field].apply(lambda x: x.lower()).tolist() for field in ['channel_title', 'video_title', 'video_description']}

In [14]:
%%time
BPE_tokenizers_uncased = {}

for field in training_data_uncased:
    BPE_tokenizers_uncased[field]= build_tokenizer(training_data_uncased[field], save=f"tokenizers/BPE_tokenizer_{field}_uncased.json")










CPU times: user 14.1 s, sys: 7.9 s, total: 22 s
Wall time: 7.2 s


In [15]:
from transformers import PreTrainedTokenizerFast

tokenizers_trained_uncased = {}

for field in training_data_uncased:
    tokenizers_trained_uncased[field] = PreTrainedTokenizerFast(tokenizer_file=f"tokenizers/BPE_tokenizer_{field}_uncased.json")

  from .autonotebook import tqdm as notebook_tqdm


In [16]:
def tokenize(text, field, cased=True):
    if cased == False:
        return [str(t) for t in tokenizers_trained_uncased[field](text.lower())['input_ids']]

def tokenizer_decode(tokenized, field, cased=True):
    if cased == False:
        return tokenizers_trained_uncased[field].decode([int(t) for t in tokenized])


In [17]:
train.loc[:,'channel_title_tokenized'] = train['channel_title'].progress_apply(lambda text: tokenize(text.lower(), 'channel_title', cased=False))
train.loc[:,'video_title_tokenized'] = train['video_title'].progress_apply(lambda text: tokenize(text.lower(), 'video_title', cased=False))
train.loc[:,'video_description_tokenized'] = train['video_description'].progress_apply(lambda text: tokenize(text.lower(), 'video_description', cased=False))

processing rows: 100%|█████████████████| 25328/25328 [00:00<00:00, 48858.10it/s]
processing rows: 100%|█████████████████| 25328/25328 [00:00<00:00, 27069.25it/s]
processing rows: 100%|██████████████████| 25328/25328 [00:09<00:00, 2683.60it/s]


In [18]:
train

Unnamed: 0,channel_title,video_category,video_title,video_description,months,video_view_count,label,channel_title_tokenized,video_title_tokenized,video_description_tokenized
22710,BKTVOK,22,BAWAR KHAN SON MUHAMMAD Khan Short video YouT...,Bawarkhan SON MUHAMMAD khan \nAmazing Videos\n...,19,1.982271e+00,0,[1165],"[2319, 2692, 3910, 2848, 6602, 3910, 2077, 196...","[10988, 5597, 12955, 5606, 5315, 4227, 4430, 4..."
26440,Rockit14,20,Add realistic waves to Minecraft! (Physics Mod),Play Minecraft with realistic physics! This mi...,11,5.338389e+00,1,[16769],"[3084, 5038, 4400, 1871, 3829, 5, 12, 1889, 59...","[4091, 9748, 4132, 17593, 4153, 5, 4123, 9748,..."
9993,MAD ABOUT SCIENCE,22,The Stirling Engine at my Institute,Very close to Carnot Engine\n\n This one is...,50,5.685385e+00,1,"[1300, 3294, 777]","[1883, 9686, 1910, 1817, 2178, 2469]","[4451, 9906, 4027, 17896, 4094, 4306, 4123, 42..."
22063,BKTVOK,22,Shergarh Bazar video YouTube amazing viralvide...,Shergarh Bazar video YouTube amazing viralvide...,21,2.287802e+00,0,[1165],"[6224, 6245, 1963, 2159, 2250, 2525, 1890, 206...","[25286, 28274, 4082, 4058, 5315, 10641, 4393, ..."
1187,HVTraining,17,Cycling Tips: The science of electrolytes and ...,Looking for a proven training plan? \nhttps://...,72,3.307282e+00,0,[19463],"[6465, 2587, 30, 1883, 1815, 1846, 21675, 1842...","[7408, 4039, 41, 17229, 5423, 4459, 33, 4006, ..."
...,...,...,...,...,...,...,...,...,...,...
7270,Khanish,22,Friction welding #tools #science #viral,Friction welding is a solid-state welding proc...,57,3.766115e+00,0,[16197],"[3683, 7242, 7, 3945, 7, 1815, 7, 2062]","[8809, 25929, 4021, 41, 7093, 17, 5087, 25929,..."
30484,TungaloyCorporation,28,We made a smile with high feed machining! #cn...,Product : AddDoFeed\nShank : VSSD08L090S05-C\n...,2,5.627098e+00,1,[10110],"[2074, 3274, 41, 10225, 1957, 2573, 3306, 5804...","[5864, 30, 5316, 44, 4035, 17185, 4053, 4299, ..."
17292,Sansad TV,25,Science Monitor | 14.08.2021,1.HUMAN-BASED MODELS TO STUDY NEURODEVELOPMENT...,32,4.421341e+00,0,"[3250, 900]","[1815, 6401, 68, 2386, 18, 4589, 18, 2158]","[21, 18, 4896, 17, 5122, 8991, 4027, 4331, 107..."
23077,The Truth Show,27,Trick for Reactivity Series of Metals #shorts ...,Join our Telegram Group ATP STAR JEE/NEET 2024...,18,6.748217e+00,1,"[829, 3098, 1169]","[2295, 1869, 7835, 2475, 1846, 2629, 7, 1897, ...","[4365, 4093, 4410, 4347, 9114, 5460, 4487, 19,..."


In [19]:
idx = train.sample(1, random_state=524).index.tolist()[0]
print('channel title:')
print(train.at[idx,'channel_title'])
print('channel title tokenized:')
print(train.at[idx,'channel_title_tokenized'])
print('video title: ')
print(train.at[idx,'video_title'])
print('video title tokenized:')
print(train.at[idx,'video_title_tokenized'])
print('video description:')
print(train.at[idx,'video_description'])
print('video description tokenized:')
print(train.at[idx,'video_description_tokenized'])

channel title:
CrashCourse
channel title tokenized:
['1946']
video title: 
Micro-Biology: Crash Course History of Science #24
video title tokenized:
['2635', '17', '1915', '30', '3465', '2299', '2744', '1846', '1815', '7', '2763']
video description:
It's all about the SUPER TINY in this episode of Crash Course: History of Science. In it, Hank Green talks about germ theory, John Snow (the other one), pasteurization,  and why following our senses isn't always the worst idea. 

***

Crash Course is on Patreon! You can support us directly by signing up at http://www.patreon.com/crashcourse

Thanks to the following Patrons for their generous monthly contributions that help keep Crash Course free for everyone forever:

Mark Brouwer, Kenneth F Penttinen, Trevin Beattie, Satya Ridhima Parvathaneni, Erika & Alexa Saur, Glenn Elliott, Justin Zingsheim, Jessica Wode, Eric Prestemon, Kathrin Benoit, Tom Trval, Jason Saslow, Nathan Taylor, Brian Thomas Gossett, Khaled El Shalakany, Indika Siriwarde

We are now ready to apply machine learning techniques on the tokenized text. To perform a frequency analysis on the tokenised data we can use TfidfVectorizer() from scikit-learn, which efficiently counts the tokens in a text and generates a vector consisting of a numerical description of the token frequencies. Rather than simply counting the token frequency in the individual samples (the *term frequency*), however, TfidfVectorizer also by default incorporates the frequencies of the tokens in the entire training corpus (the *document frequency*). By default, TfidfVectorizer multiplies each token $i$ by a weight IDF = $\log(\frac{N_{\text{samples}}}{N_{\text{samples containing }i}})$, which describes the specificity of the token to the sample.

The parameters are:
* ngram_range: rather than considering individual tokens, we can consider pairs, triples, etc. of consecutive tokens and perform frequency analysis on these larger units. These are known as n-grams, with $n=1,2,3, \dots$ being the number of consecutive tokens that form the unit. The ngram_range is a tuple (n,m) with $n$ and $m$ being the minimum and maximum sizes of the n-grams used in generating features from the tokenised text.
* min_df, max_df: we can filter the tokens by the minimum and maximum number of documents in which the token must appear, which allows for dimensionality reduction.
* use_idf: this allows the incorporation of the IDF factor into the vector representation of the text: without it, the text is represented as a set of numbers corresponding to the frequency of each token or n-gram appearing in the text, with a normalisation factor. With the default option 'use_idf=True', this frequency is divided by a factor (idf) that suppresses tokens that appear in a large number of documents.
* norm: with 'l1', the vector of input features is normalised so that the sum of the features is unity, with 'l2', the sum of the squares is unity.
* sublinear_tf: this uses the logarithm of the term frequencies rather than the term frequencies themselves.

We will introduce a function that trains the vectoriser on the total vocabulary of channel names, video titles and descriptions, vectorises them individually and then combines them. We'll also determine the effect of incorporating the video category, which will be one-hot encoded and stacked with the vectoriser output.

In [20]:
from sklearn.preprocessing import OneHotEncoder

video_category_encoder = OneHotEncoder()
video_category_encoder.fit(train[['video_category']])
video_category_encoder.categories_[0]

array([ 1,  2, 10, 15, 17, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29])

In [21]:
from scipy.sparse import csr_matrix, hstack

def dummy(x):
    return x

train_texts_tokenized = {'channel_title': train['channel_title_tokenized'],
                           'video_title': train['video_title_tokenized'],
                           'video_description': train['video_description_tokenized']}

def get_features(ngram_range=(1,1), min_df=1, max_df=1.0, verbose=True, use_idf=True, norm='l2', sublinear_tf=False, video_category_encoder=None):
    vectorizers = {}
    X_vectorized = {}
    for field in train_texts_tokenized:
        vectorizers[field] = TfidfVectorizer(preprocessor=dummy, tokenizer=dummy, ngram_range=ngram_range, min_df=min_df, max_df=max_df, token_pattern=None, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
        X_vectorized[field] = vectorizers[field].fit_transform(train_texts_tokenized[field])
        if verbose:
            print(f"Fit tfidf vectorizer with {len(vectorizers[field].get_feature_names_out())} features in the {ngram_range} ngram range.")

    if video_category_encoder != None:
        X_category = video_category_encoder.transform(train[['video_category']]).toarray()
        X_train = hstack([X_category, X_vectorized['channel_title'], X_vectorized['video_title'], X_vectorized['video_description']])
    else:
        X_train = hstack([X_vectorized['channel_title'], X_vectorized['video_title'], X_vectorized['video_description']])
    return X_train, vectorizers

Let's look at the number of features for each n-gram range:

In [22]:
for ngram_range in [(1,1), (1,2), (1,3), (1,4), (1,5)]:
    _,_ = get_features(ngram_range=ngram_range)

Fit tfidf vectorizer with 12424 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 24974 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 26901 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 29438 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 158781 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 995189 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 44876 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 585682 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 5766326 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 47958 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 810580 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 8696668 featu

## Multinomial naive Bayes

We see that for the higher n-gram ranges, we have millions or tens of millions of features, which is orders of magnitude larger than the training sample size.

I'll start our exploration of classical machine learning approaches with the multinomial naive Bayes model, which is known to perform well for text classification tasks with the tf-idf approach despite the large vocabularies. This has two main advantages: for the number of features we are considering, it is comparatively fast, and it requires tuning of only one hyperparameter, the Laplacian smoothing $\alpha$, which can be fixed by cross-validation to minimise overfitting.

We'll vary the n-gram range from (1,1) (only single tokens) to (1,5), as well as the the vectoriser settings, (use_idf = [True, False], norm = ['l1', 'l2'], and 'sublinear_tf' = [True, False]), and use Bayesian hyperparameter tuning to optimise the value of $\alpha$ (Laplacian smoothing) with the Optuna library.

In [23]:
from sklearn.metrics import *

import warnings
from sklearn.exceptions import UndefinedMetricWarning
warnings.simplefilter("ignore", UndefinedMetricWarning)

In [24]:
from sklearn.model_selection import cross_validate, KFold
import optuna

max_trials=100

def objective(trial, X_train, y_train, estimator, get_params, scoring):
    np.random.seed(524)
    params = get_params(trial=trial)
    model = estimator(**params)
    scores = cross_validate(model, X_train, y_train, scoring=scoring, cv=KFold(n_splits=5, random_state=524, shuffle=True), n_jobs=-1, verbose=0)
    return np.mean(scores['test_score'])

def report_optuna_results(X_train, y_train, estimator, get_params, scoring):
    sampler = optuna.samplers.TPESampler(seed=524)
    study = optuna.create_study(sampler=sampler, direction='maximize')
    study.optimize(lambda trial: objective(trial, X_train, y_train, estimator, get_params, scoring), n_trials=max_trials)
    return study.best_params

In [25]:
def report_tuned_models(X_trains, y_train, params_fixed, estimator, get_params, scoring_tune, scoring_report):
    results_list = []
    for n in range(len(X_trains)):
        X_train = X_trains[n]

        best = report_optuna_results(X_train, y_train, estimator, get_params, scoring_tune)
        model = estimator(**get_params(best=best))
        scores = cross_validate(model, X_train, y_train, scoring=scoring_report, cv=KFold(n_splits=5, random_state=524, shuffle=True), n_jobs=-1, verbose=5)

        cv_results = {}
        for param in params_fixed:
            cv_results[param] = params_fixed[param][n]
        cv_results['mean_fit_time'] = np.mean(scores['fit_time'])
        for score in scoring_report:
            cv_results[score] = f'{np.min(scores["test_"+score]):.5f}/{np.mean(scores["test_"+score]):.5f}/{np.max(scores["test_"+score]):.5f}'
        for param in best:
            cv_results[param] = best[param]
        results_list.append(cv_results)
        print(pd.DataFrame(results_list))
    return results_list

In [26]:
from sklearn.naive_bayes import MultinomialNB

def get_params_mnB(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
    elif best != None:
        alpha = best['alpha']
    return {'alpha': alpha}

In [27]:
X_trains = []
params_fixed = {'vectorizer_type': [], 'norm': [], 'ngram_range': []}

for use_idf in [False,True]:
    for norm in ['l1','l2']:
        for sublinear_tf in [False,True]:
            for ngram_range in [(1,1), (1,2), (1,3), (1,4), (1,5)]:
                if use_idf == False and sublinear_tf == False:
                    params_fixed['vectorizer_type'].append('TF')
                elif use_idf == False and sublinear_tf == True:
                    params_fixed['vectorizer_type'].append('log(TF)')
                elif use_idf == True and sublinear_tf == False:
                    params_fixed['vectorizer_type'].append('TF-IDF')
                elif use_idf == True and sublinear_tf == True:
                    params_fixed['vectorizer_type'].append('log(TF)-IDF')
                params_fixed['norm'].append(norm)
                params_fixed['ngram_range'].append(ngram_range)

                X_train, _ = get_features(ngram_range=ngram_range, use_idf=use_idf, norm=norm, sublinear_tf=sublinear_tf)
                X_trains.append(X_train)

Fit tfidf vectorizer with 12424 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 24974 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 26901 features in the (1, 1) ngram range.
Fit tfidf vectorizer with 29438 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 158781 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 995189 features in the (1, 2) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 44876 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 585682 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 5766326 features in the (1, 4) ngram range.
Fit tfidf vectorizer with 47958 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 810580 features in the (1, 5) ngram range.
Fit tfidf vectorizer with 8696668 featu

In [28]:
%%time
mnB_tune_ngrams = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))


[I 2024-05-16 02:30:36,435] A new study created in memory with name: no-name-fd6ee78b-31ea-44bf-b93a-75fb743e6fa3
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current p

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   

                   roc_auc        alpha  
0  0.84670/0.85191/0.86416 7.060859e-02  


[I 2024-05-16 02:31:02,348] Trial 0 finished with value: 0.7967861230766712 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7967861230766712.
[I 2024-05-16 02:31:02,891] Trial 1 finished with value: 0.7942594202723458 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7967861230766712.
[I 2024-05-16 02:31:03,409] Trial 2 finished with value: 0.7928381026910722 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7967861230766712.
[I 2024-05-16 02:31:03,977] Trial 3 finished with value: 0.7132422993777302 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7967861230766712.
[I 2024-05-16 02:31:04,553] Trial 4 finished with value: 0.7944962233951134 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7967861230766712.
[I 2024-05-16 02:31:05,082] Trial 5 finished with value: 0.7945357100683612 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   

                   roc_auc        alpha  
0  0.84670/0.85191/0.86416 7.060859e-02  
1  0.86802/0.87337/0.88423 1.087060e-02  


[I 2024-05-16 02:31:56,125] Trial 0 finished with value: 0.7970624050782387 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-16 02:31:57,129] Trial 1 finished with value: 0.7946937191169358 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-16 02:31:58,096] Trial 2 finished with value: 0.7944567523107615 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-16 02:31:59,100] Trial 3 finished with value: 0.7181774554167321 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-16 02:32:00,072] Trial 4 finished with value: 0.7924429553584686 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7970624050782387.
[I 2024-05-16 02:32:01,060] Trial 5 finished with value: 0.7925219209105163 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   

                   roc_auc        alpha  
0  0.84670/0.85191/0.86416 7.060859e-02  
1  0.86802/0.87337/0.88423 1.087060e-02  
2  0.86774/0.87366/0.88516 6.921634e-03  


[I 2024-05-16 02:33:36,798] Trial 0 finished with value: 0.797733678523451 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.797733678523451.
[I 2024-05-16 02:33:38,253] Trial 1 finished with value: 0.7896793714868962 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.797733678523451.
[I 2024-05-16 02:33:39,704] Trial 2 finished with value: 0.7913771503420399 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.797733678523451.
[I 2024-05-16 02:33:41,130] Trial 3 finished with value: 0.7191644741534158 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.797733678523451.
[I 2024-05-16 02:33:42,569] Trial 4 finished with value: 0.7875472080482352 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.797733678523451.
[I 2024-05-16 02:33:44,041] Trial 5 finished with value: 0.7873103113920923 and parameters: {'alpha': 0.00039799342667825053}. Best is tri

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
3  0.76836/0.77903/0.78942  0.69837/0.71275/0.74135  0.73657/0.74433/0.76296   

                   roc_auc        alpha  
0  0.84670/0.85191/0.86416 7.060859e-02  
1  0.86802/0.87337/0.88423 1.087060e-02  
2  0.86774/0.87366/0.88516 6.921634e-03  
3  0.86564/0.87216/0.88416 4.012247e-03 

[I 2024-05-16 02:36:03,790] Trial 0 finished with value: 0.7996287972114583 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7996287972114583.
[I 2024-05-16 02:36:05,720] Trial 1 finished with value: 0.7836780129146208 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7996287972114583.
[I 2024-05-16 02:36:07,631] Trial 2 finished with value: 0.7879816705762319 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7996287972114583.
[I 2024-05-16 02:36:09,549] Trial 3 finished with value: 0.7190459907503287 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7996287972114583.
[I 2024-05-16 02:36:11,452] Trial 4 finished with value: 0.7835201597550049 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7996287972114583.
[I 2024-05-16 02:36:13,350] Trial 5 finished with value: 0.7835201675494529 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
3  0.76836/0.77903/0.78942  0.69837/0.71275/0.74135  0.73657/0.74433/0.76296   
4  0.77510/0.78496/0.79442  0.68753/0.70184/0.72782  0.73422/0.74100/0.75862   

                   roc_auc        alpha  
0  0.846

[I 2024-05-16 02:39:17,670] Trial 0 finished with value: 0.7787823591377625 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7787823591377625.
[I 2024-05-16 02:39:17,873] Trial 1 finished with value: 0.7731363416524776 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7787823591377625.
[I 2024-05-16 02:39:18,121] Trial 2 finished with value: 0.7704121119485379 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7787823591377625.
[I 2024-05-16 02:39:18,389] Trial 3 finished with value: 0.7219283775973536 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7787823591377625.
[I 2024-05-16 02:39:18,602] Trial 4 finished with value: 0.7772030247134663 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7787823591377625.
[I 2024-05-16 02:39:18,810] Trial 5 finished with value: 0.7772030169190184 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5         log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
3  0.76836/0.77903/0.78942  0.69837/0.71275/0.74135  0.73657/0.74433/0.76296   
4  0.77510/0.78496/0.79442  0.68753/0.70184/0.72782  

[I 2024-05-16 02:39:38,852] Trial 0 finished with value: 0.7965097553361764 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7965097553361764.
[I 2024-05-16 02:39:39,364] Trial 1 finished with value: 0.7948122259033668 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7965097553361764.
[I 2024-05-16 02:39:39,892] Trial 2 finished with value: 0.7929565705052634 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7965097553361764.
[I 2024-05-16 02:39:40,434] Trial 3 finished with value: 0.7130449283670748 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7965097553361764.
[I 2024-05-16 02:39:40,950] Trial 4 finished with value: 0.7942198322712748 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7965097553361764.
[I 2024-05-16 02:39:41,465] Trial 5 finished with value: 0.7941803455980271 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5         log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6         log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73452/0.74075/0.75853   
3  0.76836/0.77903/0.78942  0.69837/0.71275/0.74135  0.

[I 2024-05-16 02:40:32,840] Trial 0 finished with value: 0.7977730872522193 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-16 02:40:33,909] Trial 1 finished with value: 0.794535733451705 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-16 02:40:35,069] Trial 2 finished with value: 0.7946541155269689 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-16 02:40:36,133] Trial 3 finished with value: 0.7179800298449411 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-16 02:40:37,196] Trial 4 finished with value: 0.7933905263941442 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7977730872522193.
[I 2024-05-16 02:40:38,288] Trial 5 finished with value: 0.7933510475153444 and parameters: {'alpha': 0.00039799342667825053}. Best i

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5         log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6         log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7         log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.73541/0.74212/0.75787   
2  0.78280/0.79023/0.79700  0.68112/0.69725/0.72431  0.73

[I 2024-05-16 02:42:17,165] Trial 0 finished with value: 0.7978915238886188 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7978915238886188.
[I 2024-05-16 02:42:18,607] Trial 1 finished with value: 0.7900741992471343 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7978915238886188.
[I 2024-05-16 02:42:20,057] Trial 2 finished with value: 0.7910612491616098 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7978915238886188.
[I 2024-05-16 02:42:21,478] Trial 3 finished with value: 0.7190460219281204 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7978915238886188.
[I 2024-05-16 02:42:22,932] Trial 4 finished with value: 0.7875076278416121 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7978915238886188.
[I 2024-05-16 02:42:24,353] Trial 5 finished with value: 0.7872312678955653 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5         log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6         log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7         log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8         log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/0.71037/0.72209   
1  0.76758/0.77658/0.78679  0.69837/0.71073/0.73584  0.7354

[I 2024-05-16 02:44:43,496] Trial 0 finished with value: 0.7989181150374776 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7989181150374776.
[I 2024-05-16 02:44:45,413] Trial 1 finished with value: 0.7835595840726692 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7989181150374776.
[I 2024-05-16 02:44:47,401] Trial 2 finished with value: 0.7879816627817839 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7989181150374776.
[I 2024-05-16 02:44:49,388] Trial 3 finished with value: 0.7192434007332238 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7989181150374776.
[I 2024-05-16 02:44:51,401] Trial 4 finished with value: 0.7833621662953262 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7989181150374776.
[I 2024-05-16 02:44:53,317] Trial 5 finished with value: 0.7830857907603834 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0              TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1              TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2              TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3              TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4              TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5         log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6         log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7         log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8         log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9         log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   

                 precision                   recall                       f1  \
0  0.74346/0.75360/0.76936  0.65599/0.67204/0.69223  0.70637/

[I 2024-05-16 02:47:57,489] Trial 0 finished with value: 0.7785452130592858 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7785452130592858.
[I 2024-05-16 02:47:57,702] Trial 1 finished with value: 0.7729781767149442 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7785452130592858.
[I 2024-05-16 02:47:57,908] Trial 2 finished with value: 0.7700170503548618 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7785452130592858.
[I 2024-05-16 02:47:58,117] Trial 3 finished with value: 0.7787427321644519 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7787427321644519.
[I 2024-05-16 02:47:58,347] Trial 4 finished with value: 0.7765711599970225 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7787427321644519.
[I 2024-05-16 02:47:58,573] Trial 5 finished with value: 0.7765316733237747 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   

                  precision                   recal

[I 2024-05-16 02:48:19,351] Trial 0 finished with value: 0.7980100462639456 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7980100462639456.
[I 2024-05-16 02:48:19,865] Trial 1 finished with value: 0.7973783686142524 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7980100462639456.
[I 2024-05-16 02:48:20,397] Trial 2 finished with value: 0.7956806754980359 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7980100462639456.
[I 2024-05-16 02:48:20,935] Trial 3 finished with value: 0.754461382212836 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7980100462639456.
[I 2024-05-16 02:48:21,447] Trial 4 finished with value: 0.7980101086195293 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.7980101086195293.
[I 2024-05-16 02:48:21,963] Trial 5 finished with value: 0.7979311430674816 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 02:49:13,094] Trial 0 finished with value: 0.7933907290497906 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7933907290497906.
[I 2024-05-16 02:49:14,079] Trial 1 finished with value: 0.7976943555336099 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7976943555336099.
[I 2024-05-16 02:49:15,065] Trial 2 finished with value: 0.7977733132912095 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7977733132912095.
[I 2024-05-16 02:49:16,035] Trial 3 finished with value: 0.7621207367779856 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7977733132912095.
[I 2024-05-16 02:49:17,071] Trial 4 finished with value: 0.7952463376812063 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7977733132912095.
[I 2024-05-16 02:49:18,027] Trial 5 finished with value: 0.7951673877180545 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 02:50:53,135] Trial 0 finished with value: 0.7842704143411607 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7842704143411607.
[I 2024-05-16 02:50:54,552] Trial 1 finished with value: 0.7923246901999238 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7923246901999238.
[I 2024-05-16 02:50:56,006] Trial 2 finished with value: 0.7948516424265831 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7948516424265831.
[I 2024-05-16 02:50:57,436] Trial 3 finished with value: 0.7676877263556396 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7948516424265831.
[I 2024-05-16 02:50:58,892] Trial 4 finished with value: 0.7869947064006838 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7948516424265831.
[I 2024-05-16 02:51:00,386] Trial 5 finished with value: 0.7872315874679308 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 02:53:19,695] Trial 0 finished with value: 0.7758607740120634 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7758607740120634.
[I 2024-05-16 02:53:21,612] Trial 1 finished with value: 0.7884950908618282 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7884950908618282.
[I 2024-05-16 02:53:23,523] Trial 2 finished with value: 0.7914957506618461 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7914957506618461.
[I 2024-05-16 02:53:25,408] Trial 3 finished with value: 0.771004115858233 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7914957506618461.
[I 2024-05-16 02:53:27,303] Trial 4 finished with value: 0.7795720770138223 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7914957506618461.
[I 2024-05-16 02:53:29,230] Trial 5 finished with value: 0.7796115714815179 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 02:56:33,020] Trial 0 finished with value: 0.7812300184455611 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7812300184455611.
[I 2024-05-16 02:56:33,115] Trial 1 finished with value: 0.7752286676677336 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7812300184455611.
[I 2024-05-16 02:56:33,327] Trial 2 finished with value: 0.7719121924262129 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7812300184455611.
[I 2024-05-16 02:56:33,526] Trial 3 finished with value: 0.7855335436015571 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7855335436015571.
[I 2024-05-16 02:56:33,757] Trial 4 finished with value: 0.7794533285995053 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7855335436015571.
[I 2024-05-16 02:56:33,970] Trial 5 finished with value: 0.779492815272753 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 02:56:54,449] Trial 0 finished with value: 0.8040902924437894 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8040902924437894.
[I 2024-05-16 02:56:54,955] Trial 1 finished with value: 0.8015239626661532 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8040902924437894.
[I 2024-05-16 02:56:55,466] Trial 2 finished with value: 0.7995104307250902 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8040902924437894.
[I 2024-05-16 02:56:56,018] Trial 3 finished with value: 0.7685958730736508 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8040902924437894.
[I 2024-05-16 02:56:56,526] Trial 4 finished with value: 0.8036559702158556 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8040902924437894.
[I 2024-05-16 02:56:57,065] Trial 5 finished with value: 0.8036164913370557 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 02:57:48,429] Trial 0 finished with value: 0.8015240172272888 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8015240172272888.
[I 2024-05-16 02:57:49,406] Trial 1 finished with value: 0.8012081238413066 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8015240172272888.
[I 2024-05-16 02:57:50,384] Trial 2 finished with value: 0.8007738405856124 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8015240172272888.
[I 2024-05-16 02:57:51,259] Trial 3 finished with value: 0.7741626833790023 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8015240172272888.
[I 2024-05-16 02:57:52,245] Trial 4 finished with value: 0.801129150494811 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8015240172272888.
[I 2024-05-16 02:57:53,223] Trial 5 finished with value: 0.8012870815989063 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 02:59:28,410] Trial 0 finished with value: 0.7935881624160295 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7935881624160295.
[I 2024-05-16 02:59:29,819] Trial 1 finished with value: 0.7978522398710174 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7978522398710174.
[I 2024-05-16 02:59:31,261] Trial 2 finished with value: 0.7985629844005816 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7985629844005816.
[I 2024-05-16 02:59:32,711] Trial 3 finished with value: 0.7782293040844076 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7985629844005816.
[I 2024-05-16 02:59:34,154] Trial 4 finished with value: 0.7950885079049341 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7985629844005816.
[I 2024-05-16 02:59:35,620] Trial 5 finished with value: 0.7950095423528865 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:02:02,749] Trial 0 finished with value: 0.7852180944991074 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7852180944991074.
[I 2024-05-16 03:02:04,759] Trial 1 finished with value: 0.7948517203710624 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7948517203710624.
[I 2024-05-16 03:02:06,716] Trial 2 finished with value: 0.795286073776788 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.795286073776788.
[I 2024-05-16 03:02:08,618] Trial 3 finished with value: 0.7810325071348428 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.795286073776788.
[I 2024-05-16 03:02:10,620] Trial 4 finished with value: 0.7890479432595368 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.795286073776788.
[I 2024-05-16 03:02:12,536] Trial 5 finished with value: 0.7891269166060323 and parameters: {'alpha': 0.00039799342667825053}. Best is t

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:05:17,780] Trial 1 finished with value: 0.7702146629934032 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7750314135737971.
[I 2024-05-16 03:05:17,986] Trial 2 finished with value: 0.7652793666543385 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7750314135737971.
[I 2024-05-16 03:05:18,198] Trial 3 finished with value: 0.7466832636444733 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7750314135737971.
[I 2024-05-16 03:05:18,411] Trial 4 finished with value: 0.7741627847068255 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7750314135737971.
[I 2024-05-16 03:05:18,618] Trial 5 finished with value: 0.7740838113603299 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0 with value: 0.7750314135737971.
[I 2024-05-16 03:05:18,816] Trial 6 finished with value: 0.770727911801145 and parameters: {'alpha': 2.6074972019493715e-05}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:05:37,539] Trial 0 finished with value: 0.7917719936911739 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7917719936911739.
[I 2024-05-16 03:05:38,106] Trial 1 finished with value: 0.7917719781022781 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7917719936911739.
[I 2024-05-16 03:05:38,632] Trial 2 finished with value: 0.7902716092300295 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7917719936911739.
[I 2024-05-16 03:05:39,170] Trial 3 finished with value: 0.7341675003478272 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7917719936911739.
[I 2024-05-16 03:05:39,594] Trial 4 finished with value: 0.7906665227291948 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7917719936911739.
[I 2024-05-16 03:05:40,079] Trial 5 finished with value: 0.7906665227291947 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:06:31,121] Trial 0 finished with value: 0.7888499954597341 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7888499954597341.
[I 2024-05-16 03:06:32,105] Trial 1 finished with value: 0.7888503384154433 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7888503384154433.
[I 2024-05-16 03:06:33,060] Trial 2 finished with value: 0.7905085682417557 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7905085682417557.
[I 2024-05-16 03:06:34,015] Trial 3 finished with value: 0.7334173548839427 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7905085682417557.
[I 2024-05-16 03:06:34,986] Trial 4 finished with value: 0.7846255449780567 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7905085682417557.
[I 2024-05-16 03:06:35,951] Trial 5 finished with value: 0.7847439894089041 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:08:10,570] Trial 0 finished with value: 0.7863232302998251 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7863232302998251.
[I 2024-05-16 03:08:12,026] Trial 1 finished with value: 0.782059285350452 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7863232302998251.
[I 2024-05-16 03:08:13,444] Trial 2 finished with value: 0.7861260307670244 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7863232302998251.
[I 2024-05-16 03:08:14,867] Trial 3 finished with value: 0.732825054785226 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7863232302998251.
[I 2024-05-16 03:08:16,282] Trial 4 finished with value: 0.7770448987481726 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7863232302998251.
[I 2024-05-16 03:08:17,746] Trial 5 finished with value: 0.7769264543173252 and parameters: {'alpha': 0.00039799342667825053}. Best is

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:10:36,109] Trial 0 finished with value: 0.7884158836818946 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7884158836818946.
[I 2024-05-16 03:10:38,006] Trial 1 finished with value: 0.7757816759544008 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7884158836818946.
[I 2024-05-16 03:10:39,922] Trial 2 finished with value: 0.782572619897121 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7884158836818946.
[I 2024-05-16 03:10:41,847] Trial 3 finished with value: 0.7319170093950378 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7884158836818946.
[I 2024-05-16 03:10:43,788] Trial 4 finished with value: 0.7716754048923412 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7884158836818946.
[I 2024-05-16 03:10:45,716] Trial 5 finished with value: 0.7716753893034454 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:13:50,416] Trial 1 finished with value: 0.7692671776966549 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7745181881493992.
[I 2024-05-16 03:13:50,618] Trial 2 finished with value: 0.764292324534311 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7745181881493992.
[I 2024-05-16 03:13:50,837] Trial 3 finished with value: 0.7464463202216429 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7745181881493992.
[I 2024-05-16 03:13:50,968] Trial 4 finished with value: 0.7731757971479336 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7745181881493992.
[I 2024-05-16 03:13:51,066] Trial 5 finished with value: 0.7730573605115341 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0 with value: 0.7745181881493992.
[I 2024-05-16 03:13:51,271] Trial 6 finished with value: 0.7700962653292432 and parameters: {'alpha': 2.6074972019493715e-05}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:14:11,556] Trial 0 finished with value: 0.7911797325646969 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7911797325646969.
[I 2024-05-16 03:14:12,078] Trial 1 finished with value: 0.7914561158940876 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7914561158940876.
[I 2024-05-16 03:14:12,589] Trial 2 finished with value: 0.7908638781509543 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7914561158940876.
[I 2024-05-16 03:14:13,108] Trial 3 finished with value: 0.7341674769644835 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7914561158940876.
[I 2024-05-16 03:14:13,615] Trial 4 finished with value: 0.7908243525054669 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7914561158940876.
[I 2024-05-16 03:14:14,198] Trial 5 finished with value: 0.7908638157953709 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:15:05,370] Trial 0 finished with value: 0.7892448310144202 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7892448310144202.
[I 2024-05-16 03:15:06,346] Trial 1 finished with value: 0.7891661694458421 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7892448310144202.
[I 2024-05-16 03:15:07,296] Trial 2 finished with value: 0.7908638937398502 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7908638937398502.
[I 2024-05-16 03:15:08,264] Trial 3 finished with value: 0.7332989104530951 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7908638937398502.
[I 2024-05-16 03:15:09,235] Trial 4 finished with value: 0.7848228926053683 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7908638937398502.
[I 2024-05-16 03:15:10,183] Trial 5 finished with value: 0.7848228848109204 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:16:47,662] Trial 0 finished with value: 0.7867575135555194 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7867575135555194.
[I 2024-05-16 03:16:49,089] Trial 1 finished with value: 0.7821777453701954 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7867575135555194.
[I 2024-05-16 03:16:50,500] Trial 2 finished with value: 0.7861654550846886 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7867575135555194.
[I 2024-05-16 03:16:51,917] Trial 3 finished with value: 0.733022464768121 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7867575135555194.
[I 2024-05-16 03:16:53,370] Trial 4 finished with value: 0.7766105453424471 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7867575135555194.
[I 2024-05-16 03:16:54,804] Trial 5 finished with value: 0.7767290131566383 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:19:13,098] Trial 0 finished with value: 0.7885737524304063 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7885737524304063.
[I 2024-05-16 03:19:15,049] Trial 1 finished with value: 0.775426272511827 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7885737524304063.
[I 2024-05-16 03:19:16,941] Trial 2 finished with value: 0.7827305042345287 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7885737524304063.
[I 2024-05-16 03:19:18,822] Trial 3 finished with value: 0.7319564726849418 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7885737524304063.
[I 2024-05-16 03:19:20,783] Trial 4 finished with value: 0.7714384926473025 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7885737524304063.
[I 2024-05-16 03:19:22,672] Trial 5 finished with value: 0.7713200404220071 and parameters: {'alpha': 0.00039799342667825053}. Best i

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:22:26,726] Trial 0 finished with value: 0.7760974212458723 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7760974212458723.
[I 2024-05-16 03:22:26,930] Trial 1 finished with value: 0.7697013908023175 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7760974212458723.
[I 2024-05-16 03:22:27,146] Trial 2 finished with value: 0.7663058954476136 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7760974212458723.
[I 2024-05-16 03:22:27,354] Trial 3 finished with value: 0.7888896146385969 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7888896146385969.
[I 2024-05-16 03:22:27,563] Trial 4 finished with value: 0.7740838191547779 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7888896146385969.
[I 2024-05-16 03:22:27,768] Trial 5 finished with value: 0.7741232980335777 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:22:46,566] Trial 0 finished with value: 0.798681187203543 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.798681187203543.
[I 2024-05-16 03:22:47,080] Trial 1 finished with value: 0.7972205466324281 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.798681187203543.
[I 2024-05-16 03:22:47,595] Trial 2 finished with value: 0.7942200037491294 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.798681187203543.
[I 2024-05-16 03:22:48,101] Trial 3 finished with value: 0.7928378454742903 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.798681187203543.
[I 2024-05-16 03:22:48,621] Trial 4 finished with value: 0.7985628752783104 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.798681187203543.
[I 2024-05-16 03:22:49,115] Trial 5 finished with value: 0.7985628752783105 and parameters: {'alpha': 0.00039799342667825053}. Best is tri

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:23:39,911] Trial 0 finished with value: 0.7916536505881496 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7916536505881496.
[I 2024-05-16 03:23:40,866] Trial 1 finished with value: 0.7959965143228827 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7959965143228827.
[I 2024-05-16 03:23:41,903] Trial 2 finished with value: 0.7965494134872788 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7965494134872788.
[I 2024-05-16 03:23:42,891] Trial 3 finished with value: 0.7979704816462185 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7979704816462185.
[I 2024-05-16 03:23:43,843] Trial 4 finished with value: 0.7941410771693216 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7979704816462185.
[I 2024-05-16 03:23:44,796] Trial 5 finished with value: 0.7941410771693216 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:25:19,451] Trial 0 finished with value: 0.7837572278890024 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7837572278890024.
[I 2024-05-16 03:25:20,918] Trial 1 finished with value: 0.791337811763303 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.791337811763303.
[I 2024-05-16 03:25:22,384] Trial 2 finished with value: 0.7935488939873239 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7935488939873239.
[I 2024-05-16 03:25:23,850] Trial 3 finished with value: 0.8000235704105607 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.8000235704105607.
[I 2024-05-16 03:25:25,271] Trial 4 finished with value: 0.7854943842951227 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.8000235704105607.
[I 2024-05-16 03:25:26,700] Trial 5 finished with value: 0.785612852109314 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:27:45,364] Trial 0 finished with value: 0.7737287430790174 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7737287430790174.
[I 2024-05-16 03:27:47,292] Trial 1 finished with value: 0.7865211313329403 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7865211313329403.
[I 2024-05-16 03:27:49,228] Trial 2 finished with value: 0.7904692842241544 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7904692842241544.
[I 2024-05-16 03:27:51,163] Trial 3 finished with value: 0.8004579160218384 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.8004579160218384.
[I 2024-05-16 03:27:53,065] Trial 4 finished with value: 0.7785456963150578 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.8004579160218384.
[I 2024-05-16 03:27:54,984] Trial 5 finished with value: 0.7786246774560013 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:31:01,416] Trial 0 finished with value: 0.7774792053872106 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7774792053872106.
[I 2024-05-16 03:31:01,619] Trial 1 finished with value: 0.7705304550515623 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7774792053872106.
[I 2024-05-16 03:31:01,831] Trial 2 finished with value: 0.7683193416497496 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7774792053872106.
[I 2024-05-16 03:31:02,029] Trial 3 finished with value: 0.7912190711434339 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7912190711434339.
[I 2024-05-16 03:31:02,236] Trial 4 finished with value: 0.7758210612998255 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7912190711434339.
[I 2024-05-16 03:31:02,446] Trial 5 finished with value: 0.7758605323841774 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:31:21,887] Trial 0 finished with value: 0.802313688336661 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.802313688336661.
[I 2024-05-16 03:31:22,376] Trial 1 finished with value: 0.7991156964982273 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.802313688336661.
[I 2024-05-16 03:31:22,873] Trial 2 finished with value: 0.795917548770835 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.802313688336661.
[I 2024-05-16 03:31:23,421] Trial 3 finished with value: 0.7959568951440199 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.802313688336661.
[I 2024-05-16 03:31:23,823] Trial 4 finished with value: 0.801405603974233 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.802313688336661.
[I 2024-05-16 03:31:24,323] Trial 5 finished with value: 0.8014450906474808 and parameters: {'alpha': 0.00039799342667825053}. Best is trial

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:32:14,923] Trial 0 finished with value: 0.7956807300591715 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7956807300591715.
[I 2024-05-16 03:32:15,907] Trial 1 finished with value: 0.797694324355818 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.797694324355818.
[I 2024-05-16 03:32:16,871] Trial 2 finished with value: 0.7978916797775776 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7978916797775776.
[I 2024-05-16 03:32:17,827] Trial 3 finished with value: 0.7992733158244051 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7992733158244051.
[I 2024-05-16 03:32:18,845] Trial 4 finished with value: 0.7959174864152514 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7992733158244051.
[I 2024-05-16 03:32:19,808] Trial 5 finished with value: 0.796035915257203 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:33:53,728] Trial 0 finished with value: 0.7860866064493601 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7860866064493601.
[I 2024-05-16 03:33:55,167] Trial 1 finished with value: 0.7931934515725104 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7931934515725104.
[I 2024-05-16 03:33:56,598] Trial 2 finished with value: 0.7952466026924361 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7952466026924361.
[I 2024-05-16 03:33:58,016] Trial 3 finished with value: 0.8013659770009225 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.8013659770009225.
[I 2024-05-16 03:33:59,432] Trial 4 finished with value: 0.7892847697656482 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.8013659770009225.
[I 2024-05-16 03:34:00,843] Trial 5 finished with value: 0.7891663175403528 and parameters: {'alpha': 0.00039799342667825053}. Best 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[I 2024-05-16 03:36:18,965] Trial 0 finished with value: 0.7779138939541974 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7779138939541974.
[I 2024-05-16 03:36:20,880] Trial 1 finished with value: 0.789324373355615 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.789324373355615.
[I 2024-05-16 03:36:22,783] Trial 2 finished with value: 0.7913773997643737 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7913773997643737.
[I 2024-05-16 03:36:24,721] Trial 3 finished with value: 0.8015239081050177 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.8015239081050177.
[I 2024-05-16 03:36:26,662] Trial 4 finished with value: 0.7814674217408198 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.8015239081050177.
[I 2024-05-16 03:36:28,555] Trial 5 finished with value: 0.781664823929267 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   vectorizer_type norm ngram_range  mean_fit_time                 accuracy  \
0               TF   l1      (1, 1)   6.042886e-02  0.77813/0.78388/0.79017   
1               TF   l1      (1, 2)   2.756785e-01  0.79688/0.80520/0.81484   
2               TF   l1      (1, 3)   6.641018e-01  0.80004/0.80753/0.81840   
3               TF   l1      (1, 4)   1.014617e+00  0.79767/0.80689/0.81859   
4               TF   l1      (1, 5)   1.387816e+00  0.79807/0.80650/0.81761   
5          log(TF)   l1      (1, 1)   2.643867e-02  0.77852/0.78388/0.78918   
6          log(TF)   l1      (1, 2)   3.309343e-01  0.79668/0.80599/0.81563   
7          log(TF)   l1      (1, 3)   6.499852e-01  0.80024/0.80930/0.82057   
8          log(TF)   l1      (1, 4)   1.012142e+00  0.79807/0.80764/0.81761   
9          log(TF)   l1      (1, 5)   1.422942e+00  0.79747/0.80697/0.81820   
10              TF   l2      (1, 1)   4.091349e-02  0.78168/0.78834/0.80063   
11              TF   l2      (1, 2)   2.792049e-01  

[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    2.0s remaining:    3.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    2.0s finished


In [29]:
mnB_tune_ngrams = pd.DataFrame(mnB_tune_ngrams)
display(mnB_tune_ngrams.style.hide())

vectorizer_type,norm,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,l1,"(1, 1)",0.060429,0.77813/0.78388/0.79017,0.74346/0.75360/0.76936,0.65599/0.67204/0.69223,0.70637/0.71037/0.72209,0.84670/0.85191/0.86416,0.070609
TF,l1,"(1, 2)",0.275678,0.79688/0.80520/0.81484,0.76758/0.77658/0.78679,0.69837/0.71073/0.73584,0.73541/0.74212/0.75787,0.86802/0.87337/0.88423,0.010871
TF,l1,"(1, 3)",0.664102,0.80004/0.80753/0.81840,0.78280/0.79023/0.79700,0.68112/0.69725/0.72431,0.73452/0.74075/0.75853,0.86774/0.87366/0.88516,0.006922
TF,l1,"(1, 4)",1.014617,0.79767/0.80689/0.81859,0.76836/0.77903/0.78942,0.69837/0.71275/0.74135,0.73657/0.74433/0.76296,0.86564/0.87216/0.88416,0.004012
TF,l1,"(1, 5)",1.387816,0.79807/0.80650/0.81761,0.77510/0.78496/0.79442,0.68753/0.70184/0.72782,0.73422/0.74100/0.75862,0.86382/0.87053/0.88275,0.003352
log(TF),l1,"(1, 1)",0.026439,0.77852/0.78388/0.78918,0.74396/0.75643/0.77493,0.65500/0.66710/0.68521,0.70444/0.70886/0.71910,0.84677/0.85226/0.86451,0.077708
log(TF),l1,"(1, 2)",0.330934,0.79668/0.80599/0.81563,0.77007/0.77981/0.79039,0.69690/0.70823/0.73434,0.73385/0.74222/0.75828,0.86825/0.87379/0.88470,0.011628
log(TF),l1,"(1, 3)",0.649985,0.80024/0.80930/0.82057,0.77949/0.78914/0.79773,0.69197/0.70503/0.73333,0.73646/0.74464/0.76297,0.86793/0.87406/0.88573,0.006647
log(TF),l1,"(1, 4)",1.012142,0.79807/0.80764/0.81761,0.76861/0.78019/0.79320,0.70133/0.71354/0.73935,0.73722/0.74529/0.76149,0.86579/0.87252/0.88460,0.004018
log(TF),l1,"(1, 5)",1.422942,0.79747/0.80697/0.81820,0.76969/0.78043/0.79395,0.69887/0.71074/0.73584,0.73557/0.74388/0.76121,0.86391/0.87087/0.88318,0.003132


We can see that the (1,3) n-gram range consistently outperformed the lower ranges during cross-validation, but no improvement was seen for the (1,4) and (1,5) ranges. Let's look at the dependence on the other hyperparameters:

In [30]:
display(mnB_tune_ngrams[mnB_tune_ngrams['ngram_range']==(1,3)].style.hide())

vectorizer_type,norm,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,l1,"(1, 3)",0.664102,0.80004/0.80753/0.81840,0.78280/0.79023/0.79700,0.68112/0.69725/0.72431,0.73452/0.74075/0.75853,0.86774/0.87366/0.88516,0.006922
log(TF),l1,"(1, 3)",0.649985,0.80024/0.80930/0.82057,0.77949/0.78914/0.79773,0.69197/0.70503/0.73333,0.73646/0.74464/0.76297,0.86793/0.87406/0.88573,0.006647
TF,l2,"(1, 3)",0.688429,0.80932/0.81657/0.82925,0.75895/0.77732/0.79064,0.73780/0.74983/0.77043,0.75241/0.76328/0.78040,0.87760/0.88359/0.89375,0.076254
log(TF),l2,"(1, 3)",0.668702,0.80991/0.81878/0.82846,0.77439/0.78607/0.79538,0.73041/0.74283/0.75990,0.75700/0.76380/0.77724,0.87444/0.88138/0.89292,0.102185
TF-IDF,l1,"(1, 3)",0.634539,0.79688/0.80709/0.81879,0.77648/0.78908/0.79781,0.68260/0.69755/0.72331,0.73126/0.74043/0.75868,0.86652/0.87292/0.88448,0.010333
log(TF)-IDF,l1,"(1, 3)",0.61817,0.79668/0.80764/0.81899,0.77036/0.78300/0.79251,0.69837/0.70893/0.73484,0.73371/0.74406/0.76176,0.86649/0.87316/0.88484,0.009311
TF-IDF,l2,"(1, 3)",0.66236,0.81011/0.81870/0.82767,0.77565/0.78651/0.79403,0.72992/0.74184/0.75940,0.75683/0.76348/0.77633,0.87300/0.87943/0.89050,0.182905
log(TF)-IDF,l2,"(1, 3)",0.627617,0.80892/0.81976/0.83024,0.77184/0.78555/0.79604,0.73583/0.74713/0.76491,0.75629/0.76582/0.78016,0.87204/0.87916/0.89094,0.183701


We find that L$^2$ normalisation performs better than L$^1$, but there is no improvement from including the IDF factor or using log(TF) instead of TF.

## Including the video category

Next we can incorporate the video category.

In [31]:
%%time

params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []
vectorizers = []

ngram_range = (1,3)
for sublinear_tf in [False,True]:
    for use_idf in [False,True]:
        if use_idf == False and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF')
        elif use_idf == False and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)')
        elif use_idf == True and sublinear_tf == False:
            params_fixed['vectorizer_type'].append('TF-IDF')
        elif use_idf == True and sublinear_tf == True:
            params_fixed['vectorizer_type'].append('log(TF)-IDF')

        params_fixed['ngram_range'].append(ngram_range)

        X_train, vectorizer = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)
        vectorizers.append(vectorizer)

Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
CPU times: user 1min 9s, sys: 1.34 s, total: 1min 10s
Wall time: 1min 10s


In [32]:
mnB_tune_category = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

[I 2024-05-16 03:40:43,233] A new study created in memory with name: no-name-3f6c0e84-642f-49fd-ae38-99d6df4f9850
[I 2024-05-16 03:40:44,709] Trial 0 finished with value: 0.7973783764087002 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7973783764087002.
[I 2024-05-16 03:40:46,143] Trial 1 finished with value: 0.799470983024082 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-16 03:40:47,573] Trial 2 finished with value: 0.7991551675825793 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-16 03:40:49,043] Trial 3 finished with value: 0.7537901399454154 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-16 03:40:50,478] Trial 4 finished with value: 0.7977731885800425 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.799470983024082.
[I 2024-05-16 03:40:51,

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   6.572216e-01  0.80991/0.81842/0.82965   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   

                   roc_auc        alpha  
0  0.88022/0.88591/0.89647 6.571946e-02  


[I 2024-05-16 03:43:11,560] Trial 0 finished with value: 0.7960361023239535 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7960361023239535.
[I 2024-05-16 03:43:13,039] Trial 1 finished with value: 0.7985233886050628 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7985233886050628.
[I 2024-05-16 03:43:14,488] Trial 2 finished with value: 0.7986024243071418 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-16 03:43:15,945] Trial 3 finished with value: 0.7892450492589623 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-16 03:43:17,363] Trial 4 finished with value: 0.7967862166100466 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7986024243071418.
[I 2024-05-16 03:43:18,804] Trial 5 finished with value: 0.7968256954888464 and parameters: {'alpha': 0.00039799342667825053}. Best 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   6.572216e-01  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)   6.884888e-01  0.80754/0.82012/0.83083   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   

                   roc_auc        alpha  
0  0.88022/0.88591/0.89647 6.571946e-02  
1  0.87564/0.88196/0.89345 1.452222e-01  


[I 2024-05-16 03:45:39,554] Trial 0 finished with value: 0.8043667381287636 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-16 03:45:41,001] Trial 1 finished with value: 0.8037744224411509 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-16 03:45:42,448] Trial 2 finished with value: 0.8028269215555067 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-16 03:45:43,899] Trial 3 finished with value: 0.766503274252717 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8043667381287636.
[I 2024-05-16 03:45:45,342] Trial 4 finished with value: 0.8047219700934827 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8047219700934827.
[I 2024-05-16 03:45:46,773] Trial 5 finished with value: 0.804643012335883 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   6.572216e-01  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)   6.884888e-01  0.80754/0.82012/0.83083   
2         log(TF)      (1, 3)   6.456723e-01  0.81030/0.82115/0.83281   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   
2  0.77851/0.79233/0.80597  0.73189/0.74083/0.75789  0.75553/0.76569/0.78119   

                   roc_auc        alpha  
0  0.88022/0.88591/0.89647 6.571946e-02  
1  0.87564/0.88196/0.89345 1.452222e-01  
2  0.87689/0.88348/0.89532 9.066916e-02  


[I 2024-05-16 03:48:07,651] Trial 0 finished with value: 0.798878721897605 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.798878721897605.
[I 2024-05-16 03:48:09,099] Trial 1 finished with value: 0.8001026762626713 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-16 03:48:10,589] Trial 2 finished with value: 0.7999052584853283 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-16 03:48:12,034] Trial 3 finished with value: 0.7911796312368737 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.8001026762626713.
[I 2024-05-16 03:48:13,468] Trial 4 finished with value: 0.8002210193656957 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 4 with value: 0.8002210193656957.
[I 2024-05-16 03:48:14,921] Trial 5 finished with value: 0.8001815404868957 and parameters: {'alpha': 0.00039799342667825053}. Best is

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   6.572216e-01  0.80991/0.81842/0.82965   
1          TF-IDF      (1, 3)   6.884888e-01  0.80754/0.82012/0.83083   
2         log(TF)      (1, 3)   6.456723e-01  0.81030/0.82115/0.83281   
3     log(TF)-IDF      (1, 3)   6.678052e-01  0.80853/0.82099/0.83281   

                 precision                   recall                       f1  \
0  0.76245/0.77930/0.79236  0.73829/0.75305/0.76892  0.75729/0.76590/0.78046   
1  0.77413/0.79016/0.80138  0.72942/0.74083/0.75840  0.75273/0.76467/0.77929   
2  0.77851/0.79233/0.80597  0.73189/0.74083/0.75789  0.75553/0.76569/0.78119   
3  0.77996/0.79514/0.80794  0.72203/0.73584/0.75489  0.75204/0.76431/0.78051   

                   roc_auc        alpha  
0  0.88022/0.88591/0.89647 6.571946e-02  
1  0.87564/0.88196/0.89345 1.452222e-01  
2  0.87689/0.88348/0.89532 9.066916e-02  
3  0.87435/0.88119/0.89339 1.666332e-01  


[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    1.0s remaining:    1.5s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.0s finished


In [33]:
mnB_tune_category = pd.DataFrame(mnB_tune_category)
mnB_tune_category.style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
TF,"(1, 3)",0.657222,0.80991/0.81842/0.82965,0.76245/0.77930/0.79236,0.73829/0.75305/0.76892,0.75729/0.76590/0.78046,0.88022/0.88591/0.89647,0.065719
TF-IDF,"(1, 3)",0.688489,0.80754/0.82012/0.83083,0.77413/0.79016/0.80138,0.72942/0.74083/0.75840,0.75273/0.76467/0.77929,0.87564/0.88196/0.89345,0.145222
log(TF),"(1, 3)",0.645672,0.81030/0.82115/0.83281,0.77851/0.79233/0.80597,0.73189/0.74083/0.75789,0.75553/0.76569/0.78119,0.87689/0.88348/0.89532,0.090669
log(TF)-IDF,"(1, 3)",0.667805,0.80853/0.82099/0.83281,0.77996/0.79514/0.80794,0.72203/0.73584/0.75489,0.75204/0.76431/0.78051,0.87435/0.88119/0.89339,0.166633


In [34]:
mnB_tune_category.to_csv('mnB_tuned.csv', index=False, sep=',', encoding='utf-8')

## Dimensionality reduction

Our best performing models use the (1,3) n-gram range, which requires over 2 million features. We will now look at reducing the number of features by setting a minimum and maximum document frequency filter that drops tokens from the vocabulary that are either too rare or too common. I'll show results for TF-IDF with L$^2$ norm.

In [35]:
%%time

X_trains = []

params_fixed = {'min_df': [], 'max_df': []}

for min_df in [5,10,20,50,100,200,500,1000]:
    for max_df in [1.0, 0.9, 0.8, 0.7]:
        X_train, _ = get_features(ngram_range=ngram_range, use_idf=True, norm='l2', sublinear_tf=True, min_df=min_df, max_df=max_df, video_category_encoder=video_category_encoder)
        X_trains.append(X_train)
        params_fixed['min_df'].append(min_df)
        params_fixed['max_df'].append(f"{max_df:.1f}")

mnB_tune_dim_reduction = report_tuned_models(X_trains, y_train, params_fixed, MultinomialNB, get_params_mnB, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315921 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315921 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315921 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 4441 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 21320 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 315920 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 2104 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 9402 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 151356 features in the (

[I 2024-05-16 03:55:32,455] A new study created in memory with name: no-name-86a3e6b9-22ed-4f8f-9e7c-a06cf9271e8c


[CV] END  accuracy: (test=0.783) f1: (test=0.708) precision: (test=0.769) recall: (test=0.656) roc_auc: (test=0.847) total time=   0.1s
[CV] END  accuracy: (test=0.812) f1: (test=0.739) precision: (test=0.784) recall: (test=0.698) roc_auc: (test=0.871) total time=   0.9s
[CV] END  accuracy: (test=0.798) f1: (test=0.737) precision: (test=0.768) recall: (test=0.707) roc_auc: (test=0.866) total time=   1.4s
[CV] END  accuracy: (test=0.809) f1: (test=0.738) precision: (test=0.772) recall: (test=0.707) roc_auc: (test=0.871) total time=   0.4s
[CV] END  accuracy: (test=0.812) f1: (test=0.741) precision: (test=0.782) recall: (test=0.704) roc_auc: (test=0.872) total time=   0.9s
[CV] END  accuracy: (test=0.803) f1: (test=0.739) precision: (test=0.779) recall: (test=0.704) roc_auc: (test=0.873) total time=   1.9s
[CV] END  accuracy: (test=0.830) f1: (test=0.781) precision: (test=0.793) recall: (test=0.769) roc_auc: (test=0.893) total time=   1.4s
[CV] END  accuracy: (test=0.818) f1: (test=0.759

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   1.983360e-01  0.80513/0.81116/0.81998   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   

                   roc_auc        alpha  
0  0.86727/0.87335/0.88590 4.750807e-03  


[I 2024-05-16 03:56:44,755] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:56:45,478] Trial 1 finished with value: 0.8070906950270252 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:56:46,164] Trial 2 finished with value: 0.8046428096802366 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:56:46,737] Trial 3 finished with value: 0.7877447271534013 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:56:47,405] Trial 4 finished with value: 0.809380618091927 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:56:47,978] Trial 5 finished with value: 0.8093411470075751 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1       5    0.9   1.749861e-01  0.80513/0.81116/0.81998   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   

                   roc_auc        alpha  
0  0.86727/0.87335/0.88590 4.750807e-03  
1  0.86727/0.87335/0.88590 4.750807e-03  


[I 2024-05-16 03:57:55,076] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:57:55,765] Trial 1 finished with value: 0.8070906950270252 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:57:56,438] Trial 2 finished with value: 0.8046428096802366 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:57:57,240] Trial 3 finished with value: 0.7877447271534013 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:57:57,917] Trial 4 finished with value: 0.809380618091927 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:57:58,671] Trial 5 finished with value: 0.8093411470075751 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1       5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2       5    0.8   1.646152e-01  0.80513/0.81116/0.81998   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   

                   roc_auc        alpha  
0  0.86727/0.87335/0.88590 4.750807e-03  
1  0.86727/0.87335/0.88590 4.750807e-03  
2  0.86727/0.87335/0.88590 4.750807e-03  


[I 2024-05-16 03:59:05,661] Trial 0 finished with value: 0.8096175459258615 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:59:06,411] Trial 1 finished with value: 0.8070512161482254 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:59:07,092] Trial 2 finished with value: 0.8045638441281889 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:59:07,794] Trial 3 finished with value: 0.7877841904433054 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:59:08,496] Trial 4 finished with value: 0.8092621736610794 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.8096175459258615.
[I 2024-05-16 03:59:09,190] Trial 5 finished with value: 0.8093016603343273 and parameters: {'alpha': 0.00039799342667825053}. Best 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1       5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2       5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3       5    0.7   2.062256e-01  0.80474/0.81104/0.82017   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   

                   roc_auc        alpha  
0  0.86727/0.87335/0.88590 4.750807e-03  
1  0.86727/0.87335/0.88590 4.750807e-03  
2  0.86727/0.87335/0.88590 4.750807e-03  
3  0.86727/0.87335/0.88590 4.580805e-03  


[I 2024-05-16 04:00:16,477] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:00:17,041] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:00:17,596] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:00:18,142] Trial 3 finished with value: 0.779887674210783 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:00:18,671] Trial 4 finished with value: 0.7948514319764889 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:00:19,275] Trial 5 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1       5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2       5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3       5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4      10    1.0   1.064364e-01  0.79171/0.79615/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   

                   roc_auc        alpha  
0  0.86727/0.87335/0.88590 4.750807e-03  
1  0.86727/0.87335/0.88590 4.750807e-03  
2  0.86727/0.87335/0.88590 4.750

[I 2024-05-16 04:01:13,673] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:01:14,236] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:01:14,795] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:01:15,347] Trial 3 finished with value: 0.779887674210783 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:01:15,912] Trial 4 finished with value: 0.7948514319764889 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:01:16,507] Trial 5 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1       5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2       5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3       5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4      10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5      10    0.9   8.839407e-02  0.79171/0.79615/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   
5  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   

                  

[I 2024-05-16 04:02:10,285] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:02:10,888] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:02:11,452] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:02:12,000] Trial 3 finished with value: 0.779887674210783 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:02:12,556] Trial 4 finished with value: 0.7948514319764889 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:02:13,098] Trial 5 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00039799342667825053}. Best is 

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1       5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2       5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3       5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4      10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5      10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6      10    0.8   1.152582e-01  0.79171/0.79615/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.72219/0.72928/0.73872   
5  0.75927/0.76583/0.77490  0.68260/0.6

[I 2024-05-16 04:03:06,862] Trial 0 finished with value: 0.7955226820383573 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:03:07,418] Trial 1 finished with value: 0.79469343851681 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:03:07,967] Trial 2 finished with value: 0.7926797818645801 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:03:08,543] Trial 3 finished with value: 0.7798087086587353 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:03:08,997] Trial 4 finished with value: 0.7948909108552887 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7955226820383573.
[I 2024-05-16 04:03:09,564] Trial 5 finished with value: 0.7950093552861361 and parameters: {'alpha': 0.00039799342667825053}. Best is

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1       5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2       5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3       5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4      10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5      10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6      10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7      10    0.7   1.081565e-01  0.79076/0.79608/0.80340   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   
4  0.75927/0.76583/0.77490  0.68260/0.69613/0.70576  0.7221

[I 2024-05-16 04:04:03,630] Trial 0 finished with value: 0.7831255736226529 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7831255736226529.
[I 2024-05-16 04:04:04,133] Trial 1 finished with value: 0.7835203546162034 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:04:04,653] Trial 2 finished with value: 0.7831649823514213 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:04:05,120] Trial 3 finished with value: 0.775189516155747 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:04:05,625] Trial 4 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:04:06,093] Trial 5 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1       5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2       5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3       5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4      10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5      10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6      10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7      10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8      20    1.0   8.578439e-02  0.78030/0.78360/0.78875   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/0.78799  0.71858/0.72917/0.74336  0.74307/0.75270/0.76502   

[I 2024-05-16 04:04:53,075] Trial 0 finished with value: 0.7831255736226529 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7831255736226529.
[I 2024-05-16 04:04:53,534] Trial 1 finished with value: 0.7835203546162034 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:04:54,030] Trial 2 finished with value: 0.7831649823514213 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:04:54,524] Trial 3 finished with value: 0.775189516155747 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:04:55,005] Trial 4 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:04:55,519] Trial 5 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00039799342667825053}. Best i

   min_df max_df  mean_fit_time                 accuracy  \
0       5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1       5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2       5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3       5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4      10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5      10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6      10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7      10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8      20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9      20    0.9   1.082216e-01  0.78030/0.78360/0.78875   

                 precision                   recall                       f1  \
0  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2  0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
3  0.76636/0.77784/

[I 2024-05-16 04:05:42,603] Trial 0 finished with value: 0.7831255736226529 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7831255736226529.
[I 2024-05-16 04:05:43,102] Trial 1 finished with value: 0.7835203546162034 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:05:43,583] Trial 2 finished with value: 0.7831649823514213 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:05:44,070] Trial 3 finished with value: 0.775189516155747 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:05:44,453] Trial 4 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7835203546162034.
[I 2024-05-16 04:05:44,807] Trial 5 finished with value: 0.7833624546898998 and parameters: {'alpha': 0.00039799342667825053}. Best i

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   

                  precision                   recall                       f1  \
0   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
2   0.76661/0.77804/0.78

[I 2024-05-16 04:06:31,849] Trial 0 finished with value: 0.783046631453949 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.783046631453949.
[I 2024-05-16 04:06:32,343] Trial 1 finished with value: 0.7834413968586037 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-16 04:06:32,806] Trial 2 finished with value: 0.7830860323882696 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-16 04:06:33,318] Trial 3 finished with value: 0.775189523950195 and parameters: {'alpha': 1.2565299585684198}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-16 04:06:33,800] Trial 4 finished with value: 0.7832835047267481 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 1 with value: 0.7834413968586037.
[I 2024-05-16 04:06:34,308] Trial 5 finished with value: 0.7832835047267481 and parameters: {'alpha': 0.00039799342667825053}. Best is 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   

                  precision                   recall                       f1  \
0   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.75285/0.76471   
1   0.76661/0.77804/0.78788  0.71858/0.72927

[I 2024-05-16 04:07:21,361] Trial 0 finished with value: 0.7601070801257557 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-16 04:07:21,765] Trial 1 finished with value: 0.7600676012469558 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-16 04:07:22,168] Trial 2 finished with value: 0.7601070957146515 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-16 04:07:22,604] Trial 3 finished with value: 0.7583304604297314 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-16 04:07:22,997] Trial 4 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-16 04:07:23,392] Trial 5 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   

                  precision                   recall                       f1  \
0   0.76661/0.77804/0.78788  0.71858/0.72927/0.74286  0.74374/0.

[I 2024-05-16 04:08:00,039] Trial 0 finished with value: 0.7601070801257557 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-16 04:08:00,325] Trial 1 finished with value: 0.7600676012469558 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-16 04:08:00,738] Trial 2 finished with value: 0.7601070957146515 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-16 04:08:01,127] Trial 3 finished with value: 0.7583304604297314 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-16 04:08:01,494] Trial 4 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-16 04:08:01,891] Trial 5 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   

                  precision                   recall                       f1  \
0  

[I 2024-05-16 04:08:39,208] Trial 0 finished with value: 0.7601070801257557 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-16 04:08:39,594] Trial 1 finished with value: 0.7600676012469558 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7601070801257557.
[I 2024-05-16 04:08:39,973] Trial 2 finished with value: 0.7601070957146515 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-16 04:08:40,397] Trial 3 finished with value: 0.7583304604297314 and parameters: {'alpha': 1.2565299585684198}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-16 04:08:40,816] Trial 4 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 2 with value: 0.7601070957146515.
[I 2024-05-16 04:08:41,096] Trial 5 finished with value: 0.7601070879202035 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   

                  preci

[I 2024-05-16 04:09:17,464] Trial 0 finished with value: 0.7599886434893561 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-16 04:09:17,836] Trial 1 finished with value: 0.7598701912640606 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-16 04:09:18,223] Trial 2 finished with value: 0.7599096857317564 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-16 04:09:18,617] Trial 3 finished with value: 0.7585278626181784 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-16 04:09:18,997] Trial 4 finished with value: 0.7599096779373085 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7599886434893561.
[I 2024-05-16 04:09:19,382] Trial 5 finished with value: 0.7599096779373085 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:09:57,181] Trial 0 finished with value: 0.743958885846023 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:09:57,517] Trial 1 finished with value: 0.743958885846023 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:09:57,867] Trial 2 finished with value: 0.743958885846023 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:09:58,196] Trial 3 finished with value: 0.7439193991727752 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:09:58,527] Trial 4 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:09:58,847] Trial 5 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:10:31,365] Trial 0 finished with value: 0.743958885846023 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:10:31,701] Trial 1 finished with value: 0.743958885846023 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:10:32,049] Trial 2 finished with value: 0.743958885846023 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:10:32,378] Trial 3 finished with value: 0.7439193991727752 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:10:32,707] Trial 4 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:10:33,043] Trial 5 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:11:05,599] Trial 0 finished with value: 0.743958885846023 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:11:05,950] Trial 1 finished with value: 0.743958885846023 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:11:06,177] Trial 2 finished with value: 0.743958885846023 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:11:06,529] Trial 3 finished with value: 0.7439193991727752 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:11:06,862] Trial 4 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.743958885846023.
[I 2024-05-16 04:11:07,226] Trial 5 finished with value: 0.743958885846023 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:11:38,869] Trial 0 finished with value: 0.7442747246708696 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-16 04:11:39,208] Trial 1 finished with value: 0.7442352379976218 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-16 04:11:39,559] Trial 2 finished with value: 0.7442352379976218 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-16 04:11:39,903] Trial 3 finished with value: 0.7437219813954321 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-16 04:11:40,231] Trial 4 finished with value: 0.7442747246708696 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7442747246708696.
[I 2024-05-16 04:11:40,562] Trial 5 finished with value: 0.7442747246708696 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:12:12,851] Trial 1 finished with value: 0.7327065557932431 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:12:13,029] Trial 2 finished with value: 0.7327065557932431 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:12:13,322] Trial 3 finished with value: 0.7321538203122534 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:12:13,514] Trial 4 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:12:13,803] Trial 5 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:12:14,095] Trial 6 finished with value: 0.7327065557932431 and parameters: {'alpha': 2.6074972019493715e-05}. Best

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:12:41,260] Trial 0 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:12:41,567] Trial 1 finished with value: 0.7327065557932431 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:12:41,889] Trial 2 finished with value: 0.7327065557932431 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:12:42,221] Trial 3 finished with value: 0.7321538203122534 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:12:42,531] Trial 4 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:12:42,819] Trial 5 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:13:10,789] Trial 0 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:13:11,082] Trial 1 finished with value: 0.7327065557932431 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:13:11,393] Trial 2 finished with value: 0.7327065557932431 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:13:11,710] Trial 3 finished with value: 0.7321538203122534 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:13:11,890] Trial 4 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7327065557932431.
[I 2024-05-16 04:13:12,166] Trial 5 finished with value: 0.7327065557932431 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:13:40,509] Trial 0 finished with value: 0.7325880879790517 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-16 04:13:40,811] Trial 1 finished with value: 0.7325880879790517 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-16 04:13:41,116] Trial 2 finished with value: 0.7325880879790517 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-16 04:13:41,411] Trial 3 finished with value: 0.7322722569486529 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-16 04:13:41,713] Trial 4 finished with value: 0.7325880879790517 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.7325880879790517.
[I 2024-05-16 04:13:42,026] Trial 5 finished with value: 0.7325880879790517 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:14:07,547] Trial 0 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-16 04:14:07,792] Trial 1 finished with value: 0.7031742733333619 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-16 04:14:07,941] Trial 2 finished with value: 0.7031742733333619 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-16 04:14:08,189] Trial 3 finished with value: 0.7032136820621304 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-16 04:14:08,340] Trial 4 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-16 04:14:08,581] Trial 5 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:14:32,095] Trial 0 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-16 04:14:32,348] Trial 1 finished with value: 0.7031742733333619 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-16 04:14:32,589] Trial 2 finished with value: 0.7031742733333619 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-16 04:14:32,856] Trial 3 finished with value: 0.7032136820621304 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-16 04:14:33,093] Trial 4 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-16 04:14:33,330] Trial 5 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:14:56,579] Trial 0 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-16 04:14:56,831] Trial 1 finished with value: 0.7031742733333619 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-16 04:14:57,083] Trial 2 finished with value: 0.7031742733333619 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031742733333619.
[I 2024-05-16 04:14:57,327] Trial 3 finished with value: 0.7032136820621304 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-16 04:14:57,579] Trial 4 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7032136820621304.
[I 2024-05-16 04:14:57,809] Trial 5 finished with value: 0.7031742733333619 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:15:20,300] Trial 0 finished with value: 0.7031347866601141 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.7031347866601141.
[I 2024-05-16 04:15:20,551] Trial 1 finished with value: 0.7031347866601141 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.7031347866601141.
[I 2024-05-16 04:15:20,686] Trial 2 finished with value: 0.7031347866601141 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.7031347866601141.
[I 2024-05-16 04:15:20,943] Trial 3 finished with value: 0.7034110842505775 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.7034110842505775.
[I 2024-05-16 04:15:21,183] Trial 4 finished with value: 0.7031347866601141 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.7034110842505775.
[I 2024-05-16 04:15:21,326] Trial 5 finished with value: 0.7031347866601141 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:15:42,771] Trial 0 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-16 04:15:42,979] Trial 1 finished with value: 0.6803932610762028 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-16 04:15:43,187] Trial 2 finished with value: 0.6803932610762028 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-16 04:15:43,394] Trial 3 finished with value: 0.6805116899181544 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-16 04:15:43,638] Trial 4 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-16 04:15:43,858] Trial 5 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:16:04,868] Trial 0 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-16 04:16:05,074] Trial 1 finished with value: 0.6803932610762028 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-16 04:16:05,290] Trial 2 finished with value: 0.6803932610762028 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-16 04:16:05,505] Trial 3 finished with value: 0.6805116899181544 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-16 04:16:05,719] Trial 4 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-16 04:16:05,928] Trial 5 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00039799342667825053}. Best 

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:16:24,696] Trial 1 finished with value: 0.6803932610762028 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-16 04:16:24,797] Trial 2 finished with value: 0.6803932610762028 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6803932610762028.
[I 2024-05-16 04:16:24,897] Trial 3 finished with value: 0.6805116899181544 and parameters: {'alpha': 1.2565299585684198}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-16 04:16:25,000] Trial 4 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-16 04:16:25,205] Trial 5 finished with value: 0.6803932610762028 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 3 with value: 0.6805116899181544.
[I 2024-05-16 04:16:25,442] Trial 6 finished with value: 0.6803932610762028 and parameters: {'alpha': 2.6074972019493715e-05}. Best

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

[I 2024-05-16 04:16:43,284] Trial 0 finished with value: 0.6800379667559 and parameters: {'alpha': 0.0015376911652393395}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-16 04:16:43,389] Trial 1 finished with value: 0.6800379667559 and parameters: {'alpha': 8.430836792080312e-06}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-16 04:16:43,490] Trial 2 finished with value: 0.6800379667559 and parameters: {'alpha': 6.795850794294448e-08}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-16 04:16:43,593] Trial 3 finished with value: 0.6797220655754699 and parameters: {'alpha': 1.2565299585684198}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-16 04:16:43,693] Trial 4 finished with value: 0.6800379667559 and parameters: {'alpha': 0.00040691375225885087}. Best is trial 0 with value: 0.6800379667559.
[I 2024-05-16 04:16:43,997] Trial 5 finished with value: 0.6800379667559 and parameters: {'alpha': 0.00039799342667825053}. Best is trial 0 with value: 0.68003

    min_df max_df  mean_fit_time                 accuracy  \
0        5    1.0   1.983360e-01  0.80513/0.81116/0.81998   
1        5    0.9   1.749861e-01  0.80513/0.81116/0.81998   
2        5    0.8   1.646152e-01  0.80513/0.81116/0.81998   
3        5    0.7   2.062256e-01  0.80474/0.81104/0.82017   
4       10    1.0   1.064364e-01  0.79171/0.79615/0.80340   
5       10    0.9   8.839407e-02  0.79171/0.79615/0.80340   
6       10    0.8   1.152582e-01  0.79171/0.79615/0.80340   
7       10    0.7   1.081565e-01  0.79076/0.79608/0.80340   
8       20    1.0   8.578439e-02  0.78030/0.78360/0.78875   
9       20    0.9   1.082216e-01  0.78030/0.78360/0.78875   
10      20    0.8   7.721381e-02  0.78030/0.78360/0.78875   
11      20    0.7   6.155052e-02  0.78030/0.78348/0.78875   
12      50    1.0   4.971957e-02  0.75464/0.76015/0.77280   
13      50    0.9   4.696412e-02  0.75464/0.76015/0.77280   
14      50    0.8   6.905251e-02  0.75464/0.76015/0.77280   
15      50    0.7   6.75

In [36]:
pd.DataFrame(mnB_tune_dim_reduction).style.hide()

min_df,max_df,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha
5,1.0,0.198336,0.80513/0.81116/0.81998,0.76661/0.77804/0.78788,0.71858/0.72927/0.74286,0.74374/0.75285/0.76471,0.86727/0.87335/0.88590,0.004751
5,0.9,0.174986,0.80513/0.81116/0.81998,0.76661/0.77804/0.78788,0.71858/0.72927/0.74286,0.74374/0.75285/0.76471,0.86727/0.87335/0.88590,0.004751
5,0.8,0.164615,0.80513/0.81116/0.81998,0.76661/0.77804/0.78788,0.71858/0.72927/0.74286,0.74374/0.75285/0.76471,0.86727/0.87335/0.88590,0.004751
5,0.7,0.206226,0.80474/0.81104/0.82017,0.76636/0.77784/0.78799,0.71858/0.72917/0.74336,0.74307/0.75270/0.76502,0.86727/0.87335/0.88590,0.004581
10,1.0,0.106436,0.79171/0.79615/0.80340,0.75927/0.76583/0.77490,0.68260/0.69613/0.70576,0.72219/0.72928/0.73872,0.85340/0.86079/0.87297,0.01519
10,0.9,0.088394,0.79171/0.79615/0.80340,0.75927/0.76583/0.77490,0.68260/0.69613/0.70576,0.72219/0.72928/0.73872,0.85340/0.86079/0.87297,0.01519
10,0.8,0.115258,0.79171/0.79615/0.80340,0.75927/0.76583/0.77490,0.68260/0.69613/0.70576,0.72219/0.72928/0.73872,0.85340/0.86079/0.87297,0.01519
10,0.7,0.108156,0.79076/0.79608/0.80340,0.75719/0.76481/0.77370,0.68408/0.69764/0.70777,0.72245/0.72963/0.73927,0.85350/0.86081/0.87329,0.004618
20,1.0,0.085784,0.78030/0.78360/0.78875,0.74147/0.74644/0.75665,0.67275/0.68378/0.68972,0.71143/0.71370/0.71723,0.84053/0.84584/0.85559,3e-06
20,0.9,0.108222,0.78030/0.78360/0.78875,0.74147/0.74644/0.75665,0.67275/0.68378/0.68972,0.71143/0.71370/0.71723,0.84053/0.84584/0.85559,3e-06


We see that as the vocabulary size is decreased, the cross validation scores rapidly degrade.

All of these results show that identifying whether a video will be popular or not by its text metadata is a machine-learning problem that contradicts the common wisdom in text classification tasks. This is a fundamentally different challenge to, for example, determining whether a text message or email is spam, etc. In our case, both the most common and rarest terms are relevant, and incorporating an IDF factor seems to have no effect on the accuracy. Whether or not a viewer likes a certain YouTube video or channel, and whether they share it on social media to contribute to its virality, is primarily subjective determination, which makes the classification problem significantly more difficult, and this is reflected in the low cross-validation metrics we have seen so far.

## Further classification models

Now that we have understood the influence of the vectoriser hyperparameters -- the n-gram range, the TF/log(TF)/TF-IDF/log(TF)-IDF modalities and the normalisation, we are ready to build some more models. Having considered a Bayesian model already we can explore three linear methods:

* Support vector machine
* Logistic regression
* Perceptron

To avoid overfitting, we will employ statistical regularisation via a combination of L$^1$ and L$^2$ penalty terms, known as *elasticnet.* There are two hyperparameters which we will again use Bayesian optimization to tune. We will also implement the linear algorithms via stochastic gradient descent using SGDClassifier from scikit-learn, which uses a randomised algorithm to solve the linear models with regularisation. A random state variable will be set for reproducibility.

In [37]:
from sklearn.linear_model import SGDClassifier
params_fixed = {'vectorizer_type': [], 'ngram_range': []}
X_trains = []
vectorizers = []

for ngram_range in [(1,3)]:
    for sublinear_tf in [False,True]:
        for use_idf in [False,True]:
            if use_idf == False and sublinear_tf == False:
                params_fixed['vectorizer_type'].append('TF')
            elif use_idf == False and sublinear_tf == True:
                params_fixed['vectorizer_type'].append('log(TF)')
            elif use_idf == True and sublinear_tf == False:
                params_fixed['vectorizer_type'].append('TF-IDF')
            elif use_idf == True and sublinear_tf == True:
                params_fixed['vectorizer_type'].append('log(TF)-IDF')

            params_fixed['ngram_range'].append(ngram_range)

            X_train, vectorizer = get_features(ngram_range=ngram_range, use_idf=use_idf, norm='l2', sublinear_tf=sublinear_tf, video_category_encoder=video_category_encoder)
            X_trains.append(X_train)
            vectorizers.append(vectorizer)

Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 39389 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 360589 features in the (1, 3) ngram range.
Fit tfidf vectorizer with 3108791 features in the (1, 3) ngram range.


### Support vector machine

In [38]:
%%time

def get_params_SVM(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'n_jobs': -1, 'loss': 'hinge', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

SVM_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_SVM, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

[I 2024-05-16 04:18:11,561] A new study created in memory with name: no-name-e00c5644-3a37-47b2-90f4-99f597ed94d5
[I 2024-05-16 04:18:18,601] Trial 0 finished with value: 0.7120182436848408 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.7120182436848408.
[I 2024-05-16 04:18:41,085] Trial 1 finished with value: 0.8138421445020498 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8138421445020498.
[I 2024-05-16 04:18:50,940] Trial 2 finished with value: 0.7548956342907384 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8138421445020498.
[I 2024-05-16 04:19:14,687] Trial 3 finished with value: 0.8170005951061 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8170005951061.
[I 2024-05-16 04:19:20,371] Trial 4 finished with value: 0.61240542

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.937010e+01  0.81642/0.82581/0.83794   

                 precision                   recall                       f1  \
0  0.78575/0.80054/0.81458  0.73047/0.74355/0.76190  0.75710/0.77099/0.78736   

         alpha     l1_ratio  
0 4.244472e-05 7.473830e-02  


[I 2024-05-16 04:45:11,154] Trial 0 finished with value: 0.647504814045907 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.647504814045907.
[I 2024-05-16 04:45:32,263] Trial 1 finished with value: 0.8164478206528708 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8164478206528708.
[I 2024-05-16 04:45:40,929] Trial 2 finished with value: 0.7255998977368432 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8164478206528708.
[I 2024-05-16 04:46:03,043] Trial 3 finished with value: 0.8196063492014003 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8196063492014003.
[I 2024-05-16 04:46:08,161] Trial 4 finished with value: 0.6062462601264493 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 wit

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.937010e+01  0.81642/0.82581/0.83794   
1          TF-IDF      (1, 3)   1.839941e+01  0.81560/0.82454/0.83675   

                 precision                   recall                       f1  \
0  0.78575/0.80054/0.81458  0.73047/0.74355/0.76190  0.75710/0.77099/0.78736   
1  0.80220/0.81192/0.82589  0.70084/0.72261/0.74185  0.75278/0.76462/0.78162   

         alpha     l1_ratio  
0 4.244472e-05 7.473830e-02  
1 4.247058e-05 1.756472e-01  


[I 2024-05-16 05:08:56,521] Trial 0 finished with value: 0.7132815756008837 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.7132815756008837.
[I 2024-05-16 05:09:16,178] Trial 1 finished with value: 0.81557907486918 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.81557907486918.
[I 2024-05-16 05:09:24,597] Trial 2 finished with value: 0.7553692561251695 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.81557907486918.
[I 2024-05-16 05:09:45,545] Trial 3 finished with value: 0.8180270927215835 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 3 with value: 0.8180270927215835.
[I 2024-05-16 05:09:50,139] Trial 4 finished with value: 0.6069964367681258 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 3 with va

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.937010e+01  0.81642/0.82581/0.83794   
1          TF-IDF      (1, 3)   1.839941e+01  0.81560/0.82454/0.83675   
2         log(TF)      (1, 3)   1.904504e+01  0.81935/0.82462/0.83656   

                 precision                   recall                       f1  \
0  0.78575/0.80054/0.81458  0.73047/0.74355/0.76190  0.75710/0.77099/0.78736   
1  0.80220/0.81192/0.82589  0.70084/0.72261/0.74185  0.75278/0.76462/0.78162   
2  0.79346/0.80683/0.82149  0.71547/0.73020/0.74737  0.75245/0.76656/0.78268   

         alpha     l1_ratio  
0 4.244472e-05 7.473830e-02  
1 4.247058e-05 1.756472e-01  
2 3.707259e-05 2.494799e-01  


[I 2024-05-16 05:35:41,276] Trial 0 finished with value: 0.6347126128587346 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6347126128587346.
[I 2024-05-16 05:35:58,066] Trial 1 finished with value: 0.819172214040217 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.819172214040217.
[I 2024-05-16 05:36:05,809] Trial 2 finished with value: 0.7123737094829982 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.819172214040217.
[I 2024-05-16 05:36:26,344] Trial 3 finished with value: 0.8188561024096925 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.819172214040217.
[I 2024-05-16 05:36:31,306] Trial 4 finished with value: 0.6062462601264493 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 with 

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.937010e+01  0.81642/0.82581/0.83794   
1          TF-IDF      (1, 3)   1.839941e+01  0.81560/0.82454/0.83675   
2         log(TF)      (1, 3)   1.904504e+01  0.81935/0.82462/0.83656   
3     log(TF)-IDF      (1, 3)   1.518059e+01  0.81721/0.82691/0.83952   

                 precision                   recall                       f1  \
0  0.78575/0.80054/0.81458  0.73047/0.74355/0.76190  0.75710/0.77099/0.78736   
1  0.80220/0.81192/0.82589  0.70084/0.72261/0.74185  0.75278/0.76462/0.78162   
2  0.79346/0.80683/0.82149  0.71547/0.73020/0.74737  0.75245/0.76656/0.78268   
3  0.79318/0.80974/0.82437  0.71562/0.73369/0.75288  0.76023/0.76978/0.78701   

         alpha     l1_ratio  
0 4.244472e-05 7.473830e-02  
1 4.247058e-05 1.756472e-01  
2 3.707259e-05 2.494799e-01  
3 6.782046e-05 3.152240e-03  
CPU times: user 3min 57s, sys: 35 s, total: 4min 32s
Wall time: 1h 40min 26s


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   16.0s finished


In [39]:
SVM_tune = pd.DataFrame(SVM_tune)
SVM_tune.style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,alpha,l1_ratio
TF,"(1, 3)",19.370099,0.81642/0.82581/0.83794,0.78575/0.80054/0.81458,0.73047/0.74355/0.76190,0.75710/0.77099/0.78736,4.2e-05,0.074738
TF-IDF,"(1, 3)",18.39941,0.81560/0.82454/0.83675,0.80220/0.81192/0.82589,0.70084/0.72261/0.74185,0.75278/0.76462/0.78162,4.2e-05,0.175647
log(TF),"(1, 3)",19.045043,0.81935/0.82462/0.83656,0.79346/0.80683/0.82149,0.71547/0.73020/0.74737,0.75245/0.76656/0.78268,3.7e-05,0.24948
log(TF)-IDF,"(1, 3)",15.180594,0.81721/0.82691/0.83952,0.79318/0.80974/0.82437,0.71562/0.73369/0.75288,0.76023/0.76978/0.78701,6.8e-05,0.003152


In [40]:
SVM_tune.to_csv('SVM_tuned.csv', index=False, sep=',', encoding='utf-8')

### Logistic regression

In [41]:
%%time

def get_params_log_reg(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'n_jobs': -1, 'loss': 'log_loss', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

log_reg_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_log_reg, 'accuracy', ('accuracy', 'precision', 'recall', 'f1', 'roc_auc'))

[I 2024-05-16 05:58:37,915] A new study created in memory with name: no-name-ede89423-b2f2-4c04-b811-dcdcbb19d831
[I 2024-05-16 05:58:43,285] Trial 0 finished with value: 0.6905005867270685 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6905005867270685.
[I 2024-05-16 05:59:05,048] Trial 1 finished with value: 0.8147503223978528 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8147503223978528.
[I 2024-05-16 05:59:10,514] Trial 2 finished with value: 0.729627117507928 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8147503223978528.
[I 2024-05-16 05:59:20,828] Trial 3 finished with value: 0.8077617580221432 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8147503223978528.
[I 2024-05-16 05:59:25,332] Trial 4 finished with value: 0.646

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   7.427716e+00  0.81402/0.82119/0.83320   

                 precision                   recall                       f1  \
0  0.77512/0.79331/0.81469  0.71069/0.73972/0.76058  0.74772/0.76533/0.78092   

                   roc_auc        alpha     l1_ratio  
0  0.88501/0.89193/0.90051 2.829485e-06 1.043030e-01  


[I 2024-05-16 06:17:42,070] Trial 0 finished with value: 0.6489656338893244 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6489656338893244.
[I 2024-05-16 06:18:01,519] Trial 1 finished with value: 0.8170005795172042 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8170005795172042.
[I 2024-05-16 06:18:07,488] Trial 2 finished with value: 0.7054643600816702 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8170005795172042.
[I 2024-05-16 06:18:12,067] Trial 3 finished with value: 0.8054721077629194 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8170005795172042.
[I 2024-05-16 06:18:16,794] Trial 4 finished with value: 0.6061673023688496 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   7.427716e+00  0.81402/0.82119/0.83320   
1          TF-IDF      (1, 3)   8.918039e+00  0.81425/0.82296/0.83557   

                 precision                   recall                       f1  \
0  0.77512/0.79331/0.81469  0.71069/0.73972/0.76058  0.74772/0.76533/0.78092   
1  0.78918/0.80701/0.82677  0.71175/0.72454/0.73684  0.75399/0.76351/0.77922   

                   roc_auc        alpha     l1_ratio  
0  0.88501/0.89193/0.90051 2.829485e-06 1.043030e-01  
1  0.88643/0.89134/0.90106 9.504629e-07 2.094861e-01  


[I 2024-05-16 06:35:16,269] Trial 0 finished with value: 0.6920404344781169 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6920404344781169.
[I 2024-05-16 06:35:38,577] Trial 1 finished with value: 0.8175929419715043 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8175929419715043.
[I 2024-05-16 06:35:44,202] Trial 2 finished with value: 0.7335751223046312 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8175929419715043.
[I 2024-05-16 06:35:53,570] Trial 3 finished with value: 0.8102097992578905 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8175929419715043.
[I 2024-05-16 06:35:58,203] Trial 4 finished with value: 0.6102338373353277 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   7.427716e+00  0.81402/0.82119/0.83320   
1          TF-IDF      (1, 3)   8.918039e+00  0.81425/0.82296/0.83557   
2         log(TF)      (1, 3)   8.284185e+00  0.81560/0.82229/0.83379   

                 precision                   recall                       f1  \
0  0.77512/0.79331/0.81469  0.71069/0.73972/0.76058  0.74772/0.76533/0.78092   
1  0.78918/0.80701/0.82677  0.71175/0.72454/0.73684  0.75399/0.76351/0.77922   
2  0.76754/0.80088/0.81589  0.70434/0.73208/0.75539  0.75421/0.76457/0.78164   

                   roc_auc        alpha     l1_ratio  
0  0.88501/0.89193/0.90051 2.829485e-06 1.043030e-01  
1  0.88643/0.89134/0.90106 9.504629e-07 2.094861e-01  
2  0.88969/0.89509/0.90595 1.158246e-05 1.983644e-01  


[I 2024-05-16 06:56:09,553] Trial 0 finished with value: 0.6295009877514147 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6295009877514147.
[I 2024-05-16 06:56:25,832] Trial 1 finished with value: 0.8188167950087474 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8188167950087474.
[I 2024-05-16 06:56:31,408] Trial 2 finished with value: 0.7019110817173819 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8188167950087474.
[I 2024-05-16 06:56:36,270] Trial 3 finished with value: 0.8067353071733473 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8188167950087474.
[I 2024-05-16 06:56:40,850] Trial 4 finished with value: 0.6054961614292523 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   7.427716e+00  0.81402/0.82119/0.83320   
1          TF-IDF      (1, 3)   8.918039e+00  0.81425/0.82296/0.83557   
2         log(TF)      (1, 3)   8.284185e+00  0.81560/0.82229/0.83379   
3     log(TF)-IDF      (1, 3)   6.923369e+00  0.81678/0.82434/0.83675   

                 precision                   recall                       f1  \
0  0.77512/0.79331/0.81469  0.71069/0.73972/0.76058  0.74772/0.76533/0.78092   
1  0.78918/0.80701/0.82677  0.71175/0.72454/0.73684  0.75399/0.76351/0.77922   
2  0.76754/0.80088/0.81589  0.70434/0.73208/0.75539  0.75421/0.76457/0.78164   
3  0.78274/0.80510/0.82265  0.71528/0.73247/0.75111  0.75873/0.76689/0.78265   

                   roc_auc        alpha     l1_ratio  
0  0.88501/0.89193/0.90051 2.829485e-06 1.043030e-01  
1  0.88643/0.89134/0.90106 9.504629e-07 2.094861e-01  
2  0.88969/0.89509/0.90595 1.158246e-05 1.983644e-01  
3  0.89048/0.

[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    7.2s remaining:   10.8s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    7.2s finished


In [42]:
log_reg_tune = pd.DataFrame(log_reg_tune)
log_reg_tune.style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,roc_auc,alpha,l1_ratio
TF,"(1, 3)",7.427716,0.81402/0.82119/0.83320,0.77512/0.79331/0.81469,0.71069/0.73972/0.76058,0.74772/0.76533/0.78092,0.88501/0.89193/0.90051,3e-06,0.104303
TF-IDF,"(1, 3)",8.918039,0.81425/0.82296/0.83557,0.78918/0.80701/0.82677,0.71175/0.72454/0.73684,0.75399/0.76351/0.77922,0.88643/0.89134/0.90106,1e-06,0.209486
log(TF),"(1, 3)",8.284185,0.81560/0.82229/0.83379,0.76754/0.80088/0.81589,0.70434/0.73208/0.75539,0.75421/0.76457/0.78164,0.88969/0.89509/0.90595,1.2e-05,0.198364
log(TF)-IDF,"(1, 3)",6.923369,0.81678/0.82434/0.83675,0.78274/0.80510/0.82265,0.71528/0.73247/0.75111,0.75873/0.76689/0.78265,0.89048/0.89561/0.90494,2e-06,0.138858


In [43]:
log_reg_tune.to_csv('log_reg_tune.csv', index=False, sep=',', encoding='utf-8')

### Perceptron

In [44]:
%%time

def get_params_perceptron(trial=None, best=None):
    if trial != None:
        alpha = trial.suggest_float('alpha', 1e-9, 1e+1, log=True)
        l1_ratio = trial.suggest_float('l1_ratio', 0, 1)
    elif best != None:
        alpha = best['alpha']
        l1_ratio = best['l1_ratio']
    return {'random_state': 524, 'n_jobs': -1, 'loss': 'perceptron', 'penalty': 'elasticnet', 'alpha': alpha, 'l1_ratio': l1_ratio}

perceptron_tune = report_tuned_models(X_trains, y_train, params_fixed, SGDClassifier, get_params_perceptron, 'accuracy', ('accuracy', 'precision', 'recall', 'f1'))

[I 2024-05-16 07:11:18,656] A new study created in memory with name: no-name-96a133ac-6bb8-49cc-9acb-5ecda6326519
[I 2024-05-16 07:11:22,758] Trial 0 finished with value: 0.6257101735862527 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6257101735862527.
[I 2024-05-16 07:11:45,595] Trial 1 finished with value: 0.8138026734176979 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8138026734176979.
[I 2024-05-16 07:11:49,805] Trial 2 finished with value: 0.6460426067907569 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8138026734176979.
[I 2024-05-16 07:11:58,367] Trial 3 finished with value: 0.7338138272726954 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8138026734176979.
[I 2024-05-16 07:12:02,531] Trial 4 finished with value: 0.57

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.550032e+01  0.80596/0.81637/0.82945   

                 precision                   recall                       f1  \
0  0.78098/0.79122/0.80391  0.70681/0.72605/0.75739  0.74448/0.75713/0.77766   

         alpha     l1_ratio  
0 3.938711e-07 9.148105e-01  


[I 2024-05-16 07:39:45,247] Trial 0 finished with value: 0.5549243022702498 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.5549243022702498.
[I 2024-05-16 07:40:04,243] Trial 1 finished with value: 0.8171979972945472 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8171979972945472.
[I 2024-05-16 07:40:08,287] Trial 2 finished with value: 0.5918403509995795 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8171979972945472.
[I 2024-05-16 07:40:14,833] Trial 3 finished with value: 0.7363391192819443 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8171979972945472.
[I 2024-05-16 07:40:18,972] Trial 4 finished with value: 0.5531869899751708 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.550032e+01  0.80596/0.81637/0.82945   
1          TF-IDF      (1, 3)   1.310447e+01  0.80951/0.81894/0.83143   

                 precision                   recall                       f1  \
0  0.78098/0.79122/0.80391  0.70681/0.72605/0.75739  0.74448/0.75713/0.77766   
1  0.78966/0.80185/0.82105  0.69529/0.71849/0.73928  0.73948/0.75780/0.77359   

         alpha     l1_ratio  
0 3.938711e-07 9.148105e-01  
1 1.824530e-07 9.575129e-01  


[I 2024-05-16 08:03:11,384] Trial 0 finished with value: 0.5930998558416855 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.5930998558416855.
[I 2024-05-16 08:03:32,945] Trial 1 finished with value: 0.8170402532572023 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8170402532572023.
[I 2024-05-16 08:03:37,037] Trial 2 finished with value: 0.6524803375307735 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8170402532572023.
[I 2024-05-16 08:03:44,910] Trial 3 finished with value: 0.7547770417653801 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8170402532572023.
[I 2024-05-16 08:03:48,757] Trial 4 finished with value: 0.6169461586817094 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 w

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.550032e+01  0.80596/0.81637/0.82945   
1          TF-IDF      (1, 3)   1.310447e+01  0.80951/0.81894/0.83143   
2         log(TF)      (1, 3)   1.554402e+01  0.81030/0.81976/0.83241   

                 precision                   recall                       f1  \
0  0.78098/0.79122/0.80391  0.70681/0.72605/0.75739  0.74448/0.75713/0.77766   
1  0.78966/0.80185/0.82105  0.69529/0.71849/0.73928  0.73948/0.75780/0.77359   
2  0.77210/0.79302/0.81940  0.71495/0.73534/0.75361  0.74763/0.76293/0.77593   

         alpha     l1_ratio  
0 3.938711e-07 9.148105e-01  
1 1.824530e-07 9.575129e-01  
2 1.150331e-07 8.100023e-01  


[I 2024-05-16 08:29:43,519] Trial 0 finished with value: 0.6328956335112935 and parameters: {'alpha': 0.0015376911652393395, 'l1_ratio': 0.3925870682115522}. Best is trial 0 with value: 0.6328956335112935.
[I 2024-05-16 08:30:00,903] Trial 1 finished with value: 0.8194088924518177 and parameters: {'alpha': 6.795850794294448e-08, 'l1_ratio': 0.9099172847632808}. Best is trial 1 with value: 0.8194088924518177.
[I 2024-05-16 08:30:04,791] Trial 2 finished with value: 0.572573488978066 and parameters: {'alpha': 0.00040691375225885087, 'l1_ratio': 0.559987589925726}. Best is trial 1 with value: 0.8194088924518177.
[I 2024-05-16 08:30:11,630] Trial 3 finished with value: 0.7503941613349395 and parameters: {'alpha': 2.6074972019493715e-05, 'l1_ratio': 0.6961134464846077}. Best is trial 1 with value: 0.8194088924518177.
[I 2024-05-16 08:30:15,740] Trial 4 finished with value: 0.5285520994540379 and parameters: {'alpha': 0.010485497004775017, 'l1_ratio': 0.12828816375274932}. Best is trial 1 wi

  vectorizer_type ngram_range  mean_fit_time                 accuracy  \
0              TF      (1, 3)   1.550032e+01  0.80596/0.81637/0.82945   
1          TF-IDF      (1, 3)   1.310447e+01  0.80951/0.81894/0.83143   
2         log(TF)      (1, 3)   1.554402e+01  0.81030/0.81976/0.83241   
3     log(TF)-IDF      (1, 3)   1.429695e+01  0.81169/0.82229/0.83518   

                 precision                   recall                       f1  \
0  0.78098/0.79122/0.80391  0.70681/0.72605/0.75739  0.74448/0.75713/0.77766   
1  0.78966/0.80185/0.82105  0.69529/0.71849/0.73928  0.73948/0.75780/0.77359   
2  0.77210/0.79302/0.81940  0.71495/0.73534/0.75361  0.74763/0.76293/0.77593   
3  0.78571/0.80540/0.82843  0.70926/0.72505/0.73484  0.75268/0.76301/0.77798   

         alpha     l1_ratio  
0 3.938711e-07 9.148105e-01  
1 1.824530e-07 9.575129e-01  
2 1.150331e-07 8.100023e-01  
3 1.824397e-07 9.442757e-01  
CPU times: user 3min 57s, sys: 33.9 s, total: 4min 31s
Wall time: 1h 41min 9s


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   15.4s finished


In [45]:
perceptron_tune = pd.DataFrame(perceptron_tune)
perceptron_tune.style.hide()

vectorizer_type,ngram_range,mean_fit_time,accuracy,precision,recall,f1,alpha,l1_ratio
TF,"(1, 3)",15.500321,0.80596/0.81637/0.82945,0.78098/0.79122/0.80391,0.70681/0.72605/0.75739,0.74448/0.75713/0.77766,0.0,0.91481
TF-IDF,"(1, 3)",13.104468,0.80951/0.81894/0.83143,0.78966/0.80185/0.82105,0.69529/0.71849/0.73928,0.73948/0.75780/0.77359,0.0,0.957513
log(TF),"(1, 3)",15.544019,0.81030/0.81976/0.83241,0.77210/0.79302/0.81940,0.71495/0.73534/0.75361,0.74763/0.76293/0.77593,0.0,0.810002
log(TF)-IDF,"(1, 3)",14.296947,0.81169/0.82229/0.83518,0.78571/0.80540/0.82843,0.70926/0.72505/0.73484,0.75268/0.76301/0.77798,0.0,0.944276


In [46]:
perceptron_tune.to_csv('perceptron_tune.csv', index=False, sep=',', encoding='utf-8')

## Training the final models

Now that we've obtained the optimal hyperparameters we can train the models on the full training data. We'll save the models and evaluate them in the next notebook.

In [51]:
mnB_clfs = []
svm_clfs = []
logreg_clfs = []
perceptron_clfs = []
models = {}

for n in range(len(X_trains)):
    mnB_clfs.append(MultinomialNB(alpha=mnB_tune_category.at[n,'alpha']))
    svm_clfs.append(SGDClassifier(loss='hinge', penalty='elasticnet', alpha=SVM_tune.at[n,'alpha'], l1_ratio=SVM_tune.at[n,'l1_ratio']))
    logreg_clfs.append(SGDClassifier(loss='log_loss', penalty='elasticnet', alpha=log_reg_tune.at[n,'alpha'], l1_ratio=log_reg_tune.at[n,'l1_ratio']))
    perceptron_clfs.append(SGDClassifier(loss='perceptron', penalty='elasticnet', alpha=perceptron_tune.at[n,'alpha'], l1_ratio=perceptron_tune.at[n,'l1_ratio']))

    for model in [mnB_clfs[-1],svm_clfs[-1], logreg_clfs[-1], perceptron_clfs[-1]]:
        model.fit(X_trains[n], y_train)

    models[f"models/mnB_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = mnB_clfs[-1]
    models[f"models/svm_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = svm_clfs[-1]
    models[f"models/logreg_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = logreg_clfs[-1]
    models[f"models/perceptron_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}"] = perceptron_clfs[-1]

In [52]:
import joblib

for model_name in models:
    joblib.dump(models[model_name], model_name+'.joblib')

joblib.dump(video_category_encoder, 'models/video_category_encoder.joblib')

['models/video_category_encoder.joblib']

In [53]:
for n in range(len(vectorizers)):
    joblib.dump(vectorizers[n]['channel_title'], f"vectorizers/channel_title_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")
    joblib.dump(vectorizers[n]['video_title'], f"vectorizers/video_title_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")
    joblib.dump(vectorizers[n]['video_description'], f"vectorizers/video_description_{params_fixed['vectorizer_type'][n]}_{params_fixed['ngram_range'][n]}.joblib")

## Probability calibration

We can see that, based on the cross-validation scores, the models are quite far from being accurate. We would like to model the probabilities  $P(y\in \mathcal{C}|P)$ of a data $y$ belonging in class $\mathcal{C}$ given the predictions of each of the models, which is not the same as the reported probabilities. (In some cases, there are also no reported probabilities). We can do this using a probability calibrator, which treats the predictions of each model as a feature that can then be used to model the true probability. This requires validation data, so we'll again use a five-fold cross-validation split.

In [54]:
from sklearn.calibration import CalibratedClassifierCV

calibrated_clfs = {}

for model_name in models:
    calibrated_clfs[model_name] = CalibratedClassifierCV(models[model_name], cv = KFold(n_splits=5, random_state=42, shuffle=True))
    calibrated_clfs[model_name].fit(X_train, y_train)

We'll save the calibrated models for evaluation in the next notebook:

In [55]:
for model_name in calibrated_clfs:
    joblib.dump(calibrated_clfs[model_name], model_name+'_calibrated.joblib')

## Stacking

Now that we have our sixteen models, we can combine them into a single classifier that uses all of their predictions. One approach is stacking, which involves a single metaclassifier that first gathers the predictions of the individual models, then uses these predictions as features and converts them into a final prediction. We will need to train the meta-classifier with cross-validation and select a model. We will compare two choices: logistic regression and gaussian naive Bayes.

In [56]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

In [58]:
stacking_logreg = StackingClassifier(list(models.items()), final_estimator=LogisticRegression(max_iter=10000), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_logreg.fit(X_train, y_train)

In [61]:
from sklearn.naive_bayes import GaussianNB

stacking_gnb = StackingClassifier(list(models.items()), final_estimator=GaussianNB(), cv=KFold(n_splits=5, random_state=42, shuffle=True))
stacking_gnb.fit(X_train, y_train)

In [62]:
joblib.dump(stacking_logreg, 'models/stacking_logreg.joblib')
joblib.dump(stacking_gnb, 'models/stacking_gnb.joblib')

['models/stacking_gnb.joblib']

We've successfully built a total of 34 different classical ML models -- four different classification approaches (Bayesian and linear), four different text vectorisation methods (TF, log(TF), TF-IDF and log(TF)-IDF vectorisation), and then used probability calibration and stacking to further improve model performance. In the next notebook we'll compare the performance of these models on the test data.