**Result Analysis** <br>

This notebook compare and contrast results across 4 vectorization methods, 13 models and 4 evaluation metrics. We performed analysis in four stages:
1. Load all experiment results into a dictionary, using pandas.style functionality to highlight model performance.
2. Compute the train/test ratio to analyze model overfit to select which cross validation we should use to draw our conclusion
3. Extract combine f1-test score for cross validation = 8, color highlight the highest and lowest model performance.
4. Compare f1 score with model evaluated on ChatGPT3.5 dataset.

In [1]:
import pandas as pd

In [2]:
def load_exp_result(vectors, cvs):
    """load all training results"""
    train_results = {}
    for v in vectors:
        for c in cvs:
            directory = f"./experiment_results/Exp_result-vector_3000_{v}-cv={c}.log"
            name = f"{v}-cv={c}"
            train_results[name] = pd.read_csv(directory, delimiter="\s+")
    return train_results

vectors_name = ['bow', 'sbert', 'sent2vec', 'tfidf']
cvs = [3, 5, 8]
train_results = load_exp_result(vectors_name, cvs)

In [5]:
def train_test_ratio(train_results, vectors, cvs, score):
    """compute the train/test ratio for each model and each cv for a given score"""
    result = pd.DataFrame()
    for v in vectors:
        for c in cvs:
            name = f"{v}-cv={c}"
            train = f"train_{score}"
            test = f"test_{score}"
            temp = pd.DataFrame(train_results[name][train]/train_results[name][test], columns=[name])
            result = pd.concat([result, temp], axis=1)
    return result

vectors_name = ['bow', 'sbert', 'sent2vec', 'tfidf']
cvs = [3, 5, 8]
overfitting_analysis = train_test_ratio(train_results, vectors_name, cvs, 'accuracy')

**ANALYSIS 1** Review individual results per vectorized dataset.

In [6]:
columns = ['fit_time', 'score_time', 'test_accuracy', 'test_f1_score', 'test_precision', 'test_recall']
train_results['bow-cv=8'][columns].style.background_gradient(cmap="Blues")

Unnamed: 0,fit_time,score_time,test_accuracy,test_f1_score,test_precision,test_recall
LogisticRegression,0.822492,0.015499,0.842,0.84351,0.835192,0.85266
RidgeClassifier,1.689055,0.015832,0.627,0.638305,0.618648,0.660609
SVC,14.135022,1.209551,0.829667,0.831069,0.823675,0.83932
SGDClassifier,0.588027,0.010768,0.769667,0.712951,0.935267,0.580715
Perceptron,0.416715,0.009431,0.821667,0.826095,0.806339,0.847334
GaussianNB,0.243056,0.030439,0.719333,0.704109,0.742534,0.670615
DecisionTreeClassifier,1.19826,0.009997,0.728,0.723478,0.736267,0.71337
BaggingClassifier,3.886169,0.048102,0.757,0.756759,0.757667,0.756695
AdaBoostClassifier,5.543517,0.060561,0.774,0.776707,0.766102,0.78866
RandomForestClassifier,2.196557,0.038708,0.822,0.809064,0.870772,0.756681


In [7]:
columns = ['fit_time', 'score_time', 'test_accuracy', 'test_f1_score', 'test_precision', 'test_recall']
train_results['sbert-cv=8'][columns].style.background_gradient(cmap="Blues")

Unnamed: 0,fit_time,score_time,test_accuracy,test_f1_score,test_precision,test_recall
LogisticRegression,0.409801,0.01302,0.834333,0.832632,0.841798,0.824681
RidgeClassifier,0.041585,0.014225,0.840667,0.836479,0.858715,0.816006
SVC,2.265665,0.445193,0.799333,0.794456,0.813988,0.776699
SGDClassifier,0.204513,0.008655,0.83,0.827102,0.841351,0.814022
Perceptron,0.064159,0.008633,0.807333,0.805412,0.813369,0.798672
GaussianNB,0.013584,0.012857,0.762667,0.762862,0.763301,0.763355
DecisionTreeClassifier,1.13108,0.007845,0.676333,0.677456,0.674894,0.68131
BaggingClassifier,6.284672,0.011514,0.684333,0.673506,0.697453,0.652005
AdaBoostClassifier,4.802708,0.018793,0.752667,0.736396,0.788204,0.691319
RandomForestClassifier,3.702533,0.032346,0.749667,0.74194,0.765607,0.719987


In [8]:
columns = ['fit_time', 'score_time', 'test_accuracy', 'test_f1_score', 'test_precision', 'test_recall']
train_results['sent2vec-cv=8'][columns].style.background_gradient(cmap="Blues")

Unnamed: 0,fit_time,score_time,test_accuracy,test_f1_score,test_precision,test_recall
LogisticRegression,1.224523,0.011687,0.836667,0.837207,0.834893,0.839988
RidgeClassifier,0.126309,0.015578,0.835333,0.836138,0.832448,0.840006
SVC,3.474157,0.697498,0.811333,0.814187,0.802287,0.827334
SGDClassifier,0.347071,0.009695,0.855667,0.855611,0.856183,0.855334
Perceptron,0.244308,0.009991,0.823667,0.822953,0.826338,0.820009
GaussianNB,0.030291,0.012327,0.687333,0.666572,0.714149,0.626042
DecisionTreeClassifier,2.175453,0.007613,0.624,0.618571,0.627365,0.611979
BaggingClassifier,45.081314,0.030707,0.695333,0.681701,0.713122,0.655361
AdaBoostClassifier,28.214847,0.043256,0.717,0.711634,0.725491,0.699354
RandomForestClassifier,6.227296,0.025128,0.737333,0.733237,0.745867,0.722035


In [9]:
columns = ['fit_time', 'score_time', 'test_accuracy', 'test_f1_score', 'test_precision', 'test_recall']
train_results['tfidf-cv=8'][columns].style.background_gradient(cmap="Blues")

Unnamed: 0,fit_time,score_time,test_accuracy,test_f1_score,test_precision,test_recall
LogisticRegression,0.506081,0.008634,0.816333,0.815058,0.821366,0.81
RidgeClassifier,1.017989,0.009047,0.604333,0.606822,0.602567,0.612619
SVC,12.017256,2.497663,0.830333,0.828944,0.837263,0.821321
SGDClassifier,0.369154,0.006274,0.775333,0.742623,0.867679,0.65391
Perceptron,0.328118,0.006445,0.802333,0.803903,0.79914,0.809346
GaussianNB,0.222436,0.040558,0.665,0.655411,0.67507,0.637957
DecisionTreeClassifier,1.174084,0.015061,0.703667,0.704376,0.701267,0.708777
BaggingClassifier,11.814149,0.115748,0.779333,0.762643,0.825205,0.709338
AdaBoostClassifier,16.219816,0.148155,0.816,0.816977,0.812273,0.822673
RandomForestClassifier,2.272838,0.029008,0.829333,0.823209,0.853124,0.796002


**ANALYSIS 2** Review overfitting. ratio is calculated as train/test score.

In [10]:
overfitting_analysis.style.background_gradient(cmap="Greens")

Unnamed: 0,bow-cv=3,bow-cv=5,bow-cv=8,sbert-cv=3,sbert-cv=5,sbert-cv=8,sent2vec-cv=3,sent2vec-cv=5,sent2vec-cv=8,tfidf-cv=3,tfidf-cv=5,tfidf-cv=8
LogisticRegression,1.183899,1.181102,1.187648,1.133564,1.110843,1.098511,1.185889,1.180104,1.178088,1.221498,1.221001,1.22499
RidgeClassifier,1.331558,1.469868,1.594896,1.099359,1.078051,1.075053,1.165592,1.148817,1.136815,1.383763,1.512098,1.654717
SVC,1.219146,1.198083,1.205303,1.217701,1.13177,1.205291,1.176887,1.16923,1.156063,1.211507,1.199738,1.199805
SGDClassifier,1.327625,1.250001,1.292458,1.113692,1.099149,1.091911,1.142426,1.1309,1.110356,1.269198,1.263806,1.287189
Perceptron,1.223492,1.214083,1.21698,1.106271,1.061498,1.064469,1.136107,1.113646,1.114702,1.255443,1.234061,1.246128
GaussianNB,1.285177,1.242664,1.243546,1.067269,1.054974,1.05201,1.022131,1.02533,1.016212,1.389313,1.340796,1.344719
DecisionTreeClassifier,1.136238,1.118748,1.162938,1.267478,1.243715,1.221363,1.332787,1.328168,1.304946,1.155052,1.208898,1.111321
BaggingClassifier,1.26661,1.286532,1.287915,1.380896,1.379266,1.411037,1.445946,1.503374,1.436037,1.280368,1.323051,1.271417
AdaBoostClassifier,1.067036,1.028729,1.020057,1.042943,1.091539,1.042135,1.030913,1.063971,1.067278,1.065934,1.081306,1.088235
RandomForestClassifier,1.089728,1.075335,1.075078,1.277778,1.268111,1.266022,1.333711,1.304867,1.327952,1.123167,1.11658,1.112943


**ANALYSIS 3** Review combined model result. Below tables summarize f1 test scross across models with cv = 8. <br>
- highest performing model is highlighted in yellow (first table)
- lowest performing model is highlighted in green (second table)

In [11]:
col_names = ['fit_time', 'score_time', 'bow', 'sbert', 'sent2vec', 'tfidf']
score = "f1_score"
cv = 8

#benchmark four vectorization method based on f1 test score and cv=8
fit_time = pd.DataFrame((train_results[f'bow-cv={cv}']['fit_time'] + train_results[f'sbert-cv={cv}']['fit_time'] + train_results[f'sent2vec-cv={cv}']['fit_time'] + train_results[f'tfidf-cv={cv}']['fit_time'])/4)
score_time = pd.DataFrame((train_results[f'bow-cv={cv}']['score_time'] + train_results[f'sbert-cv={cv}']['score_time'] + train_results[f'sent2vec-cv={cv}']['score_time'] + train_results[f'tfidf-cv={cv}']['score_time'])/4)
bow = pd.DataFrame(train_results[f'bow-cv={cv}'][f'test_{score}'])
sbert = pd.DataFrame(train_results[f'sbert-cv={cv}'][f'test_{score}'])
sent2vec = pd.DataFrame(train_results[f'sent2vec-cv={cv}'][f'test_{score}'])
tfidf = pd.DataFrame(train_results[f'tfidf-cv={cv}'][f'test_{score}'])
summary_table = pd.concat([fit_time, score_time, bow, sbert, sent2vec, tfidf], axis=1)
summary_table.columns = col_names
summary_table.style.highlight_max(color='yellow') #highlight the highest performing model

Unnamed: 0,fit_time,score_time,bow,sbert,sent2vec,tfidf
LogisticRegression,0.740724,0.01221,0.84351,0.832632,0.837207,0.815058
RidgeClassifier,0.718735,0.013671,0.638305,0.836479,0.836138,0.606822
SVC,7.973025,1.212476,0.831069,0.794456,0.814187,0.828944
SGDClassifier,0.377191,0.008848,0.712951,0.827102,0.855611,0.742623
Perceptron,0.263325,0.008625,0.826095,0.805412,0.822953,0.803903
GaussianNB,0.127342,0.024045,0.704109,0.762862,0.666572,0.655411
DecisionTreeClassifier,1.419719,0.010129,0.723478,0.677456,0.618571,0.704376
BaggingClassifier,16.766576,0.051518,0.756759,0.673506,0.681701,0.762643
AdaBoostClassifier,13.695222,0.067691,0.776707,0.736396,0.711634,0.816977
RandomForestClassifier,3.599806,0.031297,0.809064,0.74194,0.733237,0.823209


In [15]:
#avearge score_time excluding SVC
summary_table['score_time'][summary_table['score_time'].index != 'SVC'].mean()

0.033114354166666665

In [16]:
#here is the lowest performance model
summary_table.style.highlight_min(color="green")

Unnamed: 0,fit_time,score_time,bow,sbert,sent2vec,tfidf
LogisticRegression,0.740724,0.01221,0.84351,0.832632,0.837207,0.815058
RidgeClassifier,0.718735,0.013671,0.638305,0.836479,0.836138,0.606822
SVC,7.973025,1.212476,0.831069,0.794456,0.814187,0.828944
SGDClassifier,0.377191,0.008848,0.712951,0.827102,0.855611,0.742623
Perceptron,0.263325,0.008625,0.826095,0.805412,0.822953,0.803903
GaussianNB,0.127342,0.024045,0.704109,0.762862,0.666572,0.655411
DecisionTreeClassifier,1.419719,0.010129,0.723478,0.677456,0.618571,0.704376
BaggingClassifier,16.766576,0.051518,0.756759,0.673506,0.681701,0.762643
AdaBoostClassifier,13.695222,0.067691,0.776707,0.736396,0.711634,0.816977
RandomForestClassifier,3.599806,0.031297,0.809064,0.74194,0.733237,0.823209


**ANALYSIS 4** Now we run the experiment pipeline again but using dataset extracted from chatGPT3.5

In [17]:
vectors_name = ['Turbo_bow', 'Turbo_sbert', 'Turbo_sent2vec', 'Turbo_tfidf']
cvs = [3, 5, 8]
train_results = load_exp_result(vectors_name, cvs)

In [21]:
vectors_name = ['Turbo_bow', 'Turbo_sbert', 'Turbo_sent2vec', 'Turbo_tfidf']
cvs = [3, 5, 8]
overfitting_analysis = train_test_ratio(train_results, vectors_name, cvs, 'accuracy')

In [22]:
overfitting_analysis.style.background_gradient(cmap="Greens")

Unnamed: 0,Turbo_bow-cv=3,Turbo_bow-cv=5,Turbo_bow-cv=8,Turbo_sbert-cv=3,Turbo_sbert-cv=5,Turbo_sbert-cv=8,Turbo_sent2vec-cv=3,Turbo_sent2vec-cv=5,Turbo_sent2vec-cv=8,Turbo_tfidf-cv=3,Turbo_tfidf-cv=5,Turbo_tfidf-cv=8
LogisticRegression,1.011805,1.011463,1.010441,1.10096,1.091654,1.084,1.080692,1.087745,1.077586,1.014885,1.017639,1.015228
RidgeClassifier,1.025291,1.03484,1.042391,1.047326,1.040555,1.038867,1.082878,1.070668,1.067415,1.036627,1.053001,1.054482
SVC,1.01283,1.014199,1.012146,1.110673,1.118717,1.106251,1.089325,1.092499,1.090117,1.01626,1.016604,1.015572
SGDClassifier,1.071429,1.118937,1.06051,1.091704,1.080318,1.074953,1.089879,1.088796,1.063054,1.040799,1.070229,1.058011
Perceptron,1.033058,1.029513,1.031992,1.079435,1.073618,1.060345,1.089004,1.077266,1.07231,1.039141,1.030928,1.032347
GaussianNB,1.117933,1.095532,1.083082,1.040385,1.030982,1.031571,1.012206,1.010266,1.005463,1.117094,1.094004,1.086609
DecisionTreeClassifier,1.028913,1.016595,1.026339,1.350527,1.259928,1.382306,1.19022,1.180915,1.171403,1.033725,1.038518,1.043622
BaggingClassifier,1.063343,1.061283,1.061637,1.389482,1.402855,1.473052,1.325792,1.238322,1.240738,1.050981,1.052405,1.0531
AdaBoostClassifier,1.021458,1.013605,1.021368,1.035149,1.036574,1.041437,1.045991,1.040051,1.034426,1.031756,1.025206,1.021904
RandomForestClassifier,1.007284,1.006087,1.009602,1.229184,1.258045,1.218657,1.144774,1.153317,1.180618,1.006603,1.009153,1.009938


In [23]:
col_names = ['fit_time', 'score_time', 'Turbo_bow', 'Turbo_sbert', 'Turbo_sent2vec', 'Turbo_tfidf']
score = "f1_score"
cv = 8

#benchmark four vectorization method based on f1 test score and cv=8
fit_time = pd.DataFrame((train_results[f'Turbo_bow-cv={cv}']['fit_time'] + train_results[f'Turbo_sbert-cv={cv}']['fit_time'] + train_results[f'Turbo_sent2vec-cv={cv}']['fit_time'] + train_results[f'Turbo_tfidf-cv={cv}']['fit_time'])/4)
score_time = pd.DataFrame((train_results[f'Turbo_bow-cv={cv}']['score_time'] + train_results[f'Turbo_sbert-cv={cv}']['score_time'] + train_results[f'Turbo_sent2vec-cv={cv}']['score_time'] + train_results[f'Turbo_tfidf-cv={cv}']['score_time'])/4)
bow = pd.DataFrame(train_results[f'Turbo_bow-cv={cv}'][f'test_{score}'])
sbert = pd.DataFrame(train_results[f'Turbo_sbert-cv={cv}'][f'test_{score}'])
sent2vec = pd.DataFrame(train_results[f'Turbo_sent2vec-cv={cv}'][f'test_{score}'])
tfidf = pd.DataFrame(train_results[f'Turbo_tfidf-cv={cv}'][f'test_{score}'])
summary_table = pd.concat([fit_time, score_time, bow, sbert, sent2vec, tfidf], axis=1)
summary_table.columns = col_names
summary_table.style.highlight_max(color='yellow') #highlight the highest performing model

Unnamed: 0,fit_time,score_time,Turbo_bow,Turbo_sbert,Turbo_sent2vec,Turbo_tfidf
LogisticRegression,0.650146,0.012287,0.989559,0.904674,0.928932,0.984913
RidgeClassifier,0.898047,0.012986,0.958488,0.92463,0.925536,0.948445
SVC,5.812186,0.630767,0.987873,0.894934,0.918265,0.984642
SGDClassifier,0.344514,0.007789,0.945249,0.90484,0.929989,0.946857
Perceptron,0.333724,0.00934,0.968626,0.89783,0.92032,0.968223
GaussianNB,0.196417,0.026213,0.920379,0.812482,0.740143,0.909343
DecisionTreeClassifier,2.305436,0.009308,0.925407,0.620811,0.726481,0.907942
BaggingClassifier,22.789661,0.078301,0.938891,0.655749,0.797464,0.947101
AdaBoostClassifier,14.765375,0.105328,0.978304,0.724023,0.812904,0.978179
RandomForestClassifier,3.444786,0.028017,0.982013,0.788492,0.816458,0.982278


In [24]:
#here is the lowest performance model
summary_table.style.highlight_min(color="green")

Unnamed: 0,fit_time,score_time,Turbo_bow,Turbo_sbert,Turbo_sent2vec,Turbo_tfidf
LogisticRegression,0.650146,0.012287,0.989559,0.904674,0.928932,0.984913
RidgeClassifier,0.898047,0.012986,0.958488,0.92463,0.925536,0.948445
SVC,5.812186,0.630767,0.987873,0.894934,0.918265,0.984642
SGDClassifier,0.344514,0.007789,0.945249,0.90484,0.929989,0.946857
Perceptron,0.333724,0.00934,0.968626,0.89783,0.92032,0.968223
GaussianNB,0.196417,0.026213,0.920379,0.812482,0.740143,0.909343
DecisionTreeClassifier,2.305436,0.009308,0.925407,0.620811,0.726481,0.907942
BaggingClassifier,22.789661,0.078301,0.938891,0.655749,0.797464,0.947101
AdaBoostClassifier,14.765375,0.105328,0.978304,0.724023,0.812904,0.978179
RandomForestClassifier,3.444786,0.028017,0.982013,0.788492,0.816458,0.982278


In [19]:
#model results for the best performing vectorization method: bow
columns = ['fit_time', 'score_time', 'test_accuracy', 'test_f1_score', 'test_precision', 'test_recall']
train_results['Turbo_bow-cv=8'][columns].style.background_gradient(cmap="Blues")

Unnamed: 0,fit_time,score_time,test_accuracy,test_f1_score,test_precision,test_recall
LogisticRegression,0.942576,0.014742,0.989667,0.989559,0.996639,0.982674
RidgeClassifier,2.078178,0.01614,0.959333,0.958488,0.977969,0.939996
SVC,10.447374,1.4359,0.988,0.987873,0.996017,0.980011
SGDClassifier,0.528982,0.0074,0.942,0.945249,0.90194,0.994009
Perceptron,0.694323,0.01161,0.969,0.968626,0.978874,0.958649
GaussianNB,0.469014,0.051482,0.915333,0.920379,0.870031,0.97734
DecisionTreeClassifier,3.16763,0.013488,0.925667,0.925407,0.929141,0.921998
BaggingClassifier,23.332305,0.154434,0.938667,0.938891,0.936062,0.942016
AdaBoostClassifier,25.955393,0.21166,0.978333,0.978304,0.980057,0.976665
RandomForestClassifier,2.45036,0.02865,0.982,0.982013,0.980771,0.983324
