## Context

* For this project, you will take part in a Kaggle competition based on tabular data. The goal is to design a machine learning algorithm that, given information on a particular concertgoer experience, can automatically classify the enjoyment of that concertgoer to that concert. In this classification problem, we have 4 classes. The training dataset consists of 170,000 training examples and the testing dataset contains 30,000 test examples.

* Each training rows contains a unique ID, 18 attributes and 1 target containing the class
that needs to be predicted. You will be evaluated on the test private leaderboard mean
F1-Score.

## Instructions

* To participate in the competition, you must provide a list of predicted outputs for the
instances on the Kaggle website. To solve the problem you are encouraged to use any
classification methods you can think off, presented in the course or otherwise. Looking
into creative way to create new features from those provided may prove especially usefull
in this competition.

* The goal of this competition is to classify a particular concert experience in one of four classes: 
1. Worst Concert Ever
2. Did Not Enjoy
3. Enjoyed
4. Best Concert Ever

* To perform this task you will be given information on the band, the venue, as well as the specific concertgoers.
 
* The dataset contains information on the specific concert, the specific band, and the specific concertgoers. It is to be noted that all three of those are unchanged across all of the training data and test data. Any conclusions on the specific characteristics of a band, concert, or concertgoers will also extend to the test set.

* Unfortunately, the data-gathering step was not impeccable. One can expect some of the training attributes not always to reflect the underlying reality. However, the "Concert Experience" column has been verified and is 100% accurate.

In [1]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# Load data
train_data = pd.read_csv('./data/train.csv')
valid_data = pd.read_csv('./data/test.csv')

In [2]:
train_data

Unnamed: 0,Id,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert ID,Concert Attendance,Inside Venue,Rain,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Concert Goer Age,Concert Goer ID,Height (cm),Concert Goer Country of Origin,Concert Enjoyment
0,ConcertExperience_180106,Teenage Crazy Blue Knickers,Indie/Alt Rock,United States of America (USA),1976.0,900.0,2980.0,False,False,,0.330843,-0.958408,-0.943548,-1.636806,29.0,concert_goer_1985,140.0,Paraguay,Did Not Enjoy
1,ConcertExperience_146268,Beyond Devon,Pop Music,United States of America (USA),1968.0,731.0,54.0,True,False,True,-2.069449,0.017777,-1.910675,0.610265,43.0,concert_goer_1874,158.0,United Kingdom (UK),Enjoyed
2,ConcertExperience_128743,Ron Talent,Rock n Roll,Canada,1955.0,,162754.0,False,False,True,-0.484268,1.968772,-0.064167,-1.260871,68.0,concert_goer_442,159.0,United States of America (USA),Did Not Enjoy
3,ConcertExperience_140839,Devon Revival,RnB,United States of America (USA),1992.0,704.0,8103.0,False,True,False,-0.858054,1.022827,-0.348389,-1.147251,17.0,concert_goer_1149,150.0,Canada,Worst Concert Ever
4,ConcertExperience_19149,Beyond Devon,Pop Music,United States of America (USA),1968.0,95.0,54.0,False,False,False,-0.793029,-1.166528,-0.043766,0.969661,59.0,concert_goer_930,166.0,United Kingdom (UK),Did Not Enjoy
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
169995,ConcertExperience_14055,Crazy Joystick Cult,RnB,Canada,1985.0,70.0,162754.0,True,False,False,-0.095021,0.175175,0.914245,0.357359,50.0,concert_goer_707,180.0,United States of America (USA),Did Not Enjoy
169996,ConcertExperience_192792,Crazy Joystick Cult,RnB,Canada,1985.0,963.0,54.0,False,False,False,-0.733719,-0.285776,-0.323312,0.641180,71.0,concert_goer_1373,143.0,Bulgaria,Worst Concert Ever
169997,ConcertExperience_152942,"Why Frogs, Why?",Heavy Metal,Canada,2005.0,764.0,54.0,False,False,False,0.744969,-0.965547,1.020598,1.027389,27.0,concert_goer_1286,176.0,Canada,Did Not Enjoy
169998,ConcertExperience_138957,Twilight of the Joystick Gods,Hip Hop/Rap,United States of America (USA),1995.0,694.0,22026.0,False,True,True,0.821976,0.351411,0.175762,1.455654,39.0,concert_goer_1845,176.0,Canada,Did Not Enjoy


In [3]:
valid_data

Unnamed: 0,Id,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert ID,Concert Attendance,Inside Venue,Rain,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Concert Goer Age,Concert Goer ID,Height (cm),Concert Goer Country of Origin
0,ConcertExperience_70055,The Crazy Heroes of Devon,Rock n Roll,United States of America (USA),1980.0,350.0,2980.0,True,False,True,1.065107,0.057660,0.249639,-0.933976,74.0,concert_goer_1587,165.0,United States of America (USA)
1,ConcertExperience_34799,Joystick for the Jockies,Hip Hop/Rap,United States of America (USA),2014.0,173.0,8103.0,True,True,False,-0.886947,0.801365,0.525624,0.176655,29.0,concert_goer_293,151.0,Kenya
2,ConcertExperience_100410,Puddle of Joystick,Rock n Roll,Canada,2010.0,502.0,2980.0,True,True,False,0.744700,-0.797531,-0.034166,-0.226052,27.0,concert_goer_1068,146.0,Canada
3,ConcertExperience_106446,Flight of the Knickers,,Canada,2014.0,532.0,22026.0,True,False,False,-0.134180,-0.361512,0.969404,-2.341205,38.0,concert_goer_1315,183.0,United States of America (USA)
4,ConcertExperience_127249,Devon Revival,RnB,United States of America (USA),1992.0,636.0,2980.0,False,False,False,1.407366,-0.084155,-0.673233,1.733714,21.0,concert_goer_1777,177.0,Fiji
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,ConcertExperience_82288,Joystick of the Big Knickers,Hip Hop/Rap,United States of America (USA),1979.0,411.0,2980.0,True,False,False,-0.421714,-1.549670,-0.351770,0.132489,42.0,concert_goer_1710,178.0,United States of America (USA)
29996,ConcertExperience_27139,Big Division,Hip Hop/Rap,United States of America (USA),1978.0,135.0,8103.0,True,False,False,0.615087,-0.047092,0.339228,0.820159,37.0,concert_goer_1758,186.0,Canada
29997,ConcertExperience_197434,Crazyplay,Pop Music,United States of America (USA),1995.0,987.0,8103.0,False,False,True,-1.396551,-0.508627,-1.692584,1.640931,45.0,concert_goer_1481,158.0,Greece
29998,ConcertExperience_166029,Lord of the Crazy Frogs,RnB,United States of America (USA),1968.0,830.0,8103.0,False,True,False,0.168073,-0.785460,0.898273,1.608389,36.0,concert_goer_1461,170.0,United Kingdom (UK)


In [4]:
#Count the number of NaN in each column

train_data.isnull().sum()

Id                                  0
Band Name                         859
Band Genre                        884
Band Country of Origin            790
Band Debut                        857
Concert ID                        870
Concert Attendance                895
Inside Venue                      838
Rain                              861
Seated                            832
Personnality Trait 1              852
Personnality Trait 2              849
Personnality Trait 3              893
Personnality Trait 4              865
Concert Goer Age                  853
Concert Goer ID                   815
Height (cm)                       847
Concert Goer Country of Origin    859
Concert Enjoyment                   0
dtype: int64

In [257]:
# Preprocessing pipeline : 
# Concert all "Insert... " to NaN
# Drop 'Concert Goer ID' and 'Id' columns
# Impute 'Concert Goer Age' with median
# Impute 'Height' with median
# One hot encode 'Band Name', 'Band Genre', 'Band Country of Origin', 'Concert Goer Country of Origin'

# Convert "Insert Band Name" "Insert Band Genre" "Insert Band Country of Origin" to NaN

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer


from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import BaggingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV

In [258]:
train_data = train_data.replace({'Insert Band Name':np.nan, 'Insert Band Genre':np.nan, 'Insert Band Country of Origin':np.nan})
valid_data = valid_data.replace({'Insert Band Name':np.nan, 'Insert Band Genre':np.nan, 'Insert Band Country of Origin':np.nan})

In [259]:
# remove "Concert Goer ID" and "Concert ID" column
train_data = train_data.drop(['Concert Goer ID'], axis=1)
train_data = train_data.drop(['Id'], axis=1)

valid_data = valid_data.drop(['Concert Goer ID'], axis=1)
valid_data = valid_data.drop(['Id'], axis=1)

In [260]:
X = train_data.drop(['Concert Enjoyment'], axis=1)
y = train_data['Concert Enjoyment']

In [261]:
X.head()

Unnamed: 0,Band Name,Band Genre,Band Country of Origin,Band Debut,Concert ID,Concert Attendance,Inside Venue,Rain,Seated,Personnality Trait 1,Personnality Trait 2,Personnality Trait 3,Personnality Trait 4,Concert Goer Age,Height (cm),Concert Goer Country of Origin
0,Teenage Crazy Blue Knickers,Indie/Alt Rock,United States of America (USA),1976.0,900.0,2980.0,False,False,,0.330843,-0.958408,-0.943548,-1.636806,29.0,140.0,Paraguay
1,Beyond Devon,Pop Music,United States of America (USA),1968.0,731.0,54.0,True,False,True,-2.069449,0.017777,-1.910675,0.610265,43.0,158.0,United Kingdom (UK)
2,Ron Talent,Rock n Roll,Canada,1955.0,,162754.0,False,False,True,-0.484268,1.968772,-0.064167,-1.260871,68.0,159.0,United States of America (USA)
3,Devon Revival,RnB,United States of America (USA),1992.0,704.0,8103.0,False,True,False,-0.858054,1.022827,-0.348389,-1.147251,17.0,150.0,Canada
4,Beyond Devon,Pop Music,United States of America (USA),1968.0,95.0,54.0,False,False,False,-0.793029,-1.166528,-0.043766,0.969661,59.0,166.0,United Kingdom (UK)


In [262]:
y.head()

0         Did Not Enjoy
1               Enjoyed
2         Did Not Enjoy
3    Worst Concert Ever
4         Did Not Enjoy
Name: Concert Enjoyment, dtype: object

In [263]:
column_trans = make_column_transformer(
    (OneHotEncoder(), ['Band Name', 'Band Genre', 'Band Country of Origin', 'Concert Goer Country of Origin']),
    remainder='passthrough')
# transform output into 4 classes
y = y.replace({'Worst Concert Ever':0, 'Did Not Enjoy':1, 'Enjoyed':2, 'Best Concert Ever':3})


In [264]:
# Split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

In [265]:
column_trans.fit_transform(X)

<170000x231 sparse matrix of type '<class 'numpy.float64'>'
	with 2414289 stored elements in Compressed Sparse Row format>

In [266]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1000, max_depth = 3, learning_rate = 0.1, colsample_bytree = 0.6, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6522058823529412

In [267]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 3, learning_rate = 0.1, colsample_bytree = 0.6, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.655264705882353

In [268]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 2000, max_depth = 3, learning_rate = 0.1, colsample_bytree = 0.6, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6575588235294118

In [269]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 4, learning_rate = 0.1, colsample_bytree = 0.6, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6629411764705883

In [270]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.6, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6674117647058824

In [271]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 6, learning_rate = 0.1, colsample_bytree = 0.6, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6696176470588235

In [272]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.7, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6667058823529411

In [273]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6659117647058823

In [274]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.9, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6674117647058824

In [275]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 1, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6688235294117647

In [276]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.7, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6681470588235294

In [277]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.8, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6670882352941176

In [278]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.9, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6675

In [279]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.9, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6675

In [280]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.9, objective = 'multi:softmax', num_class = 4, min_child_weight = 3, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6669411764705883

In [281]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 2, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6675294117647059

In [282]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 1500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.6, objective = 'multi:softmax', num_class = 4, min_child_weight = 5, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6671176470588235

In [283]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 2000, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.9, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6690588235294118

In [284]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 2500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.9, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6698235294117647

In [285]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 3000, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.9, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6691470588235294

In [286]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 3500, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.9, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.6688529411764705

In [287]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 4000, max_depth = 5, learning_rate = 0.1, colsample_bytree = 0.8, subsample = 0.9, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.668235294117647

In [288]:
pipe = make_pipeline(column_trans, XGBClassifier(n_estimators = 2500, max_depth = 5, learning_rate = 0.05, colsample_bytree = 0.8, subsample = 0.9, objective = 'multi:softmax', num_class = 4, min_child_weight = 4, n_jobs = -1))
pipe.fit(X_train, y_train)
f1_score(y_test, pipe.predict(X_test), average='micro')

0.667235294117647

In [250]:
# Grid search for best parameters
param_grid = {
    'n_estimators': [1000, 1500, 2000],
    'max_depth': [3,4, 5, 6, 7],
    'learning_rate': [0.1,0.2],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1],
    'min_child_weight': [3, 4, 5, 6],
    'objective': ['multi:softmax'],
    'num_class': [4]
}

In [251]:
CV_clf = GridSearchCV(estimator=pipe, param_grid=param_grid, cv= 3, verbose=1, n_jobs=-1, scoring='f1_micro')

In [252]:
CV_clf.fit(X_train, y_train)

Fitting 3 folds for each of 3000 candidates, totalling 9000 fits


ValueError: Invalid parameter 'colsample_bytree' for estimator Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(),
                                                  ['Band Name', 'Band Genre',
                                                   'Band Country of Origin',
                                                   'Concert Goer Country of '
                                                   'Origin'])])),
                ('xgbclassifier',
                 XGBClassifier(base_score=None, booster=None, callbacks=None,
                               colsample_bylevel=None, colsample_bynode=None,
                               colsample_bytree=None,...
                               feature_types=None, gamma=None, gpu_id=None,
                               grow_policy=None, importance_type=None,
                               interaction_constraints=None, learning_rate=None,
                               max_bin=None, max_cat_threshold=None,
                               max_cat_to_onehot=None, max_delta_step=None,
                               max_depth=None, max_leaves=None,
                               min_child_weight=None, missing=nan,
                               monotone_constraints=None, n_estimators=100,
                               n_jobs=None, num_parallel_tree=None,
                               predictor=None, random_state=None, ...))]). Valid parameters are: ['memory', 'steps', 'verbose'].

In [None]:
CV_clf.best_params_

### Test avec Random Forest

----

In [314]:
clf = RandomForestClassifier(n_estimators=500, max_depth=25)
clf.fit(X_train, y_train)
f1_score(y_test, clf.predict(X_test), average='micro')

0.6490055585901102

---

## Test set prediction

In [258]:
best_model = XGBClassifier(n_estimators=1500, max_depth=5, objective='multi:softmax', num_class=4, learning_rate=0.1, colsample_bytree=0.6, subsample=1)
best_model.fit(X_train_3, y_train)
f1_score(y_test, best_model.predict(X_test_3), average='micro')

0.6734569289592905

In [None]:
best_y_pred = best_model.predict(valid_data_np)
best_y_pred

array([3, 2, 2, ..., 2, 0, 0])

In [None]:
len(valid_data_np)

30000

In [None]:
# Convert back y_pred to the original label and save it into submission.csv
y_final_pred = pd.DataFrame(best_y_pred, columns = ['Predicted'])

y_final_pred['Predicted'] = y_final_pred['Predicted'].map({0: 'Worst Concert Ever', 1: 'Did Not Enjoy', 2: 'Enjoyed', 3: 'Best Concert Ever'})
# insert column 'Id' to the first column
y_final_pred.insert(0, 'Id', range(1, 1 + len(y_final_pred)))

first_y_test = pd.read_csv('./data/test.csv')
y_final_pred['Id'] = first_y_test['Id']

# save the result to submission.csv
y_final_pred.to_csv('submission2.csv', index = False)

# analyze the result
y_final_pred['Predicted'].value_counts() / y_final_pred.shape[0]

Enjoyed               0.448367
Did Not Enjoy         0.435067
Worst Concert Ever    0.065867
Best Concert Ever     0.050700
Name: Predicted, dtype: float64

In [None]:
y_final_pred['Predicted'].value_counts().sum()

30000

In [None]:
analysis = pd.read_csv('./data/train.csv')
analysis['Concert Enjoyment'].value_counts() / analysis.shape[0]

Enjoyed               0.400153
Did Not Enjoy         0.399676
Best Concert Ever     0.100159
Worst Concert Ever    0.100012
Name: Concert Enjoyment, dtype: float64