# CISC 372: Advanced Data Analytics
## Project: Predicting whether a song has the potential to become a hit
### Project Done By: Udbhav Balaji
### Student Number: 20179467

In [1]:
# importing the necessary libraries
import warnings
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from xgboost.sklearn import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import pickle

warnings.filterwarnings('ignore')

  from pandas import MultiIndex, Int64Index


In [2]:
# Reading in the 3 datasets
data_90s = pd.read_csv('Datasets/dataset-of-90s.csv')
data_00s = pd.read_csv('Datasets/dataset-of-10s.csv')
data_10s = pd.read_csv('Datasets/dataset-of-10s.csv')

In [3]:
# Since all the 3 datasets have the same columns, we can simply concatenate them together in order to create a master dataset
master_data = pd.concat([data_90s, data_00s, data_10s], ignore_index=True, axis=0)
master_data.size

348004

In [4]:
# In order to avoid any bias due to time period, let's shuffle the rows so that their order is random
master_data = shuffle(master_data)
master_data.reset_index(inplace=True, drop=True)

In [5]:
master_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18316 entries, 0 to 18315
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track             18316 non-null  object 
 1   artist            18316 non-null  object 
 2   uri               18316 non-null  object 
 3   danceability      18316 non-null  float64
 4   energy            18316 non-null  float64
 5   key               18316 non-null  int64  
 6   loudness          18316 non-null  float64
 7   mode              18316 non-null  int64  
 8   speechiness       18316 non-null  float64
 9   acousticness      18316 non-null  float64
 10  instrumentalness  18316 non-null  float64
 11  liveness          18316 non-null  float64
 12  valence           18316 non-null  float64
 13  tempo             18316 non-null  float64
 14  duration_ms       18316 non-null  int64  
 15  time_signature    18316 non-null  int64  
 16  chorus_hit        18316 non-null  float6

Now that we have our master dataset in place, let's actually see what kind of information the dataset holds. We have the following features in the dataset:

1. track - The name of the track
2. artist - The name of the artist that made the track
3. uri - The resource identifier of the track
4. danceability - How danceable the track is (0.0 is lowest, 1.0 is highest)
5. energy - How much energy does the track have (0.0 is the lowest, 1.0 is the highest)
6. key - The estimated key of the track (0 = C, etc.)
7. loudness - The overall loudness of the track in decibels (dB), Values range from -60 to 0 dB.
8. mode - Indicates the modality of the track (whether the track is major or minor).
9. speechiness - It represents the presence of spoken-words in the track. The more exclusively speech-like the recording of the song, the higher the value. (0.0 is the lowest, 1.0 is the highest).
10. acousticness - A confidence metric from 0.0 to 1.0 of whether the track is acoustic. (0.0 is lowest confidence, 1.0 is the highest confidence)
11. instrumentalness - Measure of how harmonized the vocals are (0.0 is the lowest, 1.0 is the highest).
12. liveness - detects confidence about whether the song was recorded in the presence of a live audience.
13. valence - A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
14. tempo - overall estimated tempo of the track in Beats Per Minute (BPM).
15. duration_ms - The duration of the track in milliseconds.
16. time_signature - The estimated overall time signature of the track. (Conventional measure of how many beats are in one bar).
17. chorus_hit - The estimated timestamp when the chorus will hit. It has been assumed that the chorus of every song begins at the 3rd section of the song.
18. sections - The number of sections that a track has. 
19. target - This is the target feature. '1' implies that it is a 'flop', 0 implies it is a 'flop'.

In [6]:
# Dropping duplicate rows, if any, from the master dataset
master_data.drop_duplicates(keep='first', inplace=True)
master_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11915 entries, 0 to 18311
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track             11915 non-null  object 
 1   artist            11915 non-null  object 
 2   uri               11915 non-null  object 
 3   danceability      11915 non-null  float64
 4   energy            11915 non-null  float64
 5   key               11915 non-null  int64  
 6   loudness          11915 non-null  float64
 7   mode              11915 non-null  int64  
 8   speechiness       11915 non-null  float64
 9   acousticness      11915 non-null  float64
 10  instrumentalness  11915 non-null  float64
 11  liveness          11915 non-null  float64
 12  valence           11915 non-null  float64
 13  tempo             11915 non-null  float64
 14  duration_ms       11915 non-null  int64  
 15  time_signature    11915 non-null  int64  
 16  chorus_hit        11915 non-null  float6

In [7]:
# Separating the dataset from the target variable
X = master_data.drop(labels='target', axis=1, inplace=False)
y = master_data.target

# Splitting our dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7)

In [8]:
# Gathering all numeric features together
numeric_features = [
    'danceability','energy','loudness','speechiness','acousticness',
    'instrumentalness','liveness','valence','tempo','duration_ms','chorus_hit','sections'
            ]

# Gathering all categorical features together
categorical_features = [
    'artist','key','mode','time_signature'
]

# Creating the numeric preprocessing pipeline to feed our ML model
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Creating the categorical preprocessing pipeline to feed our ML model
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Creating the column transpformer 
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Logistic Regression

In [9]:
# First, let's try using Logistic Regression
clf_log = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('clf', LogisticRegression())
])

# Defining the param grid that we want to test using CV
param_grid_log = {
    'clf__penalty': ['l2'],
    # 'clf__dual': [True, False],
    'clf__fit_intercept': [True, False],
    'clf__class_weight': ['balanced', None],
    'clf__solver': ['newton-cg','lbfgs','liblinear','sag','saga']
}

In [10]:
# Applying the transformations to the data (both training and testing sets)
X_train = X_train[[*numeric_features, *categorical_features]]
X_test = X_test[[*numeric_features, *categorical_features]]

In [11]:
# Performing RandomizedSearchCV to perform hyper-parameter tuning of the Logistic Regression model
random_log = RandomizedSearchCV(
    clf_log, param_grid_log, cv=5, verbose=3,
    n_jobs=2, scoring='accuracy'
)

# Fitting the training data to the model
random_log.fit(X_train, y_train)

# Getting the best possible model
print(f'Best Score = {random_log.best_score_}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits




[CV 1/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=saga;, score=0.836 total time=   0.1s
[CV 2/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=saga;, score=0.793 total time=   0.1s
[CV 3/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=saga;, score=0.811 total time=   0.1s
[CV 4/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=saga;, score=0.792 total time=   0.1s
[CV 5/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=saga;, score=0.801 total time=   0.1s




[CV 1/5] END clf__class_weight=balanced, clf__fit_intercept=False, clf__penalty=l2, clf__solver=saga;, score=0.836 total time=   0.1s
[CV 2/5] END clf__class_weight=balanced, clf__fit_intercept=False, clf__penalty=l2, clf__solver=saga;, score=0.793 total time=   0.1s
[CV 3/5] END clf__class_weight=balanced, clf__fit_intercept=False, clf__penalty=l2, clf__solver=saga;, score=0.811 total time=   0.1s
[CV 4/5] END clf__class_weight=balanced, clf__fit_intercept=False, clf__penalty=l2, clf__solver=saga;, score=0.792 total time=   0.1s
[CV 5/5] END clf__class_weight=balanced, clf__fit_intercept=False, clf__penalty=l2, clf__solver=saga;, score=0.803 total time=   0.1s
[CV 1/5] END clf__class_weight=balanced, clf__fit_intercept=False, clf__penalty=l2, clf__solver=liblinear;, score=0.848 total time=   0.0s
[CV 2/5] END clf__class_weight=balanced, clf__fit_intercept=False, clf__penalty=l2, clf__solver=liblinear;, score=0.808 total time=   0.0s
[CV 3/5] END clf__class_weight=balanced, clf__fit_in



[CV 1/5] END clf__class_weight=None, clf__fit_intercept=False, clf__penalty=l2, clf__solver=saga;, score=0.832 total time=   0.1s
[CV 2/5] END clf__class_weight=None, clf__fit_intercept=False, clf__penalty=l2, clf__solver=saga;, score=0.792 total time=   0.1s
[CV 3/5] END clf__class_weight=None, clf__fit_intercept=False, clf__penalty=l2, clf__solver=saga;, score=0.810 total time=   0.1s
[CV 5/5] END clf__class_weight=None, clf__fit_intercept=False, clf__penalty=l2, clf__solver=saga;, score=0.811 total time=   0.1s
[CV 4/5] END clf__class_weight=None, clf__fit_intercept=False, clf__penalty=l2, clf__solver=saga;, score=0.796 total time=   0.1s




[CV 1/5] END clf__class_weight=None, clf__fit_intercept=True, clf__penalty=l2, clf__solver=saga;, score=0.834 total time=   0.1s
[CV 2/5] END clf__class_weight=None, clf__fit_intercept=True, clf__penalty=l2, clf__solver=saga;, score=0.792 total time=   0.1s
[CV 4/5] END clf__class_weight=None, clf__fit_intercept=True, clf__penalty=l2, clf__solver=saga;, score=0.796 total time=   0.1s
[CV 3/5] END clf__class_weight=None, clf__fit_intercept=True, clf__penalty=l2, clf__solver=saga;, score=0.810 total time=   0.1s
[CV 5/5] END clf__class_weight=None, clf__fit_intercept=True, clf__penalty=l2, clf__solver=saga;, score=0.810 total time=   0.1s
[CV 1/5] END clf__class_weight=balanced, clf__fit_intercept=False, clf__penalty=l2, clf__solver=newton-cg;, score=0.848 total time=   0.1s
[CV 3/5] END clf__class_weight=balanced, clf__fit_intercept=False, clf__penalty=l2, clf__solver=newton-cg;, score=0.821 total time=   0.1s
[CV 2/5] END clf__class_weight=balanced, clf__fit_intercept=False, clf__penal



[CV 5/5] END clf__class_weight=None, clf__fit_intercept=True, clf__penalty=l2, clf__solver=sag;, score=0.819 total time=   0.1s
[CV 2/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=newton-cg;, score=0.808 total time=   0.1s
[CV 3/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=newton-cg;, score=0.821 total time=   0.1s
[CV 5/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=newton-cg;, score=0.826 total time=   0.1s
[CV 4/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=newton-cg;, score=0.806 total time=   0.1s




[CV 1/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=sag;, score=0.843 total time=   0.1s
[CV 2/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=sag;, score=0.804 total time=   0.1s
[CV 4/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=sag;, score=0.801 total time=   0.1s
[CV 3/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=sag;, score=0.820 total time=   0.1s
[CV 5/5] END clf__class_weight=balanced, clf__fit_intercept=True, clf__penalty=l2, clf__solver=sag;, score=0.822 total time=   0.1s
Best Score = 0.8217696029460736


In [12]:
# Testing the model on the testing set
y_pred = random_log.predict(X_test)

# Getting the accuracy of the model
conf_mat = confusion_matrix(y_pred, y_test)
acc = np.sum(conf_mat.diagonal()) / np.sum(conf_mat)
print('Overall accuracy: {} %'.format(acc*100))

Overall accuracy: 83.15549694281262 %


# Random Forest Classifier

In [13]:
clf_rf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('clf', RandomForestClassifier())
])

param_grid_rf = {
    'clf__n_estimators': [20,50,100,200],
    'clf__criterion': ['gini','entropy'],
    'clf__max_depth': [20,50],
    'clf__max_features': ['auto','sqrt','log2'],
    'clf__class_weight': ['balanced','balanced_subsample']
}

In [14]:
random_rf = RandomizedSearchCV(
    clf_rf, param_grid_rf, cv=5, n_jobs=2, 
    verbose=3, scoring='accuracy'
)

random_rf.fit(X_train, y_train)

print(f'Best score = {random_rf.best_score_}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END clf__class_weight=balanced, clf__criterion=gini, clf__max_depth=50, clf__max_features=log2, clf__n_estimators=20;, score=0.807 total time=   0.1s
[CV 2/5] END clf__class_weight=balanced, clf__criterion=gini, clf__max_depth=50, clf__max_features=log2, clf__n_estimators=20;, score=0.782 total time=   0.1s
[CV 3/5] END clf__class_weight=balanced, clf__criterion=gini, clf__max_depth=50, clf__max_features=log2, clf__n_estimators=20;, score=0.811 total time=   0.1s
[CV 4/5] END clf__class_weight=balanced, clf__criterion=gini, clf__max_depth=50, clf__max_features=log2, clf__n_estimators=20;, score=0.793 total time=   0.1s
[CV 5/5] END clf__class_weight=balanced, clf__criterion=gini, clf__max_depth=50, clf__max_features=log2, clf__n_estimators=20;, score=0.805 total time=   0.1s
[CV 2/5] END clf__class_weight=balanced_subsample, clf__criterion=gini, clf__max_depth=50, clf__max_features=log2, clf__n_estimators=20;, score=

In [15]:
# Testing the model on the testing set
y_pred = random_rf.predict(X_test)

# Getting the accuracy of the model
conf_mat = confusion_matrix(y_pred, y_test)
acc = np.sum(conf_mat.diagonal()) / np.sum(conf_mat)
print('Overall accuracy: {} %'.format(acc*100))

Overall accuracy: 83.40726531590936 %


# XGBoost Classifier

In [16]:
clf_xg = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('clf', XGBClassifier(verbose=0))
])

param_grid_xg = {
    'clf__n_estimators': [10,20,50,100,200],
    'clf__max_depth': [10,20,30],
    'clf__learning_rate': [0.1,0.001,0.001],
    'clf__objective': ['binary:logistic','reg:squarederror'],
    'clf__booster': ['gbtree','gblinear']
}

In [17]:
random_xg = RandomizedSearchCV(
    clf_xg, param_grid_xg, cv=5, n_jobs=2,
    verbose=0, scoring='accuracy'
)

random_xg.fit(X_train, y_train)

print(f'Best score = {random_xg.best_score_}')

  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index


Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_



Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verb



Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.






Parameters: { "max_depth", "verbose" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


Best score = 0.8659748095042212


In [18]:
print(f'Best score = {random_xg.best_score_}')

Best score = 0.8659748095042212


In [19]:
# Testing the model on the testing set
y_pred = random_xg.predict(X_test)

# Getting the accuracy of the model
conf_mat = confusion_matrix(y_pred, y_test)
acc = np.sum(conf_mat.diagonal()) / np.sum(conf_mat)
print('Overall accuracy: {} %'.format(acc*100))

Overall accuracy: 88.15489749430525 %


Now that we have a model that has quite a good result in terms of generalizing to new data, let's serialize the model and store it in a pickle file

In [21]:
# Storing the XGB model as a pickle file to use in our application
filename = 'model.pkl'
pickle.dump(random_xg, open(filename, 'wb'))