
# MBIT School

## Executive Master en Data Science (2020-2021)

  
by

*Nuria Espadas*  
*Mireia Vecino*  
*Tomeu Mir*  

### Notebook: modelling excluded

This workbook uses the datasets created during the previous analysis (EDA and another modelling workbook) to evaluate different predictive models. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
DATA_PATH = '../data/'

In [4]:
df_train_and_test= pd.read_csv('data/df_train_and_test.csv')

In [5]:
df_train_and_test = pd.read_csv('data/df_train_and_test.csv')
cols_to_exclude=['Unnamed: 0',
                 'I_TOTAL_SALES_SC',
                'SOURCE_COUNTRY_CODE',
                'I_BOOKINGDATE',
                'I_STARTDATE',
                 'dir',
                 'presMax'
               ]
df_train_and_test.drop(cols_to_exclude, axis=1, inplace=True)
df_train_and_test.shape

(9020, 14)

In [6]:
df_train_and_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9020 entries, 0 to 9019
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   I_DAYSBEFOREBOOK     9020 non-null   int64  
 1   STOCK_CODE           9020 non-null   object 
 2   STOCK_NAME           9020 non-null   object 
 3   ADT                  9020 non-null   int64  
 4   CHD                  9020 non-null   int64  
 5   INF                  9020 non-null   int64  
 6   LEAD_PAX_AGE         9020 non-null   float64
 7   prec                 9020 non-null   float64
 8   tmax                 9020 non-null   float64
 9   velmedia             9020 non-null   float64
 10  sol                  9020 non-null   float64
 11  i_booking_dayofweek  9020 non-null   int64  
 12  i_start_dayofweek    9020 non-null   int64  
 13  i_avg_sales          9020 non-null   float64
dtypes: float64(6), int64(6), object(2)
memory usage: 986.7+ KB


In [7]:
df_train_and_test.head()

Unnamed: 0,I_DAYSBEFOREBOOK,STOCK_CODE,STOCK_NAME,ADT,CHD,INF,LEAD_PAX_AGE,prec,tmax,velmedia,sol,i_booking_dayofweek,i_start_dayofweek,i_avg_sales
0,2,XESTCIBCSG,Teide Masca (Grand Tour),2,0,0,41.0,0.0,31.7,7.5,11.0,1,6,36.0
1,0,XESTCIBCSG,Teide Masca (Grand Tour),2,0,0,48.794693,0.2,24.9,7.2,1.9,2,2,36.0
2,4,XESTCIBPNI,Freebird (3H Vip Exclusive),2,1,0,44.0,0.0,27.1,6.9,10.8,0,6,41.0
3,2,PESTCI4HYS,Gomera Safari Tour,3,0,0,67.0,0.0,25.5,9.2,4.8,3,5,95.0
4,3,XESTCIBCTW,Music Hall Tavern,2,0,0,64.0,0.0,22.9,8.9,11.3,0,3,39.0


Features excluded for modeling

In [8]:
target =['STOCK_CODE','STOCK_NAME']

We divide our datasey it in training and test

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(df_train_and_test.drop(target, axis=1),
                                                  df_train_and_test["STOCK_CODE"], 
                                                  test_size=0.3, 
                                                  random_state=1)

Let's try different classification algorithms to see how they work with our dataset.

To further refine the algorithms, we will perform different transformations on the data:
- Standardising
- Principal Component Analysis (PCA)
- Kernel principal component analysis (KPCA)

**Standardising**
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data. Standardize features by removing the mean and scaling to unit variance

In [10]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_trainS = sc_X.fit_transform(X_train)
X_testS = sc_X.transform(X_test)

**Principal Component Analysis (PCA)**
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.

In [11]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_trainP = pca.fit_transform(X_train)
X_testP = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

**Kernel principal component analysis (KPCA)** Non-linear dimensionality reduction through the use of kernels

In [12]:
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 2, kernel = "rbf")
X_trainK = kpca.fit_transform(X_train)
X_testK = kpca.transform(X_test)

### K-Nearest Neighbors (K-NN)
It is a method that simply looks at the observations closest to the one you are trying to predict and classifies the point of interest based on the most surrounding data. It is based on the calculation of distances

In [13]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=10, metric="euclidean", p=10)
classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.5136733185513673

               precision    recall  f1-score   support

  LESTCI4FWU       0.70      0.50      0.58        42
  PESTCI4FLK       0.47      0.60      0.53        55
  PESTCI4FN8       0.17      0.12      0.14        86
  PESTCI4FNG       0.42      0.47      0.45       264
  PESTCI4FXQ       0.37      0.62      0.46       183
  PESTCI4HYM       0.13      0.04      0.06        93
  PESTCI4HYS       0.99      0.90      0.94        91
  PESTCI4IGS       0.28      0.25      0.27       106
  PESTCI4IJM       0.90      0.59      0.71       142
  PESTCI4KNA       0.42      0.42      0.42        48
  PESTCI4SUG       0.24      0.09      0.13        78
  PESTCI6CBQ       0.62      0.67      0.64        39
  XESTCI9VN2       0.90      0.96      0.93       116
  XESTCIB26U       0.17      0.10      0.13        49
  XESTCIB26W       0.27      0.14      0.18        78
  XESTCIB2R4       0.17      0.06      0.09        77
  XESTCIB4FU       0.57      0.82      0.67

**Standardising our features**

In [14]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=10, metric="euclidean", p=10)
classifier.fit(X_trainS,y_train)
y_pred = classifier.predict(X_testS)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.3617886178861789

               precision    recall  f1-score   support

  LESTCI4FWU       0.44      0.45      0.45        42
  PESTCI4FLK       0.35      0.24      0.28        55
  PESTCI4FN8       0.15      0.14      0.14        86
  PESTCI4FNG       0.25      0.41      0.31       264
  PESTCI4FXQ       0.26      0.42      0.32       183
  PESTCI4HYM       0.05      0.02      0.03        93
  PESTCI4HYS       0.75      0.64      0.69        91
  PESTCI4IGS       0.28      0.28      0.28       106
  PESTCI4IJM       0.53      0.43      0.47       142
  PESTCI4KNA       0.35      0.31      0.33        48
  PESTCI4SUG       0.11      0.03      0.04        78
  PESTCI6CBQ       0.22      0.18      0.20        39
  XESTCI9VN2       0.72      0.66      0.69       116
  XESTCIB26U       0.23      0.20      0.22        49
  XESTCIB26W       0.04      0.01      0.02        78
  XESTCIB2R4       0.11      0.05      0.07        77
  XESTCIB4FU       0.50      0.69      0.58

**Principal Component Analysis (PCA)**

In [15]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=10, metric="euclidean", p=10)
classifier.fit(X_trainP,y_train)
y_pred = classifier.predict(X_testP)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.5716925351071692

               precision    recall  f1-score   support

  LESTCI4FWU       0.53      0.74      0.62        42
  PESTCI4FLK       0.76      0.64      0.69        55
  PESTCI4FN8       0.37      0.35      0.36        86
  PESTCI4FNG       0.65      0.72      0.68       264
  PESTCI4FXQ       0.63      0.73      0.68       183
  PESTCI4HYM       0.45      0.23      0.30        93
  PESTCI4HYS       0.99      0.91      0.95        91
  PESTCI4IGS       0.39      0.37      0.38       106
  PESTCI4IJM       0.74      0.69      0.71       142
  PESTCI4KNA       0.70      0.54      0.61        48
  PESTCI4SUG       0.12      0.06      0.08        78
  PESTCI6CBQ       0.53      0.85      0.65        39
  XESTCI9VN2       0.91      0.96      0.93       116
  XESTCIB26U       0.40      0.24      0.30        49
  XESTCIB26W       0.26      0.18      0.21        78
  XESTCIB2R4       0.20      0.10      0.14        77
  XESTCIB4FU       0.51      0.70      0.59

**Kernel principal component analysis (KPCA)**

In [16]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=10, metric="minkowski", p=2)
classifier.fit(X_trainK,y_train)
y_pred = classifier.predict(X_testK)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.17812269031781228

               precision    recall  f1-score   support

  LESTCI4FWU       0.19      0.29      0.23        42
  PESTCI4FLK       0.14      0.15      0.14        55
  PESTCI4FN8       0.07      0.06      0.06        86
  PESTCI4FNG       0.22      0.33      0.26       264
  PESTCI4FXQ       0.21      0.26      0.23       183
  PESTCI4HYM       0.00      0.00      0.00        93
  PESTCI4HYS       0.10      0.05      0.07        91
  PESTCI4IGS       0.11      0.10      0.11       106
  PESTCI4IJM       0.12      0.10      0.11       142
  PESTCI4KNA       0.08      0.04      0.05        48
  PESTCI4SUG       0.09      0.03      0.04        78
  PESTCI6CBQ       0.16      0.18      0.17        39
  XESTCI9VN2       0.10      0.05      0.07       116
  XESTCIB26U       0.16      0.12      0.14        49
  XESTCIB26W       0.05      0.03      0.03        78
  XESTCIB2R4       0.09      0.04      0.05        77
  XESTCIB4FU       0.25      0.35      0.2

The highest accuracy is achieved by transforming the data by means of a PCA, (**Accuracy: 0.5716**).  K-NN is based on the calculation of room distances, trying to predict and categorise.


### Naive Bayes

In [17]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.2058388765705839

               precision    recall  f1-score   support

  LESTCI4FWU       0.67      0.10      0.17        42
  PESTCI4FLK       0.30      0.91      0.45        55
  PESTCI4FN8       0.12      0.26      0.16        86
  PESTCI4FNG       0.00      0.00      0.00       264
  PESTCI4FXQ       0.00      0.00      0.00       183
  PESTCI4HYM       0.08      0.85      0.15        93
  PESTCI4HYS       0.82      0.84      0.83        91
  PESTCI4IGS       0.00      0.00      0.00       106
  PESTCI4IJM       0.00      0.00      0.00       142
  PESTCI4KNA       0.16      0.21      0.18        48
  PESTCI4SUG       0.17      0.01      0.02        78
  PESTCI6CBQ       0.97      0.92      0.95        39
  XESTCI9VN2       0.84      0.80      0.82       116
  XESTCIB26U       0.26      0.22      0.24        49
  XESTCIB26W       0.00      0.00      0.00        78
  XESTCIB2R4       0.17      0.01      0.02        77
  XESTCIB4FU       0.67      0.04      0.08

With different CLASIFICADORES and different metrics, the highest **accuracy** we get is this one here. This model applying no treatment to the data gives us enough confidence.

**Standardising our features**

In [18]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_trainS, y_train)
y_pred = classifier.predict(X_testS)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.19549150036954915

               precision    recall  f1-score   support

  LESTCI4FWU       0.50      0.05      0.09        42
  PESTCI4FLK       0.30      0.91      0.45        55
  PESTCI4FN8       0.11      0.26      0.15        86
  PESTCI4FNG       0.00      0.00      0.00       264
  PESTCI4FXQ       0.00      0.00      0.00       183
  PESTCI4HYM       0.08      0.85      0.15        93
  PESTCI4HYS       0.82      0.84      0.83        91
  PESTCI4IGS       0.00      0.00      0.00       106
  PESTCI4IJM       0.00      0.00      0.00       142
  PESTCI4KNA       0.14      0.21      0.17        48
  PESTCI4SUG       0.20      0.01      0.02        78
  PESTCI6CBQ       0.97      0.92      0.95        39
  XESTCI9VN2       0.84      0.80      0.82       116
  XESTCIB26U       0.26      0.22      0.24        49
  XESTCIB26W       0.00      0.00      0.00        78
  XESTCIB2R4       0.17      0.01      0.02        77
  XESTCIB4FU       1.00      0.03      0.0

**Principal Component Analysis (PCA)**

In [19]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_trainP, y_train)
y_pred = classifier.predict(X_testP)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.33739837398373984

               precision    recall  f1-score   support

  LESTCI4FWU       0.09      0.02      0.04        42
  PESTCI4FLK       0.33      0.93      0.49        55
  PESTCI4FN8       0.00      0.00      0.00        86
  PESTCI4FNG       0.17      0.02      0.03       264
  PESTCI4FXQ       0.23      0.91      0.36       183
  PESTCI4HYM       0.00      0.00      0.00        93
  PESTCI4HYS       0.46      0.93      0.61        91
  PESTCI4IGS       0.24      0.04      0.07       106
  PESTCI4IJM       0.50      0.25      0.34       142
  PESTCI4KNA       0.67      0.04      0.08        48
  PESTCI4SUG       0.00      0.00      0.00        78
  PESTCI6CBQ       0.51      0.87      0.64        39
  XESTCI9VN2       0.20      0.01      0.02       116
  XESTCIB26U       0.00      0.00      0.00        49
  XESTCIB26W       0.00      0.00      0.00        78
  XESTCIB2R4       0.00      0.00      0.00        77
  XESTCIB4FU       0.42      0.99      0.5

**Kernel principal component analysis (KPCA)**

In [20]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_trainK, y_train)
y_pred = classifier.predict(X_testK)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.13710273466371029

               precision    recall  f1-score   support

  LESTCI4FWU       0.00      0.00      0.00        42
  PESTCI4FLK       0.02      0.07      0.03        55
  PESTCI4FN8       0.00      0.00      0.00        86
  PESTCI4FNG       0.32      0.16      0.21       264
  PESTCI4FXQ       0.35      0.16      0.22       183
  PESTCI4HYM       0.00      0.00      0.00        93
  PESTCI4HYS       0.09      0.56      0.16        91
  PESTCI4IGS       0.67      0.02      0.04       106
  PESTCI4IJM       0.06      0.06      0.06       142
  PESTCI4KNA       0.00      0.00      0.00        48
  PESTCI4SUG       0.00      0.00      0.00        78
  PESTCI6CBQ       0.04      0.03      0.03        39
  XESTCI9VN2       0.08      0.47      0.13       116
  XESTCIB26U       0.00      0.00      0.00        49
  XESTCIB26W       0.00      0.00      0.00        78
  XESTCIB2R4       0.00      0.00      0.00        77
  XESTCIB4FU       0.18      0.41      0.2

This algorithm does not work well, it gives a very low accuracy in all cases. Transformed the data with a PCA, (**Accuracy: 0,3373**). Naive Bayes works with the probabilities of belonging to the group

### Support Vector Machines

In [21]:
from sklearn.svm import SVC
svc=SVC(kernel='poly', gamma='scale')
svc.fit(X_train, y_train)
y_pred=svc.predict(X_test)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.44124168514412415

               precision    recall  f1-score   support

  LESTCI4FWU       0.00      0.00      0.00        42
  PESTCI4FLK       0.49      0.53      0.51        55
  PESTCI4FN8       0.00      0.00      0.00        86
  PESTCI4FNG       0.28      0.74      0.41       264
  PESTCI4FXQ       0.33      0.02      0.03       183
  PESTCI4HYM       0.00      0.00      0.00        93
  PESTCI4HYS       0.99      0.91      0.95        91
  PESTCI4IGS       0.40      0.28      0.33       106
  PESTCI4IJM       0.87      0.68      0.76       142
  PESTCI4KNA       0.42      0.31      0.36        48
  PESTCI4SUG       0.00      0.00      0.00        78
  PESTCI6CBQ       0.95      0.46      0.62        39
  XESTCI9VN2       0.93      0.97      0.95       116
  XESTCIB26U       0.00      0.00      0.00        49
  XESTCIB26W       0.00      0.00      0.00        78
  XESTCIB2R4       0.00      0.00      0.00        77
  XESTCIB4FU       0.45      0.96      0.6

**Standardising our features**

In [22]:
from sklearn.svm import SVC
svc=SVC(kernel='poly', gamma='scale')
svc.fit(X_trainS, y_train)
y_pred=svc.predict(X_testS)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.4109386548410939

               precision    recall  f1-score   support

  LESTCI4FWU       0.67      0.52      0.59        42
  PESTCI4FLK       0.57      0.15      0.23        55
  PESTCI4FN8       0.31      0.06      0.10        86
  PESTCI4FNG       0.20      0.68      0.31       264
  PESTCI4FXQ       0.32      0.25      0.28       183
  PESTCI4HYM       0.00      0.00      0.00        93
  PESTCI4HYS       0.99      0.93      0.96        91
  PESTCI4IGS       0.33      0.18      0.23       106
  PESTCI4IJM       0.83      0.61      0.70       142
  PESTCI4KNA       0.52      0.33      0.41        48
  PESTCI4SUG       0.12      0.01      0.02        78
  PESTCI6CBQ       0.00      0.00      0.00        39
  XESTCI9VN2       0.96      0.96      0.96       116
  XESTCIB26U       0.33      0.20      0.25        49
  XESTCIB26W       0.20      0.01      0.02        78
  XESTCIB2R4       0.00      0.00      0.00        77
  XESTCIB4FU       0.53      0.79      0.63

**Principal Component Analysis (PCA)**

In [23]:
from sklearn.svm import SVC
svc=SVC(kernel='poly', gamma='scale')
svc.fit(X_trainP, y_train)
y_pred=svc.predict(X_testP)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.3625277161862528

               precision    recall  f1-score   support

  LESTCI4FWU       0.00      0.00      0.00        42
  PESTCI4FLK       0.33      0.02      0.03        55
  PESTCI4FN8       0.00      0.00      0.00        86
  PESTCI4FNG       0.18      0.92      0.30       264
  PESTCI4FXQ       0.00      0.00      0.00       183
  PESTCI4HYM       0.00      0.00      0.00        93
  PESTCI4HYS       0.98      0.91      0.94        91
  PESTCI4IGS       0.50      0.01      0.02       106
  PESTCI4IJM       0.52      0.68      0.59       142
  PESTCI4KNA       0.50      0.04      0.08        48
  PESTCI4SUG       0.00      0.00      0.00        78
  PESTCI6CBQ       0.00      0.00      0.00        39
  XESTCI9VN2       0.93      0.96      0.94       116
  XESTCIB26U       0.00      0.00      0.00        49
  XESTCIB26W       0.00      0.00      0.00        78
  XESTCIB2R4       0.00      0.00      0.00        77
  XESTCIB4FU       0.38      1.00      0.55

**Kernel principal component analysis (KPCA)**


In [24]:
from sklearn.svm import SVC
svc=SVC(kernel='poly', gamma='scale')
svc.fit(X_trainK, y_train)
y_pred=svc.predict(X_testK)
print('\nAccuracy score:', accuracy_score(y_test, y_pred) )
print('\n',classification_report(y_test, y_pred))


Accuracy score: 0.13636363636363635

               precision    recall  f1-score   support

  LESTCI4FWU       0.00      0.00      0.00        42
  PESTCI4FLK       0.00      0.00      0.00        55
  PESTCI4FN8       0.00      0.00      0.00        86
  PESTCI4FNG       0.26      0.24      0.25       264
  PESTCI4FXQ       0.47      0.08      0.13       183
  PESTCI4HYM       0.00      0.00      0.00        93
  PESTCI4HYS       0.00      0.00      0.00        91
  PESTCI4IGS       0.00      0.00      0.00       106
  PESTCI4IJM       0.00      0.00      0.00       142
  PESTCI4KNA       0.00      0.00      0.00        48
  PESTCI4SUG       0.00      0.00      0.00        78
  PESTCI6CBQ       0.00      0.00      0.00        39
  XESTCI9VN2       0.00      0.00      0.00       116
  XESTCIB26U       0.00      0.00      0.00        49
  XESTCIB26W       0.00      0.00      0.00        78
  XESTCIB2R4       0.00      0.00      0.00        77
  XESTCIB4FU       0.10      1.00      0.1

For the SVC, the best accuracy is given for the data without any transformation, this is because it works with hyperplanes. Trying to find the one that best separates the different instances. 

# CONCLUSIONS:

None of these 3 algorithms give us enough confidence (accuracy) to be considered as models with which our data can learn. We have performed some transformations on the features before using any machine learning models, but we have not been able to improve the models either. The main reasons why they do not work with these models is the basis on which the models are based.

**Naive Bayes** is a probabilistic classifier based on Bayes' theorem, working with probabilities to make predictions of group membership or class assignment. In our case, the reason why this model does not work is because of the weight of the feature, **i_avg_sales **. As it is more important than the rest, it is a determining factor in deciding whether to belong or not.

**K-Nearest Neighbors**, is based on the calculation of distances, there is no learning. Due to this characteristic, the most advisable when using this algorithm is to perform a **PCA**, applying it to our data, we obtain an accuracy of 0.3. So we can conclude that this algorithm is not valid for my data. 


**Support Vector Machines**, I seek to find the best hyperplane that separates the different instances, trying to obtain the optimal, non-linear one. It is mostly used for non-text and non-image classification.  With our data it has not worked, the maximum accuracy we have obtained is 0.44. We could not find hyperplanes separating our classes.

As we see that these models do not work, we will rely on trees (decision trees, random forest, extratrees, ensembles...).




