# MBIT School

## Executive Master en Data Science (2020-2021)

*Nuria Espadas*  
*Mireia Vecino*  
*Tomeu Mir*  

### Notebook: model_validator

This notebook loads a model from pickle, loads the validation dataset and performs some predictions. Finally, it makes a prediction of the probabilities for all classes and simulates a "recommendation" listing the top 5 classes ordered by its probability.



In [2]:
import pandas as pd
import numpy as np
import pickle

In [3]:
# default path for loading the objects
DATA_PATH = '../data/'
MODEL_PATH = '../models/'
model_filename= 'et_turbo_0.8792.pkl'

## Load model and datasets

In [4]:
# Load the model
model = pickle.load(open(MODEL_PATH+model_filename,'rb'))

In [5]:
# Load the df_ranking dataset for having the code and name of the excursions
df_ranking = pd.read_csv(DATA_PATH+'df_ranking.csv', index_col=False)
df_ranking.drop("Unnamed: 0",axis=1,inplace=True)
df_ranking

Unnamed: 0,STOCK_CODE,STOCK_NAME,total
0,PESTCI4FNG,Royal Delfin,1286
1,XESTCIB4FU,Teide National Park,1207
2,XESTCIBO3U,Loro Express Exclusive,1119
3,PESTCI4FXQ,Mts. Inselrundfahrt - Tour de Ile,1008
4,XESTCIBCSG,Teide Masca (Grand Tour),934
5,XESTCIBPNI,Freebird (3H Vip Exclusive),747
6,XESTCIBSBI,Mts. Teide South,747
7,PESTCI4IJM,Twin Tickets,657
8,PESTCI4IGS,Teleferico / Cable Car,576
9,XESTCI9VN2,Teide By Night And Romantic Tour Only For Adults,530


In [6]:
# Load the validation dataset --> reminder: it was not used either for training or testing
df_prod = pd.read_csv(DATA_PATH + 'df_prod.csv')
# List of columns to exclude when entering the features for the model when predicting
cols_to_exclude=['Unnamed: 0',
                 'I_TOTAL_SALES_SC',
                'SOURCE_COUNTRY_CODE',
                'I_BOOKINGDATE',
                'I_STARTDATE',
                 'dir',
                 'presMax'
               ]
df_prod.drop( cols_to_exclude, axis=1, inplace=True)
# info
df_prod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3866 entries, 0 to 3865
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   I_DAYSBEFOREBOOK     3866 non-null   int64  
 1   STOCK_CODE           3866 non-null   object 
 2   STOCK_NAME           3866 non-null   object 
 3   ADT                  3866 non-null   int64  
 4   CHD                  3866 non-null   int64  
 5   INF                  3866 non-null   int64  
 6   LEAD_PAX_AGE         3866 non-null   float64
 7   prec                 3866 non-null   float64
 8   tmax                 3866 non-null   float64
 9   velmedia             3866 non-null   float64
 10  sol                  3866 non-null   float64
 11  i_booking_dayofweek  3866 non-null   int64  
 12  i_start_dayofweek    3866 non-null   int64  
 13  i_avg_sales          3866 non-null   float64
dtypes: float64(6), int64(6), object(2)
memory usage: 423.0+ KB


## Predict

In [7]:
# Lets predict the probabilities for each class 
target =['STOCK_CODE','STOCK_NAME']
# predictions
pred = model.predict(pd.DataFrame(df_prod.drop(target,axis=1)))
print(len(pred))

3866


In [30]:
pred

array(['XESTCIBO3U', 'PESTCI4FXQ', 'XESTCIB4FU', ..., 'XESTCIBPNI',
       'PESTCI4FXQ', 'PESTCI4FXQ'], dtype=object)

In [8]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Accuracy of the predictions
print('\nAccuracy score in validation dataset:', accuracy_score(df_prod['STOCK_CODE'], pred) )
print('\n',classification_report(df_prod['STOCK_CODE'], pred))


Accuracy score in validation dataset: 0.8792033109156752

               precision    recall  f1-score   support

  LESTCI4FWU       0.97      0.87      0.92        78
  PESTCI4FLK       0.98      1.00      0.99        62
  PESTCI4FN8       0.90      0.85      0.88       117
  PESTCI4FNG       0.93      0.94      0.94       395
  PESTCI4FXQ       0.84      0.92      0.88       271
  PESTCI4HYM       0.80      0.62      0.70       139
  PESTCI4HYS       0.99      0.97      0.98       116
  PESTCI4IGS       0.90      0.87      0.88       159
  PESTCI4IJM       0.98      0.93      0.95       201
  PESTCI4KNA       0.93      0.83      0.88        78
  PESTCI4SUG       0.87      0.46      0.60       101
  PESTCI6CBQ       0.97      1.00      0.98        65
  XESTCI9VN2       0.97      1.00      0.99       152
  XESTCIB26U       0.81      0.88      0.85        69
  XESTCIB26W       0.79      0.77      0.78       104
  XESTCIB2R4       0.72      0.53      0.61       134
  XESTCIB4FU       0.

In [35]:
from sklearn.metrics import confusion_matrix
pd.set_option('max_columns', None)

cm = pd.DataFrame(confusion_matrix(df_prod['STOCK_CODE'], pred))
cm.loc["Total"] = cm.sum()
cm["Total"] = cm.sum(axis=1)
cm

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,Total
0,68,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,2,0,0,0,0,0,4,0,78
1,0,62,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,62
2,0,0,100,7,0,0,0,0,1,0,2,0,0,0,1,1,0,0,0,0,5,0,0,0,117
3,0,0,1,373,0,0,0,1,0,4,0,0,0,3,1,3,4,1,0,0,3,0,1,0,395
4,0,0,0,0,249,16,0,2,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,271
5,0,0,1,1,45,86,0,0,0,0,0,0,0,1,2,0,0,0,0,0,2,1,0,0,139
6,0,0,0,0,0,0,112,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,116
7,0,0,0,0,1,0,0,138,0,0,0,0,0,1,0,0,12,0,0,2,0,0,1,4,159
8,0,0,1,3,0,1,1,0,187,0,0,0,0,0,1,0,0,0,1,0,4,2,0,0,201
9,0,1,0,7,0,0,0,0,0,65,1,0,0,0,1,0,0,0,0,0,2,1,0,0,78


## Predicting probabilities

In [31]:
# predict probabilities for each class
pred_proba=model.predict_proba(pd.DataFrame(df_prod.drop(target,axis=1)))
print(len(pred_proba))

3866


In [32]:
pred_proba

array([[0.   , 0.   , 0.   , ..., 0.002, 0.   , 0.   ],
       [0.   , 0.   , 0.   , ..., 0.002, 0.   , 0.   ],
       [0.014, 0.   , 0.   , ..., 0.   , 0.006, 0.09 ],
       ...,
       [0.   , 0.   , 0.004, ..., 0.434, 0.   , 0.   ],
       [0.   , 0.   , 0.   , ..., 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , ..., 0.002, 0.   , 0.   ]])

### Listing items based on predicted probabilities

In [61]:
def recommendator( rownum):
    print( 'Excursion on the validation dataset is:')
    print( df_prod[['STOCK_CODE','STOCK_NAME']].loc[rownum] )
    print()
    print ('my_predicted_exc:', pred[rownum] )

    reco = pd.DataFrame({'STOCK_CODE': list(model.classes_),
                       'proba': pred_proba[rownum]}).\
                        sort_values('proba', ascending = False)
    # get the exc. name merging the data with df_ranking
    reco = pd.merge(reco,df_ranking)
    print('\nmy recommendation list : ')
    print(reco[:5]) # Top 5

In [62]:
recommendator( 0)

Excursion on the validation dataset is:
STOCK_CODE                XESTCIBO3U
STOCK_NAME    Loro Express Exclusive
Name: 0, dtype: object

my_predicted_exc: XESTCIBO3U

my recommendation list : 
   STOCK_CODE  proba                   STOCK_NAME  total
0  XESTCIBO3U  0.836       Loro Express Exclusive   1119
1  PESTCI4SUG  0.160          Loro Parque Express    352
2  XESTCIBPNI  0.002  Freebird (3H Vip Exclusive)    747
3  PESTCI4HYM  0.002  Jeep Safari Teide Masca Sur    453
4  LESTCI4FWU  0.000       Loro Parque (Entrance)    231


In [63]:
recommendator( 101)

Excursion on the validation dataset is:
STOCK_CODE                  XESTCIBCSG
STOCK_NAME    Teide Masca (Grand Tour)
Name: 101, dtype: object

my_predicted_exc: XESTCIBCSG

my recommendation list : 
   STOCK_CODE  proba                STOCK_NAME  total
0  XESTCIBCSG  0.998  Teide Masca (Grand Tour)    934
1  XESTCIBO3U  0.002    Loro Express Exclusive   1119
2  LESTCI4FWU  0.000    Loro Parque (Entrance)    231
3  PESTCI4FLK  0.000            La Gomera Tour    243
4  XESTCIBSBI  0.000          Mts. Teide South    747
