In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
plt.rcParams['figure.figsize'] = [20, 20]
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [3]:
#from google.colab import drive
#drive.mount('/content/drive')

In [2]:
db = pd.read_csv('online_shoppers_intention.csv')

**Análisis de la base de datos**

In [3]:
print('# muestras de la clase negativa: ',db[db["Revenue"]==False].shape[0])
print('# muerstras de la clase positiva: ',db[db["Revenue"]==True].shape[0])

# muestras de la clase negativa:  10422
# muerstras de la clase positiva:  1908


Como se puede observar la base de datos está bastante desbalanceada, por lo tanto se requiere realizar alguna estartegia de balanceo.

In [4]:
if db.isnull().values.any():
  print("la base de datos tiene datos datos faltantes")
else:
  print("la base de datos no tiene datos datos faltantes")

la base de datos no tiene datos datos faltantes


El resultado anterior nos indica que la base de datos está completa y ninguna columna tiene datos faltantes por lo tanto no se requiere realizar ninguna técnica de imputación de variables

In [5]:

print('Variables predictoras\n')
[print('* '+x) for x in db.columns[0:-1]]
print('\nVariables a predecir\n')
print('* '+db.columns[-1])

Variables predictoras

* Administrative
* Administrative_Duration
* Informational
* Informational_Duration
* ProductRelated
* ProductRelated_Duration
* BounceRates
* ExitRates
* PageValues
* SpecialDay
* Month
* OperatingSystems
* Browser
* Region
* TrafficType
* VisitorType
* Weekend

Variables a predecir

* Revenue


Del problema sabemos que las variables "Month", "OperatingSystems", "Browser", "Region", "TrafficType", "VisitorType", "Wekkend" y "Revenue" (variable a predecir) son categorícas y las demás variables son numéricas, además, la varible "Revenue" es binaria, por lo tanto nuestro problema se enmarca dentro de los problemas de **clasificación binaria**. Para las demás variables categóricas se requiere hacer un proceso de codificación.

**Transformación de la  base de datos**


Separación de las columnas predictoras y la columna a predecir

In [11]:
X=db.iloc[:,0:-1]
Y=db.iloc[:,-1]
X.shape[1]

17

Codificación de las variables categóricas usando dummy encondig

In [12]:
columnas_categoricas=['Month','OperatingSystems','Browser','Region','TrafficType','VisitorType','Weekend']
columnas_numericas=[x for x in X.columns if x not in columnas_categoricas]
X_result=X[columnas_numericas]

for columna in columnas_categoricas:
  dummies=pd.get_dummies(X[columna], prefix = columna)
  X_result = pd.concat([X_result, dummies], axis=1);
X=X_result
X.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,...,TrafficType_16,TrafficType_17,TrafficType_18,TrafficType_19,TrafficType_20,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor,Weekend_False,Weekend_True
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,...,0,0,0,0,0,0,0,1,1,0
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,...,0,0,0,0,0,0,0,1,1,0
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,...,0,0,0,0,0,0,0,1,1,0
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,...,0,0,0,0,0,0,0,1,1,0
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,...,0,0,0,0,0,0,0,1,0,1


In [13]:
X.shape[1]
X.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,...,TrafficType_16,TrafficType_17,TrafficType_18,TrafficType_19,TrafficType_20,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor,Weekend_False,Weekend_True
0,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,...,0,0,0,0,0,0,0,1,1,0
1,0,0.0,0,0.0,2,64.0,0.0,0.1,0.0,0.0,...,0,0,0,0,0,0,0,1,1,0
2,0,0.0,0,0.0,1,0.0,0.2,0.2,0.0,0.0,...,0,0,0,0,0,0,0,1,1,0
3,0,0.0,0,0.0,2,2.666667,0.05,0.14,0.0,0.0,...,0,0,0,0,0,0,0,1,1,0
4,0,0.0,0,0.0,10,627.5,0.02,0.05,0.0,0.0,...,0,0,0,0,0,0,0,1,0,1


In [14]:
x = X.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
X = pd.DataFrame(x_scaled, columns=X.columns)
X.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,...,TrafficType_16,TrafficType_17,TrafficType_18,TrafficType_19,TrafficType_20,VisitorType_New_Visitor,VisitorType_Other,VisitorType_Returning_Visitor,Weekend_False,Weekend_True
0,0.0,0.0,0.0,0.0,0.001418,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.002837,0.001,0.0,0.5,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.001418,0.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.002837,4.2e-05,0.25,0.7,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.014184,0.009809,0.1,0.25,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


Luego de la codificación las variables predictoras pasaron de ser 17 a ser 75.

**Balanceo de los datos**

Devido a las limitaciones de máquina se acudirá a la técnica de **Submuestreo** con el fin de eliminar de forma aleatoria muestras de la clase mayoritaria. Siendo conscientes de que podríamos estar eliminando información importante.

In [6]:
import random
numero_muestras_valance=2000
indices_clase_negativa=db[db["Revenue"]==False].index
indices_elegidos=random.sample(list(indices_clase_negativa), numero_muestras_valance)
indices_a_eliminar=[i for i in indices_clase_negativa if i not in indices_elegidos]
db=db.drop(indices_a_eliminar)
print('# muestras de la clase negativa: ',db[db["Revenue"]==False].shape[0])
print('# muerstras de la clase positiva: ',db[db["Revenue"]==True].shape[0])


# muestras de la clase negativa:  2000
# muerstras de la clase positiva:  1908


**Entrenamiento**

Para la etapa de entrenamiento se hace uso de la librería sklearn y la clase GridSearchCV para valuar los diferentes modelos con variabilidad de parámetros y poder elegir el mejor de cada uno de ellos. Adicionalmente se haca uso del parámetro CV=5 el cual indica que se realizará una validación estratificada (Stratified K-Folds cross-validator) con 5 Folds.


In [7]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix



In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

resultados_modelos_df = pd.DataFrame([], columns = ['Modelo', 'Mejores parametros','TP','FP','FN','TN','Sensibilidad','Especificidad', 'Eficiencia', 'Precision', 'Error']) 

CV=5

Los modleos a entrenar son:
* Naıve Bayes
* K vecinos mas cercanos  ́
* Redes Neuronales Artificiales
* Random Forest
* Maquinas de Soporte Vectorial con kernel lineal y  ́
con kernel RBF.

Para la validación de los modelos se hace uso de las métricas vistas en clase para modelos de clasificación binaria:

* **TP**: Verdaderos positivos
* **FP**: Falsos positivos
* **FN**: Falsos negativos
* **TN**: Verdaderos negativos
* **Sensibilidad**: proporción de datos positivos que se estimaron correctamente como positivos, con respecto a todos los puntos de datos positivos.
* **Especificidad**: proporción de datos negativos que se estimaron erróneamente positivos con respecto a todos los puntos de datos negativos.
* **Eficiencia**: proporción de muestras predichas correctamente por el total
de muestras.
* **Precisión**:proporción de resultados relevantes en la lista de todos los
resultados.
* **Error**: 1- Eficiencia



Los parámetros que varían en cada modelo serán tomados de los laboratorios correspondientes.


In [11]:
def obtener_resultados(y_true, y_pred):
  tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
  sensibilidad=tp/(tp+fn)
  especificidad=tn/(tn+fp)
  eficiencia=(tp+tn)/(tn+fp+fn+tp)
  precision=tp/(tp+fp)
  error=1-eficiencia
  return {
      'TP':tp,
      'FP':fp,
      'FN':fn,
      'TN':tn,
      'Sensibilidad':sensibilidad,
      'Especificidad':especificidad,
      'Eficiencia':eficiencia,
      'Precision':precision,
      'Error':error
  }


def filtrar_diccionario(dic,filtros):
    result={}
    for key in filtros:
        result[key]=dic[key]
    return result

def obtener_dataframe_resultados(parametros,modelo, model_name):
    global CV,X_train,y_train,X_test,y_test,resultados_modelos_df
    parametros_usados=[]
    for k,v in parametros.items():
        parametros_usados.append(k)
    clf = GridSearchCV(modelo, parametros, cv=CV, verbose=5, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred = clf.best_estimator_.predict(X_test)
    model_results=obtener_resultados(y_test, y_pred)
    model_results['Modelo']=model_name
    model_results['Mejores parametros']=filtrar_diccionario(clf.best_params_,parametros_usados)
    resultados_modelos_df =resultados_modelos_df.append(model_results , ignore_index=True)
    filtros_resultado={}
    for k, v in clf.cv_results_.items():
        if not 'time' in k and 'split' not in k:
            filtros_resultado[k]=v
    df=pd.DataFrame(filtros_resultado)
    return df.sort_values(by="rank_test_score")
    


**Naıve Bayes**

In [90]:
parametros = {'var_smoothing':[1e-02,1e-04,1e-06,1e-08,1e-10,1e-12]}
GNB = GaussianNB()
resultados_df=obtener_dataframe_resultados(parametros,GNB,"Naıve Bayes")
resultados_df.head(resultados_df.shape[0])


Unnamed: 0,param_var_smoothing,params,mean_test_score,std_test_score,rank_test_score,mean_train_score,std_train_score
2,1e-06,{'var_smoothing': 1e-06},0.856434,0.007216,1,0.856464,0.000916
3,1e-08,{'var_smoothing': 1e-08},0.838397,0.009459,2,0.838549,0.002727
1,0.0001,{'var_smoothing': 0.0001},0.835129,0.005774,3,0.835068,0.002257
0,0.01,{'var_smoothing': 0.01},0.830529,0.002903,4,0.830953,0.000966
4,1e-10,{'var_smoothing': 1e-10},0.593391,0.034438,5,0.597718,0.034183
5,1e-12,{'var_smoothing': 1e-12},0.365089,0.021028,6,0.367268,0.019533


**KNN**

In [91]:
parametros = {'n_neighbors':[1,2,3,4,5,6,7,100]}
KNN = KNeighborsClassifier()
resultados_df=obtener_dataframe_resultados(parametros,KNN,"KNN")
resultados_df.head(resultados_df.shape[0])

Unnamed: 0,param_n_neighbors,params,mean_test_score,std_test_score,rank_test_score,mean_train_score,std_train_score
5,6,{'n_neighbors': 6},0.866602,0.004702,1,0.884033,0.001351
3,4,{'n_neighbors': 4},0.866481,0.003281,2,0.892689,0.001109
4,5,{'n_neighbors': 5},0.866118,0.00746,3,0.895564,0.00116
6,7,{'n_neighbors': 7},0.864786,0.006667,4,0.886606,0.001163
1,2,{'n_neighbors': 2},0.862002,0.005621,5,0.907215,0.000925
2,3,{'n_neighbors': 3},0.859702,0.006484,6,0.912722,0.000762
7,100,{'n_neighbors': 100},0.847234,0.000827,7,0.847173,0.000422
0,1,{'n_neighbors': 1},0.828108,0.005813,8,1.0,0.0


**Redes Neuronales Artificiales**

In [93]:
parametros = {'hidden_layer_sizes':[(20,),(24,),(28,),(32,),(36,),(20,20,),(24,24,),(28,28,),(32,32,),(36,36,)], 'activation':['tanh','relu']}
MPL = MLPClassifier()
resultados_df=obtener_dataframe_resultados(parametros,MPL,"Redes Neuronales Artificiales")
resultados_df.head(resultados_df.shape[0])

Unnamed: 0,param_activation,param_hidden_layer_sizes,params,mean_test_score,std_test_score,rank_test_score,mean_train_score,std_train_score
1,tanh,"(24,)","{'activation': 'tanh', 'hidden_layer_sizes': (...",0.890207,0.006914,1,0.891993,0.002419
3,tanh,"(32,)","{'activation': 'tanh', 'hidden_layer_sizes': (...",0.889844,0.003939,2,0.892446,0.002443
4,tanh,"(36,)","{'activation': 'tanh', 'hidden_layer_sizes': (...",0.888875,0.004755,3,0.893657,0.00278
2,tanh,"(28,)","{'activation': 'tanh', 'hidden_layer_sizes': (...",0.888391,0.007835,4,0.893536,0.002406
9,tanh,"(36, 36)","{'activation': 'tanh', 'hidden_layer_sizes': (...",0.888028,0.007281,5,0.889602,0.005092
0,tanh,"(20,)","{'activation': 'tanh', 'hidden_layer_sizes': (...",0.887907,0.006932,6,0.890026,0.004101
8,tanh,"(32, 32)","{'activation': 'tanh', 'hidden_layer_sizes': (...",0.887665,0.012335,7,0.890721,0.002487
5,tanh,"(20, 20)","{'activation': 'tanh', 'hidden_layer_sizes': (...",0.887665,0.00588,7,0.888028,0.003692
6,tanh,"(24, 24)","{'activation': 'tanh', 'hidden_layer_sizes': (...",0.884397,0.002901,9,0.88821,0.004902
19,relu,"(36, 36)","{'activation': 'relu', 'hidden_layer_sizes': (...",0.883186,0.005965,10,0.887453,0.003205


**Random Forest**

In [95]:
parametros = {'n_estimators':[5,10,20,50,100], 'max_features':[5,10,15,20,25,30]}
RFC = RandomForestClassifier()
resultados_df=obtener_dataframe_resultados(parametros,RFC,"Random Forest")
resultados_df.head(resultados_df.shape[0])

Unnamed: 0,param_max_features,param_n_estimators,params,mean_test_score,std_test_score,rank_test_score,mean_train_score,std_train_score
13,15,50,"{'max_features': 15, 'n_estimators': 50}",0.905096,0.005687,1,0.999576,0.0001132194
14,15,100,"{'max_features': 15, 'n_estimators': 100}",0.904733,0.0058,2,0.99997,6.053269e-05
24,25,100,"{'max_features': 25, 'n_estimators': 100}",0.904733,0.008487,2,0.99997,6.052353e-05
19,20,100,"{'max_features': 20, 'n_estimators': 100}",0.903765,0.007315,4,0.99997,6.052353e-05
23,25,50,"{'max_features': 25, 'n_estimators': 50}",0.903523,0.007986,5,0.999516,0.0003503231
29,30,100,"{'max_features': 30, 'n_estimators': 100}",0.903523,0.007786,5,1.0,0.0
28,30,50,"{'max_features': 30, 'n_estimators': 50}",0.903159,0.007245,7,0.999788,0.0002264571
18,20,50,"{'max_features': 20, 'n_estimators': 50}",0.901828,0.008648,8,0.999788,0.0001815706
17,20,20,"{'max_features': 20, 'n_estimators': 20}",0.901586,0.00539,9,0.997276,0.0002872404
9,10,100,"{'max_features': 10, 'n_estimators': 100}",0.901586,0.006988,9,0.99997,6.053269e-05


**SVM RBF**

In [None]:
parametros = {'C':[0.001,0.010,0.100,1.000,10.000,100.000],'gamma':[0.00,0.01,0.10,1.00]}
SVM = svm.SVC(kernel='rbf')
resultados_df=obtener_dataframe_resultados(parametros,SVM,"SVM RBF")
resultados_df.head(resultados_df.shape[0])

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


**SVM Linear**

In [12]:
parametros = {'C':[0.001,0.010,0.100,1.000,10.000,100.000]}
linearSVM = LinearSVC()
resultados_df=obtener_dataframe_resultados(parametros,linearSVM,"SVM Linear")
resultados_df.head(resultados_df.shape[0])

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done  22 out of  30 | elapsed:    6.6s remaining:    2.4s
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    7.5s finished


Unnamed: 0,param_C,params,mean_test_score,std_test_score,rank_test_score,mean_train_score,std_train_score
1,0.01,{'C': 0.01},0.863455,0.033751,1,0.864242,0.031942
0,0.001,{'C': 0.001},0.862365,0.046089,2,0.861213,0.040328
3,1.0,{'C': 1.0},0.852681,0.017777,3,0.850956,0.013551
5,100.0,{'C': 100.0},0.805471,0.066691,4,0.804686,0.068564
4,10.0,{'C': 10.0},0.772788,0.104948,5,0.773335,0.106012
2,0.1,{'C': 0.1},0.719647,0.181332,6,0.722154,0.179625


**Resultados**

In [11]:
resultados_modelos_df.sort_values(by="Error").head(resultados_modelos_df.shape[0])

Unnamed: 0,Modelo,Mejores parametros,TP,FP,FN,TN,Sensibilidad,Especificidad,Eficiencia,Precision,Error
0,SVM Linear,{},123,22,510,3414,0.194313,0.993597,0.869255,0.848276,0.130745


**Referencias**

1. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

2. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

3. https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

4. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold

5. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

6. https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

7. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

8. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC