# Problema de aprendizaje automatico - Ventas Online

## Descripcion de las variables

1. **revenue**: etiqueta de la clase
2. **Administrative, administrative duration, informational, informational duration, product related y product related duration** representan el número de diferentes tipos de páginas visitadas por el visitante en esa sesión y el tiempo total dedicado a cada una de estas categorías de páginas.
3. Las características **bounce rate, exit rate y page value** representan las métricas medidas por Google Analytics para cada página del sitio de comercio electrónico.
4. La característica de **special day** indica la cercanía de la hora de visita del sitio a un día especial específico (por ejemplo, el Día de la Madre, San Valentín) en el que es más probable que las sesiones finalicen con una transacción.

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
from IPython.core.display import display, HTML
import warnings
warnings.filterwarnings('ignore')

## Lectura de datos 

In [6]:
XY = pd.read_csv("https://masterdatascience.s3.us-east-2.amazonaws.com/online_shoppers_intention.csv", sep=',')

In [7]:
XY

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0.0,0.0,0.0,0.0,1.0,0.000000,0.200000,0.200000,0.000000,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0.0,0.0,0.0,0.0,2.0,64.000000,0.000000,0.100000,0.000000,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0.0,-1.0,0.0,-1.0,1.0,-1.000000,0.200000,0.200000,0.000000,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0.0,0.0,0.0,0.0,2.0,2.666667,0.050000,0.140000,0.000000,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0.0,0.0,0.0,0.0,10.0,627.500000,0.020000,0.050000,0.000000,0.0,Feb,3,3,1,4,Returning_Visitor,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12325,3.0,145.0,0.0,0.0,53.0,1783.791667,0.007143,0.029031,12.241717,0.0,Dec,4,6,1,1,Returning_Visitor,True,False
12326,0.0,0.0,0.0,0.0,5.0,465.750000,0.000000,0.021333,0.000000,0.0,Nov,3,2,1,8,Returning_Visitor,True,False
12327,0.0,0.0,0.0,0.0,6.0,184.250000,0.083333,0.086667,0.000000,0.0,Nov,3,2,1,13,Returning_Visitor,True,False
12328,4.0,75.0,0.0,0.0,15.0,346.000000,0.000000,0.021053,0.000000,0.0,Nov,2,2,3,11,Returning_Visitor,False,False


In [8]:
print(u'- El número de filas en el dataset es: {}'.format(XY.shape[0]))
print(u'- El número de columnas en el dataset es: {}'.format(XY.shape[1]))
print(u'- Los nombres de las variables son: {}'.format(list(XY.columns)))

- El número de filas en el dataset es: 12330
- El número de columnas en el dataset es: 18
- Los nombres de las variables son: ['Administrative', 'Administrative_Duration', 'Informational', 'Informational_Duration', 'ProductRelated', 'ProductRelated_Duration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month', 'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'VisitorType', 'Weekend', 'Revenue']


## Preprocesamiento de datos

In [11]:
XY.isnull().sum()

Administrative             14
Administrative_Duration    14
Informational              14
Informational_Duration     14
ProductRelated             14
ProductRelated_Duration    14
BounceRates                14
ExitRates                  14
PageValues                  0
SpecialDay                  0
Month                       0
OperatingSystems            0
Browser                     0
Region                      0
TrafficType                 0
VisitorType                 0
Weekend                     0
Revenue                     0
dtype: int64

In [19]:
# Eliminando nulos 
XY = XY.dropna()

## Categoricas a Numericas

In [22]:
# lista de variables categoricas
XY.select_dtypes(exclude=['number']).columns

Index(['Month', 'VisitorType', 'Weekend', 'Revenue'], dtype='object')

In [23]:
le = LabelEncoder()

### Month

In [24]:
XY['Month'].value_counts()

May     3363
Nov     2998
Mar     1894
Dec     1727
Oct      549
Sep      448
Aug      433
Jul      432
June     288
Feb      184
Name: Month, dtype: int64

In [27]:
XY.Month = le.fit_transform(XY.Month.values)

### VisitorType

In [28]:
XY['VisitorType'].value_counts()

Returning_Visitor    10537
New_Visitor           1694
Other                   85
Name: VisitorType, dtype: int64

In [29]:
XY.VisitorType = le.fit_transform(XY.VisitorType.values)

### Weekend

In [35]:
XY['Weekend'].value_counts()

False    9451
True     2865
Name: Weekend, dtype: int64

In [36]:
XY.Weekend = le.fit_transform(XY.Weekend.values)

### Revenue

In [37]:
XY['Revenue'].value_counts()

0    10408
1     1908
Name: Revenue, dtype: int64

In [38]:
XY.Revenue = le.fit_transform(XY.Revenue.values)

## Comprobacion de datos numericos

In [39]:
XY.select_dtypes(exclude=['number']).columns

Index([], dtype='object')

Como se puede evidenciar no hay ningun tipo de variable categorica, anteriormente estas variables se decodificaron con "LeEnconder"

## Eliminar columnas

In [40]:
XY.drop(['OperatingSystems','Browser'], axis=1, inplace=True)

## Divisiones en features X + target Y

In [42]:
X = XY.drop('Revenue', axis=1)
Y = XY['Revenue']

## Visualizacion y correlaciones

In [43]:
XY.describe()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,Region,TrafficType,VisitorType,Weekend,Revenue
count,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0,12316.0
mean,2.317798,80.906176,0.503979,34.506387,31.763884,1196.037057,0.022152,0.043003,5.895952,0.061497,5.164095,3.148019,4.070477,1.718009,0.232624,0.15492
std,3.322754,176.860432,1.270701,140.825479,44.490339,1914.372511,0.048427,0.048527,18.577926,0.19902,2.371528,2.402211,4.024598,0.691086,0.422522,0.361844
min,0.0,-1.0,0.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,7.0,185.0,0.0,0.014286,0.0,0.0,5.0,1.0,2.0,2.0,0.0,0.0
50%,1.0,8.0,0.0,0.0,18.0,599.76619,0.003119,0.025124,0.0,0.0,6.0,3.0,2.0,2.0,0.0,0.0
75%,4.0,93.5,0.0,0.0,38.0,1466.479902,0.016684,0.05,0.0,0.0,7.0,4.0,4.0,2.0,0.0,0.0
max,27.0,3398.75,24.0,2549.375,705.0,63973.52223,0.2,0.2,361.763742,1.0,9.0,9.0,20.0,2.0,1.0,1.0
