# Trabajo Final de Aprendizaje de Máquina

### Walmart Recruiting: Trip Type Classification


**J. Adrián Sánchez Castro**   116369<br>
**Nohemí Limnones Calvillo**    136135<br>
**Manuel Ríos Zagal**           159284<br>


#### Diciembre, 2016

Instituto Tecnológico Autónomo de México

## Introducción

El proyecto final consiste en clasificar tipos de viajes de compras basándose en los artículos que compraron los clientes. Para dar algunos ejemplos hipotéticos de tipos de viaje: un cliente puede hacer un pequeño viaje diario para la cena, un viaje semanal para compra de muchos comestibles, un viaje para comprar regalos para unas próximas vacaciones, o un viaje temporada para comprar ropa.

Walmart ha categorizado los viajes contenidos en estos datos en 38 distintos tipos usando un método patentado aplicado a un conjunto extendido de datos. Se le reta a recrear esta categorización / agrupación con un conjunto más limitado de características.


Para resolver el proyecto se utilizarán 3 algoritmos de aprendizaje máquina: Regresión Logística, Bosques Aleatorios y K-Vecinos Cercanos. El objetivo es tratar de lograr el mayor porcentaje de certeza en la predicción del tipo de viaje.

Este proyecto es entregado en un *Ipython Jupyter Notebook* que se encuentra dividido en cuatro secciones.

- **Sección 1.** Análisis Exploratorio de Datos
- **Sección 2.** Limpieza y Transformación de Datos<br>
- **Sección 3.** Modelado
- **Sección 4.** Análisis de Resultados

## Sección 1. Análisis Exploratorio de Datos

In [1]:
import pandas as pd
import numpy as np
import re 
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import scipy
import sklearn
%matplotlib inline

In [2]:
data = pd.read_csv("data/train.csv")
data.head()

Unnamed: 0,TripType,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
0,999,5,Friday,68113150000.0,-1,FINANCIAL SERVICES,1000.0
1,30,7,Friday,60538820000.0,1,SHOES,8931.0
2,30,7,Friday,7410811000.0,1,PERSONAL CARE,4504.0
3,26,8,Friday,2238404000.0,2,PAINT AND ACCESSORIES,3565.0
4,26,8,Friday,2006614000.0,2,PAINT AND ACCESSORIES,1017.0


In [3]:
data.count()

TripType                 647054
VisitNumber              647054
Weekday                  647054
Upc                      642925
ScanCount                647054
DepartmentDescription    645693
FinelineNumber           642925
dtype: int64

In [4]:
print(647054-642925)
print(4129/647054)

4129
0.006381229387346342


Podemos ver que hay 4129 filas que se encuentran vacías y representan al rededor .6 % y podemos remover 

In [5]:
len(data.TripType.unique())

38

## Sección 2. Limpieza y Transformación de Datos

### Conjunto de Entrenamiento

In [6]:
#Importamos la base con pandas Data Frame para el conjunto de entrenamiento
train = pd.read_csv("data/train.csv")
train.columns = ['tipo_visita', 'ticket', 'dia', 'producto', 'items_comprados', 'departamento', 'categoria']
train.head()

Unnamed: 0,tipo_visita,ticket,dia,producto,items_comprados,departamento,categoria
0,999,5,Friday,68113150000.0,-1,FINANCIAL SERVICES,1000.0
1,30,7,Friday,60538820000.0,1,SHOES,8931.0
2,30,7,Friday,7410811000.0,1,PERSONAL CARE,4504.0
3,26,8,Friday,2238404000.0,2,PAINT AND ACCESSORIES,3565.0
4,26,8,Friday,2006614000.0,2,PAINT AND ACCESSORIES,1017.0


In [7]:
#Importamos la base con pandas Data Frame para el conjunto de prueba
test = pd.read_csv("data/test.csv")
test.columns = ['ticket', 'dia', 'producto', 'items_comprados', 'departamento', 'categoria']
test.head()

Unnamed: 0,ticket,dia,producto,items_comprados,departamento,categoria
0,1,Friday,72503390000.0,1,SHOES,3002.0
1,1,Friday,1707711000.0,1,DAIRY,1526.0
2,1,Friday,89470000000.0,1,DAIRY,1431.0
3,1,Friday,88491210000.0,1,GROCERY DRY GOODS,3555.0
4,2,Friday,2840015000.0,1,DSD GROCERY,4408.0


### Dummies por departamento

In [8]:
#Creamos dummies de cada tipo de producto definido por el departamento
dummies = pd.get_dummies(train.departamento)
train[dummies.columns] = dummies 
train.head()

Unnamed: 0,tipo_visita,ticket,dia,producto,items_comprados,departamento,categoria,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,...,SEAFOOD,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS
0,999,5,Friday,68113150000.0,-1,FINANCIAL SERVICES,1000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,30,7,Friday,60538820000.0,1,SHOES,8931.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,30,7,Friday,7410811000.0,1,PERSONAL CARE,4504.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,26,8,Friday,2238404000.0,2,PAINT AND ACCESSORIES,3565.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,26,8,Friday,2006614000.0,2,PAINT AND ACCESSORIES,1017.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Reasignamos los valores de los días de  la semana por un valor entero

In [9]:
train['dia'] = train['dia'].map({"Monday": 1, "Tuesday": 2, "Wednesday": 3, "Thursday": 4, "Friday": 5, "Saturday": 6, "Sunday": 7})
train.head()

Unnamed: 0,tipo_visita,ticket,dia,producto,items_comprados,departamento,categoria,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,...,SEAFOOD,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS
0,999,5,5,68113150000.0,-1,FINANCIAL SERVICES,1000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,30,7,5,60538820000.0,1,SHOES,8931.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,30,7,5,7410811000.0,1,PERSONAL CARE,4504.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,26,8,5,2238404000.0,2,PAINT AND ACCESSORIES,3565.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,26,8,5,2006614000.0,2,PAINT AND ACCESSORIES,1017.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Total de productos por departamento por ticket

In [10]:
#Creación de un dataframe que ligue las categorias de producto por ticket
data_dummies = train.iloc[:,7:]
data_dummies.head()

Unnamed: 0,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,BEAUTY,BEDDING,BOOKS AND MAGAZINES,BOYS WEAR,BRAS & SHAPEWEAR,...,SEAFOOD,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
data_dummies = data_dummies.apply(lambda x: x*train["items_comprados"])
data_dummies = data_dummies.replace(-0,0)
data_dummies.describe()

Unnamed: 0,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,BEAUTY,BEDDING,BOOKS AND MAGAZINES,BOYS WEAR,BRAS & SHAPEWEAR,...,SEAFOOD,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS
count,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,...,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0
mean,0.0007,0.001923,0.008448,0.012834,0.007574,0.023301,0.003666,0.001516,0.00568,0.002691,...,0.003403,4.2e-05,0.016655,0.000822,0.008894,0.003408,0.00795,0.000913,0.009597,0.002473
std,0.037966,0.050949,0.113774,0.138664,0.107618,0.175774,0.07975,0.04352,0.083321,0.059365,...,0.070044,0.006695,0.142415,0.042769,0.119666,0.068176,0.116839,0.036888,0.125519,0.058284
min,-2.0,-2.0,-6.0,-6.0,-3.0,-9.0,-2.0,-2.0,-3.0,-3.0,...,-1.0,-1.0,-2.0,-2.0,-3.0,-2.0,-5.0,-1.0,-10.0,-2.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,5.0,9.0,12.0,12.0,10.0,20.0,12.0,7.0,4.0,4.0,...,11.0,1.0,7.0,10.0,51.0,15.0,20.0,2.0,16.0,4.0


In [12]:
#Creando dummie especial si hubo algun retorno de producto
train.loc[train.items_comprados < 0, 'devolución'] = 1
train.loc[train.devolución != 1, 'devolución'] = 0
train.head()

Unnamed: 0,tipo_visita,ticket,dia,producto,items_comprados,departamento,categoria,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,...,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS,devolución
0,999,5,5,68113150000.0,-1,FINANCIAL SERVICES,1000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,30,7,5,60538820000.0,1,SHOES,8931.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,30,7,5,7410811000.0,1,PERSONAL CARE,4504.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,26,8,5,2238404000.0,2,PAINT AND ACCESSORIES,3565.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,26,8,5,2006614000.0,2,PAINT AND ACCESSORIES,1017.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Agrupamiento por ticket por tipo de producto por departamento

In [13]:
grouped = train.groupby("ticket")

In [14]:
grouped = grouped.agg({'dia': np.max, "tipo_visita": np.max, 'items_comprados': np.sum, 'devolución': np.max, 
              '1-HR PHOTO': np.sum, 'ACCESSORIES': np.sum,
       'AUTOMOTIVE': np.sum, 'BAKERY': np.sum, 'BATH AND SHOWER': np.sum, 'BEAUTY': np.sum, 'BEDDING': np.sum,
       'BOOKS AND MAGAZINES': np.sum, 'BOYS WEAR': np.sum, 'BRAS & SHAPEWEAR': np.sum,
       'CAMERAS AND SUPPLIES': np.sum, 'CANDY, TOBACCO, COOKIES': np.sum, 'CELEBRATION': np.sum,
       'COMM BREAD': np.sum, 'CONCEPT STORES': np.sum, 'COOK AND DINE': np.sum, 'DAIRY': np.sum, 'DSD GROCERY': np.sum,
       'ELECTRONICS': np.sum, 'FABRICS AND CRAFTS': np.sum, 'FINANCIAL SERVICES': np.sum,
       'FROZEN FOODS': np.sum, 'FURNITURE': np.sum, 'GIRLS WEAR, 4-6X  AND 7-14': np.sum,
       'GROCERY DRY GOODS': np.sum, 'HARDWARE': np.sum, 'HEALTH AND BEAUTY AIDS': np.sum, 'HOME DECOR': np.sum,
       'HOME MANAGEMENT': np.sum, 'HORTICULTURE AND ACCESS': np.sum,
       'HOUSEHOLD CHEMICALS/SUPP': np.sum, 'HOUSEHOLD PAPER GOODS': np.sum,
       'IMPULSE MERCHANDISE': np.sum, 'INFANT APPAREL': np.sum, 'INFANT CONSUMABLE HARDLINES': np.sum,
       'JEWELRY AND SUNGLASSES': np.sum, 'LADIES SOCKS': np.sum, 'LADIESWEAR': np.sum,
       'LARGE HOUSEHOLD GOODS': np.sum, 'LAWN AND GARDEN': np.sum, 'LIQUOR,WINE,BEER': np.sum,
       'MEAT - FRESH & FROZEN': np.sum, 'MEDIA AND GAMING': np.sum, 'MENS WEAR': np.sum, 'MENSWEAR': np.sum,
       'OFFICE SUPPLIES': np.sum, 'OPTICAL - FRAMES': np.sum, 'OPTICAL - LENSES': np.sum,
       'OTHER DEPARTMENTS': np.sum, 'PAINT AND ACCESSORIES': np.sum, 'PERSONAL CARE': np.sum,
       'PETS AND SUPPLIES': np.sum, 'PHARMACY OTC': np.sum, 'PHARMACY RX': np.sum,
       'PLAYERS AND ELECTRONICS': np.sum, 'PLUS AND MATERNITY': np.sum, 'PRE PACKED DELI': np.sum,
       'PRODUCE': np.sum, 'SEAFOOD': np.sum, 'SEASONAL': np.sum, 'SERVICE DELI': np.sum, 'SHEER HOSIERY': np.sum,
       'SHOES': np.sum, 'SLEEPWEAR/FOUNDATIONS': np.sum, 'SPORTING GOODS': np.sum,
       'SWIMWEAR/OUTERWEAR': np.sum, 'TOYS': np.sum, 'WIRELESS': np.sum})

In [None]:
train = grouped[["tipo_visita", "dia", "items_comprados", "devolución", 'ACCESSORIES',
               'AUTOMOTIVE', 'BAKERY', 'BATH AND SHOWER', 'BEAUTY', 'BEDDING',
               'BOOKS AND MAGAZINES', 'BOYS WEAR', 'BRAS & SHAPEWEAR',
               'CAMERAS AND SUPPLIES', 'CANDY, TOBACCO, COOKIES', 'CELEBRATION',
               'COMM BREAD', 'CONCEPT STORES', 'COOK AND DINE', 'DAIRY', 'DSD GROCERY',
               'ELECTRONICS', 'FABRICS AND CRAFTS', 'FINANCIAL SERVICES',
               'FROZEN FOODS', 'FURNITURE', 'GIRLS WEAR, 4-6X  AND 7-14',
               'GROCERY DRY GOODS', 'HARDWARE', 'HOME DECOR',
               'HOME MANAGEMENT', 'HORTICULTURE AND ACCESS',
               'HOUSEHOLD CHEMICALS/SUPP', 'HOUSEHOLD PAPER GOODS',
               'IMPULSE MERCHANDISE', 'INFANT APPAREL', 'INFANT CONSUMABLE HARDLINES',
               'JEWELRY AND SUNGLASSES', 'LADIES SOCKS', 'LADIESWEAR',
               'LARGE HOUSEHOLD GOODS', 'LAWN AND GARDEN', 'LIQUOR,WINE,BEER',
               'MEAT - FRESH & FROZEN', 'MEDIA AND GAMING', 'MENS WEAR', 'MENSWEAR',
               'OFFICE SUPPLIES', 'OPTICAL - FRAMES', 'OPTICAL - LENSES',
               'OTHER DEPARTMENTS', 'PAINT AND ACCESSORIES', 'PERSONAL CARE',
               'PETS AND SUPPLIES', 'PHARMACY OTC', 'PHARMACY RX',
               'PLAYERS AND ELECTRONICS', 'PLUS AND MATERNITY', 'PRE PACKED DELI',
               'PRODUCE', 'SEAFOOD', 'SEASONAL', 'SERVICE DELI', 'SHEER HOSIERY',
               'SHOES', 'SLEEPWEAR/FOUNDATIONS', 'SPORTING GOODS',
               'SWIMWEAR/OUTERWEAR', 'TOYS', 'WIRELESS']]

In [16]:
train['tipo_visita'] = train['tipo_visita'].map({3:0, 4:1, 5:2, 6:3, 7:4, 8:5, 9:6, 12:7, 14:8, 15:9,  
                                         18:10, 19:11, 20:12, 21:13, 22:14, 23:15, 24:16, 25:17, 26:18,  
                                         27:19, 28:20, 29:21, 30:22, 31:23, 32:24, 33:25, 34:26, 35:27, 
                                         36:28, 37:29, 38:30, 39:31, 40:32, 41:33, 42:34, 43:35, 44:36, 999:37})

In [17]:
train.head(5)

Unnamed: 0,tipo_visita,ticket,dia,producto,items_comprados,departamento,categoria,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,...,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS,devolución
0,37,5,5,68113150000.0,-1,FINANCIAL SERVICES,1000.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,22,7,5,60538820000.0,1,SHOES,8931.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,22,7,5,7410811000.0,1,PERSONAL CARE,4504.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,18,8,5,2238404000.0,2,PAINT AND ACCESSORIES,3565.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,18,8,5,2006614000.0,2,PAINT AND ACCESSORIES,1017.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


De esta manera, tenemos un dataframe que agrupo por cada id de ticket con el día de compra y el tipo de producto que se compró

In [18]:
train.shape

(647054, 76)

### Conjunto de Prueba

In [19]:
#Creamos dummies de cada tipo de producto definido por el departamento
dummies = pd.get_dummies(test.departamento)
test[dummies.columns] = dummies 
test.head()

Unnamed: 0,ticket,dia,producto,items_comprados,departamento,categoria,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,...,SEAFOOD,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS
0,1,Friday,72503390000.0,1,SHOES,3002.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1,Friday,1707711000.0,1,DAIRY,1526.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,Friday,89470000000.0,1,DAIRY,1431.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,Friday,88491210000.0,1,GROCERY DRY GOODS,3555.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2,Friday,2840015000.0,1,DSD GROCERY,4408.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Reasignamos los valores de los días de  la semana por un valor entero

In [20]:
test['dia'] = test['dia'].map({"Monday": 1, "Tuesday": 2, "Wednesday": 3, "Thursday": 4, "Friday": 5, "Saturday": 6, "Sunday": 7})
test.head()

Unnamed: 0,ticket,dia,producto,items_comprados,departamento,categoria,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,...,SEAFOOD,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS
0,1,5,72503390000.0,1,SHOES,3002.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1,5,1707711000.0,1,DAIRY,1526.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,5,89470000000.0,1,DAIRY,1431.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,5,88491210000.0,1,GROCERY DRY GOODS,3555.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2,5,2840015000.0,1,DSD GROCERY,4408.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
#Creación de un dataframe que ligue las categorias de producto por ticket
data_dummies_test = test.iloc[:,6:]
data_dummies_test.head()

Unnamed: 0,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,BEAUTY,BEDDING,BOOKS AND MAGAZINES,BOYS WEAR,BRAS & SHAPEWEAR,...,SEAFOOD,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
data_dummies_test = data_dummies.apply(lambda x: x*test["items_comprados"])
data_dummies_test = data_dummies.replace(-0,0)
data_dummies_test.describe()

Unnamed: 0,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,BEAUTY,BEDDING,BOOKS AND MAGAZINES,BOYS WEAR,BRAS & SHAPEWEAR,...,SEAFOOD,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS
count,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,...,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0,647054.0
mean,0.0007,0.001923,0.008448,0.012834,0.007574,0.023301,0.003666,0.001516,0.00568,0.002691,...,0.003403,4.2e-05,0.016655,0.000822,0.008894,0.003408,0.00795,0.000913,0.009597,0.002473
std,0.037966,0.050949,0.113774,0.138664,0.107618,0.175774,0.07975,0.04352,0.083321,0.059365,...,0.070044,0.006695,0.142415,0.042769,0.119666,0.068176,0.116839,0.036888,0.125519,0.058284
min,-2.0,-2.0,-6.0,-6.0,-3.0,-9.0,-2.0,-2.0,-3.0,-3.0,...,-1.0,-1.0,-2.0,-2.0,-3.0,-2.0,-5.0,-1.0,-10.0,-2.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,5.0,9.0,12.0,12.0,10.0,20.0,12.0,7.0,4.0,4.0,...,11.0,1.0,7.0,10.0,51.0,15.0,20.0,2.0,16.0,4.0


## Concatenación con base de dummies

In [23]:
test = pd.concat([test, data_dummies_test], axis=1)
test.head()

Unnamed: 0,ticket,dia,producto,items_comprados,departamento,categoria,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,...,SEAFOOD,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS
0,1,5,72503390000.0,1,SHOES,3002.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,5,1707711000.0,1,DAIRY,1526.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1,5,89470000000.0,1,DAIRY,1431.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,5,88491210000.0,1,GROCERY DRY GOODS,3555.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2,5,2840015000.0,1,DSD GROCERY,4408.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Agrupamiento por ticket por tipo de producto por departamento

In [24]:
grouped = test.groupby("ticket")

In [25]:
grouped = grouped.agg({'dia': np.max, 'items_comprados': np.sum, 'devolución': np.max,'1-HR PHOTO': np.sum, 'ACCESSORIES': np.sum,
       'AUTOMOTIVE': np.sum, 'BAKERY': np.sum, 'BATH AND SHOWER': np.sum, 'BEAUTY': np.sum, 'BEDDING': np.sum,
       'BOOKS AND MAGAZINES': np.sum, 'BOYS WEAR': np.sum, 'BRAS & SHAPEWEAR': np.sum,
       'CAMERAS AND SUPPLIES': np.sum, 'CANDY, TOBACCO, COOKIES': np.sum, 'CELEBRATION': np.sum,
       'COMM BREAD': np.sum, 'CONCEPT STORES': np.sum, 'COOK AND DINE': np.sum, 'DAIRY': np.sum, 'DSD GROCERY': np.sum,
       'ELECTRONICS': np.sum, 'FABRICS AND CRAFTS': np.sum, 'FINANCIAL SERVICES': np.sum,
       'FROZEN FOODS': np.sum, 'FURNITURE': np.sum, 'GIRLS WEAR, 4-6X  AND 7-14': np.sum,
       'GROCERY DRY GOODS': np.sum, 'HARDWARE': np.sum, 'HEALTH AND BEAUTY AIDS': np.sum, 'HOME DECOR': np.sum,
       'HOME MANAGEMENT': np.sum, 'HORTICULTURE AND ACCESS': np.sum,
       'HOUSEHOLD CHEMICALS/SUPP': np.sum, 'HOUSEHOLD PAPER GOODS': np.sum,
       'IMPULSE MERCHANDISE': np.sum, 'INFANT APPAREL': np.sum, 'INFANT CONSUMABLE HARDLINES': np.sum,
       'JEWELRY AND SUNGLASSES': np.sum, 'LADIES SOCKS': np.sum, 'LADIESWEAR': np.sum,
       'LARGE HOUSEHOLD GOODS': np.sum, 'LAWN AND GARDEN': np.sum, 'LIQUOR,WINE,BEER': np.sum,
       'MEAT - FRESH & FROZEN': np.sum, 'MEDIA AND GAMING': np.sum, 'MENS WEAR': np.sum, 'MENSWEAR': np.sum,
       'OFFICE SUPPLIES': np.sum, 'OPTICAL - FRAMES': np.sum, 'OPTICAL - LENSES': np.sum,
       'OTHER DEPARTMENTS': np.sum, 'PAINT AND ACCESSORIES': np.sum, 'PERSONAL CARE': np.sum,
       'PETS AND SUPPLIES': np.sum, 'PHARMACY OTC': np.sum, 'PHARMACY RX': np.sum,
       'PLAYERS AND ELECTRONICS': np.sum, 'PLUS AND MATERNITY': np.sum, 'PRE PACKED DELI': np.sum,
       'PRODUCE': np.sum, 'SEAFOOD': np.sum, 'SEASONAL': np.sum, 'SERVICE DELI': np.sum, 'SHEER HOSIERY': np.sum,
       'SHOES': np.sum, 'SLEEPWEAR/FOUNDATIONS': np.sum, 'SPORTING GOODS': np.sum,
       'SWIMWEAR/OUTERWEAR': np.sum, 'TOYS': np.sum, 'WIRELESS': np.sum})

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/base.py", line 545, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/base.py", line 496, in _agg
    result[fname] = func(fname, agg_how)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/base.py", line 477, in _agg_1dim
    raise SpecificationError("nested dictionary is ambiguous "
pandas.core.base.SpecificationError: nested dictionary is ambiguous in aggregation

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/groupby.py", line 99, in f
    return self._cython_agg_general(alias, numeric_only=numeric_only)
  File "/Library/Frameworks/Python.framework/Versions/3

TypeError: Can't convert 'list' object to str implicitly

In [None]:
grouped = data.groupby("VisitNumber")
    grouped = grouped.agg({'Weekday': np.max, 'NumItems': np.sum, 'Return': np.max, 
                'ACCESSORIES': np.sum,
           'AUTOMOTIVE': np.sum, 'BAKERY': np.sum, 'BATH AND SHOWER': np.sum, 'BEAUTY': np.sum, 'BEDDING': np.sum,
           'BOOKS AND MAGAZINES': np.sum, 'BOYS WEAR': np.sum, 'BRAS & SHAPEWEAR': np.sum,
           'CAMERAS AND SUPPLIES': np.sum, 'CANDY, TOBACCO, COOKIES': np.sum, 'CELEBRATION': np.sum,
           'COMM BREAD': np.sum, 'CONCEPT STORES': np.sum, 'COOK AND DINE': np.sum, 'DAIRY': np.sum, 'DSD GROCERY': np.sum,
           'ELECTRONICS': np.sum, 'FABRICS AND CRAFTS': np.sum, 'FINANCIAL SERVICES': np.sum,
           'FROZEN FOODS': np.sum, 'FURNITURE': np.sum, 'GIRLS WEAR, 4-6X  AND 7-14': np.sum,
           'GROCERY DRY GOODS': np.sum, 'HARDWARE': np.sum, 'HOME DECOR': np.sum,
           'HOME MANAGEMENT': np.sum, 'HORTICULTURE AND ACCESS': np.sum,
           'HOUSEHOLD CHEMICALS/SUPP': np.sum, 'HOUSEHOLD PAPER GOODS': np.sum,
           'IMPULSE MERCHANDISE': np.sum, 'INFANT APPAREL': np.sum, 'INFANT CONSUMABLE HARDLINES': np.sum,
           'JEWELRY AND SUNGLASSES': np.sum, 'LADIES SOCKS': np.sum, 'LADIESWEAR': np.sum,
           'LARGE HOUSEHOLD GOODS': np.sum, 'LAWN AND GARDEN': np.sum, 'LIQUOR,WINE,BEER': np.sum,
           'MEAT - FRESH & FROZEN': np.sum, 'MEDIA AND GAMING': np.sum, 'MENS WEAR': np.sum, 'MENSWEAR': np.sum,
           'OFFICE SUPPLIES': np.sum, 'OPTICAL - FRAMES': np.sum, 'OPTICAL - LENSES': np.sum,
           'OTHER DEPARTMENTS': np.sum, 'PAINT AND ACCESSORIES': np.sum, 'PERSONAL CARE': np.sum,
           'PETS AND SUPPLIES': np.sum, 'PHARMACY OTC': np.sum, 'PHARMACY RX': np.sum,
           'PLAYERS AND ELECTRONICS': np.sum, 'PLUS AND MATERNITY': np.sum, 'PRE PACKED DELI': np.sum,
           'PRODUCE': np.sum, 'SEAFOOD': np.sum, 'SEASONAL': np.sum, 'SERVICE DELI': np.sum, 'SHEER HOSIERY': np.sum,
           'SHOES': np.sum, 'SLEEPWEAR/FOUNDATIONS': np.sum, 'SPORTING GOODS': np.sum,
           'SWIMWEAR/OUTERWEAR': np.sum, 'TOYS': np.sum, 'WIRELESS': np.sum})

## Columna de total de tipos de productos comprados

In [20]:
# para cada fila a partir de la columna 4, suma uno por cada tipo de producto comprado en una lista y lo apendea en una lista
alist = []
for array in np.asarray(data.iloc[:,4:]):
    count = 0
    for item in array:
        if item > 0:
            count += 1
    alist.append(count)

In [21]:
len(alist)

95674

In [22]:
cat_counts = pd.DataFrame(alist)
cat_counts = cat_counts.rename(columns={0:"total_tipo_productos"})
cat_counts = cat_counts.set_index(data.index)
data.insert(4, 'total_categorias_compradas', cat_counts)

In [23]:
data.head(5)

Unnamed: 0_level_0,tipo_visita,dia,items_comprados,devolución,total_categorias_compradas,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,BEAUTY,...,SEAFOOD,SEASONAL,SERVICE DELI,SHEER HOSIERY,SHOES,SLEEPWEAR/FOUNDATIONS,SPORTING GOODS,SWIMWEAR/OUTERWEAR,TOYS,WIRELESS
ticket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5,999,5,-1,1.0,1,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,30,5,2,0.0,2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
8,26,5,28,1.0,6,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,8,5,3,0.0,2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,8,5,3,0.0,2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Asignación de parámetros

In [24]:
param = ['dia', 'items_comprados', 'devolución', 'total_categorias_compradas', 'ACCESSORIES',
       'AUTOMOTIVE', 'BAKERY', 'BATH AND SHOWER', 'BEAUTY', 'BEDDING',
       'BOOKS AND MAGAZINES', 'BOYS WEAR', 'BRAS & SHAPEWEAR',
       'CAMERAS AND SUPPLIES', 'CANDY, TOBACCO, COOKIES', 'CELEBRATION',
       'COMM BREAD', 'CONCEPT STORES', 'COOK AND DINE', 'DAIRY', 'DSD GROCERY',
       'ELECTRONICS', 'FABRICS AND CRAFTS', 'FINANCIAL SERVICES',
       'FROZEN FOODS', 'FURNITURE', 'GIRLS WEAR, 4-6X  AND 7-14',
       'GROCERY DRY GOODS', 'HARDWARE', 'HOME DECOR',
       'HOME MANAGEMENT', 'HORTICULTURE AND ACCESS',
       'HOUSEHOLD CHEMICALS/SUPP', 'HOUSEHOLD PAPER GOODS',
       'IMPULSE MERCHANDISE', 'INFANT APPAREL', 'INFANT CONSUMABLE HARDLINES',
       'JEWELRY AND SUNGLASSES', 'LADIES SOCKS', 'LADIESWEAR',
       'LARGE HOUSEHOLD GOODS', 'LAWN AND GARDEN', 'LIQUOR,WINE,BEER',
       'MEAT - FRESH & FROZEN', 'MEDIA AND GAMING', 'MENS WEAR', 'MENSWEAR',
       'OFFICE SUPPLIES', 'OPTICAL - FRAMES', 'OPTICAL - LENSES',
       'OTHER DEPARTMENTS', 'PAINT AND ACCESSORIES', 'PERSONAL CARE',
       'PETS AND SUPPLIES', 'PHARMACY OTC', 'PHARMACY RX',
       'PLAYERS AND ELECTRONICS', 'PLUS AND MATERNITY', 'PRE PACKED DELI',
       'PRODUCE', 'SEAFOOD', 'SEASONAL', 'SERVICE DELI', 'SHEER HOSIERY',
       'SHOES', 'SLEEPWEAR/FOUNDATIONS', 'SPORTING GOODS',
       'SWIMWEAR/OUTERWEAR', 'TOYS', 'WIRELESS']

## Regresión Logística

In [25]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss



In [28]:
#Separamos la base de datos original de train en train y test nuevamente 
train_2,test_2 = train_test_split(data, test_size=0.4)

In [31]:

#Asignamos los tipos de variables para cada categoría del nuevo train y test
train_2.tipo_visita = train_2.tipo_visita.astype(float)
train_2.dia = train_2.dia.astype(float)  
train_2.items_comprados = train_2.items_comprados.astype(float)                         

test_2.tipo_visita = test_2.tipo_visita.astype(float)
test_2.dia = test_2.dia.astype(float)  
test_2.items_comprados = test_2.items_comprados.astype(float) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [33]:
LR = LogisticRegression()
LR.fit(X = np.asarray(train_2[param]), y = np.asarray(train_2.tipo_visita))

predicciones = LR.predict_proba(np.asarray(test_2[param]))
predicciones_noProba = LR.predict(np.asarray(test_2[param]))

print("Porcentaje de Pérdida Logarítmica: {}".format(log_loss(test_2.tipo_visita, predicciones).round(5)))

print("Resultado de Precisión del Modelo: {}".format(sklearn.metrics.accuracy_score(test_2.tipo_visita, predicciones_noProba)))

Porcentaje de Pérdida Logarítmica: 1.21894
Resultado de Precisión del Modelo: 0.6542461458061144


## Bosques Aleatorios

In [34]:
from sklearn.ensemble import RandomForestClassifier

In [36]:
bosque = RandomForestClassifier(n_estimators = 300, max_features=2)

In [38]:
bosque = bosque.fit(np.asarray(train_2[param]), np.asarray(train_2.tipo_visita))

In [None]:
#predicciones = bosque.predict_proba(np.asarray(test_2[param]))
#predicciones_noProba = bosque.predict(np.asarray(test_2[fe]))
print("Porcentaje de Pérdida Logarítmica: {}".format(log_loss(test_2.tipo_visita, predicciones).round(5)))
print("Resultado de Precisión del Modelo: {}".format(sklearn.metrics.accuracy_score(test_2.tipo_visita, predicciones_noProba)))

## K - Vecinos Cercanos


In [40]:
from sklearn.neighbors import KNeighborsClassifier, RadiusNeighborsClassifier

In [41]:
KNN = KNeighborsClassifier()

In [43]:
KNN = KNN.fit(np.asarray(train_2[param]), np.asarray(train_2.tipo_visita))

In [44]:
#predicciones = KNN.predict_proba(np.asarray(test_2[features]))
log_loss(test_2.tipo_visita, predicciones).round(5)

NameError: name 'features' is not defined

# Clasificador de Soporte Vectorial

In [46]:
from sklearn.svm import LinearSVC
LSVC= LinearSVC()
LSVC = LSVC.fit(np.asarray(train_2[param]), np.asarray(train_2.tipo_visita))