# Diplodatos Kaggle Competition
---

We present this peace of code to create the baseline for the competition, and as an example of how to deal with these kind of problems. The main goals are that you:

1. Learn
1. Try different models and see which one fits the best the given data
1. Get a higher score than the given one in the current baseline example
1. Try to get the highest score in the class :)



Data fields

* `TripType` - a categorical id representing the type of shopping trip the customer made. This is the ground truth that you are predicting. TripType_999 is an "other" category.

* `VisitNumber` - an id corresponding to a single trip by a single customer

* `Weekday` - the weekday of the trip
* `Upc` - the UPC number of the product purchased
* `ScanCount` - the number of the given item that was purchased. A negative value indicates a product return.

* `DepartmentDescription` - a high-level description of the item's department

* `FinelineNumber` - a more refined category for each of the products, created by Walmart

In [1]:
# Import the required packages
import os
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.tree import DecisionTreeClassifier as DT

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold


from scipy.stats import uniform, truncnorm, randint

In [2]:
# variables.

file_path_train = 'https://raw.githubusercontent.com/DiploDatos/AprendizajeSupervisado/master/practico/data/train.csv'
file_path_test = 'https://raw.githubusercontent.com/DiploDatos/AprendizajeSupervisado/master/practico/data/test.csv'

dtype={
        #'TripType': np.uint8, # unsigned number
        #'VisitNumber': np.uint32,
        #'Weekday': str,
        'Upc': str,
        #'ScanCount': np.int32,
        #'DepartmentDescription': str,
        #'FinelineNumber': str # long
        }

Read the *original* dataset...

In [None]:
original_df = pd.read_csv(file_path_train, dtype=dtype)

original_df.dtypes

Looking into the columns values...

In [None]:
original_df

**TripType** is the column that we should predict. That column is not present
in the test set.

The min value in `ScanCount` column is `-10`, but a negative value indicates a 
product return. It is posible make a new column using if a values is negative 
or not.

In [None]:
original_df.describe(include='all')

In [None]:
original_df.Weekday.nunique(dropna=False)

In [None]:
original_df.DepartmentDescription.nunique(dropna=False)

In [None]:
original_df.FinelineNumber.nunique(dropna=False)

In [None]:
original_df.Upc.nunique(dropna=False)

## 1 Pre-processing
---

### 1.1 `NaN` values

There are `nan`s in the column, let us find them...

In [None]:
original_df.isna().sum()

In [None]:
original_df[original_df.DepartmentDescription.isna()]

When the description is `NaN`, then the Upc and FinelineNumber are both NaN?

In [None]:
(
    original_df.DepartmentDescription.isna().sum(),
    (original_df.DepartmentDescription.isna() & 
     original_df.Upc.isna() & 
     original_df.FinelineNumber.isna()).sum())

In [None]:
original_df[original_df.Upc.isna()]

Cuando `Upc` es `NaN`, entonces `FileNumber` es `NaN`?

In [None]:
(original_df.Upc.isna().sum(),
 original_df.FinelineNumber.isna().sum(),
 (original_df.FinelineNumber.isna() & original_df.Upc.isna()).sum())

But it may be the case that both `Upc` and `FineLineNumber` are `NaN` but not the `DepartmentDescription` ...




In [None]:
fil = (original_df.FinelineNumber.isna() & original_df.Upc.isna())
original_df[fil]['DepartmentDescription'].value_counts(dropna=False)

Para el caso anterior se puede observar que a pesar de que 
`Upc` y `FineLineNumber` son `NaN` , existen valores para

`DepartmentDescription` = `PHARMACY RX`.

In [None]:
print(original_df[original_df.Upc.isna()].TripType.nunique())

plt.figure(figsize=(16,8))
sns.countplot(
   original_df[original_df.Upc.isna()].TripType, color='dodgerblue')
plt.title('Cantidad de UPC NaN por TripType')
plt.xlabel('TripType')
# plt.ylabel('Cant. de Mediciones')
plt.show()

So, `Upc` and `FinelineNumber` are both `NaN` at the same
time.

### 1.2 Analysis

Our last step in this analysis is to see how balanced is the data...

In [None]:
print(original_df[['TripType']].nunique())

plt.figure(figsize=(16,8))
sns.countplot(
    original_df.TripType, color='dodgerblue')
plt.title('Cantidad de entradas por TripType')
plt.xlabel('TripType')
# plt.ylabel('Cant. de Mediciones')
plt.show()

In [None]:
plt.figure(figsize=(16,8))
sns.countplot(
   original_df[original_df.ScanCount < 0].TripType, color='dodgerblue')
plt.title('')
plt.xlabel('TripType')
plt.title('Cantidad de devoluciones por TripType')
# plt.ylabel('Cant. de Mediciones')
plt.show()

In [None]:
del original_df

```
def to_categorical(column, bin_size=100, min_cut=0, max_cut=9998):
    if min_cut is None:
        min_cut = int(round(column.min())) - 1
    value_max = int(np.ceil(column.max()))
    max_cut = min(max_cut, value_max)
    intervals = [(x, x + bin_size) for x in range(min_cut, max_cut, bin_size)]
    if max_cut != value_max:
        intervals.append((max_cut, value_max))
    #print(intervals)
    return pd.cut(column, pd.IntervalIndex.from_tuples(intervals))
```

### 1.3 Data Cleaning.


In [3]:
clean_df = pd.read_csv(file_path_train, dtype=dtype)

#### 1.3.1 Get Labels `TripType`

In [4]:
y = clean_df.groupby(
    ['VisitNumber', 'Weekday'], as_index=False).first().TripType
y

0        999
1          8
2          8
3         35
4         41
        ... 
67024     24
67025     38
67026     25
67027     22
67028      8
Name: TripType, Length: 67029, dtype: int64

#### 1.3.2 Concat Test and Train Dataframes

In [5]:
test_df = pd.read_csv(file_path_test, dtype=dtype)

In [6]:
clean_df = clean_df.drop(['TripType'], axis=1)
clean_df

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber
0,5,Friday,68113152929.0,-1,FINANCIAL SERVICES,1000.0
1,9,Friday,1070080727.0,1,IMPULSE MERCHANDISE,115.0
2,9,Friday,3107.0,1,PRODUCE,103.0
3,9,Friday,4011.0,1,PRODUCE,5501.0
4,10,Friday,6414410235.0,1,DSD GROCERY,2008.0
...,...,...,...,...,...,...
453406,191344,Sunday,73150956660.0,1,BEAUTY,3405.0
453407,191344,Sunday,65053002603.0,1,WIRELESS,1712.0
453408,191344,Sunday,7918131034.0,1,BEAUTY,3405.0
453409,191347,Sunday,4190007664.0,1,DAIRY,1512.0


In [7]:
clean_df['is_train_set'] = 1
test_df['is_train_set'] = 0

In [8]:
clean_df = pd.concat([clean_df, test_df])

In [9]:
del test_df

#### 1.3.3 `nan` values ?

```
clean_df.FinelineNumber = clean_df.FinelineNumber.fillna(-1)
clean_df.Upc = clean_df.Upc.fillna('-1')
clean_df.DepartmentDescription = clean_df.DepartmentDescription.fillna('nan')
```

In [10]:
clean_df.isna().sum()

VisitNumber                 0
Weekday                     0
Upc                      4129
ScanCount                   0
DepartmentDescription    1361
FinelineNumber           4129
is_train_set                0
dtype: int64

Checking new column

#### 1.3.4 return column

New `return` and `ScanCount` columns from `ScanCount`

`return`

* `1` a return
* `0` no return

In [11]:

def repay_column(df: pd.DataFrame):
    """
    add new return column 
    """
    df['returns'] = df.apply(
        lambda x: abs(x['ScanCount']) if x['ScanCount'] < 0 else 0, axis=1
    )
    return df

In [12]:
clean_df = repay_column(clean_df)
clean_df[['ScanCount', 'returns']]

Unnamed: 0,ScanCount,returns
0,-1,1
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
193638,1,0
193639,1,0
193640,1,0
193641,1,0


#### 1.3.3 Positive ScanCount column

Positive `ScanCount`

In [13]:
clean_df['rScount'] = clean_df.ScanCount
clean_df.loc[clean_df.ScanCount < 0, 'rScount'] = 0
clean_df

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber,is_train_set,returns,rScount
0,5,Friday,68113152929.0,-1,FINANCIAL SERVICES,1000.0,1,1,0
1,9,Friday,1070080727.0,1,IMPULSE MERCHANDISE,115.0,1,0,1
2,9,Friday,3107.0,1,PRODUCE,103.0,1,0,1
3,9,Friday,4011.0,1,PRODUCE,5501.0,1,0,1
4,10,Friday,6414410235.0,1,DSD GROCERY,2008.0,1,0,1
...,...,...,...,...,...,...,...,...,...
193638,191346,Sunday,3120033013.0,1,DSD GROCERY,4639.0,0,0,1
193639,191346,Sunday,3700091229.0,1,HOUSEHOLD CHEMICALS/SUPP,8947.0,0,0,1
193640,191346,Sunday,32390001778.0,1,PHARMACY OTC,1118.0,0,0,1
193641,191346,Sunday,7874205336.0,1,FROZEN FOODS,1752.0,0,0,1


Total Count

#### 1.3.4 `UPC` columns

In its standard version (UPC-A), the bar code consists of a five digit 
manufacturer number and a five digit product number.  In addition there is a
1 digit number system identifier at the start of the code. The number 
system digit denotes the use of one of ten number systems defined by UPC:

* `0, 1 , 6, 7 and 8` are for regular UPC codes.
* `2` is for random weight items, e.g. meat, marked in-store.
* `3` is for National Drug Code and National Health Related Items.
* `4` is for in-store marking of non-food items.
* `5 and 9` are for coupon use.

<p style="text-align: center;">
<img src=http://www.computalabel.com/Images/UPCdiag.png width=75%>
</p>



The UPC symbol also has a `check digit` which is the last digit of the 
code and is calculated according to the algorithm used for EAN.

First step add `0's` to some values in Upc column to clomplete 11 digits

if values is `nan` = `'nan'` (str)

In [14]:
def clean_upc(df):
    
    def f(x):
        if x == '-1' or not isinstance(x, str) :
            x = '-1'
        elif len(x) < 11:
            x = '0' * (11 - len(x)) + x
        return x
    
    df.Upc = df.Upc.str[:-2].apply(f)
    return df

In [15]:
clean_df = clean_upc(clean_df)
clean_df[['Upc']]

Unnamed: 0,Upc
0,68113152929
1,01070080727
2,00000003107
3,00000004011
4,06414410235
...,...
193638,03120033013
193639,03700091229
193640,32390001778
193641,07874205336


In [16]:
def upc_columns(df):
    df['numSysChar'] = df.apply(
        lambda x: x.Upc[0] if x.Upc != '-1' else '-1', axis=1)
    df['manNum'] = df.apply(
        lambda x: x.Upc[1:6] if x.Upc != '-1' else '-1', axis=1)
    #df['itemNum'] = df.apply(
    #     lambda x: x.Upc[6:11] if x.Upc != '-1' else '-1', axis=1)
    
    # df['checkDig'] = df.apply(
    #     lambda x: int(x.Upc[-1]) if isinstance(x.Upc, str) else -1, axis=1)
    return df

In [17]:
clean_df = upc_columns(clean_df)
clean_df[['Upc', 'numSysChar']]

Unnamed: 0,Upc,numSysChar
0,68113152929,6
1,01070080727,0
2,00000003107,0
3,00000004011,0
4,06414410235,0
...,...,...
193638,03120033013,0
193639,03700091229,0
193640,32390001778,3
193641,07874205336,0


* `0, 1 , 6, 7 and 8` are for regular UPC codes.
* `2` is for random weight items, e.g. meat, marked in-store.
* `3` is for National Drug Code and National Health Related Items.
* `4` is for in-store marking of non-food items.
* `5 and 9` are for coupon use.

In [18]:
fil = ['0','1','6','7', '8']
clean_df.loc[clean_df.numSysChar.isin(fil), 'numSysChar'] = 'regular'
fil = ['5','9']
clean_df.loc[clean_df.numSysChar.isin(fil), 'numSysChar'] = 'cupon'

In [19]:
clean_df.numSysChar.value_counts()

regular    612971
3           13892
2           13706
-1           4129
4            1762
cupon         594
Name: numSysChar, dtype: int64

### 1.4 drop columns

In [20]:
clean_df.Upc = clean_df.Upc.astype('float')
clean_df.FinelineNumber = clean_df.FinelineNumber.astype('float')
clean_df.manNum = clean_df.manNum.astype('int')

In [21]:
clean_df

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,DepartmentDescription,FinelineNumber,is_train_set,returns,rScount,numSysChar,manNum
0,5,Friday,6.811315e+10,-1,FINANCIAL SERVICES,1000.0,1,1,0,regular,81131
1,9,Friday,1.070081e+09,1,IMPULSE MERCHANDISE,115.0,1,0,1,regular,10700
2,9,Friday,3.107000e+03,1,PRODUCE,103.0,1,0,1,regular,0
3,9,Friday,4.011000e+03,1,PRODUCE,5501.0,1,0,1,regular,0
4,10,Friday,6.414410e+09,1,DSD GROCERY,2008.0,1,0,1,regular,64144
...,...,...,...,...,...,...,...,...,...,...,...
193638,191346,Sunday,3.120033e+09,1,DSD GROCERY,4639.0,0,0,1,regular,31200
193639,191346,Sunday,3.700091e+09,1,HOUSEHOLD CHEMICALS/SUPP,8947.0,0,0,1,regular,37000
193640,191346,Sunday,3.239000e+10,1,PHARMACY OTC,1118.0,0,0,1,3,23900
193641,191346,Sunday,7.874205e+09,1,FROZEN FOODS,1752.0,0,0,1,regular,78742


### 1.5 Dummies, groupby columns 

Now, we create the dummy columns

Now, we group by the VisitNumber and Weekday (they should be the same), 
and add all values for ScanCount, and the one-hot encoding 
of `DepartmentDescriptioin` and `'numSysChar'`

In [22]:
clean_df = pd.get_dummies(
    clean_df, 
    columns=['DepartmentDescription'], 
    dummy_na=True)
clean_df

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,FinelineNumber,is_train_set,returns,rScount,numSysChar,manNum,...,DepartmentDescription_SEASONAL,DepartmentDescription_SERVICE DELI,DepartmentDescription_SHEER HOSIERY,DepartmentDescription_SHOES,DepartmentDescription_SLEEPWEAR/FOUNDATIONS,DepartmentDescription_SPORTING GOODS,DepartmentDescription_SWIMWEAR/OUTERWEAR,DepartmentDescription_TOYS,DepartmentDescription_WIRELESS,DepartmentDescription_nan
0,5,Friday,6.811315e+10,-1,1000.0,1,1,0,regular,81131,...,0,0,0,0,0,0,0,0,0,0
1,9,Friday,1.070081e+09,1,115.0,1,0,1,regular,10700,...,0,0,0,0,0,0,0,0,0,0
2,9,Friday,3.107000e+03,1,103.0,1,0,1,regular,0,...,0,0,0,0,0,0,0,0,0,0
3,9,Friday,4.011000e+03,1,5501.0,1,0,1,regular,0,...,0,0,0,0,0,0,0,0,0,0
4,10,Friday,6.414410e+09,1,2008.0,1,0,1,regular,64144,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193638,191346,Sunday,3.120033e+09,1,4639.0,0,0,1,regular,31200,...,0,0,0,0,0,0,0,0,0,0
193639,191346,Sunday,3.700091e+09,1,8947.0,0,0,1,regular,37000,...,0,0,0,0,0,0,0,0,0,0
193640,191346,Sunday,3.239000e+10,1,1118.0,0,0,1,3,23900,...,0,0,0,0,0,0,0,0,0,0
193641,191346,Sunday,7.874205e+09,1,1752.0,0,0,1,regular,78742,...,0,0,0,0,0,0,0,0,0,0


In [23]:
clean_df = pd.get_dummies(
    clean_df, 
    columns=['numSysChar'], 
    dummy_na=False)
clean_df

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,FinelineNumber,is_train_set,returns,rScount,manNum,DepartmentDescription_1-HR PHOTO,...,DepartmentDescription_SWIMWEAR/OUTERWEAR,DepartmentDescription_TOYS,DepartmentDescription_WIRELESS,DepartmentDescription_nan,numSysChar_-1,numSysChar_2,numSysChar_3,numSysChar_4,numSysChar_cupon,numSysChar_regular
0,5,Friday,6.811315e+10,-1,1000.0,1,1,0,81131,0,...,0,0,0,0,0,0,0,0,0,1
1,9,Friday,1.070081e+09,1,115.0,1,0,1,10700,0,...,0,0,0,0,0,0,0,0,0,1
2,9,Friday,3.107000e+03,1,103.0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
3,9,Friday,4.011000e+03,1,5501.0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,10,Friday,6.414410e+09,1,2008.0,1,0,1,64144,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193638,191346,Sunday,3.120033e+09,1,4639.0,0,0,1,31200,0,...,0,0,0,0,0,0,0,0,0,1
193639,191346,Sunday,3.700091e+09,1,8947.0,0,0,1,37000,0,...,0,0,0,0,0,0,0,0,0,1
193640,191346,Sunday,3.239000e+10,1,1118.0,0,0,1,23900,0,...,0,0,0,0,0,0,1,0,0,0
193641,191346,Sunday,7.874205e+09,1,1752.0,0,0,1,78742,0,...,0,0,0,0,0,0,0,0,0,1


In [24]:
clean_df = clean_df.groupby(['VisitNumber', 'Weekday'], as_index=False).sum()

In [25]:
clean_df

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,FinelineNumber,is_train_set,returns,rScount,manNum,DepartmentDescription_1-HR PHOTO,...,DepartmentDescription_SWIMWEAR/OUTERWEAR,DepartmentDescription_TOYS,DepartmentDescription_WIRELESS,DepartmentDescription_nan,numSysChar_-1,numSysChar_2,numSysChar_3,numSysChar_4,numSysChar_cupon,numSysChar_regular
0,5,Friday,6.811315e+10,-1,1000.0,1,1,0,81131,0,...,0,0,0,0,0,0,0,0,0,1
1,7,Friday,6.794963e+10,2,13435.0,0,0,2,79496,0,...,0,0,0,0,0,0,0,0,0,2
2,8,Friday,4.259239e+11,28,58669.0,0,2,30,859233,0,...,0,0,0,1,1,2,0,0,0,20
3,9,Friday,1.070088e+09,3,5719.0,3,0,3,10700,0,...,0,0,0,0,0,0,0,0,0,3
4,10,Friday,1.700927e+10,3,10073.0,3,0,3,170092,0,...,0,0,0,0,0,0,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95669,191343,Sunday,4.516225e+11,9,33991.0,7,0,9,516223,0,...,0,0,0,0,0,0,0,0,0,7
95670,191344,Sunday,1.614572e+11,5,15127.0,5,0,5,314571,0,...,0,0,1,0,0,0,0,0,0,5
95671,191345,Sunday,1.191780e+11,17,49902.0,0,0,17,591775,0,...,0,0,0,0,0,0,2,0,0,11
95672,191346,Sunday,9.870798e+10,17,72908.0,0,0,17,687075,0,...,0,0,0,0,0,0,1,0,0,16


drop columns 

In [26]:
clean_df = clean_df.drop(
    ['numSysChar_-1', 'DepartmentDescription_HEALTH AND BEAUTY AIDS',
    'DepartmentDescription_SEASONAL'], axis=1)
clean_df

Unnamed: 0,VisitNumber,Weekday,Upc,ScanCount,FinelineNumber,is_train_set,returns,rScount,manNum,DepartmentDescription_1-HR PHOTO,...,DepartmentDescription_SPORTING GOODS,DepartmentDescription_SWIMWEAR/OUTERWEAR,DepartmentDescription_TOYS,DepartmentDescription_WIRELESS,DepartmentDescription_nan,numSysChar_2,numSysChar_3,numSysChar_4,numSysChar_cupon,numSysChar_regular
0,5,Friday,6.811315e+10,-1,1000.0,1,1,0,81131,0,...,0,0,0,0,0,0,0,0,0,1
1,7,Friday,6.794963e+10,2,13435.0,0,0,2,79496,0,...,0,0,0,0,0,0,0,0,0,2
2,8,Friday,4.259239e+11,28,58669.0,0,2,30,859233,0,...,0,0,0,0,1,2,0,0,0,20
3,9,Friday,1.070088e+09,3,5719.0,3,0,3,10700,0,...,0,0,0,0,0,0,0,0,0,3
4,10,Friday,1.700927e+10,3,10073.0,3,0,3,170092,0,...,0,0,0,0,0,0,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95669,191343,Sunday,4.516225e+11,9,33991.0,7,0,9,516223,0,...,0,0,0,0,0,0,0,0,0,7
95670,191344,Sunday,1.614572e+11,5,15127.0,5,0,5,314571,0,...,0,0,0,1,0,0,0,0,0,5
95671,191345,Sunday,1.191780e+11,17,49902.0,0,0,17,591775,0,...,0,0,0,0,0,0,2,0,0,11
95672,191346,Sunday,9.870798e+10,17,72908.0,0,0,17,687075,0,...,0,0,0,0,0,0,1,0,0,16


In [27]:
clean_df = pd.get_dummies(clean_df, columns=["Weekday"], dummy_na=False)

In [28]:
df_test = clean_df[clean_df.is_train_set == 0]
clean_df = clean_df[clean_df.is_train_set != 0]

In [29]:
clean_df = clean_df.drop(["is_train_set"], axis=1)
df_test = df_test.drop(["is_train_set"], axis=1)

## 3 Models Train and Test 
---

Load the data...

### 3.1 Create the model and evaluate it

split training dataset into train and "validation" 
(we won't be using validation set in this example, because of the cross-validation;

but it could be useful for you depending on your approach)

In [30]:
# state = 42
#state = np.random.RandomState(42)
X_train, X_valid, y_train, y_valid = train_test_split(
    clean_df, y, 
    test_size=0.2, 
    random_state=42)

In [31]:
print(X_train.shape, y_train.shape)

(53623, 86) (53623,)


### 3.2 `GradientBoostingClassifier` 

In [None]:
kfold = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

parameters = {
    'learning_rate': [0.1],
    'loss': ['deviance'],
    'min_samples_split': [2],
    #'criterion': ['friedman_mse', 'mse', 'mae']
    'max_depth': [3],
    
}

clf3 = GradientBoostingClassifier(random_state=42, n_estimators=115)
boost_clf3 = GridSearchCV(clf3, parameters, cv=kfold, scoring='accuracy', n_jobs=6) 
# scoring='balanced_accuracy')
boost_clf3.fit(X_train, y_train)
best_tree_clf = boost_clf3.best_estimator_
# clf3.fit(X_train, y_train)

In [None]:
print(f'Best GradientBoostingClassifier: Score = {best_tree_clf.score(X_valid, y_valid)}')
print('Best GradientBoostingClassifier: ', boost_clf3.best_score_)
print(best_tree_clf)
results = results.append(
    {'clf': best_tree_clf, 'best_acc': boost_clf3.best_score_}, 
    ignore_index=True
    )

print('The best classifier so far is: ')
print(results.loc[results['best_acc'].idxmax()]['clf'])

### 3.2 `Xtreme GradientBoostingClassifier` 


In [33]:
kfold = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

xgbc = XGBClassifier(random_state=42)
parameters = {
    'eta': [0.1],
}
xgbc_clf = GridSearchCV(
    xgbc, parameters, cv=kfold, scoring='balanced_accuracy', n_jobs=6)

xgbc_clf.fit(X_train, y_train)
best_tree_clf = xgbc_clf.best_estimator_

In [37]:
best_tree_clf

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eta=0.1, gamma=0,
              gpu_id=-1, importance_type='gain', interaction_constraints='',
              learning_rate=0.100000001, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=42, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [39]:
y_pred = best_tree_clf.predict(X_valid)
predictions = [value for value in y_pred]

accuracy = accuracy_score(y_valid, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# print('Best GradientBoostingClassifier: ', best_tree_clf.best_score_)

Accuracy: 71.87%


### 3.3  `RandomForestClassifier`

In [None]:
rf_model = RandomForestClassifier(n_estimators=150, random_state=state)

kfold = StratifiedKFold(n_splits=3, random_state=state, shuffle=True)

model_params = {
    #'max_features': [32, 64], # X_train.shape[1]],
    # 'max_features': ['auto', 'sqrt', 'log2'],
    #'min_samples_leaf':[1, 2],
    'min_samples_split': [2, 3, 4, 6],
    'class_weight': ['balanced'],
    'max_depth': [64, 96, 108, 128],
    'bootstrap': [False],
}


tree_clf = GridSearchCV(
    rf_model, model_params, cv=kfold, scoring='accuracy', 
    n_jobs=6
) # scoring='balanced_accuracy')

tree_clf.fit(X_train, y_train)
best_tree_clf = tree_clf.best_estimator_

In [None]:
print(f'Best Decision Tree accuracy: Score = {best_tree_clf.score(X_valid, y_valid)}')

### 3.4 `DTree`

In [None]:
tree_param = {
    'criterion':('gini', 'entropy'),
    'min_samples_leaf':(1, 2, 5),
    'min_samples_split':(2, 3, 5, 10, 50, 100)}

tree = DT(random_state=42)
tree_clf = GridSearchCV(tree, tree_param, cv=3, scoring='accuracy') #scoring='balanced_accuracy')
tree_clf.fit(X_train, y_train)
best_tree_clf = tree_clf.best_estimator_

In [None]:
print(f'Best Decision Tree accuracy: Score = {best_tree_clf.score(X_valid, y_valid)}')

In [None]:
# sns.set_context(context='talk', font_scale=0.5)
fig = plt.figure(figsize=(25,25))
ax = plt.subplot('111')
plot_confusion_matrix(
    rf_model, X_valid, y_valid,
    cmap=plt.cm.Blues,
    normalize='true',
    ax=ax
    )
plt.title('Confusion Matrix SGDClassifier best model')
plt.show()

## 4 Results write back
---

In [None]:
# Esto hace un ranking de la importancia de la variable para el modelo
# lo saque de aca https://www.kaggle.com/zlatankr/titanic-random-forest-82-78/data
pd.concat((
    pd.DataFrame(X_train.columns, columns = ['variable']), 
    pd.DataFrame(
        best_tree_clf.feature_importances_, columns = ['importance'])), 
    axis=1
).sort_values(by='importance', ascending=False).tail(20)


In [None]:
yy = xgbc.predict(df_test)
## yy = best_tree_clf.predict(df_test)

In [None]:
submission = pd.DataFrame(
    list(zip(df_test.VisitNumber, yy)), 
    columns=["VisitNumber", "TripType"])

submission

In [None]:
submission.to_csv("../data/submission.csv", header=True, index=False)

---
## End