<h1>XGBoost vs FastAI/Pytorch ConvNet on Categorical data: Part I</h1>

<h3>Part I </h3>

This section will demonstrate the efficacy of XGBoost straight out of the box, with default paramater settings. This script demonstrates an accuracy of 64.07%. With minimal feature engineering, namely the exclusion of all continuous variables, the dataset was first one-hot encoded.

<h3>Part II </h3> 

This section will implement a ConvNet on the same dataset.


<h3>Part III</h3>

Part III will compare both algorithms, after hyperparamter tuning is implemented. A twist in Part III will be the conversion of all categorical variables to <b><i>entity embeddings</i></b>, where each variables is mapped to a continuous n-dimensional space.


-------------------

<h3>A bit about the data</h3>

The aim of this Kaggle dataset was to predict whether a Kickstarter project will succeed or fail based upon a mix of categorical and continuous variables.

Here is a link to the datasets, hosted on Kaggle: https://www.kaggle.com/kemical/kickstarter-projects.

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn import preprocessing
import re
import math
from sklearn.metrics import accuracy_score


%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

In [4]:
import sys
sys.path.append("/home/paperspace/fastai/")

from fastai.imports import *
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.sgdr import *
from fastai.plots import *
from fastai.structured import *
from fastai.column_data import *

In [5]:
dataset   = pd.read_csv("./ks-projects-201612.csv", encoding="latin1", low_memory=False)

In [6]:
dataset.iloc[:5,:]

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1,,,,
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29 01:00:00,19500,2015-07-04 08:35:03,1283,canceled,14,US,1283,,,,
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375,,,,


<h3>Descriptive stats on the data</h3>

In [7]:
print("the number of instances (rows) are:  ", dataset.shape[0])
print("the number of features (columns) are:", dataset.shape[1])

the number of instances (rows) are:   323750
the number of features (columns) are: 17


In [8]:
print("Stats on the 'project state' label-variable: \n\n")
print(dataset.iloc[:,9].describe())
print("\n\n\nBreakdown of the top 15 levels of the label-variable: \n\n")
print(dataset.iloc[:,9].value_counts().iloc[:15])

Stats on the 'project state' label-variable: 


count     323750
unique       410
top       failed
freq      168221
Name: state , dtype: object



Breakdown of the top 15 levels of the label-variable: 


failed        168221
successful    113081
canceled       32354
live            4428
undefined       3555
suspended       1479
0                 96
1                 15
5                 11
25                11
10                10
100                8
50                 7
65                 6
55                 5
Name: state , dtype: int64


In [9]:
datasetClean = dataset.iloc[np.where((dataset.iloc[:,9].values=='failed') | (dataset.iloc[:,9].values=='successful'))]

In [10]:
print("reducing the number of levels of the label-variable to just two:\n\n")
print(datasetClean.iloc[:,9].value_counts())

reducing the number of levels of the label-variable to just two:


failed        168221
successful    113081
Name: state , dtype: int64


In [11]:
print("Count the number of missing values in the dataset: \n\n")
datasetClean.isnull().sum(axis=0)

Count the number of missing values in the dataset: 




ID                     0
name                   3
category               0
main_category          0
currency               0
deadline               0
goal                   0
launched               0
pledged                0
state                  0
backers                0
country                0
usd pledged          210
Unnamed: 13       281302
Unnamed: 14       281302
Unnamed: 15       281302
Unnamed: 16       281302
dtype: int64

<h5>The Model will predict State using the Category, Main Category, Currency and Country variables.</h5>

In [12]:
## subsetting category, main_category, currency, state, country
datasetSub = datasetClean.iloc[:,[2,3,4,9,11]]
datasetSub.head()

Unnamed: 0,category,main_category,currency,state,country
0,Poetry,Publishing,GBP,failed,GB
1,Narrative Film,Film & Video,USD,failed,US
2,Music,Music,USD,failed,US
4,Restaurants,Food,USD,successful,US
5,Food,Food,USD,successful,US


In [13]:
datasetSub.describe()

Unnamed: 0,category,main_category,currency,state,country
count,281302,281302,281302,281302,281302
unique,158,15,13,2,22
top,Product Design,Film & Video,USD,failed,US
freq,14516,51057,229517,168221,229366


In [14]:
print("the number of instances (rows) are:  ", datasetSub.shape[0])
print("the number of features (columns) are:", datasetSub.shape[1])

the number of instances (rows) are:   281302
the number of features (columns) are: 5


<h3>One Hot Encoding the dataset</h3>

<h5>An example of one hot encoding</h5>

In [15]:
df = pd.DataFrame({'Primary': ['alpha', 'beta', 'alpha'], 'Secondary': ['uno', 'dos', 'tres'], 'Tertiary': ["1", "2", "3"]})

In [16]:
df

Unnamed: 0,Primary,Secondary,Tertiary
0,alpha,uno,1
1,beta,dos,2
2,alpha,tres,3


The final dimensions of the one-hot encoded dataset depend upon the number of levels of each variable and the number of variables:
- In `df` there are 2 levels in the first column.
- 3 levels in the second column.
- 3 levels in the third column. 

This means that the final dimensions of the dataset will need (3 x 2), (3 x 3) and (3 * 3) dimensions. The output will be 3 x 8.

In [17]:
pd.get_dummies(df)

Unnamed: 0,Primary_alpha,Primary_beta,Secondary_dos,Secondary_tres,Secondary_uno,Tertiary_1,Tertiary_2,Tertiary_3
0,1,0,0,0,1,1,0,0
1,0,1,1,0,0,0,1,0
2,1,0,0,1,0,0,0,1


As a sanity-check, the number of dimensions of the OHE train and testsets can be calculated ahead of time, if we know the number of levels and the number of variables.

In [18]:
datasetSub.describe()

Unnamed: 0,category,main_category,currency,state,country
count,281302,281302,281302,281302,281302
unique,158,15,13,2,22
top,Product Design,Film & Video,USD,failed,US
freq,14516,51057,229517,168221,229366


From the description of `datasetSub`:
- There are 158 levels in the `category` variable.
- There are 15 levels in the `main_category` variable.
- There are 13 levels in the `currency` variable.
- There are 2 levels in the `state` variable.
- There are 22 levels in the `country` variable.

If we subset out the state variable, which we will require to be the binary label, we can calculate the dimensions of the OHE dataset thus:

In [22]:
print("281302," ,158 + 15 + 13 + 22)

281302, 208


In [23]:
datasetOHEPrep = datasetSub.iloc[:,[0,1,2,4]]
datasetOHEPrep.head()

Unnamed: 0,category,main_category,currency,country
0,Poetry,Publishing,GBP,GB
1,Narrative Film,Film & Video,USD,US
2,Music,Music,USD,US
4,Restaurants,Food,USD,US
5,Food,Food,USD,US


In [24]:
datasetOHE = pd.get_dummies(datasetOHEPrep)
datasetOHE.shape

(281302, 208)

Now to encode the project_state variable as either 1 or 0 and to stitch it back onto the dataset.

In [25]:
datasetBinaryLabel = datasetSub.iloc[:,3].replace(to_replace={"failed": 0, "successful" : 1})
datasetBinaryLabel.head()

0    0
1    0
2    0
4    1
5    1
Name: state , dtype: int64

In [26]:
datasetPrep = pd.concat([datasetOHE, datasetBinaryLabel], axis=1)
datasetPrep.head()

Unnamed: 0,category _3D Printing,category _Academic,category _Accessories,category _Action,category _Animals,category _Animation,category _Anthologies,category _Apparel,category _Apps,category _Architecture,...,country _LU,country _MX,"country _N,""0",country _NL,country _NO,country _NZ,country _SE,country _SG,country _US,state
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1


In [45]:
datasetPrep_Reindex = datasetPrep.reset_index(drop=True)

Limit the dataset to 200,000 rows for easier maths.
take 60% of rows for the training set, and 20% each for the test and validate sets.

In [47]:
datasetTrain = datasetPrep_Reindex.iloc[0:120000,:]
datasetTest  = datasetPrep_Reindex.iloc[120000:160000,:]
datasetValid = datasetPrep_Reindex.iloc[160000:200000,]

In [48]:
print(datasetTrain.shape)
print(datasetTest.shape)
print(datasetValid.shape)

(120000, 209)
(40000, 209)
(40000, 209)


In [49]:
datasetTrain_DMATRIX = xgb.DMatrix(data = datasetTrain.iloc[1:,0:208], label = datasetTrain.iloc[1:,[208]])

In [50]:
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, datasetTrain_DMATRIX, num_round)

In [51]:
datasetTest_DMATRIX = xgb.DMatrix(data = datasetTest.iloc[1:,0:208], label = datasetTest.iloc[1:,[208]])

In [52]:
predict = bst.predict(datasetTest_DMATRIX)

In [53]:
predictions = [round(value) for value in predict]
print(len(predictions))

39999


In [54]:
accuracy = accuracy_score(datasetTest.iloc[1:,[208]], predictions)

In [55]:
print(accuracy)

0.6407160179004475


<h3>From XGBoost to a Convolutional Neural Network</h3>

In [58]:
PATH = '/home/paperspace/fastai/courses/dl1/XGBoost/'
columnNames = list(datasetTest[0:208])
print(len(columnNames))
## del columnNames[-1]
## print(len(columnNames))

209
208


In [59]:
columnNames

['category _3D Printing',
 'category _Academic',
 'category _Accessories',
 'category _Action',
 'category _Animals',
 'category _Animation',
 'category _Anthologies',
 'category _Apparel',
 'category _Apps',
 'category _Architecture',
 'category _Art',
 'category _Art Books',
 'category _Audio',
 'category _Bacon',
 'category _Blues',
 'category _Calendars',
 'category _Camera Equipment',
 'category _Candles',
 'category _Ceramics',
 "category _Children's Books",
 'category _Childrenswear',
 'category _Chiptune',
 'category _Civic Design',
 'category _Classical Music',
 'category _Comedy',
 'category _Comic Books',
 'category _Comics',
 'category _Community Gardens',
 'category _Conceptual Art',
 'category _Cookbooks',
 'category _Country & Folk',
 'category _Couture',
 'category _Crafts',
 'category _Crochet',
 'category _DIY',
 'category _DIY Electronics',
 'category _Dance',
 'category _Design',
 'category _Digital Art',
 'category _Documentary',
 'category _Drama',
 'category _Dri

In [None]:
train_ratio = 0.75
train_size = int(samp_size * train_ratio)
val_idx = list(range(train_size, len(df)))

In [73]:
md = ColumnarModelData.from_data_frames(PATH, 
                                        datasetTrain.iloc[1:,0:208], 
                                        datasetValid.iloc[1:,0:208],
                                        datasetTrain.iloc[1:,[208]].astype('int'), 
                                        datasetValid.iloc[1:,[208]].astype('int'), 
                                        columnNames,
                                        128,
                                        test_df=datasetTest.iloc[1:,0:208],
                                        is_reg=False,
                                        is_multi=False
                                       )

In [None]:
## m = md.get_learner(emb_szs=, y_range=[0,1])