In [1]:
import numpy as np
import pandas as pd
import random as rd
from collections import defaultdict
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Load Data 

In [2]:
train = pd.read_csv('data/train.csv')
train = train.set_index('ID')

  data = self._reader.read(nrows)


Get a sample to play with if we need it

In [3]:
samp = train.loc[rd.sample(train.index, 5000)]

## EDA

In [4]:
# Variable types
vartypes = train.dtypes

# Divide into categories
cat_df = train.loc[:, vartypes[vartypes=='O'].index]
int_df = train.loc[:, vartypes[vartypes=='int64'].index]
float_df = train.loc[:, vartypes[vartypes=='float64'].index]


### Categorical

The first thing we need to do is make a summary of all the categorical variables, so that we can figure out how to process them. Things I would like to look at:
- Number of unique values
- Frequency of top value
- Top Item
- Number of missing values
- First few examples
- Spread of target frequencies

In [5]:
from proc_functions import get_summary

# Replace nans
cat_df = cat_df.replace(np.nan, 'NA')

# Summarise each categorical variable
cat_df.apply(lambda col: get_summary(col, train['target'])).T

Unnamed: 0,Examples,Fraction Missing,Freq Range,NumUnique,Top Freq,Top Item
VAR_0001,"H, R, Q",0,0.06985867,3,0.5845377,R
VAR_0005,"C, B, N, S",0,0.117376,4,0.491968,B
VAR_0008,"False, NA",0,0.000404069,2,0.9996144,False
VAR_0009,"False, NA",0,0.000404069,2,0.9996144,False
VAR_0010,"False, NA",0,0.000404069,2,0.9996144,False
VAR_0011,"False, NA",0,0.000404069,2,0.9996144,False
VAR_0012,"False, NA",0,0.000404069,2,0.9996144,False
VAR_0043,"False, NA",0,0.000404069,2,0.9996144,False
VAR_0044,"[], NA",0,0.000404069,2,0.9996144,[]
VAR_0073,"NA, 04SEP12:00:00:00, 26JAN12:00:00:00, 18SEP1...",0,1.0,1459,0.6963183,


Use the above summaries to categorize the variables in the first part of process_categorical.py

#### Dates
Start by converting all dates to the proper format

In [6]:
from process_categorical import convert_all_date_columns

date_df = convert_all_date_columns(train)
date_df.describe().T

Unnamed: 0,count,unique,top,freq,first,last
VAR_0073,44104,1458,2009-03-13 00:00:00,260,2008-01-02 00:00:00,2012-10-31 00:00:00
VAR_0075,145175,2371,2010-09-22 00:00:00,1168,2001-01-01 00:00:00,2012-11-01 00:00:00
VAR_0156,5870,730,2011-12-12 00:00:00,45,2008-04-11 00:00:00,2012-10-29 00:00:00
VAR_0157,920,424,2012-07-27 00:00:00,8,2008-10-17 00:00:00,2012-10-31 00:00:00
VAR_0158,2089,407,2012-06-01 00:00:00,28,2008-09-30 00:00:00,2012-10-29 00:00:00
VAR_0159,5870,650,2012-06-04 00:00:00,49,2008-09-23 00:00:00,2012-10-29 00:00:00
VAR_0166,14230,2145,2011-12-19 00:00:00,45,2002-07-30 00:00:00,2012-10-30 00:00:00
VAR_0167,2567,853,2011-11-03 00:00:00,27,2005-02-12 00:00:00,2012-10-30 00:00:00
VAR_0168,10725,1645,2011-11-03 00:00:00,206,1999-12-31 00:00:00,2012-11-01 00:00:00
VAR_0169,14230,1908,2011-12-19 00:00:00,48,2002-07-30 00:00:00,2012-11-01 00:00:00


Now do a little bit of feature engineering on dates
- Take all pairwise differences between dates
- Earliest date record
- Most recent date record
- Number of non-nan date records

In [106]:
# Feature engineering on dates
from process_categorical import engineer_dates
diff_df = engineer_dates(date_df)

In [109]:
diff_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
VAR_0073-VAR_0075,44104,1 days 23:59:19.947251,505 days 21:33:52.319903,0 days 00:00:00,2 days 00:00:00,222 days 00:00:00,713 days 06:00:00,4281 days 00:00:00
VAR_0073-VAR_0156,5787,-13 days +04:55:22.719694,179 days 20:47:54.450964,-976 days +00:00:00,1 days 00:00:00,50 days 00:00:00,153 days 00:00:00,1488 days 00:00:00
VAR_0073-VAR_0157,901,34 days 19:25:06.326304,170 days 01:03:23.695355,-860 days +00:00:00,-11 days +00:00:00,19 days 00:00:00,56 days 00:00:00,1328 days 00:00:00
VAR_0073-VAR_0158,2043,-2 days +23:14:53.392070,184 days 02:25:14.461456,-1177 days +00:00:00,-61 days +12:00:00,19 days 00:00:00,94 days 00:00:00,1357 days 00:00:00
VAR_0073-VAR_0159,5787,-6 days +10:27:48.692982,146 days 11:25:39.359484,-976 days +00:00:00,0 days 00:00:00,22 days 00:00:00,79 days 00:00:00,1357 days 00:00:00
VAR_0073-VAR_0166,13646,-2 days +22:04:05.344246,583 days 18:45:27.032582,-1428 days +00:00:00,46 days 00:00:00,287 days 00:00:00,822 days 00:00:00,2893 days 00:00:00
VAR_0073-VAR_0167,2532,-27 days +08:36:25.694656,380 days 03:00:44.062032,-649 days +00:00:00,4 days 00:00:00,97 days 12:00:00,329 days 00:00:00,2792 days 00:00:00
VAR_0073-VAR_0168,10252,-2 days +06:12:57.276802,463 days 04:21:50.644763,-1429 days +00:00:00,0 days 00:00:00,98 days 00:00:00,407 days 00:00:00,4619 days 00:00:00
VAR_0073-VAR_0169,13646,-4 days +08:04:42.599072,444 days 01:24:06.607731,-1428 days +00:00:00,3 days 00:00:00,65 days 00:00:00,371 days 00:00:00,2893 days 00:00:00
VAR_0073-VAR_0176,16885,-6 days +23:17:29.796870,554 days 17:18:24.374012,-1428 days +00:00:00,28 days 00:00:00,192 days 00:00:00,689 days 00:00:00,2893 days 00:00:00


#### Geography Variables
- We should make a feature regarding whether the two state variables are the same or different
- It's important that this happens before processing 'thin' columns

In [None]:
# Geography

#### Thin columns
- These columns are sliced too thin to be used as true categories (like categorizing people by social security number)
- Instead, we replace the significant ones with their p-values from a chisq test.
- We need a threshold for which p-values to actually include to avoid fitting accidental proportions (1 in 20 categories will be significant at the 95% level)

In [77]:
# Get features from thinly sliced columns
from process_categorical import convert_all_thin_vars
THRESH = 100
tscore_df = convert_all_thin_vars(train, THRESH, train['target']) 

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Get a p-value for each proportion


In [138]:
d = defaultdict(int, col)

In [140]:
col.head()

ID
2        Other
4        Other
5        Other
7        Other
8    FRANKFORT
Name: VAR_0200, dtype: object

In [179]:
from process_categorical import zscore_convert, var_types
zconv = zscore_convert(thresh = 100)
zconv.fit(train[var_types['thin']], train['target'])
thin_convert_df = zconv.transform(train[var_types['thin']])

In [180]:
thin_convert_df.describe()

Unnamed: 0,VAR_0200,VAR_0237,VAR_0274,VAR_0342,VAR_0404
count,145231.0,145231.0,145231.0,145231.0,145231.0
mean,-2.598308,-0.53374,-0.337783,0.875256,1.060971
std,2.660417,6.610559,5.020041,4.818371,1.010594
min,-5.931612,-13.445297,-11.673988,-7.039453,-3.632704
25%,-4.508292,-5.287099,-3.726071,-3.254337,1.339791
50%,-4.508292,-1.085341,-0.669519,0.884012,1.339791
75%,-0.437341,5.79037,3.289203,3.129548,1.339791
max,5.885018,10.505208,8.365205,9.335687,1.339791


### Numeric Variables
I think there are not any actual float variables (non-integer) in the data. It looks like features with negative entries (just -1?) are being characterized as floats.

We should verify this, but then everything can be treated the same.

In [26]:
# Verify difference between integer and float

#### Integer
 
- Each integer valued variable needs to be checked to see if it should be treated as categorical or numerical
- It's possible these can all be taken care of, just by using trees
- If mutual information and linear correlation tell wildly different stories, then the integer may actually be a category. (For example, zip codes)

#### Are all the missing values (99, 999, etc..) in the same spot?

#### Float
- Float variables _should_ be the easiest---they can be taken at face value

Combine models from all three variable types