# StateFarm Code Screen 

> This is Sungryong Hong. I will use both of **python** and **pyspark** to solve this problem. 

> For spark calculations, I have used my hand-installed spark cluster with a hadoop `hdfs` file system. I am also using Dataprc in Google Cloud Platform occasionally. 

#### Encoding the test set 

## 1. Import basic libraries

In [1]:
# Basic Libraries 
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial import cKDTree
import gc


pd.set_option('display.max_rows', 500)
pd.options.mode.chained_assignment = None
#pd.set_option("display.precision", 10)

# plot settings
plt.rc('font', family='serif') 
plt.rc('font', serif='Times New Roman') 
plt.rcParams.update({'font.size': 16})
plt.rcParams['mathtext.fontset'] = 'stix'

In [2]:
# Basic PySpark Libraries

# Old Style : SparkContext 
#from pyspark import SparkContext   
#from pyspark.sql import SQLContext


# New Style : Spark Session  
#Shell-Mode: Spark Session Name is `spark`

sc = spark.sparkContext
sqlsc = SQLContext(sc)
sc.setCheckpointDir("hdfs://master:54310/tmp/spark/checkpoints")

import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark import Row
from pyspark.sql.window import Window as W

In [3]:
# Enable Arrow for boosting up python performances 
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

## 2. Read and Explore the Data

In [4]:
# Read data
rawdf = pd.read_csv("./exercise_01_test.csv", low_memory=False)

In [5]:
rawdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 100 columns):
x0     10000 non-null float64
x1     9998 non-null float64
x2     9997 non-null float64
x3     9997 non-null float64
x4     9998 non-null float64
x5     9999 non-null float64
x6     9996 non-null float64
x7     10000 non-null float64
x8     9999 non-null float64
x9     9998 non-null float64
x10    9997 non-null float64
x11    10000 non-null float64
x12    9999 non-null float64
x13    9996 non-null float64
x14    9998 non-null float64
x15    9999 non-null float64
x16    9999 non-null float64
x17    9998 non-null float64
x18    9999 non-null float64
x19    9999 non-null float64
x20    9999 non-null float64
x21    9998 non-null float64
x22    9998 non-null float64
x23    10000 non-null float64
x24    9995 non-null float64
x25    9997 non-null float64
x26    10000 non-null float64
x27    9999 non-null float64
x28    9999 non-null float64
x29    9999 non-null float64
x30    10000 non

In [6]:
rawdf.head(10).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
x0,-23.2309,138.561,-9.24305,8.96371,27.431,-12.0991,9.45928,56.6346,30.7946,80.855
x1,-1.80976,1.10747,-10.2073,17.5805,-6.23285,-15.7851,6.25889,-12.5004,-17.1707,-10.6775
x2,12.3807,-19.781,-7.5078,13.8842,52.7808,-35.8741,-4.63648,-1.46464,12.7751,-11.4187
x3,-4.1012,-17.5848,3.15211,-17.1642,-7.0539,-16.1511,-23.7135,11.7088,-27.5976,2.21547
x4,-60.7607,-76.9221,-14.9151,-33.5475,5.67919,-16.9613,31.5643,32.8602,-30.7603,-2.54534
x5,-22.9575,71.8168,30.5762,19.2882,-29.6181,37.3716,-36.1961,96.7381,-44.4052,-34.4462
x6,-1.96408,-0.418432,-0.378178,-1.21902,1.33183,-0.276613,1.13096,-0.543605,-0.960201,0.334544
x7,-0.631029,1.40396,2.60635,5.57461,4.42569,-2.47701,1.9764,-4.75293,4.7501,-1.11558
x8,-4.30662,-5.36705,1.58168,-3.87966,-4.21326,-2.13307,-2.13749,-4.37732,-1.95603,-0.218714
x9,-4.6942,0.039857,4.80297,2.69311,-0.398755,-1.16244,0.909426,2.33175,2.91485,-3.14503


> `x34` car maker; `x35` days of the week; `x41` in dollars; `x45` in percentages; `x68` months; `x93` area; 

> `x41` and `x45` are still numerical. Only `x34`, `x35`, `x68`, `x93` are categorical. 

In [7]:
rawdf.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
x0,10000.0,7.659723,38.154705,-146.942945,-18.268008,7.50968,33.18692,143.801953
x1,9998.0,-3.090298,15.250796,-63.335984,-13.30154,-3.064799,6.844091,60.33981
x2,9997.0,1.111079,24.735211,-96.450383,-15.68688,1.764322,17.980219,90.310926
x3,9997.0,-0.738489,15.361257,-65.740406,-11.062302,-0.634181,9.692347,60.732554
x4,9998.0,0.47923,41.934791,-159.472121,-27.693495,0.533755,28.755266,158.49833
x5,9999.0,-1.447082,41.382954,-160.162736,-28.929959,-1.152902,26.287108,148.496327
x6,9996.0,0.007241,1.059882,-3.97143,-0.707804,0.007485,0.704268,4.298322
x7,10000.0,-0.007471,3.353986,-12.138488,-2.248168,0.009257,2.244817,12.827708
x8,9999.0,-0.663452,2.924507,-11.529654,-2.580388,-0.666028,1.281501,10.722596
x9,9998.0,-0.02946,1.875564,-7.123307,-1.289821,-0.044106,1.234191,6.945239


In [8]:
rawdf.dtypes

x0     float64
x1     float64
x2     float64
x3     float64
x4     float64
x5     float64
x6     float64
x7     float64
x8     float64
x9     float64
x10    float64
x11    float64
x12    float64
x13    float64
x14    float64
x15    float64
x16    float64
x17    float64
x18    float64
x19    float64
x20    float64
x21    float64
x22    float64
x23    float64
x24    float64
x25    float64
x26    float64
x27    float64
x28    float64
x29    float64
x30    float64
x31    float64
x32    float64
x33    float64
x34     object
x35     object
x36    float64
x37    float64
x38    float64
x39    float64
x40    float64
x41     object
x42    float64
x43    float64
x44    float64
x45     object
x46    float64
x47    float64
x48    float64
x49    float64
x50    float64
x51    float64
x52    float64
x53    float64
x54    float64
x55    float64
x56    float64
x57    float64
x58    float64
x59    float64
x60    float64
x61    float64
x62    float64
x63    float64
x64    float64
x65    float64
x66    flo

In [9]:
rawdf.isnull().sum(axis=0)

x0     0
x1     2
x2     3
x3     3
x4     2
x5     1
x6     4
x7     0
x8     1
x9     2
x10    3
x11    0
x12    1
x13    4
x14    2
x15    1
x16    1
x17    2
x18    1
x19    1
x20    1
x21    2
x22    2
x23    0
x24    5
x25    3
x26    0
x27    1
x28    1
x29    1
x30    0
x31    2
x32    3
x33    3
x34    2
x35    3
x36    2
x37    3
x38    0
x39    2
x40    2
x41    1
x42    3
x43    1
x44    1
x45    2
x46    4
x47    2
x48    4
x49    0
x50    2
x51    2
x52    3
x53    2
x54    0
x55    3
x56    1
x57    2
x58    3
x59    2
x60    2
x61    3
x62    3
x63    2
x64    1
x65    2
x66    1
x67    2
x68    2
x69    1
x70    2
x71    2
x72    2
x73    7
x74    3
x75    2
x76    4
x77    2
x78    3
x79    1
x80    2
x81    1
x82    3
x83    1
x84    0
x85    3
x86    3
x87    2
x88    2
x89    3
x90    1
x91    3
x92    2
x93    1
x94    1
x95    2
x96    2
x97    6
x98    2
x99    2
dtype: int64

### 2.1 Convert `Dollar` and `Percentage` to Numericals

In [10]:
rawdf[['x41','x45']].head(5)

Unnamed: 0,x41,x45
0,$-1073.61,0.01%
1,$1775.77,0.01%
2,$697.23,0.0%
3,$-134.48,-0.02%
4,$1195.16,-0.0%


In [11]:
rawdf[['x41','x45']].dtypes

x41    object
x45    object
dtype: object

In [12]:
%%time
rawdf['x41n'] = rawdf['x41'].apply(lambda x: np.double(str(x).strip('$')))

CPU times: user 9.84 ms, sys: 2.65 ms, total: 12.5 ms
Wall time: 10.3 ms


In [13]:
%%time
rawdf['x45n'] = rawdf['x45'].apply(lambda x: np.double(str(x).strip('%')))

CPU times: user 8.71 ms, sys: 2.14 ms, total: 10.9 ms
Wall time: 9.11 ms


In [14]:
rawdf[['x41','x45','x41n','x45n']].head(5)

Unnamed: 0,x41,x45,x41n,x45n
0,$-1073.61,0.01%,-1073.61,0.01
1,$1775.77,0.01%,1775.77,0.01
2,$697.23,0.0%,697.23,0.0
3,$-134.48,-0.02%,-134.48,-0.02
4,$1195.16,-0.0%,1195.16,-0.0


In [15]:
rawdf[['x41','x45','x41n','x45n']].dtypes

x41      object
x45      object
x41n    float64
x45n    float64
dtype: object

### 2.2 Trim the columns to `numericals` and `categoricals`

In [16]:
rawdf.columns

Index([u'x0', u'x1', u'x2', u'x3', u'x4', u'x5', u'x6', u'x7', u'x8', u'x9',
       ...
       u'x92', u'x93', u'x94', u'x95', u'x96', u'x97', u'x98', u'x99', u'x41n',
       u'x45n'],
      dtype='object', length=102)

In [17]:
dtypelist = rawdf.dtypes

#### Numerical Columns 

In [18]:
listNumerical = list(dtypelist[dtypelist.values == 'float64'].index)

In [19]:
rawdf[listNumerical[:3]].head()

Unnamed: 0,x0,x1,x2
0,-23.230884,-1.809757,12.38069
1,138.561415,1.107473,-19.781009
2,-9.243047,-10.207303,-7.507803
3,8.963713,17.580528,13.88417
4,27.431028,-6.232849,52.780835


In [20]:
len(listNumerical)

96

#### Categorical Columns

> We already know that we only have four categorical columns, `['x34','x35','x68','x93']`

In [21]:
listCategorical = ['x34','x35','x68','x93']

In [22]:
rawdf[listCategorical].dtypes

x34    object
x35    object
x68    object
x93    object
dtype: object

In [23]:
rawdf[listCategorical].head(3)

Unnamed: 0,x34,x35,x68,x93
0,volkswagon,wednesday,Jun,asia
1,volkswagon,thurday,Jun,asia
2,Toyota,tuesday,July,asia


In [24]:
rawdf.x34.value_counts(dropna=False)

volkswagon    3117
Toyota        2749
bmw           1822
Honda         1343
tesla          556
chrystler      277
nissan          93
ford            34
mercades         5
chevrolet        2
NaN              2
Name: x34, dtype: int64

#### Quick and Dirty Imputings

> only tiny fractions are `nan`. Imputings are barely critical in ML performances, at least, for this problem

In [25]:
rawdf[listCategorical] = rawdf[listCategorical].fillna('others')

In [26]:
rawdf.x34.value_counts(dropna=False)

volkswagon    3117
Toyota        2749
bmw           1822
Honda         1343
tesla          556
chrystler      277
nissan          93
ford            34
mercades         5
chevrolet        2
others           2
Name: x34, dtype: int64

In [27]:
rawdf[listNumerical] = rawdf[listNumerical].fillna(rawdf[listNumerical].median())

In [28]:
rawdf[listNumerical].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
x0,10000.0,7.659723,38.154705,-146.942945,-18.268008,7.50968,33.18692,143.801953
x1,10000.0,-3.090292,15.249271,-63.335984,-13.299095,-3.064799,6.840624,60.33981
x2,10000.0,1.111275,24.731503,-96.450383,-15.683698,1.764322,17.976587,90.310926
x3,10000.0,-0.738457,15.358952,-65.740406,-11.052687,-0.634181,9.690136,60.732554
x4,10000.0,0.479241,41.930597,-159.472121,-27.681615,0.533755,28.749545,158.49833
x5,10000.0,-1.447052,41.380885,-160.162736,-28.925701,-1.152902,26.284512,148.496327
x6,10000.0,0.007241,1.05967,-3.97143,-0.707606,0.007485,0.70368,4.298322
x7,10000.0,-0.007471,3.353986,-12.138488,-2.248168,0.009257,2.244817,12.827708
x8,10000.0,-0.663452,2.924361,-11.529654,-2.580253,-0.666028,1.28143,10.722596
x9,10000.0,-0.029463,1.875376,-7.123307,-1.289778,-0.044106,1.233568,6.945239


### Now Using Apache Spark ... 

#### Define a spark dataframe from the pandasDF

In [29]:
['y']+listCategorical

['y', 'x34', 'x35', 'x68', 'x93']

In [30]:
df = spark.createDataFrame(rawdf[listNumerical+listCategorical])



In [31]:
df.describe().toPandas().transpose()

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
x0,10000,7.659723203367136,38.154704663913705,-146.94294471732132,143.80195322877827
x1,10000,-3.090292490079438,15.24927093090432,-63.335984263236405,60.33980991681287
x2,10000,1.1112753240007516,24.73150257276899,-96.45038306157231,90.31092565447685
x3,10000,-0.7384574433942562,15.358952173979194,-65.74040558704641,60.73255416666648
x4,10000,0.4792410882606864,41.93059704080318,-159.47212072302844,158.49833000878922
x5,10000,-1.4470522080562491,41.38088457972964,-160.1627355559169,148.4963269062084
x6,10000,0.007240945144702367,1.059669704539627,-3.971429549734245,4.298321580116536
x7,10000,-0.007470527837147,3.3539856522606772,-12.138487792297136,12.827708343844936
x8,10000,-0.6634518980376386,2.924361142322127,-11.529654112589498,10.722595854974573


### 2.3 Ecode the `categoricals` using `StringIndexers`

In [32]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator

In [33]:
indexersCategorical = \
    [StringIndexer(inputCol=eachcat, outputCol="{0}_indexed".format(eachcat), \
                   handleInvalid='keep') \
     for eachcat in listCategorical]

In [34]:
print [eachindexer.getOutputCol() for eachindexer in indexersCategorical]

['x34_indexed', 'x35_indexed', 'x68_indexed', 'x93_indexed']


In [35]:
listCategoricalIndexed = [eachindexer.getOutputCol() for eachindexer in indexersCategorical]

#### Sanity Check for StringIndexer

In [36]:
for idx in range(len(listCategorical)):
    indexersCategorical[idx]\
        .fit(df).transform(df)\
        .select('x0',listCategorical[idx],indexersCategorical[idx].getOutputCol())\
        .show(8)

+-------------------+----------+-----------+
|                 x0|       x34|x34_indexed|
+-------------------+----------+-----------+
|-23.230883903348346|volkswagon|        0.0|
| 138.56141534989942|volkswagon|        0.0|
| -9.243047142331873|    Toyota|        1.0|
|  8.963712705509357|volkswagon|        0.0|
| 27.431028306988978|    Toyota|        1.0|
|-12.099124839844775|    Toyota|        1.0|
|  9.459280736855304|volkswagon|        0.0|
| 56.634596214519924|    Toyota|        1.0|
+-------------------+----------+-----------+
only showing top 8 rows

+-------------------+---------+-----------+
|                 x0|      x35|x35_indexed|
+-------------------+---------+-----------+
|-23.230883903348346|wednesday|        2.0|
| 138.56141534989942|  thurday|        1.0|
| -9.243047142331873|  tuesday|        4.0|
|  8.963712705509357|      wed|        0.0|
| 27.431028306988978|  tuesday|        4.0|
|-12.099124839844775|      wed|        0.0|
|  9.459280736855304|      wed|        

#### Now run the stages of StringIndexers

In [37]:
%%time
pipeline = Pipeline(stages=indexersCategorical)
numdf = pipeline.fit(df).transform(df)

CPU times: user 29.1 ms, sys: 7.35 ms, total: 36.4 ms
Wall time: 1.31 s


In [38]:
numdf.cache()

DataFrame[x0: double, x1: double, x2: double, x3: double, x4: double, x5: double, x6: double, x7: double, x8: double, x9: double, x10: double, x11: double, x12: double, x13: double, x14: double, x15: double, x16: double, x17: double, x18: double, x19: double, x20: double, x21: double, x22: double, x23: double, x24: double, x25: double, x26: double, x27: double, x28: double, x29: double, x30: double, x31: double, x32: double, x33: double, x36: double, x37: double, x38: double, x39: double, x40: double, x42: double, x43: double, x44: double, x46: double, x47: double, x48: double, x49: double, x50: double, x51: double, x52: double, x53: double, x54: double, x55: double, x56: double, x57: double, x58: double, x59: double, x60: double, x61: double, x62: double, x63: double, x64: double, x65: double, x66: double, x67: double, x69: double, x70: double, x71: double, x72: double, x73: double, x74: double, x75: double, x76: double, x77: double, x78: double, x79: double, x80: double, x81: double,

In [39]:
df.select('x93').groupby('x93').count().sort(F.desc("count")).show()

+-------+-----+
|    x93|count|
+-------+-----+
|   asia| 8875|
|america|  783|
| euorpe|  341|
| others|    1|
+-------+-----+



In [40]:
numdf.select('x93_indexed').groupby('x93_indexed').count().sort(F.desc("count")).show()

+-----------+-----+
|x93_indexed|count|
+-----------+-----+
|        0.0| 8875|
|        1.0|  783|
|        2.0|  341|
|        3.0|    1|
+-----------+-----+



#### One-hot Encoding the indexed categoricals 

In [41]:
encoder = OneHotEncoderEstimator(inputCols=listCategoricalIndexed,\
                                 outputCols=[eachcol+ "_onehot" for eachcol in listCategoricalIndexed])

In [42]:
listCategoricalOneHot = encoder.getOutputCols()

In [43]:
listCategoricalOneHot

['x34_indexed_onehot',
 'x35_indexed_onehot',
 'x68_indexed_onehot',
 'x93_indexed_onehot']

In [44]:
%%time
numdf = encoder.fit(numdf).transform(numdf)

CPU times: user 9.18 ms, sys: 2.21 ms, total: 11.4 ms
Wall time: 86.3 ms


In [45]:
numdf.select('x34','x34_indexed','x34_indexed_onehot').show(5)

+----------+-----------+------------------+
|       x34|x34_indexed|x34_indexed_onehot|
+----------+-----------+------------------+
|volkswagon|        0.0|    (11,[0],[1.0])|
|volkswagon|        0.0|    (11,[0],[1.0])|
|    Toyota|        1.0|    (11,[1],[1.0])|
|volkswagon|        0.0|    (11,[0],[1.0])|
|    Toyota|        1.0|    (11,[1],[1.0])|
+----------+-----------+------------------+
only showing top 5 rows



In [46]:
numdf.cache()

DataFrame[x0: double, x1: double, x2: double, x3: double, x4: double, x5: double, x6: double, x7: double, x8: double, x9: double, x10: double, x11: double, x12: double, x13: double, x14: double, x15: double, x16: double, x17: double, x18: double, x19: double, x20: double, x21: double, x22: double, x23: double, x24: double, x25: double, x26: double, x27: double, x28: double, x29: double, x30: double, x31: double, x32: double, x33: double, x36: double, x37: double, x38: double, x39: double, x40: double, x42: double, x43: double, x44: double, x46: double, x47: double, x48: double, x49: double, x50: double, x51: double, x52: double, x53: double, x54: double, x55: double, x56: double, x57: double, x58: double, x59: double, x60: double, x61: double, x62: double, x63: double, x64: double, x65: double, x66: double, x67: double, x69: double, x70: double, x71: double, x72: double, x73: double, x74: double, x75: double, x76: double, x77: double, x78: double, x79: double, x80: double, x81: double,

### 2.4 Vectorize the features

In [47]:
from pyspark.ml.feature import VectorAssembler

In [48]:
vectorizedFeatures = listNumerical+listCategoricalOneHot

In [49]:
len(vectorizedFeatures)

100

In [50]:
print vectorizedFeatures

['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18', 'x19', 'x20', 'x21', 'x22', 'x23', 'x24', 'x25', 'x26', 'x27', 'x28', 'x29', 'x30', 'x31', 'x32', 'x33', 'x36', 'x37', 'x38', 'x39', 'x40', 'x42', 'x43', 'x44', 'x46', 'x47', 'x48', 'x49', 'x50', 'x51', 'x52', 'x53', 'x54', 'x55', 'x56', 'x57', 'x58', 'x59', 'x60', 'x61', 'x62', 'x63', 'x64', 'x65', 'x66', 'x67', 'x69', 'x70', 'x71', 'x72', 'x73', 'x74', 'x75', 'x76', 'x77', 'x78', 'x79', 'x80', 'x81', 'x82', 'x83', 'x84', 'x85', 'x86', 'x87', 'x88', 'x89', 'x90', 'x91', 'x92', 'x94', 'x95', 'x96', 'x97', 'x98', 'x99', 'x41n', 'x45n', 'x34_indexed_onehot', 'x35_indexed_onehot', 'x68_indexed_onehot', 'x93_indexed_onehot']


In [51]:
vecAssem = VectorAssembler(inputCols = vectorizedFeatures, outputCol= "features")

In [52]:
mldata_test = vecAssem.transform(numdf).select(vectorizedFeatures+['features'])

In [53]:
mldata_test.cache()

DataFrame[x0: double, x1: double, x2: double, x3: double, x4: double, x5: double, x6: double, x7: double, x8: double, x9: double, x10: double, x11: double, x12: double, x13: double, x14: double, x15: double, x16: double, x17: double, x18: double, x19: double, x20: double, x21: double, x22: double, x23: double, x24: double, x25: double, x26: double, x27: double, x28: double, x29: double, x30: double, x31: double, x32: double, x33: double, x36: double, x37: double, x38: double, x39: double, x40: double, x42: double, x43: double, x44: double, x46: double, x47: double, x48: double, x49: double, x50: double, x51: double, x52: double, x53: double, x54: double, x55: double, x56: double, x57: double, x58: double, x59: double, x60: double, x61: double, x62: double, x63: double, x64: double, x65: double, x66: double, x67: double, x69: double, x70: double, x71: double, x72: double, x73: double, x74: double, x75: double, x76: double, x77: double, x78: double, x79: double, x80: double, x81: double,

In [54]:
mldata_test.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|[-23.230883903348...|
|[138.561415349899...|
|[-9.2430471423318...|
|[8.96371270550935...|
|[27.4310283069889...|
+--------------------+
only showing top 5 rows



#### Save this as a parquet table and done

In [55]:
import pyarrow as pa
import pyarrow.parquet as pq

In [56]:
mldata_test.write.option("compression","snappy")\
      .mode("overwrite").save("hdfs://master:54310/data/spark/statefarm/mldata_test.parquet.snappy")