# Introduction

This notebook is for modeling on `data_cleaned.csv` from `cleaning.ipynb`.

In [133]:
import pandas as pd
pd.set_option("max_columns", None)

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# General Helper Functions

In [12]:
# Returns top-N values for all columns in ls.
def many_count(ls, df, n=5):
    for i in ls:
        print(df[i].value_counts(normalize=True).iloc[:n])
        print('---------------')

In [13]:
# Returns all unique values for columns in ls.
def many_unique(ls, df):
    for i in ls:
        print(df[i].unique())
        print('---------------')

In [73]:
# Creates dummies of top-N most common values in column and drops original.
def top_dummies(ls, df, n=5, drop=True):
    for x in ls:
        for i in df[x].value_counts().index[:n]:
            df[x + "_" + i.replace(" ", "_")] = (df[x] == i).astype(int)
        if drop == True:
            df.drop(x, axis=1, inplace=True)

# Importing Data

In [109]:
df = pd.read_csv('../data/data_cleaned.csv')
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

Unnamed: 0,id,amount_tsh,funder,gps_height,installer,longitude,latitude,wpt_name,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,permit,construction_year,extraction_type_class,management,payment,quality_group,quantity,source,waterpoint_type,status_group,day_of_year,month
0,69572,6000.0,Roman,1390,Roman,34.938093,-9.856322,none,Lake Nyasa,Mnyusi B,Iringa,11,5,Ludewa,Mundindi,109,True,GeoData Consultants Ltd,VWC,False,1999,gravity,vwc,pay annually,good,enough,spring,communal standpipe,functional,73,3
1,8776,0.0,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,Lake Victoria,Nyamara,Mara,20,2,Serengeti,Natta,280,True,GeoData Consultants Ltd,Other,True,2010,gravity,wug,never pay,good,insufficient,rainwater harvesting,communal standpipe,functional,65,3
2,34310,25.0,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,Pangani,Majengo,Manyara,21,4,Simanjiro,Ngorika,250,True,GeoData Consultants Ltd,VWC,True,2009,gravity,vwc,pay per bucket,good,enough,dam,communal standpipe multiple,functional,56,2
3,67743,0.0,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,Ruvuma / Southern Coast,Mahakamani,Mtwara,90,63,Nanyumbu,Nanyumbu,58,True,GeoData Consultants Ltd,VWC,True,1986,submersible,vwc,never pay,good,dry,machine dbh,communal standpipe multiple,non functional,28,1
4,19728,0.0,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,Lake Victoria,Kyanyamisa,Kagera,18,1,Karagwe,Nyakasimbi,0,True,GeoData Consultants Ltd,WUA,True,0,gravity,other,never pay,good,seasonal,rainwater harvesting,communal standpipe,functional,194,7


# General Preprocessing

## Replacing Target Values

In [110]:
df.status_group.replace({'functional':0, 'functional needs repair':1, 'non functional':2}, inplace=True)
df.status_group.unique()

array([0, 2, 1], dtype=int64)

## Creating Selected Dummies

In the section below, I'm going to sort out the categorical columns from the continuous and select which categorical columns to make dummies out of. There are too many to reasonably do it for all of them, so some executive choices must be made. If the models end up performing poorly, this step may be revisited and done differently.

In [113]:
cats = []
for i in df.columns:
    if (type(df[i][0]) != np.int64) & (type(df[i][0]) != np.float64):
        cats.append(i)

print(len(cats))
print(cats)

19
['funder', 'installer', 'wpt_name', 'basin', 'subvillage', 'region', 'lga', 'ward', 'public_meeting', 'recorded_by', 'scheme_management', 'permit', 'extraction_type_class', 'management', 'payment', 'quality_group', 'quantity', 'source', 'waterpoint_type']


In [114]:
conts = [i for i in df.columns if i not in cats]

print(len(conts))
print(conts)

12
['id', 'amount_tsh', 'gps_height', 'longitude', 'latitude', 'region_code', 'district_code', 'population', 'construction_year', 'status_group', 'day_of_year', 'month']


In [115]:
many_count(cats, df)

Government Of Tanzania    0.162903
Danida                    0.056125
Hesawa                    0.039472
Rwssp                     0.024971
Kkkt                      0.023056
Name: funder, dtype: float64
---------------
DWE           0.312474
Government    0.032882
RWE           0.021955
DANIDA        0.018906
Commu         0.016297
Name: installer, dtype: float64
---------------
none         0.060292
Shuleni      0.029037
Zahanati     0.013739
Msikitini    0.009029
Kanisani     0.005302
Name: wpt_name, dtype: float64
---------------
Lake Victoria      0.173440
Pangani            0.151451
Rufiji             0.135120
Internal           0.131884
Lake Tanganyika    0.108963
Name: basin, dtype: float64
---------------
Madukani    0.008606
Shuleni     0.008572
Majengo     0.008504
Kati        0.006319
Mtakuja     0.004438
Name: subvillage, dtype: float64
---------------
Iringa         0.089685
Shinyanga      0.084399
Mbeya          0.078588
Kilimanjaro    0.074184
Morogoro       0.067865
Na

Columns to be given 5 dummies: `extraction_type_class` (because of relationship to function of well)

Columns to be given 2 dummies: `management` (because EDA suggested relationship between top management values and well failure), `quality_group` (same as previous but with water quality), `waterpoint_type` (because of relationship to function of well)

Columns to make binary of top value: `public_meeting`, `permit`, `payment`, `funder`, `installer`

> *Next columns in the batting order to be tried out if these fail:* `source`, `quantity`, `basin`

The cells below replace the aforementioned columns with their set amount of dummies and drops all remaining object columns in the DataFrame.

In [116]:
five_dum = ['extraction_type_class']
two_dum = ['management', 'quality_group', 'waterpoint_type']
one_dum = ['payment', 'funder', 'installer']
one_dum_already_bool = ['public_meeting', 'permit']

In [117]:
dum_df = df.copy()

In [118]:
top_dummies(five_dum, dum_df, n=5)
top_dummies(two_dum, dum_df, n=2)
top_dummies(one_dum, dum_df, n=1)

In [119]:
for i in one_dum_already_bool:
    dum_df[i] = dum_df[i].astype('int')

In [120]:
for i in dum_df.columns:
    if dum_df[i].dtype == 'O':
        dum_df.drop(i, axis=1, inplace=True)

In [121]:
np.sum(dum_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59029 entries, 0 to 59028
Data columns (total 28 columns):
id                                    59029 non-null int64
amount_tsh                            59029 non-null float64
gps_height                            59029 non-null int64
longitude                             59029 non-null float64
latitude                              59029 non-null float64
region_code                           59029 non-null int64
district_code                         59029 non-null int64
population                            59029 non-null int64
public_meeting                        59029 non-null int32
permit                                59029 non-null int32
construction_year                     59029 non-null int64
status_group                          59029 non-null int64
day_of_year                           59029 non-null int64
month                                 59029 non-null int64
extraction_type_class_gravity         59029 non-null int32


In [125]:
np.sum(dum_df.dtypes == 'O')

0

In [126]:
dum_df.head()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,region_code,district_code,population,public_meeting,permit,construction_year,status_group,day_of_year,month,extraction_type_class_gravity,extraction_type_class_handpump,extraction_type_class_other,extraction_type_class_submersible,extraction_type_class_motorpump,management_vwc,management_wug,quality_group_good,quality_group_salty,waterpoint_type_communal_standpipe,waterpoint_type_hand_pump,payment_never_pay,funder_Government_Of_Tanzania,installer_DWE
0,69572,6000.0,1390,34.938093,-9.856322,11,5,109,1,0,1999,0,73,3,1,0,0,0,0,1,0,1,0,1,0,0,0,0
1,8776,0.0,1399,34.698766,-2.147466,20,2,280,1,1,2010,0,65,3,1,0,0,0,0,0,1,1,0,1,0,1,0,0
2,34310,25.0,686,37.460664,-3.821329,21,4,250,1,1,2009,0,56,2,1,0,0,0,0,1,0,1,0,0,0,0,0,0
3,67743,0.0,263,38.486161,-11.155298,90,63,58,1,1,1986,2,28,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0
4,19728,0.0,0,31.130847,-1.825359,18,1,0,1,1,0,0,194,7,1,0,0,0,0,0,0,1,0,1,0,1,0,0


Beautiful numbers only! Now we can pass these numerics to Sklearn and see how the baseline models perform.

# Test-Train Split

In [128]:
X = dum_df.drop("status_group", axis=1)
y = dum_df['status_group']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Baseline Model

I'm doing a KNN baseline model because this data has a fair amount of geographic data still in it, despite having dropped `region` and `ward` previously. We still have height, lat, long, and other geographic codes.

In [129]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [131]:
np.sum(knn.predict(X_test) == y_test) / len(df)

0.1301733046468685

This is ... shockingly bad. Almost five times worse than just guessing.

# KNN Training

Let's see if building a standard-scaled model with grid-searching will improve this.

In [136]:
knn_pipe = Pipeline([('SS' , StandardScaler()),
                     ('KN' , KNeighborsClassifier())])

In [144]:
knn_grid = [{'KN__n_neighbors' : [3, 5, 7]}]

In [145]:
knn_gridsearch = GridSearchCV(estimator=knn_pipe, param_grid=knn_grid, scoring='accuracy', cv=3)

In [146]:
knn_gridsearch.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('SS',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('KN',
                                        KNeighborsClassifier(algorithm='auto',
                                                             leaf_size=30,
                                                             metric='minkowski',
                                                             metric_params=None,
                                                             n_jobs=None,
                                                             n_neighbors=5, p=2,
                                                             weights='uniform'))],
                                verbose=False),


In [None]:
np.sum()