# Basic Submission

The aim of this notebook will be to submit a baseline for the competition exceeding 60% and getting familiar with the basics of Kaggle API.

Let's start by interacting with the Kaggle API to get our data and submit a prediction.

In [38]:
# imports
import pandas as pd
pd.options.display.max_columns = 999
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score, confusion_matrix
from sklearn.metrics import classification_report, make_scorer

from sklearn.model_selection import GridSearchCV, validation_curve, learning_curve
from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.feature_selection import SelectKBest

from sklearn.linear_model import LogisticRegression

import category_encoders as ce

## Loading Data

In [3]:
# download data from kaggle
# %env KAGGLE_CONFIG_DIR=/Users/zach/Kaggle
# !kaggle competitions download -c ds1-predictive-modeling-challenge

env: KAGGLE_CONFIG_DIR=/Users/zach/Kaggle
Downloading sample_submission.csv to /Users/zach/repos/water-pump-prediction
100%|████████████████████████████████████████| 236k/236k [00:00<00:00, 2.20MB/s]

Downloading test_features.csv.zip to /Users/zach/repos/water-pump-prediction
  0%|                                                | 0.00/948k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 948k/948k [00:00<00:00, 10.4MB/s]
Downloading train_labels.csv.zip to /Users/zach/repos/water-pump-prediction
  0%|                                                | 0.00/211k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 211k/211k [00:00<00:00, 62.6MB/s]
Downloading train_features.csv.zip to /Users/zach/repos/water-pump-prediction
100%|██████████████████████████████████████| 3.81M/3.81M [00:00<00:00, 14.3MB/s]



In [49]:
# load in training and submission data
df_train = pd.read_csv('train_features.csv', index_col=0).join(pd.read_csv('train_labels.csv', index_col=0))

df_submit = pd.read_csv('test_features.csv', index_col=0)

df_submit.head()

Unnamed: 0_level_0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1
50785,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,Internal,Magoma,Manyara,21,3,Mbulu,Bashay,321,True,GeoData Consultants Ltd,Parastatal,,True,2012,other,other,other,parastatal,parastatal,never pay,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
51630,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,Pangani,Kimnyak,Arusha,2,2,Arusha Rural,Kimnyaki,300,True,GeoData Consultants Ltd,VWC,TPRI pipe line,True,2000,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
17168,0.0,2013-02-01,,1567,,34.767863,-5.004344,Puma Secondary,0,Internal,Msatu,Singida,13,2,Singida Rural,Puma,500,True,GeoData Consultants Ltd,VWC,P,,2010,other,other,other,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
45559,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,Ruvuma / Southern Coast,Kipindimbi,Lindi,80,43,Liwale,Mkutano,250,,GeoData Consultants Ltd,VWC,,True,1987,other,other,other,vwc,user-group,unknown,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
49871,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,Ruvuma / Southern Coast,Losonga,Ruvuma,10,3,Mbinga,Mbinga Urban,60,,GeoData Consultants Ltd,Water Board,BRUDER,True,2000,gravity,gravity,gravity,water board,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


## First Submission - Majority Classifier

To practice working with Kaggle's API, let's make a submission predicting the majority class for all observations.

In [22]:
# get majority class from training data
majority_class = df_train['status_group'].mode()[0]
maj_pred = [majority_class] * df_submit.shape[0]

# create a dataframe and write it to csv for submission
df_majority = pd.DataFrame(data=maj_pred, index=df_submit.index, columns=['status_group'])
df_majority.to_csv('majority_class_submission.csv')

In [23]:
# SUBMIT!!
# !kaggle competitions submit -c ds1-predictive-modeling-challenge -f majority_class_submission.csv -m 'majority classifier submission'

100%|█████████████████████████████████████████| 236k/236k [00:01<00:00, 204kB/s]
Successfully submitted to DS1 Predictive Modeling Challenge

## Second Submission - Simple Logistic Regression Baseline

Next, let's work on getting a simple model submission using logistic regression. Let's see how our accuracy changes if we only use the numeric features to fit our model.

In [32]:
# separate into X and y variables, create train and validation data
X = df_train.select_dtypes(exclude=['object'])
y = df_train['status_group']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.1,
                                                   random_state=1,
                                                   stratify=y)

pipe_log_reg_basic = make_pipeline(StandardScaler(),
                                   LogisticRegression(solver='lbfgs',
                                  max_iter=5000)).fit(X_train, y_train)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


In [33]:
# report train and validation accuracy
print ('Training Accuracy Score %.3f' % (pipe_log_reg_basic.score(X_train, y_train)))
print ('Validation Accuracy Score %.3f' % (pipe_log_reg_basic.score(X_test, y_test)))

Training Accuracy Score 0.557
Validation Accuracy Score 0.557


  Xt = transform.transform(Xt)
  Xt = transform.transform(Xt)


Sadly, it doesn't look like only numeric features will get us over the 60% baseline target. Let's try one hot encoding objects and fit another regression.

We'll also drop some high arity features from the dataset for the sake of computation time.

In [42]:
# one hot encoding 
X = df_train.drop(['status_group', 'date_recorded','subvillage', 'lga', 'funder',
                  'ward', 'scheme_name', 'wpt_name'], axis=1)
y = df_train['status_group']

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.1,
                                                   random_state=1,
                                                   stratify=y)

pipe_log_reg2 = make_pipeline(ce.OneHotEncoder(use_cat_names=True),
                              StandardScaler(),
                              LogisticRegression(solver='lbfgs',
                              max_iter=500)).fit(X_train, y_train)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


In [43]:
# report train and validation accuracy
print ('Training Accuracy Score %.3f' % (pipe_log_reg2.score(X_train, y_train)))
print ('Validation Accuracy Score %.3f' % (pipe_log_reg2.score(X_test, y_test)))

  Xt = transform.transform(Xt)


Training Accuracy Score 0.770
Validation Accuracy Score 0.752


  Xt = transform.transform(Xt)


Alright that seems like it'll exceed our 60% baseline for submission. Let's go ahead and submit, and the work on a more advanced model.

In [45]:
# create a dataframe and write it to csv for submission
X_submit_logistic = df_submit.drop(['date_recorded','subvillage', 'lga', 'funder',
                  'ward', 'scheme_name', 'wpt_name'], axis=1)
y_submit_logistic = pipe_log_reg2.predict(X_submit_logistic)
df_logistic = pd.DataFrame(data=y_submit_logistic, index=df_submit.index, columns=['status_group'])
df_logistic.to_csv('basic_logistic_regression_submission.csv')

  Xt = transform.transform(Xt)


In [46]:
# SUBMIT!1
# !kaggle competitions submit -c ds1-predictive-modeling-challenge -f basic_logistic_regression_submission.csv -m 'logistic regression basic classifier submission'

100%|█████████████████████████████████████████| 257k/257k [00:01<00:00, 143kB/s]
Successfully submitted to DS1 Predictive Modeling Challenge