<a href="https://colab.research.google.com/github/qweliant/DS-Unit-2-Kaggle-Challenge/blob/master/module2/assignment_kaggle_challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 2

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [1]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    get_ipython().system('pip install category_encoders==2.*')

# If you're working locally:
else:
    DATA_PATH = '../data/'



Reinitialized existing Git repository in /content/.git/
fatal: remote origin already exists.
From https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge
 * branch            master     -> FETCH_HEAD
Already up to date.


In [0]:
import pandas as pd
import category_encoders as ce
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import numpy as np

from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

In [0]:
def wrangle(X):
    X = X.copy()
    # # Get a list of the top 10 neighborhoods
    top10 = X['wpt_name'].value_counts()[:10].index
    X.loc[~X['wpt_name'].isin(top10), 'wpt_name'] = 'other'

    top10 = X['subvillage'].value_counts()[:10].index
    X.loc[~X['subvillage'].isin(top10), 'subvillage'] = 'other'

    top10 = X['region'].value_counts()[:10].index
    X.loc[~X['region'].isin(top10), 'region'] = 'other'

    # top10 = X['lga'].value_counts()[:10].index
    # X.loc[~X['lga'].isin(top10), 'lga'] = 'other'

    top10 = X['ward'].value_counts()[:10].index
    X.loc[~X['ward'].isin(top10), 'ward'] = 'other'

    # top10 = X['basin'].value_counts()[:10].index
    # X.loc[~X['basin'].isin(top10), 'basin'] = 'other'

    top10 = X['funder'].value_counts()[:10].index
    X.loc[~X['funder'].isin(top10), 'funder'] = 'other'

    top10 = X['scheme_name'].value_counts()[:10].index
    X.loc[~X['scheme_name'].isin(top10), 'scheme_name'] = 'other'

    top10 = X['installer'].value_counts()[:10].index
    X.loc[~X['installer'].isin(top10), 'installer'] = 'other'

    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    cols_with_zeros = ['longitude', 'latitude', 'construction_year', 'gps_height', 'population']
    X['construction_year'] = X['construction_year'].astype(dtype='int32')
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
        # X[col+'_MISSING'] = X[col].isnull()
    X = X.drop(columns=['quantity_group', 'payment_type'])

    
    X.date_recorded = pd.to_datetime(X.date_recorded)
    X =  X.set_index('date_recorded')
    
    X['inspec_year'] = X.index.year.astype(dtype='int32')
    X['time_before_inspection'] = (X.inspec_year - X.construction_year)
    X['years_MISSING'] = X['time_before_inspection'].isnull() 
    return(X)


def train_validate(df):
    train_set, validation_set = train_test_split(df, train_size=0.80, test_size=0.20, stratify=train['status_group'], random_state=42, )
    return train_set, validation_set

In [0]:
train_set, validation_set = train_validate(train)

In [0]:
train_set = wrangle(train_set)
validation_set = wrangle(validation_set)
test = wrangle(test)

In [6]:
train_set.select_dtypes(exclude='number').describe().T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq
recorded_by,47520,1,GeoData Consultants Ltd,47520
years_MISSING,47520,2,False,31003
permit,45077,2,True,31071
public_meeting,44876,2,True,40838
source_class,47520,3,groundwater,36638
status_group,47520,3,functional,25807
quantity,47520,5,enough,26567
management_group,47520,5,user-group,42027
quality_group,47520,6,good,40598
waterpoint_type_group,47520,6,communal standpipe,27642


In [7]:
train_set.select_dtypes(include='number').describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,47520.0,37037.915699,21412.099719,0.0,18482.75,36986.5,55450.25,74247.0
amount_tsh,47520.0,321.925261,3197.240487,0.0,0.0,0.0,25.0,350000.0
gps_height,31215.0,1019.312991,612.056739,-63.0,395.5,1167.0,1497.0,2770.0
longitude,46078.0,35.149033,2.604241,29.607122,33.284679,35.008578,37.223501,40.344301
latitude,46078.0,-5.884512,2.805599,-11.64944,-8.633876,-5.170151,-3.375068,-0.998464
num_private,47520.0,0.477736,13.312977,0.0,0.0,0.0,0.0,1776.0
region_code,47520.0,15.258291,17.530228,1.0,5.0,12.0,17.0,99.0
district_code,47520.0,5.616751,9.62123,0.0,2.0,3.0,5.0,80.0
population,30454.0,280.566034,553.488321,1.0,40.0,150.0,321.0,15300.0
construction_year,31003.0,1996.825469,12.499247,1960.0,1988.0,2000.0,2008.0,2013.0


In [0]:
target = "status_group"
numbers_columns = train_set.select_dtypes(include='number').columns.drop('id').tolist()
categorical_columns = train_set.select_dtypes(exclude='number').columns.drop(target).tolist()
drop_high_cardinality = [col for col in categorical_columns if train_set[col].nunique() <= 200]
features = numbers_columns + drop_high_cardinality

In [0]:
test.replace({'functional':0, 'functional needs repair':1, 'non functional':2}, inplace=True)

In [0]:
X_train = train_set[features]
y_train = train_set[target]
X_val = validation_set[features]
y_val = validation_set[target]
X_test = test[features]

In [24]:
pipeline1 = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, oob_score=True)
)

pipeline1.fit(X_train, y_train)


print('Train Accuracy:', pipeline1.score(X_train, y_train))
print('Validation Accuracy:', pipeline1.score(X_val, y_val))

Train Accuracy: 0.9970749158249158
Validation Accuracy: 0.8067340067340067


In [21]:
pipeline2 = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

pipeline2.fit(X_train, y_train)


print('Train Accuracy:', pipeline2.score(X_train, y_train))
print('Validation Accuracy:', pipeline2.score(X_val, y_val))

Train Accuracy: 0.9970749158249158
Validation Accuracy: 0.8126262626262626


In [22]:
pipeline3 = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'), 
    GradientBoostingClassifier(n_estimators=32, random_state=42)
)

pipeline3.fit(X_train, y_train)


print('Train Accuracy:', pipeline3.score(X_train, y_train))
print('Validation Accuracy:', pipeline3.score(X_val, y_val))

Train Accuracy: 0.7361111111111112
Validation Accuracy: 0.7345959595959596


In [23]:
pipeline4 = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='median'), 
    GradientBoostingClassifier(max_features="log2",n_estimators=48, random_state=42)
)

pipeline4.fit(X_train, y_train)


print('Train Accuracy:', pipeline4.score(X_train, y_train))
print('Validation Accuracy:', pipeline4.score(X_val, y_val))

Train Accuracy: 0.7197390572390573
Validation Accuracy: 0.7175925925925926


In [15]:
# train_set.date_recorded = pd.to_datetime(train_set.date_recorded)
# validation_set.date_recorded = pd.to_datetime(validation_set.date_recorded)
test_test = test.copy()
y_pred = pipeline2.predict(X_test)
submission = test[['id']]
submission['status_group'] = y_pred
submission.replace({0:'functional', 1:'functional needs repair', 2:'non functional'}, inplace=True)
submission.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  method=method)


Unnamed: 0_level_0,id,status_group
date_recorded,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-02-04,50785,functional
2013-02-04,51630,functional
2013-02-01,17168,functional
2013-01-22,45559,non functional
2013-03-27,49871,functional


In [0]:
submission.to_csv('submission.csv', index=False)

In [0]:
from google.colab import files

files.download('submission.csv')