Lambda School Data Science

*Unit 2, Sprint 2, Module 2*

---

# Random Forests

## Assignment
- [x] Read [“Adopting a Hypothesis-Driven Workflow”](http://archive.is/Nu3EI), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [x] Continue to participate in our Kaggle challenge.
- [x] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features.
- [x] Try Ordinal Encoding.
- [x] Try a Random Forest Classifier.
- [x] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [x] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](contrib.scikit-learn.org/category_encoders/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_


### More Categorical Encodings

**1.** The article **[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)** mentions 4 encodings:

- **"Categorical Encoding":** This means using the raw categorical values as-is, not encoded. Scikit-learn doesn't support this, but some tree algorithm implementations do. For example, [Catboost](https://catboost.ai/), or R's [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package.
- **Numeric Encoding:** Synonymous with Label Encoding, or "Ordinal" Encoding with random order. We can use [category_encoders.OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html).
- **One-Hot Encoding:** We can use [category_encoders.OneHotEncoder](http://contrib.scikit-learn.org/categorical-encoding/onehot.html).
- **Binary Encoding:** We can use [category_encoders.BinaryEncoder](http://contrib.scikit-learn.org/categorical-encoding/binary.html).


**2.** The short video 
**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)** introduces an interesting idea: use both X _and_ y to encode categoricals.

Category Encoders has multiple implementations of this general concept:

- [CatBoost Encoder](http://contrib.scikit-learn.org/categorical-encoding/catboost.html)
- [James-Stein Encoder](http://contrib.scikit-learn.org/categorical-encoding/jamesstein.html)
- [Leave One Out](http://contrib.scikit-learn.org/categorical-encoding/leaveoneout.html)
- [M-estimate](http://contrib.scikit-learn.org/categorical-encoding/mestimate.html)
- [Target Encoder](http://contrib.scikit-learn.org/categorical-encoding/targetencoder.html)
- [Weight of Evidence](http://contrib.scikit-learn.org/categorical-encoding/woe.html)

Category Encoder's mean encoding implementations work for regression problems or binary classification problems. 

For multi-class classification problems, you will need to temporarily reformulate it as binary classification. For example:

```python
encoder = ce.TargetEncoder(min_samples_leaf=..., smoothing=...) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_val_encoded = encoder.transform(X_train, y_val=='functional')
```

For this reason, mean encoding won't work well within pipelines for multi-class classification problems.

**3.** The **[dirty_cat](https://dirty-cat.github.io/stable/)** library has a Target Encoder implementation that works with multi-class classification.

```python
 dirty_cat.TargetEncoder(clf_type='multiclass-clf')
```
It also implements an interesting idea called ["Similarity Encoder" for dirty categories](https://www.slideshare.net/GaelVaroquaux/machine-learning-on-non-curated-data-154905090).

However, it seems like dirty_cat doesn't handle missing values or unknown categories as well as category_encoders does. And you may need to use it with one column at a time, instead of with your whole dataframe.

**4. [Embeddings](https://www.kaggle.com/learn/embeddings)** can work well with sparse / high cardinality categoricals.

_**I hope it’s not too frustrating or confusing that there’s not one “canonical” way to encode categoricals. It’s an active area of research and experimentation — maybe you can make your own contributions!**_

### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

# Split train into train & val
train, val = train_test_split(train, train_size=0.90, test_size=0.10, 
                              stratify=train['status_group'], random_state=61)
train.shape, val.shape, test.shape

((53460, 41), (5940, 41), (14358, 40))

In [3]:
print(train.info())
train.sample(20)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 53460 entries, 23040 to 55243
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     53460 non-null  int64  
 1   amount_tsh             53460 non-null  float64
 2   date_recorded          53460 non-null  object 
 3   funder                 50178 non-null  object 
 4   gps_height             53460 non-null  int64  
 5   installer              50152 non-null  object 
 6   longitude              53460 non-null  float64
 7   latitude               53460 non-null  float64
 8   wpt_name               53460 non-null  object 
 9   num_private            53460 non-null  int64  
 10  basin                  53460 non-null  object 
 11  subvillage             53126 non-null  object 
 12  region                 53460 non-null  object 
 13  region_code            53460 non-null  int64  
 14  district_code          53460 non-null  int64  
 15

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
730,49888,50.0,2013-01-22,Lga,403,LGA,38.688737,-10.54667,Galani,0,...,soft,good,dry,dry,rainwater harvesting,rainwater harvesting,surface,hand pump,hand pump,non functional
56410,67983,0.0,2013-02-28,Private,1260,Private,34.931724,-10.92111,Shuleni Sekondari,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
1520,43831,0.0,2011-07-20,Adp,1144,DWE,33.155839,-2.123621,Kwaihano,0,...,salty,salty,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,non functional
46756,30695,250.0,2013-01-18,Dwe,1266,DWE,30.379222,-4.358021,Kwa Felis,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
10766,17927,0.0,2013-03-03,Rwssp,0,DWE,0.0,-2e-08,Ntililiko,0,...,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional
13297,66901,500.0,2011-03-28,Wananchi,1393,wananchi,34.643113,-8.831093,Kwa Nodick Kaleme,0,...,soft,good,enough,enough,river,river/lake,surface,communal standpipe multiple,communal standpipe,functional
15019,45777,0.0,2011-08-06,Norad,752,DWE,31.081836,-8.362348,Kwa Mzee Dismas Sokotela,0,...,soft,good,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,non functional
26591,54795,450.0,2013-02-25,Germany Republi,1380,CES,37.19246,-3.222416,Kwa Ebeza Salo,0,...,soft,good,enough,enough,spring,spring,groundwater,communal standpipe multiple,communal standpipe,functional
2294,40214,0.0,2011-08-05,Government Of Tanzania,0,Government,33.5629,-2.525226,Kwa Stelia Kahabi,0,...,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,non functional
32266,29604,0.0,2012-10-24,World Vision,0,World vision,32.238514,-3.319898,Azimio,0,...,salty,salty,enough,enough,shallow well,shallow well,groundwater,hand pump,hand pump,functional


# Baseline for Classification
Majority class baseline

In [4]:
# Arrange ALL data (uncleaned) into X features matrix and y target vector for baseline

# We'll change this later; we just want a benchmark to see whether and to what extent
# our feature selection and engineering improves our model's predictive power

target = 'status_group'
features = train.columns.drop(target)

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

In [5]:
# Just predict the most common target value 100% of the time
baseline = train['status_group'].value_counts(normalize=True, dropna=False)[0]
print(f'If we guessed the majority class for every prediction, we would correctly')
print(f'identify {round(baseline * 100, 2)}% of functional pumps in the training set.')

If we guessed the majority class for every prediction, we would correctly
identify 54.31% of functional pumps in the training set.


With more advanced models

In [6]:
# Random forest accuracy
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=400, random_state=61, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print('Training Accuracy:', pipeline.score(X_train, y_train))
print('Validation Accuracy:', pipeline.score(X_val, y_val)) # mean = .813, strategy='median' = .814

Training Accuracy: 1.0
Validation Accuracy: 0.8144781144781145


## Submission 1

In [7]:
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('koul_2.csv', index=False)

## Alternative tree model

In [8]:
from sklearn.ensemble import ExtraTreesClassifier
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    ExtraTreesClassifier(n_estimators=400, random_state=61, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Training Accuracy:', pipeline.score(X_train, y_train))
print('Validation Accuracy:', pipeline.score(X_val, y_val)) # Not as good

Training Accuracy: 1.0
Validation Accuracy: 0.8074074074074075


In [9]:
# Check columns whose values are more than 1/3 NaNs
for col in X_train.columns:
    if X_train[col].isnull().sum() > (len(col) / 3):
        print(col)

funder
installer
subvillage
public_meeting
scheme_management
scheme_name
permit


In [10]:
def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # # About 3% of the time, latitude has small values near zero,
    # # outside Tanzania, so we'll treat these values like zero.
    # X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # # When columns have zeros and shouldn't, they are like null values.
    # # So we will replace the zeros with nulls, and impute missing values later.
    # # Also create a "missing indicator" column, because the fact that
    # # values are missing may be a predictive signal.
    # cols_with_zeros = ['longitude', 'latitude']
    # for col in cols_with_zeros:
    #     X[col] = X[col].replace(0, np.nan)
    #     # X[col+'_MISSING'] = X[col].isnull()
            
    # Drop duplicate columns; waterpoint_type_group and waterpoint_type are near duplicates
    # duplicates = ['quantity_group', 'payment_type', 'waterpoint_type']
    # X = X.drop(columns=duplicates)

    # Drop recorded_by (never varies), id (always varies, random), and num_private (98.7% zeros)
    unusable_variance = ['recorded_by', 'id']  #  Actually more accurate *with* 'num_private'
    X = X.drop(columns=unusable_variance)

    # # Replace nulls with a random sample from the column if column has more than 1/3 NaNs
    # for col in X.columns:
    #   nans = X[col].isnull().sum()
    #   if nans > (len(col) / 3):
    #     X[col] = X[col].apply(lambda x: np.where(x.isna(), x.dropna().sample(len(x), replace=True), x))

    # # Convert date_recorded to datetime
    # X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    # # Extract components from date_recorded, then drop the original column
    # X['year_recorded'] = X['date_recorded'].dt.year
    # # X['month_recorded'] = X['date_recorded'].dt.month
    # # X['day_recorded'] = X['date_recorded'].dt.day
    # X = X.drop(columns='date_recorded')
    
    # # Engineer feature: how many years from construction_year to date_recorded
    # X['years'] = X['year_recorded'] - X['construction_year']
    # X['years_MISSING'] = X['years'].isnull()
    
#      X = X.drop(columns=subvillage)
    
    # return the wrangled dataframe
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [11]:
# Reassign X, y sets
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

In [12]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators=400, random_state=61, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Training Accuracy:', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))

y_pred = pipeline.predict(X_test)

Training Accuracy: 0.9999812944257389
Validation Accuracy 0.8131313131313131


In [None]:
# Let's try using the median for imputation
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    RandomForestClassifier(n_estimators=400, random_state=61, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Training Accuracy:', pipeline.score(X_train, y_train))
print('Validation Accuracy', pipeline.score(X_val, y_val))  # No significant difference