BloomTech Data Science

*Unit 2, Sprint 2, Module 3*

---
<p style="padding: 10px; border: 2px solid red;">
    <b>Before you start:</b> Today is the day you should submit the dataset for your Unit 2 Build Week project. You can review the guidelines and make your submission in the Build Week course for your cohort on Canvas.</p>

# Module Project: Hyperparameter Tuning

This sprint, the module projects will focus on creating and improving a model for the Tanazania Water Pump dataset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or needs repair.

Dataset source: [DrivenData.org](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).

## Directions

The tasks for this project are as follows:

- **Task 1:** Use `wrangle` function to import training and test data.
- **Task 2:** Split training data into feature matrix `X` and target vector `y`.
- **Task 3:** Establish the baseline accuracy score for your dataset.
- **Task 4:** Build `clf_dt`.
- **Task 5:** Build `clf_rf`.
- **Task 6:** Evaluate classifiers using k-fold cross-validation.
- **Task 7:** Tune hyperparameters for best performing classifier.
- **Task 8:** Print out best score and params for model.
- **Task 9:** Create `submission.csv` and upload to Kaggle.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `pandas-profiling`
- `sklearn`

# I. Wrangle Data

In [None]:
!pip install category_encoders==2.*
!pip install pandas-profiling==2.*





In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from category_encoders import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

  import pandas.util.testing as tm


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/MyDrive/Kaggle Challenge

/content/drive/MyDrive/Kaggle Challenge


In [None]:
!ls

kaggle.json	    sample_submission.csv  train_features.csv
new_submission.csv  test_features.csv	   train_labels.csv


In [None]:
def wrangle(fm_path, tv_path=None):
    if tv_path:
        df = pd.merge(pd.read_csv(fm_path, 
                                  na_values=[0, -2.000000e-08],
                                  parse_dates = ['date_recorded']),
                      pd.read_csv(tv_path)).set_index('id')
    else:
        df = pd.read_csv(fm_path, 
                         na_values=[0, -2.000000e-08],
                         parse_dates = ['date_recorded'],
                         index_col='id')

    # Drop constant columns
    df.drop(columns=['recorded_by'], inplace=True)

    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    # Drop duplicate columns
    dupe_cols = [col for col in df.head(15).T.duplicated().index
                 if df.head(15).T.duplicated()[col]]
    df.drop(columns=dupe_cols, inplace=True)  
    
        #Make pump_age feature
    df['pump_age'] = df['date_recorded'].dt.year - df['construction_year']

    #drop columns
    df = df.drop(columns = ['date_recorded'])
    df = df.drop(columns = ['num_private'])
    df = df.drop(columns = ['management_group'])
    df = df.drop(columns = ['source_type'])
    df = df.drop(columns = ['public_meeting'])
    
   

    #one hot encoding columns
    pd.get_dummies(df['quantity'], prefix = 'quantity')
    df = pd.concat([df,pd.get_dummies(df['quantity'], prefix = 'quantity')],axis=1)
    df = df.drop(columns= ['quantity'])

    #Changing T/F to 1/0
    df['permit'] = df['permit']*1










    return df

df = wrangle(fm_path= 'train_features.csv',
             tv_path = 'train_labels.csv')
X_test = wrangle(fm_path= 'test_features.csv')

**Task 1:** Using the above `wrangle` function to read `train_features.csv` and `train_labels.csv` into the DataFrame `df`, and `test_features.csv` into the DataFrame `X_test`.

# II. Split Data

**Task 2:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

**Note:** You won't need to do a train-test split because you'll use cross-validation instead.

In [None]:
X = df.drop(columns = ['status_group'])
y = df['status_group']

# III. Establish Baseline

**Task 3:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [None]:
baseline_acc = y.value_counts(normalize=True).max()
print('Baseline Accuracy Score:', baseline_acc)

Baseline Accuracy Score: 0.5429828068772491


# IV. Build Models

**Task 4:** Build a `Pipeline` named `clf_dt`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `DecisionTreeClassifier` Predictor.

**Note:** Do not train `clf_dt`. You'll do that in a subsequent task. 

In [None]:
clf_dt = make_pipeline(OrdinalEncoder(),
                       SimpleImputer(),
                       DecisionTreeClassifier(random_state= 42)
    
)

**Task 5:** Build a `Pipeline` named `clf_rf`. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `RandomForestClassifier` predictor.

**Note:** Do not train `clf_rf`. You'll do that in a subsequent task. 

In [None]:
clf_rf = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(strategy= 'median'),
    RandomForestClassifier(random_state= 42, n_estimators= 125, max_depth= 20, n_jobs = -1)
)

clf_rf.fit(X,y)

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['basin', 'region', 'scheme_management',
                                      'permit', 'extraction_type',
                                      'extraction_type_class', 'management',
                                      'payment', 'payment_type',
                                      'water_quality', 'quality_group',
                                      'source', 'source_class',
                                      'waterpoint_type'],
                                mapping=[{'col': 'basin',
                                          'data_type': dtype('O'),
                                          'mapping': Internal                   1
Lake Rukwa                 2
Rufiji                     3
Wami / Ruvu                4
L...
unknown        3
NaN           -2
dtype: int64},
                                         {'col': 'waterpoint_type',
                                          'data_type': dtype('O'),
 

In [None]:
sorted(clf_rf.get_params())

['memory',
 'ordinalencoder',
 'ordinalencoder__cols',
 'ordinalencoder__drop_invariant',
 'ordinalencoder__handle_missing',
 'ordinalencoder__handle_unknown',
 'ordinalencoder__mapping',
 'ordinalencoder__return_df',
 'ordinalencoder__verbose',
 'randomforestclassifier',
 'randomforestclassifier__bootstrap',
 'randomforestclassifier__ccp_alpha',
 'randomforestclassifier__class_weight',
 'randomforestclassifier__criterion',
 'randomforestclassifier__max_depth',
 'randomforestclassifier__max_features',
 'randomforestclassifier__max_leaf_nodes',
 'randomforestclassifier__max_samples',
 'randomforestclassifier__min_impurity_decrease',
 'randomforestclassifier__min_samples_leaf',
 'randomforestclassifier__min_samples_split',
 'randomforestclassifier__min_weight_fraction_leaf',
 'randomforestclassifier__n_estimators',
 'randomforestclassifier__n_jobs',
 'randomforestclassifier__oob_score',
 'randomforestclassifier__random_state',
 'randomforestclassifier__verbose',
 'randomforestclassifier_

# V. Check Metrics

**Task 6:** Evaluate the performance of both of your classifiers using k-fold cross-validation.

In [None]:
X_test.drop(columns= ['waterpoint_type_group'], inplace= True)

In [None]:
cv_scores_dt = cross_val_score(clf_dt, X, y, cv = 5, n_jobs= -1)
cv_scores_rf = cross_val_score(clf_rf, X, y, cv = 5, n_jobs= -1)

In [None]:
print('CV scores DecisionTreeClassifier')
print(cv_scores_dt)
print('Mean CV accuracy score:', cv_scores_dt.mean())
print('STD CV accuracy score:', cv_scores_dt.std())

CV scores DecisionTreeClassifier
[0.74673822 0.7522096  0.75189394 0.74989478 0.75365674]
Mean CV accuracy score: 0.7508786543926763
STD CV accuracy score: 0.0023929567271821404


In [None]:
print('CV score RandomForestClassifier')
print(cv_scores_rf)
print('Mean CV accuracy score:', cv_scores_rf.mean())
print('STD CV accuracy score:', cv_scores_rf.std())

CV score RandomForestClassifier
[0.79734848 0.80082071 0.79776936 0.80397727 0.79953699]
Mean CV accuracy score: 0.7998905626470607
STD CV accuracy score: 0.0023938689101471876


# VI. Tune Model

**Task 7:** Choose the best performing of your two models and tune its hyperparameters using a `RandomizedSearchCV` named `model`. Make sure that you include cross-validation and that `n_iter` is set to at least `25`.

**Note:** If you're not sure which hyperparameters to tune, check the notes from today's guided project and the `sklearn` documentation. 

In [None]:
param_grid = {
    'simpleimputer__strategy': ['mean', 'median'],
    'randomforestclassifier__max_depth': range(5,50,5),
    'randomforestclassifier__n_estimators': range(25,150,25),
}
model = RandomizedSearchCV(
    clf_rf,
    param_distributions = param_grid,
    n_jobs = -1,
    cv = 5,
    verbose = 1,
    n_iter = 50
)
model.fit(X, y)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('ordinalencoder',
                                              OrdinalEncoder()),
                                             ('simpleimputer', SimpleImputer()),
                                             ('randomforestclassifier',
                                              RandomForestClassifier(n_jobs=-1,
                                                                     random_state=42))]),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'randomforestclassifier__max_depth': range(5, 50, 5),
                                        'randomforestclassifier__n_estimators': range(25, 150, 25),
                                        'simpleimputer__strategy': ['mean',
                                                                    'median']},
                   verbose=1)

**Task 8:** Print out the best score and best params for `model`.

In [None]:
best_score = model.best_score_
best_params = model.best_params_


print('Best score for `model`:', best_score)
print('Best params for `model`:', best_params)

Best score for `model`: 0.8038678597331128
Best params for `model`: {'simpleimputer__strategy': 'median', 'randomforestclassifier__n_estimators': 125, 'randomforestclassifier__max_depth': 20}


# Communicate Results

**Task 9:** Create a DataFrame `submission` whose index is the same as `X_test` and that has one column `'status_group'` with your predictions. Next, save this DataFrame as a CSV file and upload your submissions to our competition site. 

**Note:** Check the `sample_submission.csv` file on the competition website to make sure your submissions follows the same formatting. 

In [None]:
submission = pd.DataFrame(data = clf_rf.predict(X_test), index= X_test.index)

In [None]:
submission

Unnamed: 0_level_0,0
id,Unnamed: 1_level_1
37098,non functional
14530,functional
62607,functional
46053,non functional
47083,functional
...,...
26092,functional
919,functional
47444,non functional
61128,non functional


In [None]:
submission.columns = ['status_group']

In [None]:
submission

Unnamed: 0_level_0,status_group
id,Unnamed: 1_level_1
37098,non functional
14530,functional
62607,functional
46053,non functional
47083,functional
...,...
26092,functional
919,functional
47444,non functional
61128,non functional


In [None]:
submission.to_csv('new_submission.csv')


In [None]:
from google.colab import files
files.download('new_submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>