Pixeltests School Data Science

*Unit 2, Sprint 2, Module 2*

---

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/pixeltests/datasets/main/' #You might not need this, use the data from Kaggle directly!
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

# Module Project: Random Forests

This week, the module projects will focus on creating and improving a model for the Tanazania Water Pump datset. Your goal is to create a model to predict whether a water pump is functional, non-functional, or needs repair.

Dataset source: [DrivenData.org](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/).

## Directions

The tasks for this project are as follows:

- **Task 1:** Enter the [Kaggle](https://www.kaggle.com/t/6169ee7701164d24943c98eda2de9b5e) competition using exactly this link!
- **Task 2:** Use `wrangle` function to import training and test data.
- **Task 3:** Split training data into feature matrix `X` and target vector `y`.
- **Task 4:** Split feature matrix `X` and target vector `y` into training and test sets.
- **Task 5:** Establish the baseline accuracy score for your dataset.
- **Task 6:** Build and train `model_dt`.
- **Task 7:** Calculate the training and validation accuracy score for your model.
- **Task 8:** Adjust model's `max_depth` to reduce overfitting.
- **Task 9 `stretch goal`:** Create a horizontal bar chart showing the 10 most important features for your model.

You should limit yourself to the following libraries for this project:

- `category_encoders`
- `matplotlib`
- `pandas`
- `pandas-profiling`
- `sklearn`

# I. Wrangle Data

In [62]:
def wrangle(fm_path, tv_path=None):
    if tv_path:
        df = pd.merge(pd.read_csv(fm_path, 
                                  na_values=[0, -2.000000e-08]),
                      pd.read_csv(tv_path)).set_index('id')
    else:
        df = pd.read_csv(fm_path, 
                         na_values=[0, -2.000000e-08],
                         index_col='id')

    # Drop constant columns
    df.drop(columns=['recorded_by'], inplace=True)
    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)

    # Drop duplicate columns
    dupe_cols = [col for col in df.head(15).T.duplicated().index
                 if df.head(15).T.duplicated()[col]]
    df.drop(columns=dupe_cols, inplace=True)             

    return df

In [78]:
 def preprocess (df):
    df.drop(columns=['recorded_by'], inplace=True)
    # Drop HCCCs
    cutoff = 100
    drop_cols = [col for col in df.select_dtypes('object').columns
                 if df[col].nunique() > cutoff]
    df.drop(columns=drop_cols, inplace=True)
    df['pump_age']=2021-df['construction_year']
    # Drop duplicate columns
    dupe_cols = [col for col in df.head(15).T.duplicated().index
                 if df.head(15).T.duplicated()[col]]
    df.drop(columns=dupe_cols, inplace=True)             

    return df

**Task 1:** Sign up for a [Kaggle](https://www.kaggle.com/) account. Choose a username that's based on your real name. Like GitHub, Kaggle is part of your public profile as a data scientist.

**Task 2:** Modify the `wrangle` function to engineer a `'pump_age'` feature. Then use the function to read `train_features.csv` and `train_labels.csv` into the DataFrame `df`, and `test_features.csv` into the DataFrame `X_test`.

In [79]:
import pandas as pd
train_features=pd.read_csv('/content/train_features.csv',na_values=[0, -2.000000e-08])
train_labels=pd.read_csv('/content/train_labels.csv',na_values=[0, -2.000000e-08])
train_features['status_group']=train_labels['status_group']
df =preprocess(train_features)
X_test = pd.read_csv('/content/test_features.csv',na_values=[0, -2.000000e-08])


In [80]:
X_test=preprocess(X_test)

In [81]:
X_test.shape


(11880, 31)

In [82]:
df.shape

(47520, 32)

# II. Split Data

**Task 3:** Split your DataFrame `df` into a feature matrix `X` and the target vector `y`. You want to predict `'status_group'`.

In [83]:
X = df.drop('status_group',axis=1)
y = df['status_group']

**Task 4:** Using a randomized split, divide `X` and `y` into a training set (`X_train`, `y_train`) and a validation set (`X_val`, `y_val`).

In [84]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X,y,train_size=0.8)

# III. Establish Baseline

**Task 5:** Since this is a **classification** problem, you should establish a baseline accuracy score. Figure out what is the majority class in `y_train` and what percentage of your training observations it represents.

In [85]:
y_train.value_counts(normalize=True)


#baseline_acc = 
#print('Baseline Accuracy Score:', baseline_acc)

functional                 0.545297
non functional             0.381629
functional needs repair    0.073074
Name: status_group, dtype: float64

In [86]:
from sklearn.metrics import accuracy_score
sample_submission=pd.read_csv('/content/tanzania sample solution.csv')
sample_submission.shape

(11880, 2)

In [88]:
import numpy as np
y_pred=pd.DataFrame(l)
y_pred

from sklearn.metrics import accuracy_score
submission=sample_submission[['S.No.']].copy()
submission['status_group']=y_pred
submission.to_csv('baseline1.csv',index=False)

In [89]:
l=np.zeros(len(X_test),dtype=int)

# IV. Build Model

**Task 6:** Build a `Pipeline` named `model_rf`, and fit it to your training data. Your `Pipeline` should include:

- an `OrdinalEncoder` transformer for categorical features.
- a `SimpleImputer` transformer fot missing values.
- a `RandomForestClassifier` predictor.

**Note:** Don't forget to set the `random_state` parameter for your `RandomForestClassifier`. Also, to decrease training time, set `n_jobs` to `-1`.

In [91]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder
import category_encoders as ce 
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
model_rf =  make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_jobs=-1, bootstrap = True, oob_score =True, random_state=25))
model_rf.fit(X_train, y_train)

Pipeline(steps=[('ordinalencoder',
                 OrdinalEncoder(cols=['basin', 'region', 'public_meeting',
                                      'scheme_management', 'permit',
                                      'extraction_type',
                                      'extraction_type_group',
                                      'extraction_type_class', 'management',
                                      'management_group', 'payment',
                                      'payment_type', 'water_quality',
                                      'quality_group', 'quantity', 'source',
                                      'source_type', 'source_class',
                                      'waterpoint_type',
                                      'waterpoin...
communal standpipe multiple    4
improved spring                5
cattle trough                  6
dam                            7
NaN                           -2
dtype: int64},
                                         {'col': 

# V. Check Metrics

**Task 7:** Calculate the training and validation accuracy scores for `model_rf`.

In [92]:
training_acc = accuracy_score(model_rf.predict(X_train),y_train)
val_acc = accuracy_score(model_rf.predict(X_val),y_val)

print('Training Accuracy Score:', training_acc)
print('Validation Accuracy Score:', val_acc)

Training Accuracy Score: 1.0
Validation Accuracy Score: 0.7990319865319865


In [93]:
training_acc = accuracy_score(model_rf.predict(X_train),y_train)
val_acc = accuracy_score(model_rf.predict(X_val),y_val)
print('Training Accuracy Score:', training_acc)
print('Validation Accuracy Score:', val_acc)

Training Accuracy Score: 1.0
Validation Accuracy Score: 0.7990319865319865


# VI. Tune Model

**Task 8:** Tune `n_estimators` and `max_depth` hyperparameters for your `RandomForestClassifier` to get the best validation accuracy score for `model_rf`. 

In [94]:
# Use this cell to experiment and then change 
# your model hyperparameters in Task 6
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder
import category_encoders as ce 
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
model_rf =  make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(max_depth=10,n_estimators=10, random_state=25))
model_rf.fit(X_train, y_train)
training_acc = accuracy_score(model_rf.predict(X_train),y_train)
val_acc = accuracy_score(model_rf.predict(X_val),y_val)
print('Training Accuracy Score:', training_acc)
print('Validation Accuracy Score:', val_acc)

Training Accuracy Score: 0.7693339646464646
Validation Accuracy Score: 0.7488425925925926


# VII. Communicate Results

**Task 9:** Generate a list of predictions for `X_test`. The list should be named `y_pred`.

In [95]:
y_pred1=model_rf.predict(X_test)
len(y_pred1)

assert len(y_pred) == len(X_test), f'Your list of predictions should have {len(X_test)} items in it. '

In [96]:
len(X_test)
len(y_pred1)

11880

**Task 11 `stretch goal`:** Create a DataFrame `submission` whose index is the same as `X_test` and that has one column `'status_group'` with your predictions. Next, save this DataFrame as a CSV file and upload your submissions to our competition site. 

**Note:** Check the `sample_submission.csv` file on the competition website to make sure your submissions follows the same formatting. 

In [100]:
type(y_pred1)
y_pred1=pd.DataFrame(y_pred1)
y_pred1=y_pred1.replace({'functional':0,'non functional':2,'functional needs repair':1})



In [101]:
from sklearn.metrics import accuracy_score
submission=sample_submission[['S.No.']].copy()
submission['status_group']=y_pred1
submission.to_csv('baselinersv.csv',index=False)