# Phase 3 Final Project:  Identifying Faulty Water Wells


## Introduction

For the Phase 3 final project we will develop a model to predict water wail failure in Tanzania using information gathered by the Tanzanian government and hosted as a competition by DrivenData. 

## Business Understanding

Our goal is to develop a classification model in which we can accurately predict which water wells will fail in Tanzania.  We are given 40 independant variables (columns) from which to base our predictive model on.  There are 59,400 entries (or wells) in our training data and 14,850 in our test data.  Tanzania faces an increased demand for water based on population growth projections, as well as increased contamination of groundwater storage from mining and agricultural runoff.  Additionally, the city of Dar-Es-Salaam is subject to frequent chollera outbreaks due to well water contamination from human run-off.  A third threat is posed by changing climate conditions that have shifted rainfall patterns, causing storms producing more intense rainfall and increased flooding.  As a result, well safety will also be taken into account when determining which water sources will reamin viable for human and agricultural usage.

# Modeling

For this project we will iterate through 3 successively more complex models to reach our goal, a model with tuned parameters that is able to accurately predict which water wells will fail given our independant variables.

* We'll start with a single decision tree baseline model  decision tree
* Next we'll move on to the more complex modelling strategy of random forest
* We'll conclude with an XG Boost pipeline 

### Import  and Examine Datasets

Our first step will be to import the neccessary libraries.

In [2]:
import pandas as pd

We'll import the three datasets here, and perform the following tasks:
* Examine the first 5 fows 
* Look for NaN values
* Examine "water_quality" column values in our Train set

In [27]:
# The independent variables that need predictions.  
# We will keep our "Test" data seperate to prevent data leakage
df_test_values= pd.read_csv('CSVFiles/TestSetValues.csv')
# Examine values
df_test_values.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,50785,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
1,51630,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,...,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
2,17168,0.0,2013-02-01,,1567,,34.767863,-5.004344,Puma Secondary,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
3,45559,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,...,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
4,49871,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,...,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


In [26]:
# The independent variables for the training set
df_train_values= pd.read_csv('CSVFiles/TrainingSetValues.csv')
# Examine values
df_train_values.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [17]:
#  The dependent variable (status_group) for each of the rows in Training set values
df_train_labels= pd.read_csv('CSVFiles/TrainingSetLabels.csv')
# Examine labels
df_train_labels.head()

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


In [35]:
df_train_values.groupby('water_quality').count()

Unnamed: 0_level_0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment,payment_type,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
water_quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
coloured,490,490,490,391,490,391,490,490,490,490,...,490,490,490,490,490,490,490,490,490,490
fluoride,200,200,200,181,200,176,200,200,200,200,...,200,200,200,200,200,200,200,200,200,200
fluoride abandoned,17,17,17,17,17,17,17,17,17,17,...,17,17,17,17,17,17,17,17,17,17
milky,804,804,804,788,804,785,804,804,804,804,...,804,804,804,804,804,804,804,804,804,804
salty,4856,4856,4856,4803,4856,4801,4856,4856,4856,4856,...,4856,4856,4856,4856,4856,4856,4856,4856,4856,4856
salty abandoned,339,339,339,331,339,331,339,339,339,339,...,339,339,339,339,339,339,339,339,339,339
soft,50818,50818,50818,47945,50818,47948,50818,50818,50818,50818,...,50818,50818,50818,50818,50818,50818,50818,50818,50818,50818
unknown,1876,1876,1876,1309,1876,1296,1876,1876,1876,1876,...,1876,1876,1876,1876,1876,1876,1876,1876,1876,1876


### 1st Model: Simple Decision Tree

### 2nd Model: Random Forest

### 3rd Model: XG Boost Pipeline

### Classification Metrics
For our models, it will be important to both capture both as many possible well failures as we can (recall) and to be sure that our predictions are accurate (precision scoore), Fortunately, we can use a F1 score (the harmonic mean of accuracy and recall) to do so.

After you finish refining your models, you should provide 1-3 paragraphs in the notebook discussing your final model

## Conclusion

### Findings
A predictive finding might include:

* How well your model is able to predict the target
* What features are most important to your model

### Predictions

* The contexts/situations where the predictions made by your model would and would not be useful for your stakeholder and business problem
* Suggestions for how the business might modify certain input variables to achieve certain target results

## Thank You