# Data Science - Module 3 - Final Project Submission

* Student Name: **James Toop**
* Student Pace: **Self Paced**
* Scheduled project review date/time: **TBC**
* Instructor name: **Jeff Herman**
* Blog post URL: **TBC**

## Table of Contents

---
<a name="business-case"></a>
## 1. Business Case

Tanzania has a water and sanitation crisis. Only 50% of the population of 53 million have access to an improved source of safe water, and 34% of the population has access to improved sanitation. The demand for both water and sanitation is high.

Water is an essential of life, yet millions around the world still don’t have access to clean water. One of the most common causes of death in the developing world is drinking dirty and diseased water.

Did you know 748 Million people in the world don’t have access to safe water?

Water wells provide clean water for years. In rural areas, they are a lifeline for the inhabitants as this may be the only source of potable water.

Using data from Taarifa and the Tanzanian Ministry of Water, can you predict which pumps are functional, 
which need some repairs, and which don't work at all? This is an intermediate-level practice competition. 
Predict one of these three classes based on a number of variables about what kind of pump is operating, 
when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve 
maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

The goal is to predict the operating condition of a waterpoint for each record in the dataset. You are provided the following set of information about the waterpoints:

* **`amount_tsh`** : Total static head (amount of water available to waterpoint)
* **`date_recorded`** : The date the row was entered
* **`funder`** : Who funded the well
* **`gps_height`** : Altitude of the well
* **`installer`** : Organization that installed the well
* **`longitude`** : GPS coordinate
* **`latitude`** : GPS coordinate
* **`wpt_name`** : Name of the waterpoint if there is one
* **`num_private`** : NO FIELD DEFINITION CONSIDER DROPPING
* **`basin`** : Geographic water basin
* **`subvillage`** : Geographic location
* **`region`** : Geographic location
* **`region_code`** : Geographic location (coded)
* **`district_code`** : Geographic location (coded)
* **`lga`** : Geographic location
* **`ward`** : Geographic location
* **`population`** : Population around the well
* **`public_meeting`** : NO FIELD DEFINITION CONSIDER DROPPING
    * True
    * False
* **`recorded_by`** : Group entering this row of data
* **`scheme_management`** : Who operates the waterpoint
    * VWC
    * WUG
    * Water authority
    * WUA
    * Water Board
    * Parastatal
    * Private operator
    * Company
    * Other
    * SWC
    * Trust
    * None
* **`scheme_name`** : Who operates the waterpoint
* **`permit`** : If the waterpoint is permitted
    * True
    * False
* **`construction_year`** : Year the waterpoint was constructed
* **`extraction_type`** : The kind of extraction the waterpoint uses (mostly brand names of pumps) : CONSIDER DROPPING?
    * gravity
    * nira/tanira
    * other
    * submersible
    * swn 80
    * mono
    * india mark ii
    * afridev
    * ksb 
    * other - rope pump
    * other - swn 81
    * windmill
    * india mark iii
    * cemo
    * other - play pump
    * walimi
    * climax
    * other - mkulima/shinyanga
* **`extraction_type_group`** : The kind of extraction the waterpoint uses : CONSIDER DROPPING?
    * gravity
    * nira/tanira
    * other
    * submersible
    * swn 80
    * mono
    * india mark ii
    * afridev
    * rope pump
    * other handpump
    * other motorpump
    * wind-powered
    * india mark iii
* **`extraction_type_class`** : The kind of extraction the waterpoint uses
    * gravity
    * handpump
    * other
    * submersible
    * motorpump
    * rope pump
    * wind-powered
* **`management`** : How the waterpoint is managed : TOO MUCH DETAIL CONSIDER DROPPING?
    * vwc
    * wug
    * water board
    * wua
    * private operator
    * parastatal
    * water authority
    * other
    * company
    * unknown
    * other - school
    * trust
* **`management_group`** : How the waterpoint is managed
    * user-group
    * commercial
    * parastatal
    * other
    * unknown
* **`payment`** : What the water costs : DUPLICATE FIELD CONSIDER DROPPING?
    * never pay
    * pay per bucket
    * pay monthly
    * pay when scheme fails
    * pay annually
    * other
    * unknown
* **`payment_type`** : What the water costs
    * never pay
    * per bucket
    * monthly
    * on failure
    * annually
    * other
    * unknown
* **`water_quality`** : The quality of the water
    * soft
    * salty
    * soft
    * salty
    * milky
    * coloured
    * salty abandoned
    * fluoride
    * flouride abandoned
    * unknown
* **`quality_group`** : The quality of the water
    * good
    * salty
    * milky
    * colored
    * fluoride
    * unknown
* **`quantity`** : The quantity of water :
    * enough
    * insufficient
    * dry
    * seasonal
    * unknown
* **` quantity_group`** : The quantity of water : DUPLICATE FIELD CONSIDER DROPPING?
    * enough
    * insufficient
    * dry
    * seasonal
    * unknown
* **`source`** : The source of the water : CONSIDER DROPPING?
    * spring
    * shallow well
    * machine dbh
    * river
    * rainwater harvesting
    * hand dtw
    * lake
    * dam
    * other
    * unknown
* **`source_type`** : The source of the water : CONSIDER DROPPING?
    * spring
    * shallow well
    * borehole
    * river/lake
    * rainwater harvesting
    * dam
    * other
* **`source_class`** : The source of the water
    * groundwater
    * surface
    * unknown
* **`waterpoint_type`** : The kind of waterpoint : CONSIDER DROPPING?
    * communal standpipe
    * hand pump
    * communal standpipe multiple
    * improved spring
    * cattle trough
    * dam
    * other    
* **`waterpoint_type_group`** : The kind of waterpoint : CONSIDER DROPPING?
    * communal standpipe
    * hand pump
    * improved spring
    * cattle trough
    * dam
    * other

---
<a name="eda"></a>
## 2. Exploratory Data Analysis (EDA)

<a name="data-discovery"></a>
### 2A. Data Discovery

This section presents an initial step to investigate, understand and document the available data fields and relationships, highlighting any potential issues / shortcomings within the datasets supplied.

In [1]:
# Import the relevant libraries for data discovery and exploratory data analysis
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# Set styles and color palette for Seaborn
plt.style.use('seaborn-whitegrid')
pres_palette = ['#599191','#8dd8d3','#0b6374','#c0791b','#424242','#fd5b58','#d7e6a3','#d558ab','#27278b']
sns.set_palette(sns.color_palette(pres_palette))

In [11]:
# Import the waterpoints training data file from the repository then inspect the data
waterpoints = pd.read_csv('training-set-values.csv')
waterpoints.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [12]:
waterpoints.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 40 columns):
id                       59400 non-null int64
amount_tsh               59400 non-null float64
date_recorded            59400 non-null object
funder                   55765 non-null object
gps_height               59400 non-null int64
installer                55745 non-null object
longitude                59400 non-null float64
latitude                 59400 non-null float64
wpt_name                 59400 non-null object
num_private              59400 non-null int64
basin                    59400 non-null object
subvillage               59029 non-null object
region                   59400 non-null object
region_code              59400 non-null int64
district_code            59400 non-null int64
lga                      59400 non-null object
ward                     59400 non-null object
population               59400 non-null int64
public_meeting           56066 non-null object
r

In [13]:
waterpoints['public_meeting'].value_counts()

True     51011
False     5055
Name: public_meeting, dtype: int64

In [20]:
waterpoints['quantity_group'].value_counts()

enough          33186
insufficient    15129
dry              6246
seasonal         4050
unknown           789
Name: quantity_group, dtype: int64

In [26]:
waterpoints['source'].value_counts()

spring                  17021
shallow well            16824
machine dbh             11075
river                    9612
rainwater harvesting     2295
hand dtw                  874
lake                      765
dam                       656
other                     212
unknown                    66
Name: source, dtype: int64

In [43]:
waterpoints['permit'].value_counts()

True     38852
False    17492
Name: permit, dtype: int64

In [15]:
# Import the relevant data file from the repository then inspect the data
waterpoints_status = pd.read_csv('training-set-labels.csv')
waterpoints_status['status_group'].value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64