# Brian's EDA - Origin Data

Working on determining whether or not water pumps in Tanzania are functional, functional but in need of maintenance or non-functional.

I thought it may be prudent to bring in soil data, as some soils would cause increased rates of deterioration of piping and other pump mechanisms. This likely correlates to the 'quality' parameter of our original data set, which is a non-parametric classification of water hardness.



In [1]:
import numpy as np
import pandas as pd

In [2]:
# Pull unmodified data into notebook

X = pd.read_csv('./00_Source_Data/DrivenData/X_train.csv')
y = pd.read_csv('./00_Source_Data/DrivenData/y_train.csv')
X.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [3]:
X.columns

Index(['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height',
       'installer', 'longitude', 'latitude', 'wpt_name', 'num_private',
       'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga',
       'ward', 'population', 'public_meeting', 'recorded_by',
       'scheme_management', 'scheme_name', 'permit', 'construction_year',
       'extraction_type', 'extraction_type_group', 'extraction_type_class',
       'management', 'management_group', 'payment', 'payment_type',
       'water_quality', 'quality_group', 'quantity', 'quantity_group',
       'source', 'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group'],
      dtype='object')

In [4]:
X.describe()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
count,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0
mean,37115.131768,317.650385,668.297239,34.077427,-5.706033,0.474141,15.297003,5.629747,179.909983,1300.652475
std,21453.128371,2997.574558,693.11635,6.567432,2.946019,12.23623,17.587406,9.633649,471.482176,951.620547
min,0.0,0.0,-90.0,0.0,-11.64944,0.0,1.0,0.0,0.0,0.0
25%,18519.75,0.0,0.0,33.090347,-8.540621,0.0,5.0,2.0,0.0,0.0
50%,37061.5,0.0,369.0,34.908743,-5.021597,0.0,12.0,3.0,25.0,1986.0
75%,55656.5,20.0,1319.25,37.178387,-3.326156,0.0,17.0,5.0,215.0,2004.0
max,74247.0,350000.0,2770.0,40.345193,-2e-08,1776.0,99.0,80.0,30500.0,2013.0


## Interested Columns and Descriptions

### Water Quality

Water quality is non-parametric and describes the condition of the water.

This attribute is also summarized in the `quality_group` column.

The quality of the water may be an important factor in determining the longevity of the pump.

In [5]:
X.water_quality.value_counts()

soft                  50818
salty                  4856
unknown                1876
milky                   804
coloured                490
salty abandoned         339
fluoride                200
fluoride abandoned       17
Name: water_quality, dtype: int64

In [6]:
X.quality_group.value_counts()

good        50818
salty        5195
unknown      1876
milky         804
colored       490
fluoride      217
Name: quality_group, dtype: int64

### Construction Year

The older a pump is, the more time the mechanisms and piping have had to erode. I postulate that an older pump is more likely to be in need or maintenance than a newer pump.

There is a problem with this data column:
- 20709 of our observations are NaN.
- Older pumps have fewer observations

We may be able to reconcile this data by binning pipes by age into 'New' and 'Old' pipes, where 'Old' is defined by some cut-off year. It is also likely that a large portion (or all) of our NaNs will fall into the 'Old' category, allowing us to impute our missing data with likely information without referencing other parameters.

In [25]:
X.construction_year.head()

0    1999
1    2010
2    2009
3    1986
4       0
Name: construction_year, dtype: int64

### Source

Source may be important as it would determine the mechanism of the pump. A bore hole ground water pump will be designed differently than a rain water collecting pump. These mechanisms will have different specifications regarding wear and corrosion and may lead to different statuses given our algorithm.

In [8]:
X_source = X[[ 'source', 'source_type', 'source_class']]
X_source.head()

Unnamed: 0,source,source_type,source_class
0,spring,spring,groundwater
1,rainwater harvesting,rainwater harvesting,surface
2,dam,dam,surface
3,machine dbh,borehole,groundwater
4,rainwater harvesting,rainwater harvesting,surface


In [9]:
X_source.source.value_counts()

spring                  17021
shallow well            16824
machine dbh             11075
river                    9612
rainwater harvesting     2295
hand dtw                  874
lake                      765
dam                       656
other                     212
unknown                    66
Name: source, dtype: int64

In [10]:
X_source.source_type.value_counts()

spring                  17021
shallow well            16824
borehole                11949
river/lake              10377
rainwater harvesting     2295
dam                       656
other                     278
Name: source_type, dtype: int64

In [11]:
X_source.source_class.value_counts()

groundwater    45794
surface        13328
unknown          278
Name: source_class, dtype: int64

### Inflow / Outflow

The column `amount_tsh` is defined as "Total static head (amount water available to waterpoint)".

By using `amount_tsh` and `quantity` or `quantity_group` we may get some idea of whether or not the pump is working properly. As a pump that has water to draw but is not drawing water is likely in need of maintenance.

`quantity_group` and `quantity` are the same. We will drop `quantity_group` because it has the most characters in the column name.

`amount_tsh` may be problematic because of the amount of `0` inputs that cannot be differentiated from NaN or 0. May remove later because of this.

In [12]:
X_IO = X[['amount_tsh', 'quantity', 'quantity_group']]
X_IO.head()

Unnamed: 0,amount_tsh,quantity,quantity_group
0,6000.0,enough,enough
1,0.0,insufficient,insufficient
2,25.0,enough,enough
3,0.0,dry,dry
4,0.0,seasonal,seasonal


In [13]:
X_IO.quantity.value_counts()

enough          33186
insufficient    15129
dry              6246
seasonal         4050
unknown           789
Name: quantity, dtype: int64

In [14]:
X_IO.quantity_group.value_counts()

enough          33186
insufficient    15129
dry              6246
seasonal         4050
unknown           789
Name: quantity_group, dtype: int64

In [15]:
test_df = X_IO.quantity_group == X_IO.quantity
test_df.value_counts()

True    59400
dtype: int64

In [16]:
X_IO.amount_tsh.describe()

count     59400.000000
mean        317.650385
std        2997.574558
min           0.000000
25%           0.000000
50%           0.000000
75%          20.000000
max      350000.000000
Name: amount_tsh, dtype: float64

### Extraction Type

Extraction type is the actual mechanism of the pump. Different mechanisms are easier or more difficult to repair and break down at different rates. This may be more useful than source, as multiple types of mechanisms can be used for the same source type.

In [17]:
X.extraction_type_class.value_counts()

gravity         26780
handpump        16456
other            6430
submersible      6179
motorpump        2987
rope pump         451
wind-powered      117
Name: extraction_type_class, dtype: int64

### Population

{write stuff here}

In [18]:
y.head()

Unnamed: 0,id,status_group
0,69572,functional
1,8776,functional
2,34310,functional
3,67743,non functional
4,19728,functional


## Data Cleaning

Things to do:
- Concatenate XY values so if rows are removed, they are removed from both y and x data.
- Remove unwanted or clerical columns for `political` models and `physical` models.
- Bin `construction_year` and potentially roll smaller values into `other` bins to help with class imbalance.
- Output cleaned data to new csv for modeling purposes.

These cleaning processes need to be done for the spatial join dataset (different notebook) as well.

Spatial Join data set and original data set are being kept seperate due to differences in dataframe shape.

In [20]:
y.columns

Index(['id', 'status_group'], dtype='object')

In [19]:
# Check if X and y id values are indexed appropriately
test_df = X['id'] == y['id']
test_df.value_counts()

True    59400
Name: id, dtype: int64

In [21]:
# Concatenate 'target' values from y into dataframe to keep target values indexed appropriately.

X['target'] = y.status_group

In [22]:
# Binning construction_year into new boolean column is_new with cut-off year 'NewYear'

# Set cut-off year
NewYear = 2000

# Create new boolean column with cut-off year
X['is_new'] = X['construction_year'] >= NewYear
X.is_new.value_counts()

False    38909
True     20491
Name: is_new, dtype: int64

In [23]:
# Dropping columns we're not interested in or do not make sense for analysis

# Set list of kept columns for concatenation with Samira's kept columns
keep_cols = ['is_new', 'extraction_type_class', 'amount_tsh', 'quantity', 
             'source_type', 'source_class', 'quality_group', 'population', 
             'target']

# New dataframe with kept columns
df = X[keep_cols]

In [26]:
# Display cleaned data frame
df

Unnamed: 0,is_new,extraction_type_class,amount_tsh,quantity,source_type,source_class,quality_group,population,target
0,False,gravity,6000.0,enough,spring,groundwater,good,109,functional
1,True,gravity,0.0,insufficient,rainwater harvesting,surface,good,280,functional
2,True,gravity,25.0,enough,dam,surface,good,250,functional
3,False,submersible,0.0,dry,borehole,groundwater,good,58,non functional
4,False,gravity,0.0,seasonal,rainwater harvesting,surface,good,0,functional
...,...,...,...,...,...,...,...,...,...
59395,False,gravity,10.0,enough,spring,groundwater,good,125,functional
59396,False,gravity,4700.0,enough,river/lake,surface,good,56,functional
59397,False,handpump,0.0,enough,borehole,groundwater,fluoride,0,functional
59398,False,handpump,0.0,insufficient,shallow well,groundwater,good,0,functional
