# Brian's EDA

Working on determining whether or not water pumps in Tanzania are functional, functional but in need of maintenance or non-functional.

I thought it may be prudent to bring in soil data, as some soils would cause increased rates of deterioration of piping and other pump mechanisms. This likely correlates to the 'quality' parameter of our original data set, which is a non-parametric classification of water hardness.



In [3]:
import numpy as np
import pandas as pd

In [5]:
# Pull unmodified data into notebook

X = pd.read_csv('./00_Source_Data/DrivenData/X_train.csv')
X.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [7]:
X.columns

Index(['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height',
       'installer', 'longitude', 'latitude', 'wpt_name', 'num_private',
       'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga',
       'ward', 'population', 'public_meeting', 'recorded_by',
       'scheme_management', 'scheme_name', 'permit', 'construction_year',
       'extraction_type', 'extraction_type_group', 'extraction_type_class',
       'management', 'management_group', 'payment', 'payment_type',
       'water_quality', 'quality_group', 'quantity', 'quantity_group',
       'source', 'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group'],
      dtype='object')

In [9]:
X.describe()

Unnamed: 0,id,amount_tsh,gps_height,longitude,latitude,num_private,region_code,district_code,population,construction_year
count,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0
mean,37115.131768,317.650385,668.297239,34.077427,-5.706033,0.474141,15.297003,5.629747,179.909983,1300.652475
std,21453.128371,2997.574558,693.11635,6.567432,2.946019,12.23623,17.587406,9.633649,471.482176,951.620547
min,0.0,0.0,-90.0,0.0,-11.64944,0.0,1.0,0.0,0.0,0.0
25%,18519.75,0.0,0.0,33.090347,-8.540621,0.0,5.0,2.0,0.0,0.0
50%,37061.5,0.0,369.0,34.908743,-5.021597,0.0,12.0,3.0,25.0,1986.0
75%,55656.5,20.0,1319.25,37.178387,-3.326156,0.0,17.0,5.0,215.0,2004.0
max,74247.0,350000.0,2770.0,40.345193,-2e-08,1776.0,99.0,80.0,30500.0,2013.0


## Interested Columns and Descriptions

### Water Quality

Water quality is non-parametric and describes the condition of the water.

This attribute is also summarized in the `quality_group` column.

The quality of the water may be an important factor in determining the longevity of the pump.

In [6]:
X.water_quality.value_counts()

soft                  50818
salty                  4856
unknown                1876
milky                   804
coloured                490
salty abandoned         339
fluoride                200
fluoride abandoned       17
Name: water_quality, dtype: int64

In [8]:
X.quality_group.value_counts()

good        50818
salty        5195
unknown      1876
milky         804
colored       490
fluoride      217
Name: quality_group, dtype: int64

### Construction Year

The older a pump is, the more time the mechanisms and piping have had to erode. I postulate that an older pump is more likely to be in need or maintenance than a newer pump.

There is a problem with this data column:
- 20709 of our observations are NaN.
- Older pumps have fewer observations

We may be able to reconcile this data by binning pipes by age into 'New' and 'Old' pipes, where 'Old' is defined by some cut-off year. It is also likely that a large portion (or all) of our NaNs will fall into the 'Old' category, allowing us to impute our missing data with likely information without referencing other parameters.

In [11]:
X.construction_year.value_counts().head()

0       20709
2010     2645
2008     2613
2009     2533
2000     2091
Name: construction_year, dtype: int64

### Source

Source may be important as it would determine the mechanism of the pump. A bore hole ground water pump will be designed differently than a rain water collecting pump. These mechanisms will have different specifications regarding wear and corrosion and may lead to different statuses given our algorithm.

In [19]:
X_source = X[[ 'source', 'source_type', 'source_class']]
X_source.head()

Unnamed: 0,source,source_type,source_class
0,spring,spring,groundwater
1,rainwater harvesting,rainwater harvesting,surface
2,dam,dam,surface
3,machine dbh,borehole,groundwater
4,rainwater harvesting,rainwater harvesting,surface


In [20]:
X_source.source.value_counts()

spring                  17021
shallow well            16824
machine dbh             11075
river                    9612
rainwater harvesting     2295
hand dtw                  874
lake                      765
dam                       656
other                     212
unknown                    66
Name: source, dtype: int64

In [21]:
X_source.source_type.value_counts()

spring                  17021
shallow well            16824
borehole                11949
river/lake              10377
rainwater harvesting     2295
dam                       656
other                     278
Name: source_type, dtype: int64

In [22]:
X_source.source_class.value_counts()

groundwater    45794
surface        13328
unknown          278
Name: source_class, dtype: int64

### Inflow / Outflow

The column `amount_tsh` is defined as "Total static head (amount water available to waterpoint)".

By using `amount_tsh` and `quantity` or `quantity_group` we may get some idea of whether or not the pump is working properly. As a pump that has water to draw but is not drawing water is likely in need of maintenance.

In [23]:
X_IO = X[['amount_tsh', 'quantity', 'quantity_group']]
X_IO.head()

Unnamed: 0,amount_tsh,quantity,quantity_group
0,6000.0,enough,enough
1,0.0,insufficient,insufficient
2,25.0,enough,enough
3,0.0,dry,dry
4,0.0,seasonal,seasonal
