# Brian's EDA - Origin Data

Working on determining whether or not water pumps in Tanzania are functional, functional but in need of maintenance or non-functional.

I thought it may be prudent to bring in soil data, as some soils would cause increased rates of deterioration of piping and other pump mechanisms. This likely correlates to the 'quality' parameter of our original data set, which is a non-parametric classification of water hardness.



In [8]:
import numpy as np
import pandas as pd

In [9]:
# Pull SpatialJoin data into notebook

xls = pd.ExcelFile("./00_Source_Data/Data/X_Train_SpatialJoin.xls")
X = xls.parse(0)

In [10]:
X.columns

Index(['OBJECTID', 'Join_Count', 'TARGET_FID', 'JOIN_FID', 'X_train_csv_id',
       'X_train_csv_amount_tsh', 'X_train_csv_date_recorded',
       'X_train_csv_funder', 'X_train_csv_gps_height', 'X_train_csv_installer',
       'X_train_csv_longitude', 'X_train_csv_latitude', 'X_train_csv_wpt_name',
       'X_train_csv_num_private', 'X_train_csv_basin',
       'X_train_csv_subvillage', 'X_train_csv_region',
       'X_train_csv_region_code', 'X_train_csv_district_code',
       'X_train_csv_lga', 'X_train_csv_ward', 'X_train_csv_population',
       'X_train_csv_public_meeting', 'X_train_csv_recorded_by',
       'X_train_csv_scheme_management', 'X_train_csv_scheme_name',
       'X_train_csv_permit', 'X_train_csv_construction_year',
       'X_train_csv_extraction_type', 'X_train_csv_extraction_type_group',
       'X_train_csv_extraction_type_class', 'X_train_csv_management',
       'X_train_csv_management_group', 'X_train_csv_payment',
       'X_train_csv_payment_type', 'X_train_csv_water_qu

### Initial Data Cleaning

- reformat column names
- remove 'junk columns'

In [11]:
# Define a list of junk columns
junk_cols = ['OBJECTID', 'Join_Count', 'TARGET_FID', 'JOIN_FID', 'y_train_csv_id']

# Drop junk columns
X.drop(columns = junk_cols, inplace=True);

In [12]:
# Set lists of new and old names for columns.
old_names = ['X_train_csv_id', 
             'X_train_csv_amount_tsh', 'X_train_csv_date_recorded', 
             'X_train_csv_funder', 'X_train_csv_gps_height', 'X_train_csv_installer', 
             'X_train_csv_longitude', 'X_train_csv_latitude', 'X_train_csv_wpt_name', 
             'X_train_csv_num_private', 'X_train_csv_basin', 
             'X_train_csv_subvillage', 'X_train_csv_region', 
             'X_train_csv_region_code', 'X_train_csv_district_code', 
             'X_train_csv_lga', 'X_train_csv_ward', 'X_train_csv_population', 
             'X_train_csv_public_meeting', 'X_train_csv_recorded_by', 
             'X_train_csv_scheme_management', 'X_train_csv_scheme_name', 
             'X_train_csv_permit', 'X_train_csv_construction_year', 
             'X_train_csv_extraction_type', 'X_train_csv_extraction_type_group', 
             'X_train_csv_extraction_type_class', 'X_train_csv_management', 
             'X_train_csv_management_group', 'X_train_csv_payment', 
             'X_train_csv_payment_type', 'X_train_csv_water_quality', 
             'X_train_csv_quality_group', 'X_train_csv_quantity', 
             'X_train_csv_quantity_group', 'X_train_csv_source', 
             'X_train_csv_source_type', 'X_train_csv_source_class', 
             'X_train_csv_waterpoint_type', 'X_train_csv_waterpoint_type_group', 
             'y_train_csv_status_group', 'LANDFORM', 'LITHOLOGY', 
             'SOILS', 'WRB', 'DOMSOILS', 'CODE_WRB']

new_names = ['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height', 
             'installer', 'longitude', 'latitude', 'wpt_name', 'num_private', 
             'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga', 
             'ward', 'population', 'public_meeting', 'recorded_by', 
             'scheme_management', 'scheme_name', 'permit', 'construction_year', 
             'extraction_type', 'extraction_type_group', 'extraction_type_class', 
             'management', 'management_group', 'payment', 'payment_type', 
             'water_quality', 'quality_group', 'quantity', 'quantity_group', 
             'source', 'source_type', 'source_class', 'waterpoint_type', 
             'waterpoint_type_group', 'target', 'landform', 'lithology', 'soils', 
             'wrb', 'dominant_soil', 'code_wrb']

# Rename columns using the newly created lists.
X.rename(columns=dict(zip(old_names, new_names)), inplace=True)

In [13]:
X

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,...,source_class,waterpoint_type,waterpoint_type_group,target,landform,lithology,soils,wrb,dominant_soil,code_wrb
0,69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,...,groundwater,communal standpipe,communal standpipe,functional,TM,MA2,LPe,Eutric Leptosols,LP,LP-eu
1,8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,...,surface,communal standpipe,communal standpipe,functional,LP,UP,PHl,Chromi-Luvic Phaeozems,PH,PH-lv-cr
2,34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,...,surface,communal standpipe multiple,communal standpipe,functional,LP,MA2,LVx,Humi-Rhodic Luvisols,LV,LV-ro-hu
3,67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,...,groundwater,communal standpipe multiple,communal standpipe,non functional,LP,MA2,CMo,Ferralic Cambisols,CM,CM-fl
4,19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,...,surface,communal standpipe,communal standpipe,functional,SH,MA3,LPu,Humi-Umbric Leptosols,LP,LP-um-hu
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57583,60739,10.0,2013-05-03,Germany Republi,1210,CES,37.169807,-3.253847,Area Three Namba 27,0,...,groundwater,communal standpipe,communal standpipe,functional,SH,UP,NTh,Eutric Nitisols,NT,NT-eu
57584,27263,4700.0,2011-05-07,Cefa-njombe,1212,Cefa,35.249991,-9.070629,Kwa Yahona Kuvala,0,...,surface,communal standpipe,communal standpipe,functional,SH,MA2,ACh,Rhodic Acrisols,AC,AC-ro
57585,37057,0.0,2011-04-11,,0,,34.017087,-8.750434,Mashine,0,...,groundwater,hand pump,hand pump,functional,LP,UF,LPe,Eutric Leptosols,LP,LP-eu
57586,31282,0.0,2011-03-08,Malec,0,Musa,35.861315,-6.378573,Mshoro,0,...,groundwater,hand pump,hand pump,functional,SH,IA1,CMo,Chromi-Ferralic Cambisols,CM,CM-fl-cr


## Interested Columns and Descriptions

### Water Quality

Water quality is non-parametric and describes the condition of the water.

This attribute is also summarized in the `quality_group` column.

The quality of the water may be an important factor in determining the longevity of the pump.

In [None]:
X.water_quality.value_counts()

In [None]:
X.quality_group.value_counts()

### Construction Year

The older a pump is, the more time the mechanisms and piping have had to erode. I postulate that an older pump is more likely to be in need or maintenance than a newer pump.

There is a problem with this data column:
- 20709 of our observations are NaN.
- Older pumps have fewer observations

We may be able to reconcile this data by binning pipes by age into 'New' and 'Old' pipes, where 'Old' is defined by some cut-off year. It is also likely that a large portion (or all) of our NaNs will fall into the 'Old' category, allowing us to impute our missing data with likely information without referencing other parameters.

In [None]:
X.construction_year.head()

### Source

Source may be important as it would determine the mechanism of the pump. A bore hole ground water pump will be designed differently than a rain water collecting pump. These mechanisms will have different specifications regarding wear and corrosion and may lead to different statuses given our algorithm.

In [None]:
X_source = X[[ 'source', 'source_type', 'source_class']]
X_source.head()

In [None]:
X_source.source.value_counts()

In [None]:
X_source.source_type.value_counts()

In [None]:
X_source.source_class.value_counts()

### Inflow / Outflow

The column `amount_tsh` is defined as "Total static head (amount water available to waterpoint)".

By using `amount_tsh` and `quantity` or `quantity_group` we may get some idea of whether or not the pump is working properly. As a pump that has water to draw but is not drawing water is likely in need of maintenance.

`quantity_group` and `quantity` are the same. We will drop `quantity_group` because it has the most characters in the column name.

`amount_tsh` may be problematic because of the amount of `0` inputs that cannot be differentiated from NaN or 0. May remove later because of this.

In [None]:
X_IO = X[['amount_tsh', 'quantity', 'quantity_group']]
X_IO.head()

In [None]:
X_IO.quantity.value_counts()

In [None]:
X_IO.quantity_group.value_counts()

In [None]:
test_df = X_IO.quantity_group == X_IO.quantity
test_df.value_counts()

In [None]:
X_IO.amount_tsh.describe()

### Extraction Type

Extraction type is the actual mechanism of the pump. Different mechanisms are easier or more difficult to repair and break down at different rates. This may be more useful than source, as multiple types of mechanisms can be used for the same source type.

In [None]:
X.extraction_type_class.value_counts()

### Population

{write stuff here}

### Landform, Lithology and Soils

{write stuff here}

## Data Cleaning

Things to do:
- Remove unwanted or clerical columns.
- Bin `construction_year` and potentially roll smaller values into `other` bins to help with class imbalance.
- Output cleaned data to new csv for modeling purposes.

Spatial Join data set and original data set are being kept seperate due to differences in dataframe shape. Time permitting, we may take a second look at the spatial join and see if we can impute lat-long using random placement and `lga` to recover the 2000 or so lost entries.

### Binning

In [None]:
# Binning construction_year into new boolean column is_new with cut-off year 'NewYear'

# Set cut-off year
NewYear = 2000

# Create new boolean column with cut-off year
X['is_new'] = X['construction_year'] >= NewYear
X.is_new.value_counts()

### Dropping

In [None]:
# Dropping columns we're not interested in or do not make sense for analysis

# Set list of kept columns for concatenation with Samira's kept columns
keep_cols = ['is_new', 'extraction_type_class', 'amount_tsh', 'quantity', 
             'source_type', 'source_class', 'quality_group', 'population', 
             'target', 'landform', 'lithology', 'soils', 'wrb', 
             'dominant_soil', 'code_wrb']

# New dataframe with kept columns
df = X[keep_cols]

In [None]:
# Display cleaned data frame
df