# Pump it Up: Data Mining the Water Table
An intermediate-level practice competition on https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/

In this competition, we are asked to predict which water pumps are faulty. We will be using the data from Taarifa and the Tanzanian Ministry of Water, to be able to predict which pumps are functional, which need some repairs, and which don't work at all. We will predict one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

We have three different data sets for this competition.
1. test_set_values.csv
2. training_set_labels.csv
3. training_set_values.csv

# Importing the data

In [1]:
import pandas as pd

In [2]:
X_train = pd.read_csv("training_set_values.csv", index_col=[0])
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 69572 to 26348
Data columns (total 39 columns):
amount_tsh               59400 non-null float64
date_recorded            59400 non-null object
funder                   55765 non-null object
gps_height               59400 non-null int64
installer                55745 non-null object
longitude                59400 non-null float64
latitude                 59400 non-null float64
wpt_name                 59400 non-null object
num_private              59400 non-null int64
basin                    59400 non-null object
subvillage               59029 non-null object
region                   59400 non-null object
region_code              59400 non-null int64
district_code            59400 non-null int64
lga                      59400 non-null object
ward                     59400 non-null object
population               59400 non-null int64
public_meeting           56066 non-null object
recorded_by              59400 non-null obj

When we check the data types for each column we realize 3 things;
1. Data type of "date_recorded" is "object" instead of being datetime object as expected. So we should change it to a datetime object.
2. Data type of "region_code" is "int64" but we can not count it as one of our numerical features for the training of our models. That's why we should change it to an "object".
3. Data type of "district_code" is "int64" but we can not count it as one of our numerical features for the training of our models. That's why we should change it to an "object".

In [3]:
X_train["region_code"] = X_train["region_code"].astype("object") # To change the data type of "region_code" column
X_train["district_code"] = X_train["district_code"].astype("object") # To change the data type of "district_code" column
X_train["date_recorded"] = pd.to_datetime(X_train["date_recorded"], infer_datetime_format=True)  # To change the data type of "date_recorded" column

In [4]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59400 entries, 69572 to 26348
Data columns (total 39 columns):
amount_tsh               59400 non-null float64
date_recorded            59400 non-null datetime64[ns]
funder                   55765 non-null object
gps_height               59400 non-null int64
installer                55745 non-null object
longitude                59400 non-null float64
latitude                 59400 non-null float64
wpt_name                 59400 non-null object
num_private              59400 non-null int64
basin                    59400 non-null object
subvillage               59029 non-null object
region                   59400 non-null object
region_code              59400 non-null object
district_code            59400 non-null object
lga                      59400 non-null object
ward                     59400 non-null object
population               59400 non-null int64
public_meeting           56066 non-null object
recorded_by              59400 no

As it is seen above I have done the changing of the data types of 3 columns

In [5]:
X_train.head()

Unnamed: 0_level_0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
69572,6000.0,2011-03-14,Roman,1390,Roman,34.938093,-9.856322,none,0,Lake Nyasa,...,annually,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe
8776,0.0,2013-03-06,Grumeti,1399,GRUMETI,34.698766,-2.147466,Zahanati,0,Lake Victoria,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe
34310,25.0,2013-02-25,Lottery Club,686,World vision,37.460664,-3.821329,Kwa Mahundi,0,Pangani,...,per bucket,soft,good,enough,enough,dam,dam,surface,communal standpipe multiple,communal standpipe
67743,0.0,2013-01-28,Unicef,263,UNICEF,38.486161,-11.155298,Zahanati Ya Nanyumbu,0,Ruvuma / Southern Coast,...,never pay,soft,good,dry,dry,machine dbh,borehole,groundwater,communal standpipe multiple,communal standpipe
19728,0.0,2011-07-13,Action In A,0,Artisan,31.130847,-1.825359,Shuleni,0,Lake Victoria,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,communal standpipe,communal standpipe


In [6]:
X_test = pd.read_csv("test_set_values.csv", index_col=[0])
X_test.head()

Unnamed: 0_level_0,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,...,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50785,0.0,2013-02-04,Dmdd,1996,DMDD,35.290799,-4.059696,Dinamu Secondary School,0,Internal,...,never pay,soft,good,seasonal,seasonal,rainwater harvesting,rainwater harvesting,surface,other,other
51630,0.0,2013-02-04,Government Of Tanzania,1569,DWE,36.656709,-3.309214,Kimnyak,0,Pangani,...,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe
17168,0.0,2013-02-01,,1567,,34.767863,-5.004344,Puma Secondary,0,Internal,...,never pay,soft,good,insufficient,insufficient,rainwater harvesting,rainwater harvesting,surface,other,other
45559,0.0,2013-01-22,Finn Water,267,FINN WATER,38.058046,-9.418672,Kwa Mzee Pange,0,Ruvuma / Southern Coast,...,unknown,soft,good,dry,dry,shallow well,shallow well,groundwater,other,other
49871,500.0,2013-03-27,Bruder,1260,BRUDER,35.006123,-10.950412,Kwa Mzee Turuka,0,Ruvuma / Southern Coast,...,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe


In [7]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14850 entries, 50785 to 68707
Data columns (total 39 columns):
amount_tsh               14850 non-null float64
date_recorded            14850 non-null object
funder                   13981 non-null object
gps_height               14850 non-null int64
installer                13973 non-null object
longitude                14850 non-null float64
latitude                 14850 non-null float64
wpt_name                 14850 non-null object
num_private              14850 non-null int64
basin                    14850 non-null object
subvillage               14751 non-null object
region                   14850 non-null object
region_code              14850 non-null int64
district_code            14850 non-null int64
lga                      14850 non-null object
ward                     14850 non-null object
population               14850 non-null int64
public_meeting           14029 non-null object
recorded_by              14850 non-null obj

In [8]:
y_train = pd.read_csv("training_set_labels.csv", index_col=[0])
y_train = y_train.status_group
y_train.head()

id
69572        functional
8776         functional
34310        functional
67743    non functional
19728        functional
Name: status_group, dtype: object

## Creation of Submission Pipeline

We have three different labels in our "training_set_labels.csv" dataset. So this project is a classification project not a regression project.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer

First, I will start with creation of "Submission Pipiline" which I need to be able to create document to submit into the competition. For this purpose, I will start with using only one numerical column to start creating and training a model to be able to see the way forward.<br>
It should be a good starting if I select "gps_height" .

In [10]:
columns = ['gps_height']


col_trans = ColumnTransformer(remainder="drop",
                             transformers=[('select', 'passthrough',columns)])

model_a = Pipeline([
    ('selector', col_trans),
    ('predictor', DecisionTreeClassifier()) # I used "DecisionTreeClassifier" which builds classification models in the form of a tree structure
    
])

In [11]:
model_a.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('selector',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('select', 'passthrough',
                                                  ['gps_height'])],
                                   verbose=False)),
                ('predictor',
                 DecisionTreeClassifier(class_weight=None, criterion='gini',
                                        max_depth=None, max_features=None,
                                        max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        pres

In [12]:
model_a.score(X_train, y_train)

0.5714814814814815

For "model_a", I got a result of 57 % which means I need to improve the way I am trying to predict.

In [13]:
def make_submission(model, X_test):
    y_test_pred = model.predict(X_test)
    predictions = pd.Series(data=y_test_pred, index=X_test.index, name='status_group')
    date = pd.Timestamp.now().strftime(format='%Y-%m-%d_%H:%M_')
    predictions.to_csv(f'predictions/{date}submission.csv',
                   index=True,
                  header=True)
    

In [14]:
make_submission(model_a, X_test)

When I made a submission for "model_a", I got a score of 53 %. So, I need to improve my model

# Creation of a Model with Numerical Features

I decided to create a model by using the numerical features.

In [15]:
num_feat = X_train.select_dtypes(include='number').columns.to_list() # To be able to see the columns which have numerical values
num_feat

['amount_tsh',
 'gps_height',
 'longitude',
 'latitude',
 'num_private',
 'population',
 'construction_year']

In [16]:
X_train.select_dtypes(include='number').describe()

Unnamed: 0,amount_tsh,gps_height,longitude,latitude,num_private,population,construction_year
count,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0,59400.0
mean,317.650385,668.297239,34.077427,-5.706033,0.474141,179.909983,1300.652475
std,2997.574558,693.11635,6.567432,2.946019,12.23623,471.482176,951.620547
min,0.0,-90.0,0.0,-11.64944,0.0,0.0,0.0
25%,0.0,0.0,33.090347,-8.540621,0.0,0.0,0.0
50%,0.0,369.0,34.908743,-5.021597,0.0,25.0,1986.0
75%,20.0,1319.25,37.178387,-3.326156,0.0,215.0,2004.0
max,350000.0,2770.0,40.345193,-2e-08,1776.0,30500.0,2013.0


When we chech the table above, we see there are some columns with "0" value where it should be other than "0". That means we have some missing / wrong values and we should deal with them first.

In [17]:
num_pipe = Pipeline([
    ('imputer', SimpleImputer(missing_values=0, strategy='mean')) # I used "SimpleImputer" to deal with the missing values
])

col_trans2 = ColumnTransformer(remainder="drop",
                             transformers=[('numarical', num_pipe, num_feat)])

model_b = Pipeline([
    ('col_trans', col_trans2),
    ('classifier', DecisionTreeClassifier()) # I used "DecisionTreeClassifier" which builds classification models in the form of a tree structure
])

In [18]:
model_b.fit(X_train, y_train);

In [19]:
model_b.score(X_train, y_train)

0.9841414141414141

For "model_b", I got a score of 98.4 %. It sounds great.

In [20]:
make_submission(model_b, X_test)

When I made a submission for "model_b", I got a score of 66.6 % which means there is an "over-fitting" in my prediction.

# Creating a Model with Numerical and Categorical Features

After getting a score of 98.4 % in Python but getting a score of 66.6 % in the competition which means I had an over-fitting, I decided to create a model by using both numerical and categorical features.

In [21]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import TruncatedSVD

In [22]:
num_feat

['amount_tsh',
 'gps_height',
 'longitude',
 'latitude',
 'num_private',
 'population',
 'construction_year']

In [23]:
cat_feat = X_train.select_dtypes(include='object').columns.to_list() # To be able to see the columns which have categorical features
cat_feat

['funder',
 'installer',
 'wpt_name',
 'basin',
 'subvillage',
 'region',
 'region_code',
 'district_code',
 'lga',
 'ward',
 'public_meeting',
 'recorded_by',
 'scheme_management',
 'scheme_name',
 'permit',
 'extraction_type',
 'extraction_type_group',
 'extraction_type_class',
 'management',
 'management_group',
 'payment',
 'payment_type',
 'water_quality',
 'quality_group',
 'quantity',
 'quantity_group',
 'source',
 'source_type',
 'source_class',
 'waterpoint_type',
 'waterpoint_type_group']

In [24]:
num_pipe = Pipeline([
    ('imputer', SimpleImputer(missing_values=0, strategy='mean')),  # I used "SimpleImputer" to deal with the missing values
    ('scaler', StandardScaler()) # I used "StandardScaler" to standardize the features
])


cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # I used "SimpleImputer" to deal with the missing values
    ('encode', OneHotEncoder(handle_unknown='ignore')) # I used "OneHotEncoder" to convert the categorical variables into a form that could be provided to the model to do a better job in prediction
])

col_trans3 = ColumnTransformer(remainder="drop",
                             transformers=[
                                 ('numarical', num_pipe, num_feat),
                                 ('categorical', cat_pipe, cat_feat)
                             ])

model_c = Pipeline([
    ('col_trans', col_trans3),
    ('classifier', DecisionTreeClassifier()) # I used "DecisionTreeClassifier" which builds classification models in the form of a tree structure
])

In [25]:
model_c.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('col_trans',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('numarical',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=0,
                                                                                 strategy='mean',
                                                              

In [26]:
model_c.score(X_train, y_train)

0.99996632996633

For "model_c", I got a score of 99.9 %. It sounds great.

In [27]:
make_submission(model_c, X_test)

When I made a submission for "model_c", I got a score of 78.5 % which means I still have an "over-fitting" in my prediction.

I decided to change the classifier at this point. Instead of using "DecisionTreeClassifier", I want to use "RandomForestClassifier"

In [28]:
from sklearn.ensemble import RandomForestClassifier

In [29]:
num_pipe = Pipeline([
    ('imputer', SimpleImputer(missing_values=0, strategy='mean')),
    ('scaler', StandardScaler())
])


cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encode', OneHotEncoder(handle_unknown='ignore'))
])

col_trans3 = ColumnTransformer(remainder="drop",
                             transformers=[
                                 ('numarical', num_pipe, num_feat),
                                 ('categorical', cat_pipe, cat_feat)
                             ])

model_d = Pipeline([
    ('col_trans', col_trans3),
    ('classifier', RandomForestClassifier(n_jobs=-1)) # I used "RandomForestClassifier" which creates a set of decision trees from randomly selected subset of training set and then aggregates the votes from different decision trees to decide the final class of the test object.
])

In [30]:
model_d.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('col_trans',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('numarical',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value=None,
                                                                                 missing_values=0,
                                                                                 strategy='mean',
                                                              

In [31]:
model_d.score(X_train, y_train)

0.9856902356902357

For "model_d", I got a score of 98.6 %. It sounds great.

In [32]:
make_submission(model_c, X_test)

When I made a submission for "model_d", I got a score of 78.72 % which means I still have an "over-fitting" in my prediction.

For the purpose of having some different practices, I will stop working for this project at this point. I will come back and try some other methods when I improve myself in using of classifiers.