# *Pump-it-up project*

### Can you predict which water pumps are faulty?

## Goal
Using data from Taarifa and the Tanzanian Ministry of Water, predict which pumps are functional, which need some repairs, and which don't work at all based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. 

A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

# I. Exploratory data analysis

## 1. Brief description of the data
### 1.1 Features

* amount_tsh - Total static head (amount water available to waterpoint)
* date_recorded - The date the row was entered
* funder - Who funded the well
* gps_height - Altitude of the well
* installer - Organization that installed the well
* longitude - GPS coordinate
* latitude - GPS coordinate
* wpt_name - Name of the waterpoint if there is one
* num_private - no description
* basin - Geographic water basin
* subvillage - Geographic location
* region - Geographic location
* region_code - Geographic location (coded)
* district_code - Geographic location (coded)
* lga - Geographic location
* ward - Geographic location
* population - Population around the well
* public_meeting - True/False
* recorded_by - Group entering this row of data
* scheme_management - Who operates the waterpoint
* scheme_name - Who operates the waterpoint
* permit - If the waterpoint is permitted
* construction_year - Year the waterpoint was constructed
* extraction_type - The kind of extraction the waterpoint uses
* extraction_type_group - The kind of extraction the waterpoint uses
* extraction_type_class - The kind of extraction the waterpoint uses
* management - How the waterpoint is managed
* management_group - How the waterpoint is managed
* payment - What the water costs
* payment_type - What the water costs
* water_quality - The quality of the water
* quality_group - The quality of the water
* quantity - The quantity of water
* quantity_group - The quantity of water
* source - The source of the water
* source_type - The source of the water
* source_class - The source of the water
* waterpoint_type - The kind of waterpoint
* waterpoint_type_group - The kind of waterpoint

### 1.2 Labels

* **functional** - the waterpoint is operational and there are no repairs needed
* **functional needs repair** - the waterpoint is operational, but needs repairs
* **non functional** - the waterpoint is not operational

## 2. Libraries and input data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
%matplotlib inline

from sklearn.metrics import accuracy_score

pd.set_option('display.max_columns', None)

print("Setup Complete")

In [None]:
pylab.rcParams["figure.figsize"] = (14,8)

In [None]:
# Read the file into a variable my_data
X_train = pd.read_csv("../input/X_train_raw.csv")
y_train = pd.read_csv("../input/y_train_raw.csv")
X_test = pd.read_csv("../input/X_test_raw.csv")

# Merge train X and y values
train_df = X_train.merge(y_train,how='outer',left_index=True, right_index=True)

## 3. Descriptive statistics

In [None]:
train_df.head()

In [None]:
train_df.describe()

From this tables we can see the distribution of data. We see several null values for the mins of columns. It most likely means missing data that we'll need to handle before modelling.   

In [None]:
train_df.info()

The train set includes 59400 observations and 41 columns. 

The "status_group" column shows the label or target for each pump, the other 40 columns are features, 10 of which are numerical, the rest are categorical. 
First, let's explore the numerical columns.

## 4. Preliminary accuracy score (baseline)

Let's take a look at the target variable distribution in the train dataset and calculate the baseline for our further predictions.

In [None]:
label_dict = {"functional":2,"functional needs repair":1,"non functional":0}
train_df["label"] = train_df["status_group"].map(label_dict)
sns.distplot(train_df["label"],kde=False)

In [None]:
majority_class = train_df['status_group'].mode()[0]
print("The most frequent label is", majority_class)

y_prelim_pred = np.full(shape=train_df['status_group'].shape, fill_value=majority_class)
accuracy_score(train_df['status_group'], y_prelim_pred)

It means that we can make a preliminary estimate of 54.31% chance of a random pump from this database to be functional. This number will be a baseline for the future model predictions.

**Side note**: Our target variable is discrete, so we will need a supervised learning **classification** algorithm for the label prediction.

**Machine Learning Classification Algorithms:**

* Ensemble Methods
* Generalized Linear Models (GLM)
* Naive Bayes - possible with multiple classes
* Nearest Neighbors
* Support Vector Machines (SVM)
* Decision Trees
* Discriminant Analysis

## 5. Numerical columns

In [None]:
# Select numerical columns
numerical_vars = [col for col in train_df.columns if 
                train_df[col].dtype in ['int64', 'float64']]

### 5.1 Construction year
Let's plot this variable against the number of pumps constructed that year.

In [None]:
sns.countplot(x=train_df["construction_year"],hue=train_df["status_group"])
plt.xticks(rotation=45, 
    horizontalalignment='right')
plt.title("Number of pumps constructed over the years", fontsize=14)
plt.xlabel("Construction year", fontsize=12)
plt.ylabel("Number of pumps constructed", fontsize=12)

We can see that most pumps that were built 1985 are non functional, whereas the more recent pumps tend to be functional. It means that the "construction_year" feature could be very useful in our prediction model. The number of pumps that needs repair seems not very high and quite stable over the years. The rows with 0 construction year need to be checked.

### 5.2 Amount_tsh
This variable shows how much water is left in a well. It could be useful for predicting if a pump is functional.

In [None]:
sns.scatterplot(y=train_df["amount_tsh"],x=train_df["status_group"])

If the "amount_tsh" > 150000 then most likely the pump is functional.

**TO DO**: create a binary var that will show 1 for functional pumps and 0 for all others (feature engineering).

### 5.3 Distributions of numerical attributes



In [None]:
fig = plt.figure(figsize=(12,18))
sns.distributions._has_statsmodels=False
for i in range(len(numerical_vars)):
    fig.add_subplot(9,4,i+1)
    sns.distplot(train_df[numerical_vars].iloc[:,i].dropna())
    plt.xlabel(numerical_vars[i])

plt.tight_layout()
plt.show()

#### Notes for Data Cleaning & Preprocessing:
Uni-modal, skewed distributions could potentially be log transformed: 
* Longtitude
* District_code
* GPS_hight
* Region_code

depends on algorithm

Some numerical data looks like categorical. For example "construction_year" mostly have 2 values - 0 or 2000.
"amount_tsh" and "population" avriables have mostly 0 values.

### 5.4 Finding Outliers
Visualisation of data may support the discovery of possible outliers within the data. 

Examples of how this can be done include:

* Within **univariate** analysis, for example through using box plots. Outliers are observations more than a multiple (1.5-3) of the IQR (inter-quartile range) beyond the upper or lower quartile. (If data is skewed, it may be helpful to transform them first to a more symmetric distribution shape)
* Within **bivariate** analysis, for example scatterplots. Outliers have y-values that are unusual in relation to other observations with similar x-values. Alternatively, plots of the residuals from fitted least square line of bivariate regression can also indicate outliers.

The consensus is that all outliers should be carefully examined:

Go back to original data to check for recording or transcription errors
If no such errors, look carefully for unusual features of the individual unit to explain difference. This may lead to new theory/discoveries
If data cannot be checked further, outlier is usually (often) dropped from the dataset.
The scatterplots of SalePrice against each numerical attribute is shown below, with the aim of employing method 2 above with bivariate analysis.

#### 5.4.1 Univariate analysis - box plots for numerical attributes

In [None]:
fig = plt.figure(figsize=(12, 18))

for i in range(len(numerical_vars)):
    fig.add_subplot(9, 4, i+1)
    sns.boxplot(y=train_df[numerical_vars].iloc[:,i])

plt.tight_layout()
plt.show()

#### Notes for Data Cleaning & Preprocessing:
The outliers:
- Population > 200000.

#### 5.4.2 Bivariate analysis - scatter plots for target versus numerical attributes

In [None]:
f = plt.figure(figsize=(14,20))

for i in range(len(numerical_vars)):
    f.add_subplot(9, 4, i+1)
    sns.scatterplot(train_df[numerical_vars].iloc[:,i], train_df["label"])
    
plt.tight_layout()
plt.show()

#### Notes for Data Cleaning & Preprocessing
Based on a first viewing of the scatter plots against Label, there appears to be a few outliers to check on the:
* amount_tsh (> 200000) 
* population (> 13000)

### 5.5 Assess correlations among attributes
The linear correlation between two columns of data is shown below. There are various correlation calculation methods, but the Pearson correlation is often used and is the default method. It may be useful to note that:

A combination of the correlation figure and a scatter plot can support the understanding of whether there is a non-linear correlation (i.e. depending on the data, this may result in a low value of linear correlation, but the variables may still be strongly correlated in a non-linear fashion)
Correlation values may be heavily influenced by single outliers!

Several authors have suggested that "to use linear regression for modelling, it is necessary to remove correlated variables to improve your model", and "it's a good practice to remove correlated variables during feature selection"


Below is a heatmap of the correlation of the numerical columns:

In [None]:
correlation = train_df.corr()

f, ax = plt.subplots(figsize=(8,6))
plt.title('Correlation of numerical attributes', size=12)
sns.heatmap(correlation)

The correlation between "district_code" and "region_code" is quite high. Consider removing one of them.

The correlation between "construction_year" and "gps_height" is also high, but these 2 variables don't have any obvious connection, so explore this correlation further to take a decision.

With reference to the target Label, the top correlated attributes are:

In [None]:
correlation['label'].sort_values(ascending=False)

The negative correlation to the target variable of the "region_code" is higher than that of the "district_code". Keep the variable with higher correlation to the target.

Linear correlation to the target is quite low for all variables but it might mean that there exists a non-linear correlation instead.

### 5.6 Missing/null values in numerical columns


In [None]:
train_df[numerical_vars].isna().sum().sort_values(ascending=False)

##### Population

In [None]:
len(train_df.population[train_df.population == 0])

21381 observations are missing population value. 

**TO DO**: A solution could be to convert it into categorical data by creating bins (feature engineering)

## 6. Categorical columns

In [None]:
cat_vars = train_df.select_dtypes(include='object').columns
print(cat_vars)

### 6.1 Missing/null values in categorical columns

In [None]:
train_df[cat_vars].isna().sum().sort_values(ascending=False)

In [None]:
## Count of categories within Scheme_management attribute
sns.countplot(x='scheme_management', data=train_df)
plt.xticks(rotation=90)
plt.ylabel('Frequency')
plt.show()

#### Notes for Data Cleaning & Preprocessing

Subvillage: 371 values missing, will be replaced with the mode of the values for the same Region_code.

The following columns will be one-hot encoded, leaving the max cardinality of 10, so we'll replace 0 values with "unknown":
* installer
* funder

The "scheme_management" column has only 11 categories, null values will be replaced with "unknown". After that, it will be one-hot encoded as well.

The following columns have True/False values, so we'll replace null values with "unknown":
* public_meeting
* permit

## 7. Export data after EDA

In [None]:
train_df.to_csv("train_df_after_EDA.csv", index=False)
X_test.to_csv("X_test_after_EDA.csv", index=False)