# *Pump-it-up project*

### Can you predict which water pumps are faulty?

## Goal
Using data from Taarifa and the Tanzanian Ministry of Water, predict which pumps are functional, which need some repairs, and which don't work at all based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. 

A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.

# III. Model selection and submission
## 1. Data preparation
### 1.1 Libraries and input data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler as ss
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV


pd.set_option('display.max_columns', None)

# machine learning
#Trees    
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import ExtraTreeClassifier

#Ensemble Methods
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting  # explicitly require this experimental feature
from sklearn.ensemble import HistGradientBoostingClassifier # now you can import normally from ensemble
from lightgbm import LGBMClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost
from xgboost import XGBClassifier

#Gaussian Processes
from sklearn.gaussian_process import GaussianProcessClassifier
    
#GLM
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import RidgeClassifierCV
from sklearn.linear_model import Perceptron   
    
#Nearest Neighbor
from sklearn.neighbors import KNeighborsClassifier
    
#SVM
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.svm import NuSVC

#Discriminant Analysis
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

 #Navies Bayes
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB

# metrics
from sklearn.metrics import accuracy_score, confusion_matrix

# PCA
from sklearn import decomposition

print("Setup Complete")

In [None]:
# Specify the path of the CSV file to read
train_df_final = pd.read_csv("../input/pumpitup-challenge-dataset/train_df_final.csv")
X_test_final = pd.read_csv("../input/pumpitup-challenge-dataset/X_test_final.csv")

In [None]:
X_test_final.shape

In [None]:
train_df_final.shape

### 1.2 Train/test splitting

In [None]:
X = train_df_final.drop("label",axis=1)
y = train_df_final["label"]

In [None]:
# Create training and test sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, stratify=y, random_state=42)

In [None]:
X.isnull().values.any()

This stratify parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter stratify.

For example, if variable y is a binary categorical variable with values 0 and 1 and there are 25% of zeros and 75% of ones, stratify=y will make sure that your random split has 25% of 0's and 75% of 1's.

### 1.3 Standard Scaling

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1.

In case of multivariate data, this is done feature-wise (in other words independently for each column of the data).

Given the distribution of the data, each value in the dataset will have the mean value subtracted, and then divided by the standard deviation of the whole dataset (or feature in the multivariate case).

https://stackoverflow.com/questions/40758562/can-anyone-explain-me-standardscaler

In [None]:
sc = ss()
X_train = sc.fit_transform(X_train)
X_valid = sc.transform(X_valid)
X_test = sc.transform(X_test_final)

### 1.3 Principle component analysis (PCA)
p.s. It didn't improve the score, so I don't use it in the final model

In [None]:
# Make an instance of the Model
pca = decomposition.PCA(.95)

In [None]:
pca.fit(X_train)

In [None]:
pca.n_components_

In [None]:
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
X_valid_pca = pca.transform(X_valid)

## 2. Model selection
### Test different models with standard parameters on validation set 


**TO DO**: combine all models in a loop

### 2.1 Trees

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_valid)

acc_decision_tree = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_decision_tree

In [None]:
# Extra Tree

extra_tree = DecisionTreeClassifier()
extra_tree.fit(X_train, y_train)
y_pred = extra_tree.predict(X_valid)

acc_extra_tree = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_extra_tree

### 2.2 Ensembles

In [None]:
# Random Forest

rfc = RandomForestClassifier(criterion='entropy', n_estimators = 1000,min_samples_split=8,random_state=42,verbose=5)
rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_valid)

acc_rfc = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_rfc

In [None]:
# GradientBoostingClassifier

GB = GradientBoostingClassifier(n_estimators=100, learning_rate=0.075, 
                                max_depth=13,max_features=0.5,
                                min_samples_leaf=14, verbose=5)

GB.fit(X_train, y_train)     
y_pred = GB.predict(X_valid)

acc_GB = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_GB

In [None]:
# Histogram-based Gradient Boosting Classification Tree.

#This estimator is much faster than GradientBoostingClassifier for big datasets (n_samples >= 10 000).


HGB = HistGradientBoostingClassifier(learning_rate=0.075, loss='categorical_crossentropy', 
                                               max_depth=8, min_samples_leaf=15)

HGB = HGB.fit(X_train_pca, y_train)

y_pred = HGB.predict(X_valid_pca)

acc_HGB = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_HGB

In [None]:
# LightGBM 

#is another fast tree based gradient boosting algorithm, which supports GPU, and parallel learning.


LGB = LGBMClassifier(objective='multiclass', learning_rate=0.75, num_iterations=100, 
                     num_leaves=50, random_state=123, max_depth=8)

LGB.fit(X_train, y_train)
y_pred = LGB.predict(X_valid)

acc_LGB = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_LGB

In [None]:
# AdaBoost classifier

AB = AdaBoostClassifier(n_estimators=100, learning_rate=0.075)
AB.fit(X_train, y_train)     
y_pred = AB.predict(X_valid)

acc_AB = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_AB

In [None]:
# BaggingClassifier

BC = BaggingClassifier(n_estimators=100)
BC.fit(X_train_pca, y_train)     
y_pred = BC.predict(X_valid_pca)

acc_BC = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_BC

In [None]:
# XGBoost

xgb = XGBClassifier(n_estimators=1000, learning_rate=0.05, n_jobs=5)
xgb.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

y_pred = xgb.predict(X_valid)
acc_xgb = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_xgb

In [None]:
# ExtraTreesClassifier

ETC = ExtraTreesClassifier(n_estimators=100)
ETC.fit(X_train, y_train)     
y_pred = ETC.predict(X_valid)

acc_ETC = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_ETC

### 2.3 Generalized Logistic Models

In [None]:
# Logistic Regression for multilabel classification

# https://acadgild.com/blog/logistic-regression-multiclass-classification
# https://medium.com/@jjw92abhi/is-logistic-regression-a-good-multi-class-classifier-ad20fecf1309

LG = LogisticRegression(solver="lbfgs", multi_class="multinomial")
LG.fit(X_train, y_train)     
y_pred = LG.predict(X_valid)

acc_LG = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_LG

We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.

Positive coefficients increase the odds of the response (and thus increase the probability), and negative coefficients decrease the odds of the response (and thus decrease the probability).

In [None]:
coeff_df = pd.DataFrame(train_df_final.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(LG.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

In [None]:
# PassiveAggressiveClassifier

PAC = PassiveAggressiveClassifier()
PAC.fit(X_train, y_train)
y_pred = PAC.predict(X_valid)

acc_PAC = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_PAC

In [None]:
# RidgeClassifierCV

RC = RidgeClassifierCV()
RC.fit(X_train, y_train)
y_pred = RC.predict(X_valid)

acc_RC = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_RC

In [None]:
# Perceptron

P = Perceptron()
P.fit(X_train, y_train)
y_pred = P.predict(X_valid)

acc_P = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_P

In [None]:
# Stochastic Gradient Descent
# https://scikit-learn.org/stable/modules/sgd.html

SGD = SGDClassifier(shuffle=True,average=True)
SGD.fit(X_train, y_train)
y_pred = SGD.predict(X_valid)

acc_SGD = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_SGD

### 2.4 KNN

In [None]:
# KNN

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_valid)

acc_knn = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_knn

### 2.5 SVC

In [None]:
# Support Vector Classifier

SVC = SVC(probability=True)
SVC.fit(X_train, y_train)
y_pred = SVC.predict(X_valid)

acc_SVC = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_SVC

In [None]:
# Linear SVC

linear_SVC = LinearSVC()
linear_SVC.fit(X_train,y_train)
linear_SVC.predict(X_valid)

acc_linear_SVC = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_linear_SVC

### 2.6 Discriminant Analysis

In [None]:
# LinearDiscriminantAnalysis

LDA = LinearDiscriminantAnalysis()
LDA.fit(X_train,y_train)
LDA.predict(X_valid)

acc_LDA = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_LDA

In [None]:
# QuadraticDiscriminantAnalysis

QDA = QuadraticDiscriminantAnalysis()
QDA.fit(X_train,y_train)
QDA.predict(X_valid)

acc_QDA = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_QDA

### 2.7 Naive Bayes

In [None]:
# BernoulliNB

bernoulliNB = BernoulliNB()
bernoulliNB.fit(X_train,y_train)
bernoulliNB.predict(X_valid)

acc_bernoulliNB = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_bernoulliNB

In [None]:
# GaussianNB

gaussianNB = GaussianNB()
gaussianNB.fit(X_train,y_train)
gaussianNB.predict(X_valid)

acc_gaussianNB = round(accuracy_score(y_valid,y_pred) * 100, 2)
acc_gaussianNB

## 3. Compare model results

In [None]:
models = pd.DataFrame({
    'Model': ['LightGBM','Decision Tree',"Extra Tree",'Random Forest','Support Vector', 'KNN', 'Logistic Regression', 
              'Stochastic Gradient Decent', 'Linear SVC',"XGBoost", "Ada Boost Classifier", 
              "Bagging Classifier", "Passive Agressive Cl", "Ridge","Perceptron",
              'Gradient Boosting Classifier','Extra Trees',
              "LinearDA","QuadraticDA","BernoulliNB","GaussianNB"],
    'Score': [acc_LGB,acc_decision_tree,acc_extra_tree,acc_rfc, acc_SVC, acc_knn, acc_LG,
              acc_SGD, acc_linear_SVC, acc_xgb, acc_AB, 
              acc_BC, acc_PAC, acc_RC, acc_P,
              acc_GB, acc_ETC,
             acc_LDA, acc_QDA, acc_bernoulliNB, acc_gaussianNB]})
sorted_by_score = models.sort_values(by='Score', ascending=False)

In [None]:
#barplot using https://seaborn.pydata.org/generated/seaborn.barplot.html
sns.barplot(x='Score', y = 'Model', data = sorted_by_score, color = 'g')

#prettify using pyplot: https://matplotlib.org/api/pyplot_api.html
plt.title('Machine Learning Algorithm Accuracy Score \n')
plt.xlabel('Accuracy Score on validation data (%)')
plt.ylabel('Model')

The top 3 models are: 
- Random Forest
- Gradient Boosting Classifier
- Light GB

Out of them, the Gradient Boosting Classifier is the fastest one but Random Forest gives a little better score so far (79.71 compared to 79.24 of GB).
We are now going to find the best parameters for these 3 models using GridSearch.

## 4. Hyperparameter tuning

We will be using the Grid Search for hyperparameter tuning for 3 best models. 

# Tuning for RF
sc = ss()
X = sc.fit_transform(X)

rfc = RandomForestClassifier(criterion='entropy', n_estimators = 50,random_state=42)

params = {"min_samples_split" : [4, 6, 8],
             "n_estimators" : [500, 700, 1000]}


grid_search = GridSearchCV(estimator=rfc, cv=4, param_grid=params, n_jobs=-1, verbose=5) # n_jobs=-1 = use all the CPU cores

grid_search.fit(X, y.values.ravel())

print(grid_search.best_score_)
print(grid_search.best_params_)

# Tuning for LGB
sc = ss()
X = sc.fit_transform(X)

LGB = LGBMClassifier(objective='multiclass', num_threads=2, verbose=2, random_state=123)

params = {'num_iterations ': [100, 150, 200],
          'max_depth': [5, 8, 15],
          'learning_rate': [0.01, 0.75, 0.1, 0.2],
          'num_leaves' : [25, 40, 50]
         }

grid_search = GridSearchCV(estimator=LGB, cv=4, param_grid=params, n_jobs=-1, verbose=5) # n_jobs=-1 = use all the CPU cores

grid_search.fit(X, y.values.ravel())

print(grid_search.best_score_)
print(grid_search.best_params_)

We need to find the following best parameters for our Gradient Boosting model:
- learning rate
- max_depth
- min_samples_leaf
- max_featres
- n_estimators
The full GridSearchCV takes very long (it ran for more than 12h and didn't yet finish, I interrupted manually), so we'll perform a randomized search instead.

Reference: https://zlatankr.github.io/posts/2017/01/23/pump-it-up

# randomized search full
GB = GradientBoostingClassifier(n_estimators=100, 
                                learning_rate=0.075,
                                max_depth=14,
                                max_features=1.0,
                                min_samples_leaf=16)


param_dist = {"n_estimators" : [50,100, 150],
              "learning_rate":[0.05, 0.025, 0.075, 0.01],
             "max_depth" : [12,13,14], 
              "min_samples_leaf":[14,15,16,17],
             "max_features" : [0.5,0.3,0.7,1.0]}

rs = RandomizedSearchCV(estimator=GB,
                  param_distributions=param_dist,
                  scoring='accuracy',
                  cv=10, n_iter=10, n_jobs=-1)

rs.fit(X, y)

print(rs.best_score_)
print(rs.best_params_)

RandomizedSearchCV result:

param_dist 
- "n_estimators" : [50,100, 150],
- "learning_rate":[0.05, 0.025, 0.075, 0.01],
- "max_depth" : [12,13,14], 
- "min_samples_leaf":[14,15,16,17],
- "max_features" : [0.5,0.3,0.7,1.0]

0.7958922558922559
{'n_estimators': 100, 'min_samples_leaf': 14, 'max_features': 0.5, 'max_depth': 13, 'learning_rate': 0.075}

## 5. Retrain the tuned model on the whole train set

For some reason retraining the model on the whole train set (train + validation) gives much worse results on the test set. The reason is not quite clear for me (overfitting?...) and it needs further research. For the time being I will omit this step. Instead I have adjusted above the parameters of the top 3 models based on the tuning.

## 6. Voting classifier

In [None]:
"""
You can combine your best predictors as a VotingClassifier, which can enhance the performance.

"""

estimators = [('RFC', rfc), ('LGB', LGB), ('GB', GB)]

ensemble = VotingClassifier(estimators, voting='soft')

ensemble.fit(X, y)

## 7. Submission

In [None]:
submission_df = pd.read_csv("../input/pumpitup-challenge-dataset/SubmissionFormat.csv")

In [None]:
X_test = sc.transform(X_test_final)
submission_df['status_group']=rfc.predict(X_test)

In [None]:
vals_to_replace = {2:'functional', 1:'functional needs repair', 0:'non functional'}

submission_df.status_group = submission_df.status_group.replace(vals_to_replace)

In [None]:
submission_df.to_csv("submission_TatianaSwrt_rfc_noretrain_80.csv",sep=',', index=False)

## 8. Conclusion and possible future improvements

The goal of this project was to predict if a pump is functional, non-functional or needs repair based on some data describing the pump and its surroundings.

In my research I've first performed an exploratory data analysis. In the beginning I calculated a preliminary/baseline accuracy score which means that a model predicting with the accuracy less than 54.31% is not adding any value, so it would not be better than an uneducated guess. I then splited the data into numerical and categorical columns, identified missing values to deal with in the preprocessing phase, searched for outliers in the data and 
assessed correlations among attributes.

In the next step I've performed data cleaning and preprocessing. First of all I dropped features containing similar information to avoid multicollinearity. Then I filled missing values, reduced cardinality of several categorical features that had many types of values to be able to encode them. I performed ordinal encoding for those variables where it made sense and one-hot encoding for the rest of variables. Finally, I used feature engineering to create new predictors (including LDA, binning, binary variables, turning a date-time variable into a continious numerical variable).

After all the preprocessing I performed feature selection based on L1 regularization with logistic regression that yielded 80 most important variables out of 90 total columns.

In the final step I tested multiple models with standard parameters and plotted the results on a graph.

The top 3 models for this project are: 
- Random Forest
- Gradient Boosting Classifier
- Light Gradient Boosting Classifier

Hyperparameter tuning has only slightly improved the scores.
Surprisingly retraining didn't improve the scores, this needs to be further investigated.
Principle Component Analysis didn't improve the scores either as well as  a Voting classifier I created using all 3 top models listed above.

**I achieved the maximum of 79.71% accuracy on the validation data and 79.6% accuracy on the test data submitted to the competition on DrivenData with the Random Forest model.**

Ideas for future improvements:

- Create a 'for' loop to automate the process of model selection
- More feature ingeneering
- Try to remove amount_tsh
- 2 binary variables -> replace unknown with false
- Log transform to reduce skew: population, amount_tsh
- Don't do ordinal but other type of encoding
- Deal with imbalanced classes (keras to balance classes?)
- Xgboost -> feature importance (not Random Forest)
- Try a different scaler
- Remove outliers in the population variable
- Fill missing values with median/mean