<a href="https://colab.research.google.com/github/tomfox1/Predictive-Modelling-Challenge-/blob/master/Tanzania_Waterpumps_Scenario.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imagine this scenario ...

In [0]:
!pip install category_encoders



In [0]:
import pandas as pd

X_test = pd.read_csv('https://drive.google.com/uc?export=download&id=1350jvoAsNoQmsgfkmh4iSYqiOF2q6xFF')
print(f'The test set has {len(X_test)} observations')

X_train = pd.read_csv('https://drive.google.com/uc?export=download&id=1Bwj7iwO2RvN3x_LZeGnY3zgTY8aQPMMC')
y_train = pd.read_csv('https://drive.google.com/uc?export=download&id=1dxe6zLADkJwgI-i4NYjbkZFCh2ANrIhB')['status_group']
print(f'The train set has {len(X_train)} observations')

The test set has 14358 observations
The train set has 59400 observations


Suppose there are over 14,000 waterpumps that you _do_ have some information about, but you _don't_ know whether they are currently functional, or functional but need repair, or non-functional.

You have the time and resources to go to just 2,000 waterpumps for proactive maintenance. You want to predict, which 2,000 are most likely non-functional or in need of repair, to help you triage and prioritize your waterpump inspections. 

You have historical inspection data for over 59,000 other waterpumps, which you'll use to fit your predictive model.

Based on this historical data, if you randomly chose waterpumps to inspect, then about 46% of the waterpumps would need repairs, and 54% would not need repairs. Can you do better than random at prioritizing inspections?

In [0]:
y_train.value_counts(normalize=True)

functional                 0.543081
non functional             0.384242
functional needs repair    0.072677
Name: status_group, dtype: float64

In this scenario, we should define our target differently. We want to identify which waterpumps are non-functional _or_ are functional but needs repair:

In [0]:
y_train = y_train != 'functional'
y_train.value_counts(normalize=True)

False    0.543081
True     0.456919
Name: status_group, dtype: float64

And in this scenario, accuracy isn't the best metric!

Instead, let's use something more similar to **Precision@K**, where k=2,000.

Read more here: [Recall and Precision at k for Recommender Systems: Detailed Explanation with examples](https://medium.com/@m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54)

> Precision at k is the proportion of recommended items in the top-k set that are relevant

> Mathematically precision@k is defined as: `Precision@k = (# of recommended items @k that are relevant) / (# of recommended items @k)`

> In the context of recommendation systems we are most likely interested in recommending top-N items to the user. So it makes more sense to compute precision and recall metrics in the first N items instead of all the items. Thus the notion of precision and recall at k where k is a user definable integer that is set by the user to match the top-N recommendations objective.

Really what we want is like that, but even simpler... we want to make exactly 2,000 positive predictions, to recommend which 2,000 waterpumps to inspect, and then see how many of our recommendations were relevant.

---

So, let's make a validation set the same size as our test set.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=len(X_test), stratify=y_train, random_state=42)

X_train.shape, X_val.shape, X_test.shape

((45042, 40), (14358, 40), (14358, 40))

Then fit a model. (This is just a quick example, but you can and should make a better model than this!)

In [0]:
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier

features = ['longitude', 'latitude', 'quantity']
encoder = ce.OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train[features])
X_val_encoded = encoder.transform(X_val[features])
model = RandomForestClassifier(n_estimators=100, class_weight='balanced', n_jobs=-1, random_state=42)
model.fit(X_train_encoded, y_train)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=100, n_jobs=-1, oob_score=False,
                       random_state=42, verbose=0, warm_start=False)

Then get predicted probabilities for the positive class, in the validation set.

In [0]:
y_pred_proba = model.predict_proba(X_val_encoded)[:,1]

The predicted probabilities range from 0 to 1, as expected.

In [0]:
pd.Series(y_pred_proba).describe()

count    14358.000000
mean         0.456897
std          0.357175
min          0.000000
25%          0.110000
50%          0.410000
75%          0.800000
max          1.000000
dtype: float64

Identify the 2,000 waterpumps in the validation set with highest predicted probabilities.

In [0]:
results = pd.DataFrame({'y_val': y_val, 'y_pred_proba': y_pred_proba})
top2000 = results.sort_values(by='y_pred_proba', ascending=False)[:2000]

Most of these top 2,000 waterpumps will be relevant recommendations, meaning `y_val==True`, meaning the waterpump is non-functional or needs repairs.

Some of these top 2,000 waterpumps will be irrelevant recommendations, meaning `y_val==False`, meaning the waterpump is functional and does not need repairs.

In [0]:
top2000.sample(n=50)

Unnamed: 0,y_val,y_pred_proba
50807,True,1.0
10359,True,1.0
47911,True,1.0
26671,True,1.0
4335,True,0.98
48737,True,1.0
26016,True,0.98
13478,True,1.0
38679,True,1.0
28315,True,1.0


So how many of our recommendations were relevant? ...

In [0]:
 top2000['y_val'].sum()

1926

In the validation set, we successfully identified over 1,900 waterpumps in need of repair!

So we will use this predictive model with the dataset of over 14,000 waterpumps that we _do_ have some information about, but we _don't_ know whether they are currently functional, or functional but need repair, or non-functional.

We will predict which 2,000 are most likely non-functional or in need of repair.

We estimate that approximately 1,900 waterpumps will be repaired after these 2,000 maintenance visits.

If we had randomly chosen waterpumps to inspect, we estimate that only 920 waterpumps would be repaired after 2,000 maintenance visits. (46%)

So we're confident that our predictive model will help triage and prioritize waterpump inspections!