This project was conceptualized by Taarifa
as a competitive project among Data Scientists hosted on Drivendata platform
. The project competition can be assessed here
Taarifa
is a platform that offers business management solutions to clients who find it easy in online data storage, information analysis, and tons of new features. They ensure high data security with ease of access to the administration on all devices. In other words, Taarifa
is an open source platform for the crowd sourced reporting and triaging of infrastructure related issues.
For this project, the data is sourced from the Taarifa waterpoints dashboard
, which aggregates data from the Tanzania Ministry of Water.
This project seeks to build a model that would predicts if a water-pump is functional, or needs some repairs, or totally non-functional. Prediction of one of these three classes based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.
The project is a multi-classification
problem
The most performing model of all the models will be built will b deployed to production.
Dataset for this project have two files.
- `water-train.csv`: contains all the features
- `water-label.csv`: contains the label
The goal is to predict the operating condition of a waterpoint for each record in the dataset. The following set of information will give a background overview about the waterpoint features:
amount_tsh
- Total static head (amount water available to waterpoint)date_recorded
- The date the row was enteredfunder
- Who funded the wellgps_height
- Altitude of the wellinstaller
- Organization that installed the welllongitude
- GPS coordinatelatitude
- GPS coordinatewpt_name
- Name of the waterpoint if there is onenum_private
-basin
- Geographic water basinsubvillage
- Geographic locationregion
- Geographic locationregion_code
- Geographic location (coded)district_code
- Geographic location (coded)lga
- Geographic locationward
- Geographic locationpopulation
- Population around the wellpublic_meeting
- True/Falserecorded_by
- Group entering this row of datascheme_management
- Who operates the waterpointscheme_name
- Who operates the waterpointpermit
- If the waterpoint is permittedconstruction_year
- Year the waterpoint was constructedextraction_type
- The kind of extraction the waterpoint usesextraction_type_group
- The kind of extraction the waterpoint usesextraction_type_class
- The kind of extraction the waterpoint usesmanagement
- How the waterpoint is managedmanagement_group
- How the waterpoint is managedpayment
- What the water costspayment_type
- What the water costswater_quality
- The quality of the waterquality_group
- The quality of the waterquantity
- The quantity of waterquantity_group
- The quantity of watersource
- The source of the watersource_type
- The source of the watersource_class
- The source of the waterwaterpoint_type
- The kind of waterpointwaterpoint_type_group
- The kind of waterpoint
Since the project problem is a classifier problem, I'll be building three prediction models:
Random Forest
Gradient Boosting
: (xgboost)Logistic regression
The most performing model will be deployed to production.
Of all the models built, while training the models with the provided dataset, the most performing of all was Random Forest
. Focus was placed more on the label class where water-pump-status seems functioning but will need repair. This will guide a sudden collapse of any waterpoint since prompt action is taken on any waterpoint that seem to have some trait of malfunctioning. I decided to keen in to where the model would predict the water-pump points that are functioning but need repair.
To make the model have more predictive power, I did some hypertuning work on some parameters to determine which value of a parameter works best. The final model was built on the most promising hyperparameters.
After the final model was built, I decided to wrap the model in a docker container
, which was later deployed on AWS Cloud
. This repo contain all the files needed to access the model in the cloud.
You can access the project app here ==> 'water-pump-project-serving-env.eba-gyvqqh2a.us-east-1.elasticbeanstalk.com'