<a href="https://colab.research.google.com/github/trajinthan/pump-it-up-data-mining/blob/main/pumb_it_up.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Load data**

Authenticate with google drive

In [1068]:
# from pydrive.auth import GoogleAuth
# from pydrive.drive import GoogleDrive
# from google.colab import auth
# from oauth2client.client import GoogleCredentials

# auth.authenticate_user()
# gauth = GoogleAuth()
# gauth.credentials = GoogleCredentials.get_application_default()
# drive = GoogleDrive(gauth)

In [1069]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from typing import Dict, Tuple



Load data from google drive to colab work space according to the csv file id

In [1070]:

# training_labels = drive.CreateFile({'id':'12QS3xedC7EoPS4Xj2cNVuwBSLnvbJMNM'}) 
# training_labels.GetContentFile('TrainLabel.csv')  
train_label = pd.read_csv('TrainLabel.csv')

# training_values = drive.CreateFile({'id':'1F4TZBjMRpTPkEbW7vjpQBIhl7Kp3QlEf'}) 
# training_values.GetContentFile('TrainValue.csv')  
train_value = pd.read_csv('TrainValue.csv')

# testing_labels = drive.CreateFile({'id':'1Y4Idhc-WeUTM5uQSjZOqyQ5r4ePgUD84'}) 
# testing_labels.GetContentFile('TestData.csv')  
Xtest = pd.read_csv('TestData.csv')

Merge training data values and respective training data labels

In [1071]:
train_data = train_value.merge(train_label, on='id')

In [1072]:
train_data.head().T

Unnamed: 0,0,1,2,3,4
id,69572,8776,34310,67743,19728
amount_tsh,6000,0,25,0,0
date_recorded,2011-03-14,2013-03-06,2013-02-25,2013-01-28,2011-07-13
funder,Roman,Grumeti,Lottery Club,Unicef,Action In A
gps_height,1390,1399,686,263,0
installer,Roman,GRUMETI,World vision,UNICEF,Artisan
longitude,34.9381,34.6988,37.4607,38.4862,31.1308
latitude,-9.85632,-2.14747,-3.82133,-11.1553,-1.82536
wpt_name,none,Zahanati,Kwa Mahundi,Zahanati Ya Nanyumbu,Shuleni
num_private,0,0,0,0,0


Get the data types of the values

In [1073]:
train_data.dtypes

id                         int64
amount_tsh               float64
date_recorded             object
funder                    object
gps_height                 int64
installer                 object
longitude                float64
latitude                 float64
wpt_name                  object
num_private                int64
basin                     object
subvillage                object
region                    object
region_code                int64
district_code              int64
lga                       object
ward                      object
population                 int64
public_meeting            object
recorded_by               object
scheme_management         object
scheme_name               object
permit                    object
construction_year          int64
extraction_type           object
extraction_type_group     object
extraction_type_class     object
management                object
management_group          object
payment                   object
payment_ty

In [1074]:
# train_data['status_group'].value_counts()

# **Data Preprocessing**

 **Drop identical or unnecessary columns**

1. The features **quantity** and **quantity_group** are described as **The quantity of water** So we need to check whether they are same in or not

In [1075]:
# train_data['quantity'].value_counts()

In [1076]:
# train_data['quantity_group'].value_counts()

In [1077]:
# train_data.groupby(['quantity','quantity_group']).count()

As both features carry identical values we can drop either **quantity** or **quantity_group**.

2. The features **water_quality** and **quality_group** are described as **The quality of the water** So we need to check whether they are same in or not.

In [1078]:
# train_data['water_quality'].value_counts()

In [1079]:
# train_data['quality_group'].value_counts()

In [1080]:
# train_data.groupby(['water_quality','quality_group']).count()

As both features have almost same values we can drop either one of them. **water_quality** is more informative.

3. The features **payment** and **payment_type** are described as **What the water costs** So we need to check whether they are same in or not.

In [1081]:
# train_data['payment'].value_counts()

In [1082]:
# train_data['payment_type'].value_counts()

In [1083]:
# train_data.groupby(['payment','payment_type']).count()

As both features carry identical values we can drop either **payment** or **payment_type**.

4. The features **waterpoint_type** and **waterpoint_type_group** are described as **The kind of waterpoint** So we need to check whether they are same in or not.

In [1084]:
# train_data['waterpoint_type'].value_counts()

In [1085]:
# train_data['waterpoint_type_group'].value_counts()

In [1086]:
# train_data.groupby(['waterpoint_type','waterpoint_type_group']).count()

As both features have almost same values we can drop either one of them. **waterpoint_type** is more informative.

5. The features **source** , **source_type** and **source_group** are described as **The source of the water** So we need to check whether they are same in or not.

In [1087]:
# train_data['source'].value_counts()

In [1088]:
# train_data['source_type'].value_counts()

In [1089]:
# train_data['source_class'].value_counts()

In [1090]:
# train_data.groupby(['source_class','source_type','source']).count()

 As **source_class** and **source_type** are super sets of **source**, we can drop **source_class** and **source_type**. **source** is more informative feature among them.

6. The features **management** and **management_group** are described as **How the waterpoint is managed** So we need to check whether they are same in or not.

In [1091]:
# train_data['management'].value_counts()

In [1092]:
# train_data['management_group'].value_counts()

In [1093]:
# train_data.groupby(['management_group','management']).count()

**management** and **management_group** contains same information and management is more detailed. so **management_group** can be dropped

7. The features **extraction_type** , **extraction_type_class** and **extraction_type_group** are described as **The kind of extraction the waterpoint uses** So we need to check whether they are same in or not.

In [1094]:
# train_data['extraction_type'].value_counts()

In [1095]:
# train_data['extraction_type_class'].value_counts()

In [1096]:
# train_data['extraction_type_group'].value_counts()

In [1097]:
# train_data.groupby(['extraction_type_class','extraction_type_group','extraction_type']).count()

As **extraction_type** contains unique information we can drop **extraction_type_group** and **extraction_type_class**	

8. The features **scheme_management** and **scheme_name** are described as **Who operates the waterpoint** So we need to check whether they are same in or not.

In [1098]:
# train_data['scheme_management'].value_counts()

In [1099]:
# train_data['scheme_name'].value_counts()

In [1100]:
# train_data.groupby(['scheme_management','scheme_name']).count()

9. The feature **recorded_by** can be dropped as it has only one distinct value

In [1101]:
train_data['recorded_by'].value_counts()

GeoData Consultants Ltd    59400
Name: recorded_by, dtype: int64

10. The feature **region** can be dropped as there is another feature **region_code**

In [1102]:
train_data['region'].value_counts()

Iringa           5294
Shinyanga        4982
Mbeya            4639
Kilimanjaro      4379
Morogoro         4006
Arusha           3350
Kagera           3316
Mwanza           3102
Kigoma           2816
Ruvuma           2640
Pwani            2635
Tanga            2547
Dodoma           2201
Singida          2093
Mara             1969
Tabora           1959
Rukwa            1808
Mtwara           1730
Manyara          1583
Lindi            1546
Dar es Salaam     805
Name: region, dtype: int64

Definition for drop columns

In [1103]:
def drop_columns(dataset: pd.DataFrame):
  drop_columns=['management_group','scheme_management',
                'quantity_group','source_class',
                'source_type','recorded_by','quality_group',
                'payment_type','extraction_type_class',
                'extraction_type', 'waterpoint_type_group','region_code','amount_tsh','num_private']
  dataset.drop(drop_columns,1, inplace=True)

In [1104]:
drop_columns(train_data)
drop_columns(Xtest)

**Handling Null Vaues**

Get the count of null values  in each features 

In [1105]:
# train_data.isnull().sum()

Analyzing the data values of the features which have missinng values

In [1106]:
# train_data['funder'].value_counts().head(20)

In [1107]:
# train_data['installer'].value_counts().head(20)

In [1108]:
# train_data['scheme_name'].value_counts().head(20)

missing values in **funder** ,**installer** and **scheme_name** can be filled as **n/a**

In [1109]:
# train_data['public_meeting'].value_counts().head(20)

In [1110]:
# train_data['permit'].value_counts().head(20)

**public_meeting** and **permit** have nearly 3000 null values and they have value **true** in very high number compared to **false**. so we can fill null values with **true**

Definition for replace missing values

In [1111]:
def replace_null_value(dataset: pd.DataFrame):
    for column in ['funder','installer','scheme_name','subvillage']:
        dataset[column] = dataset[column].fillna('n/a')
    for column in ['permit','public_meeting']:
        dataset[column] = dataset[column].fillna('true')

In [1112]:
replace_null_value(train_data)
replace_null_value(Xtest)

In [1113]:
import time
import datetime
def convert_date_columns_to_epoch(dataset: pd.DataFrame, timestamp_format="%Y-%m-%d"):
        dataset['date_recorded'] = [ datetime.datetime.strptime(x, timestamp_format).timestamp() for x in dataset['date_recorded']]

In [1114]:
# convert_date_columns_to_epoch(train_data)
# convert_date_columns_to_epoch(Xtest)

**Encoding categorical columns**

Convert boolean values in to **0** and **1**

In [1115]:
train_data['permit'] = train_data['permit'].astype(bool).astype(int)
Xtest['permit'] = Xtest['permit'].astype(bool).astype(int)

In [1116]:
train_data['public_meeting'] = train_data['public_meeting'].astype(bool).astype(int)
Xtest['public_meeting'] = Xtest['public_meeting'].astype(bool).astype(int)

Apply label encoding for categorical columns

In [1117]:
cat_cols = train_data.select_dtypes('object').columns

In [1118]:
def encode_categorical_columns(dataset: pd.DataFrame) -> Dict[str, LabelEncoder]:
    encoders = {} 
    for column in cat_cols:
      if column not in dataset.columns:
        continue

      le = LabelEncoder()
      le.fit(dataset[column])

      dataset[column] = le.transform(dataset[column])
      encoders[column]= le
        
    return encoders

In [1119]:
encoders = encode_categorical_columns(train_data)
encode_categorical_columns(Xtest)

{'basin': LabelEncoder(),
 'date_recorded': LabelEncoder(),
 'extraction_type_group': LabelEncoder(),
 'funder': LabelEncoder(),
 'installer': LabelEncoder(),
 'lga': LabelEncoder(),
 'management': LabelEncoder(),
 'payment': LabelEncoder(),
 'quantity': LabelEncoder(),
 'region': LabelEncoder(),
 'scheme_name': LabelEncoder(),
 'source': LabelEncoder(),
 'subvillage': LabelEncoder(),
 'ward': LabelEncoder(),
 'water_quality': LabelEncoder(),
 'waterpoint_type': LabelEncoder(),
 'wpt_name': LabelEncoder()}

**Scale all columns using standard scalar**

In [1120]:
def scale_columns(dataset: pd.DataFrame):
    scaler = StandardScaler()
    dataset = scaler.fit_transform(dataset)

In [1121]:
scale_columns(train_data)
scale_columns(Xtest)

**Split train dataset**

In [1122]:
X = train_data.iloc[:, :-1]
X.drop('id',1)
y = train_data.iloc[:, -1]
# to divide our X and y to test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.00000001, random_state=42)

# **Model Selection**


In [1123]:
# !pip install catboost

 **CatBoost Classifier**

In [1124]:
# from catboost import CatBoostClassifier
# model= CatBoostClassifier(
#                          learning_rate = 0.39730054363848666,
#         # n_estimators=1000,
#         subsample=0.075,
#         max_depth=5,
#         l2_leaf_reg = 40,
#         verbose=100,
#         bootstrap_type="Bernoulli"
#         # auto_class_weights="SqrtBalanced",
#         # loss_function='MultiClass'
#         )

**XGBoost Classifier** 

In [1125]:
# from xgboost import XGBClassifier
# model = XGBClassifier(nthread=2, num_class=3, 
#                         min_child_weight=3, max_depth=15,
#                         gamma=0.5, scale_pos_weight=0.8,
#                         subsample=0.7, colsample_bytree = 0.8,
#                         objective='multi:softmax')

**RandomForest Classifier**

In [1126]:
# from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
# model = RandomForestClassifier(max_depth=25,
#                                n_estimators = 42*5, 
#                                criterion = 'entropy',
#                                random_state = 0)
    

**Train model**

In [1127]:
# model.fit(X_train, y_train)
# print(model.score(X_test, y_test))


Find f1 score

In [1128]:
# train_pred = model.predict(X_test)
# f1_score(train_pred, y_test,average = 'macro')

In [None]:

print(classification_report(y_test, train_pred))

# **Prediction of labels for given test values**

In [None]:
# id = Xtest['id']
# Xtest.drop('id',1)

In [None]:
# ytest = model.predict(Xtest)

In [None]:
# status_group_encoder = encoders['status_group']

In [None]:
# decoded_y = pd.DataFrame(status_group_encoder.inverse_transform(ytest), columns = ['status_group'])

In [None]:
# result = pd.concat([id, ytest], axis=1)

In [None]:
# result.to_csv("submisssion.csv", index=False)