# Data Cleaning
This notebook contains a code to clean the data.

In [117]:
import pandas as pd
import numpy as np

import pickle

pd.set_option('precision', 4)
pd.options.display.max_seq_items = None

In [3]:
Y = pd.read_csv('DATA/TRAINING_LABELS.csv')
df = pd.read_csv('DATA/TRAINING_VALUES.csv')

In [4]:
df.shape

(59400, 40)

## Duplicates?
Check if there's any duplicates

In [12]:
#df[df.duplicated('id')]

## Target variable
First, let's look at the target variable.

In [14]:
Y.status_group.value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

It seems like we have a bit of class imbalance. I'll merge Y to df for now to make it easier for EDA.  


In [19]:
df = df.merge(Y, on = 'id')

## Missing values
Let's deal with all the missing values.

In [94]:
#df.isnull().sum()

funder, installer, subvillage(location), public_meeting(T/F), scheme_management(operator), scheme_name(operator), permit(T/F) has missing values.

### Funder  & Installer

In [31]:
df.funder.isnull().sum()

3635

In [100]:
#df.funder.value_counts()

Substantial amount is missing compared to any majority. I will create a unknown category to include all missing funder value.

In [104]:
df['funder'] = df.funder.fillna('Unknown')
df['installer'] = df.installer.fillna('Unknown')

### Scheme Management & Name
I'll do the same thing (unknown category) to management. For name, I'll impute 'None' value (existing string).

In [101]:
df['scheme_management'] = df.scheme_management.fillna('Unknown')
df['scheme_name'] = df.scheme_name.fillna('None')

### Subvillage

In [35]:
df.subvillage.isnull().sum()

371

In [37]:
#df.subvillage.value_counts()

Since region has no empty value. For missing subvillages, I will impute the value using subvillage with most counts within each region.

In [68]:
freq_subvil = df.groupby(['region']).subvillage.apply(lambda x: x.value_counts().index[0])

In [77]:
df['subvillage'] = np.where(df.subvillage.isnull(), 
                            freq_subvil[df.region], 
                            df.subvillage)

### Pubic Meeting & Permit
For public meeting and permit, they are boolean values, so if the classes are highly imbalanced, I'll impute the more frequent class. If not, I'll randomly select one.

In [80]:
#df.public_meeting.value_counts()

In [79]:
df['public_meeting'] = df.public_meeting.fillna(True)

In [84]:
#df.permit.value_counts()

In [87]:
rand_choice = np.random.choice([True, False], df.permit.isnull().sum())

In [91]:
df['permit']= df.permit.mask(df.permit.isnull(), np.random.choice([True, False], size=len(df)))

In [114]:
df['permit'] = df.permit.astype('bool')

## Pickle

I'll pickle the dataframe so use for EDA.

In [119]:
#df.to_pickle('PKL/clean_df.pkl')