# Data Cleaning
This notebook contains a code to clean the data.

In [25]:
import pandas as pd
import numpy as np

import pickle

pd.set_option('precision', 4)
pd.options.display.max_seq_items = None

In [26]:
Y = pd.read_csv('DATA/TRAINING_LABELS.csv')
df = pd.read_csv('DATA/TRAINING_VALUES.csv')

In [27]:
df.shape

(59400, 40)

## Duplicates?
Check if there's any duplicates

In [28]:
#df[df.duplicated('id')]

## Target variable
First, let's look at the target variable.

In [29]:
Y.status_group.value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

It seems like we have a bit of class imbalance. I'll merge Y to df for now to make it easier for EDA.  


In [30]:
df = df.merge(Y, on = 'id')

## Missing values
Let's deal with all the missing values.

In [31]:
#df.isnull().sum()

funder, installer, subvillage(location), public_meeting(T/F), scheme_management(operator), scheme_name(operator), permit(T/F) has missing values.

### Funder  & Installer

In [32]:
df.funder.isnull().sum()

3635

In [33]:
#df.funder.value_counts()

Substantial amount is missing compared to any majority. I will create a unknown category to include all missing funder value.

In [34]:
df['funder'] = df.funder.fillna('Unknown')
df['installer'] = df.installer.fillna('Unknown')

### Scheme Management & Name
I'll do the same thing (unknown category) to management. For name, I'll impute 'None' value (existing string).

In [35]:
df['scheme_management'] = df.scheme_management.fillna('Unknown')
df['scheme_name'] = df.scheme_name.fillna('None')

### Subvillage

In [36]:
df.subvillage.isnull().sum()

371

In [37]:
#df.subvillage.value_counts()

Since region has no empty value. For missing subvillages, I will impute the value using subvillage with most counts within each region.

In [38]:
freq_subvil = df.groupby(['region']).subvillage.apply(lambda x: x.value_counts().index[0])

In [39]:
df['subvillage'] = np.where(df.subvillage.isnull(), 
                            freq_subvil[df.region], 
                            df.subvillage)

### Pubic Meeting & Permit
For public meeting and permit, they are boolean values, so if the classes are highly imbalanced, I'll impute the more frequent class. If not, I'll randomly select one.

In [40]:
#df.public_meeting.value_counts()

In [41]:
df['public_meeting'] = df.public_meeting.fillna(True)

In [42]:
#df.permit.value_counts()

In [43]:
rand_choice = np.random.choice([True, False], df.permit.isnull().sum())

In [44]:
df['permit']= df.permit.mask(df.permit.isnull(), np.random.choice([True, False], size=len(df)))

In [45]:
df['permit'] = df.permit.astype('bool')

## Outliers / Abnormalities

In [53]:
#df.describe()

Longitude had 0 values. Since they are all in Tanzania, these values don't make sense. I'll find the mean longitude and latitude of each region and fill them in.

In [47]:
tmp = df.copy()
tmp = df[df.longitude > 5]
avg_lat_long = tmp.groupby('region')['latitude', 'longitude'].mean()

  This is separate from the ipykernel package so we can avoid doing imports until


In [48]:
df['latitude'] = np.where(df.longitude < 5, 
         avg_lat_long['latitude'][df.region], df.latitude)
df['longitude'] = np.where(df.longitude < 5, 
         avg_lat_long['longitude'][df.region], df.longitude)

Construction year has 0 values. This does not make sense.

In [68]:
len(df[df.construction_year == 0])/len(df)

0.34863636363636363

In fact, almost 34% of the data does not have the construction year. It won't make sense to impute such a large amount of data. But it's also a lot of data to simply drop. I will keep 0 as is and take into consideration during feature selection.

In [69]:
len(df[df.gps_height == 0]) /len(df)

0.3440740740740741

It seems like gps height is showing similar trend. We'll address this later in the feature selection.

## Pickle

I'll pickle the dataframe so use for EDA.

In [27]:
#df.to_pickle('PKL/clean_df.pkl')