# Data Cleaning
This notebook contains a code to clean the data.

In [78]:
import pandas as pd
import numpy as np

import pickle

pd.set_option('precision', 4)
pd.options.display.max_seq_items = None

In [79]:
Y = pd.read_csv('DATA/TRAINING_LABELS.csv')
df = pd.read_csv('DATA/TRAINING_VALUES.csv')

In [80]:
df.shape

(59400, 40)

## Duplicates?
Check if there's any duplicates

In [4]:
#df[df.duplicated('id')]

## Target variable
First, let's look at the target variable.

In [5]:
Y.status_group.value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

It seems like we have a bit of class imbalance. I'll merge Y to df for now to make it easier for EDA.  


In [6]:
df = df.merge(Y, on = 'id')

## Missing values
Let's deal with all the missing values.

In [7]:
#df.isnull().sum()

funder, installer, subvillage(location), public_meeting(T/F), scheme_management(operator), scheme_name(operator), permit(T/F) has missing values.

### Funder  & Installer

In [8]:
df.funder.isnull().sum()

3635

In [9]:
#df.funder.value_counts()

Substantial amount is missing compared to any majority. I will create a unknown category to include all missing funder value.

In [10]:
df['funder'] = df.funder.fillna('Unknown')
df['installer'] = df.installer.fillna('Unknown')

Funder has a lot of type errors. We'll consolidate these.

### Scheme Management & Name
I'll do the same thing (unknown category) to management. For name, I'll impute 'None' value (existing string).

In [11]:
df['scheme_management'] = df.scheme_management.fillna('Unknown')
df['scheme_name'] = df.scheme_name.fillna('None')

### Subvillage

In [12]:
df.subvillage.isnull().sum()

371

In [13]:
#df.subvillage.value_counts()

Since region has no empty value. For missing subvillages, I will impute the value using subvillage with most counts within each region.

In [81]:
freq_subvil = df.groupby(['region']).subvillage.apply(lambda x: x.value_counts().index[0])

In [83]:
dict(freq_subvil)

{'Arusha': 'Madukani',
 'Dar es Salaam': 'Mtaa Wa Kitunda Kati',
 'Dodoma': 'Kawawa',
 'Iringa': 'M',
 'Kagera': 'Bunukangoma',
 'Kigoma': 'Majengo',
 'Kilimanjaro': 'Majengo',
 'Lindi': 'Shuleni',
 'Manyara': 'Madukani',
 'Mara': 'Senta',
 'Mbeya': 'Katumba',
 'Morogoro': 'Shuleni',
 'Mtwara': 'Majengo',
 'Mwanza': '1',
 'Pwani': 'Vikuge',
 'Rukwa': 'Majengo',
 'Ruvuma': 'Muungano',
 'Shinyanga': 'Madukani',
 'Singida': 'Madukani',
 'Tabora': 'Majengo',
 'Tanga': 'Sokoni'}

In [15]:
df['subvillage'] = np.where(df.subvillage.isnull(), 
                            freq_subvil[df.region], 
                            df.subvillage)

### Pubic Meeting & Permit
For public meeting and permit, they are boolean values, so if the classes are highly imbalanced, I'll impute the more frequent class. If not, I'll randomly select one.

In [86]:
#df.public_meeting.value_counts()

In [17]:
df['public_meeting'] = df.public_meeting.fillna(True)

In [88]:
#df.permit.value_counts()

In [20]:
df['permit']= df.permit.mask(df.permit.isnull(), np.random.choice([True, False], size=len(df)))

In [21]:
df['permit'] = df.permit.astype('bool')

## Outliers / Abnormalities

In [22]:
#df.describe()

Longitude had 0 values. Since they are all in Tanzania, these values don't make sense. I'll find the mean longitude and latitude of each region and fill them in.

In [23]:
tmp = df.copy()
tmp = df[df.longitude > 5]
avg_lat_long = tmp.groupby('region')['latitude', 'longitude'].mean()

  This is separate from the ipykernel package so we can avoid doing imports until


In [93]:
#pd.to_pickle(avg_lat_long, 'PKL/avg_lat_long.pkl')

In [24]:
df['latitude'] = np.where(df.longitude < 5, 
         avg_lat_long['latitude'][df.region], df.latitude)
df['longitude'] = np.where(df.longitude < 5, 
         avg_lat_long['longitude'][df.region], df.longitude)

Construction year has 0 values. This does not make sense.

In [25]:
len(df[df.construction_year == 0])/len(df)

0.34863636363636363

In fact, almost 34% of the data does not have the construction year. It won't make sense to impute such a large amount of data. But it's also a lot of data to simply drop. I will keep 0 as is and take into consideration during feature selection.

In [26]:
len(df[df.gps_height == 0]) /len(df)

0.3440740740740741

It seems like gps height is showing similar trend. We'll address this later in the feature selection.

## Pickle

I'll pickle the dataframe so use for EDA.

In [27]:
#df.to_pickle('PKL/clean_df.pkl')