# Data Cleaning
This notebook look at data in general to locate any abnormalities and resolve those issues.

In [1]:
import pandas as pd
import numpy as np

import pickle

pd.set_option('precision', 4)
pd.options.display.max_seq_items = None

In [2]:
Y = pd.read_csv('DATA/TRAINING_LABELS.csv')
df = pd.read_csv('DATA/TRAINING_VALUES.csv')

In [3]:
df.shape

(59400, 40)

## Duplicates?
Check if there's any duplicates

In [4]:
#df[df.duplicated('id')]
# no duplicates found

## Target variable
First, let's look at the target variable.

In [5]:
Y.status_group.value_counts()

functional                 32259
non functional             22824
functional needs repair     4317
Name: status_group, dtype: int64

It seems like we have a bit of class imbalance. We'll merge target variable to df for now to make it easier for EDA.  


In [6]:
df = df.merge(Y, on = 'id')

## Missing values
Let's check if we have any missing values.

In [7]:
#df.isnull().sum()

funder, installer, subvillage(location), public_meeting(T/F), scheme_management(operator), scheme_name(operator), permit(T/F) has missing values.

### Funder  & Installer


In [8]:
#checking a number of missing values
df.funder.isnull().sum()

3635

Substantial amount is missing compared. We will create a category called 'Unknown' to include all missing funder and installer values.

In [9]:
df['funder'] = df.funder.fillna('Unknown')
df['installer'] = df.installer.fillna('Unknown')

### Scheme Management & Name
We'll do the same thing (unknown category) to management. For name, we'll impute 'None' value (existing string).

In [10]:
df['scheme_management'] = df.scheme_management.fillna('Unknown')
df['scheme_name'] = df.scheme_name.fillna('None')

### Subvillage

In [11]:
df.subvillage.isnull().sum()

371

Since region column has no empty value, we will impute the most frequent subvillages per region when subvillage is missing.

In [12]:
freq_subvil = df.groupby(['region']).subvillage.apply(lambda x: x.value_counts().index[0])

In [13]:
df['subvillage'] = np.where(df.subvillage.isnull(), 
                            freq_subvil[df.region], 
                            df.subvillage)

### Pubic Meeting & Permit
For public meeting and permit, they are boolean values, so if the classes are highly imbalanced, we'll impute the more frequent class. If not, we'll randomly select one.

In [14]:
df['public_meeting'] = df.public_meeting.fillna(True)

In [15]:
df['permit']= df.permit.mask(df.permit.isnull(), np.random.choice([True, False], size=len(df)))

In [16]:
df['permit'] = df.permit.astype('bool')

## Outliers / Abnormalities

In [17]:
#df.describe()

Longitude had 0 values. Since they are all in Tanzania, these values don't make sense. I'll find the mean longitude and latitude of each region and fill them in.

In [18]:
tmp = df.copy()
tmp = df[df.longitude > 5]
avg_lat_long = tmp.groupby('region')['latitude', 'longitude'].mean()

  This is separate from the ipykernel package so we can avoid doing imports until


In [19]:
#pd.to_pickle(avg_lat_long, 'PKL/avg_lat_long.pkl')

In [20]:
df['latitude'] = np.where(df.longitude < 5, 
         avg_lat_long['latitude'][df.region], df.latitude)
df['longitude'] = np.where(df.longitude < 5, 
         avg_lat_long['longitude'][df.region], df.longitude)

Construction year has 0 values. This does not make sense.

In [21]:
len(df[df.construction_year == 0])/len(df)

0.34863636363636363

In fact, almost 34% of the data does not have the construction year. It won't make sense to impute such a large amount of data. But it's also a lot of data to simply drop. We will keep 0 as is and take into consideration during EDA.

In [22]:
len(df[df.gps_height == 0]) /len(df)

0.3440740740740741

It seems like gps height is showing similar trend. We'll address this later in the EDA.

## Pickling

Exporting the dataframe to use for EDA.

In [24]:
df.to_pickle('PKL/clean_df.pkl')