# Data Cleaning

#### To-do:

- drop irrelevant columns
- deal with nulls/missing data
- string cleaning of names
- explore combining target classes down to two classes
- **REMEMBER TO CONDUCT THE SAME DATA CLEANING TO THE TEST SET AS WELL**


In [2]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
from skrub import SimilarityEncoder

In [3]:
# Importing the custom functions
import sys
import os 

sys.path.append(os.path.abspath('../src'))

from utils import *

In [4]:
df_train_raw = pd.read_csv('../data/raw/raw_training_set_full.csv')
df_test_raw = pd.read_csv('../data/raw/test_set_values.csv')

## Drop Unneeded Columns

During EDA, I went through all of the features in the data and made an initial determination on which features to keep and which to drop. Those determinations are informing this part of data cleaning.

In [5]:
df_train_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 41 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     59400 non-null  int64  
 1   amount_tsh             59400 non-null  float64
 2   date_recorded          59400 non-null  object 
 3   funder                 55763 non-null  object 
 4   gps_height             59400 non-null  int64  
 5   installer              55745 non-null  object 
 6   longitude              59400 non-null  float64
 7   latitude               59400 non-null  float64
 8   wpt_name               59398 non-null  object 
 9   num_private            59400 non-null  int64  
 10  basin                  59400 non-null  object 
 11  subvillage             59029 non-null  object 
 12  region                 59400 non-null  object 
 13  region_code            59400 non-null  int64  
 14  district_code          59400 non-null  int64  
 15  lg

In [6]:
# Dropping the columns that are not needed based on my EDA
df_train = df_train_raw.drop(columns=['target', 'id', 'wpt_name', 'num_private', 'subvillage', 'ward', 'recorded_by', 'scheme_name', 
                            'scheme_management', 'water_quality', 'waterpoint_type_group', 'quantity_group', 'region_code', 
                            'extraction_type', 'extraction_type_group', 'payment', 'source_class', 'source_type', 'amount_tsh'])

In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59400 entries, 0 to 59399
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   amount_tsh             59400 non-null  float64
 1   date_recorded          59400 non-null  object 
 2   funder                 55763 non-null  object 
 3   gps_height             59400 non-null  int64  
 4   installer              55745 non-null  object 
 5   longitude              59400 non-null  float64
 6   latitude               59400 non-null  float64
 7   basin                  59400 non-null  object 
 8   region                 59400 non-null  object 
 9   district_code          59400 non-null  int64  
 10  lga                    59400 non-null  object 
 11  population             59400 non-null  int64  
 12  public_meeting         56066 non-null  object 
 13  permit                 56344 non-null  object 
 14  construction_year      59400 non-null  int64  
 15  ex

Further pruning will likely happen when I begin feature engineering. I'm specifically thinking of 'lga', 'installer', 'funder' and the coordinate features. The first three due to their high cardinality and the coordinates because I'm not sure that level of geographic specificity is needed for an intial classification model. 

## Dealing with Missing Values

### Missing data overview

In [8]:
missing = round((df_train.isna().sum() / len(df)) * 100, 2)
missing = missing.sort_values(ascending=False)
missing

installer                6.15
funder                   6.12
public_meeting           5.61
permit                   5.14
amount_tsh               0.00
source                   0.00
quantity                 0.00
quality_group            0.00
payment_type             0.00
management_group         0.00
management               0.00
extraction_type_class    0.00
construction_year        0.00
population               0.00
date_recorded            0.00
lga                      0.00
district_code            0.00
region                   0.00
basin                    0.00
latitude                 0.00
longitude                0.00
gps_height               0.00
waterpoint_type          0.00
dtype: float64

## String Cleaning