# Handling missing data

Be sure to have a look at your data initially, perhaps using some visualization tools covered in section 2. Ensure that you also look at your data before and after replacing/removing missing values. 

Also try to understand why the values are missing, see if there are any correlation with other features/missing data. *'Understanding the reasons why data are missing is important for handling the remaining data correctly.'* See https://en.wikipedia.org/wiki/Missing_data

**Method 1: Omission**
Remove samples that have missing data - however this may introduce gaps or bias in your data. It can dramatically change your dataset as it may remove important data 

**Method 2: Imputation by Mean** (average), Median (middle value) or Mode (most common value)
You don't lose any samples and it's quick and easy - however results can vary. The most common method is using the mean. For categorical data, try using the mode (most_frequent) or method 3. https://www.researchgate.net/post/What_is_the_proper_imputation_method_for_categorical_missing_value

**Method 3: Imputation using Supervised learning**
Using models like k-nearest neighbors (KNN) can yield the best results, however, it's difficult and takes time to implement. 
Could also use Expectation  maximization algorithm (https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) or Maximum likelihood estimation (https://en.wikipedia.org/wiki/Maximum_likelihood_estimation)

Which method you use depends on how much data is missing, why it's missing, and what method would have the least impact or bias on your data.

Take into consideration, too, which features you will likely to be using in your model. Are there a lot of missing values for a feature which is unlikely to impact your model? 

In [1]:
## ------- Import libraries and data ------- ##

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

lst = [['France', 44, 72000, 'No'], 
       ['Spain', 27, 48000, 'Yes'],
       ['Germany', 30, 54000, 'No'],
       ['Spain', 38, 61000, 'No'],
       ['Germany', 40, None , 'Yes'],
       ['France', 35, 58000, 'Yes'],
       ['Spain', None , 52000, 'No'],
       ['France', 48,79000, 'Yes'],
       ['Germany', 50, 83000, 'No'],
       ['France', 37, 67000, 'Yes']] 

df = pd.DataFrame(lst, columns =['Country', 'Age', 'Salary', 'Purchased'])

# set X as all independent variables
X = df.iloc[:,:-1].values

# set y as dependent variable
y = df.iloc[:,-1].values


## ------- detecting missing values ------- ##

# detect any NaN values - keep in mind missing values may be represented in various ways. 
missing_data = df.isnull().sum()

print('Missing values in each feature: \n')
print(missing_data)
print('\n','_'*7,'\n')


## ------- Method 1: drop samples with missing data ------- ##

dropped_df = df.dropna()
dropped_missing_data = dropped_df.isnull().sum()

print('Check after dropping NaN/missing values: \n')
print(dropped_df, '\n')
print(dropped_missing_data)
print('\n','_'*7,'\n')

## ------- Method 2: replace missing values by the Mean ------- ##

print('Check X before imputer: \n\n', X, '\n\n')

# create imputer object - for more info use help(SimpleImputer)
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')

# fit object to data - take only the cols that have missing data
imputer = imputer.fit(X[:, 1:3])

# replace nan values
X[:,1:3] = imputer.transform(X[:,1:3])

print('Check X after imputer: \n\n', X)
print('\n','_'*7,'\n')

## ------- Method 3: ML model ------- ##

# TODO

Missing values in each feature: 

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

 _______ 

Check after dropping NaN/missing values: 

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
5   France  35.0  58000.0       Yes
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes 

Country      0
Age          0
Salary       0
Purchased    0
dtype: int64

 _______ 

Check X before imputer: 

 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]] 


Check X after imputer: 

 [['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.777