# Missing values

First, let's import our dataset:

In [1]:
from sklearn.neighbors import NearestNeighbors as NN
import pandas as pd
import numpy as np

df = pd.read_csv('MV_example.csv')
print(df)

     Amount          Loan type  Age Gender
0   50000.0           Mortgage   19      F
1    1000.0           Car loan   23      M
2   27000.0           Car loan   44      M
3  655555.0           Mortgage   45      F
4  187666.0           Mortgage   65      F
5  165777.0           Mortgage   39    NaN
6       NaN           Mortgage   36      F
7  145000.0                NaN   27      F
8  156899.0           Mortgage   48      F
9   15000.0  Short-term credit   55      M


## Dropping NAs

To start, an easy solution could be to drop all observations with NAs (NaNs in pandas):

In [2]:
print(df.dropna())

     Amount          Loan type  Age Gender
0   50000.0           Mortgage   19      F
1    1000.0           Car loan   23      M
2   27000.0           Car loan   44      M
3  655555.0           Mortgage   45      F
4  187666.0           Mortgage   65      F
8  156899.0           Mortgage   48      F
9   15000.0  Short-term credit   55      M


Or you can drop all columns with missing values:

In [3]:
print(df.dropna(axis=1))

   Age
0   19
1   23
2   44
3   45
4   65
5   39
6   36
7   27
8   48
9   55


## Imputation

A quick fix for filling all NaNs with the average is the following:

In [4]:
# We use nanmean and nanmedian instead of mean and median to ignore NaNs
mean = np.nanmean(df['Amount'])
median = np.nanmedian(df['Amount'])
print('Mean: ', mean, ' median: ', median)

print(df['Amount'].fillna(mean))
print(df['Amount'].fillna(median))
# if you want to replace nan
# df['Amount'] = df['Amount'].fillna(mean)
# print(df)

Mean:  155988.55555555556  median:  145000.0
0     50000.000000
1      1000.000000
2     27000.000000
3    655555.000000
4    187666.000000
5    165777.000000
6    155988.555556
7    145000.000000
8    156899.000000
9     15000.000000
Name: Amount, dtype: float64
0     50000.0
1      1000.0
2     27000.0
3    655555.0
4    187666.0
5    165777.0
6    145000.0
7    145000.0
8    156899.0
9     15000.0
Name: Amount, dtype: float64


### Nearest neighbour imputation

Now let's use the nearest neighbour algorithm to find the best replacements for missing values:

In [5]:
# Let's first try for observation 6 (counting from 0), which misses the amount
X = df.loc[:, df.columns !='Amount']

# Convert the categorical variables
X = pd.get_dummies(X, prefix='cat', drop_first=True)

# Store 6 separately
x_6 = X.loc[6,:]

# Store the others without the other observations missing values (5 (6) and 7 (8))
X = X.loc[[0,1,2,3,4,8,9], :]

print(X)

   Age  cat_Mortgage  cat_Short-term credit  cat_M
0   19             1                      0      0
1   23             0                      0      1
2   44             0                      0      1
3   45             1                      0      0
4   65             1                      0      0
8   48             1                      0      0
9   55             0                      1      1


In [6]:
print(x_6)

Age                      36
cat_Mortgage              1
cat_Short-term credit     0
cat_M                     0
Name: 6, dtype: int64
