# Draft 0

# Introduction

The real-world data often has a lot of missing values.Missing data can range from variables without observations to specific data questions without any known answers. Causes include human error in data collection, system malfunctions, corruption of data, or a number of other unintended origins.

It is good practice to identify and replace missing values for each column in the input data prior to modeling in prediction task. This is called missing data imputation, or imputing for short. A popular approach for data imputation is to calculate a statistical value for each column (such as a mean) and replace all missing values for that column with the statistic. It is a popular approach because the statistic is easy to calculate using the training dataset and because it often results in good performance.

#### Objective 
To Uderstand the impact of Imputation on model performance by using the impuation Strageies on a complete dataset

# Baseline Model Development

In [1]:
# Import package dependencies
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from ml_metrics import rmse
import matplotlib.pyplot as plt
from sklearn import datasets

### Over view of the dataset

In [2]:
# Load in the dataset
california = datasets.fetch_california_housing()
print(california.data.shape)

(20640, 8)


In [3]:
# Convert the matrix to pandas
cal = pd.DataFrame(california.data)
cal.columns = california.feature_names
cal['MedHouseVal'] = california.target
cal.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [4]:
cal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


The dataset consists of 20640 records and  9 features including the target. These featues all have a flot64 data type. The featur don't consist any missing value record as it can be seen in above(cal.info())

### Fitting a Linear Regression model to the full dataset

**Create a training and testing split**

The train-test split is a technique for evaluating the performance of a machine learning algorithm. It can be used for classification or regression problems and can be used for any supervised learning algorithm. we implemet a 70/30 apprach for this project. Which means we have 70% for train dataset and 30 % for Test dataset.

In [5]:
train_set = cal.sample(frac=0.7, random_state=100)
test_set = cal[~cal.isin(train_set)].dropna()
print(train_set.shape[0])
print(test_set.shape[0])

14448
6192


In [6]:
# Converting the training and testing datasets back to matrix-formats
X_train = train_set.iloc[:, 0:8].values # returns the data; excluding the target
Y_train = train_set.iloc[:, -1].values # returns the target-only
X_test = test_set.iloc[:, 0:8].values # ""
Y_test = test_set.iloc[:, -1].values # ""

In [7]:
# Fit a linear regression to the training data
reg = LinearRegression(normalize=True).fit(X_train, Y_train)
print(reg.score(X_train, Y_train))
print(reg.coef_)
print(reg.intercept_)
print(reg.get_params())

0.6160214522398206
[ 4.59063361e-01  9.72601795e-03 -1.37408894e-01  8.20058010e-01
 -5.20832695e-06 -3.38987100e-03 -4.12859546e-01 -4.28244613e-01]
-36.61483864947015
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': True, 'positive': False}


In [8]:
# Find the variable with the largest "normalized" coefficient value
print('The positive(max) coef-value is {}'.format(max(reg.coef_))) # Positive Max
#print('The abs(max) coef-value is {}'.format(max(reg.coef_, key=abs))) # ABS Max
max_var = max(reg.coef_) # Positive Max
#max_var = max(reg.coef_, key=abs) # ABS Max
var_index = reg.coef_.tolist().index(max_var)
print('The variable associated with this coef-value is {}'.format(california.feature_names[var_index]))

The positive(max) coef-value is 0.8200580100743708
The variable associated with this coef-value is AveBedrms


In [9]:
Y_pred = reg.predict(X_test)

orig_mae = mean_absolute_error(Y_test,Y_pred)
orig_mse = mean_squared_error(Y_test,Y_pred)
orig_rmse_val = rmse(Y_test,Y_pred)
orig_r2 = r2_score(Y_test,Y_pred)
print("MAE: %.3f"%orig_mae)
print("MSE:  %.3f"%orig_mse)
print("RMSE:  %.3f"%orig_rmse_val)
print("R2:  %.3f"%orig_r2)

MAE: 0.537
MSE:  0.556
RMSE:  0.746
R2:  0.580


In [10]:
res_frame = pd.DataFrame({'data':'original', 'imputation':'none', 'mae': orig_mae, 
                   'mse': orig_mse, 'rmse':orig_rmse_val, 'R2':orig_r2, 'mae_diff':np.nan,
                   'mse_diff':np.nan, 'rmse_diff':np.nan, 'R2_diff':np.nan}, index=[0])

In [11]:
res_frame

Unnamed: 0,data,imputation,mae,mse,rmse,R2,mae_diff,mse_diff,rmse_diff,R2_diff
0,original,none,0.53676,0.556336,0.74588,0.58016,,,,


# Missing Completely at Random

In [12]:
in_sample_1 = cal.sample(frac=0.05, random_state=99)
in_sample_5 = cal.sample(frac=0.05, random_state=99)
in_sample_10 = cal.sample(frac=0.10, random_state=99)
in_sample_20 = cal.sample(frac=0.20, random_state=99)
in_sample_33 = cal.sample(frac=0.33, random_state=99)
in_sample_50 = cal.sample(frac=0.5, random_state=99)

print("at 1%", in_sample_1.shape)
print("at 5%", in_sample_5.shape)
print("at 10%", in_sample_10.shape)
print("at 20%", in_sample_20.shape)
print("at 33%", in_sample_33.shape)
print("at 50%", in_sample_50.shape)

at 1% (1032, 9)
at 5% (1032, 9)
at 10% (2064, 9)
at 20% (4128, 9)
at 33% (6811, 9)
at 50% (10320, 9)


In [13]:
in_sample_1['HouseAge'] = np.nan
in_sample_5['HouseAge'] = np.nan
in_sample_10['HouseAge'] = np.nan
in_sample_20['HouseAge'] = np.nan
in_sample_33['HouseAge'] = np.nan
in_sample_50['HouseAge'] = np.nan

In [14]:
print(in_sample_1.head())
in_sample_5
in_sample_10
in_sample_20
in_sample_33
in_sample_50

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
7914   3.2639       NaN  3.832661   1.046371      2685.0  5.413306     33.88   
11963  2.6442       NaN  6.185930   1.085427       841.0  4.226131     34.01   
18738  1.9881       NaN  4.513889   0.998843      2145.0  2.482639     40.56   
17431  1.9257       NaN  4.203036   1.096774      1544.0  2.929791     34.65   
17947  4.0667       NaN  5.109131   0.915367      1054.0  2.347439     37.33   

       Longitude  MedHouseVal  
7914     -118.08        1.201  
11963    -117.41        0.920  
18738    -122.35        0.852  
17431    -120.45        1.353  
17947    -121.96        2.769  


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
7914,3.2639,,3.832661,1.046371,2685.0,5.413306,33.88,-118.08,1.20100
11963,2.6442,,6.185930,1.085427,841.0,4.226131,34.01,-117.41,0.92000
18738,1.9881,,4.513889,0.998843,2145.0,2.482639,40.56,-122.35,0.85200
17431,1.9257,,4.203036,1.096774,1544.0,2.929791,34.65,-120.45,1.35300
17947,4.0667,,5.109131,0.915367,1054.0,2.347439,37.33,-121.96,2.76900
...,...,...,...,...,...,...,...,...,...
12004,2.3417,,3.937500,1.051136,365.0,2.073864,33.89,-117.56,1.91700
14752,3.1696,,5.079365,1.015873,2450.0,4.320988,32.57,-117.05,1.38000
14331,1.3071,,2.647295,1.102204,781.0,1.565130,32.72,-117.15,2.50000
15116,5.2820,,5.918495,1.000000,950.0,2.978056,32.85,-117.00,1.40800


In [15]:
print(in_sample_1.head())
print(in_sample_5.head())
print(in_sample_10.head())
print(in_sample_20.head())
print(in_sample_33.head())
print(in_sample_50.head())

       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
7914   3.2639       NaN  3.832661   1.046371      2685.0  5.413306     33.88   
11963  2.6442       NaN  6.185930   1.085427       841.0  4.226131     34.01   
18738  1.9881       NaN  4.513889   0.998843      2145.0  2.482639     40.56   
17431  1.9257       NaN  4.203036   1.096774      1544.0  2.929791     34.65   
17947  4.0667       NaN  5.109131   0.915367      1054.0  2.347439     37.33   

       Longitude  MedHouseVal  
7914     -118.08        1.201  
11963    -117.41        0.920  
18738    -122.35        0.852  
17431    -120.45        1.353  
17947    -121.96        2.769  
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
7914   3.2639       NaN  3.832661   1.046371      2685.0  5.413306     33.88   
11963  2.6442       NaN  6.185930   1.085427       841.0  4.226131     34.01   
18738  1.9881       NaN  4.513889   0.998843      2145.0  2.482639     40.56   
17431  

In [16]:
out_sample_1 = cal[~cal.isin(in_sample_1)].dropna()
out_sample_5 = cal[~cal.isin(in_sample_5)].dropna()
out_sample_10 = cal[~cal.isin(in_sample_10)].dropna()
out_sample_20 = cal[~cal.isin(in_sample_20)].dropna()
out_sample_33 = cal[~cal.isin(in_sample_33)].dropna()
out_sample_50 = cal[~cal.isin(in_sample_50)].dropna()
# out_sample_1

In [17]:
in_sample_1['HouseAge'] = in_sample_1['HouseAge'].fillna(out_sample_1['HouseAge'].median())
in_sample_5['HouseAge'] = in_sample_5['HouseAge'].fillna(out_sample_5['HouseAge'].median())
in_sample_10['HouseAge'] = in_sample_10['HouseAge'].fillna(out_sample_10['HouseAge'].median())
in_sample_20['HouseAge'] = in_sample_20['HouseAge'].fillna(out_sample_20['HouseAge'].median())
in_sample_33['HouseAge'] = in_sample_33['HouseAge'].fillna(out_sample_33['HouseAge'].median())
in_sample_50['HouseAge'] = in_sample_50['HouseAge'].fillna(out_sample_50['HouseAge'].median())

In [18]:
in_sample_1.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
7914,3.2639,29.0,3.832661,1.046371,2685.0,5.413306,33.88,-118.08,1.201
11963,2.6442,29.0,6.18593,1.085427,841.0,4.226131,34.01,-117.41,0.92
18738,1.9881,29.0,4.513889,0.998843,2145.0,2.482639,40.56,-122.35,0.852
17431,1.9257,29.0,4.203036,1.096774,1544.0,2.929791,34.65,-120.45,1.353
17947,4.0667,29.0,5.109131,0.915367,1054.0,2.347439,37.33,-121.96,2.769


In [19]:
in_sample_5.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
7914,3.2639,29.0,3.832661,1.046371,2685.0,5.413306,33.88,-118.08,1.201
11963,2.6442,29.0,6.18593,1.085427,841.0,4.226131,34.01,-117.41,0.92
18738,1.9881,29.0,4.513889,0.998843,2145.0,2.482639,40.56,-122.35,0.852
17431,1.9257,29.0,4.203036,1.096774,1544.0,2.929791,34.65,-120.45,1.353
17947,4.0667,29.0,5.109131,0.915367,1054.0,2.347439,37.33,-121.96,2.769


In [20]:
in_sample_10.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
7914,3.2639,29.0,3.832661,1.046371,2685.0,5.413306,33.88,-118.08,1.201
11963,2.6442,29.0,6.18593,1.085427,841.0,4.226131,34.01,-117.41,0.92
18738,1.9881,29.0,4.513889,0.998843,2145.0,2.482639,40.56,-122.35,0.852
17431,1.9257,29.0,4.203036,1.096774,1544.0,2.929791,34.65,-120.45,1.353
17947,4.0667,29.0,5.109131,0.915367,1054.0,2.347439,37.33,-121.96,2.769


In [21]:
in_sample_20.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
7914,3.2639,29.0,3.832661,1.046371,2685.0,5.413306,33.88,-118.08,1.201
11963,2.6442,29.0,6.18593,1.085427,841.0,4.226131,34.01,-117.41,0.92
18738,1.9881,29.0,4.513889,0.998843,2145.0,2.482639,40.56,-122.35,0.852
17431,1.9257,29.0,4.203036,1.096774,1544.0,2.929791,34.65,-120.45,1.353
17947,4.0667,29.0,5.109131,0.915367,1054.0,2.347439,37.33,-121.96,2.769


In [22]:
in_sample_33.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
7914,3.2639,29.0,3.832661,1.046371,2685.0,5.413306,33.88,-118.08,1.201
11963,2.6442,29.0,6.18593,1.085427,841.0,4.226131,34.01,-117.41,0.92
18738,1.9881,29.0,4.513889,0.998843,2145.0,2.482639,40.56,-122.35,0.852
17431,1.9257,29.0,4.203036,1.096774,1544.0,2.929791,34.65,-120.45,1.353
17947,4.0667,29.0,5.109131,0.915367,1054.0,2.347439,37.33,-121.96,2.769


In [23]:
in_sample_50.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
7914,3.2639,29.0,3.832661,1.046371,2685.0,5.413306,33.88,-118.08,1.201
11963,2.6442,29.0,6.18593,1.085427,841.0,4.226131,34.01,-117.41,0.92
18738,1.9881,29.0,4.513889,0.998843,2145.0,2.482639,40.56,-122.35,0.852
17431,1.9257,29.0,4.203036,1.096774,1544.0,2.929791,34.65,-120.45,1.353
17947,4.0667,29.0,5.109131,0.915367,1054.0,2.347439,37.33,-121.96,2.769


In [24]:
imputed_data_1 = pd.concat([in_sample_1, out_sample_1])
imputed_data_5 = pd.concat([in_sample_5, out_sample_5])
imputed_data_10 = pd.concat([in_sample_10, out_sample_10])
imputed_data_20 = pd.concat([in_sample_20, out_sample_20])
imputed_data_33 = pd.concat([in_sample_33, out_sample_33])
imputed_data_50 = pd.concat([in_sample_50, out_sample_50])

In [25]:
imputed_data_1.shape

(20640, 9)

In [26]:
train_set = imputed_data_1.sample(frac=0.7, random_state=100)
test_set = imputed_data_1[~imputed_data_1.isin(train_set)].dropna()

In [27]:
train_set = imputed_data_1.sample(frac=0.7, random_state=100)
test_set = imputed_data_1[~imputed_data_1.isin(train_set)].dropna()

X_train = train_set.iloc[:, 0:8].values 
Y_train = train_set.iloc[:, -1].values

X_test = train_set.iloc[:, 0:8].values 
Y_test = train_set.iloc[:, -1].values 

reg1 = LinearRegression(normalize=True).fit(X_train, Y_train)

In [28]:
Y_pred = reg.predict(X_test)

Y_pred = reg1.predict(X_test)

mae = mean_absolute_error(Y_test,Y_pred)
mse = mean_squared_error(Y_test,Y_pred)
rmse_val = rmse(Y_test,Y_pred)
r1 = r2_score(Y_test,Y_pred)

temp_frame_1 = pd.DataFrame({'data':'1% imputed', 'imputation':'MAR', 'mae': mae, 'mse': mse, 
                   'rmse':rmse_val, 'R2_at_1%':r1, 'mae_diff':mae-orig_mae, 'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val, 'R2_diff':r1-orig_r2}, index=[0])



In [29]:
temp_frame_1

Unnamed: 0,data,imputation,mae,mse,rmse,R2_at_1%,mae_diff,mse_diff,rmse_diff,R2_diff
0,1% imputed,MAR,0.529113,0.520766,0.721641,0.609728,-0.007648,-0.03557,-0.024238,0.029568


In [30]:
train_set = imputed_data_5.sample(frac=0.7, random_state=100)
test_set = imputed_data_5[~imputed_data_5.isin(train_set)].dropna()

X_train = train_set.iloc[:, 0:8].values 
Y_train = train_set.iloc[:, -1].values

X_test = train_set.iloc[:, 0:8].values 
Y_test = train_set.iloc[:, -1].values 

reg5 = LinearRegression(normalize=True).fit(X_train, Y_train)


In [31]:
Y_pred = reg5.predict(X_test)

mae = mean_absolute_error(Y_test,Y_pred)
mse = mean_squared_error(Y_test,Y_pred)
rmse_val = rmse(Y_test,Y_pred)
r5 = r2_score(Y_test,Y_pred)

temp_frame_5 = pd.DataFrame({'data':'5% imputed', 'imputation':'MAR', 'mae': mae, 'mse': mse, 
                   'rmse':rmse_val, 'R2_at_5%':r5, 'mae_diff':mae-orig_mae, 'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val, 'R2_diff':r5-orig_r2}, index=[0])

In [32]:
temp_frame_5

Unnamed: 0,data,imputation,mae,mse,rmse,R2_at_5%,mae_diff,mse_diff,rmse_diff,R2_diff
0,5% imputed,MAR,0.529113,0.520766,0.721641,0.609728,-0.007648,-0.03557,-0.024238,0.029568


In [33]:
train_set = imputed_data_10.sample(frac=0.7, random_state=100)
test_set = imputed_data_10[~imputed_data_10.isin(train_set)].dropna()

                            
X_train = train_set.iloc[:, 0:8].values 
Y_train = train_set.iloc[:, -1].values

X_test = train_set.iloc[:, 0:8].values 
Y_test = train_set.iloc[:, -1].values 

reg10 = LinearRegression(normalize=True).fit(X_train, Y_train)

In [34]:
Y_pred = reg10.predict(X_test)

mae = mean_absolute_error(Y_test,Y_pred)
mse = mean_squared_error(Y_test,Y_pred)
rmse_val = rmse(Y_test,Y_pred)
r10 = r2_score(Y_test,Y_pred)

temp_frame_10 = pd.DataFrame({'data':'10% imputed', 'imputation':'MAR', 'mae': mae, 'mse': mse, 
                   'rmse':rmse_val, 'R2_at_10%':r10, 'mae_diff':mae-orig_mae, 'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val, 'R2_diff':r10-orig_r2}, index=[0])

In [35]:
temp_frame_10

Unnamed: 0,data,imputation,mae,mse,rmse,R2_at_10%,mae_diff,mse_diff,rmse_diff,R2_diff
0,10% imputed,MAR,0.529816,0.51517,0.717753,0.61452,-0.006944,-0.041166,-0.028126,0.03436


In [36]:
train_set = imputed_data_20.sample(frac=0.7, random_state=100)
test_set = imputed_data_20[~imputed_data_20.isin(train_set)].dropna()

                             
X_train = train_set.iloc[:, 0:8].values 
Y_train = train_set.iloc[:, -1].values

X_test = train_set.iloc[:, 0:8].values 
Y_test = train_set.iloc[:, -1].values 

reg20 = LinearRegression().fit(X_train, Y_train)

In [37]:
Y_pred = reg20.predict(X_test)

mae = mean_absolute_error(Y_test,Y_pred)
mse = mean_squared_error(Y_test,Y_pred)
rmse_val = rmse(Y_test,Y_pred)
r20 = r2_score(Y_test,Y_pred)

temp_frame_20 = pd.DataFrame({'data':'20% imputed', 'imputation':'MAR', 'mae': mae, 'mse': mse, 
                   'rmse':rmse_val, 'R2_at_20%':r20, 'mae_diff':mae-orig_mae, 'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val, 'R2_diff':r20-orig_r2}, index=[0])

In [38]:
temp_frame_20

Unnamed: 0,data,imputation,mae,mse,rmse,R2_at_20%,mae_diff,mse_diff,rmse_diff,R2_diff
0,20% imputed,MAR,0.529883,0.525442,0.724873,0.608509,-0.006878,-0.030895,-0.021006,0.028349


In [39]:
train_set = imputed_data_33.sample(frac=0.7, random_state=100)
test_set = imputed_data_33[~imputed_data_33.isin(train_set)].dropna()

                             
X_train = train_set.iloc[:, 0:8].values 
Y_train = train_set.iloc[:, -1].values

X_test = train_set.iloc[:, 0:8].values 
Y_test = train_set.iloc[:, -1].values 

reg33 = LinearRegression(normalize=True).fit(X_train, Y_train)

In [40]:
Y_pred = reg33.predict(X_test)

mae = mean_absolute_error(Y_test,Y_pred)
mse = mean_squared_error(Y_test,Y_pred)
rmse_val = rmse(Y_test,Y_pred)
r33 = r2_score(Y_test,Y_pred)

temp_frame_33 = pd.DataFrame({'data':'33% imputed', 'imputation':'MAR', 'mae': mae, 'mse': mse, 
                   'rmse':rmse_val, 'R2_at_33%':r33, 'mae_diff':mae-orig_mae, 'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val, 'R2_diff':r33-orig_r2}, index=[0])

In [41]:
temp_frame_33

Unnamed: 0,data,imputation,mae,mse,rmse,R2_at_33%,mae_diff,mse_diff,rmse_diff,R2_diff
0,33% imputed,MAR,0.532691,0.529767,0.727851,0.602485,-0.00407,-0.026569,-0.018029,0.022325


In [42]:
train_set = imputed_data_50.sample(frac=0.7, random_state=100)
test_set = imputed_data_50[~imputed_data_50.isin(train_set)].dropna()
                              
X_train = train_set.iloc[:, 0:8].values 
Y_train = train_set.iloc[:, -1].values

X_test = train_set.iloc[:, 0:8].values 
Y_test = train_set.iloc[:, -1].values 


reg50 = LinearRegression(normalize=True).fit(X_train, Y_train)

In [43]:
Y_pred = reg50.predict(X_test)

mae = mean_absolute_error(Y_test,Y_pred)
mse = mean_squared_error(Y_test,Y_pred)
rmse_val = rmse(Y_test,Y_pred)
r50 = r2_score(Y_test,Y_pred)

temp_frame_50 = pd.DataFrame({'data':'50% imputed', 'imputation':'MAR', 'mae': mae, 'mse': mse, 
                   'rmse':rmse_val, 'R2_at_50%':r50, 'mae_diff':mae-orig_mae, 'mse_diff':mse-orig_mse,
                   'rmse_diff':rmse_val-orig_rmse_val, 'R2_diff':r50-orig_r2}, index=[0])

In [44]:
temp_frame_50

Unnamed: 0,data,imputation,mae,mse,rmse,R2_at_50%,mae_diff,mse_diff,rmse_diff,R2_diff
0,50% imputed,MAR,0.527224,0.521339,0.722038,0.605956,-0.009537,-0.034998,-0.023842,0.025796


In [45]:
res_frame = pd.concat([res_frame, temp_frame_1, temp_frame_5, temp_frame_10, temp_frame_20,
                       temp_frame_33, temp_frame_50])


In [46]:
res_frame


Unnamed: 0,data,imputation,mae,mse,rmse,R2,mae_diff,mse_diff,rmse_diff,R2_diff,R2_at_1%,R2_at_5%,R2_at_10%,R2_at_20%,R2_at_33%,R2_at_50%
0,original,none,0.53676,0.556336,0.74588,0.58016,,,,,,,,,,
0,1% imputed,MAR,0.529113,0.520766,0.721641,,-0.007648,-0.03557,-0.024238,0.029568,0.609728,,,,,
0,5% imputed,MAR,0.529113,0.520766,0.721641,,-0.007648,-0.03557,-0.024238,0.029568,,0.609728,,,,
0,10% imputed,MAR,0.529816,0.51517,0.717753,,-0.006944,-0.041166,-0.028126,0.03436,,,0.61452,,,
0,20% imputed,MAR,0.529883,0.525442,0.724873,,-0.006878,-0.030895,-0.021006,0.028349,,,,0.608509,,
0,33% imputed,MAR,0.532691,0.529767,0.727851,,-0.00407,-0.026569,-0.018029,0.022325,,,,,0.602485,
0,50% imputed,MAR,0.527224,0.521339,0.722038,,-0.009537,-0.034998,-0.023842,0.025796,,,,,,0.605956


# Missing at Random

# Missing "not at" at Random

# Conclusion