# **Univariate approach to Imputation**
The next 3 steps are taken from the [skikit learn repository](https://scikit-learn.org/stable/modules/impute.html). The first of these is a simple tool to help you implement Univariate imputation. Remember, it is always wise to do this with MCAR variables. Ski-learns SimpleImputer has options to use a constant, mean, median or mode for your missing values. It is important before you implement this process to visualise the data. For example when you have normally distributed variable you should probably use the mean. If you have outliers or skewed data use the median or the mode. The code in the next section gives a simple example of this. I would strongly recommend that you experiment with it so you can understand the implications of your choice.


In [1]:
import numpy as np
from sklearn.impute import SimpleImputer

y=np.array([[780,750,690,710,680,730,690,720,740,900,950,975,995,1000,1010,1020],
    [5.1,4.5,np.nan,3.3,3.6,9.3,6.7,2.8,5.4,np.nan,7.8,np.nan,np.nan,10.1,6.7,np.nan],
    [78000,75000,100000,71000,68000,70000,69000,72000,74000,69000,102000,101000,79000,114000,101000,95000],
    [0.5,0.55,0.1,0.6,0.7,0.45,0.56,0.73,0.45,0.67,0.43,0.23,0.78,0.42,0.36,0.23]])
#y=np.reshape(y, (4, 16))
y=y.transpose()
#print(y)
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
#imp.fit(y)
y=imp.fit_transform(y)
print(y)

[[7.80000000e+02 5.10000000e+00 7.80000000e+04 5.00000000e-01]
 [7.50000000e+02 4.50000000e+00 7.50000000e+04 5.50000000e-01]
 [6.90000000e+02 5.93636364e+00 1.00000000e+05 1.00000000e-01]
 [7.10000000e+02 3.30000000e+00 7.10000000e+04 6.00000000e-01]
 [6.80000000e+02 3.60000000e+00 6.80000000e+04 7.00000000e-01]
 [7.30000000e+02 9.30000000e+00 7.00000000e+04 4.50000000e-01]
 [6.90000000e+02 6.70000000e+00 6.90000000e+04 5.60000000e-01]
 [7.20000000e+02 2.80000000e+00 7.20000000e+04 7.30000000e-01]
 [7.40000000e+02 5.40000000e+00 7.40000000e+04 4.50000000e-01]
 [9.00000000e+02 5.93636364e+00 6.90000000e+04 6.70000000e-01]
 [9.50000000e+02 7.80000000e+00 1.02000000e+05 4.30000000e-01]
 [9.75000000e+02 5.93636364e+00 1.01000000e+05 2.30000000e-01]
 [9.95000000e+02 5.93636364e+00 7.90000000e+04 7.80000000e-01]
 [1.00000000e+03 1.01000000e+01 1.14000000e+05 4.20000000e-01]
 [1.01000000e+03 6.70000000e+00 1.01000000e+05 3.60000000e-01]
 [1.02000000e+03 5.93636364e+00 9.50000000e+04 2.300000

We have to transform the matrix for SimpleImputer to handle it.
Now you will notice the missing values that were in the second column have not been converted to 5.9363. We now implement a simple regression model and note the results.

In [3]:
import statsmodels.formula.api as sm
import statsmodels.stats.stattools as st
import statsmodels.stats.api as sms
import pandas as pd
df=pd.DataFrame(y)
df.columns=['X1','X2','X3','Y']

formula_str="Y~X1 + X2 +X3"

result=sm.ols(formula=formula_str,data=df).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.597
Model:                            OLS   Adj. R-squared:                  0.497
Method:                 Least Squares   F-statistic:                     5.931
Date:                Tue, 11 Feb 2025   Prob (F-statistic):             0.0101
Time:                        14:29:50   Log-Likelihood:                 11.472
No. Observations:                  16   AIC:                            -14.94
Df Residuals:                      12   BIC:                            -11.85
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.9979      0.226      4.413      0.0

  res = hypotest_fun_out(*samples, **kwds)


Now we will re-run the analysis but this time with the median as our estimate of the missing value.

In [4]:
y=np.array([[780,750,690,710,680,730,690,720,740,900,950,975,995,1000,1010,1020],
    [5.1,4.5,np.nan,3.3,3.6,9.3,6.7,2.8,5.4,np.nan,7.8,np.nan,np.nan,10.1,6.7,np.nan],
    [78000,75000,100000,71000,68000,70000,69000,72000,74000,69000,102000,101000,79000,114000,101000,95000],
    [0.5,0.55,0.1,0.6,0.7,0.45,0.56,0.73,0.45,0.67,0.43,0.23,0.78,0.42,0.36,0.23]])

y=y.transpose()
imp = SimpleImputer(missing_values=np.nan, strategy='median')
#imp.fit(y)
y=imp.fit_transform(y)
print(y)
df=pd.DataFrame(y)
df.columns=['X1','X2','X3','Y']

formula_str="Y~X1 + X2 +X3"

result=sm.ols(formula=formula_str,data=df).fit()
print(result.summary())

[[7.80e+02 5.10e+00 7.80e+04 5.00e-01]
 [7.50e+02 4.50e+00 7.50e+04 5.50e-01]
 [6.90e+02 5.40e+00 1.00e+05 1.00e-01]
 [7.10e+02 3.30e+00 7.10e+04 6.00e-01]
 [6.80e+02 3.60e+00 6.80e+04 7.00e-01]
 [7.30e+02 9.30e+00 7.00e+04 4.50e-01]
 [6.90e+02 6.70e+00 6.90e+04 5.60e-01]
 [7.20e+02 2.80e+00 7.20e+04 7.30e-01]
 [7.40e+02 5.40e+00 7.40e+04 4.50e-01]
 [9.00e+02 5.40e+00 6.90e+04 6.70e-01]
 [9.50e+02 7.80e+00 1.02e+05 4.30e-01]
 [9.75e+02 5.40e+00 1.01e+05 2.30e-01]
 [9.95e+02 5.40e+00 7.90e+04 7.80e-01]
 [1.00e+03 1.01e+01 1.14e+05 4.20e-01]
 [1.01e+03 6.70e+00 1.01e+05 3.60e-01]
 [1.02e+03 5.40e+00 9.50e+04 2.30e-01]]
                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.594
Model:                            OLS   Adj. R-squared:                  0.492
Method:                 Least Squares   F-statistic:                     5.846
Date:                Tue, 11 Feb 2025   Prob (F-statistic):

  res = hypotest_fun_out(*samples, **kwds)


There isn't much difference between the results. </br></br>

Insert a number of high results into the experience variable and re-run your analysis. What happens?