# Python for Data Science Practice Session 4 : Mathematics and Statistics

In [2]:
#Import pandas, numpy, numpy.random and matplotlib.pyplot 


## Missing data types: A closer look

In this section, I would like to talk more about missing values. It is one of the most influential problems that have to be dealt with carefully by Data Scientists, and even by other practitioners in different fields who analyse and use data to help in forming decisions/conclusions. Together, we will go through the three main types of missing values: Missing Completely at Random, Missing at Random and Missing Not at Random. Then, we will build up on the imputation methods discussed in this week's teaching session by using two more complex ones, namely Expectation Maximization and Linear Regression. We will also make use of the dataframe reshaping methods that we learned in this week's teaching session to help make our lives easier throughout this project session.

Let's start by generating a mock dataset that we will use to demonstrate the different missing value types. This dataset will contain simulated data of three consecutive lap times for ten F1 racing drivers on a track.

Start by creating a Pandas dataframe with column labels 'D1', 'D2', 'D3', ... , 'D10' representing the ten drivers we have, and indexes 'Lap1', 'Lap2' representing the first two lap times for each driver.

In [3]:
#Create the dataframe as specified above


Now, we simulate lap time values from normal distributions. Assume that for the first lap times, data are normally distributed with a mean of 150 and standard deviation of 10. Then in lap 2, data are normally distributed with a mean of 145 and a standard deviation of 10. Simulating the data in this way allows for the possibility for some drivers improving their lap time in their second trial and some other drivers worsening their lap time in their second trial.

In [4]:
#Simulate data from normal distributions for laps 1 and 2 using the means and standard deviations specified above


In [5]:
#View the dataset


The structure of the dataframe is not visually pleasing in my opinion, especially that we have ten columns and only two rows. Try transposing the dataframe: 

In [6]:
#Transpose the dataset


In [7]:
#View the dataset


This looks much better, let us now proceed. 

Now, create a column for lap 3 times where drivers who score better lap times in their second lap than their first lap have their data generated from a normal distribution of mean of 140 and standard deviation of 5, and drivers who score worse in their second lap than their first lap have their lap 3 times generated from a normal distribution of mean 155 and standard deviation of 5. This can be thought of in the following sense: The drivers who do better in lap 2 than lap 1 are more likely to be motivated and improve on their lap 2 times with a faster lap 3 time, and people who score worse in lap 2 might be disappointed and thus more likely to score worse in lap 3. (Of course you can argue that this is not usually the case in real life, but for simplicity, we will proceed with it as it will help us in the explanation of the different missing data types).

In [8]:
#Create the column Lap 3 following the scenario mentioned above


In [9]:
#View the dataset


## Missing Completely At Random (MCAR)

From its name, missing data are said to be missing completely at random if they are missing due to completely random reasons. There is no specific pattern in the dataset: 
- No relationships between the variable with missing values and other variables.
- No relationship between the variable with missing values and itself. 

An example for this in our dataset would be:
- The tracking device on one of the cars suddenly got damaged.
- A cars' engine broke down before finishing one of the laps.

We are now going to create a column called MCAR which could represent Lap 3 times from real life data where the missing Lap 3 times for some of the drivers are missing completely at random. 

There are plenty of ways to do so, but the one I would hint towards is multiplying each lap 3 times with a number simulated from a binomial distribution with probability of success equalling 0.5 

In [10]:
#Create the column called MCAR for Lap 3 times, with some of them missing completely at random


In [11]:
#View the dataset


## Missing At Random (MAR)

Missing values are said to be Missing at Random if they are missing due to a dependance on other variable/s. This means that for any observation, a variable being missing depends on other variable/s' values. An example for this in our dataset would be: 

- Drivers who scored 10-seconds slower lap 2 times or more than lap 1 times are ordered by their teams not to attempt a third lap as this might indicate loss of power in the car's engine. So, the availability of lap 3 times are affected by the times for laps 1 and 2. 

In [12]:
#Create a column called MAR that follows the scenario in the above bulletpoint


In [13]:
#View the dataset


## Missing Not at Random (MNAR)

Missing values are said to be Missing Not at Random if they are missing because of their values themselves. This might sound unintuitive, but here is an example from our dataset that will make it clearer: 

- Assume that drivers on this track generally score in the range of 120-160 seconds. Drivers who score above 160 seconds on Lap 3 get disheartened by that and sometimes decide not to report their scores as they are afraid it will reflect badly on their performance. This means that missing values in lap 3 are missing because of the time scored in lap 3.

In [14]:
#Create a column called MNAR that follows the scenario in the above bulletpoint


In [15]:
#View the dataset


A good question right now would be: How to detect whether the missing values in your dataset are MCAR, MAR or MNAR? 

Here is a good discussion I found on Kaggle: https://www.kaggle.com/questions-and-answers/105010 

Now, we move onto the more advanced missing value imputation methods, starting with Expectation Maximization.

## Expectation Maximization

Expectation Maximization is an iterative algorithm used to calculate the maximum likelihood estimates for the parameters of a statistical model that depends on latent variables (or unobserved variables). Due to its nature of working with statistical models of unobserved variables, it is suitable for missing data imputations.

In more details, suppose you have an explicit form for the joint distribution of $X_{obs}$ (the observed data), and $X_{mis}$ (the missing data). The goal is to estimate the parameters $\theta \in \mathbb{R}^{d}$ for the statistical model of the joint distribution ($X_{obs}$,$X_{mis}$) by calculating the maximum likelihood estimates for the likelihood function of the marginal distribution of $X_{obs}$. We have that:
$$
L(\theta ; X_{obs}) = p(X_{obs} | \theta) = \int p(X_{obs},X_{mis} | \theta) \, dX_{mis} = \int p(X_{obs} | X_{mis},\theta) p(X_{mis}|\theta) \, dX_{mis} 
$$
As we do not observe the missing values $X_{mis}$, we cannot always compute the above explicitly. 

This is where EM comes into play. It obtains the maximum likelihood estimates for the marginal distribution of $X_{obs}$ by iteratively maximizing the expected complete-data log likelihood. Here are the two steps of the iteration:

Start with an initial estimate $\theta^{(0)}$ and let $\theta^{(t)}$ be the iterate for the parameter $\theta$ in the  $t^{th}$ iteration of the algorithm.

<b> 1) Expectation: </b> Compute the expectation of the log likelihood of the complete-data with respect to the conditional distribution of $X_{mis}$ parameterized by $\theta^{(t)}$: 
$$
Q(\theta, \theta^{(t)}) = \mathbb{E}_{X_{mis} | X_{obs},\theta^{(t)}}[log(l(\theta ; X_{obs}, X_{mis})]
$$

<b> 2) Maximization: </b> Obtain the value of $\theta$ that maximizes Q($\theta$, $\theta^{(t)})$:

$$
\theta^{(t+1)} = argmax_{\theta} \, Q(\theta, \theta^{(t)})
$$

The algorithm keeps iterating until it reaches a point where the difference between the estimates is negligible, and the algorithm is thought to have converged.

In a nutshell, the algorithm starts with $\theta^{(0)}$, then it estimates the values for the missing data using the observed data and the parameters $\theta^{(0)}$, then it calculates the maximum likelihood parameter estimates for the complete-data, then it estimates new values for the previous missing data, and then the process keeps repeating until it converges. 


Expectation Maximization only works with MCAR and MAR missing data, and it works specifically well with distributions from the exponential family. One thing to bear in mind is that it calculates the local maximum likelihood estimates, and so for multimodal distributions, the global likelihood might not be obtained. 

- - - - - -

Let us work through a simulated dataset from a multivariate normal distribution, where we are going to remove data randomly using the help of a binomial distribution as we did before; so that they fall under the MCAR missing data category.

We are going to simulate random vectors of size 3 from a multivariate normal distribution with the following mean vector and variance covariance matrix (Feel free to change the parameters if you'd like):

In [None]:
#The mean vector
Mean = random.normal(10,5,3)
Mean

In [None]:
#The variance-covariance matrix
Cov_matrix = np.array([[1,1,1], [1,1,1], [1,1,1]])
Cov_matrix

In [None]:
#Generate 200 vectors from a multivariate normal distribution with the mean vector and var-cov matrix specified above 


In [None]:
#View the first 10 vectors


Now, we are going to remove some of the data randomly from the third entry of the vectors so that we could then try imputing them using EM and other simple imputations methods. 

Let us start by creating a copy from the array so that we could have a full array and another array with MCAR values in the third entry that we could impute and compare to the original array.

In [None]:
#Create a copy from the original array


In [None]:
#Remove data from the third entry randomly (as done before)


Download a library called `impyute` using `pip install impyute`. 

Then, import impyute as impy into the notebook:

In [None]:
#import impyute


Quickly research how impyute is used, and then create three arrays where each array has its missing values imputed using one of the following methods: 

- Expectation Maximization
- Mean
- Median 

In [None]:
#Create the array imputed with EM


In [None]:
#Create the array imputed with the mean


In [None]:
#Create the array imputed with the median


## Comparing them together

Now, we are going to specify the entries that were imputed in the three imputation methods, and then compare them by concatenating them together alongside the original data, and then using matplotlib to plot a graph for the data from the three imputed methods as well as the original data. 

Let us start by specifying the indices of the observations where there was a missing value in the third entry: 


In [None]:
#Create an array with the indices of the observations where there was a missing value in the third entry


In [None]:
#Create an array with the EM imputed values 


In [None]:
#Create an array with the mean imputed values


In [None]:
#Create an array with the median imputed values 


In [None]:
#Create an array with the original data 


Now, create three datasets where each dataset contain the imputed values from one data imputation method, and one final dataset with the original data.

In [None]:
#Create a dataset with the EM imputed values


In [None]:
#Create a dataset with the mean imputed values


In [None]:
#Create a dataset with the median imputed values


In [None]:
#Create a dataset with the original values


Now, concatenate the four datasets together to help in the comparison between the values from the different imputation methods:

In [None]:
#Concatenate the above four datasets 


In [None]:
#View the new dataset


Now, plot a line graph for the data imputed from the three imputation methods as well as the original data in one plot:

In [None]:
#Plot a line graph for the data imputed from the three imputation methods and the original data 


- - - - - -

## Linear Regression imputation

If you have a dataset where one of the variables has some missing values, and this variable correlates strongly with one other or more variables in the dataset, then Linear Regression imputation might work really well. 

It fits a linear regression model where the variable with the missing values is the target/response variable, and the other variable/s (has no missing data) that correlate with it as the predictor variable/s. 

Because relationships/correlations between variables are conserved in this imputation method, it has an advantage over the simpler imputation methods such as the mean and median in some of the cases.

There are two versions for this imputation method: 

<b> 1) Deterministic Regression Imputation: </b>

This is when the missing values are imputed with the exact predictions from the linear regression model, without adding any error terms to it. A disadvantage for this is that it reduces the variability of the imputed variable. This is because the imputed values would lie exactly on the regression hyperplane, which is not a very good representative of real world data. This leads to the second version.

<b> 2) Stochastic Regression Imputation: </b>

This follows the same idea of Deterministic Regression Imputation but with the addition of a random error term to the predicted value. 

You will try by yourself an example of imputing missing data using both methods to see how each one performs. 

Again, let us simulate two variables that are correlated.

Start by generating 100 data points from a normal distribution of mean 5 and standard deviation 1. Call this array col1.

In [None]:
# Generate 100 data points from a normal distribution of mean 5 and standard deviation 1. 


Simulate another random variable that correlates with it. Remember to add a random noise/term to prevent the two variables from correlating perfectly. Call this array col2.

In [None]:
#Generate another 100 data points that correlate with the previously generated 100 data points


In [None]:
#Create a scatterplot for col2 against col1


Again, we will remove data randomly from col2 using the binomial distribution; so that the missing data fall under the MCAR type.

In [None]:
#Create a copy of v2.


In [None]:
#Remove data randomly


In [None]:
#View v2 with missing data


In [None]:
#Find the indices of the observations that having a missing value for v2


Now, we are going to fit a linear regression model for the observations with no missing data. Recall in the third project session, we went through the steps of creating a linear regression model step by step. Now, we are going to make use of a library called `sklearn` that has a built-in function called LinearRegression. 

Read more about how to use it here, and then use it to fit a linear regression model where v1 is the predictor variable and v2 is the target/response variable. 


In [None]:
#Import LinearRegression from sklearn.linear_model


In [None]:
#Find the indices of the observations that have no missing values 


In [None]:
#Create X_train for the v1 values for observations with no missing values


In [None]:
#Create y_train for the v2 values for observations with no missing values


The LinearRegression function requires the X_train values to follow a specific format, where each entry is an array on its own, and all of those individual arrays are stacked on top of each others. I will reformat it for you.

In [204]:
X_train = np.reshape(X_train, [X_train.size,1])

In [18]:
#Fit the linear regression model


## Deterministic Linear Regression

Let us start by imputing the missing values using the Deterministic Linear Regression approach. 

Start by using the fitted regression model to predict values for the missing data:

In [19]:
#Create an array of the v1 values of the observations that have missing values 


In [20]:
#Reshaping of the array to be suitable for usage in the regression model


In [21]:
#Create an array for the missing v2 values 


Now, create a dataset where each row has the imputed value of v2 by linear regression, and the original value of v2 from the original array:

In [22]:
#Create the dataset as specified above


In [23]:
#View the dataset


In [24]:
#Plot the observations using the original values and the observations with the imputed values


Now, I will leave you to try and impute data using Stochastic Linear Regression. This resource might help you out: https://www.kaggle.com/shashankasubrahmanya/missing-data-imputation-using-regression but feel free to do your own research.