https://towardsdatascience.com/why-using-a-mean-for-missing-data-is-a-bad-idea-alternative-imputation-algorithms-837c731c1008

### Why using a mean for missing data is a bad idea. Alternative imputation algorithms.

- We all know the pain when the dataset we want to use for Machine Learning contains missing data.

- The quick and easy workaround is to substitute a **mean for numerical features** and use a **mode for categorical ones**

- Even better, someone might just insert 0's or discard the data and proceed to the training of the model.

### Mean and mode ignore feature correlations

- Let’s have a look at a very simple example to visualize the problem. The following table have 3 variables: Age, Gender and Fitness Score. It shows a Fitness Score results (0–10) performed by people of different age and gender.

![10_Missing_Values](image/2.JPG)

- Now let’s assume that some of the data in Fitness Score is actually missing, so that after using a mean imputation we can compare results using both tables.

![10_Missing_Values](image/3.JPG)

- Imputed values don’t really make sense — in fact, they can have a negative effect on accuracy when training our ML model.
  
- For example, 78 year old women now has a Fitness Score of 5.1, which is typical for people aged between 42 and 60 years old. 

- Mean imputation doesn’t take into account a fact that Fitness Score is correlated to Age and Gender features. It only inserts 5.1, a mean of the Fitness Score, while ignoring potential feature correlations.

### Mean reduces a variance of the data

- Based on the previous example, variance of the real Fitness Score and of their mean imputed equivalent will differ. Figure below presents the variance of those two cases:

![10_Missing_Values](image/4.JPG)

- As we can see, the variance was reduced (that big change is because the dataset is very small) after using the Mean Imputation. Going deeper into mathematics, a smaller variance leads to the narrower confidence interval in the probability distribution[3]. This leads to nothing else than introducing a bias to our model.

### Alternative Imputation Algorithms

- Fortunately, there is a lot of brilliant alternatives to mean and mode imputations. A lot of them are based on already existing algorithms used for Machine Learning. The following list briefly describes most popular methods, as well as few less known imputation techniques.

#### MICE
- According to [4], it is the second most popular Imputation method, right after the mean. Initially, a simple imputation is performed (e.g. mean) to replace the missing data for each variable and we also note their positions in the dataset. Then, we take each feature and predict the missing data with Regression model. The remaining features are used as dependent variables for our Regression model. The process is iterated multiple times which updates the imputation values. The common number of iterations is usually 10, but it depends on the dataset. More detailed explanation of the algorithm can be found here[5].

#### KNN
- This popular imputation technique is based on the K-Nearest Neighbours algorithm. For a given instance with missing data, KNN Impute returns n most similar neighbours and replaces the missing element with a mean or mode of the neighbours. The choice between mode and mean depends if the feature is a continuous or a categorical one. Great paper for more in-depth understanding is here[6].

#### MissForest
- It is a non-standard, but a fairly flexible imputation algorithm. It uses RandomForest at its core to predict the missing data. It can be applied to both continuous and categorical variables which makes it advantageous over other imputation algorithms. Have a look what authors of MissForest wrote about its implementation[7].

#### Fuzzy K-means Clustering
- It is a less known Imputation technique, but it proves to be more accurate and faster than the basic clustering algorithms according to [8]. It computes the clusters of instances and fills in the missing values which dependns to which cluster the instance with missing data belongs to.