#### There are three different types of missing data

1) Missing completely at random (MCAR)
2) Missing at random (MAR)
3) Not missing at random (NMAR)

#### popular ways for data imputation for cross-sectional datasets
Source:
    
https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
    
    
####  1. Do Nothing:

We can just let the algorithm handle the missing data. Some algorithms can factor in the missing values and learn the 
best imputation values for the missing data based on the training loss reduction (ie. XGBoost). 
Some others have the option to just ignore them (ie. LightGBM — use_missing=false). However, other algorithms will panic and 
throw an error complaining about the missing values (ie. Scikit learn — LinearRegression). 
In that case, you will need to handle the missing data and clean it before feeding it to the algorithm.

#### 2. Imputation Using (Mean/Median) Values:

This works by calculating the mean/median of the non-missing values in a column and then replacing the missing values within 
each column separately and independently from the others. It can only be used with numeric data.

##### Pros:
Easy and fast.
Works well with small numerical datasets.

##### Cons:
Doesn’t factor the correlations between features. It only works on the column level.
Will give poor results on encoded categorical features (do NOT use it on categorical features).
Not very accurate.
Doesn’t account for the uncertainty in the imputations.
    
##### 3. Imputation Using (Most Frequent) or (Zero/Constant) Values:
Most Frequent is another statistical strategy to impute missing values and YES!! It works with categorical features 
(strings or numerical representations) by replacing missing data with the most frequent values within each column.
    
##### Pros:
Works well with categorical features.

##### Cons:
It also doesn’t factor the correlations between features.
It can introduce bias in the data.

Zero or Constant imputation — as the name suggests — it replaces the missing values with either zero or any constant 
value you specify



##### Imputation Using k-NN:
    
The algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned 
a value based on how closely it resembles the points in the training set. This can be very useful in making predictions about 
the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based 
on the non-missing values in the neighbourhood. 

It creates a basic mean impute then uses the resulting complete list to construct a KDTree. Then, it uses the resulting KDTree 
to compute nearest neighbours (NN). After it finds the k-NNs, it takes the weighted average of them

##### Pros:
Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
##### Cons:
Computationally expensive. KNN works by storing the whole training dataset in memory.
K-NN is quite sensitive to outliers in the data (unlike SVM)

Let's discuss KNN Imputer with an example:

In [3]:
!pip3 install -U scikit-learn



###### How does KNN Imputer work?
According scikit-learn docs: Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close. By default, a euclidean distance metric that supports missing values, nan_euclidean_distances, is used to find the nearest neighbors.

In [11]:
# Creating Dataframe with Missing Values
import numpy as np
import pandas as pd

df= {'first': [112, 90, np.nan, 89],
    'second': [30,45,56, np.nan],
    'Third':[np.nan, 40, 80, 98]}

df= pd.DataFrame(df)
df

Unnamed: 0,first,second,Third
0,112.0,30.0,
1,90.0,45.0,40.0
2,,56.0,80.0
3,89.0,,98.0


In [15]:
# 2 Initialize KNNImputer
# You can define your own n_neighbors value (as its typical of KNN algorithm)
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)

df_filled = imputer.fit_transform(df)
df_filled

array([[112. ,  30. ,  69. ],
       [ 90. ,  45. ,  40. ],
       [100.5,  56. ,  80. ],
       [ 89. ,  43. ,  98. ]])

##### 4. Imputation Using Multivariate Imputation by Chained Equation (MICE)

This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a 
single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also 
very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities 
such as bounds or survey skip patterns. For more information on the algorithm mechanics.


##### 5. Imputation Using Deep Learning (Datawig)

This method works very well with categorical and non-numerical features. It is a library that learns Machine Learning models using Deep Neural Networks to impute missing values in a dataframe. 
It also supports both CPU and GPU for training.

##### Pros:
Quite accurate compared to other methods.
It has some functions that can handle categorical data (Feature Encoder).
It supports CPUs and GPUs.
##### Cons:
Single Column imputation.
Can be quite slow with large datasets.
You have to specify the columns that contain information about the target column that will be imputed.

###### 6. Other Imputation Methods:

###### a) Stochastic regression imputation:
It is quite similar to regression imputation which tries to predict the missing values by regressing it from other related variables in the same dataset plus some random residual value.
Extrapolation and Interpolation:
It tries to estimate values from other observations within the range of a discrete set of known data points.
###### b) Hot-Deck imputation:
Works by randomly choosing the missing value from a set of related and similar variables.

In conclusion, there is no perfect way to compensate for the missing values in a dataset. Each strategy can perform better for certain datasets and missing data types but may perform much worse on other types of datasets. There are some set rules to decide which strategy to use for particular types of missing values, but beyond that, you should experiment and check which model works best 
for your dataset.


# Source from : https://towardsdatascience.com/4-tips-for-advanced-feature-engineering-and-preprocessing-ec11575c09ea
#### 1. Resampling Imbalanced Data

In real world scenario, you will encounter imbalanced data more often than not (i.e: More number of records training/ test data). This does not necessarily have to be a problem if your target only has a slight imbalance. 

You could then resolve it by using proper validation measures for the data such as Balanced Accuracy, Precision-Recall Curves or F1-score. Unfortunately, this is not always the case and your target variable might be highly imbalanced (e.g., 10:1). 

Instead, you can oversample the minority target in order to introduce balance using a technique called SMOTE.

##### SMOTE
More information on paper: https://jair.org/index.php/jair/article/view/10302

SMOTE stands for Synthetic Minority Oversampling Technique and is an oversampling technique used to increase the samples in a minority class.

It generates new samples by looking at the feature space of the target and detecting nearest neighbors. Then, it simply selects similar samples and changes a column at a time randomly within the feature space of the neighboring samples.
The module to implement SMOTE can be found within the imbalanced-learn package. You can simply import the package and apply a fit_transform:

There are several strategies that you can take when oversampling using SMOTE:
'minority': resample only the minority class;
'not minority': resample all classes but the minority class;
'not majority': resample all classes but the majority class;
'all': resample all classes;

When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
I chose to use a dictionary to specify the extent to which I wanted to oversample my data.

If you have categorical variables in your dataset SMOTE is likely to create values for those variables that cannot happen. For example, if you have a variable called is Male, which could only take 0 or 1, then SMOTE might create 0.365 as a value.
Instead, you can use SMOTENC which takes into account the nature of categorical variables. This version is also available in the imbalanced-learn package. 

Another imporrtant thing to consider, Make sure to oversample after creating the train/test split so that you only oversample the train data. You typically do not want to test your model on synthetic data.

##### 2. Creating New Features

To improve the quality and predictive power of our models, new features from existing variables are often created. We can create some interaction (e.g., multiply or divide) between each pair of variables hoping to find an interesting new feature. This, however, is a lengthy process and requires a significant amount of coding. Fortunately, this can be automated using Deep Feature Synthesis.

###### Deep Feature Synthesis

Deep feature synthesis (DFS) is an algorithm which enables you to quickly create new variables with varying depth. For example, you can multiply pairs of columns but you can also choose to first multiply Column A with Column B and then add Column C.
First, let me introduce the data I will be using for the example. I have chosen to use HR analytics data since the features are easily interpretable


The first step is to create an entity from which relationships can be created with other tables if necessary. Next, we can simply run ft.dfs in order to create new variables. We specify how variables are created with the parameter trans_primitives. We chose to either add numeric variables together or multiply.

DFS's one more interesting feature is that it can create new variables from aggregations between tables (e.g., facts and dimensions)

Run ft.list_primitives()in order to see the full list of aggregation that you can do. It even handles timestamps, null values, and long/lat information.

###### Handling Missing Values

As always, there is no one best way of dealing with missing values. Depending on your data it might be sufficient to simply fill them with the mean or mode of certain groups. However, there are advanced techniques that use known parts of the data to impute the missing values.
One such method is called IterativeImputer a new package in Scikit-Learn which is based on the popular R algorithm for imputing missing variables, MICE.


###### Iterative Imputer
The Iterative Imputer is developed by Scikit-Learn and models each feature with missing values as a function of other features. It uses that as an estimate for imputation. At each step, a feature is selected as output y and all other features are treated as inputs X. A regressor is then fitted on X and y and used to predict the missing values of y. This is done for each feature and repeated for several imputation rounds.

The great thing about this method is that it allows you to use an estimator of your choosing. I used a RandomForestRegressor to mimic the behavior of the frequently used missForest in R.

If you have sufficient data, then it might be an attractive option to simply delete samples with missing data. However, keep in mind that it could create bias in your data. Perhaps the missing data follows a pattern that you miss out on.

The Iterative Imputer allows for different estimators to be used. After some testing, I found out that you can even use Catboost as an estimator! Unfortunately, LightGBM and XGBoost do not work since their random state names differ.