# Handling Missing Values

Often have to deal with missing values. They can be not numbers, empty strings or outliers like -999.

Sometimes they can contain useful information themselves. What was the reason the missing value occurred?

We can see from the distributions if the data collectors have used something other than NaN to record missing values. We may see, on a histogram, a small cluster at a specific number.

So, **missing values can be hidden from us**!!! This is very important to realize.

## Imputation

- Replace NaN with some value outside fixed value range e.g. -999, -1. Useful as trees can treat missing values as a separate category. But linear models and NNs would suffer.
- Mean, median - useful for simple linear models or NNs. And then for trees, it will be hard (if not impossible) to select an object which had a missing value in the first place.
- Reconstruct value

We can add a `isnull` feature which is binary and indicates if that particular feature is NaN or not. So we can still get the benefits of being able to see if a value is missing or not and also the benefits of having no missing values (after imputation). 

Downside is that we will double the number of columns in the dataset...

If we have time series data it is not unreasonable to reconstruct the missing values. e.g. if have time series temperature data for the start and end of the month but not the middle, we can approximate the missing value using something like KNN. 

But this nice opportunity is unlikely to happen very often. 

In most scenarios, the rows in our data set are independent and we usually will not find any proper logic to reconstruct them.

### Concerns

If we generate new features from columns with missing values, we need to be careful. If we impute missing values and then use that column to generate new feature columns, these new columns will contain the assumptions we made when we imputed the missing values. 

Usually we don't have the time to perform super precise interpolation of missing values.

Will look at this more when we do advanced feature engineering. But example is if we have categorical column and numeric column and want to combine them so that the new categorical_encoded column is the mean of all the numeric values from each category. If we encode missing values in the numeric column with -999 this will skew the categorical encoding massively. If one category has a lot of missing values, the categorical encoded feature will be massively skewed towards -999.

The same thing happens if we fill missing values with the mean or median of the feature. But one positive here is that it will skew it to the mean or median and not some value outside the range. Also, could definitely argue, that the mean is massively impacted by huge numbers but the mean/median would not be massive outliers and so would not cause such big skew related problems?

Either way, this imputation can definitely screw up the feature we are constructing.

The way to handle it (in this case) is to ignore missing values when calculating the mean for each category.

**Be very careful with early NaN imputation if you want to generate new features!!**

### Other ideas

- Sometimes we can treat outliers as missing values if they are particularly extreme or don't make sense. E.g. if classifying songs and there are some that apparently composed before Ancient Rome or after 2050.

Sometimes beneficial to change the categories which are present in the test data but not in the training data. The model could not learn anything about this category during training and so will not be able to infer anything from it when making predictions. So you can actually go to the test data and manually change any categories that appear there to those that appear in the training data.

Unsupervised encodings of categorical features can help e.g. change categories to its frequency. Thus we can treat categories we've never seen before based on their frequency. So we can even work with unseen categories on our test data. Pretty cool!

We can create a column `categorical_encoded` which is the number of times each category appears across _both_ train and test data. 

## Summary

- Choice of NaN fill method depends on the situation
- Usually way to deal with them is to replace with -999, mean, or median (the former lets trees create a separate category for NaNs automatically but linear models would get put off)
- Missing values may already be replaced with something by the competition organizers - can find the rows by browsing histograms
- Binary feature `isnull` can be beneficial for each column (but doubles the number of columns in the dataset!)
- In general, avoid filling NaNs before feature generation (it can massively decrease the usefulness of the features)
- XGBoost (and LGBM?) can handle NaNs itself, so you don't even need to impute them!! Woo!

