# Dealing with NA values

1. Impute with some statistical (reasonable) value.
    - Statistic over feature. 
    > fill NA Age with mean Age
    - Statistic over groupby feature. 
    > fill NA Age of Females with mean Age of Females
2. Impute with anomalous value + creating indicator column (prefered for tree based methods).
3. Other
    - Leave them as is. 
    - Predict missing values based on other features (*typically* impractical).
    - KNN based
    - Drop (*typically* the worst option).
    

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('./ml\ds/titanic_data.csv')

In [3]:
df.count()

Passenger Class                       1309
Name                                  1309
Sex                                   1309
Age                                   1046
No of Siblings or Spouses on Board    1309
No of Parents or Children on Board    1309
Ticket Number                         1309
Passenger Fare                        1308
Cabin                                  295
Port of Embarkation                   1307
Life Boat                              486
Survived                              1309
dtype: int64

In [4]:
X_train, X_test = train_test_split(df, train_size=0.33, random_state=10)

# 1. Impute with different statistics.

- Impute with some statistics of this particular feature:
    - mean (average)
    - median
    - percentiles
    - mode (most frequent). Also works for categorical features
    
- Impute with some statistics computed within categorical groups:
    - Impute missing `Age` with average/median `Age` of a person of the same `Sex`
    - Impute missing `Price` with the average/median Price of a item from the same `Category`
    
> Must be done with extreme **CAUTION**, otherwise easy to overfit.

Imputation in **test** part of the data better be done based on statistics computed on **train**.

---

## 1.1 Statistic over the feature

### Manual computing

In [5]:
aver_age = df.Age.mean()

df.Age = df.Age.fillna(aver_age)

# Or if we have X_train, X_test

aver_age = X_train.Age.mean()

X_train.Age = X_train.Age.fillna(aver_age)
X_test.Age = X_train.Age.fillna(aver_age)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


### Using `sklearn.impute` to use with `sklearn.pipe.Pipeline`

In [6]:
from sklearn.impute import SimpleImputer

In [7]:
imputer = SimpleImputer(strategy='mean')

df.Age = imputer.fit_transform(df.Age.values.reshape(-1, 1))

# Or if we have X_train, X_test

imputer = SimpleImputer(strategy='mean')
imputer.fit(X_train.Age.values.reshape(-1, 1))

X_train.Age = imputer.transform(X_train.Age.values.reshape(-1, 1))
X_test.Age = imputer.transform(X_test.Age.values.reshape(-1, 1))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


## 1.2 Statistics within groups

### Manual computing

In [8]:
aver_age_sex = df.groupby(['Sex'])['Age'].mean()

df[df.Age.isna()].Age = df[df.Age.isna()].Sex.map(aver_age_sex)

# Or if we have X_train, X_test

aver_age_sex = X_train.groupby(['Sex'])['Age'].mean()

X_train[X_train.Age.isna()].Age = X_train[X_train.Age.isna()].Sex.map(aver_age_sex)
X_test[X_test.Age.isna()].Age = X_test[X_test.Age.isna()].Sex.map(aver_age_sex)

### Using `sklearn.impute` to use with `sklearn.pipe.Pipeline`

requires writing a custom Imputer class

In [9]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

In [10]:
class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, groupby_col, impute_col, agg):
        """Imputes missing values using groupby aggregated statistic.
        
        Parameters
        ---
        
        groupby_col - str,
            column to groupby over
            
        impute_col - str,
            column to make imputation
            
        agg - function,
            aggregation function
        """
        self.groupby_col = groupby_col
        self.impute_col = impute_col
        self.agg = agg
        self._mapper = None
        
    def fit(self, X, y=None):
        self._mapper = X.groupby(self.groupby_col)[self.impute_col].apply(self.agg)
        return self
        
    def transform(self, X):
        X.loc[:, self.impute_col] = X[self.groupby_col].map(self._mapper)
        return X

In [11]:
# Statistic over the feature

imputer = CustomImputer(groupby_col='Sex', impute_col='Age', agg=np.mean)

df.Age = imputer.fit_transform(df)

# Or if we have X_train, X_test

imputer = CustomImputer(groupby_col='Sex', impute_col='Age', agg=np.mean)

imputer.fit(X_train)

X_train.Age = imputer.transform(X_train)
X_test.Age = imputer.transform(X_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


# 2. Create indicator column.

Sometimes (often) missing value of some feature is itself a great signal. 
> Recall, `Life Boat` in Titanic dataset, the fact that the person have a NA value in this column means that he is most probably did not survived (otherwise is also true, those who have non NA Life Boat are most likely survived).

This approach is especially useful if you use tree based methods.

1. Impute NA values with some anomalous, impossible value, e.g. negative value for `Age`.
2. Create an additional indicator column, 1 if value is missing and 0 otherwise.

Alternitevely, IF most of the values are missing, you could simply create indicator column (and drop the original column).

> Easier cross-validation, since we could do this preprocessing before `train-test` split.

In [12]:
# Create indicator column

df['has_NA_Age'] = df.Age.isna().astype(int)

# Fill missing values with impossible value for this feature

df.Age.fillna(-999, inplace=True)

# 3. Other

## 3.1 Do not deal with them.

Some of the modern implementations (e.g. catboost) will handle missing values for you. 
Typically, using one of the methods above, e.g. catboost uses two approaches:
- “Min” — Missing values are processed as the minimum value (less than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.
- “Max” — Missing values are processed as the maximum value (greater than all other values) for the feature. It is guaranteed that a split that separates missing values from all other values is considered when selecting trees.

from  https://catboost.ai/docs/concepts/input-data_custom-borders.html

Both are versions of the `2. Create indicator column.`

In [13]:
"¯\_(ツ)_/¯"

'¯\\_(ツ)_/¯'

## 3.2 Predict missing values based on other features (typically impractical)

> The idea is to solve regression or classification problem but instead of original target variable use column with missing values. The problem is, it requires embedded cross-validation and goes far beyond this introductory course.

## 3.3 Use similar observations (KNN)

> Find k observations which has similar feature (and not NA in required column) representation and average over them.

This option is similar to `1.2 Statistics within groups`.


In [14]:
from sklearn.impute import KNNImputer

In [15]:
# Example from sklearn documentation

X = np.array([[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, np.nan]])
print('Before imputation\n', X)

imputer = KNNImputer(n_neighbors=2)
X = imputer.fit_transform(X)
print('\nAfter imputation\n', X)

Before imputation
 [[ 1.  2. nan]
 [ 3.  4.  3.]
 [nan  6.  5.]
 [ 8.  8. nan]]

After imputation
 [[1.  2.  4. ]
 [3.  4.  3. ]
 [5.5 6.  5. ]
 [8.  8.  4. ]]


## 3.4. Drop missing values.

Rarely a good choice.

In [16]:
# Drop rows with NA

df.Age.dropna(inplace=True)

In [17]:
# Drop columns with `any` NA

df.dropna(inplace=True, axis=1, how='any')

# Recap

I would personally recommend to use `2. Indicator column` approach almost always (or methods wich works with NA out of the box), but under certain circumstances other methods might work better.