# Dealing with Missing Data

If we have blank spaces, NaN (not a number), or NULL values in the dataset, we need to address this for our models to work.

### Identifying Missing Values

In [4]:
import pandas as pd
from io import StringIO #allows us to read string as input from hard drive

#make sample csv file
csv_data = \
'''
W,X,Y,Z
1.0,,3.0,4.0
5.0,6.0,,8.0
9.0,10.0,11.0,12.0
'''

#read into a dataframe
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,W,X,Y,Z
0,1.0,,3.0,4.0
1,5.0,6.0,,8.0
2,9.0,10.0,11.0,12.0


In [9]:
#count number of missing values per column
df.isnull().sum()

W    0
X    1
Y    1
Z    0
dtype: int64

Note that when working with `sklearn`, we prefer to use `NumPy` arrays rather than `pandas` dataframes. We can always access the underlying `NumPy` array for a dataframe by accessing the values attribute `df.values` before feeding it into an `sklearn` estimator.

### Eliminating Missing Values

We can use the `dropna` method from the `pandas` library.

In [12]:
#drop rows that contain a missing value
df.dropna(axis=0)

Unnamed: 0,W,X,Y,Z
2,9.0,10.0,11.0,12.0


In [13]:
#drop features that contain a missing value
df.dropna(axis=1)

Unnamed: 0,W,Z
0,1.0,4.0
1,5.0,8.0
2,9.0,12.0


The `dropna` method also takes various parameters which can be useful.

In [14]:
#only drop rows where all features are NaN
df.dropna(how='all')

Unnamed: 0,W,X,Y,Z
0,1.0,,3.0,4.0
1,5.0,6.0,,8.0
2,9.0,10.0,11.0,12.0


In [15]:
#drop rows that have fewer than 4 actual values
df.dropna(thresh=4)

Unnamed: 0,W,X,Y,Z
2,9.0,10.0,11.0,12.0


In [16]:
#only drop rows where NaN appears in a specific column
df.dropna(subset=['Y'])

Unnamed: 0,W,X,Y,Z
0,1.0,,3.0,4.0
2,9.0,10.0,11.0,12.0


### Imputing Missing Values

Often dropping instances that contain missing values eliminates too much useful data. Instead, we can use *interpolation* techniques to estimate the missing values from other training examples in the dataset.

A common method is *mean imputation*, where we replace the missing value by the mean value of the feature. We can do this using the `SimpleImputer` class in the `sklearn` library. Other options for the strategy parameter are `'median'` (if the data are skewed so that this is a better predictor) and `'most_frequent'` (this is useful for categorical data, for example, if the feature is an encoding of color names).

In [17]:
from sklearn.impute import SimpleImputer
import numpy as np

#create an instance of SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df.values)
imputed_data = imputer.transform(df.values)
imputed_data #mean of 2nd feature is 8, mean of 3rd feature is 7

array([[ 1.,  8.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12.]])

Alternatively, we can just use the `fillna` method from the `pandas` library to fill every NaN with the mean.

In [20]:
df.fillna(df.mean())

Unnamed: 0,W,X,Y,Z
0,1.0,8.0,3.0,4.0
1,5.0,6.0,7.0,8.0
2,9.0,10.0,11.0,12.0


There is also a `KNNImputer` class in the `sklearn` library which imputes missing values using the k-Nearest Neighbors approach. Each missing feature is imputed using values from `n_neighbors` nearest neighbors that have a value for the feature. The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor.

In [23]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2, weights='uniform')
imputer.fit_transform(df.values)

array([[ 1.,  8.,  3.,  4.],
       [ 5.,  6.,  7.,  8.],
       [ 9., 10., 11., 12.]])

The `SimpleImputer` class is part of the transformer API in `sklearn`, which contains classes to help with data transformation. In general, these classes have two main methods: the `fit` method by which the transformer (NOT the neural network architecture) learns its parameters from a dataset, and the `transform` method where it uses those parameters to transform a dataset. Any dataset that is transformed must have the same number of features as the dataset that was used to fit the transformer.

For example, given a training set `X_train` and a test set `X_test`, if we want to transform our data, it's best practice to learn the parameters of the training dataset and then transform both the training and test set according to these parameters:

In [None]:
tf = Transformer(param_1=x, param_2=y)
tf.fit(X_train)
tf.transform(X_train)
tf.transform(X_test)

# Handling Categorical Data

We need to distinguish between *ordinal* and *nominal* features when considering categorical features. Ordinal features have a natural ordering, whereas nominal features have no order. For example, t-shirt size is an ordinal feature, while t-shirt color is a nominal feature.

In general, class labels are always considered to be nominal.

In [25]:
#create new dataframe
df = pd.DataFrame([
    ['green', 'M', 10.1, 'class2'],
    ['blue', 'L', 13.5, 'class1'],
    ['red', 'XL', 15.1, 'class1']
])
df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,blue,L,13.5,class1
2,red,XL,15.1,class1


### Mapping Ordinal Features

Suppose we want to encode the ordinal variable above (size) into integers; we can do this by defining a dictionary describing the conversion that creates the correct order and using the `map` function:

In [26]:
#map according to mapping dict
size_map = {'M':0, 'L':1, 'XL':2}
df['size'] = df['size'].map(size_map)
df

Unnamed: 0,color,size,price,classlabel
0,green,0,10.1,class2
1,blue,1,13.5,class1
2,red,2,15.1,class1


If we want to recover the original size labels from the newly transformed feature, we can simply define an inverse mapping dict and follow the same procedure as above.

In [27]:
##use dict comprehension to reverse the original dict
inverse_map = {value:index for index, value in size_map.items()}
df['size'].map(inverse_map)

0     M
1     L
2    XL
Name: size, dtype: object

### Encoding Ordinal Features

Suppose we have an ordinal feature where the numerical difference between the values it takes is unknown or ill-defined. Then we can encode the values using a *threshold encoding* with binary values.

The idea is to create subsets of values. For example, in terms of an adult's education level, given the set of values {high school, college, law school, medical school}, it might make sense to create the subsets {high school}, {high school, college}, {high school, college, law school}, {high school, college, medical school}. We are assuming that if someone has completed college, they have also completed high school. Similarly, we assume that if someone has gone to law school, then it is unlikely that they have also completed medical school.

In [56]:
#recreate data frame
df = pd.DataFrame([
    ['green', 'M', 10.1, 'class2'],
    ['blue', 'L', 13.5, 'class1'],
    ['red', 'XL', 15.1, 'class1']
])
df.columns = ['color', 'size', 'price', 'classlabel']

#create binary-valued features for size subsets
df['size > M'] = df['size'].apply(lambda x: 1 if x in {'L', 'XL'} else 0)
df['size > L'] = df['size'].apply(lambda x: 1 if x == 'XL' else 0)
del df['size']
df

Unnamed: 0,color,price,classlabel,size > M,size > L
0,green,10.1,class2,0,0
1,blue,13.5,class1,1,0
2,red,15.1,class1,1,1


### Encoding Class Labels

Most machine learning libraries require class labels to be encoded as integer values. So, we do the same thing as we did for ordinal variables, but we can simply enumerate the unique class labels.

In [38]:
class_map = {label:index for index, label in enumerate(np.unique(df['classlabel']))}
df['classlabel'] = df['classlabel'].map(class_map)
df

Unnamed: 0,color,size,price,classlabel
0,green,0,10.1,1
1,blue,1,13.5,0
2,red,2,15.1,0


Alternatively, we can use the `LabelEncoder` class from `sklearn` which is designed to accomplish this.

In [40]:
from sklearn.preprocessing import LabelEncoder

#create an instance of the LabelEncoder class
labe = LabelEncoder()
labe.fit_transform(df['classlabel'].values)

array([1, 0, 0])

### One-Hot Encoding

For nominal variables other than class labels, if we do the same thing we did above, we will end up imposing an ordering on the values that the feature takes. This may influence the classification algorithm adversely and unnecessarily; as a result, we prefer to employ *one-hot encoding*. We use binary values to encode each value of the categorical feature, and we thereby replace the original feature with `len(np.unique(feature))` many binary-valued features.

In [41]:
from sklearn.preprocessing import OneHotEncoder

#make a numpy array for the non-classlabel features
X = df[['color', 'size', 'price']].values

#use sklearn class OneHotEncoder
color_ohe = OneHotEncoder()
color_ohe.fit_transform(X[:,0].reshape(-1,1)).toarray()

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

If we want to transform multiple columns at once according to different encodings, we can use the `ColumnTransformer` class in the `sklearn` library. It accepts a list of `(name, transformer, column(s))` tuples, where `column(s)` is a list of columns to which you want to apply the specified transformation.

In [45]:
from sklearn.compose import ColumnTransformer

ctr = ColumnTransformer([
    ('onehot', OneHotEncoder(), [0]),
    ('nothing', 'passthrough', [1,2])
])

ctr.fit_transform(X).astype(float) #why do we want this to be floats

array([[ 0. ,  1. ,  0. ,  0. , 10.1],
       [ 1. ,  0. ,  0. ,  1. , 13.5],
       [ 0. ,  0. ,  1. ,  2. , 15.1]])

An even easier way to implement the same idea is by using the `get_dummies` function in the `pandas` library. Applied to a `DataFrame`, the `get_dummies` method will only convert string columns and will leave all other columns unchanged.

In [48]:
pd.get_dummies(df[['color', 'size', 'price']], dtype=float)

Unnamed: 0,size,price,color_blue,color_green,color_red
0,0,10.1,0.0,1.0,0.0
1,1,13.5,1.0,0.0,0.0
2,2,15.1,0.0,0.0,1.0


However, we need to note that one-hot encoding introduces *multicollinearity*, which is a problem for some algorithms (such as those that require matrix inversion). To reduce correlation, we remove one feature column from the one-hot encoded array. Note that this only removes redundant information (because if an instance does not take any of $K-1$ values, it must be take the remaining value).

In [52]:
pd.get_dummies(df[['color', 'size', 'price']], dtype=float, drop_first=True).astype(float)

Unnamed: 0,size,price,color_green,color_red
0,0.0,10.1,1.0,0.0
1,1.0,13.5,0.0,0.0
2,2.0,15.1,0.0,1.0


In order to drop a redundant column via the `OneHotEncoder`, we need to set `drop='first'` and set `categories='auto'`.

In [53]:
color_ohe = OneHotEncoder(categories='auto', drop='first')

ctr = ColumnTransformer([
    ('onehot', color_ohe, [0]),
    ('nothing', 'passthrough', [1,2])
])

ctr.fit_transform(X).astype(float)

array([[ 1. ,  0. ,  0. , 10.1],
       [ 0. ,  0. ,  1. , 13.5],
       [ 0. ,  1. ,  2. , 15.1]])

# Partitioning a Dataset into Training and Test Sets

If we are dividing a dataset into training and test datasets, we have to keep in mind that we are withholding valuable information that the learning algorithm could benefit from. Thus, we don’t want to allocate too much information to the test set. However, the smaller the test set, the more inaccurate the estimation of the generalization error. Dividing a dataset into training and test datasets is all about balancing this tradeoff. In practice, the most commonly used splits are 60:40, 70:30, or 80:20, depending on the size of the initial dataset. However, for large datasets, 90:10 or 99:1 splits are also common and appropriate. For example, if the dataset contains more than 100,000 training examples, it might be fine to withhold only 10,000 examples for testing in order to get a good estimate of the generalization performance.

In [5]:
#load wine dataset from UCI ML archive
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash',\
                   'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols',\
                   'Proanthocyanins', 'Color intensity', 'Hue',\
                   'OD280/OD315 of diluted wines', 'Proline']
df_wine.head()

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [6]:
from sklearn.model_selection import train_test_split

#separate out features from class labels
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

#do 70-30 train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

In the above, setting `stratify=y` ensures that both the training and test sets have the same class proportions as the original data set.

# Bringing Features onto the Same Scale

remember to fit scaler on training set then transform test set using those same parameters

Almost every machine learning algorithm requires feature scaling for it to work optimally (decision trees and random forests are two of the few which do not as they are scale-invariant). Having all features on the same scale ensures that the training process is not dominated by a small number of features with disproportionately large scale.

There are two common ways to accompish this: *normalization* and *standardization*. In general, normalization involves scaling features to be within the range $[0, 1]$ with an application of *min-max scaling*. To calculate a normalized version of the feature $X$, for every observation $x^{(i)}\in X$ we compute
$$
x^{(i)}_{norm} = \frac{x^{(i)} - x_{min}}{x_{max} - x_{min}},
$$
where $x_{max} = \max_{x\in X} x$ and $x_{min}$ is defined similarly. We can implement this with `sklearn` as follows:

In [7]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

For many machine learning algorithms, it is likely more effective to perform standardization, which entails centering the features at mean zero with standard deviation one. This is helpful many algorithms (including gradient descent) as it is common to initialize weights to zero or small random values near zero. Note that standardization does NOT transform the distribution of the data, so the new features will not be normally distributed unless the initial data are normally distributed. To standardize $X$ we compute
$$
x_{std}^{(i)} = \frac{x^{(i)}-\mu_{X}}{\sigma_{X}},
$$
where $\mu_X$ is the mean of the feature $X$ and $\sigma_X$ is its standard deviation. We can implement standardization with `sklearn` as well.

In [8]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
X_train_std = ss.fit_transform(X_train)
X_test_std = ss.transform(X_test)

It's important to note that we always fit the scaler only once, on the training features. We then transform any new data (such as the test set) using those parameters.

Another option is to use `RobustScaler`, which operates on each feature independently and removes the median value before scaling according to the first and third quartiles of the feature, so that extreme values and outliers become less pronounced. This method is recommended for small datasets that contain many outliers, and for machine learning algorithms that are prone to overfitting.