## Handling missing values for categorical variables

In [1]:
import pandas as pd
import numpy as np

The dataset we are going to use is about UFO sightings in different locations in USA.

In [2]:
ufo = pd.read_csv("http://bit.ly/uforeports")
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [3]:
ufo.shape

(18241, 5)

In [4]:
ufo.dtypes

City               object
Colors Reported    object
Shape Reported     object
State              object
Time               object
dtype: object

<h4>Count missing values in each column</h4>
<p>
Using dataframe method isna() followed by sum(), we can quickly figure out the number of missing values in each column. "True" represents a missing value, "False"  means the value is present in the dataset. True is equivalent to 1 and False to 0 for sum() method. It simply returns sum for each column.
</p>

In [5]:
ufo.isna().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

Based on the summary above, each column has 18241 rows of data, three columns containing missing data:
<ol>
    <li>"City": 25 missing data</li>
    <li>"Colors Reported": 15359 missing data</li>
    <li>"Shape Reported": 2644 missing data</li>
</ol>

#### 1. Row Deletion

We simply delete the rows with missing value. 

When should we apply?  
Row Deletion has the assumption that the data are missing completely at random(MCAR).

In [6]:
# simply drop whole row with NaN in "City" column
ufo.dropna(subset=["City"], axis=0, how="any", inplace=True)

In [7]:
# reset index, because we droped 25 rows
ufo.reset_index(drop=True, inplace=True)

In [8]:
ufo.shape

(18216, 5)

In [9]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


#### Advantages and Disadvantages of Row Deletion:

##### Advantages:
1. Easy to implement
2. Faster way to obtain the complete dataset

##### Disadvantages:
1. If data is not MCAR, it will introduce bias in the data
2. We may lose valuable information and row deletion should only be last resort

#### 2. Frequent Category (Mode) Imputation

Mode Imputation involves replacing the NAN with the most frequent value.

When should we apply?  
Mode imputation has the assumption that the data are missing completely at random (MCAR). 

In [10]:
#function to fill missing values with mode
def impute_nan_mode(df,variable):
    most_frequent_category=df[variable].mode()[0]
    df[variable].fillna(most_frequent_category,inplace=True)

In [11]:
impute_nan_mode(ufo,"Shape Reported")
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [12]:
ufo.isnull().sum()

City                   0
Colors Reported    15339
Shape Reported         0
State                  0
Time                   0
dtype: int64

We will read the dataset again to discard all the changes we have made.

In [13]:
ufo = pd.read_csv("http://bit.ly/uforeports")
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


#### Advantages and Disadvantages of Mode Imputation

##### Advantages
1. Easy To implement
2. Fater way to implement

##### Disadvantages
1. Since we are using the most frequent labels, it may use them in an over respresented way, if there are many nan
2. It distorts the relation of the most frequent label

#### 3. Random Sample Imputation

Random sample imputation consists of taking random observation from the dataset and we use this observation to replace the nan values.

When should it be used?   
It assumes that the data are missing completely at random(MCAR)

In [14]:
ufo["Shape Reported"].isnull().sum()

2644

We need 2644 random values to replace 2644 missing values above. So first we will drop the rows with missing value for "Shape Reported" and then sample 2644 random values from the remaining rows.

In [15]:
ufo["Shape Reported"].dropna().sample(ufo["Shape Reported"].isnull().sum(), random_state=0)

12938     TRIANGLE
5533      CYLINDER
15145     FIREBALL
2675         LIGHT
582          CIGAR
           ...    
16469     FIREBALL
3290     RECTANGLE
216         CIRCLE
13764       SPHERE
15861     TRIANGLE
Name: Shape Reported, Length: 2644, dtype: object

We can find out the index for columns with missing values like shown below. We have to change the index of random values above, to the missing values below to replace the missing values with random values using assignment operation.

In [16]:
ufo.loc[ufo["Shape Reported"].isnull(), :].index

Int64Index([   16,    17,    21,    53,    56,    57,    66,    67,    69,
               71,
            ...
            18173, 18176, 18179, 18183, 18193, 18206, 18223, 18232, 18235,
            18238],
           dtype='int64', length=2644)

In [17]:
def impute_nan_random(df,variable):
    
    #copying ordinal data to a new column
    df[variable+"_random"]=df[variable]
    
    #generating same number of random values as missing values
    random_sample = df[variable].dropna().sample(df[variable].isnull().sum(),random_state=0)
    
    #pandas need to have same index in order to merge the dataset
    #changing index of random_sample to index from missing values
    random_sample.index=df.loc[df[variable].isnull(), :].index
    
    #overwriting missing values with random values
    df.loc[df[variable].isnull(), variable+"_random"]=random_sample

In [18]:
impute_nan_random(ufo,"Shape Reported")

In [19]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Shape Reported_random
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,TRIANGLE
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,OTHER
2,Holyoke,,OVAL,CO,2/15/1931 14:00,OVAL
3,Abilene,,DISK,KS,6/1/1931 13:00,DISK
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,LIGHT


We will drop the column we just created before moving to other methods.

In [20]:
ufo.drop("Shape Reported_random", axis=1, inplace=True)
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [21]:
ufo.isnull().sum()

City                  25
Colors Reported    15359
Shape Reported      2644
State                  0
Time                   0
dtype: int64

#### 4. Capturing NAN values with a new feature

We create a new column to mark which values are missing and then we substitute the missing values with mode or random sample values like we did above.

When should it be used?  
It works well if the data are not MCAR or MCAR 

In [22]:
ufo["Colors_Reported_NAN"]=np.where(ufo["Colors Reported"].isnull(),1,0)
#or
#df["Colors_Reported_NAN"]=df["Colors Reported"].isnull().astype(int)

In [23]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Colors_Reported_NAN
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,1
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,1
2,Holyoke,,OVAL,CO,2/15/1931 14:00,1
3,Abilene,,DISK,KS,6/1/1931 13:00,1
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,1


In [24]:
CR_mode = ufo["Colors Reported"].mode()[0]
CR_mode

'RED'

In [25]:
ufo["Colors Reported"].fillna(CR_mode,inplace=True)

In [26]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time,Colors_Reported_NAN
0,Ithaca,RED,TRIANGLE,NY,6/1/1930 22:00,1
1,Willingboro,RED,OTHER,NJ,6/30/1930 20:00,1
2,Holyoke,RED,OVAL,CO,2/15/1931 14:00,1
3,Abilene,RED,DISK,KS,6/1/1931 13:00,1
4,New York Worlds Fair,RED,LIGHT,NY,4/18/1933 19:00,1


We will drop the column we just created before moving to other methods.

In [27]:
ufo.drop("Colors_Reported_NAN", axis=1, inplace=True)
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,RED,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,RED,OTHER,NJ,6/30/1930 20:00
2,Holyoke,RED,OVAL,CO,2/15/1931 14:00
3,Abilene,RED,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,RED,LIGHT,NY,4/18/1933 19:00


In [28]:
ufo.isnull().sum()

City                 25
Colors Reported       0
Shape Reported     2644
State                 0
Time                  0
dtype: int64

#### Advantages and Disadvantages of Capturing NaN with new feature

##### Advantages
1. Easy to implement
2. Captures the importance of missing values

##### Disadvantages
1. Creating Additional Features may hamper accuracy (Curse of Dimensionality)

#### 5. Replacing NAN with a new category

If there are more than one mode or the difference between value counts for different category is low. We fill the missing values with a new category.

When should it be used?  
It works well if the data are not MCAR or MCAR 

In [29]:
ufo["Shape Reported"].fillna("Missing", inplace=True)

In [30]:
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,RED,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,RED,OTHER,NJ,6/30/1930 20:00
2,Holyoke,RED,OVAL,CO,2/15/1931 14:00
3,Abilene,RED,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,RED,LIGHT,NY,4/18/1933 19:00


In [31]:
ufo.isnull().sum()

City               25
Colors Reported     0
Shape Reported      0
State               0
Time                0
dtype: int64

#### Advantages and Disadvantages of Replacing NAN with a new category

##### Advantages
1. Easy to implement
2. Captures the importance of missing values

##### Disadvantage
1. In every situation replacing with new category won't work