## Removing observations with missing data

**Complete Case Analysis (CCA)**, also called list-wise deletion of cases, consists
of discarding those observations where the values in any of the variables are missing. CCA
can be applied to categorical and numerical variables. CCA is quick and easy to implement
and has the advantage that it preserves the distribution of the variables, provided the data
is missing at random and only a small proportion of the data is missing. However, if data is
missing across many variables, CCA may lead to the removal of a big portion of the
dataset.

In [13]:
# let's import necessary libraries
import pandas  as pd

In [14]:
# load credit approval dataset
data = pd.read_csv('data/creditApprovalUCI.csv')
data.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,,u,g,q,h,,,,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In [15]:
# let's calculate the percentage of missing values and sort them 
data.isnull().mean().sort_values(ascending = True)

A11    0.000000
A12    0.000000
A13    0.000000
A15    0.000000
A16    0.000000
A4     0.008696
A5     0.008696
A6     0.013043
A7     0.013043
A1     0.017391
A2     0.017391
A14    0.018841
A3     0.133333
A8     0.133333
A9     0.133333
A10    0.133333
dtype: float64

**Note:** Here we can observe the missing values percentage in ascending order.

In [16]:
# now we will remove the observation with missing value in any varible
data_cca = data.dropna()

**TIP:** To remove observations where data is missing in a **subset of variables**, we
can execute **`data.dropna(subset=['A3', 'A4'])`** . To remove
observations if data is **missing in all the variables**, we can execute
**`data.dropna(how='all')`** .

In [17]:
# let's print and compare the size
print('Total number of observations along with NaN values:',len(data))
print('Total number of observation after removing the NaN values: ',len(data_cca))

Total number of observations along with NaN values: 690
Total number of observation after removing the NaN values:  564
