# Missing Values and Outliers

## Data Science, Machine Learning and Artificial Intelligence - by Farzad Minooei

## Missing Values

Ref: 

Flexible Imputation of Missing Data, Stef van Braun (2018). Ed. 2.

https://stefvanbuuren.name/fimd/

Multivariate Data Analysis, Joseph F. Hair, William C. Black, Barry J. Babin, Rolph E. Anderson (2013). Ed. 7.

Problems w/ MVs:

     #1: reduction of the sample size available for analysis
     
     #2: bias resulting from differences between missing and complete data

#### Four-Step Process for Identifying Missing Data and Applying Remedies

Step 1: Determine the type of MVs

    Know the cause

    Ignorable MVs
    
         Specific design of the data collection process
         
         Censored data

Step 2:  Determine the extent of MVs

    How much MVs are OK? ---> can generally be ignored
         
         under 5% - 10% of observations are missed.
         
    Randomness
    
    Sufficient data for the selected analysis technique

Step 3:  Diagnose the randomness of the MVs processes

    Missing not at random (MNAR)
    
    Missing completely at random (MCAR)
    
    Missing at random (MAR) if the missing values of Y depend on X, but not on Y
    
Step 4:  Select the imputation method
    
    Imputation is the process of estimating the missing value    
    based on valid values of other variables and/or cases in the sample.
    
    Some imputation methods:
    
        -- Complete case approach
        
        -- Using all-available data
        
        -- Mean/Median substitution
        
        -- MICE (Multivariate Imputation by Chained Equations) algorithm
        
            step 1: For each variable, replace the missing value with a simple imputation strategy such as mean imputation, also considered as “placeholders.”
               
            step 2: The “placeholders” for the first variable, X1, are regressed by using machine learning where X1 is the dependent variable, and the rest of the variables are the independent variables. The process continues as such until all the variables are considered at least once as the dependent variable.
            
            step 3: Those original “placeholders” are then replaced with the predictions from the model.
            
            step 4: The replacement process is repeated for a number of cycles which is generally ten, according to Raghunathan et al. 2002, and the imputation is updated at each cycle.
            
            step 5: At the end of the cycle, the missing values are ideally replaced with the prediction values that best reflect the relationships identified in the data.

### Business Undestanding

Goal:

Survey analysis: calculate Pearson correlaion w/ missing values

### Initial Setup

In [None]:
#Required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
#Read data from file
data = pd.read_csv('cs_02.csv')
data.head()

In [None]:
data.shape

### Exploratory Data Analysis

#### 1: Understand Data Collection Process

Data collected from an online customer satisfaction survey.

#### 2: Document Data Set Description (Meta Data)

In [None]:
data.info()

id: unique identification number

age: age of the respondent (numeric)

gender: gender of the respondent (binary: 'F': female, 'M': male)

customer_longevity: the length of time a customer continues to transact with the company (ordinal: 0: Never used, 1: Less than 1 year, 2: 1 - 2 years, 3: Over 2 years)

customer_satisfaction_score: customer satisfaction score from 1 to 10 (numeric)

net_promoter_score: the likelihood of recommending the company from 1 to 10 (numeric)

customer_effort_score: the amount of effort the respondent had to exert to use company's product from 1 to 10 (numeric)

#### 3: Check for Missing Values

In [None]:
#Step 1: Determine the type of MVs
#Know the cause
np.sum(data.isnull(), axis = 0)

In [None]:
#The number of MVs in each column
np.sum(data == '.', axis = 0)

In [None]:
#Replace '.' with nan
data[data == '.'] = np.nan

In [None]:
#The number of MVs in each column
np.sum(data.isnull(), axis = 0)

In [None]:
#Get info
data.info()

In [None]:
#Use astype method to change data type of a column
data['customer_satisfaction_score'].astype('float')

In [None]:
#Change data type of numeric columns
data[data.columns[4 : ]] = data.iloc[:, 4 : ].apply(lambda col: col.astype('float'), axis = 0)

In [None]:
#Get info
data.info()

In [None]:
#Step 2: Determine the extent of MVs
#Summary of MVs in each column
mvs_summary = pd.DataFrame({'freq' : np.sum(data.isnull(), axis = 0)})
mvs_summary['pct'] = round(mvs_summary['freq'] / data.shape[0] * 100, 1)
mvs_summary.sort_values(by = 'pct', ascending = False)

In [None]:
#Summary of MVs for each case
data.loc[:, 'mvs'] = np.sum(data.isnull(), axis = 1)
data.sort_values(by = 'mvs', ascending = False).head(10)

In [None]:
#Decision: remove cases with more than 50% mvs
data.drop(index = [65, 84, 85, 87], inplace = True)

In [None]:
#Decision: remove customers with 0 longevity
data.drop(index = data.loc[data['customer_longevity'] == '0', :].index, inplace = True)

In [None]:
#Summary of MVs in each column
mvs_summary = pd.DataFrame({'freq' : np.sum(data.isnull(), axis = 0)})
mvs_summary['pct'] = round(mvs_summary['freq'] / data.shape[0] * 100, 1)
mvs_summary.sort_values(by = 'pct', ascending = False)

In [None]:
#Step 3: Diagnose the randomness of the MVs processes
#Create a list of conditions
conditions = [data['customer_satisfaction_score'].isnull(), data['customer_satisfaction_score'].notnull()]
#Create a list of the values needed to assign for each condition
values = [1, 0]
#Create a new column and use np.select to assign values to it using the lists as arguments
data['if_null'] = np.select(conditions, values)
data.tail()

In [None]:
#Evaluate the randomness of the MVs in customer_satisfaction_score from age perspective
data.groupby(by = 'if_null')['age'].mean()

In [None]:
#Box plot for evaluating the randomness of the MVs in customer_satisfaction_score from age perspective
plt.boxplot([data.loc[data['if_null'] == 0, 'age'],
             data.loc[data['if_null'] == 1, 'age']])
plt.xticks(ticks = [1, 2], labels = [0, 1])
plt.title('MVs in customer satisfaction score \n from age perspective')
plt.show()

In [None]:
#Evaluate the randomness of the MVs in customer_satisfaction_score from gender perspective
#Cross tabulation analysis
cross_tab_pct = round(pd.crosstab(data['gender'], data['if_null'], normalize = 'index'), 2)
cross_tab_pct

In [None]:
#Evaluate the randomness of the MVs in customer_satisfaction_score from customer_longevity perspective
#Cross tabulation analysis
cross_tab_pct = round(pd.crosstab(data['customer_longevity'], data['if_null'], normalize = 'index'), 2)
cross_tab_pct

In [None]:
#Remove temporary variables: mvs and if_null
data.drop(columns = ['mvs', 'if_null'], inplace = True)
data.head()

In [None]:
#Get Shape
data.shape

In [None]:
#Step 4: Select the imputation method
#Method 1: complete case approach
data_complete_case = data.dropna(axis = 0, inplace = False)
print(data_complete_case.shape)
np.sum(data_complete_case.isnull(), axis = 0)

In [None]:
#Method 2: mean substitution
data_mean_sub = data.copy()
#Substiude NAs w/ mean of each column
data_mean_sub.iloc[:, 4 : ] = data_mean_sub.iloc[:, 4 : ].fillna(data_mean_sub.iloc[:, 4 : ].mean())
print(data_mean_sub.shape)
np.sum(data_mean_sub.isnull(), axis = 0)

In [None]:
#Data preparation for MICE 
data_mice_imputation = data.iloc[:, 1 :].reset_index(drop = True).copy()
#Convert object columns to category data type
data_mice_imputation['gender'] = data_mice_imputation['gender'].astype('category')
data_mice_imputation['customer_longevity'] = data_mice_imputation['customer_longevity'].astype('category')

In [None]:
#Method 3: MICE
#%pip install miceforest 
from miceforest import ImputationKernel #It uses lightgbm as a backend

mice_kernel = ImputationKernel(data = data_mice_imputation, 
                               random_state = 123)
#Run the kernel on the data for 10 iterations
mice_kernel.mice(10)
#Create the imputed data
data_mice_imputation = mice_kernel.complete_data()
print(data_mice_imputation.shape)
np.sum(data_mice_imputation.isnull(), axis = 0)

In [None]:
#Step 5: correlation analysis
#Method 1: complete case approach
corr_complete_case = round(data_complete_case.iloc[:, 4 : ].corr(), 2)
corr_complete_case

In [None]:
#Method 2: mean substitution
corr_mean_sub = round(data_mean_sub.iloc[:, 4 : ].corr(), 2)
corr_mean_sub

In [None]:
#Method 3: MICE
corr_mice_imputation = round(data_mice_imputation.iloc[:, 3 : ].corr(), 2)
corr_mice_imputation

In [None]:
#Final correlation table
corr_table = round((corr_complete_case + corr_mean_sub + corr_mice_imputation) / 3, 2)
corr_table

In [None]:
#Summary
#   The missing data process is MCAR
#   Imputation is the most logical course of action
#   Correlations differ slightly across imputation techniques

## Outliers

Problems w/ Outliers:
    
    #1: can have a marked effect on any type of empirical analysis

    #2: how representative the outlier is of the population

Sources of their uniqueness:
     
     procedural error
     
     extraordinary event
     
     extraordinary observations
    
     unique in combination of values across the variables

In [None]:
#Problem of Masking
import numpy as np
x = np.array([2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 1000])
y = np.array([2, 2, 3, 3, 3, 4, 4, 4, 10000, 100000])

In [None]:
#Classic method for outlier detection
#|(x - mean)/ sd| > 3
print(abs((x - np.mean(x))/ np.std(x)) > 3)
print(abs((y - np.mean(y))/ np.std(y)) > 3)

In [None]:
#Tukey's method
from scipy.stats import iqr
#x > q(0.75) + 1.5 * IQR(x)
#x < q(0.25) - 1.5 * IQR(x)
print(x > np.quantile(x, 0.75) + 1.5 * iqr(x))
print(y > np.quantile(y, 0.75) + 1.5 * iqr(y))

# End of Code