### <center>The Art of Cleaning Data <center>
    
Data cleaning and preprocessing is a very important process in  data science and data analysis projects. You cannot go directly to modeling with raw data in your hands. As data scientists commonly say: garbage in garbage out. So it is a very indispensable task to clean your data, handle missing values, and deal with outliers before build you machine learning models. 

In this tutorial we will go through several techniques to handle missing data by customizing the missing values and imputing the missing data values using different methods such as  mean, median, mode, a constant value, forward fill, backward fill and polynomial interpolation. After that, we will see how to implement some common methods to detect and remove outliers from the dataset. 

In [None]:
import os
import pandas as pd 
import numpy as np 
import seaborn as sb 
from matplotlib import pyplot as plt 
import warnings
warnings.filterwarnings('ignore')
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#choose columns for the dataframe 
cols = ['Emp ID','First Name', 'Last Name','Gender','E Mail','Date of Birth','Age in Yrs.',
       'Weight in Kgs.','Year of Joining','Age in Company (Years)','Salary','State', 'Zip']
#get the data 
employees = pd.read_csv('/kaggle/input/employees/employees.csv',usecols= cols)
heart = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')

In [None]:
#copy the employees dataset for further use
df = employees.copy()

In [None]:
#rename some columns 
df.rename(columns={'Emp ID':'ID','E Mail':'Email','Date of Birth':'Birth Date','Age in Yrs.':'Age'
          ,'Weight in Kgs.':'Weight','Age in Company (Years)':'Experience'},inplace=True)
df.head()

In [None]:
#round Age and Experiene columns 
df.Age = df.Age.apply(lambda x:int(round(x,0)))
df.Experience = df.Experience.apply(lambda x:int(round(x,0)))
df.head()

In [None]:
df.shape

In [None]:
#randomly put null values in some columns of the dataframe
df.loc[df.sample(frac=0.01).index, 'Gender'] = pd.np.nan
df.loc[df.sample(frac=0.115).index, 'Age'] = pd.np.nan
df.loc[df.sample(frac=0.16).index, 'Weight'] = pd.np.nan
df.loc[df.sample(frac=0.05).index, 'Salary'] = pd.np.nan
df.loc[df.sample(frac=0.23).index, 'Experience'] = pd.np.nan
df.loc[df.sample(frac=0.056).index, 'Birth Date'] = pd.np.nan
#check out qgqin 
df.isnull().sum()

In [None]:
#mess with data by changing some row values 
df.loc[df.query('Gender== "F"').sample(frac=0.021).index,'Gender'] = 'Femme'
df.loc[df.query('Gender =="M"').sample(frac=0.011).index,'Gender'] = 'Homme'
df.loc[df.query('Age == 34').sample(frac=0.1).index,'Age'] = 99999

### Dropping rows or columns with NaN

One approach would be removing all the rows or columns which contain missing values. This can easily be done with the *dropna()* assigned to whole the dataframe 


In [None]:
# Drops all rows with NaN values
df.dropna(axis=0,inplace=True)
df.isnull().sum()

Let's have a look into the function parameters: 
- inplace : if it's True , that makes all the changes in the existing DataFrame without returning a new one. Without it, you'd have to re-assign the DataFrame to itself.

- axis : this specifies if you're working with rows(with value 0) or columns(with value 1)

- how : we can control whether you want to remove the rows containing at least 1 NaN (any) or all NaN values (all) by setting the how parameter in the dropna method.
- thresh : we can specify the percentage in which the column or the row will be deleted
- subset : choose columns which you want to delete if they contain NaN values. 



### Imputing Missing Values 

Dropping data isn't always the best way to deal with NaNs. these rows or columns might contain valuable data  and we don't want to skew the data towards an inaccurate state. In this case we could use another approach to handle missing data by imputing the empty values using other non null values. So we can either:
1. Fill NaN with Mean, Median or Mode of the data
2. Fill NaN with a constant value
3. Forward Fill or Backward Fill NaN
4. Interpolate Data and Fill NaN


#### 1. Fill NaN values with median, mode,  mean, or with a constant value

In [None]:
# Using median
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
  
# Using mean
df['Age'].fillna(int(df['Age'].median()), inplace=True)
  
# Using mode
df['Gender'].fillna(str(df['Gender'].mode()), inplace=True)

 #check again 
df[['Age','Gender','Salary']].isnull().sum()

In [None]:
#we could also fill null data in a certain column with a constant 
df['Weight'].fillna(67, inplace=True)
# check 
df.Weight.isnull().sum()

In [None]:
#let's mess again with salaries 
df.loc[df.sample(frac=0.05).index, 'Salary'] = pd.np.nan
# the forward fill (ffil) method would fill the missing values with first non-missing value that occurs before it:
df['Salary'].fillna(method='ffill', inplace=True)
#same with the backward fill  method would fill the missing values with first non-missing value that occurs after it
df['Salary'].fillna(method='bfill', inplace=True)
#or filling data using the polynomial interpolation 
df['Salary'].interpolate(method ='linear', limit_direction ='forward')
df.isnull().sum()

In [None]:
# fill the null dates with one of the frequent dates 
df['Birth Date'].fillna('11/18/1965',inplace=True)
#check 
df.isnull().sum()

In [None]:
#fill Eperience with mean of all experiences 
df['Experience'].fillna(round(df.Experience.mean(),0),inplace=True)
#check 
df.isnull().sum()

### Replace some irrelevant column values using a map dict

In [None]:
#let's have a looke at Gender columns values 
df.Gender.value_counts()

In [None]:
#let's replace Homme with M
df.Gender.replace(to_replace='Homme',value='M',inplace=True)
df.Gender.replace(to_replace='Femme',value='F',inplace=True)
df.Gender.replace(to_replace='0    M\ndtype: object',value='M',inplace=True)
#check 
df.Gender.value_counts()

In [None]:
#we can also use a map dictionnary 
gender_dict ={'M':'Homme','F':'Femme','0    M\ndtype: object':'M'}
#replace old values with the right values 
df.Gender.map(gender_dict)
#check 
df.Gender.value_counts()

### Removing outliers 
In machine learning projects, during model building it is important to remove  outliers, because their presence can mislead the model. Existance of outliers may change the mean and standard deviation of the whole dataset that can badly affect the performance of the model. Outliers also increases the variance error and reduces the power of statistical test. We find outliers in due to data entry errors, Experimental measurement error, Measurement error(Instrument error), and
Sampling error. <br>

Detecting outliers is one of the challenging job in data cleaning. There is no any precise way to detect and remove outliers due to specific of datasets. Yet, raw assumption and observation must be made to remove those outliers that seems to be unusual among all other data. The two ways for detection of outliers are:

- Visualization method: using Box plots and scatter plots 
- Statistical method: using inter quartile method, standard deviation method ..<br>

We will some methods, either visual or statistical, to handle outliers. 

In [None]:
#let's define a function to get the outliers 
def outlier_iqr(df, col):
    Q1= df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    upper_limit = Q3 + 1.5 * IQR
    lower_limit = Q1 - 1.5 * IQR
    return upper_limit, lower_limit
#using standard deviation 
def outlier_std(df, col,cutoff=2):
    upper_limit = df[col].mean() + cutoff * df[col].std()
    lower_limit = df[col].mean() - cutoff * df[col].std()
    return upper_limit, lower_limit
# define a function for plotting a boxplot
def boxplot(df,col):
    fig = plt.figure(figsize=(12,8))
    sb.set_style('darkgrid')
    sb.boxplot(data = df,x = col)
    plt.title("Ages range")
    plt.show()

In [None]:
# as the plot shows above don't have any outliers. 
# we will work with an other dataset which has some to deal with. 
heart.head()

### IQR Method

In [None]:
boxplot(heart,'chol')

In [None]:
#as you can see there are some outliers beyond the range upper limit 
# let's deal with them 
upper, lower = outlier_iqr(heart,'chol')
#print 
print(f'The upper limit is: {upper}')
print(f'The lower limit is: {lower}')

In [None]:
# Now that we have upper and lower limits, we will filter out the column for outliers removal 
outliers = heart[(heart['chol'] < lower) | (heart['chol'] > upper)]['chol']
# print the outliers 
print(outliers)

In [None]:
#let's remove the outliers 
heart = heart[(heart['chol'] > lower) & (heart['chol'] < upper)]
#plot again
boxplot(heart,'chol')

#### Standad Deviation Method

In [None]:
#let's plot this variable from the dataset 
boxplot(heart,'trestbps')

In [None]:
#Appears to have some outliers, get them 
upper, lower = outlier_std(heart,'trestbps')
#print 
print(f'The upper limit is: {upper}')
print(f'The lower limit is: {lower}')

In [None]:
# Now that we have upper and lower limits, we will filter out the column for outliers removal 
outliers = heart[(heart['trestbps'] < lower) | (heart['trestbps'] > upper)]['trestbps']
# print the outliers 
print(outliers)

In [None]:
#now let us remove them and check 
heart = heart[(heart['trestbps'] > lower) & (heart['trestbps'] < upper)]
#plot 
boxplot(heart,'trestbps')

### And Don't Forget to Upvote if you Like the Notebook!!!