
![banner](https://i.imgur.com/vmCCvUB.png "https://www.notion.so/sterlingdatascience")
## Data Science MSc - DATA101: Principles of Data Science
# DATA101: Cleaning and Preparing Data

## About

This is the code notebook for the "Cleaning and Preparing Data" of Sterling Osborne's Data Science MSc online course found here:


### Contents:
1. Missing/Null Values
    - Identify Missing Continuous Values
    - Identify Missing Categorical Values
    - Fix Missing Continuous Values
    - Fix Missing Categorical Values
2. Outlier Values
    - Check for and Remove Outlier Values 
3. Transforming Variables
    - Log Scale for Visualistions
    - Normalization
4. Other Methods
    - Feature Reduction
    - Data Split into Train/Test Subsets
5. Conclusion

Code Notebook by Sterling Osborne

October 2019

https://twitter.com/DataOsborne

---


---


# Import and Summarise Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
employee_data = pd.read_csv('../input/ibm-hr-analytics-attrition-dataset/WA_Fn-UseC_-HR-Employee-Attrition.csv')
employee_data.head()

In [None]:
# Radomly replace some of the Monthly Income values with nan for our demonstration on fixing nan values
random_replace = 0.1

item_list = []
for item in employee_data['MonthlyRate']:
    rng = np.random.rand()
    if rng <= random_replace:
        item_list.append(np.nan)
    else:
        item_list.append(item)
    

employee_data['MonthlyRate'] = item_list
employee_data['MonthlyRate'].head()

In [None]:
list(employee_data)

In [None]:
data_types = employee_data.dtypes
data_types

## Change Data Types

We note that the MonthlyRate feature is the only float column and this appears to be incorrect. Therefore, we can change this to an integer column to align with the other features. 

However, we recieve and error because it appears we have nan values in this column and so need to first validate if missing/null values exist in the dataset. 


In [None]:
employee_data['MonthlyRate'] = employee_data['MonthlyRate'].astype(int)

---

# Missing/Null Values

## Check for Missing/Null Values

### Continuous Variables

For continuous variables we can use the describe function to return the count result for each feature and compare to the length of the dataset. We find that only the "MonthlyRate" column has missing/null values.

In [None]:
len(employee_data)

In [None]:
employee_data.describe().iloc[0]

In [None]:
employee_data['MonthlyRate'][(employee_data['MonthlyRate']).isna()==True].head()

In [None]:
len(employee_data['MonthlyRate'][(employee_data['MonthlyRate']).isna()==True])

In [None]:
len(employee_data['MonthlyRate'][(employee_data['MonthlyRate']).isna()==True])/len(employee_data)*100

### Categorical Variables

For categorical variables, we first select only the relevant columns. This can be done either manually as shown in the previous lecture or by applying a more systematic approach. 


More specifically, a useful habit to utilise when programming is to extract lists of information and subset on these. For example, instead of manually sub-setting the original data just for categorical features, we can extract the list of categorical features from the '.dtypes' function and use this directly for our column selection as shown in the comparison images below. In short, we:

1. Subset the '.dtpes' result only for 'object' columns
2. Take the index of these (row names of a list) that correspond to the column headers
3. Use this index list as the column selection for sub-setting the original data for categorical features only

With this completed, we can again check the columns manually or apply a simple loop function to systematically check each.


**Your first loop!**

Unlike with continuous variables, we do not have a simple function that will provide the information for all columns on null values needed. Instead, we need to perform a more manual check. This could mean that we go through each column one-by-one with our is null check but this is laborious and not needed. 

Instead, we introduce an important feature of programming: **iterative loops**. [Pythonforbeginners.com](https://www.pythonforbeginners.com/loops/for-while-and-nested-loops-in-python) provide a gentle introduction to loops in Python and will summarise how the "for loop" function works by the following:

> The for loop that is used to iterate over elements of a sequence, it is often used when you have a piece of code which you want to repeat "n" number of time. 
> It works like this: " for all elements in a list, do this "

For loop summary:

1. *Generate list of categorical feature column names*
2. ***Iterate** over this list where each step is a column name*
3. *At each step, create a variable only for this column*
4. ***Return**:* 
    1. The column name
    2. The list of unique values for that column
5. *Format results with some extra print functions*


We confirm with this simple print loop that  "JobRole" and "MaritalStatus" are the only categorical features with missing/null values.

In [None]:
employee_data_categorical = employee_data[['Attrition', 'BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus', 'Over18', 'OverTime']]
employee_data_categorical.head()

In [None]:
data_types[data_types=='object'].index

In [None]:
employee_data_categorical = employee_data[data_types[data_types=='object'].index]
employee_data_categorical.head()

In [None]:
len(employee_data_categorical['MaritalStatus'][(employee_data_categorical['MaritalStatus']).isna()==True])

In [None]:
len(employee_data_categorical['JobRole'][(employee_data_categorical['JobRole']).isna()==True])

In [None]:
employee_data_categorical_columns = list(employee_data_categorical)
employee_data_categorical_columns

In [None]:
employee_data_categorical_columns = list(employee_data_categorical)

print("Categorical Features - unique value checks")
for col_id in employee_data_categorical_columns:
    column = employee_data_categorical[col_id]
    print("-----------------------------")
    print(col_id)
    print(column.unique())


In [None]:
employee_data['JobRole'][(employee_data['JobRole']).isna()==True].head()

In [None]:
len(employee_data['JobRole'][(employee_data['JobRole']).isna()==True])/len(employee_data)*100

In [None]:
employee_data['MaritalStatus'][(employee_data['MaritalStatus']).isna()==True].head()

In [None]:
len(employee_data['MaritalStatus'][(employee_data['MaritalStatus']).isna()==True])/len(employee_data)*100

---

# Fixing Missing Values
### Continuous Variables

#### 1. Remove the entry → delete row

We can either subset the dataset for rows with na in the MonthlyRate column or we can use pandas pre-built function "dropna()" https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html.

In [None]:
employee_data['MonthlyRate'].iloc[0:20]

In [None]:
employee_data_removed_1 = employee_data[employee_data['MonthlyRate'].isna()==False]
print(len(employee_data_removed_1))

In [None]:
employee_data_removed_2 = employee_data.dropna(subset = ['MonthlyRate'])
print(len(employee_data_removed_2))

#### 2. Remove the feature → delete column if lots of missing values

First, we check how many rows in the MonthlyRate column are missing, we find this to be approximately 9.2% and is not sufficient to simply remove to column.

However, we show how to drop a single column in case this is more significant in your case.

In [None]:
print(np.round(len(employee_data[employee_data['MonthlyRate'].isna()==True])/len(employee_data),3)*100, "%")

In [None]:
employee_data_removed_3 = employee_data.drop('MonthlyRate', axis=1)
list(employee_data_removed_3)

#### 3. Fill in value manually → requires strong justification for chosen values and time consuming

This is not often advised but in some cases, we can deduce what the missing values should be based on a link to another column. For example, it seems logical that MonthlyRate could be linked to DailyRate.

We therefore estimate the multiplication value based on known values.

We find that this multiplication factor aligns with the fact that there are 28-31 days in a month. Therefore, we could replace missing values in the MonthlyRate with the result of $MonthlyRate = DailyRate * 30$.

In order to replace just the missing values, we use either the fillna() pandas function or the numpy function 'np.where' that works similar to the Excel 'if()' function https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html.

In [None]:
employee_data[['DailyRate','MonthlyRate']].head(20)

In [None]:
employee_data_monthlyrate_hourlyrate_div = employee_data['MonthlyRate']/employee_data['DailyRate']
employee_data_monthlyrate_hourlyrate_div.head()

In [None]:
plt.hist(employee_data_monthlyrate_hourlyrate_div, bins=20)
plt.title("Histogram of the Multiplication Factor for \n Daily to Monthly Rate")
plt.show()

In [None]:
# fillna() function
employee_data_replaced_1 = employee_data.copy()
employee_data_replaced_1['MonthlyRate'].fillna(employee_data_replaced_1['DailyRate']*30, inplace=True)
employee_data_replaced_1[['DailyRate','MonthlyRate']].head(20)

In [None]:
# np.where() function
employee_data_replaced_1 = employee_data.copy()
employee_data_replaced_1['MonthlyRate'] = np.where(employee_data_replaced_1['MonthlyRate'].isna()==True,
                                                       employee_data_replaced_1['DailyRate']*30,
                                                       employee_data_replaced_1['MonthlyRate'])
employee_data_replaced_1[['DailyRate','MonthlyRate']].head(20)

#### 4. Use a standard value to replace with → e.g. "unknown", works well if only a few for categorical features

Similarly, we could just replace missing values with 0.

In [None]:
# fillna() function
employee_data_replaced_2 = employee_data.copy()
employee_data_replaced_2['MonthlyRate'].fillna(0, inplace=True)
employee_data_replaced_2[['DailyRate','MonthlyRate']].head(20)

In [None]:
# np.where() function
employee_data_replaced_2 = employee_data.copy()
employee_data_replaced_2['MonthlyRate'] = np.where(employee_data_replaced_2['MonthlyRate'].isna()==True,
                                                       0,
                                                       employee_data_replaced_2['MonthlyRate'])
employee_data_replaced_2[['DailyRate','MonthlyRate']].head(20)

#### 5. Use an average value to replace with → mean, median and mode, works well if only a few for continuous features

Again, we can very easily replace with the mean and median values (mode is advised for categorical features).


In [None]:
# fillna() function
employee_data_replaced_3 = employee_data.copy()
employee_data_replaced_3['MonthlyRate'].fillna(employee_data_replaced_3['MonthlyRate'].mean(), inplace=True)
employee_data_replaced_3[['DailyRate','MonthlyRate']].head(20)

In [None]:
# np.where() function
employee_data_replaced_3 = employee_data.copy()
employee_data_replaced_3['MonthlyRate'] = np.where(employee_data_replaced_3['MonthlyRate'].isna()==True,
                                                       employee_data_replaced_3['MonthlyRate'].mean(),
                                                       employee_data_replaced_3['MonthlyRate'])
employee_data_replaced_3[['DailyRate','MonthlyRate']].head(20)

In [None]:
# fillna() function
employee_data_replaced_4 = employee_data.copy()
employee_data_replaced_4['MonthlyRate'].fillna(employee_data_replaced_4['MonthlyRate'].median(), inplace=True)
employee_data_replaced_4[['DailyRate','MonthlyRate']].head(20)

In [None]:
# np.where() function
employee_data_replaced_4 = employee_data.copy()
employee_data_replaced_4['MonthlyRate'] = np.where(employee_data_replaced_4['MonthlyRate'].isna()==True,
                                                       employee_data_replaced_4['MonthlyRate'].median(),
                                                       employee_data_replaced_4['MonthlyRate'])
employee_data_replaced_4[['DailyRate','MonthlyRate']].head(20)

#### 6. Use an average value for each class to replace with → e.g. mean for age group 18-24

A little more complicated, we use a logical categorical feature's clases a more sophisticated indication of what the missing values should be. For example, we can use the 'Department' feature as a better indicator. 

To calculate the statistics by classes, we use the 'groupby()' function.

Continuing from our previous replacement method, we extend the np.where function and can nest if-else statements for a manual replacement.

**Advanced:** Because this is a very manual replacement, we can utilise a loop function as before to improve the speed at which we replace values and is particularly important if we had more classes. Because this is fairly complex, I have done my best to summarise the process in the comments but you may wish to find an alternative method for achieving the same result if you are more comfortable with programming.

In [None]:
employee_data['Department'].head()

In [None]:
employee_data['Department'].unique()

In [None]:
employee_data_department_monthylratemean = employee_data.groupby('Department').mean()['MonthlyRate']
employee_data_department_monthylratemean

In [None]:
employee_data_department_monthylratemean['Sales']

In [None]:
employee_data_replaced_5 = employee_data.copy()
employee_data_replaced_5['MonthlyRate'] = np.where((employee_data_replaced_5['MonthlyRate'].isna()==True)&(employee_data_replaced_5['Department']=='Sales'),
                                                    employee_data_department_monthylratemean['Sales'],
                                            np.where((employee_data_replaced_5['MonthlyRate'].isna()==True)&(employee_data_replaced_5['Department']=='Research & Development'),
                                                      employee_data_department_monthylratemean['Research & Development'],
                                                np.where((employee_data_replaced_5['MonthlyRate'].isna()==True)&(employee_data_replaced_5['Department']=='Human Resources'),
                                                         employee_data_department_monthylratemean['Human Resources'],employee_data_replaced_5['MonthlyRate'])))


employee_data_replaced_5[['Department','MonthlyRate']].head(20)

In [None]:
# REPLACE EACH ITEM IN THE MONTHLYRATE COLUMN WITH THE AVERAGE OF THE DEPARTMENT CLASS USING A FOR LOOP
#-----------------------------------------------------------------------------------------------------
# Create a copy of the employee dataset
# Compute the mean value of the MonthlyRate for each Department class
#
# MAIN LOOP:
#
# Initialise an empty list for logging results
# for each index,row in our employee data:
#     if the MonthlyRate value is not null --> dont change
#     else:
#         find the Department class for the row, extract the mean from the previously and replace with this value
#
#     store each value into the initialised list with the ".append()" function
#
# replace the MonthlyRate column with this updated list
#-----------------------------------------------------------------------------------------------------

employee_data_replaced_6 = employee_data.copy()

# Find the mean results of the MonthlyRate column for each Department class for all known values
employee_data_department_monthylratemean = employee_data.groupby('Department').mean()['MonthlyRate']

MonthlyRate_replacement = []
for n,row in employee_data_replaced_6.iterrows():
    if pd.notnull(row['MonthlyRate']):
        MonthlyRate_value = row['MonthlyRate']
    else:
        row_dep_class = row['Department']
        row_dep_class_mean = employee_data_department_monthylratemean[row_dep_class]
        MonthlyRate_value = row_dep_class_mean
    
    MonthlyRate_replacement.append(MonthlyRate_value)


employee_data_replaced_6['MonthlyRate'] = MonthlyRate_replacement
employee_data_replaced_6[['Department','MonthlyRate']].head(20)

#### 7. Use algorithm output values to replace with → e.g. regression or decision tree predictions 

Our features do not correlate well in a simple linear regression so would not be advised but can show how this would work. First we use the sklearn package to fit a linear regression line against the known values for "TotalWorkingYears" and "MonthlyRate". With this, we predict what we think values should be given we know the "TotalWorkingYears" for each missing "MonthlyRate" value.

The previous "fillna()" and "np.where()" replacement functions do now work in this case and so have had to use the more complex for loop method as shown before. 


In [None]:
list(employee_data)

In [None]:
plt.scatter(employee_data['TotalWorkingYears'], employee_data['MonthlyRate'])
plt.title("Comparison of Total Working Years against \n Monthly Rate for all Employees")
plt.xlabel("Total Working Years")
plt.ylabel("Monthly Rate")
plt.show()

In [None]:
# https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
from sklearn import linear_model

In [None]:
employee_data_replaced_7 = employee_data.copy()

employee_data_7_NAN = employee_data_replaced_7[employee_data_replaced_7['MonthlyRate'].isna()==True]
employee_data_7 = employee_data_replaced_7[employee_data_replaced_7['MonthlyRate'].isna()==False]

X = employee_data_7[['TotalWorkingYears']]
y = employee_data_7[['MonthlyRate']]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X, y)

# The correlation coefficient
print('Coefficients: \n', regr.coef_)

# The intercept
print('Intercept: \n', regr.intercept_)


In [None]:
plt.scatter(employee_data['TotalWorkingYears'], employee_data['MonthlyRate'])
plt.plot(X, regr.predict(X),color='r')
plt.title("Comparison of Total Working Years against \n Monthly Rate for all Employees")
plt.xlabel("Total Working Years")
plt.ylabel("Monthly Rate")
plt.show()

In [None]:
# Make predictions using the testing set
nan_pred = regr.predict(employee_data_7_NAN[['TotalWorkingYears']])
nan_pred[0:10]

In [None]:
employee_data_replaced_7 = employee_data.copy()

MonthlyRate_replacement = []
for n,row in employee_data_replaced_7.iterrows():
    if pd.notnull(row['MonthlyRate']):
        MonthlyRate_value = row['MonthlyRate']
    else:
        regr_pred = regr.predict([row[['TotalWorkingYears']]])
        MonthlyRate_value = regr_pred[0][0]
    
    MonthlyRate_replacement.append(MonthlyRate_value)


employee_data_replaced_7['MonthlyRate'] = MonthlyRate_replacement

employee_data_replaced_7[['TotalWorkingYears','MonthlyRate']].head(20)

### Categorical Variables

Unlike for continuous variables, we have fewer replacement choices for categorical variables. As before, we can simply remove the row or column, add an improved class label (e.g. "unknown") or replace with the mode.

The first thing to consider is that the MaritalStatus feature has almost 26% of its values missing. Furthermore, this may be considered "personal information" that has no relevancy to making predictions or may be unethical to make decisions on. Therefore, we can make a solid justification for removing this feature completely. 

In [None]:
print(np.round(len(employee_data['JobRole'][(employee_data['JobRole']).isna()==True])/len(employee_data)*100,2),"%")

In [None]:
print(np.round(len(employee_data['MaritalStatus'][(employee_data['MaritalStatus']).isna()==True])/len(employee_data)*100,2),"%")

In [None]:
employee_data_removed_8 = employee_data.copy()
employee_data_removed_8 = employee_data_removed_8.drop('MaritalStatus',axis=1)
list(employee_data_removed_8)

In [None]:
employee_data_removed_8['JobRole'].mode()[0]

In [None]:
# fillna() function
employee_data_removed_9 = employee_data.copy()
employee_data_removed_9['JobRole'].fillna(employee_data_removed_9['JobRole'].mode()[0], inplace=True)
employee_data_removed_9[['JobRole']].head(20)

In [None]:
# np.where() function
employee_data_removed_9 = employee_data.copy()
employee_data_removed_9['JobRole'] = np.where(employee_data_removed_9['JobRole'].isna()==True,
                                                       employee_data_removed_9['JobRole'].mode(),
                                                       employee_data_removed_9['JobRole'])
employee_data_removed_9[['JobRole']].head(20)

### Final Removed and Replaced Dataset

1. MonthlyRate replaced with mean value
2. MaritalStatus removed feature
3. JobRole replaced with mode class

In [None]:
employee_data_clean = employee_data.copy()

employee_data_clean['MonthlyRate'].fillna(employee_data_clean['MonthlyRate'].mean(), inplace=True)
employee_data_clean = employee_data_clean.drop('MaritalStatus',axis=1)
employee_data_clean['JobRole'].fillna(employee_data_clean['JobRole'].mode()[0], inplace=True)

#Final check for any null values, if false then there are none :)
employee_data_clean.isnull().values.any()

In [None]:
employee_data_clean.head(10)

## Change Data Types - Re-visited

Now that missing values have been fixed, we can attempt to change the MonthlyRate data type to align with the other features.


In [None]:
employee_data_clean['MonthlyRate'] = employee_data_clean['MonthlyRate'].astype(np.int64)
data_types_clean = employee_data_clean.dtypes
data_types_clean

---

# Check and Fix Outlier Values

### Numerical Computation

If you have any data point that is more than 3 times the standard deviation, then those points are very likely to be anomalous or outliers.

Based on this, we find that MonthlyIncome appears to have no clear outliers.

In [None]:
import statistics 

x = employee_data_clean['MonthlyIncome']
# mean and stdev
mu = employee_data_clean['MonthlyIncome'].mean()
sigma = statistics.stdev(employee_data_clean['MonthlyIncome'])

num_bins = 50

fig, ax = plt.subplots()

# the histogram of the data
n, bins, patches = ax.hist(x, num_bins, density=1)

# add a 'best fit' line
y = ((1 / (np.sqrt(2 * np.pi) * sigma)) *
     np.exp(-0.5 * (1 / sigma * (bins - mu))**2))
ax.plot(bins, y, 'r--')
ax.set_xlabel('MonthlyRate')
ax.set_ylabel('Probability density')
ax.set_title(r'Histogram of MonthlyRate for 50 bins')

# Tweak spacing to prevent clipping of ylabel
fig.tight_layout()
plt.show()

In [None]:
# compute value for outlier cutoff points
mu + 3*sigma

In [None]:
print(np.round(len(employee_data_clean[employee_data_clean['MonthlyIncome']>=(mu + 3*sigma)])/len(employee_data_clean)*100,5), "%")

In [None]:
print(np.round(len(employee_data_clean[employee_data_clean['MonthlyIncome']<=(mu - 3*sigma)])/len(employee_data_clean)*100,5), "%")


### Visual Computation

Alternatively, we can use the visual box plot outlier detection and deduce that any point above, say 17,000 are outliers.

In [None]:
ax = sns.boxplot(x=employee_data_clean['MonthlyIncome'])
plt.title("Simple Box Plot of MonthlyIncome to find Outliers")
plt.show()

In [None]:
employee_data_clean_outliers = employee_data_clean[employee_data_clean['MonthlyIncome']<17000]
print("Mean Monthly Income BEFORE removing outliers = ", employee_data_clean['MonthlyIncome'].mean())
print("Mean Monthly Income AFTER removing outliers = ", employee_data_clean_outliers['MonthlyIncome'].mean())

# Transforming Variables

### Logarithm Scales for Visualisations

In [None]:
plt.scatter(employee_data_clean_outliers['MonthlyIncome'],employee_data_clean_outliers['MonthlyRate'])
plt.title("Comparison of Monthly Income against Monthly Rate")
plt.xlabel("Monthly Income")
plt.ylabel("Monthly Rate")
plt.show()

In [None]:
plt.scatter(employee_data_clean_outliers['MonthlyIncome'],employee_data_clean_outliers['MonthlyRate'])
plt.title("Comparison of Monthly Income against Monthly Rate (log scale)")
plt.xlabel("Monthly Income")
plt.ylabel("Monthly Rate (log scale)")

plt.yscale("log")

plt.show()

In [None]:
plt.scatter(employee_data_clean_outliers['MonthlyIncome'],employee_data_clean_outliers['MonthlyRate'])
plt.title("Comparison of Monthly Income (log scale) against Monthly Rate")
plt.xlabel("Monthly Income (log scale)")
plt.ylabel("Monthly Rate")

plt.xscale("log")

plt.show()

In [None]:
plt.scatter(employee_data_clean_outliers['MonthlyIncome'],employee_data_clean_outliers['MonthlyRate'])
plt.title("Comparison of Monthly Income (log scale) against Monthly Rate (log scale)")
plt.xlabel("Monthly Income (log scale)")
plt.ylabel("Monthly Rate (log scale)")

plt.yscale("log")
plt.xscale("log")

plt.show()

### Normlisation for Improved Comparison

We use normalisation when variables to compare are on vastly different scales. For example, HourlyRate and MonthlyIncome would otherwise be incomparable.

In [None]:
plt.hist(employee_data_clean_outliers['HourlyRate'], alpha=0.5)
plt.hist(employee_data_clean_outliers['MonthlyIncome'], alpha=0.5)
plt.title("Histogram comparison between the \n Monthly Income and Hourly Rate")
plt.show()

In [None]:
employee_data_clean_outliers['HourlyRate_norm'] = ((employee_data_clean_outliers['HourlyRate']-min(employee_data_clean_outliers['HourlyRate']))/          
                                            (max(employee_data_clean_outliers['HourlyRate'])-min(employee_data_clean_outliers['HourlyRate'])))
employee_data_clean_outliers['MonthlyIncome_norm'] = ((employee_data_clean_outliers['MonthlyIncome']-min(employee_data_clean_outliers['MonthlyIncome']))/          
                                            (max(employee_data_clean_outliers['MonthlyIncome'])-min(employee_data_clean_outliers['MonthlyIncome'])))


In [None]:
plt.hist(employee_data_clean_outliers['HourlyRate_norm'], alpha=0.5)
plt.hist(employee_data_clean_outliers['MonthlyIncome_norm'], alpha=0.5)
plt.title("Normalised Histogram comparison between the \n Monthly Income and Hourly Rate")
plt.show()

# Other Methods

### Feature Reduction

#### Manual Removal

We have covered this previously when we removed the "MaritalStatus" feature.

#### Feature selection

Will cover in Machine Learning.

#### PCA merging features

Will cover in Machine Learning.


### Data Split into Train/Test Subsets

Will cover in Machine Learning.



---

## Conclusion

In this notebook, we have successfully demonstrated how to complete the majority of pre-processing steps in Python. This includes checking for and removing missing/null values and outliers. The final pre-processing steps will be covered in later modules.

It is important to understand that pre-processing often requires a good understanding of the data to make decisions. Often data available publicly will already be clean but it is important to check before applying any Machine Learning or Visual Analytics. 
