<font color='orange' size='5'>**Data Cleaning and Preprocessing**</font>


<div style="border-top: 3px solid black"></div>  


<font color='orange' size='5'>**Steven Archuleta**</font><br>  
<font color='#000080'>Interview Candidate</font><br>  
<font color='#000080'>Research Data Analyst I/II</font><br>

<div style="float: left"> <img src="../reports/figures/RADD_logo.jpg" /> </div>
<div style="clear: both"></div>

<font color='black'>

- CDSS (California Department of Social Services)
- RADD (Research Automation Data Division)
- RDI (Research Data Insights Branch)
- EARS (Exploratory Analysis & Research Section)
- AM (Analysis & Modelling Unit)

</font>


<div style="border-top: 3px solid black"></div>  

&nbsp;
<a id='top'></a>  
**TABLE OF CONTENTS**

&nbsp;
    
|<font color=white>.</font> | <font color=white>.</font> | <font color=white>.</font> |<font color=white>.</font> |
|:----------|:---------|:---------|:---------|
| <a href='#libraries'>1. Libraries and Packages</a> | <a href='#readdata'>2. Load Data</a> | <a href='#shape'>3. Shape of Data Frame</a>  |<a href='#sample'>4. Sample of DataFrame</a> |
| <a href='#columnnames'>5. Feature Names</a> |<a href='#comprehensions'>6. List and Dictionary Comprehensions</a>|<a href='#info'>7. Dataframe Info</a>| <a href='#fix'>8. Fix Datatypes</a> | 
| [9. Dictonary Comprehension](#dictionary) |[10. DataType Count](#dtcount) | [11. DataFrame Information](#info) | [12. Fix Datatypes](#fixdt) |
| [13. Data Description](#describe)| [14. Duplicate Data Objects](#duplicates) | [15. Drop Rows](#drop)  |[16. Outliers](#outliers)|
|  [17. Missing Values](#missing) | [18. Impute Values](#impute) | [19. Unique Values](#uniquevalues)| [20. Negative Values](#negative)|
|  [21. Values Equal Zero](#zero) | [22. Values Equal 100](#100) |[23. Contingency Tables](#crosstab) |  [24. Group By](#groupby) |
| [25. EDA](#eda) | [26. Rename Columns](#rename) | [ZZ. Combinations of KPIs](#kpi) | [28. Sandbox](#sandbox) |
&nbsp;  


<div style="border-top: 3px solid black"></div> 
<a id=libraries></a>
<h3 style="color:#4682B4">Import libraries and packages</h3>
<a href='#top'>🔼</a>
<h3 style="color:blue">Import Libraries and Packages</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
These are collections of modules that provide pre-written code to perform common tasks, easing the process of programming and increasing the efficiency of data cleaning, preprocessing, and analysis in Python.
</p>
<ol>
<li>
<b>Pandas:</b>
<p>Pandas is a popular data manipulation library in Python that provides flexible data structures that make working with structured data fast and easy.</p>
<ul>
    <li>Alias: pandas as pd</li>
</ul>
</li>
<p></p>
<li>
<b>Numpy:</b>
<p>Numpy is a Python library used for numerical computations and working with arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.</p>
<ul>
    <li>Alias: numpy as np</li>
</ul>
</li>
<p></p>
<li>
<b>MatPlotLib:</b>
<p>Matplotlib is a plotting library for Python and its numerical mathematics extension, NumPy. It provides an object-oriented API for embedding plots into applications.</p>
<ul>
    <li>Alias: matplotlib.pyplot as plt</li>
</ul>
</li>
<p></p>
<li>
<b>Seaborn:</b>
<p>Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for creating attractive and informative statistical graphics.</p>
<ul>
    <li>Alias: seaborn as sns</li>
</ul>
</li>
<p></p>
<li>
<b>Datetime:</b>
<p>Datetime is a module in the Python standard library that supplies classes for manipulating dates and times, such as obtaining the current date and time, formatting or parsing dates, and performing time arithmetic.</p>
<ul>
    <li>No Alias</li>
</ul>
</li>
<p></p>
<li>
<b>Math:</b>
<p>The Math module is a standard module in Python that provides mathematical functions. These functions cannot be used with complex numbers; use the functions of the same name from the cmath module if you require support for complex numbers.</p>
<ul>
    <li>No Alias</li>
</ul>
</li>
<p></p>
<li>
<b>MissingNo:</b>
<p>MissingNo is a Python library that provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allow you to get a quick visual summary of the completeness (or lack thereof) of your dataset.</p>
<ul>
    <li>Alias: missingno as msno</li>
</ul>
</li>
<p></p>
<li>
<b>StandardScaler:</b>
<p>StandardScaler is a feature scaling method in the Scikit-Learn library in Python that standardizes features by removing the mean and scaling to unit variance. It is a common requirement for many machine learning estimators.</p>
<ul>
    <li>No Alias:</li>
</ul>
</li>
<p></p>
<li>
<b>Regular Expressions:</b>
<p>Regular expressions (often abbreviated as "regex") are sequences of characters that form a search pattern. They are used in Python, and many other programming languages, for text searching and text manipulation. The 're' module in Python provides support for regular expressions.</p>
<ul>
    <li>Alias: re</li>
</ul>
</li>
<p></p>
<li>
<b>MinMaxScaler:</b>
<p>MinMaxScaler is another feature scaling method from the Scikit-Learn library in Python. It scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.</p>
<ul>
    <li>No Alias</li>
</ul>
</li>
<p></p>
<li>
<b>OneHotEncoder:</b>
<p>OneHotEncoder is a function in the Scikit-Learn library in Python that is used to convert categorical data, or text data, into numbers, which the model can understand.</p>
<ul>
    <li>No Alias</li>
</ul>
</li>
<p></p>
<li>
<b>LabelEncoder:</b>
<p>LabelEncoder is a utility class to help normalize labels such that they contain only values between 0 and n_classes-1, where n is the number of distinct labels. It can also be used to transform non-numerical labels to numerical labels, which is an essential part of machine learning since many algorithms do not support non-numerical values.</p>
<ul>
    <li>No Alias</li>
</ul>
</li>
<p></p>
<li>
<b>Warnings:</b>
<p>The warnings module in Python's standard library is used when you want to warn users of your program about a certain condition (that you specified) instead of raising an exception. It's typically used for deprecations, or changes in a function or behavior.</p>
<ul>
    <li>No Alias</li>
</ul>
</li>
<p></p>
<li>
<b>Display:</b>
<p>The 'display' function is a part of the IPython.display module in Python. It is used for displaying outputs in different formats within Jupyter notebook cells, such as HTML, JSON, PNG, JPEG, SVG, and LaTeX.</p>
<ul>
    <li>No Alias</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>


In [1]:
# ===================================================
# This will automatically make the Python code more
# structured (good coding practice)
from IPython import get_ipython

ipython = get_ipython()
if "nb_black" not in ipython.extension_manager.loaded:
    %load_ext nb_black
# ====================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import datetime
import math
import missingno as msno
import re
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")

# ====================================================
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# ====================================================

# ====================================================
## Create a function to display side by side
from IPython.display import display_html


def display_side_by_side(*args):
    html_str = ""
    for df in args:
        html_str += df.to_html() + "&nbsp;&nbsp;&nbsp;"
    display_html(html_str.replace("table", 'table style="display:inline"'), raw=True)


# ====================================================

<IPython.core.display.Javascript object>

<div style="border-top: 3px solid black"></div> 
<h3 style="color:#4682B4">Load raw dataset(s)</h3>
<a id=readdata></a>
<a href='#top'>🔼</a>

In [None]:
import os
print(os.getcwd())

# Load the csv files
# orders_raw = pd.read_csv("../data/orders.csv")
# returns_raw = pd.read_csv("../data/returns.csv")

<div style="border-top: 3px solid black"></div> 
<a id=shape></a>
<h3 style="color:#4682B4">Shape of Dataframe(s)</h3>
<a href='#top'>🔼</a>

In [None]:
# Create the two dataframes
orders_shape_raw = pd.DataFrame(
    {"Rows": [orders_raw.shape[0]], "Columns": [orders_raw.shape[1]]},
    index=["orders_raw"],
)
returns_shape_raw = pd.DataFrame(
    {"Rows": [returns_raw.shape[0]], "Columns": [returns_raw.shape[1]]},
    index=["returns_raw"],
)

# Call function to display the two dataframes side by side
display_side_by_side(orders_shape_raw, returns_shape_raw)

<div style="border-top: 3px solid black"></div> 
<a id=sample></a>
<h3 style="color:#4682B4">Display Random Sample of DataFrame(s)</h3>
<a href='#top'>🔼</a>

In [None]:
# Create a copy of the data
orders_1 = orders_raw.copy()

# Obtain the same random results every time
# View a sample of 10 data objects
np.random.seed(1)
orders_1.sample(n=10)

In [None]:
<div style="border-top: 3px solid black"></div> 
<a id=rename></a>
<h3 style="color:#4682B4">Rename Columns</h3>
<a href='#top'>🔼</a>

In [None]:
<div style="border-top: 3px solid black"></div> 
<a id=columnnames></a>
<h3 style="color:#4682B4">Column Names of DataFrame(s)</h3>
<a href='#top'>🔼</a>

In [None]:
# Create the two dataframes
df1 = pd.DataFrame(data={"Orders Features": orders_1.columns})
df2 = pd.DataFrame(data={"Returns Features": returns_1.columns})

# Call function to display the two dataframes side by side
display_side_by_side(df1, df2)

In [None]:
<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Libraries and Packages</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>



<div style="border-top: 3px solid black"></div> 
<a id='section_id'></a>
<h3 style="color:blue">Section Title Here</h3>
<hr>
<h4 style="color:orange">OBSERVATION</h4>
<p style="font-size:15px">
Your observations and analysis about this section go here.
</p>
<ol>
<li>
<b>Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<p></p>
<li>
<b>Another Sub-section Title :</b>
<p>Description for this sub-section goes here.</p>
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
    <li>Item 4</li>
</ul>
</li>
<!-- Add as many list items as you need -->
</ol>
<a href='#top'>🔼</a>

