<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview-of-Pandas-Data-Cleaning" data-toc-modified-id="Overview-of-Pandas-Data-Cleaning-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview of Pandas Data Cleaning</a></span><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Objectives</a></span></li></ul></li><li><span><a href="#Functions-&amp;-Methods-for-Data-Cleaning" data-toc-modified-id="Functions-&amp;-Methods-for-Data-Cleaning-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Functions &amp; Methods for Data Cleaning</a></span><ul class="toc-item"><li><span><a href="#Lambda-Functions" data-toc-modified-id="Lambda-Functions-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Lambda Functions</a></span></li><li><span><a href="#Aggregation-Functions" data-toc-modified-id="Aggregation-Functions-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Aggregation Functions</a></span></li></ul></li><li><span><a href="#Dealing-with-Missing-Data" data-toc-modified-id="Dealing-with-Missing-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Dealing with Missing Data</a></span></li><li><span><a href="#Combining-DataFrames" data-toc-modified-id="Combining-DataFrames-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Combining DataFrames</a></span></li></ul></div>

# Overview of Pandas Data Cleaning

## Objectives

You will be able to:
* Split DataFrame into subgroups using .groupby() and aggregation functions (.min(), .max(), .count(), .sum()) 
* Explain the different types of joins (outer, inner, left, right)
* Explain strategies for missing data (categorical & numerical)

# Functions & Methods for Data Cleaning

## Lambda Functions

* experiment and solve for individual cases first
* generalize your solution
* watch for edge cases & exceptions

## Aggregation Functions

* .min() -- returns the minimum value for each column by group
* .max() -- returns the maximum value for each column by group
* .mean() -- returns the average value for each column by group
* .median() -- returns the median value for each column by group
* .count() -- returns the count of each column by group

**Being familiar and comfortable with DataFrame splitting using aggregation methods will be VERY IMPORTANT for correctly using pivot tables, stack/unstack, and multi-heirarchical indexing**

In [None]:
from sklearn.datasets import load_diabetes
import pandas as pd 

data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head()

In [None]:
df.groupby('sex').max()
#df['sex']
#df['sex'].unique()

In [None]:
r_list = ['bp', 's1']
df2 = pd.DataFrame(df.groupby(['sex', 'bmi'])[r_list].mean())
df2

# Dealing with Missing Data

* How to detect missing data?
    1. NaNs - .isna().sum()
    2. Placeholder Values
    
        a. Numerical - 0/999, .value_counts()
        
        b. Categorical - .unique()

* How to deal with missing data?
    1. Remove - df.dropna()
    
    2. Replace/Impute - df['col'].fillna(df['col'].median()) OR common value for categorical data (df['col'].value_counts())
       
       a. Why median instead of mean?
       
    3. Keep - 
        a. categorical - label 'missing'/'NaN' can give useful info about dataset
        
        b. numerical - binning

# Combining DataFrames

* Outer Join - returns all records from both tables.

* Inner Join - returns only the records with matching keys in both tables.

* Left Join - returns all the records from the left table, as well as any records from the right table that have a matching key with a record from the left table.

* Right Join - returns all the records from the right table, as well as any records from the left table that have a matching key with a record from the right table.