# Case Study: My Brothers Keeper Data

### Objectives
After this lesson you should be able to...
+ Tidy real datasets

## Introduction
[data.gov](www.data.gov) is an excellent place to find interesting and messy (and occasionally tidy) datasets. This case study will examine the [My Brothers Keeper](https://catalog.data.gov/dataset/my-brothers-keeper-key-statistical-indicators-on-boys-and-men-of-color) dataset.

**Description**: 'MBK is an interagency effort to improve measurably the expected educational and life outcomes for and address the persistent opportunity gaps faced by boys and young men of color'

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('../data/tidy/my_brothers_keeper.csv')

df.head()

In [None]:
df.info()

In [None]:
df.shape

## Variables as column names
It appears that there are some variables in the column names, which violates one of the tidy data principles. Both age and possibly gender are stored in the column names.

There are also appears to be two other variables: **`birth rate`** and **`percentage male/female`**.

### Split data into two Data Frames
Because it appears that both the 'Rate' (**`birth_rate`**) columns and the 'Distribution'(**`percentage male/female`**) columns need to be melted, we will split them up into two separate Data Frames and combine the results at the end.

In [None]:
distribution_cols = \
    ['Race', 
     'Year', 
     'Distribution of male children born to women ages 18-19',
     'Distribution of female children born to women ages 18-19',
     'Distribution of male children born to women ages 20-24',
     'Distribution of female children born to women ages 20-24']
    
rate_cols = \
    ['Race', 
     'Year',
     'Rate of birth to women ages 18-19',
     'Rate of birth to women ages 20-24']

In [None]:
percent = df[distribution_cols]
rate = df[rate_cols]

In [None]:
percent.head()

In [None]:
rate.head()

## Melt the distribution DataFrame

In [None]:
percent_melt = percent.melt(id_vars=['Race', 'Year'], value_name='Gender Percent')
percent_melt.head()

## Extracting age
We can extract the variables gender and age from the new **`variable`** column above. We will use a regular expression to find two numbers followed by a dash followed by two numbers again.

In [None]:
age_group = percent_melt['variable'].str.extract('(\d{2}-\d{2})')
age_group.head()

### Add new column for Age Group

In [None]:
percent_melt['Age Group'] = age_group
percent_melt.head()

## Extracting Gender
Gender is with a simple regular expression that searches for either 'male' or 'female'.

In [None]:
gender = percent_melt['variable'].str.extract(('(male|female)'))
gender.head()

### Add new columns for Gender

In [None]:
percent_melt['Gender'] = gender
percent_melt.head()

### Convert Gender Percent to numeric
The percentage sign in the Gender Percentage column is preventing the column from becoming a numeric. Lets strip that percentage sign and then convert to float.

In [None]:
percent_melt['Gender Percent'] = percent_melt['Gender Percent'].str.strip('%').astype(float)
percent_melt.head()

Verify data types

In [None]:
percent_melt.dtypes

### Drop `variable` column

In [None]:
percent_melt = percent_melt.drop(columns='variable')
percent_melt.head()

## Do same procedure with with rate DataFrame
We can take a similar approach with the **`rate`** DataFrame which is outputted again below. Only the age group and NOT gender are found in the column names.

In [None]:
rate.head()

In [None]:
rate_melt = rate.melt(id_vars=['Race', 'Year'], value_name='Birth Rate')
rate_melt.head()

### Get age group

In [None]:
rate_melt['Age Group'] = rate_melt['variable'].str.extract('(\d{2}-\d{2})')
rate_melt = rate_melt.drop(columns='variable')
rate_melt.head()

## Melted Distribution and Rate tables

In [None]:
percent_melt.head()

In [None]:
rate_melt.head()

In [None]:
percent_melt.shape

In [None]:
rate_melt.shape

## Join tables back together with `merge`
We now can put both the tables back together. Use the **`merge`** method to join the two tables together. Use the **`on`** parameter to only join rows that have the same race, year, and age group.

In [None]:
mbk_tidy = percent_melt.merge(rate_melt, on=['Race', 'Year', 'Age Group'])
mbk_tidy.head()

In [None]:
mbk_tidy.shape

# Change column order to make more sense
The order of the columns can make a difference in readability. Putting the descriptive static columns first is usually a good idea.

In [None]:
cols = ['Year', 'Race', 'Age Group', 'Gender', 'Gender Percent', 'Birth Rate']
mbk_tidy = mbk_tidy[cols]
mbk_tidy.head()

### `Birth Rate` Incorrect!
The **`Birth Rate`** column is now showing incorrect data. The original birth rate is for the entire population - both male and female. Our new DataFrame is showing the male and female individual birth rates to be equal to that of the total population. 

### Fix `Birth Rate`
To fix this we will simply change multiply the **`Birth Rate`** column by the **`Gender Percentage`** column

In [None]:
mbk_tidy['Birth Rate'] = mbk_tidy['Birth Rate'] * mbk_tidy['Gender Percent'] / 100
mbk_tidy.head()

## Tidy data?
Let's check the three tidy principles to see if we have made our way to tidy data.
* Is each variable a column? Although we don't have a common and strict definition of a variable, the column names are now labels and do not contain data like age or gender.
* Is each row an observational unit? Each row contains data from exactly one specific observation.
* Is there one type of observational unit? Not exactly, but close enough. Race, Year, Age Group and Gender each repeat and they can be separated into their own table. This 'normalizes' the data in a relational database sense. i.e. it minimizes the number of repetitions. The issue with this is that data typically needs to be in one table to visualize, aggregate and apply machine learning to. We will separate this data into two tables later in the notebook.

## Steps to produce tidy data
Though, there won't be an exact set of procedures that will result in a tidy dataset, this loose guideline may help you turn messy data into tidy data.

1. Identify each variable
1. Look for variable names masquerading as column names
1. Look for column names masquerading as variable values
1. Examine the 5 types of common messy data sets to see which one your dataset most closely resembles
1. You will likely need to use **`melt`**, **`pivot`** to restructure your DataFrame
1. You might need to separate different variables into their own DataFrame to make for easier tidying
1. Parse data with the **`str`** accessor to extract multiple variables from a single piece of data