# CSS 201.5 - CSS MA Bootcamp

## Week 02 - Lecture 4 (morning)

# Data Wrangling

In [None]:
# Loading packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
from plotly import express as px
from plotly.subplots import make_subplots

# Dates and times handlers
from datetime import date
from datetime import timedelta
from datetime import datetime as dtm
from datetime import timezone as tmz

## Roadmap

1. Chaining

1. Stacking data

1. Joining data

1. Reshaping data

## Loading PErisk

In [None]:
perisk = pd.read_csv('PErisk.csv')
perisk.head(2)

In [None]:
tips = pd.read_csv('tips.csv')
tips.head(2)

## Variable operations

**Your turn**: Create a variable `share_bill_tip`, that computes the fraction of the bill that was give as a `tip`.

In [None]:
tips.head(2)
# Your answers here

## Variable operations

Quantiles of a variable:

In [None]:
perisk.gdpw2.quantile(q = [0, 0.25, 0.5, 0.75, 1])

And we can bin by quantiles:

In [None]:
new_perisk = perisk.assign(
    gdpw2_bin = pd.qcut(perisk.gdpw2, q = 4)
)
new_perisk.head()

## Quantile cuts

**Your turn**: Cut the `share_bill_tip` into three categories. Then build a table of this variable.

In [None]:
tips.head(2)
# Your answers here

## Variable operations

Little bit of a 0-1 index:

In [None]:
new_perisk = perisk.assign(
    z1_barb2 = (perisk.barb2 - perisk.barb2.min()) / (perisk.barb2.max() - perisk.barb2.min())
)
new_perisk.head()

## Zero-One Indexing

**Your turn**: Create a variable `zero_one_totbill`, that transforms the total bill into a zero - one variable.

In [None]:
tips.head(2)
# Your answers here

## Variable operations

Standardizing values or taking absolute values:

In [None]:
stdz = lambda x: (x - x.mean()) / x.std()
new_perisk = perisk.assign(
    stdz_barb2 = stdz(perisk.barb2),
    stdz_gdpw2 = stdz(perisk.gdpw2),
    abs_barb2 = perisk.barb2.abs()
)
new_perisk.head()

Clipping values: force lower and higher to be of a given value (danger zone!).

In [None]:
new_perisk = new_perisk.assign(
    stdz_barb2 = new_perisk.stdz_barb2.clip(lower = -1, upper = 1),
    stdz_gdpw2 = new_perisk.stdz_gdpw2.clip(lower = -1, upper = 1),
)
new_perisk.head(3)

## Standardizing

**Your turn**: 

1. Standardize the total bill.
2. Create a new variable that standardize the tips, clipping values to be between -2 and 2 standard deviations.
3. Count the values within and outside these bounds.

Do you know what they mean?

In [None]:
tips.head(2)
# Your answers here

## Chaining

This is useful when you want to run multiple commands at once.

In [None]:
new_perisk = (perisk.assign(expr_risk = 5 - perisk.prsexp2,
                            corr_risk = 5 - perisk.prsexp2)
                    .query('courts == 1')
                    .sample(n = 5)
                    .set_index('country')
                    .sort_index()
             )

In [None]:
new_perisk.head()

## Standardizing

**Your turn**: Using chaining, perform the following operations:

1. Select 50 samples of `tips`
1. Standardize the total bill.
1. Create a new variable that standardize the tips, clipping values to be between -2 and 2 standard deviations.
1. Create a new variable that creates a zero-one representation of `totbill` and `tip`
1. Use qcut to create a 10-cut discrete variable representation of the zero-one `tip` created before.

In [None]:
tips.head(2)
# Your answers here

## Stacking data

Suppose you have two datasets, both with half of the data you need, and the same variables in both datasets.

For instance:

In [None]:
# First dataset
perisk_1sthalf = perisk.loc[0:1]
perisk_1sthalf

In [None]:
# Second dataset
perisk_2ndhalf = perisk.loc[2:3]
perisk_2ndhalf

## Stacking data

To stack the data, you should do:

In [None]:
pd.concat([perisk_1sthalf, perisk_2ndhalf])

## Stacking

**Your turn**: Stack the datasets `tips_1` and `tips_2`.

In [None]:
tips_1 = tips.head(2)
tips_2 = tips.tail(2)
# Your answers here

## Stacking data

But what if the variables are in different order, with possibly some differences from one dataset to the other?

In [None]:
perisk_1sthalf = (
    perisk.loc[0:1]
          .drop(columns = ['barb2'])
          .reindex(columns = ['country', 'prscorr2', 'gdpw2', 'courts', 'prsexp2'])
)
perisk_1sthalf

In [None]:
perisk_2ndhalf = (
    perisk.loc[2:3]
          .drop(columns = ['gdpw2'])
          .reindex(columns = ['country', 'courts', 'prsexp2', 'prscorr2', 'barb2'])
)
perisk_2ndhalf

In [None]:
pd.concat([perisk_1sthalf, perisk_2ndhalf])

## Stacking

**Your turn**: Stack the datasets `tips_1` and `tips_2`.

In [None]:
tips_1 = tips.head(2).drop(columns = ['day'])
tips_2 = tips.tail(2).drop(columns = ['time'])
# Your answers here

## Join Data

Suppose you have two datasets that have a common key, with different types of information in them. How to we join them together?

In [None]:
perisk_inc1 = (
    perisk.loc[0:4]
          .drop(columns = ['prsexp2', 'prscorr2', 'gdpw2'])
)
perisk_inc1

In [None]:
perisk_inc2 = (
    perisk.loc[1:5]
          .drop(columns = ['courts', 'barb2'])
)
perisk_inc2

## Join Data

Inner joins:

In [None]:
pd.merge(perisk_inc1, perisk_inc2, how = 'inner', on = 'country')

## Inner Join

**Your turn**: Join the datasets `tips_1` and `tips_2` using inner join.

In [None]:
tips_1 = tips.head(4).drop(columns = ['day'])
tips_2 = tips.loc[2:].head(4).drop(columns = ['time'])
# Your answers here

## Join Data

Left joins:

In [None]:
pd.merge(perisk_inc1, perisk_inc2, how = 'left', on = 'country')

## Left Join

**Your turn**: Join the datasets `tips_1` and `tips_2` using left join.

In [None]:
tips_1 = tips.head(4).drop(columns = ['day'])
tips_2 = tips.loc[2:].head(4).drop(columns = ['time'])
# Your answers here

## Join Data

Right joins:

In [None]:
pd.merge(perisk_inc1, perisk_inc2, how = 'right', on = 'country')

## Join Data

Full (outer) joins:

## Right Join

**Your turn**: Join the datasets `tips_1` and `tips_2` using right join.

In [None]:
tips_1 = tips.head(4).drop(columns = ['day'])
tips_2 = tips.loc[2:].head(4).drop(columns = ['time'])
# Your answers here

In [None]:
pd.merge(perisk_inc1, perisk_inc2, how = 'outer', on = 'country')

## Full Join

**Your turn**: Join the datasets `tips_1` and `tips_2` using full join.

In [None]:
tips_1 = tips.head(4).drop(columns = ['day'])
tips_2 = tips.loc[2:].head(4).drop(columns = ['time'])
# Your answers here

## Join Data

Diagnostics one: matched?

In [None]:
perisk_inc1[perisk_inc1.country.isin(perisk_inc2.country)].country

Diagnostics two: Unmatched?

In [None]:
perisk_inc1[~perisk_inc1.country.isin(perisk_inc2.country)].country

## Join Data

Now reversing:

Diagnostics one: matched?

In [None]:
perisk_inc2[perisk_inc2.country.isin(perisk_inc1.country)].country

Diagnostics two: Unmatched?

In [None]:
perisk_inc2[~perisk_inc2.country.isin(perisk_inc1.country)].country

## Joins

**Your turn**: Diagnose the joins you ran before.

In [None]:
tips_1 = tips.head(4).drop(columns = ['day'])
tips_2 = tips.loc[2:].head(4).drop(columns = ['time'])
# Your answers here

## Reshaping data

It is very common that the data we use is in a different format than the required for analysis.

Fortunately, it is easy to deal with that in `pandas`.

In [None]:
# cases data
cases = pd.DataFrame({
  'country': ["Afghanistan", "Brazil", "China"],
  1999: [745, 37737, 212258],
  2000: [2666, 80488, 213766]  
})
cases

## Reshaping data (gather)

Suppose you have this data:

In [None]:
cases_spread = pd.DataFrame({
  'country': ["Afghanistan", "Brazil", "China"],
  1999: [745, 37737, 212258],
  2000: [2666, 80488, 213766]  
})
cases_spread

The first thing we can do in here is to `gather` this data:

In [None]:
cases_new = pd.melt(cases_spread, id_vars = 'country', var_name = 'year', value_name = 'cases')
cases_new

## Reshaping data (spread)

Now, suppose you have this data:

In [None]:
cases_new

But you want to go back to the previous pattern:

In [None]:
pd.pivot(cases_new, index = 'country', columns = 'year', values = 'cases').reset_index()

## Reshaping data

**Your turn**: Create a dataset that separates the tips and total bills based on the people were smokers or non-smokers.

In [None]:
tips.head(2)

# Great work!