# CSS 201.5 - CSS MA Bootcamp

## Week 02 - Lecture 3 (morning)

# Data Wrangling

## Wrangling Categorical Variables

In [1]:
# Loading packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
from plotly import express as px
from plotly.subplots import make_subplots

# Dates and times handlers
from datetime import date
from datetime import timedelta
from datetime import datetime as dtm
from datetime import timezone as tmz

# Wrangling Dates and Times (cont'd)

In [10]:
# Here are two dates
dts = ['5/17/2020', '12/2/2022']
dt1 = dtm(2021, 5, 5, 23, 20, 2)
dts2 = [date(2020, 5, 17), date(2022, 12, 2)]

In [11]:
delta = dts2[1] - dts2[0]
delta.days

929

## Wrangling Dates and Times

### Timedeltas and duration

As we see before, we can create time differences (or timedeltas) by subtracting one date from another:

In [12]:
begin = dtm(2021, 5, 5, 23, 20, 2)
end = dtm(2021, 7, 5, 8, 15, 22)
delta = end - begin

## Wrangling Dates and Times

### Timedeltas and duration

The handlers for time deltas and duration can be found in [here](https://images.datacamp.com/image/upload/v1666944896/Marketing/Blog/Working_with_Dates_and_Times_Cheat_Sheet.pdf)

In [13]:
delta.total_seconds()

5216120.0

## Wrangling Dates and Times

**Exercise:** Read the following date and time correctly: "Jan 16, 2021 at 3:30 AM"

In [14]:
# Your answers here

## Wrangling Dates and Times

### Timezone

In [15]:
# Timezone PDT
PDT = tmz(timedelta(hours = -7))

# Date and time in an specific time zone
dtPDT = dtm(2021, 5, 12, 15, 23, 25, tzinfo = PDT)
print(dtPDT)

2021-05-12 15:23:25-07:00


In [16]:
# Or we can adjust
ET = tmz(timedelta(hours = -5))

# Before
print(dt1)

# After
dtET = dt1.astimezone(ET)
print(dtET)

2021-05-05 23:20:02
2021-05-06 01:20:02-05:00


## Wrangling Dates and Times

**Exercise:** Change `dt1` to India time zone (UTC+3:30).

In [17]:
# Your answers here

## Wrangling Dates and Times

### Pandas

In [18]:
# Datasets
dat2 = pd.read_csv('lakers.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'lakers.csv'

## Wrangling Dates and Times

**Exercise**: Explore this dataset.

In [None]:
# Your code here

## Wrangling Dates and Times

### Pandas

To parse dates and times on pandas, we use the to_datetime method:

In [None]:
# More on that later...

# Advanced Data Wrangling

## Roadmap

1. Drop variables

1. Sort variables

1. Indexing (basics)

1. Subsetting observations

1. Variable computations

1. Chaining

1. Stacking data

1. Joining data

1. Reshaping data

## Loading PErisk

In [None]:
perisk = pd.read_csv('PErisk.csv')
perisk.head(2)

In [None]:
tips = pd.read_csv('tips.csv')
tips.head(2)

## Dropping variables

This should be useful, especially when you have multiple variables.

In [None]:
new_perisk = perisk.drop(columns = ['courts', 'barb2', 'gdpw2'])

In [None]:
new_perisk.head(4)

## Dropping variables

**Your turn**: Drop two variables of your choice in the `tips` dataset.

In [None]:
tips.head(2)
# Your answers here

## Changing position variables

Reindex helps you with this:

In [None]:
new_perisk = perisk.reindex(columns = ['country', 'courts', 'prsexp2', 'prscorr2', 'gdpw2', 'barb2'])

In [None]:
new_perisk.head(4)

## Changing position variables

**Your turn**: Organize the `tips` dataset by placing the numeric variables first, and the other variables last.

In [None]:
tips.head(2)
# Your answers here

## Sorting variables

Useful for situations when need to check the dataset.

In [None]:
new_perisk = perisk.sort_values('gdpw2', ascending = True)

In [None]:
new_perisk.head(3)

In [None]:
new_perisk = perisk.sort_values('gdpw2', ascending = False)

In [None]:
new_perisk.head(3)

## Sorting variables

**Your turn**: Sort the `tips` dataset by the value of the tip.

In [None]:
tips.head(2)
# Your answers here

## Indexing (basics)

Adding indexes:

In [None]:
new_perisk = perisk.set_index('country')

In [None]:
new_perisk.head(2)

Sort indexes:

In [None]:
new_perisk = new_perisk.sort_index(ascending = False)
new_perisk.head(2)

Drop indexes:

In [None]:
new_perisk = new_perisk.reset_index()
new_perisk.head(2)

## Indexing (basics)

**Your turn**: Set the `obs` as the index of the `tips` dataset. Then undo it.

In [None]:
tips.head(2)
# Your answers here

## Sampling

Sample a fraction of the data:

In [None]:
new_perisk = perisk.sample(frac = 0.05)
new_perisk

Sample a given number of cases:

In [None]:
new_perisk = perisk.sample(n = 3)
new_perisk

## Sampling

**Your turn**: Sample `20` observations from `tips`.

In [None]:
tips.head(2)
# Your answers here

## Subsetting

Subsetting using query:

In [None]:
new_perisk = perisk.query('gdpw2 > 10 and courts == 1')
new_perisk.head()

## Subsetting

**Your turn**: Sample `50` observations from `tips`. Then, keep only the tips that are either bigger than or equal 10 dollars or came from a smoker. How many observations did you end up with?

In [None]:
tips.head(2)
# Your answers here

## Subsetting

Dropping duplicates:

In [None]:
new_perisk = perisk.sample(n = 5, replace = True, random_state = 479)
new_perisk

In [None]:
new_perisk.drop_duplicates()

## Subsetting

**Your turn**: A common operation in data science is called [`bootstrapping`](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)). It consists in randomly generate samples of the dataset you are working with replacement.

1. Create a sample with the `tips` with replacement, with the same size as the original data.

2. Check how many observations were repeated.

In [None]:
tips.head(2)
# Your answers here

## Variable operations

Creating new variables with computations (multiple columns):

In [None]:
new_perisk = perisk.assign(
    risk_expr = 5 - perisk.prsexp2,
    risk_corr = 5 - perisk.prscorr2,
)
new_perisk.head()

## Variable operations

**Your turn**: Create a variable `share_bill_tip`, that computes the fraction of the bill that was give as a `tip`.

In [None]:
tips.head(2)
# Your answers here

## Variable operations

Quantiles of a variable:

In [None]:
perisk.gdpw2.quantile(q = [0, 0.25, 0.5, 0.75, 1])

And we can bin by quantiles:

In [None]:
new_perisk = perisk.assign(
    gdpw2_bin = pd.qcut(perisk.gdpw2, q = 4)
)
new_perisk.head()

## Quantile cuts

**Your turn**: Cut the `share_bill_tip` into three categories. Then build a table of this variable.

In [None]:
tips.head(2)
# Your answers here

## Variable operations

Little bit of a 0-1 index:

In [None]:
new_perisk = perisk.assign(
    z1_barb2 = (perisk.barb2 - perisk.barb2.min()) / (perisk.barb2.max() - perisk.barb2.min())
)
new_perisk.head()

## Zero-One Indexing

**Your turn**: Create a variable `zero_one_totbill`, that transforms the total bill into a zero - one variable.

In [None]:
tips.head(2)
# Your answers here

## Variable operations

Standardizing values or taking absolute values:

In [None]:
stdz = lambda x: (x - x.mean()) / x.std()
new_perisk = perisk.assign(
    stdz_barb2 = stdz(perisk.barb2),
    stdz_gdpw2 = stdz(perisk.gdpw2),
    abs_barb2 = perisk.barb2.abs()
)
new_perisk.head()

Clipping values: force lower and higher to be of a given value (danger zone!).

In [None]:
new_perisk = new_perisk.assign(
    stdz_barb2 = new_perisk.stdz_barb2.clip(lower = -1, upper = 1),
    stdz_gdpw2 = new_perisk.stdz_gdpw2.clip(lower = -1, upper = 1),
)
new_perisk.head(3)

## Standardizing

**Your turn**: 

1. Standardize the total bill.
2. Create a new variable that standardize the tips, clipping values to be between -2 and 2 standard deviations.
3. Count the values within and outside these bounds.

Do you know what these countings mean?

In [None]:
tips.head(2)
# Your answers here

# Great work!