# CSS 201.5 - CSS MA Bootcamp

## Week 02 - Lecture 2 (morning)

# Data Wrangling

## Wrangling Categorical Variables

In [None]:
## Loading a couple of friends in here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
from plotly import express as px
from plotly.subplots import make_subplots

## Wrangling Categorical Variables

In [None]:
# Loading the General Social Survey
dat = pd.read_csv('gss.csv')
dat.head()

## Wrangling Categorical Variables

### Updating Categories

We can rename categories using the `.rename_categories` method.

In [None]:
## Seeing it
polv = pd.Series(
    pd.Categorical(
        dat.polviews, categories = [
            'EXTREMELY LIBERAL', 'LIBERAL', 'SLIGHTLY LIBERAL', 'MODERATE', 'SLGHTLY CONSERVATIVE', 'CONSERVATIVE', 'EXTRMLY CONSERVATIVE'
        ], 
        ordered = True)
)
polv3 = polv.cat.rename_categories(new_categories = {
    'SLGHTLY CONSERVATIVE': 'SLIGHTLY CONSERVATIVE',
    'EXTRMLY CONSERVATIVE': 'EXTREMELY CONSERVATIVE'
})
polv3.value_counts(sort = False)

## Wrangling Categorical Variables

### Updating Categories

Your turn: fix the `degreevar` values.

In [None]:
## Seeing it
degreevar = pd.Series(
    pd.Categorical(
        dat.degree, categories = [
            'LT HIGH SCHOOL', 'HIGH SCHOOL', 'JUNIOR COLLEGE', 'BACHELOR', 'GRADUATE'
        ], 
        ordered = True)
)
degreevar.value_counts(sort = False)

## Wrangling Categorical Variables

### Updating Categories

And one nice way is that we can apply functions to the texts using this command.

In [None]:
## Seeing it (no worries about lambda and title, we will learn those)
polv6 = polv3.cat.rename_categories(lambda cat: cat.title())
polv6.value_counts(sort = False)

## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`

In [None]:
## Your code here

## Wrangling Categorical Variables

### Collapsing Categories

In [None]:
## collapsing
polv7 = polv6.replace({
    'Extremely Liberal': 'Liberal',
    'Slightly Liberal': 'Liberal',
    'Slightly Conservative': 'Conservative',
    'Extremely Conservative': 'Conservative'
})
polv7.value_counts(sort = False)

## Wrangling Categorical Variables

**Exercise:** Collapse `degree` categories to just three categories of your choice.

In [None]:
## Your code here

## Wrangling Categorical Variables

### Reordering Categories

In [None]:
## reordering
polv8 = polv7.cat.reorder_categories(
    new_categories = ['Conservative', 'Moderate', 'Liberal'],
    ordered = True
)
polv8.value_counts(sort = False)

## Wrangling Categorical Variables

**Exercise:** Reorder the `degree` categories.

In [None]:
## Your code here

## Wrangling Categorical Variables

### Wrong Data Types

- Being, for example, object instead of category

- We know how to deal with this!

### Inconsistencies

- It is very common that our data is inconsistent. 
    
Example: `house`, `House`, ` House`, `HOUSE`. All the same for us, but all different for the computer.

In [None]:
myv = ['house', 'condo', ' house', 'condo', 'house',
       'House', 'Condo', 'CONDO',  'House', 'house',
       'House', 'house', 'house', 'Condo', ' Condo',
       'Horse', 'Hoseu', 'Codno']
myv = pd.Series(myv)
myv.value_counts()

## Wrangling Categorical Variables

### Inconsistencies

- Removing whitespace:

In [None]:
myv = myv.str.strip() # Strip out whitespaces!
myv.value_counts()

## Wrangling Categorical Variables

### Inconsistencies

- Capitalization: we can use `.lower`, `.upper`, or `.title`
    + `upper`: All upper case
    + `lower`: All lower case
    + `title`: Only first letter capitalized

In [None]:
myv = myv.str.title() # Make it title
myv.value_counts()

## Wrangling Categorical Variables

### Misspelling

- Example: `house`, `Hosue`.

In [None]:
myv = myv.replace({
    'Horse': 'House',
    'Hoseu': 'House',
    'Codno': 'Condo',
})
myv.value_counts()

## Wrangling Categorical Variables

**Exercise:** Fix the following vector:

In [None]:
vec = pd.Series(
    ['yes', 'yes', 'yes', 'Yes', 'yEs', 'YESSS', ' yes',
     ' yes ', 'no', 'nooo', '  no way', 'no', 'No', 'Nononono',
     'Sure thing']
)
## Your code here

## Wrangling Categorical Variables

**Exercise**: Fix the remaining variables in this dataset. If you want, build this as a report.

In [None]:
## Answers here (in case you don't want to build a report)

# Wrangling Dates and Times

## Wrangling Dates and Times

Working with dates and times in a computer requires special methods.

1. Representation: Dates and times have a complex and structured nature. Storing and manipulating these components in a consistent and reliable manner requires specific data structures and formats.

1. Time Zones: Accurate handling of time zones is crucial for calculating time differences, scheduling events, and ensuring correct timestamps across different locations.

1. Leap Years and Daylight Saving Time

1. Arithmetic Operations: Dates and times often require arithmetic operations such as addition, subtraction, and comparison.

1. Formatting and Localization: Displaying dates and times in a human-readable format, adhering to cultural conventions.

1. Integration with External Systems: Dates and times often need to be exchanged and communicated with external systems, such as databases, APIs, and other software applications.

## Wrangling Dates and Times

Let us look at this example:

In [None]:
# Here are two dates
dts = ['5/17/2020', '12/2/2022']

## Wrangling Dates and Times

Let's get started by loading a package that deals with dates and times:

In [None]:
# Dates and times handlers
from datetime import date
from datetime import timedelta
from datetime import datetime as dtm
from datetime import timezone as tmz

## Wrangling Dates and Times

- To create a date object:
    + The order: `date(YEAR, MONTH, DAY)`

In [None]:
dts2 = [date(2020, 5, 17), date(2022, 12, 2)]
print(dts2[0], end = '\n\n') # ISO formatted date!
print(dts2[0].year, end = '\n\n')
print(dts2[0].month, end = '\n\n')
print(dts2[0].day, end = '\n\n')

## Wrangling Dates and Times

Weekdays:

- 0 = Monday
- 1 = Tuesday
- .
- .
- .
- 6 = Sunday

In [None]:
dts2[0].weekday()

## Wrangling Dates and Times

**Exercise:** Find the weekday of the following date:

In [None]:
mydate = date(2020, 1, 16)
## Your code here

## Wrangling Dates and Times

### Math

- The order of two dates is not trivial for a computer. For instance:

In [None]:
print(dts)
min(dts)

In [None]:
# Difference between two dates?
# Run this and see: '12/2/2022' - '5/17/2020'

## Wrangling Dates and Times

### Math

With date objects, things get less complicated:

In [None]:
print(dts2)
min(dts2)

In [None]:
# Differences create something called a delta object:
dts2[1] - dts2[0]

In [None]:
delta = dts2[1] - dts2[0]
delta.days

## Wrangling Dates and Times

**Exercise:** Find the difference in days of the following dates

In [None]:
mydate1 = date(2020, 1, 16)
mydate2 = date(2021, 3, 26)
## Your code here

## Wrangling Dates and Times

### Math

What day it is going to be exactly 30 days from now?

We can use `timedelta` to create the difference and add it to today's date.

In [None]:
# Descriptive of irrigation
print(date.today())
print(date.today() + timedelta(days = 30))

## Wrangling Dates and Times

### Turning dates back to strings

- ISO dates: 
    + Format: 'YYYY-MM-DD'
    + Great for computation, since they are always the same length.
    + It adds zeros
    + Even if it is a string, it sorts correctly!

In [None]:
print(date.today())

Express the data into ISO format:

In [None]:
date.today().isoformat() # Creates a string!

## Wrangling Dates and Times

### Turning dates back to strings

- Custom dates: We will use the `strftime`
    + '%Y': Year
    + '%m': Month
    + '%d': Day
    + '%B': Month
    + '%D': US format
    + And others [here](https://strftime.org)

In [None]:
print(date.today().strftime('%Y'), end = '\n\n')
print(date.today().strftime('The year is %Y'), end = '\n\n')
print(date.today().strftime('%B (%Y)'), end = '\n\n')
print(date.today().strftime('%Y/%m/%d'), end = '\n\n')
print(date.today().strftime('%Y-%j'), end = '\n\n')
print(date.today().strftime('%D'), end = '\n\n')

## Wrangling Dates and Times

**Exercise:** Print each of these dates in four different formats based on the formats in [here](https://strftime.org).

In [None]:
mydate1 = date(2020, 1, 16)
mydate2 = date(2021, 3, 26)
## Your code here

## Wrangling Dates and Times

### Times and dates

Sometimes, we also have to work with times: E.g., detect credit card fraud!

For instance:

>
> May, 12 2021, 3:23:15 PM
> 

Means:

In [None]:
# Computers work with 24-h settings
dt1 = dtm(2021, 5, 12, 15, 23, 25)
dt1

## Wrangling Dates and Times

### Times and dates

We can display the dates nicely:

In [None]:
print(dt1.isoformat(), end = '\n\n')
print(dt1.strftime('%Y-%m-%d'), end = '\n\n')
print(dt1.strftime('%Y-%m-%d %H:%M:%S'), end = '\n\n')
print(dt1.strftime('The thing happened in-on-at (no idea...) %H:%M:%S in-on-at %Y-%m-%d'), end = '\n\n')

## Wrangling Dates and Times

### Times and dates

And if we have data in the same format, such as:

In [None]:
mydt = dt1.strftime('%m/%d/%Y %H:%M:%S')
print(mydt)

We can easily parse it:

In [None]:
dtm.strptime(mydt, '%m/%d/%Y %H:%M:%S')

## Wrangling Dates and Times

### Timedeltas and duration

As we see before, we can create time differences (or timedeltas) by subtracting one date from another:

In [None]:
begin = dtm(2021, 5, 5, 23, 20, 2)
end = dtm(2021, 7, 5, 8, 15, 22)
delta = end - begin

## Wrangling Dates and Times

### Timedeltas and duration

In [None]:
delta.total_seconds()

## Wrangling Dates and Times

**Exercise:** Read the following date and time correctly: "Jan 16, 2021 at 3:30 AM"

In [None]:
# Your answers here

## Wrangling Dates and Times

### Timezone

In [None]:
# Timezone PDT
PDT = tmz(timedelta(hours = -7))

# Date and time in an specific time zone
dtPDT = dtm(2021, 5, 12, 15, 23, 25, tzinfo = PDT)
print(dtPDT)

In [None]:
# Or we can adjust
ET = tmz(timedelta(hours = -5))

# Before
print(dt1)

# After
dtET = dt1.astimezone(ET)
print(dtET)

## Wrangling Dates and Times

**Exercise:** Change `dt1` to India time zone (UTC+3:30).

In [None]:
# Your answers here

## Wrangling Dates and Times

### Pandas

In [None]:
# Datasets
dat2 = pd.read_csv('lakers.csv')

## Wrangling Dates and Times

**Exercise**: Explore this dataset.

In [None]:
# Your code here

## Wrangling Dates and Times

### Pandas

To parse dates and times on pandas, we use the to_datetime method:

In [None]:
# More on that later...

# Great work!