# CSS 201.5 - CSS MA Bootcamp

## Lecture 04 - Python for Data Analysis

# Data Wrangling

## Data Wrangling

Data wrangling, also known as data munging or data preprocessing, refers to the process of cleaning, transforming, and preparing raw data into a structured format suitable for analysis. Here are some common tasks involved in data wrangling:

1. Data cleaning: Handling missing values, dealing with outliers, correcting inconsistent or inaccurate data.

1. Data integration: When working with multiple data sources, data integration involves combining data from different sources into a single dataset.

1. Data transformation: Data transformation involves converting data into a suitable format for analysis.

1. Data reduction: In some cases, the original dataset may be too large or contain unnecessary variables. Data reduction involves selecting relevant variables and reducing the size of the dataset without losing critical information.

1. Handling inconsistencies: Addressing inconsistencies in the data, such as inconsistent formatting, inconsistent units of measurement, or inconsistent categorical values.


Data wrangling is the most important skills you are going to learn as a Computational Social Scientist. It is a crucial step in the data analysis pipeline, as the quality of the final analysis depend on the quality of the prepared data.

# Wrangling Categorical Variables

## Wrangling Categorical Variables

- **Categorical variable**: Variable that represents qualitative data, often organized into discrete categories or groups.

- It can take on a limited number of distinct values or levels, where each value represents a particular group or category.

## Wrangling Categorical Variables

- Categorical variables are commonly used to classify or describe characteristics, attributes, or qualities.

- Examples of categorical variables:
    + *Gender* (male, female, and others)
    + *Marital status* (single, married, divorced, etc)
    + *Educational level* (high school, college, graduate, etc).

## Wrangling Categorical Variables

- Categorical variables can be further classified into:
    + **Nominal**: Categories have no inherent order
    + **Ordinal**: Categories have a specific order or ranking.

## Wrangling Categorical Variables

In [1]:
## Loading a couple of friends in here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
from plotly import express as px
from plotly.subplots import make_subplots

## Wrangling Categorical Variables

In [2]:
# Loading the General Social Survey
dat = pd.read_csv('https://raw.githubusercontent.com/umbertomig/qtm151/main/datasets/gss.csv')
dat.head()

Unnamed: 0,region,income,happy,age,finrela,marital,degree,health,wrkstat,partyid,polviews,sex,year
0,E. NOR. CENTRAL,$10000 - 14999,NOT TOO HAPPY,54.0,AVERAGE,MARRIED,LT HIGH SCHOOL,FAIR,WORKING FULLTIME,OTHER PARTY,,MALE,1973
1,E. NOR. CENTRAL,$7000 TO 7999,VERY HAPPY,51.0,AVERAGE,MARRIED,LT HIGH SCHOOL,GOOD,KEEPING HOUSE,NOT STR DEMOCRAT,,FEMALE,1973
2,E. NOR. CENTRAL,$10000 - 14999,PRETTY HAPPY,36.0,AVERAGE,MARRIED,LT HIGH SCHOOL,EXCELLENT,WORKING FULLTIME,IND NEAR REP,,FEMALE,1973
3,E. NOR. CENTRAL,$10000 - 14999,PRETTY HAPPY,32.0,AVERAGE,MARRIED,HIGH SCHOOL,EXCELLENT,WORKING FULLTIME,NOT STR DEMOCRAT,,MALE,1973
4,E. NOR. CENTRAL,$10000 - 14999,PRETTY HAPPY,54.0,AVERAGE,MARRIED,LT HIGH SCHOOL,GOOD,KEEPING HOUSE,IND NEAR REP,,FEMALE,1973


## Wrangling Categorical Variables

**Exercise:** Explore the data.

1. Make at least two plots
2. Generate at least two descriptive stats
3. How about missing data? Tell me something about that and if there is, do something about it.

Feel free to ask me the meaning of the variables, if it is not clear.

In [3]:
## Your code here

## Wrangling Categorical Variables

The first step to take is to explore the variables. There are two ways to do that:

1. `.describe()`
2. `.value_counts()`

In [4]:
# Health status
print(dat.health.describe(), end = '\n\n\n')
print(dat.health.value_counts())

count     40273
unique        6
top        GOOD
freq      17747
Name: health, dtype: object


GOOD         17747
EXCELLENT    12135
FAIR          7402
POOR          2224
IAP            763
DK               2
Name: health, dtype: int64


## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `polviews`.

In [5]:
## Your code here

## Wrangling Categorical Variables

The second thing we can do is to identify data and categorical data we have. 

We should print the data type:

In [6]:
dat.dtypes

region       object
income       object
happy        object
age         float64
finrela      object
marital      object
degree       object
health       object
wrkstat      object
partyid      object
polviews     object
sex          object
year          int64
dtype: object

## Wrangling Categorical Variables

Data types can be:

- `int`: Numerical integer data
- `float`: Numerical continuous data
- `object`: Something else...
- others

We want it to be `category`.

In [7]:
dat.happy.dtype

dtype('O')

In [8]:
dat.happy = dat.happy.astype('category')

## Wrangling Categorical Variables

And when we check it again:

In [9]:
dat.dtypes

region        object
income        object
happy       category
age          float64
finrela       object
marital       object
degree        object
health        object
wrkstat       object
partyid       object
polviews      object
sex           object
year           int64
dtype: object

In [10]:
dat.happy.dtype

CategoricalDtype(categories=['DK', 'NOT TOO HAPPY', 'PRETTY HAPPY', 'VERY HAPPY'], ordered=False)

## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`.

In [11]:
## Your code here

## Wrangling Categorical Variables

There are two ways to create a Categorical Series in pandas.

- By the way, in pandas, the variables (data columns) are called *series*.

Way 1:

In [12]:
myvar = ['Happy', 'Happy', 'Sad', 'Happy', 'Sad', 'Meh']
myseries = pd.Series(myvar, dtype = 'category')
myseries

0    Happy
1    Happy
2      Sad
3    Happy
4      Sad
5      Meh
dtype: category
Categories (3, object): ['Happy', 'Meh', 'Sad']

Way 2 (better!):

In [13]:
myseries2 = pd.Categorical(myvar, 
                           categories = ['Sad', 'Meh', 'Happy'],
                           ordered = True)
myseries2

['Happy', 'Happy', 'Sad', 'Happy', 'Sad', 'Meh']
Categories (3, object): ['Sad' < 'Meh' < 'Happy']

## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`.

In [14]:
## Your code here

## Wrangling Categorical Variables

Why to store variables as categorical?

1. Methods built-in to deal with the data
1. We can specify order
1. Very important but neglected: Memory issues

In [15]:
vobject = dat.polviews
vobject.nbytes

427792

In [16]:
vobject = dat.polviews.astype('category')
vobject.nbytes

53538

## Wrangling Categorical Variables

And we can specify all that at the beginning stage, when loading the data:

In [17]:
data_types = {'happy': 'category'}
dat = pd.read_csv('https://raw.githubusercontent.com/umbertomig/qtm151/main/datasets/gss.csv',
                 dtype = data_types)
dat.dtypes

region        object
income        object
happy       category
age          float64
finrela       object
marital       object
degree        object
health        object
wrkstat       object
partyid       object
polviews      object
sex           object
year           int64
dtype: object

Doing this step at this point saves lots of memory!

## Wrangling Categorical Variables

We did some group by. Let's do it again with categorical data now.

In [18]:
## Group by (pipe-ing)
(dat.groupby('happy')
    .age
    .mean())

happy
DK               51.500000
NOT TOO HAPPY    46.130420
PRETTY HAPPY     44.632033
VERY HAPPY       46.953361
Name: age, dtype: float64

## Wrangling Categorical Variables

**Exercise:** Do the same with the variables `degree` and `age`.

In [19]:
## Your code here

## Wrangling Categorical Variables

We can do that with multiple categories.

In [20]:
## Group by (pipe-ing)
(dat.groupby(['happy', 'health'])
    .age
    .mean())

happy          health   
DK             DK                 NaN
               EXCELLENT          NaN
               FAIR         51.500000
               GOOD               NaN
               IAP                NaN
               POOR               NaN
NOT TOO HAPPY  DK           53.500000
               EXCELLENT    40.870445
               FAIR         47.171703
               GOOD         42.869019
               IAP          49.242718
               POOR         56.920000
PRETTY HAPPY   DK                 NaN
               EXCELLENT    39.277410
               FAIR         51.043950
               GOOD         43.226660
               IAP          47.133772
               POOR         60.323988
VERY HAPPY     DK                 NaN
               EXCELLENT    43.093289
               FAIR         55.592857
               GOOD         47.616559
               IAP          50.044554
               POOR         61.384401
Name: age, dtype: float64

## Wrangling Categorical Variables

**Exercise:** Do the same with the variables `degree`, `happy`, and `age`.

In [21]:
## Your code here

## Wrangling Categorical Variables

The weird values are because some combinations have no data.

We check the number of cases within category using the `.size()` function:

In [22]:
## Groupby + size
(dat.groupby(['happy', 'health'])
    .size())

happy          health   
DK             DK              0
               EXCELLENT       0
               FAIR            2
               GOOD            0
               IAP             0
               POOR            0
NOT TOO HAPPY  DK              2
               EXCELLENT     745
               FAIR         1464
               GOOD         1675
               IAP           103
               POOR          730
PRETTY HAPPY   DK              0
               EXCELLENT    5290
               FAIR         3969
               GOOD         9971
               IAP           456
               POOR          965
VERY HAPPY     DK              0
               EXCELLENT    5171
               FAIR         1402
               GOOD         4665
               IAP           202
               POOR          361
dtype: int64

## Wrangling Categorical Variables

**Exercise:** Do the same with the variables `degree` and `happy`.

In [23]:
## Your code here

## Wrangling Categorical Variables

In [24]:
## Groupby + dropna
(dat.groupby(['happy', 'health'])
    .age
    .mean()
    .dropna())

happy          health   
DK             FAIR         51.500000
NOT TOO HAPPY  DK           53.500000
               EXCELLENT    40.870445
               FAIR         47.171703
               GOOD         42.869019
               IAP          49.242718
               POOR         56.920000
PRETTY HAPPY   EXCELLENT    39.277410
               FAIR         51.043950
               GOOD         43.226660
               IAP          47.133772
               POOR         60.323988
VERY HAPPY     EXCELLENT    43.093289
               FAIR         55.592857
               GOOD         47.616559
               IAP          50.044554
               POOR         61.384401
Name: age, dtype: float64

## Wrangling Categorical Variables

**Exercise:** Do the same with the variables `degree`, `happy`, and `age`, removing the missing values.

In [25]:
## Your code here

## Wrangling Categorical Variables

The third step in here is to set the categorical variables parameters in a way that would help the analysis.

Let us look at the variable `polviews`. We start by setting it to categorical and see what we have there:

In [26]:
dat.polviews = dat.polviews.astype('category')
(dat.polviews
    .value_counts(dropna = False))

MODERATE                17781
SLGHTLY CONSERVATIVE     7423
NaN                      7411
CONSERVATIVE             6800
SLIGHTLY LIBERAL         5973
LIBERAL                  5338
EXTRMLY CONSERVATIVE     1438
EXTREMELY LIBERAL        1249
DK                         61
Name: polviews, dtype: int64

## Wrangling Categorical Variables

To see some more details, let us separate the series:

In [27]:
polv = dat.polviews
polv

0                         NaN
1                         NaN
2                         NaN
3                         NaN
4                         NaN
                 ...         
53469    SLGHTLY CONSERVATIVE
53470    EXTRMLY CONSERVATIVE
53471        SLIGHTLY LIBERAL
53472        SLIGHTLY LIBERAL
53473                MODERATE
Name: polviews, Length: 53474, dtype: category
Categories (8, object): ['CONSERVATIVE', 'DK', 'EXTREMELY LIBERAL', 'EXTRMLY CONSERVATIVE', 'LIBERAL', 'MODERATE', 'SLGHTLY CONSERVATIVE', 'SLIGHTLY LIBERAL']

## Wrangling Categorical Variables

### Set Categories

In [28]:
polv2 = polv.cat.set_categories(
    new_categories = ['EXTREMELY LIBERAL', 
                      'LIBERAL', 
                      'SLIGHTLY LIBERAL', 
                      'MODERATE', 
                      'SLGHTLY CONSERVATIVE', 
                      'CONSERVATIVE', 
                      'EXTRMLY CONSERVATIVE']
)
print(polv2.value_counts(dropna = False), end = '\n\n\n')
print(polv.value_counts(dropna = False), end = '\n\n\n')

MODERATE                17781
NaN                      7472
SLGHTLY CONSERVATIVE     7423
CONSERVATIVE             6800
SLIGHTLY LIBERAL         5973
LIBERAL                  5338
EXTRMLY CONSERVATIVE     1438
EXTREMELY LIBERAL        1249
Name: polviews, dtype: int64


MODERATE                17781
SLGHTLY CONSERVATIVE     7423
NaN                      7411
CONSERVATIVE             6800
SLIGHTLY LIBERAL         5973
LIBERAL                  5338
EXTRMLY CONSERVATIVE     1438
EXTREMELY LIBERAL        1249
DK                         61
Name: polviews, dtype: int64




## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`

In [29]:
## Your code here

## Wrangling Categorical Variables

### Set Categories

In [30]:
polv3 = polv.cat.set_categories(
    new_categories = ['EXTREMELY LIBERAL', 'LIBERAL', 'SLIGHTLY LIBERAL', 'MODERATE', 'SLGHTLY CONSERVATIVE', 'CONSERVATIVE', 'EXTRMLY CONSERVATIVE'],
    ordered = True
)
print(polv3.value_counts(dropna = False, sort = False), end = '\n\n\n')
print(polv.value_counts(dropna = False), end = '\n\n\n')

EXTREMELY LIBERAL        1249
LIBERAL                  5338
SLIGHTLY LIBERAL         5973
MODERATE                17781
SLGHTLY CONSERVATIVE     7423
CONSERVATIVE             6800
EXTRMLY CONSERVATIVE     1438
NaN                      7472
Name: polviews, dtype: int64


MODERATE                17781
SLGHTLY CONSERVATIVE     7423
NaN                      7411
CONSERVATIVE             6800
SLIGHTLY LIBERAL         5973
LIBERAL                  5338
EXTRMLY CONSERVATIVE     1438
EXTREMELY LIBERAL        1249
DK                         61
Name: polviews, dtype: int64




## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`

In [31]:
## Your code here

## Wrangling Categorical Variables

### Add and Remove Categories

We can also add and remove categories. If we add categories that are not there, they will have a zero count:

In [32]:
polv4 = polv.cat.add_categories(
    new_categories = [
        'very very liberal',
        'very very conservative'],
)
print(polv4.value_counts(dropna = False, sort = False))

CONSERVATIVE               6800
DK                           61
EXTREMELY LIBERAL          1249
EXTRMLY CONSERVATIVE       1438
LIBERAL                    5338
MODERATE                  17781
SLGHTLY CONSERVATIVE       7423
SLIGHTLY LIBERAL           5973
very very liberal             0
very very conservative        0
NaN                        7411
Name: polviews, dtype: int64


## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`, creating one imaginary degree.

In [33]:
## Your code here

## Wrangling Categorical Variables

### Add and Remove Categories

In [34]:
polv5 = polv4.cat.remove_categories(
    removals = ['MODERATE']
)
print(polv5.value_counts(dropna = False, sort = False))

CONSERVATIVE               6800
DK                           61
EXTREMELY LIBERAL          1249
EXTRMLY CONSERVATIVE       1438
LIBERAL                    5338
SLGHTLY CONSERVATIVE       7423
SLIGHTLY LIBERAL           5973
very very liberal             0
very very conservative        0
NaN                       25192
Name: polviews, dtype: int64


## Wrangling Categorical Variables

**Exercise:** Remove the imaginary degree you created (`degree`).

In [35]:
## Your code here

## Wrangling Categorical Variables

### Updating Categories

We can rename categories using the `.rename_categories` method.

In [36]:
## Seeing it
polv3 = polv3.cat.rename_categories(new_categories = {
    'SLGHTLY CONSERVATIVE': 'SLIGHTLY CONSERVATIVE',
    'EXTRMLY CONSERVATIVE': 'EXTREMELY CONSERVATIVE'
})
polv3.value_counts(sort = False)

EXTREMELY LIBERAL          1249
LIBERAL                    5338
SLIGHTLY LIBERAL           5973
MODERATE                  17781
SLIGHTLY CONSERVATIVE      7423
CONSERVATIVE               6800
EXTREMELY CONSERVATIVE     1438
Name: polviews, dtype: int64

## Wrangling Categorical Variables

### Updating Categories

And one nice way is that we can apply functions to the texts using this command.

In [37]:
## Seeing it (no worries about lambda and title, we will learn those)
polv6 = polv3.cat.rename_categories(lambda cat: cat.title())
polv6.value_counts(sort = False)

Extremely Liberal          1249
Liberal                    5338
Slightly Liberal           5973
Moderate                  17781
Slightly Conservative      7423
Conservative               6800
Extremely Conservative     1438
Name: polviews, dtype: int64

## Wrangling Categorical Variables

**Exercise:** Do the same with the variable `degree`

In [38]:
## Your code here

## Wrangling Categorical Variables

### Collapsing Categories

In [39]:
## collapsing
polv7 = polv6.replace({
    'Extremely Liberal': 'Liberal',
    'Slightly Liberal': 'Liberal',
    'Slightly Conservative': 'Conservative',
    'Extremely Conservative': 'Conservative'
})
polv7.value_counts(sort = False)

Liberal         12560
Moderate        17781
Conservative    15661
Name: polviews, dtype: int64

## Wrangling Categorical Variables

**Exercise:** Collapse `degree` categories to just three categories of your choice.

In [40]:
## Your code here

## Wrangling Categorical Variables

### Reordering Categories

In [41]:
## reordering
polv8 = polv7.cat.reorder_categories(
    new_categories = ['Conservative', 'Moderate', 'Liberal'],
    ordered = True
)
polv8.value_counts(sort = False)

Conservative    15661
Moderate        17781
Liberal         12560
Name: polviews, dtype: int64

## Wrangling Categorical Variables

**Exercise:** Reorder the `degree` categories.

In [42]:
## Your code here

## Wrangling Categorical Variables

### Wrong Data Types

- Being, for example, object instead of category

- We know how to deal with this!

### Inconsistencies

- It is very common that our data is inconsistent. 
    
Example: `house`, `House`, ` House`, `HOUSE`. All the same for us, but all different for the computer.

In [43]:
myv = ['house', 'condo', ' house', 'condo', 'house',
       'House', 'Condo', 'CONDO',  'House', 'house',
       'House', 'house', 'house', 'Condo', ' Condo',
       'Horse', 'Hoseu', 'Codno']
myv = pd.Series(myv)
myv.value_counts()

house     5
House     3
condo     2
Condo     2
 house    1
CONDO     1
 Condo    1
Horse     1
Hoseu     1
Codno     1
dtype: int64

## Wrangling Categorical Variables

### Inconsistencies

- Removing whitespace:

In [44]:
myv = myv.str.strip() # Strip out whitespaces!
myv.value_counts()

house    6
House    3
Condo    3
condo    2
CONDO    1
Horse    1
Hoseu    1
Codno    1
dtype: int64

## Wrangling Categorical Variables

### Inconsistencies

- Capitalization: we can use `.lower`, `.upper`, or `.title`
    + `upper`: All upper case
    + `lower`: All lower case
    + `title`: Only first letter capitalized

In [45]:
myv = myv.str.title() # Make it title
myv.value_counts()

House    9
Condo    6
Horse    1
Hoseu    1
Codno    1
dtype: int64

## Wrangling Categorical Variables

### Misspelling

- Example: `house`, `Hosue`.

In [46]:
myv = myv.replace({
    'Horse': 'House',
    'Hoseu': 'House',
    'Codno': 'Condo',
})
myv.value_counts()

House    11
Condo     7
dtype: int64

## Wrangling Categorical Variables

**Exercise:** Fix the following vector:

In [47]:
vec = pd.Series(
    ['yes', 'yes', 'yes', 'Yes', 'yEs', 'YESSS', ' yes',
     ' yes ', 'no', 'nooo', '  no way', 'no', 'No', 'Nononono',
     'Sure thing']
)
## Your code here

## Wrangling Categorical Variables

**Exercise**: Fix the remaining variables in this dataset. If you want, build this as a report.

In [48]:
## Answers here (in case you don't want to build a report)

# Wrangling Dates and Times

## Wrangling Dates and Times

Working with dates and times in a computer requires special methods.

1. Representation: Dates and times have a complex and structured nature. Storing and manipulating these components in a consistent and reliable manner requires specific data structures and formats.

1. Time Zones: Accurate handling of time zones is crucial for calculating time differences, scheduling events, and ensuring correct timestamps across different locations.

1. Leap Years and Daylight Saving Time

1. Arithmetic Operations: Dates and times often require arithmetic operations such as addition, subtraction, and comparison.

1. Formatting and Localization: Displaying dates and times in a human-readable format, adhering to cultural conventions.

1. Integration with External Systems: Dates and times often need to be exchanged and communicated with external systems, such as databases, APIs, and other software applications.

## Wrangling Dates and Times

Let us look at this example:

In [49]:
# Here are two dates
dts = ['5/17/2020', '12/2/2022']

## Wrangling Dates and Times

Let's get started by loading a package that deals with dates and times:

In [50]:
# Dates and times handlers
from datetime import date
from datetime import timedelta
from datetime import datetime as dtm
from datetime import timezone as tmz

## Wrangling Dates and Times

- To create a date object:
    + The order: `date(YEAR, MONTH, DAY)`

In [51]:
dts2 = [date(2020, 5, 17), date(2022, 12, 2)]
print(dts2[0], end = '\n\n') # ISO formatted date!
print(dts2[0].year, end = '\n\n')
print(dts2[0].month, end = '\n\n')
print(dts2[0].day, end = '\n\n')

2020-05-17

2020

5

17



## Wrangling Dates and Times

Weekdays:

- 0 = Monday
- 1 = Tuesday
- .
- .
- .
- 6 = Sunday

In [52]:
dts2[0].weekday()

6

## Wrangling Dates and Times

**Exercise:** Find the weekday of the following date:

In [53]:
mydate = date(2020, 1, 16)
## Your code here

## Wrangling Dates and Times

### Math

- The order of two dates is not trivial for a computer. For instance:

In [54]:
print(dts)
min(dts)

['5/17/2020', '12/2/2022']


'12/2/2022'

In [55]:
# Difference between two dates?
# Run this and see: '12/2/2022' - '5/17/2020'

## Wrangling Dates and Times

### Math

With date objects, things get less complicated:

In [56]:
print(dts2)
min(dts2)

[datetime.date(2020, 5, 17), datetime.date(2022, 12, 2)]


datetime.date(2020, 5, 17)

In [57]:
# Differences create something called a delta object:
dts2[1] - dts2[0]

datetime.timedelta(days=929)

In [58]:
delta = dts2[1] - dts2[0]
delta.days

929

## Wrangling Dates and Times

**Exercise:** Find the difference in days of the following dates

In [59]:
mydate1 = date(2020, 1, 16)
mydate2 = date(2021, 3, 26)
## Your code here

## Wrangling Dates and Times

### Math

What day it is going to be exactly 30 days from now?

We can use `timedelta` to create the difference and add it to today's date.

In [60]:
# Descriptive of irrigation
print(date.today())
print(date.today() + timedelta(days = 30))

2023-07-07
2023-08-06


## Wrangling Dates and Times

### Turning dates back to strings

- ISO dates: 
    + Format: 'YYYY-MM-DD'
    + Great for computation, since they are always the same length.
    + It adds zeros
    + Even if it is a string, it sorts correctly!

In [61]:
print(date.today())

2023-07-07


Express the data into ISO format:

In [62]:
date.today().isoformat() # Creates a string!

'2023-07-07'

## Wrangling Dates and Times

### Turning dates back to strings

- Custom dates: We will use the `strftime`
    + '%Y': Year
    + '%m': Month
    + '%d': Day
    + '%B': Month
    + '%D': US format
    + And others [here](https://strftime.org)

In [63]:
print(date.today().strftime('%Y'), end = '\n\n')
print(date.today().strftime('The year is %Y'), end = '\n\n')
print(date.today().strftime('%B (%Y)'), end = '\n\n')
print(date.today().strftime('%Y/%m/%d'), end = '\n\n')
print(date.today().strftime('%Y-%j'), end = '\n\n')
print(date.today().strftime('%D'), end = '\n\n')

2023

The year is 2023

July (2023)

2023/07/07

2023-188

07/07/23



## Wrangling Dates and Times

**Exercise:** Print each of these dates in four different formats based on the formats in [here](https://strftime.org).

In [64]:
mydate1 = date(2020, 1, 16)
mydate2 = date(2021, 3, 26)
## Your code here

## Wrangling Dates and Times

### Times and dates

Sometimes, we also have to work with times: E.g., detect credit card fraud!

For instance:

>
> May, 12 2021, 3:23:15 PM
> 

Means:

In [65]:
# Computers work with 24-h settings
dt1 = dtm(2021, 5, 12, 15, 23, 25)
dt1

datetime.datetime(2021, 5, 12, 15, 23, 25)

## Wrangling Dates and Times

### Times and dates

We can display the dates nicely:

In [66]:
print(dt1.isoformat(), end = '\n\n')
print(dt1.strftime('%Y-%m-%d'), end = '\n\n')
print(dt1.strftime('%Y-%m-%d %H:%M:%S'), end = '\n\n')
print(dt1.strftime('The thing happened in-on-at (no idea...) %H:%M:%S in-on-at %Y-%m-%d'), end = '\n\n')

2021-05-12T15:23:25

2021-05-12

2021-05-12 15:23:25

The thing happened in-on-at (no idea...) 15:23:25 in-on-at 2021-05-12



## Wrangling Dates and Times

### Times and dates

And if we have data in the same format, such as:

In [67]:
mydt = dt1.strftime('%m/%d/%Y %H:%M:%S')
print(mydt)

05/12/2021 15:23:25


We can easily parse it:

In [68]:
dtm.strptime(mydt, '%m/%d/%Y %H:%M:%S')

datetime.datetime(2021, 5, 12, 15, 23, 25)

## Wrangling Dates and Times

### Timedeltas and duration

As we see before, we can create time differences (or timedeltas) by subtracting one date from another:

In [69]:
begin = dtm(2021, 5, 5, 23, 20, 2)
end = dtm(2021, 7, 5, 8, 15, 22)
delta = end - begin

## Wrangling Dates and Times

### Timedeltas and duration

In [70]:
delta.total_seconds()

5216120.0

## Wrangling Dates and Times

**Exercise:** Read the following date and time correctly: "Jan 16, 2021 at 3:30 AM"

In [71]:
# Your answers here

## Wrangling Dates and Times

### Timezone

In [72]:
# Timezone PDT
PDT = tmz(timedelta(hours = -7))

# Date and time in an specific time zone
dtPDT = dtm(2021, 5, 12, 15, 23, 25, tzinfo = PDT)
print(dtPDT)

2021-05-12 15:23:25-07:00


In [73]:
# Or we can adjust
ET = tmz(timedelta(hours = -5))

# Before
print(dt1)

# After
dtET = dt1.astimezone(ET)
print(dtET)

2021-05-12 15:23:25
2021-05-12 17:23:25-05:00


## Wrangling Dates and Times

**Exercise:** Change `dt1` to India time zone (UTC+3:30).

In [74]:
# Your answers here

## Wrangling Dates and Times

### Pandas

In [75]:
# Datasets
dat2 = pd.read_csv('https://raw.githubusercontent.com/umbertomig/elements_css/main/lakers.csv')
dat3 = pd.read_csv('https://raw.githubusercontent.com/umbertomig/elements_css/main/nyc13flight.csv')

## Wrangling Dates and Times

**Exercise**: Explore these datasets.

In [76]:
# Your code here

## Wrangling Dates and Times

### Pandas

To parse dates and times on pandas, we use the to_datetime method:

In [75]:
# More on that later...

# Great work!