
<a href="https://github.com/zia207/python-colab/blob/main/NoteBook/Python_for_Beginners/01-03-03-data-wrangling-date-time-python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1IFEWet-Aw4DhkkVe1xv_2YYqlvRe9m5_)

# 3.3 Working with Dates and Times using `pandas` and `pytz`




Date and time data are often messy, inconsistent, and challenging to work with — whether you're dealing with different formats like "2025-04-10", "10/04/2025", or "April 10, 2025". In R, `lubridate` simplifies this process. In Python, **`pandas`** is the go-to package for cleaning, parsing, extracting, and manipulating datetime data — and it does so with remarkable ease.

This tutorial will show you how to:
- Parse inconsistent date strings
- Extract components (year, month, day, hour, etc.)
- Filter, group, and manipulate datetime data
- Handle time zones
- Construct new datetime values from parts

We'll use:
- `pandas` — for parsing, extracting, and manipulating dates/times
- `pytz` — for advanced timezone handling (optional but recommended)

### Prerequisites

Install the required packages:

In [None]:
import importlib.util
import sys

# List of required packages
packages = ['pandas', 'pytz']

# Check and install missing packages
for package in packages:
    if not importlib.util.find_spec(package):
        try:
            import pip
            pip.main(['install', package])
        except ImportError:
            print(f"Failed to install {package}. Pip is not available.")

# Import packages
import pandas as pd
import pytz
import numpy as np
from datetime import datetime

In [2]:
# Verify package availability
for package in packages:
    print(f"{package} installed: {bool(importlib.util.find_spec(package))}")

pandas installed: True
pytz installed: True


In [3]:
# Optional: Set display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

### Simulate the Dataset

Let's recreate the messy dataset from the R example:

In [4]:
np.random.seed(123)

df = pd.DataFrame({
    'id': range(1, 11),
    'name': np.random.choice(['Alice', 'Bob', 'Carol'], size=10),
    'raw_date': np.random.choice([
        '2025-04-10',
        '10/04/2025',
        'April 10, 2025'
    ], size=10),
    'timestamp': pd.to_datetime(np.random.choice(
        pd.date_range(start='2025-04-10 08:00', end='2025-04-10 18:00', freq='H'),
        size=10
    ))
})

print(df)

   id   name        raw_date           timestamp
0   1  Carol  April 10, 2025 2025-04-10 11:00:00
1   2    Bob      10/04/2025 2025-04-10 12:00:00
2   3  Carol      2025-04-10 2025-04-10 08:00:00
3   4  Carol      10/04/2025 2025-04-10 08:00:00
4   5  Alice  April 10, 2025 2025-04-10 12:00:00
5   6  Carol      10/04/2025 2025-04-10 09:00:00
6   7  Carol      2025-04-10 2025-04-10 15:00:00
7   8    Bob  April 10, 2025 2025-04-10 11:00:00
8   9  Carol      2025-04-10 2025-04-10 10:00:00
9  10    Bob      10/04/2025 2025-04-10 12:00:00


  pd.date_range(start='2025-04-10 08:00', end='2025-04-10 18:00', freq='H'),


### Parse Inconsistent Date Formats

In R, `parse_date_time()` handles multiple formats automatically. In pandas, we use `pd.to_datetime()` — which is *even smarter* than `lubridate`'s parser and auto-detects most common formats.

In [8]:
# Parse the messy 'raw_date' column
df['parsed_date'] = pd.to_datetime(df['raw_date'], format='mixed')

print(df[['raw_date', 'parsed_date']])

         raw_date parsed_date
0  April 10, 2025  2025-04-10
1      10/04/2025  2025-10-04
2      2025-04-10  2025-04-10
3      10/04/2025  2025-10-04
4  April 10, 2025  2025-04-10
5      10/04/2025  2025-10-04
6      2025-04-10  2025-04-10
7  April 10, 2025  2025-04-10
8      2025-04-10  2025-04-10
9      10/04/2025  2025-10-04


Pandas automatically recognizes "2025-04-10" (ISO), "10/04/2025" (DD/MM/YYYY), and "April 10, 2025" (month name format)!

> Tip: If automatic detection fails, use `format=` parameter:  
> `pd.to_datetime(date_str, format='%B %d, %Y')` for explicit formatting.

### Extract Date Components

Just like `year()`, `month()`, `day()`, etc., in `lubridate`, pandas provides `.dt` accessor:

In [9]:
df['year'] = df['parsed_date'].dt.year
df['month'] = df['parsed_date'].dt.month
df['month_name'] = df['parsed_date'].dt.month_name()  # Full name: "April"
df['day'] = df['parsed_date'].dt.day
df['weekday'] = df['parsed_date'].dt.day_name()       # Full weekday: "Thursday"
df['hour'] = df['timestamp'].dt.hour
df['minute'] = df['timestamp'].dt.minute

print(df[['parsed_date', 'year', 'month', 'month_name', 'day', 'weekday', 'hour', 'minute']])

  parsed_date  year  month month_name  day   weekday  hour  minute
0  2025-04-10  2025      4      April   10  Thursday    11       0
1  2025-10-04  2025     10    October    4  Saturday    12       0
2  2025-04-10  2025      4      April   10  Thursday     8       0
3  2025-10-04  2025     10    October    4  Saturday     8       0
4  2025-04-10  2025      4      April   10  Thursday    12       0
5  2025-10-04  2025     10    October    4  Saturday     9       0
6  2025-04-10  2025      4      April   10  Thursday    15       0
7  2025-04-10  2025      4      April   10  Thursday    11       0
8  2025-04-10  2025      4      April   10  Thursday    10       0
9  2025-10-04  2025     10    October    4  Saturday    12       0


 `month_name()` and `day_name()` return readable labels — just like `label=TRUE` in `lubridate`.

### Reformat Dates (Like `format()` in R)

Convert the parsed date to a custom string format:

In [10]:
df['date_reformatted'] = df['parsed_date'].dt.strftime('%d-%b-%Y')

print(df[['parsed_date', 'date_reformatted']])

  parsed_date date_reformatted
0  2025-04-10      10-Apr-2025
1  2025-10-04      04-Oct-2025
2  2025-04-10      10-Apr-2025
3  2025-10-04      04-Oct-2025
4  2025-04-10      10-Apr-2025
5  2025-10-04      04-Oct-2025
6  2025-04-10      10-Apr-2025
7  2025-04-10      10-Apr-2025
8  2025-04-10      10-Apr-2025
9  2025-10-04      04-Oct-2025


> `%d` = day, `%b` = abbreviated month, `%Y` = 4-digit year.

### Filter Data by Time

Filter rows where the timestamp is after 12 PM (noon):

In [11]:
df_after_noon = df[df['hour'] > 12]

print(df_after_noon[['name', 'timestamp', 'hour']])

    name           timestamp  hour
6  Carol 2025-04-10 15:00:00    15


### Group by Weekday

Group and count entries by weekday:

In [12]:
grouped_by_weekday = df.groupby('weekday').size().reset_index(name='entries')

print(grouped_by_weekday)

    weekday  entries
0  Saturday        4
1  Thursday        6


### Construct New Datetime from Parts

In R, you used `make_datetime(year, month, day, hour, min)`. In pandas, use `pd.Timestamp` or `pd.to_datetime()` with a dictionary:

In [13]:
df['full_datetime'] = pd.to_datetime(
    dict(
        year=df['year'],
        month=df['month'],
        day=df['day'],
        hour=df['hour'],
        minute=df['minute']
    )
)

print(df[['parsed_date', 'timestamp', 'full_datetime']].head())

  parsed_date           timestamp       full_datetime
0  2025-04-10 2025-04-10 11:00:00 2025-04-10 11:00:00
1  2025-10-04 2025-04-10 12:00:00 2025-10-04 12:00:00
2  2025-04-10 2025-04-10 08:00:00 2025-04-10 08:00:00
3  2025-10-04 2025-04-10 08:00:00 2025-10-04 08:00:00
4  2025-04-10 2025-04-10 12:00:00 2025-04-10 12:00:00


You now have a clean, constructed datetime combining date and time parts!

### Handle Time Zones

In R, you used `with_tz()` and `force_tz()`. In Python, we use `pytz` with pandas.

#### Step 1: Assign a timezone (e.g., America/New_York)

In [14]:
# Assume original timestamps are in US Eastern Time
df['full_datetime_est'] = df['full_datetime'].dt.tz_localize('America/New_York')

#### Step 2: Convert to UTC and Tokyo

In [15]:
# Convert to UTC
df['full_datetime_utc'] = df['full_datetime_est'].dt.tz_convert('UTC')

# Convert to Tokyo
df['full_datetime_tokyo'] = df['full_datetime_est'].dt.tz_convert('Asia/Tokyo')

print(df[['full_datetime', 'full_datetime_est', 'full_datetime_utc', 'full_datetime_tokyo']].head())

        full_datetime         full_datetime_est         full_datetime_utc  \
0 2025-04-10 11:00:00 2025-04-10 11:00:00-04:00 2025-04-10 15:00:00+00:00   
1 2025-10-04 12:00:00 2025-10-04 12:00:00-04:00 2025-10-04 16:00:00+00:00   
2 2025-04-10 08:00:00 2025-04-10 08:00:00-04:00 2025-04-10 12:00:00+00:00   
3 2025-10-04 08:00:00 2025-10-04 08:00:00-04:00 2025-10-04 12:00:00+00:00   
4 2025-04-10 12:00:00 2025-04-10 12:00:00-04:00 2025-04-10 16:00:00+00:00   

        full_datetime_tokyo  
0 2025-04-11 00:00:00+09:00  
1 2025-10-05 01:00:00+09:00  
2 2025-04-10 21:00:00+09:00  
3 2025-10-04 21:00:00+09:00  
4 2025-04-11 01:00:00+09:00  


Notice: Tokyo is 13 hours ahead of EST (during daylight saving).  
`tz_localize()` assigns a timezone (assumes local time).  
`tz_convert()` changes the timezone without changing the absolute moment in time.

## Summary and Key Takeaways

| R (`lubridate`) | Python (`pandas`) |
|------------------|-------------------|
| `ymd()`, `ymd_hms`,`dmy()`,`dmy_hms`,`mdy()`, ... | `pd.to_datetime()` — auto-detects all |
| `parse_date_time(..., orders=...)` | `pd.to_datetime()` — even smarter! |
| `year()`, `month()`, `day()` | `series.dt.year`, `series.dt.month`, `series.dt.day` |
| `month(..., label=TRUE)` | `series.dt.month_name()` |
| `wday(..., label=TRUE)` | `series.dt.day_name()` |
| `hour()`, `minute()` | `series.dt.hour`, `series.dt.minute` |
| `format(..., "%d-%b-%Y")` | `series.dt.strftime("%d-%b-%Y")` |
| `make_datetime()` | `pd.to_datetime(dict(year=..., month=..., ...))` |
| `with_tz(x, "UTC")` | `x.dt.tz_convert("UTC")` |
| `force_tz(x, "UTC")` | `x.dt.tz_localize(None).dt.tz_convert("UTC")` |

## Final Thoughts

While `lubridate` revolutionized date-time handling in R, **`pandas` brings even greater power and flexibility to Python**. It:
- Auto-parses complex formats out-of-the-box
- Has a consistent `.dt` accessor for all datetime operations
- Integrates seamlessly with `DataFrame` operations
- Handles timezones reliably with `pytz`

You don't need a separate "lubridate for Python" — **`pandas` is it**.

## Resources

- [pandas Timestamp Documentation](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html)
- [pandas Datetime Properties](https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components)
- [pytz Documentation](https://pythonhosted.org/pytz/)
- [Python Datetime Guide (Real Python)](https://realpython.com/python-datetime/)
