# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#4: Advanced Data Manipulation`**

10. **Merging and Concatenating DataFrames**
    
    - Combining DataFrames
    - Concatenation and merging operations
11. **Reshaping Data**
    
    - Pivoting and melting
    - Stacking and unstacking
12. **Time Series Data**
    
    - Handling time and date data
    - Resampling and frequency conversion

### **`12. Time Series Data`**

### **`Handling Time and Date Data in Pandas`**

#### Importance of Handling Time and Date Data:

Handling time and date data is crucial in data analysis for various reasons:

1. **Temporal Analysis:**
   - Time-based insights, trends, and patterns are essential for understanding data.

2. **Time Series Analysis:**
   - Analyzing data collected over time for forecasting and trend identification.

3. **Data Alignment:**
   - Aligning datasets based on time indices for effective merging and analysis.

4. **Event Sequencing:**
   - Understanding the chronological order of events for context-aware analysis.


#### `DatetimeIndex` in Pandas:

Pandas provides the `DatetimeIndex`, a powerful tool for working with time and date data.

In [3]:
import pandas as pd

# Creating a DatetimeIndex
date_rng = pd.date_range(start='2022-01-01', end='2022-01-10', freq='D')

# Creating a DataFrame with DatetimeIndex
df = pd.DataFrame(date_rng, columns=['date']) # This syntax is similar to the one we have learned for creating dataframe from array



# Displaying the DataFrame
print("DataFrame with DatetimeIndex:")
print(df)

# Result :  The DataFrame contains a `DatetimeIndex` ranging from '2022-01-01' to '2022-01-10'.


DataFrame with DatetimeIndex:
        date
0 2022-01-01
1 2022-01-02
2 2022-01-03
3 2022-01-04
4 2022-01-05
5 2022-01-06
6 2022-01-07
7 2022-01-08
8 2022-01-09
9 2022-01-10


***Explanation***:

The above code uses the pandas library in Python to create a DataFrame with a DatetimeIndex. Let's break down the code step by step:

1. **DatetimeIndex Creation:**
   ```python
   date_rng = pd.date_range(start='2022-01-01', end='2022-01-10', freq='D')
   ```
   - `pd.date_range`: This function generates a fixed-frequency DatetimeIndex.
   - `start='2022-01-01'`: The starting date of the range.
   - `end='2022-01-10'`: The ending date of the range.
   - `freq='D'`: The frequency of the date range, in this case, 'D' stands for daily.

   The result (`date_rng`) is a DatetimeIndex containing dates from '2022-01-01' to '2022-01-10', with a daily frequency.

2. **Creating a DataFrame with DatetimeIndex:**
   ```python
   df = pd.DataFrame(date_rng, columns=['date'])
   ```
   - `pd.DataFrame`: This function creates a DataFrame.
   - `date_rng`: The DatetimeIndex generated earlier is used as the data for the 'date' column in the DataFrame.
   - `columns=['date']`: Specifies the name of the column in the DataFrame. Here, the column is named 'date'.

   The resulting DataFrame (`df`) will have a single column named 'date', and the index of the DataFrame will be the DatetimeIndex (`date_rng`). Each row in the DataFrame corresponds to a date in the specified range.


3. **date_range()** :
The `pd.date_range()` function in pandas is used to generate a fixed frequency DatetimeIndex. It is particularly useful when working with time series data and requires creating a sequence of dates. Let's break down the parameters of `pd.date_range()`:

```python
pd.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, closed=None, **kwargs)
```

- **`start`**: The start date of the sequence.
- **`end`**: The end date of the sequence.
- **`periods`**: The total number of periods (dates) to generate.
- **`freq`**: The frequency of the data. This can be a string representing a frequency alias (e.g., 'D' for day, 'H' for hour) or an offset object.
- **`tz`**: Time zone for the resulting DatetimeIndex.
- **`normalize`**: If True, normalize the start and end dates.
- **`name`**: Name to be stored in the resulting DatetimeIndex.
- **`closed`**: Make the interval closed on the right, left, both, or neither ('right', 'left', 'both', 'neither').



#### Manipulating Time Series Data:

In [5]:
# Adding a new column with random values
import numpy as np

df['value'] = np.random.randint(0, 100, size=(len(date_rng)))

# Displaying the updated DataFrame
print("\nDataFrame with Random Values:")
print(df)

# Result : The DataFrame now includes a 'value' column with random integer values.


DataFrame with Random Values:
        date  value
0 2022-01-01     34
1 2022-01-02     65
2 2022-01-03     75
3 2022-01-04     92
4 2022-01-05     52
5 2022-01-06     16
6 2022-01-07     58
7 2022-01-08     44
8 2022-01-09     31
9 2022-01-10     82


#### Time Series Operations:

In [6]:
# Extracting components of the date
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()

# Displaying the DataFrame with extracted components
print("\nDataFrame with Date Components:")
print(df)

# Result : Additional columns are added for the year, month, day, and weekday of each date.


DataFrame with Date Components:
        date  value  year  month  day    weekday
0 2022-01-01     34  2022      1    1   Saturday
1 2022-01-02     65  2022      1    2     Sunday
2 2022-01-03     75  2022      1    3     Monday
3 2022-01-04     92  2022      1    4    Tuesday
4 2022-01-05     52  2022      1    5  Wednesday
5 2022-01-06     16  2022      1    6   Thursday
6 2022-01-07     58  2022      1    7     Friday
7 2022-01-08     44  2022      1    8   Saturday
8 2022-01-09     31  2022      1    9     Sunday
9 2022-01-10     82  2022      1   10     Monday


***Explanation:***

In the above code, `dt` is the accessor used to access the datetime properties of a Series in pandas. This is commonly used when you have a column with datetime values in a DataFrame. Let's break down the code:

1. **Extracting Year, Month, and Day:**
   ```python
   df['year'] = df['date'].dt.year
   df['month'] = df['date'].dt.month
   df['day'] = df['date'].dt.day
   ```
   - `df['date'].dt.year`: Accesses the year component of the 'date' column and creates a new column named 'year' in the DataFrame.
   - `df['date'].dt.month`: Accesses the month component of the 'date' column and creates a new column named 'month' in the DataFrame.
   - `df['date'].dt.day`: Accesses the day component of the 'date' column and creates a new column named 'day' in the DataFrame.

2. **Extracting Weekday:**
   ```python
   df['weekday'] = df['date'].dt.day_name()
   ```
   - `df['date'].dt.day_name()`: Accesses the day name (e.g., Monday, Tuesday) of each date in the 'date' column and creates a new column named 'weekday' in the DataFrame.

The `dt` accessor is used to make these datetime-related operations concise and easy. It's important to note that the 'date' column must contain datetime values for these operations to work. If 'date' is not a datetime column, you would need to convert it to datetime using `pd.to_datetime` before using the `dt` accessor.

For example, if 'date' is not already a datetime column, you can convert it as follows:

```python
df['date'] = pd.to_datetime(df['date'])
```

After this conversion, you can use the `dt` accessor as shown in the original code.

#### Time Resampling:

In [7]:
# Resampling the DataFrame to weekly frequency
weekly_df = df.resample('W-Mon', on='date').sum()

# Displaying the resampled DataFrame
print("\nResampled DataFrame (Weekly):")
print(weekly_df)

# Result : The DataFrame is resampled to a weekly frequency, aggregating values based on the sum.


Resampled DataFrame (Weekly):
            value   year  month  day
date                                
2022-01-03    174   6066      3    6
2022-01-10    375  14154      7   49


  weekly_df = df.resample('W-Mon', on='date').sum()


***Explanation***

The above code is using the `resample()` method in pandas to resample a DataFrame based on a specified frequency. Let's break down the code:

```python
# Resampling the DataFrame to weekly frequency
weekly_df = df.resample('W-Mon', on='date').sum()
```

- **`df.resample('W-Mon', on='date')`**: This part of the code uses the `resample()` method to resample the DataFrame (`df`) based on a weekly frequency. The argument `'W-Mon'` specifies that the resampling should be done on a weekly basis, and the `'Mon'` indicates that the week should start on a Monday. This is the frequency string for weekly resampling starting on Monday.

- **`.sum()`**: After resampling, the `.sum()` method is applied to aggregate the values for each week by summing them up. This means that for each week, the sum of the values in the original DataFrame will be calculated.

- **`weekly_df`**: The result of the resampling and aggregation is stored in a new DataFrame named `weekly_df`.

```python
# Displaying the resampled DataFrame
print("\nResampled DataFrame (Weekly):")
print(weekly_df)
```

The above code then prints the resulting DataFrame `weekly_df`, which contains the aggregated values for each week.

**Explanation of `resample()`**:

The `resample()` method in pandas is used for time-based resampling of time series data. It allows you to change the frequency of your time series data, such as converting daily data to monthly data or weekly data. The syntax generally looks like:

```python
df.resample(rule, on=None, ...)
```

- **`rule`**: A string specifying the frequency at which to resample the data (e.g., 'D' for daily, 'M' for monthly, 'W' for weekly). It can also include an anchor such as 'W-Mon' to specify the starting day of the week.

- **`on`**: The name of the datetime-like column on which to perform the resampling.

After applying `resample()`, you often chain aggregation functions (like `sum()`, `mean()`, etc.) to perform some operation on the data within each resampled interval.

In the provided code, the DataFrame is resampled to a weekly frequency, and the values for each week are summed up. This is a common operation when dealing with time series data to get a summarized view at a different frequency.

#### Considerations:

- **Date Components:**
  - Extracting date components facilitates detailed analysis and reporting.

- **Resampling Frequency:**
  - Choose the appropriate frequency when resampling data to suit analysis requirements.

#### Tips:

- **Time Zone Handling:**
  - Consider time zone information when working with data from different regions.

- **Periods and Durations:**
  - Explore Pandas' `Period` and `Timedelta` for handling periods and durations.

Handling time and date data with Pandas' `DatetimeIndex` enables effective analysis, visualization, and manipulation of time series data. Leveraging these functionalities enhances the ability to derive meaningful insights from datasets with temporal components.


### **`Resampling and Frequency Conversion in Pandas`**

#### Resampling and Frequency Conversion Concepts:

Resampling involves changing the frequency of time series data, either increasing or decreasing the frequency, to suit analysis or visualization needs. It is a crucial operation in time series analysis.

#### Using `resample()` in Pandas:

The `resample()` function in Pandas allows for flexible and powerful resampling of time series data.

In [8]:
import pandas as pd

# Sample DataFrame with DatetimeIndex
date_rng = pd.date_range(start='2022-01-01', end='2022-01-10', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['value'] = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Resampling to weekly frequency
weekly_df = df.resample('W-Mon', on='date').sum()

# Displaying the resampled DataFrame
print("Resampled DataFrame (Weekly):")
print(weekly_df)

# Result : The DataFrame is resampled to a weekly frequency (every Monday), and the values are summed for each week.


Resampled DataFrame (Weekly):
            value
date             
2022-01-03     60
2022-01-10    490


#### Handling Missing Values during Resampling:

In [9]:
# Adding some missing values to the DataFrame
df.loc[df['date'] == '2022-01-03', 'value'] = None
df.loc[df['date'] == '2022-01-07', 'value'] = None

# Resampling with handling missing values using forward fill (ffill)
resampled_filled = df.resample('D', on='date').sum().ffill()

# Displaying the resampled and filled DataFrame
print("\nResampled DataFrame with Forward Fill for Missing Values:")
print(resampled_filled)

# Result : Missing values are filled using forward fill (`ffill`) during the resampling process.


Resampled DataFrame with Forward Fill for Missing Values:
            value
date             
2022-01-01   10.0
2022-01-02   20.0
2022-01-03    0.0
2022-01-04   40.0
2022-01-05   50.0
2022-01-06   60.0
2022-01-07    0.0
2022-01-08   80.0
2022-01-09   90.0
2022-01-10  100.0


***Explanation***

```python
# Adding some missing values to the DataFrame
df.loc[df['date'] == '2022-01-03', 'value'] = None
df.loc[df['date'] == '2022-01-07', 'value'] = None
```

Here, missing values (represented by `None`) are added to the 'value' column of the DataFrame `df` at specific dates ('2022-01-03' and '2022-01-07').

```python
# Resampling with handling missing values using forward fill (ffill)
resampled_filled = df.resample('D', on='date').sum().ffill()
```

The DataFrame `df` is then resampled with a daily frequency ('D') using the `resample` method. During this resampling, the missing values are filled using forward fill (`ffill`). Forward fill means that missing values are replaced by the last valid observation, effectively propagating the last valid value forward in time.

```python
# Displaying the resampled and filled DataFrame
print("\nResampled DataFrame with Forward Fill for Missing Values:")
print(resampled_filled)
```

The resulting DataFrame `resampled_filled` is then printed, showing the resampled data with missing values filled using forward fill.

In summary, this code snippet demonstrates how to add missing values to a DataFrame at specific dates and then use resampling with forward fill to handle those missing values and create a new DataFrame with a regular time frequency.

#### Applications:

- **Aggregating Data:**
  - Summarize data over larger time intervals for higher-level insights.

- **Handling Missing Values:**
  - Address missing values during resampling using methods like forward fill or interpolation.

#### Considerations:

- **Resampling Rule:**
  - Choose the appropriate resampling rule ('D' for day, 'W' for week, etc.) based on analysis requirements.

- **Handling Missing Values:**
  - Consider the method for handling missing values during resampling, such as forward fill, backward fill, or interpolation.

#### Tips:

- **Custom Resampling Rules:**
  - Create custom resampling rules to fit specific business requirements.

- **Chaining Operations:**
  - Chain operations like resampling and aggregations for more complex analysis.

Resampling and frequency conversion in Pandas are powerful techniques for adjusting the temporal granularity of time series data. These operations facilitate meaningful analysis and visualization, ensuring that the data aligns with the desired temporal context.
