# **`Data Science Learners Hub`**

**Module : Python**

**email** : [datasciencelearnershub@gmail.com](mailto:datasciencelearnershub@gmail.com)

## **`#4: Advanced Data Manipulation`**
10. **Merging and Concatenating DataFrames**
    - Combining DataFrames
    - Concatenation and merging operations

11. **Reshaping Data**
    - Pivoting and melting
    - Stacking and unstacking

12. **Time Series Data**
    - Handling time and date data
    - Resampling and frequency conversion

### **`10. Merging and Concatenating DataFrames`**

#### **`Combining DataFrames in Pandas`**

#### Concept of Combining DataFrames:

Combining or merging DataFrames in Pandas involves bringing together information from two or more DataFrames based on a common key or index. This is particularly useful when dealing with related datasets or when you want to integrate information from multiple sources.

#### Scenarios for DataFrame Combination:

1. **Data Integration:**
   - Combine datasets with shared information to create a unified view.

2. **Relational Databases:**
   - Mimic relational database joins for complex data relationships.

3. **Time Series Alignment:**
   - Align datasets based on time indices for time series analysis.

4. **Handling Missing Data:**
   - Fill in missing information by combining datasets with complementary information.

#### Types of Merges:

##### 1. Inner Merge:

In [1]:
import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['laxman', 'harshita', 'naina']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Salary': [60000, 45000, 70000]})

# Inner Merge on 'ID'
merged_inner = pd.merge(df1, df2, on='ID', how='inner')

# Displaying the merged DataFrame
print("Inner Merge Result:")
print(merged_inner)


Inner Merge Result:
   ID      Name  Salary
0   2  harshita   60000
1   3     naina   45000


- **Result:**
  - Only rows with common 'ID' values in both DataFrames are retained.

##### 2. Outer Merge:

In [2]:
# Outer Merge on 'ID'
merged_outer = pd.merge(df1, df2, on='ID', how='outer')

# Displaying the merged DataFrame
print("\nOuter Merge Result:")
print(merged_outer)


Outer Merge Result:
   ID     Name   Salary
0   1    Alice      NaN
1   2      Bob  60000.0
2   3  Charlie  45000.0
3   4      NaN  70000.0


- **Result:**
  - All rows from both DataFrames are included. NaN is used for missing values.

##### 3. Left Merge:

In [3]:
# Left Merge on 'ID'
merged_left = pd.merge(df1, df2, on='ID', how='left')

# Displaying the merged DataFrame
print("\nLeft Merge Result:")
print(merged_left)


Left Merge Result:
   ID     Name   Salary
0   1    Alice      NaN
1   2      Bob  60000.0
2   3  Charlie  45000.0


- **Result:**
  - All rows from the left DataFrame (df1) are retained. NaN for missing values in the right DataFrame.

##### 4. Right Merge:

In [4]:
# Right Merge on 'ID'
merged_right = pd.merge(df1, df2, on='ID', how='right')

# Displaying the merged DataFrame
print("\nRight Merge Result:")
print(merged_right)


Right Merge Result:
   ID     Name  Salary
0   2      Bob   60000
1   3  Charlie   45000
2   4      NaN   70000


- **Result:**
  - All rows from the right DataFrame (df2) are retained. NaN for missing values in the left DataFrame.

#### Implications of Merge Types:

- **Inner Merge:**
  - Retains only rows with matching keys in both DataFrames.

- **Outer Merge:**
  - Retains all rows from both DataFrames, filling in missing values with NaN.

- **Left Merge:**
  - Retains all rows from the left DataFrame, filling in missing values with NaN.

- **Right Merge:**
  - Retains all rows from the right DataFrame, filling in missing values with NaN.

#### Considerations:

- **Key Column(s):**
  - Specify the key column(s) on which to merge the DataFrames.

- **Duplicate Keys:**
  - Be cautious about duplicate keys; they can result in unexpected behavior.

- **Multiple Key Columns:**
  - Merge on multiple columns for more complex relationships.

#### Tips:

- **Suffixes:**
  - Use `suffixes` parameter to differentiate columns with the same name in the merged DataFrames.

- **Index-Based Merge:**
  - Merge based on indices using `left_index` and `right_index` parameters.

Merging DataFrames is a crucial aspect of data manipulation in Pandas, enabling the combination of information from diverse sources. Understanding the types of merges and their implications empowers efficient data integration and analysis.

#### **`Concatenation and Merging Operations in Pandas`**

#### Concatenation with `concat()`:

Concatenation in Pandas involves combining DataFrames either vertically or horizontally.

##### 1. Vertical Concatenation:

In [5]:
import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})

# Vertical Concatenation
concatenated_vertical = pd.concat([df1, df2])

# Displaying the concatenated DataFrame
print("Vertical Concatenation Result:")
print(concatenated_vertical)

Vertical Concatenation Result:
    A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3


- **Result:**
  - Rows from both DataFrames are stacked vertically.

##### 2. Horizontal Concatenation:

In [6]:
# Sample DataFrames
df3 = pd.DataFrame({'C': ['C0', 'C1'], 'D': ['D0', 'D1']})

# Horizontal Concatenation
concatenated_horizontal = pd.concat([df1, df3], axis=1)

# Displaying the concatenated DataFrame
print("\nHorizontal Concatenation Result:")
print(concatenated_horizontal)


Horizontal Concatenation Result:
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1


- **Result:**
  - Columns from both DataFrames are joined horizontally.

#### Merging with `merge()`:

The `merge()` function combines DataFrames based on specified columns.

In [7]:
# Sample DataFrames
df4 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df5 = pd.DataFrame({'ID': [2, 3], 'Salary': [60000, 45000]})

# Merging on 'ID'
merged_result = pd.merge(df4, df5, on='ID', how='inner')

# Displaying the merged DataFrame
print("\nMerge Result:")
print(merged_result)


Merge Result:
   ID Name  Salary
0   2  Bob   60000


- **Result:**
  - Inner merge on 'ID' retains only rows with common 'ID' values.

#### Merging Parameters:

- **`how`:**
  - Specifies the type of merge (e.g., 'inner', 'outer', 'left', 'right').

- **`on`:**
  - Specifies the key column(s) for merging.

- **`suffixes`:**
  - Appends suffixes to duplicate column names in case of overlap.

#### Considerations:

- **Common Key Columns:**
  - Ensure the key columns have the same name and contain common values.

- **Duplicate Columns:**
  - Be cautious about duplicate columns; use `suffixes` to handle them.

#### Tips:

- **Multiple Key Columns:**
  - Merge on multiple columns for complex relationships using a list in the `on` parameter.

- **Index-Based Merge:**
  - Merge based on indices using `left_index` and `right_index` parameters.

Combining DataFrames using `concat()` and `merge()` provides flexibility in managing and integrating data. Understanding these functions and their parameters allows for efficient data manipulation and analysis in various scenarios.


### **`11. Reshaping Data`**

#### **`Pivoting and Melting in Pandas for Data Reshaping`**

#### Pivoting in Pandas:

Pivoting involves reshaping data to rearrange or reshape the structure of the DataFrame, typically by changing the layout of data in the columns.

In [8]:
import pandas as pd

# Sample DataFrame
data = {'Date': ['2022-01-01', '2022-01-01', '2022-01-02', '2022-01-02'],
        'Category': ['A', 'B', 'A', 'B'],
        'Value': [10, 20, 30, 40]}

df = pd.DataFrame(data)

# Pivoting DataFrame
pivot_result = df.pivot(index='Date', columns='Category', values='Value')

# Displaying the pivoted DataFrame
print("Pivoted DataFrame:")
print(pivot_result)

# Result - Rows with the same 'Date' are combined, and 'Category' values become separate columns.

Pivoted DataFrame:
Category     A   B
Date              
2022-01-01  10  20
2022-01-02  30  40


#### Melting in Pandas:

Melting involves transforming a DataFrame from wide format to long format, unpivoting it.

In [10]:
# Melting DataFrame
melted_result = pd.melt(df, id_vars='Date', value_vars='Value', var_name='Category', value_name='Value')

# Displaying the melted DataFrame
print("\nMelted DataFrame:")
print(melted_result)

# Result:
# Columns 'A' and 'B' from the previous DataFrame become rows, with a new 'Category' column.


Melted DataFrame:
         Date Category  Value
0  2022-01-01    Value     10
1  2022-01-01    Value     20
2  2022-01-02    Value     30
3  2022-01-02    Value     40


  melted_result = pd.melt(df, id_vars='Date', value_vars='Value', var_name='Category', value_name='Value')


#### Applications:

- **Pivoting:**
  - Convert data for better presentation or visualization.
  - Facilitate analysis by organizing data for specific requirements.

- **Melting:**
  - Convert aggregated or summarized data into a long format.
  - Prepare data for specific analyses or visualizations.

#### Use Cases:

1. **Pivoting Example:**
   - Convert sales data with columns for each product category into a format where each row represents a sale with product category and quantity.

2. **Melting Example:**
   - Transform a DataFrame with a multi-level column index into a long format for easier analysis.

#### Considerations:

- **Unique Index Values:**
  - Ensure that the combination of index and columns in a pivoted DataFrame results in unique index values.

- **Melting Wide Data:**
  - Specify columns to be preserved as identifier variables and those to be melted.

#### Tips:

- **Multi-level Index:**
  - When pivoting, use `reset_index()` if the DataFrame has a multi-level index.

- **Handling NaN Values:**
  - Check for NaN values after pivoting, especially if using hierarchical indexing.

Pivoting and melting are powerful tools in reshaping data to meet specific analysis or visualization needs. Mastering these operations allows for efficient manipulation and exploration of diverse datasets in Pandas.


#### **`Stacking and Unstacking in Pandas for Hierarchical Index Reshaping`**

#### Stacking and Unstacking Concepts:

In Pandas, stacking and unstacking are operations used to manipulate DataFrames with hierarchical index structures, particularly those with multi-level indices.

#### Stacking:

Stacking involves "compressing" a level in the DataFrame's columns to produce a new level in the index.

In [11]:
import pandas as pd

# Sample DataFrame with Multi-level Index
arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number'))

df = pd.DataFrame({'Value': [10, 20, 30, 40]}, index=index)

# Stacking DataFrame
stacked_result = df.stack()

# Displaying the stacked DataFrame
print("Stacked DataFrame:")
print(stacked_result)

# Result:
# The DataFrame is compressed, and a new level is created in the index.

Stacked DataFrame:
Letter  Number       
A       1       Value    10
        2       Value    20
B       1       Value    30
        2       Value    40
dtype: int64


#### Unstacking:

Unstacking is the inverse operation of stacking. It involves "expanding" a level in the DataFrame's index to produce a new level in the columns.

In [12]:
# Unstacking DataFrame
unstacked_result = df.unstack()

# Displaying the unstacked DataFrame
print("\nUnstacked DataFrame:")
print(unstacked_result)

# Result : The DataFrame is expanded, and a new level is created in the columns.


Unstacked DataFrame:
       Value    
Number     1   2
Letter          
A         10  20
B         30  40


#### Applications:

- **Stacking:**
  - Transform a DataFrame with a multi-level column index into a long format.
  - Facilitate analysis or visualization requiring a simpler column structure.

- **Unstacking:**
  - Convert data with a multi-level index into a wide format for better presentation.
  - Facilitate analysis by organizing data in a way that simplifies access to information.

#### Use Cases:

1. **Stacking Example:**
   - Convert sales data with a multi-level column index (products, regions) into a long format for easy analysis.

2. **Unstacking Example:**
   - Transform a DataFrame with a multi-level index representing time series data into a wide format with columns for each time point.

#### Considerations:

- **Level Selection:**
  - Specify the level to be stacked or unstacked.

- **NaN Values:**
  - Check for NaN values after unstacking, especially if the original DataFrame had missing data.

#### Tips:

- **Multiple Levels:**
  - Stack or unstack multiple levels by passing a list of level names or level numbers.

- **Naming Levels:**
  - Assign names to levels for clarity using the `names` parameter in `MultiIndex`.

Stacking and unstacking are essential operations for reshaping hierarchical index structures in Pandas. Understanding when and how to use these operations allows for efficient manipulation and exploration of multi-level index DataFrames.


### **`12. Time Series Data`**

### **`Handling Time and Date Data in Pandas`**

#### Importance of Handling Time and Date Data:

Handling time and date data is crucial in data analysis for various reasons:

1. **Temporal Analysis:**
   - Time-based insights, trends, and patterns are essential for understanding data.

2. **Time Series Analysis:**
   - Analyzing data collected over time for forecasting and trend identification.

3. **Data Alignment:**
   - Aligning datasets based on time indices for effective merging and analysis.

4. **Event Sequencing:**
   - Understanding the chronological order of events for context-aware analysis.


#### `DatetimeIndex` in Pandas:

Pandas provides the `DatetimeIndex`, a powerful tool for working with time and date data.

In [13]:
import pandas as pd

# Creating a DatetimeIndex
date_rng = pd.date_range(start='2022-01-01', end='2022-01-10', freq='D')

# Creating a DataFrame with DatetimeIndex
df = pd.DataFrame(date_rng, columns=['date'])

# Displaying the DataFrame
print("DataFrame with DatetimeIndex:")
print(df)

# Result :  The DataFrame contains a `DatetimeIndex` ranging from '2022-01-01' to '2022-01-10'.


DataFrame with DatetimeIndex:
        date
0 2022-01-01
1 2022-01-02
2 2022-01-03
3 2022-01-04
4 2022-01-05
5 2022-01-06
6 2022-01-07
7 2022-01-08
8 2022-01-09
9 2022-01-10


#### Manipulating Time Series Data:

In [14]:
# Adding a new column with random values
import numpy as np

df['value'] = np.random.randint(0, 100, size=(len(date_rng)))

# Displaying the updated DataFrame
print("\nDataFrame with Random Values:")
print(df)

# Result : The DataFrame now includes a 'value' column with random integer values.


DataFrame with Random Values:
        date  value
0 2022-01-01     64
1 2022-01-02     62
2 2022-01-03     91
3 2022-01-04     30
4 2022-01-05     52
5 2022-01-06     66
6 2022-01-07     60
7 2022-01-08      7
8 2022-01-09     19
9 2022-01-10     69


#### Time Series Operations:

In [16]:
# Extracting components of the date
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()

# Displaying the DataFrame with extracted components
print("\nDataFrame with Date Components:")
print(df)

# Result : Additional columns are added for the year, month, day, and weekday of each date.


DataFrame with Date Components:
        date  value  year  month  day    weekday
0 2022-01-01     64  2022      1    1   Saturday
1 2022-01-02     62  2022      1    2     Sunday
2 2022-01-03     91  2022      1    3     Monday
3 2022-01-04     30  2022      1    4    Tuesday
4 2022-01-05     52  2022      1    5  Wednesday
5 2022-01-06     66  2022      1    6   Thursday
6 2022-01-07     60  2022      1    7     Friday
7 2022-01-08      7  2022      1    8   Saturday
8 2022-01-09     19  2022      1    9     Sunday
9 2022-01-10     69  2022      1   10     Monday


#### Time Resampling:

In [17]:
# Resampling the DataFrame to weekly frequency
weekly_df = df.resample('W-Mon', on='date').sum()

# Displaying the resampled DataFrame
print("\nResampled DataFrame (Weekly):")
print(weekly_df)

# Result : The DataFrame is resampled to a weekly frequency, aggregating values based on the sum.


Resampled DataFrame (Weekly):
            value   year  month  day
date                                
2022-01-03    217   6066      3    6
2022-01-10    303  14154      7   49


  weekly_df = df.resample('W-Mon', on='date').sum()


#### Considerations:

- **Date Components:**
  - Extracting date components facilitates detailed analysis and reporting.

- **Resampling Frequency:**
  - Choose the appropriate frequency when resampling data to suit analysis requirements.

#### Tips:

- **Time Zone Handling:**
  - Consider time zone information when working with data from different regions.

- **Periods and Durations:**
  - Explore Pandas' `Period` and `Timedelta` for handling periods and durations.

Handling time and date data with Pandas' `DatetimeIndex` enables effective analysis, visualization, and manipulation of time series data. Leveraging these functionalities enhances the ability to derive meaningful insights from datasets with temporal components.


### **`Resampling and Frequency Conversion in Pandas`**

#### Resampling and Frequency Conversion Concepts:

Resampling involves changing the frequency of time series data, either increasing or decreasing the frequency, to suit analysis or visualization needs. It is a crucial operation in time series analysis.

#### Using `resample()` in Pandas:

The `resample()` function in Pandas allows for flexible and powerful resampling of time series data.

In [18]:
import pandas as pd

# Sample DataFrame with DatetimeIndex
date_rng = pd.date_range(start='2022-01-01', end='2022-01-10', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['value'] = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Resampling to weekly frequency
weekly_df = df.resample('W-Mon', on='date').sum()

# Displaying the resampled DataFrame
print("Resampled DataFrame (Weekly):")
print(weekly_df)

# Result : The DataFrame is resampled to a weekly frequency (every Monday), and the values are summed for each week.


Resampled DataFrame (Weekly):
            value
date             
2022-01-03     60
2022-01-10    490


#### Handling Missing Values during Resampling:

In [19]:
# Adding some missing values to the DataFrame
df.loc[df['date'] == '2022-01-03', 'value'] = None
df.loc[df['date'] == '2022-01-07', 'value'] = None

# Resampling with handling missing values using forward fill (ffill)
resampled_filled = df.resample('D', on='date').sum().ffill()

# Displaying the resampled and filled DataFrame
print("\nResampled DataFrame with Forward Fill for Missing Values:")
print(resampled_filled)

# Result : Missing values are filled using forward fill (`ffill`) during the resampling process.


Resampled DataFrame with Forward Fill for Missing Values:
            value
date             
2022-01-01   10.0
2022-01-02   20.0
2022-01-03    0.0
2022-01-04   40.0
2022-01-05   50.0
2022-01-06   60.0
2022-01-07    0.0
2022-01-08   80.0
2022-01-09   90.0
2022-01-10  100.0


#### Applications:

- **Aggregating Data:**
  - Summarize data over larger time intervals for higher-level insights.

- **Handling Missing Values:**
  - Address missing values during resampling using methods like forward fill or interpolation.

#### Considerations:

- **Resampling Rule:**
  - Choose the appropriate resampling rule ('D' for day, 'W' for week, etc.) based on analysis requirements.

- **Handling Missing Values:**
  - Consider the method for handling missing values during resampling, such as forward fill, backward fill, or interpolation.

#### Tips:

- **Custom Resampling Rules:**
  - Create custom resampling rules to fit specific business requirements.

- **Chaining Operations:**
  - Chain operations like resampling and aggregations for more complex analysis.

Resampling and frequency conversion in Pandas are powerful techniques for adjusting the temporal granularity of time series data. These operations facilitate meaningful analysis and visualization, ensuring that the data aligns with the desired temporal context.
