## Section 1: Combining and Cleaning Data

### `pd.concat()`
- **Purpose**: Combines multiple DataFrames into one.

- **Key Parameter**:
    - `ignore_index`: When set to `True` resets the index so that the resulting DataFrame has a new continuous index.

In [None]:
import pandas as pd

# Department 1 data
df_dept1 = pd.DataFrame({
    'EmployeeID': [101, 102, 103],
    'Name': ['Alice', 'Bob', 'Michael'],
    'Department': ['Sales', 'Sales', 'Sales'],
    'Salary': [60000, 62000, 61000]
})

# Department 2 data
df_dept2 = pd.DataFrame({
    'EmployeeID': [104, 105],
    'Name': ['David', 'Henry'],
    'Department': ['Marketing', 'Marketing'],
    'Salary': [65000, 67000]
})

# Concatenating the two DataFrames (vertical concatenation)
df_employees = pd.concat([df_dept1, df_dept2], ignore_index=True)
print("Combined DataFrame:")
print(df_employees)

Combined DataFrame:
   EmployeeID     Name Department  Salary
0         101    Alice      Sales   60000
1         102      Bob      Sales   62000
2         103  Charlie      Sales   61000
3         104    David  Marketing   65000
4         105      Eva  Marketing   67000


### `rename()`
- **Purpose**: Changes column names to more meaningful or standardized labels.
- **Key Parameter**:
    - `columns`: A dictionary mapping old column names to new names.

In [None]:
# Rename columns to more descriptive names
df_employees_renamed = df_employees.rename(columns={
    'EmployeeID': 'Emp_ID',
    'Name': 'Employee_Name',
    'Department': 'Dept',
    'Salary': 'Annual_Salary'
})
print("\nRenamed DataFrame:")
print(df_employees_renamed)


Renamed DataFrame:
   Emp_ID Employee_Name       Dept  Annual_Salary
0     101         Alice      Sales          60000
1     102           Bob      Sales          62000
2     103       Charlie      Sales          61000
3     104         David  Marketing          65000
4     105           Eva  Marketing          67000


### `sort_values()`
- **Purpose**: Sorts the DataFrame by a specific column.
- **Key Parameter**:
    - `by`: Specifies the column(s) to sort by.
    - `ascending`: When set to `False` sorts the data in descending order.

In [None]:
# Sort by Annual_Salary in descending order
df_sorted = df_employees_renamed.sort_values(by='Annual_Salary', ascending=False)
print("\nDataFrame Sorted by Annual Salary (Descending):")
print(df_sorted)


DataFrame Sorted by Annual Salary (Descending):
   Emp_ID Employee_Name       Dept  Annual_Salary
4     105           Eva  Marketing          67000
3     104         David  Marketing          65000
1     102           Bob      Sales          62000
2     103       Charlie      Sales          61000
0     101         Alice      Sales          60000


## Section 2: Method Chaining

### **Concept**: Combining several DataFrame operations into one continuous expression.

### **Advantages**:
- Improves readability and conciseness.
- Reduces the need for intermediate variables.

### Steps in this chain:
- **Concatenation**: Merges the two department DataFrames.
- **Renaming**: Standardizes column names.
- **Sorting**: Orders the DataFrame by annual salary in descending order.

In [None]:
import pandas as pd

# Department 1 data
df_dept1 = pd.DataFrame({
    'EmployeeID': [101, 102, 103],
    'Name': ['Alice', 'Bob', 'Michael'],
    'Department': ['Sales', 'Sales', 'Sales'],
    'Salary': [60000, 62000, 61000]
})

# Department 2 data
df_dept2 = pd.DataFrame({
    'EmployeeID': [104, 105],
    'Name': ['David', 'Henry'],
    'Department': ['Marketing', 'Marketing'],
    'Salary': [65000, 67000]
})

# Method chaining: Combine, rename, and sort in descending order of salary.
df_employees = (
    pd.concat([df_dept1, df_dept2], ignore_index=True)
    .rename(columns={
        'EmployeeID': 'Emp_ID',
        'Name': 'Employee_Name',
        'Department': 'Dept',
        'Salary': 'Annual_Salary'
    })
    .sort_values(by='AnnualSalary', ascending=False)
)

print("Combined and Cleaned Employee DataFrame:")
print(df_employees)

Combined and Cleaned Employee DataFrame:
   EmployeeID EmployeeName Department  AnnualSalary
4         105          Eva  Marketing         67000
3         104        David  Marketing         65000
1         102          Bob      Sales         62000
2         103      Charlie      Sales         61000
0         101        Alice      Sales         60000


## Group Activity: Cleaning an Untidy Sales Dataset Using Method Chaining

### Method Chaining Instructions:
- Remove duplicates.
- Fill missing values with 0.
- Reshape the DataFrame from wide to long format.
- Sort the final DataFrame.

In [None]:
import pandas as pd

df_sales = pd.DataFrame({
    'Product': [
        'Widget A', 'Widget B', 'Widget A', 'Widget C',
        'Widget B', 'Widget A', 'Widget D', 'Widget E',
        'Widget C', 'Widget D', 'Widget B', 'Widget E'
    ],
    'Region': [
        'North', 'South', 'North', 'East',
        'South', 'North', 'West', 'East',
        'Central', 'North', 'West', 'South'
    ],
    'Sales_Q1': [100, 200, 100, 150, None, 100, 180, 210, 140, 190, 205, 220],
    'Sales_Q2': [110, None, 110, 160, 210, 110, 185, 220, 150, 200, 215, 230],
    'Sales_Q3': [105, 205, 105, None, 215, 105, 175, 205, 145, 195, 210, 225],
    'Sales_Q4': [115, 215, 115, 165, 225, None, 190, 215, 155, 205, 220, 235]
})

print("Expanded df_sales DataFrame:")
print(df_sales)


# Method chaining: Clean the dataset in one pipeline.
df_sales_clean = (
    df_sales
    .drop_duplicates()                          # Remove duplicate rows
    .fillna(0)                                   # Replace missing sales with 0
    .melt(id_vars=['Product', 'Region'],         # Reshape from wide to long format
          value_vars=['Sales_Q1', 'Sales_Q2', 'Sales_Q3', 'Sales_Q4'],
          var_name='Quarter',
          value_name='Sales')
    .assign(Quarter=lambda df: df['Quarter'].str.replace('Sales_', '')) # Clean Quarter names
    .sort_values(by=['Region', 'Product', 'Quarter'])  # Sort data
)

print("Recovered and Tidy Sales Data:")
df_sales_clean

In [None]:
import seaborn as sns
# Create a pivot table that calculates the sum of sales per Product per Region
pivot_sales = pd.pivot_table(
    df_sales_clean,
    values='Sales',
    index=['Product', 'Region'],
    columns='Quarter',
    aggfunc='sum',
    fill_value=0
)

print("Pivot Table of Average Sales:")
print(pivot_sales)