 # Advanced Pandas

**SESSION 1: Date & Time Operations**

*Why Date & Time Is CRITICAL in Data Analytics*

Almost every real business question is time-based:

- üìà Monthly sales trend
- üìä Year-over-year growth
- üõí Orders in last 30 days
- üë• Active users per week

In [None]:
import pandas as pd

**Step 1: Create a Sample Dataset**

In [None]:

data = {
    'order_id': [101, 102, 103, 104],
    'order_date': ['2024-01-05', '2024-02-10', '2024-02-15', '2024-03-01'],
    'sales': [2500, 1800, 3200, 1500]
}

df = pd.DataFrame(data)
df


**Step 2: Check the Data Type**

In [None]:
df.info()


**Step 3: Convert String ‚Üí Datetime**

In [None]:
df['order_date'] = pd.to_datetime(df['order_date'])


In [None]:
df.info()

**Step 4: Extract Date Components**

In [None]:
# Extract Year
df['year'] = df['order_date'].dt.year


In [None]:
# Extract Month
df['month'] = df['order_date'].dt.month


In [None]:
# Extract Day
df['day'] = df['order_date'].dt.day


In [None]:
# Extract Weekday
df['weekday'] = df['order_date'].dt.day_name()


These columns are used directly in:

- GroupBy
- Pivot tables
- Dashboards

**Step 5: Filtering Data Using Dates**

In [None]:
# Example 1: Orders after Feb 1, 2024

df[df['order_date'] >= '2024-02-01']


In [None]:
# Example 2: Orders in February only

df[df['month'] == 2]



**Step 6: Date Difference**

In [None]:
# Days since order
df['days_since_order'] = (pd.Timestamp('2024-03-10') - df['order_date']).dt.days


**PRACTICE QUESTIONS**

Practice 1 (Basic)

Create a DataFrame with:
- employee_name
- joining_date


In [None]:
data = {
   'employee_name': ['jhon','karan','meena','jack','king'],
   'joining_date': ['2025-02-01','2025-02-02','2025-03-03','2025-03-05','2025-03-04'],
   'Sales':  [2000,12000,15000,30000,50000]
}
practice = pd.DataFrame(data)
practice

In [None]:
df['order_date'] = pd.to_datetime(df['order_date'])

In [None]:
# Convert joining_date to datetime.
practice['joining_date'] = pd.to_datetime(practice['joining_date'])


In [None]:
practice.info()

*Practice 2 (Intermediate)*

In [None]:
# extract year , month and day
practice['year'] = practice['joining_date'].dt.year
practice['month'] = practice['joining_date'].dt.month
practice['day']  =  practice['joining_date'].dt.day

In [None]:
# filter sales from 2023
practice_2023 = practice[practice['year'] == 2023]


In [None]:
practice.head()

**INTERVIEW QUESTIONS**

*Q1Ô∏è‚É£ Why do we convert dates using pd.to_datetime()?*

- Because Pandas treats raw dates as strings and cannot perform time-based operations.

*Q2Ô∏è‚É£ Difference between object and datetime64?*

- object is text; datetime64 allows filtering, extraction, and arithmetic.

*Q3Ô∏è‚É£ How do you filter records from last 30 days?*

In [None]:
df[df['date'] >= pd.Timestamp.today() - pd.Timedelta(days=30)]


*Q4Ô∏è‚É£ How do you extract month name?*

In [None]:
df['date'].dt.month_name()



*Q5Ô∏è‚É£ Common mistakes while working with dates?*

- *Not converting datatype, wrong format, string comparison*

**SESSION 2: Pivot Tables**

*Why Pivot Tables Are CRITICAL for Data Analysts*

Pivot tables are used to:

- Create summary reports
- Answer business questions fast
- Replace complex GroupBy code
- Do Excel / Power BI‚Äìstyle analysis


*What Is a Pivot Table*

A pivot table:

- Takes raw data
- Groups it
- Aggregates it
- Shows results in a clean table

GroupBy + Aggregate + Reshape


In [None]:
import pandas as pd 

**Step 1: Create a Sample Dataset**

In [None]:

data = {
    'city': ['Delhi', 'Delhi', 'Mumbai', 'Mumbai', 'Chennai', 'Chennai'],
    'year': [2023, 2024, 2023, 2024, 2023, 2024],
    'sales': [5000, 7000, 4000, 6500, 3000, 4800]
}

df = pd.DataFrame(data)
df


**Step 2: Understand the Business Question**

Example questions:

- Total sales per city?
- Year-wise sales per city?
- Average sales per city?

In [None]:
df.groupby('city')['sales'].sum()

In [None]:
df.groupby(['city', 'year'])['sales'].sum()


In [None]:
df.groupby('city')['sales'].mean()

**Step 3: Basic Pivot Table**

In [None]:
# Total Sales per City
pd.pivot_table(
    df,
    values='sales',
    index='city',
    aggfunc='sum'
)

# values ‚Üí what to calculate
# index ‚Üí group by
# aggfunc ‚Üí how to summarize


**Step 4: Pivot with Rows + Columns**

In [None]:
# Year-wise Sales per City
pd.pivot_table(
    df,
    values='sales',
    index='city',
    columns='year',
    aggfunc='sum'
)


**Step 5: Multiple Aggregations**

In [None]:
# Sum + Average Sales per City
pd.pivot_table(
    df,
    values='sales',
    index='city',
    aggfunc=['sum', 'mean']
)


**Step 6: Handle Missing Values in Pivot Tables**

By default:

- Missing combinations = NaN

In [None]:
# Fill missing with 0
pd.pivot_table(
    df,
    values='sales',
    index='city',
    columns='year',
    aggfunc='sum',
    fill_value=0
)


**pivot() vs pivot_table()**

pivot()

- No aggregation
- Data must be unique

pivot_table()

- Supports aggregation
- Handles duplicates
- Used in real analytics

**PRACTICE QUESTIONS**

*Practice 1 (Basic)*

In [None]:

data = {
    'city': ['Delhi', 'Mumbai', 'Delhi', 'Chennai', 'Mumbai'],
    'sales': [200, 300, 150, 400, 250]
}

df = pd.DataFrame(data)
df


In [None]:
pd.pivot_table(
    df,
    values= 'sales',
    index= 'city',
    aggfunc= 'sum'
)


*Practice 2*

In [None]:
data = {
    'city': ['Delhi', 'Delhi', 'Mumbai', 'Mumbai', 'Chennai', 'Chennai'],
    'year': [2022, 2023, 2022, 2023, 2022, 2023],
    'sales': [200, 300, 250, 350, 400, 500]
}

df = pd.DataFrame(data)
df


In [None]:
pd.pivot_table(
    df,
    values='sales',
    index='city',
    columns='year',
    aggfunc='mean'
)


*Practice 3 (Advanced)*

In [None]:
# Sample employee dataset
data = {
    'department': ['HR', 'HR', 'IT', 'IT', 'Finance'],
    'year': [2022, 2023, 2022, 2023, 2022],
    'salary': [50000, 55000, 70000, None, 60000]
}

df_emp = pd.DataFrame(data)
df_emp


In [None]:
# Pivot table solution (with missing values filled)
pd.pivot_table(
    df_emp,
    values='salary',
    index='department',
    columns='year',
    aggfunc='mean',
    fill_value=0
)


**INTERVIEW QUESTIONS**

*1Ô∏è‚É£ What is a pivot table?*

* A summary table that groups and aggregates data.

*2Ô∏è‚É£ Difference between GroupBy and Pivot Table?*

* GroupBy is flexible for logic; pivot tables are better for reporting.

*3Ô∏è‚É£ When would you prefer pivot tables?*

* When creating dashboards, reports, or Excel-like summaries.

*Q4Ô∏è‚É£ How do you handle missing values in pivot tables?*

In [None]:
fill_value=0


*Q5Ô∏è‚É£ Difference between pivot() and pivot_table()?*

- pivot_table() supports aggregation and duplicates.*

**SESSION 3: Performance Optimization & Best Practices**

*Why Performance Matters in Data Analytics*

In real jobs:

- Datasets can have millions of rows
- Slow code = slow reports = unhappy business
- Interviewers check how you think, not just output
- Good analysts write clean + fast + readable code.

*First Concept: Vectorization*

In [None]:
# Correct Way (Vectorized Operation)
df['updated_Salary'] = df['salary'] * 1.1


 *Use Built-in Pandas Functions*

In [None]:

df['name'] = df['name'].str.upper()


*Avoid Chained Indexing*

In [None]:
df.loc[df['salary'] > 50000, 'bonus'] = 5000


*Memory Optimization*

In [None]:
# Convert object ‚Üí category
df['city'] = df['city'].astype('category')


In [None]:
# Use copy() When Needed
filtered_df = df[df['salary'] > 40000].copy()


**PRACTICE QUESTIONS**

*Practice 1*

In [None]:
# Increase salary by 15% without loops.
df["salary"] = df["salary"] * 1.15

*Practice 2*

In [None]:
# ‚ÄúSenior‚Äù if experience > 5
# Junior‚Äù otherwise

df["level"] = "Junior"
df.loc[df["experience"] > 5, "level"] = "Senior"


**INTERVIEW QUESTIONS**

*Q1Ô∏è‚É£ Why are loops slow in Pandas?*

- Pandas is optimized for vectorized operations, not Python loops.

*Q2Ô∏è‚É£ When should you use .apply()?*

- When logic cannot be vectorized easily.

*Q3Ô∏è‚É£ What is chained indexing?*

- Accessing data in multiple steps, which can cause assignment issues.

*Q4Ô∏è‚É£ How do you optimize memory usage?*

- Use category, proper dtypes, drop unused columns.

*Q5Ô∏è‚É£ How do you improve Pandas performance?*

- Vectorization, avoid loops, minimize .apply(), optimize datatypes.