# 📘 Notebook 2: Core Concepts and Notation in Panel Data


This notebook builds on the previous session by introducing:
- Panel data notation
- Difference between **Panel Data** and **Pooled Data**
- Why tracking the **same entities** over time adds analytical power

We'll use the same synthetic dataset of Indian firms from 2015–2023.


## 🔹 Step 1: Load and Prepare the Dataset

In [1]:

# Import required packages
import pandas as pd

# Load the dataset into a pandas DataFrame
df = pd.read_csv("synthetic_indian_firms_panel_data.csv")  # Dataset contains financial panel data for Indian firms

# Convert necessary columns to numeric format for safety
df['Revenue_Cr'] = pd.to_numeric(df['Revenue_Cr'], errors='coerce')
df['RD_Spend_Cr'] = pd.to_numeric(df['RD_Spend_Cr'], errors='coerce')
df['Employees'] = pd.to_numeric(df['Employees'], errors='coerce')
df['Profit_Cr'] = pd.to_numeric(df['Profit_Cr'], errors='coerce')

# Preview the data
df.head()  # Show first few rows to confirm structure

Unnamed: 0,Firm,Year,Revenue_Cr,RD_Spend_Cr,Employees,Profit_Cr
0,Tata,2015,25795.0,1360.0,46624,6355.856746
1,Tata,2016,27035.935025,1388.167938,49602,6661.621632
2,Tata,2017,30175.623068,1464.404549,47843,7435.236962
3,Tata,2018,29997.364272,1479.214781,49643,7391.314211
4,Tata,2019,29989.199169,1499.711555,48482,7389.302339


## ✏️ Step 2: Panel Data Notation
Panel data uses specific notation to describe observations.


Let’s define the core notation used in panel datasets:

- **Y<sub>it</sub>** → Dependent variable (e.g. `Revenue_Cr`) for firm `i` in year `t`
- **X<sub>it</sub>** → Independent variables (e.g. `RD_Spend_Cr`, `Employees`) for firm `i` in year `t`
- **i** → Cross-sectional unit (Firm name)
- **t** → Time unit (Year)

This structure allows us to model not just variation across firms, but also how the same firm changes over time.


## 🔍 Step 3: Panel vs Pooled Data


In **Panel Data**, we track the **same firms** across time.  
In **Pooled Data**, we observe a new sample each year — we lose the ability to track specific changes in entities.

### 👇 Example:
- Panel: Tata, Infosys, HDFC from 2015 to 2023 — we see how *each* changes.
- Pooled: A new sample of firms every year — different firms may appear/disappear.

Let’s simulate and compare the structures below.


In [2]:
# 📊 Panel Data View: Same firms (Tata, Infosys) over time

# Filters 'df' for rows where 'Firm' is 'Tata' or 'Infosys', 
# then sorts the result first by 'Firm' and then by 'Year'.
panel_sample = df[df['Firm'].isin(['Tata', 'Infosys'])].sort_values(['Firm', 'Year'])

print("📘 Panel Data Sample (Tata & Infosys over time):") # Prints a descriptive message to the console.
# Displays the 'Firm', 'Year', and 'Revenue_Cr' columns of the 'panel_sample' DataFrame 
# (typically for rich output in environments like Jupyter).
display(panel_sample[['Firm', 'Year', 'Revenue_Cr']])

📘 Panel Data Sample (Tata & Infosys over time):


Unnamed: 0,Firm,Year,Revenue_Cr
9,Infosys,2015,13005.0
10,Infosys,2016,13518.665546
11,Infosys,2017,14012.02215
12,Infosys,2018,14419.647891
13,Infosys,2019,15791.409085
14,Infosys,2020,16238.490861
15,Infosys,2021,16556.855537
16,Infosys,2022,15294.525635
17,Infosys,2023,18704.18759
0,Tata,2015,25795.0


In [3]:
# ⚠️ Pooled Data View: Different firms each year, so trends can't be tracked
# This line defines a list of years from which data will be sampled. 
# It's setting up the specific time points for creating the pooled dataset.
years_to_sample = [2015, 2016, 2017]

# This initializes an empty list called 'pooled_rows'. 
# This list will be used to store the sampled DataFrames from each year before they are combined.
pooled_rows = []

# This begins a loop that will iterate through each year specified in the 'years_to_sample' list. 
# For each year, it will perform sampling operations.
for yr in years_to_sample:
    # This line filters the main DataFrame 'df' to get data only for the current year 'yr'. 
    # Then, it randomly samples 2 rows (firms) from that year's data.
    # The 'random_state=yr' ensures that if the code is run again, the same "random" sample 
    # will be chosen for that year, making the results reproducible.
    sample = df[df['Year'] == yr].sample(n=2, random_state=yr)
    # This line appends the 'sample' DataFrame (containing data for 2 firms from the current year 'yr') 
    # to the 'pooled_rows' list.
    pooled_rows.append(sample)

# This line concatenates (combines) all the small DataFrames stored in the 'pooled_rows' 
# list into a single DataFrame called 'pooled_sample'.
# After concatenation, it sorts this new DataFrame first by 'Year' and then by 'Firm' name.
pooled_sample = pd.concat(pooled_rows).sort_values(['Year', 'Firm'])

# This line prints a descriptive message to the console, indicating that the following output 
# is a sample of pooled data.
print("⚠️ Pooled Data Sample (different firms in each year):")

# This line displays selected columns ('Firm', 'Year', 'Revenue_Cr') from the 'pooled_sample' DataFrame.
# The 'display()' function is typically used in environments like Jupyter Notebooks 
# or IPython for a more formatted output of DataFrames.
display(pooled_sample[['Firm', 'Year', 'Revenue_Cr']])

⚠️ Pooled Data Sample (different firms in each year):


Unnamed: 0,Firm,Year,Revenue_Cr
9,Infosys,2015,13005.0
36,Wipro,2015,26157.0
10,Infosys,2016,13518.665546
1,Tata,2016,27035.935025
2,Tata,2017,30175.623068
38,Wipro,2017,29164.325913


## ✅ Summary: Why Panel Structure Matters


- Panel datasets allow you to **control for hidden traits** of firms (like management quality).
- Panel data helps us **understand changes within the same unit** over time.
- Pooled data may look like panel data, but **does not retain the entity identity across time**.

This distinction is **crucial** for making valid causal inferences in modeling.
