# Feature Engineering: Electric Vehicle Sales by Makers Analysis

This notebook focuses on creating meaningful features from our EV sales by manufacturers dataset to enhance our analytical capabilities. Building on the cleaned data from our previous data cleaning notebook, we'll implement several feature engineering strategies to better understand:

1. Manufacturer performance metrics
2. Temporal patterns in manufacturer performance
3. Market competition dynamics
4. Brand growth indicators
5. Segment-specific features
6. Regional performance estimation


In [2]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings

In [None]:
# Suppress warnings
warnings.filterwarnings("ignore")

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (12, 8)

# Load the cleaned datasets created in the data cleaning notebook
timestamp = "20250806"  # Using the timestamp from the files

# Main cleaned maker dataset
df_maker = pd.read_csv(
    f"../../data/processed/ev_sales_by_makers_cleaned_{timestamp}.csv"
)

# Maker monthly share dataset
df_maker_share = pd.read_csv(
    f"../../data/processed/ev_maker_monthly_share_{timestamp}.csv"
)

# Estimated maker state sales dataset
df_estimated_maker_state = pd.read_csv(
    f"../../data/processed/ev_estimated_maker_state_sales_{timestamp}.csv"
)

# Enhanced state dataset
df_state = pd.read_csv(
    f"../../data/processed/ev_sales_by_state_enhanced_{timestamp}.csv"
)

# Convert date columns to datetime
df_maker["date"] = pd.to_datetime(df_maker["date"])
df_maker_share["date"] = pd.to_datetime(df_maker_share["date"])
df_estimated_maker_state["date"] = pd.to_datetime(df_estimated_maker_state["date"])
df_state["date"] = pd.to_datetime(df_state["date"])

# Display basic information about the datasets
print("\n=== Dataset Overview ===")
print("-" * 50)
print(
    f"Time Range: {df_maker['date'].min().strftime('%B %Y')} to {df_maker['date'].max().strftime('%B %Y')}"
)
print(f"Number of Manufacturers: {df_maker['maker'].nunique()}")
print(f"Vehicle Categories: {', '.join(df_maker['vehicle_category'].unique())}")
print(f"Total Records: {len(df_maker)}")


=== Dataset Overview ===
--------------------------------------------------
Time Range: April 2021 to March 2024
Number of Manufacturers: 26
Vehicle Categories: 2-Wheelers, 4-Wheelers
Total Records: 816


## 1. Manufacturer Performance Metrics

Let's create features that help us better understand manufacturer performance in the EV market:

1. Relative market position
2. Market leadership indicators
3. Sales distribution metrics
4. Performance benchmarking


In [4]:
# Working with the main maker dataset
# Create a copy to avoid modifying the original
df_performance = df_maker.copy()

# Calculate manufacturer rank within each vehicle category and time period
df_performance["maker_rank"] = df_performance.groupby(["date", "vehicle_category"])[
    "electric_vehicles_sold"
].rank(ascending=False, method="dense")

# Identify market leaders (rank = 1) for each vehicle category and time period
df_performance["is_market_leader"] = (df_performance["maker_rank"] == 1).astype(int)

# Calculate the distance from market leader (as a percentage of leader's sales)
leader_sales = df_performance.groupby(["date", "vehicle_category"])[
    "electric_vehicles_sold"
].transform(max)
df_performance["leader_gap_pct"] = (
    (leader_sales - df_performance["electric_vehicles_sold"]) / leader_sales * 100
).round(2)
df_performance["leader_gap_pct"] = (
    df_performance["leader_gap_pct"].replace([np.inf, -np.inf], 100).fillna(100)
)

# Calculate performance compared to category average
avg_sales = df_performance.groupby(["date", "vehicle_category"])[
    "electric_vehicles_sold"
].transform("mean")
df_performance["performance_vs_avg"] = (
    (df_performance["electric_vehicles_sold"] - avg_sales) / avg_sales
).round(2)
df_performance["performance_vs_avg"] = (
    df_performance["performance_vs_avg"].replace([np.inf, -np.inf], 0).fillna(0)
)

# Create market position categories
conditions = [
    (df_performance["maker_rank"] == 1),
    (df_performance["maker_rank"] <= 3),
    (df_performance["maker_rank"] <= 5),
    (df_performance["maker_rank"] <= 10),
]
choices = ["Market Leader", "Top 3", "Top 5", "Top 10"]
df_performance["market_position"] = np.select(conditions, choices, default="Other")

# Display the new features
print("\nManufacturer Performance Metrics:")
print("-" * 50)
display(
    df_performance[
        [
            "date",
            "maker",
            "vehicle_category",
            "electric_vehicles_sold",
            "maker_rank",
            "is_market_leader",
            "leader_gap_pct",
            "performance_vs_avg",
            "market_position",
        ]
    ].head(10)
)

# Show the top manufacturers by category
print("\nTop Manufacturers by Vehicle Category (Overall):")
print("-" * 50)
top_makers = (
    df_performance.groupby(["vehicle_category", "maker"])["electric_vehicles_sold"]
    .sum()
    .reset_index()
)
top_makers = top_makers.sort_values(
    ["vehicle_category", "electric_vehicles_sold"], ascending=[True, False]
)

for category in top_makers["vehicle_category"].unique():
    print(f"\n{category} Top 5 Manufacturers:")
    display(top_makers[top_makers["vehicle_category"] == category].head(5))


Manufacturer Performance Metrics:
--------------------------------------------------


Unnamed: 0,date,maker,vehicle_category,electric_vehicles_sold,maker_rank,is_market_leader,leader_gap_pct,performance_vs_avg,market_position
0,2021-04-01,OLA ELECTRIC,2-Wheelers,0,12.0,0,100.0,-1.0,Other
1,2022-04-01,OKAYA EV,2-Wheelers,0,13.0,0,100.0,-1.0,Other
2,2021-05-01,OLA ELECTRIC,2-Wheelers,0,12.0,0,100.0,-1.0,Other
3,2021-06-01,OLA ELECTRIC,2-Wheelers,0,12.0,0,100.0,-1.0,Other
4,2021-07-01,OLA ELECTRIC,2-Wheelers,0,12.0,0,100.0,-1.0,Other
5,2021-08-01,OLA ELECTRIC,2-Wheelers,0,12.0,0,100.0,-1.0,Other
6,2021-09-01,OLA ELECTRIC,2-Wheelers,0,12.0,0,100.0,-1.0,Other
7,2021-10-01,OLA ELECTRIC,2-Wheelers,0,12.0,0,100.0,-1.0,Other
8,2021-11-01,OLA ELECTRIC,2-Wheelers,0,12.0,0,100.0,-1.0,Other
9,2021-04-01,BYD India,4-Wheelers,0,6.0,0,100.0,-1.0,Top 10



Top Manufacturers by Vehicle Category (Overall):
--------------------------------------------------

2-Wheelers Top 5 Manufacturers:


Unnamed: 0,vehicle_category,maker,electric_vehicles_sold
11,2-Wheelers,OLA ELECTRIC,489473
15,2-Wheelers,TVS,272575
1,2-Wheelers,ATHER,204449
6,2-Wheelers,HERO ELECTRIC,170394
0,2-Wheelers,AMPERE,167274



4-Wheelers Top 5 Manufacturers:


Unnamed: 0,vehicle_category,maker,electric_vehicles_sold
24,4-Wheelers,Tata Motors,88935
21,4-Wheelers,Mahindra and Mahindra,41193
20,4-Wheelers,MG Motor,13753
17,4-Wheelers,BYD India,2419
18,4-Wheelers,Hyundai Motor,2076


## 2. Temporal Growth Features

Now let's create features to analyze growth patterns and temporal dynamics for each manufacturer:

1. Growth rates and momentum indicators
2. Seasonality metrics
3. Cumulative and relative growth
4. Time-based ranking shifts


In [10]:
# Create a copy of the DataFrame
df_growth = df_maker.copy()

# Sort the data for time series calculations
df_growth = df_growth.sort_values(["maker", "vehicle_category", "date"])

# Add month and quarter info
df_growth["year"] = df_growth["date"].dt.year
df_growth["month"] = df_growth["date"].dt.month
df_growth["quarter"] = df_growth["date"].dt.quarter
df_growth["month_name"] = df_growth["date"].dt.strftime("%B")

# Calculate maker rank within each vehicle category and time period
df_growth["maker_rank"] = df_growth.groupby(["date", "vehicle_category"])[
    "electric_vehicles_sold"
].rank(ascending=False, method="dense")

# Calculate months since first appearance for each maker
first_appearance = df_growth.groupby("maker")["date"].transform("min")
df_growth["months_in_market"] = (
    (df_growth["date"].dt.year - first_appearance.dt.year) * 12
    + (df_growth["date"].dt.month - first_appearance.dt.month)
    + 1
)

# Calculate growth rates
df_growth["monthly_growth_rate"] = (
    df_growth.groupby(["maker", "vehicle_category"])["electric_vehicles_sold"]
    .pct_change()
    .fillna(0)
)
df_growth["monthly_growth_rate"] = (
    df_growth["monthly_growth_rate"].replace([np.inf, -np.inf], np.nan).fillna(0)
)

# Calculate rolling statistics for each maker (3-month windows)
df_growth["rolling_3m_avg"] = df_growth.groupby(["maker", "vehicle_category"])[
    "electric_vehicles_sold"
].transform(lambda x: x.rolling(window=3, min_periods=1).mean())
df_growth["rolling_3m_std"] = df_growth.groupby(["maker", "vehicle_category"])[
    "electric_vehicles_sold"
].transform(lambda x: x.rolling(window=3, min_periods=1).std())
df_growth["rolling_3m_growth"] = df_growth.groupby(["maker", "vehicle_category"])[
    "monthly_growth_rate"
].transform(lambda x: x.rolling(window=3, min_periods=1).mean())

# Calculate momentum indicators
df_growth["sales_momentum"] = np.where(
    df_growth["rolling_3m_avg"]
    > df_growth.groupby(["maker", "vehicle_category"])[
        "electric_vehicles_sold"
    ].transform(lambda x: x.shift(3).rolling(window=3, min_periods=1).mean()),
    "Positive",
    "Negative",
)

# Calculate year-over-year growth (12 month difference)
df_growth["yoy_sales"] = df_growth.groupby(["maker", "vehicle_category", "month"])[
    "electric_vehicles_sold"
].transform(lambda x: x / x.shift(1) - 1)
df_growth["yoy_sales"] = (
    df_growth["yoy_sales"].replace([np.inf, -np.inf], np.nan).fillna(0)
)

# Calculate rank changes over time (month to month)
df_growth["prev_month_rank"] = (
    df_growth.groupby(["maker", "vehicle_category"])["maker_rank"]
    .shift(1)
    .fillna(df_growth["maker_rank"])
)
df_growth["rank_change"] = df_growth["prev_month_rank"] - df_growth["maker_rank"]
df_growth["rank_movement"] = np.select(
    [
        df_growth["rank_change"] > 0,
        df_growth["rank_change"] < 0,
        df_growth["rank_change"] == 0,
    ],
    ["Improved", "Declined", "Stable"],
    default="New",
)

# Display the new features
print("\nManufacturer Growth Metrics:")
print("-" * 50)
display(
    df_growth[
        [
            "date",
            "maker",
            "vehicle_category",
            "electric_vehicles_sold",
            "months_in_market",
            "monthly_growth_rate",
            "rolling_3m_avg",
            "sales_momentum",
            "yoy_sales",
            "rank_movement",
        ]
    ].head(15)
)


Manufacturer Growth Metrics:
--------------------------------------------------


Unnamed: 0,date,maker,vehicle_category,electric_vehicles_sold,months_in_market,monthly_growth_rate,rolling_3m_avg,sales_momentum,yoy_sales,rank_movement
371,2021-04-01,AMPERE,2-Wheelers,751,1,0.0,751.0,Negative,0.0,Stable
407,2021-05-01,AMPERE,2-Wheelers,147,2,-0.804261,449.0,Negative,0.0,Declined
444,2021-06-01,AMPERE,2-Wheelers,299,3,1.034014,399.0,Negative,0.0,Declined
481,2021-07-01,AMPERE,2-Wheelers,663,4,1.217391,369.666667,Negative,0.0,Declined
518,2021-08-01,AMPERE,2-Wheelers,810,5,0.221719,590.666667,Positive,0.0,Stable
555,2021-09-01,AMPERE,2-Wheelers,807,6,-0.003704,760.0,Positive,0.0,Improved
592,2021-10-01,AMPERE,2-Wheelers,1083,7,0.342007,900.0,Positive,0.0,Stable
629,2021-11-01,AMPERE,2-Wheelers,2077,8,0.917821,1322.333333,Positive,0.0,Improved
666,2021-12-01,AMPERE,2-Wheelers,3410,9,0.641791,2190.0,Positive,0.0,Improved
704,2022-01-01,AMPERE,2-Wheelers,4366,10,0.280352,3284.333333,Positive,0.0,Stable


## 3. Market Share and Competition Features

Let's analyze the competitive landscape with features that track market share dynamics and competitive positioning:

1. Market share evolution
2. Category dominance metrics
3. Competition intensity indicators
4. Market concentration metrics


In [None]:
# Working with the market share dataset
df_competition = df_maker_share.copy()

# Calculate the Herfindahl-Hirschman Index (HHI) for market concentration
# HHI is the sum of squared market shares (expressed as fractions)
df_competition["market_share_fraction"] = df_competition[
    "maker_share_percent"
]  # Already calculated as fraction
df_competition["squared_market_share"] = df_competition["market_share_fraction"] ** 2

# Calculate HHI for each time period and vehicle category
hhi_df = (
    df_competition.groupby(["date", "vehicle_category"])["squared_market_share"]
    .sum()
    .reset_index()
)
hhi_df.rename(columns={"squared_market_share": "hhi_index"}, inplace=True)
df_competition = pd.merge(df_competition, hhi_df, on=["date", "vehicle_category"])

# Classify market concentration based on HHI
# < 0.15: Unconcentrated, 0.15-0.25: Moderately Concentrated, > 0.25: Highly Concentrated
df_competition["market_concentration"] = pd.cut(
    df_competition["hhi_index"],
    bins=[-float("inf"), 0.15, 0.25, float("inf")],
    labels=["Unconcentrated", "Moderately Concentrated", "Highly Concentrated"],
)

# Calculate number of effective competitors (1/HHI)
df_competition["effective_competitors"] = (1 / df_competition["hhi_index"]).round(1)

# Calculate market share tiers
df_competition["market_share_tier"] = pd.cut(
    df_competition["market_share_fraction"],
    bins=[-float("inf"), 0.05, 0.1, 0.2, 0.3, float("inf")],
    labels=["<5%", "5-10%", "10-20%", "20-30%", ">30%"],
)

# Calculate market share changes over time
df_competition_sorted = df_competition.sort_values(
    ["vehicle_category", "maker", "date"]
)
df_competition_sorted["prev_market_share"] = df_competition_sorted.groupby(
    ["vehicle_category", "maker"]
)["market_share_fraction"].shift(1)
df_competition_sorted["market_share_change"] = (
    df_competition_sorted["market_share_fraction"]
    - df_competition_sorted["prev_market_share"]
)
df_competition_sorted["market_share_change_pct"] = (
    df_competition_sorted["market_share_change"]
    / df_competition_sorted["prev_market_share"]
    * 100
).fillna(0)
df_competition_sorted["market_share_change_pct"] = df_competition_sorted[
    "market_share_change_pct"
].replace([np.inf, -np.inf], 0)

# Calculate share change status
df_competition_sorted["share_change_status"] = np.select(
    [
        df_competition_sorted["market_share_change"] > 0.02,
        df_competition_sorted["market_share_change"] > 0,
        df_competition_sorted["market_share_change"] < -0.02,
        df_competition_sorted["market_share_change"] < 0,
    ],
    ["Strong Gain", "Slight Gain", "Strong Loss", "Slight Loss"],
    default="Stable",
)

# Display the new features
print("\nMarket Competition Metrics:")
print("-" * 50)
display(
    df_competition_sorted[
        [
            "date",
            "vehicle_category",
            "maker",
            "electric_vehicles_sold",
            "market_share_fraction",
            "hhi_index",
            "market_concentration",
            "effective_competitors",
            "market_share_tier",
            "market_share_change",
            "share_change_status",
        ]
    ].head(15)
)


Market Competition Metrics:
--------------------------------------------------


Unnamed: 0,date,vehicle_category,maker,electric_vehicles_sold,market_share_fraction,hhi_index,market_concentration,effective_competitors,market_share_tier,market_share_change,share_change_status
0,2021-04-01,2-Wheelers,AMPERE,751,0.132009,0.14096,Unconcentrated,7.1,10-20%,,Stable
22,2021-05-01,2-Wheelers,AMPERE,147,0.118357,0.145179,Unconcentrated,6.9,10-20%,-0.013652,Slight Loss
44,2021-06-01,2-Wheelers,AMPERE,299,0.064039,0.16734,Moderately Concentrated,6.0,5-10%,-0.054318,Strong Loss
66,2021-07-01,2-Wheelers,AMPERE,663,0.045176,0.167943,Moderately Concentrated,6.0,<5%,-0.018864,Slight Loss
88,2021-08-01,2-Wheelers,AMPERE,810,0.050467,0.182248,Moderately Concentrated,5.5,5-10%,0.005291,Slight Gain
110,2021-09-01,2-Wheelers,AMPERE,807,0.044754,0.206342,Moderately Concentrated,4.8,<5%,-0.005714,Slight Loss
132,2021-10-01,2-Wheelers,AMPERE,1083,0.052035,0.182304,Moderately Concentrated,5.5,5-10%,0.007281,Slight Gain
154,2021-11-01,2-Wheelers,AMPERE,2077,0.084658,0.177029,Moderately Concentrated,5.6,5-10%,0.032623,Strong Gain
176,2021-12-01,2-Wheelers,AMPERE,3410,0.12807,0.152122,Moderately Concentrated,6.6,10-20%,0.043412,Strong Gain
198,2022-01-01,2-Wheelers,AMPERE,4366,0.144944,0.154775,Moderately Concentrated,6.5,10-20%,0.016874,Slight Gain


## 4. Vehicle Category Analysis Features

Let's create features that help compare and contrast the 2-wheeler and 4-wheeler segments:

1. Segment specialization metrics
2. Dual-segment presence indicators
3. Segment performance comparisons
4. Segment-specific growth rates


In [None]:
# Identify manufacturers present in both 2W and 4W segments
makers_per_category = (
    df_maker.groupby(["maker", "vehicle_category"]).size().unstack(fill_value=0)
)
makers_per_category = makers_per_category.reset_index()

if (
    "2-Wheelers" in makers_per_category.columns
    and "4-Wheelers" in makers_per_category.columns
):
    makers_per_category["dual_segment"] = np.where(
        (makers_per_category["2-Wheelers"] > 0)
        & (makers_per_category["4-Wheelers"] > 0),
        1,
        0,
    )
else:
    # Handle cases where one of the categories might be missing
    present_categories = [col for col in makers_per_category.columns if col != "maker"]
    makers_per_category["dual_segment"] = 0  # Default to 0
    if len(present_categories) > 1:
        # Check if any maker has presence in more than one category
        makers_per_category["dual_segment"] = np.where(
            makers_per_category[present_categories].sum(axis=1) > 1, 1, 0
        )

# Create a dictionary mapping maker to dual_segment status
dual_segment_dict = dict(
    zip(makers_per_category["maker"], makers_per_category["dual_segment"])
)

# Add this back to the main dataframe
df_maker["dual_segment_presence"] = df_maker["maker"].map(dual_segment_dict)

# Calculate the primary segment for each manufacturer
maker_segment_sales = (
    df_maker.groupby(["maker", "vehicle_category"])["electric_vehicles_sold"]
    .sum()
    .reset_index()
)
maker_total_sales = (
    maker_segment_sales.groupby("maker")["electric_vehicles_sold"].sum().reset_index()
)
maker_total_sales.rename(
    columns={"electric_vehicles_sold": "total_sales"}, inplace=True
)

# Merge to get total sales
maker_segment_sales = pd.merge(maker_segment_sales, maker_total_sales, on="maker")

# Calculate segment contribution
maker_segment_sales["segment_contribution"] = (
    maker_segment_sales["electric_vehicles_sold"] / maker_segment_sales["total_sales"]
).round(2)

# Find primary segment (highest contribution)
primary_segment = maker_segment_sales.loc[
    maker_segment_sales.groupby("maker")["segment_contribution"].idxmax()
]
primary_segment = primary_segment[["maker", "vehicle_category", "segment_contribution"]]
primary_segment.columns = ["maker", "primary_segment", "primary_segment_contribution"]

# Merge these features back into the original dataframe
df_segment = df_maker.copy()
df_segment = pd.merge(df_segment, primary_segment, on="maker")

# Calculate if currently operating in primary segment
df_segment["in_primary_segment"] = (
    df_segment["vehicle_category"] == df_segment["primary_segment"]
).astype(int)

# Calculate segment specialization score
# 1.0 = fully specialized in one segment, 0.5 = equal sales in both segments
df_segment["segment_specialization"] = np.where(
    df_segment["dual_segment_presence"] == 1,
    df_segment["primary_segment_contribution"],
    1.0,
)

# Display the new features
print("\nVehicle Category Analysis Metrics:")
print("-" * 50)
display(
    df_segment[
        [
            "maker",
            "vehicle_category",
            "electric_vehicles_sold",
            "dual_segment_presence",
            "primary_segment",
            "primary_segment_contribution",
            "in_primary_segment",
            "segment_specialization",
        ]
    ]
    .drop_duplicates(subset=["maker"])
    .head(15)
)

# Calculate segment-specific metrics
segment_metrics = (
    df_maker.groupby(["date", "vehicle_category"])
    .agg(
        total_sales=("electric_vehicles_sold", "sum"),
        avg_sales=("electric_vehicles_sold", "mean"),
        max_sales=("electric_vehicles_sold", "max"),
        min_sales=("electric_vehicles_sold", "min"),
        makers_count=("maker", "nunique"),
    )
    .reset_index()
)

# Calculate segment growth rates
segment_metrics_sorted = segment_metrics.sort_values(["vehicle_category", "date"])
segment_metrics_sorted["prev_total_sales"] = segment_metrics_sorted.groupby(
    "vehicle_category"
)["total_sales"].shift(1)
segment_metrics_sorted["segment_growth"] = (
    (
        segment_metrics_sorted["total_sales"]
        / segment_metrics_sorted["prev_total_sales"]
        - 1
    )
    .fillna(0)
    .replace([np.inf, -np.inf], 0)
)

# Compare segments
print("\nSegment Comparison Metrics:")
print("-" * 50)
display(segment_metrics_sorted.head(15))


Vehicle Category Analysis Metrics:
--------------------------------------------------


Unnamed: 0,maker,vehicle_category,electric_vehicles_sold,dual_segment_presence,primary_segment,primary_segment_contribution,in_primary_segment,segment_specialization
0,OLA ELECTRIC,2-Wheelers,0,0,2-Wheelers,1.0,1,1.0
1,OKAYA EV,2-Wheelers,0,0,2-Wheelers,1.0,1,1.0
9,BYD India,4-Wheelers,0,0,4-Wheelers,1.0,1,1.0
10,PCA Automobiles,4-Wheelers,0,0,4-Wheelers,1.0,1,1.0
11,BMW India,4-Wheelers,0,0,4-Wheelers,1.0,1,1.0
12,Volvo Auto India,4-Wheelers,0,0,4-Wheelers,1.0,1,1.0
13,KIA Motors,4-Wheelers,0,0,4-Wheelers,1.0,1,1.0
20,Mercedes-Benz AG,4-Wheelers,0,0,4-Wheelers,1.0,1,1.0
88,Tata Motors,4-Wheelers,322,0,4-Wheelers,1.0,1,1.0
89,MG Motor,4-Wheelers,118,0,4-Wheelers,1.0,1,1.0



Segment Comparison Metrics:
--------------------------------------------------


Unnamed: 0,date,vehicle_category,total_sales,avg_sales,max_sales,min_sales,makers_count,prev_total_sales,segment_growth
0,2021-04-01,2-Wheelers,5689,474.083333,1251,0,12,,0.0
2,2021-05-01,2-Wheelers,1242,103.5,260,0,12,5689.0,-0.781684
4,2021-06-01,2-Wheelers,4669,389.083333,1355,0,12,1242.0,2.759259
6,2021-07-01,2-Wheelers,14676,1223.0,4557,0,12,4669.0,2.143286
8,2021-08-01,2-Wheelers,16050,1337.5,5527,0,12,14676.0,0.093622
10,2021-09-01,2-Wheelers,18032,1502.666667,6727,0,12,16050.0,0.123489
12,2021-10-01,2-Wheelers,20813,1734.416667,6799,0,12,18032.0,0.154226
14,2021-11-01,2-Wheelers,24534,2044.5,7493,0,12,20813.0,0.178782
16,2021-12-01,2-Wheelers,26626,2218.833333,6402,240,12,24534.0,0.085269
18,2022-01-01,2-Wheelers,30122,2510.166667,8238,535,12,26626.0,0.1313


## 5. Regional Market Estimation Features

Using the estimated state-wise sales dataset, let's create features to understand regional market performance:

1. Geographic concentration metrics
2. State-wise market position indicators
3. Regional preference patterns
4. Geographic expansion metrics


In [None]:
# Working with the estimated maker state sales
df_regional = df_estimated_maker_state.copy()

# Calculate total estimated sales per maker
maker_total_est_sales = (
    df_regional.groupby(["date", "maker", "vehicle_category"])["estimated_ev_sales"]
    .sum()
    .reset_index()
)
maker_total_est_sales.rename(
    columns={"estimated_ev_sales": "total_est_sales"}, inplace=True
)

# Merge back to get total estimated sales
df_regional = pd.merge(
    df_regional, maker_total_est_sales, on=["date", "maker", "vehicle_category"]
)

# Calculate state contribution for each manufacturer (what % of sales comes from each state)
df_regional["state_contribution"] = (
    df_regional["estimated_ev_sales"] / df_regional["total_est_sales"]
).fillna(0).round(4) * 100

# Count the number of states where each maker has sales
state_count = (
    df_regional[df_regional["estimated_ev_sales"] > 0]
    .groupby(["date", "maker", "vehicle_category"])["state"]
    .nunique()
    .reset_index()
)
state_count.rename(columns={"state": "active_states"}, inplace=True)

# Calculate geographic concentration (Herfindahl-Hirschman Index for state distribution)
df_regional["state_share_squared"] = (df_regional["state_contribution"] / 100) ** 2
geo_hhi = (
    df_regional.groupby(["date", "maker", "vehicle_category"])["state_share_squared"]
    .sum()
    .reset_index()
)
geo_hhi.rename(columns={"state_share_squared": "geo_hhi"}, inplace=True)

# Calculate total states
total_states = df_regional["state"].nunique()
print(f"\nTotal number of states in dataset: {total_states}")

# Merge state count and geographic concentration back
df_regional = pd.merge(
    df_regional, state_count, on=["date", "maker", "vehicle_category"], how="left"
)
df_regional = pd.merge(
    df_regional, geo_hhi, on=["date", "maker", "vehicle_category"], how="left"
)
df_regional["active_states"] = df_regional["active_states"].fillna(0)

# Calculate geographic reach (% of states covered)
df_regional["geographic_reach"] = (
    df_regional["active_states"] / total_states * 100
).round(1)

# Calculate geographic concentration category
df_regional["geo_concentration"] = pd.cut(
    df_regional["geo_hhi"],
    bins=[-float("inf"), 0.10, 0.20, 0.30, float("inf")],
    labels=[
        "Very Diversified",
        "Moderately Diversified",
        "Concentrated",
        "Highly Concentrated",
    ],
)

# For each maker, calculate their top state by sales
top_states = df_regional[df_regional["estimated_ev_sales"] > 0].sort_values(
    ["date", "maker", "vehicle_category", "estimated_ev_sales"],
    ascending=[True, True, True, False],
)
top_states = (
    top_states.groupby(["date", "maker", "vehicle_category"]).first().reset_index()
)
top_states = top_states[["date", "maker", "vehicle_category", "state"]]
top_states.rename(columns={"state": "top_state"}, inplace=True)

# Merge back to get top state
df_regional_final = pd.merge(
    df_regional, top_states, on=["date", "maker", "vehicle_category"], how="left"
)
df_regional_final["is_top_state"] = (
    df_regional_final["state"] == df_regional_final["top_state"]
).astype(int)

# Display the new features
print("\nRegional Market Estimation Features:")
print("-" * 50)
display(
    df_regional_final[
        [
            "date",
            "maker",
            "vehicle_category",
            "state",
            "estimated_ev_sales",
            "state_contribution",
            "active_states",
            "geographic_reach",
            "geo_hhi",
            "geo_concentration",
            "top_state",
            "is_top_state",
        ]
    ].head(15)
)

# Calculate regional preference (which makers dominate in which states)
state_leader = (
    df_regional.groupby(["date", "state", "vehicle_category"])
    .apply(
        lambda x: (
            x.loc[x["estimated_ev_sales"].idxmax(), "maker"]
            if len(x) > 0 and x["estimated_ev_sales"].max() > 0
            else None
        )
    )
    .reset_index()
)
state_leader.columns = ["date", "state", "vehicle_category", "leading_maker"]

print("\nState-wise Market Leaders (Sample):")
print("-" * 50)
display(state_leader.head(15))


Total number of states in dataset: 34

Regional Market Estimation Features:
--------------------------------------------------


Unnamed: 0,date,maker,vehicle_category,state,estimated_ev_sales,state_contribution,active_states,geographic_reach,geo_hhi,geo_concentration,top_state,is_top_state
0,2021-04-01,AMPERE,2-Wheelers,Andaman and Nicobar Islands,0,0.0,24.0,70.6,0.151155,Moderately Diversified,Karnataka,0
1,2021-04-01,ATHER,2-Wheelers,Andaman and Nicobar Islands,0,0.0,24.0,70.6,0.151946,Moderately Diversified,Karnataka,0
2,2021-04-01,BAJAJ,2-Wheelers,Andaman and Nicobar Islands,0,0.0,13.0,38.2,0.159786,Moderately Diversified,Karnataka,0
3,2021-04-01,BEING,2-Wheelers,Andaman and Nicobar Islands,0,0.0,19.0,55.9,0.154273,Moderately Diversified,Karnataka,0
4,2021-04-01,HERO ELECTRIC,2-Wheelers,Andaman and Nicobar Islands,0,0.0,24.0,70.6,0.151702,Moderately Diversified,Karnataka,0
5,2021-04-01,JITENDRA,2-Wheelers,Andaman and Nicobar Islands,0,0.0,13.0,38.2,0.154335,Moderately Diversified,Karnataka,0
6,2021-04-01,OKINAWA,2-Wheelers,Andaman and Nicobar Islands,0,0.0,24.0,70.6,0.151249,Moderately Diversified,Karnataka,0
7,2021-04-01,OLA ELECTRIC,2-Wheelers,Andaman and Nicobar Islands,0,0.0,0.0,0.0,0.0,Very Diversified,,0
8,2021-04-01,OTHERS,2-Wheelers,Andaman and Nicobar Islands,0,0.0,24.0,70.6,0.151388,Moderately Diversified,Karnataka,0
9,2021-04-01,PURE EV,2-Wheelers,Andaman and Nicobar Islands,0,0.0,24.0,70.6,0.150405,Moderately Diversified,Karnataka,0



State-wise Market Leaders (Sample):
--------------------------------------------------


Unnamed: 0,date,state,vehicle_category,leading_maker
0,2021-04-01,Andaman and Nicobar Islands,2-Wheelers,
1,2021-04-01,Andaman and Nicobar Islands,4-Wheelers,Tata Motors
2,2021-04-01,Andhra Pradesh,2-Wheelers,OKINAWA
3,2021-04-01,Andhra Pradesh,4-Wheelers,Tata Motors
4,2021-04-01,Arunachal Pradesh,2-Wheelers,
5,2021-04-01,Arunachal Pradesh,4-Wheelers,
6,2021-04-01,Assam,2-Wheelers,OKINAWA
7,2021-04-01,Assam,4-Wheelers,
8,2021-04-01,Bihar,2-Wheelers,OKINAWA
9,2021-04-01,Bihar,4-Wheelers,Tata Motors


## 6. Advanced Statistical Features

Let's create more advanced statistical features to add deeper analytical capabilities:

1. Z-scores for relative performance
2. Moving averages and volatility measures
3. Seasonal decomposition
4. Growth trend indicators


In [17]:
# Create a copy of the main dataset
df_stats = df_maker.copy()

# Sort the data for time series calculations
df_stats = df_stats.sort_values(["maker", "vehicle_category", "date"])

# Calculate z-scores for sales within each time period and category
df_stats["sales_z_score"] = df_stats.groupby(["date", "vehicle_category"])[
    "electric_vehicles_sold"
].transform(lambda x: (x - x.mean()) / x.std() if x.std() != 0 else 0)

# Calculate monthly growth rate
df_stats["monthly_growth_rate"] = df_stats.groupby(["maker", "vehicle_category"])[
    "electric_vehicles_sold"
].pct_change()
df_stats["monthly_growth_rate"] = df_stats["monthly_growth_rate"].fillna(0)
df_stats["monthly_growth_rate"] = df_stats["monthly_growth_rate"].replace(
    [np.inf, -np.inf], 0
)

# Calculate rolling volatility (standard deviation of growth rates)
df_stats["rolling_volatility"] = df_stats.groupby(["maker", "vehicle_category"])[
    "monthly_growth_rate"
].transform(lambda x: x.rolling(window=3, min_periods=1).std())

# Calculate acceleration (change in growth rate)
df_stats["growth_acceleration"] = df_stats.groupby(["maker", "vehicle_category"])[
    "monthly_growth_rate"
].diff()
df_stats["growth_acceleration"] = df_stats["growth_acceleration"].fillna(0)

# Calculate performance consistency
# (Higher values indicate more consistent growth)
df_stats["consistency_score"] = 1 - df_stats["rolling_volatility"].clip(upper=1)

# Create stability indicator
# (combines positive growth with low volatility)
df_stats["stability_indicator"] = np.where(
    (df_stats["monthly_growth_rate"] > 0) & (df_stats["rolling_volatility"] < 0.2),
    "Stable Growth",
    np.where(
        (df_stats["monthly_growth_rate"] > 0) & (df_stats["rolling_volatility"] >= 0.2),
        "Volatile Growth",
        np.where(
            (df_stats["monthly_growth_rate"] <= 0)
            & (df_stats["rolling_volatility"] < 0.2),
            "Stable Decline",
            "Volatile Decline",
        ),
    ),
)

# Calculate long-term trend (6-month moving average)
df_stats["long_term_trend"] = df_stats.groupby(["maker", "vehicle_category"])[
    "electric_vehicles_sold"
].transform(lambda x: x.rolling(window=6, min_periods=1).mean())

# Create trend direction indicator
df_stats["trend_direction"] = np.where(
    df_stats["electric_vehicles_sold"] > df_stats["long_term_trend"],
    "Above Trend",
    "Below Trend",
)

# Display the new features
print("\nAdvanced Statistical Features:")
print("-" * 50)
display(
    df_stats[
        [
            "date",
            "maker",
            "vehicle_category",
            "electric_vehicles_sold",
            "sales_z_score",
            "rolling_volatility",
            "growth_acceleration",
            "consistency_score",
            "stability_indicator",
            "long_term_trend",
            "trend_direction",
        ]
    ].head(15)
)


Advanced Statistical Features:
--------------------------------------------------


Unnamed: 0,date,maker,vehicle_category,electric_vehicles_sold,sales_z_score,rolling_volatility,growth_acceleration,consistency_score,stability_indicator,long_term_trend,trend_direction
371,2021-04-01,AMPERE,2-Wheelers,751,0.672506,,0.0,,Volatile Decline,751.0,Below Trend
407,2021-05-01,AMPERE,2-Wheelers,147,0.467099,0.568698,-0.804261,0.431302,Volatile Decline,449.0,Below Trend
444,2021-06-01,AMPERE,2-Wheelers,299,-0.22078,0.921527,1.838275,0.078473,Volatile Growth,399.0,Below Trend
481,2021-07-01,AMPERE,2-Wheelers,663,-0.435077,1.118031,0.183378,0.0,Volatile Growth,465.0,Above Trend
518,2021-08-01,AMPERE,2-Wheelers,810,-0.346589,0.529907,-0.995672,0.470093,Volatile Growth,534.0,Above Trend
555,2021-09-01,AMPERE,2-Wheelers,807,-0.364826,0.649776,-0.225423,0.350224,Volatile Decline,579.5,Above Trend
592,2021-10-01,AMPERE,2-Wheelers,1083,-0.329965,0.1755,0.345711,0.8245,Stable Growth,634.833333,Above Trend
629,2021-11-01,AMPERE,2-Wheelers,2077,0.014353,0.465526,0.575813,0.534474,Volatile Growth,956.5,Above Trend
666,2021-12-01,AMPERE,2-Wheelers,3410,0.565726,0.287988,-0.27603,0.712012,Volatile Growth,1475.0,Above Trend
704,2022-01-01,AMPERE,2-Wheelers,4366,0.764498,0.319687,-0.361439,0.680313,Volatile Growth,2092.166667,Above Trend


## 7. Integrating All Features and Export

Now let's bring together our most valuable features into integrated datasets for further analysis:

1. Manufacturer performance dataset
2. Temporal performance dataset
3. Regional analysis dataset


In [22]:
# Create a timestamp for versioning
timestamp = datetime.now().strftime("%Y%m%d")

# 1. Create manufacturer performance dataset
# Select key columns from the performance metrics
performance_cols = [
    "date",
    "year",
    "month",
    "maker",
    "vehicle_category",
    "electric_vehicles_sold",
    "cumulative_sales",
    "maker_rank",
    "market_position",
    "leader_gap_pct",
    "performance_vs_avg",
]

# Select from market share features
competition_cols = [
    "market_share_fraction",
    "hhi_index",
    "market_concentration",
    "effective_competitors",
    "market_share_tier",
    "share_change_status",
]

# Select from segment features
segment_cols = [
    "dual_segment_presence",
    "primary_segment",
    "primary_segment_contribution",
    "segment_specialization",
]

# Create the integrated manufacturer performance dataset
df_manufacturer = df_performance[performance_cols].copy()

# Merge with competition metrics
df_competition_selected = df_competition_sorted[
    ["date", "maker", "vehicle_category"] + competition_cols
]
df_manufacturer = pd.merge(
    df_manufacturer,
    df_competition_selected,
    on=["date", "maker", "vehicle_category"],
    how="left",
)

# Merge with segment metrics
df_segment_selected = df_segment[
    ["maker", "vehicle_category"] + segment_cols
].drop_duplicates()
df_manufacturer = pd.merge(
    df_manufacturer, df_segment_selected, on=["maker", "vehicle_category"], how="left"
)

# 2. Create temporal analysis dataset
# Select key temporal columns
temporal_cols = [
    "date",
    "maker",
    "vehicle_category",
    "electric_vehicles_sold",
    "monthly_growth_rate",
    "sales_z_score",
    "rolling_volatility",
    "stability_indicator",
    "long_term_trend",
    "trend_direction",
]

df_temporal = df_stats[temporal_cols].copy()

# 3. Create regional analysis dataset
# Select key regional columns
regional_cols = [
    "date",
    "maker",
    "vehicle_category",
    "state",
    "estimated_ev_sales",
    "state_contribution",
    "state_rank",
    "active_states",
    "geographic_reach",
    "geo_concentration",
    "top_state",
    "is_top_state",
]

df_regional_selected = df_regional_final[
    [
        "date",
        "maker",
        "vehicle_category",
        "state",
        "estimated_ev_sales",
        "state_contribution",
        "active_states",
        "geographic_reach",
        "geo_hhi",
        "geo_concentration",
        "top_state",
        "is_top_state",
    ]
].copy()

# Export the datasets
df_manufacturer.to_csv(
    f"../../data/processed/ev_manufacturer_performance_{timestamp}.csv", index=False
)
df_temporal.to_csv(
    f"../../data/processed/ev_manufacturer_temporal_{timestamp}.csv", index=False
)
df_regional_selected.to_csv(
    f"../../data/processed/ev_manufacturer_regional_{timestamp}.csv", index=False
)

# Print confirmation
print("\n=== Export Complete ===")
print("-" * 50)
print(f"Files exported to data/processed/ directory with timestamp: {timestamp}")
print("\nExported Files:")
print(f"1. ev_manufacturer_performance_{timestamp}.csv")
print(f"   - Rows: {len(df_manufacturer)}")
print(f"   - Columns: {df_manufacturer.columns.tolist()}")

print(f"\n2. ev_manufacturer_temporal_{timestamp}.csv")
print(f"   - Rows: {len(df_temporal)}")
print(f"   - Columns: {df_temporal.columns.tolist()}")

print(f"\n3. ev_manufacturer_regional_{timestamp}.csv")
print(f"   - Rows: {len(df_regional_selected)}")
print(f"   - Columns: {df_regional_selected.columns.tolist()}")


=== Export Complete ===
--------------------------------------------------
Files exported to data/processed/ directory with timestamp: 20250809

Exported Files:
1. ev_manufacturer_performance_20250809.csv
   - Rows: 816
   - Columns: ['date', 'year', 'month', 'maker', 'vehicle_category', 'electric_vehicles_sold', 'cumulative_sales', 'maker_rank', 'market_position', 'leader_gap_pct', 'performance_vs_avg', 'market_share_fraction', 'hhi_index', 'market_concentration', 'effective_competitors', 'market_share_tier', 'share_change_status', 'dual_segment_presence', 'primary_segment', 'primary_segment_contribution', 'segment_specialization']

2. ev_manufacturer_temporal_20250809.csv
   - Rows: 816
   - Columns: ['date', 'maker', 'vehicle_category', 'electric_vehicles_sold', 'monthly_growth_rate', 'sales_z_score', 'rolling_volatility', 'stability_indicator', 'long_term_trend', 'trend_direction']

3. ev_manufacturer_regional_20250809.csv
   - Rows: 27711
   - Columns: ['date', 'maker', 'vehicle_

## Conclusion and Next Steps

In this feature engineering notebook, we've significantly enhanced our EV manufacturer sales dataset with valuable features for deeper analysis:

### Key Features Added:

1. **Performance Metrics**: Ranking, market position, and comparison to leaders
2. **Temporal Dynamics**: Growth rates, momentum indicators, and stability metrics
3. **Competitive Landscape**: Market concentration, share metrics, and competitive positioning
4. **Segment Analysis**: Specialization, dual-segment presence, and primary segment performance
5. **Regional Insights**: Geographic reach, state contributions, and regional market leadership
6. **Advanced Statistics**: Z-scores, volatility measures, and trend indicators

### Key Insights:

- Clear differentiation between 2-wheeler and 4-wheeler manufacturer dynamics
- Identification of market leaders and their dominance patterns
- Understanding of growth trajectories and stability across manufacturers
- Geographic distribution patterns of manufacturer sales
- Competitive intensity differences across vehicle segments

### Next Steps:

1. **Exploratory Data Analysis**: Utilize these features for in-depth EV manufacturer analysis
2. **Visualizations**: Create dashboards and charts to communicate manufacturer dynamics
3. **Predictive Modeling**: Use these features to predict future market share trends
4. **Comparative Analysis**: Compare manufacturer performance across states and segments
5. **Strategic Insights**: Develop manufacturer-specific insights for market positioning

The engineered features provide a comprehensive foundation for understanding the EV manufacturer landscape in India, supporting both descriptive and predictive analytics.
