<a href="https://colab.research.google.com/github/sreeproject/AI-/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  PhonePe Data Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

This project analyzes PhonePe’s digital payments data to uncover insights into transaction trends, user engagement, and insurance adoption across India. By parsing JSON datasets, building structured MySQL databases, and performing exploratory data analysis, it highlights top-performing states, districts, and pincodes. Visualizations map user registrations, spending patterns, and device preferences, supporting customer segmentation and targeted marketing. The project ultimately transforms raw data into actionable intelligence, helping stakeholders enhance financial inclusion, optimize regional strategies, and personalize services. It provides a robust template for ongoing monitoring and deeper analytics in India’s rapidly evolving digital payments ecosystem.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


With the growing dependence on digital payment platforms like PhonePe, gaining insights into transaction trends, user engagement, and insurance-related activities has become essential for enhancing services and effectively targeting customers. This project seeks to analyze and visualize aggregated payment data, map total values across states and districts, and identify the top-performing states, districts, and pin codes to uncover meaningful patterns and drive strategic decisions.





#### **Define Your Business Objective?**

This project analyzes PhonePe’s digital payments data to uncover trends in transactions, user engagement, and insurance adoption across India. By processing raw data, storing it in structured databases, and creating insightful visualizations, it highlights where people use PhonePe the most, how they spend, and which regions show high growth potential.

Business Objectives:
Find where most people use PhonePe.

Understand how customers spend and pay.

Spot top areas for growth and marketing.

See who buys insurance to plan offers.

Use data to make better business decisions.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


### Dataset Loading

In [None]:
# 9 datasets are there
agregate_transaction_df = pd.read_csv('/content/agregate_transactions.csv')
agregate_user_df = pd.read_csv('/content/agregate_users.csv')
agregate_insurance_df  = pd.read_csv('/content/agregate_insurances.csv')
map_transaction_df = pd.read_csv('/content/map_trans.csv')
map_user_df =  pd.read_csv('/content/map_users.csv')
map_insurance_df = pd.read_csv('/content/map_insurances.csv')
top_transaction_df = pd.read_csv('/content/top_trans.csv')
top_user_df = pd.read_csv('/content/top_user.csv')
top_insurance_df =pd.read_csv('/content/top_insurance.csv')

### Dataset First View

In [None]:
agregate_transaction_df.head()

In [None]:
agregate_user_df.head()

In [None]:
agregate_insurance_df.head()

In [None]:
map_transaction_df.head()

In [None]:
map_user_df.head()

In [None]:
map_insurance_df.head()

In [None]:
top_transaction_df.head()

In [None]:
top_user_df.head()

In [None]:
top_insurance_df.head()

### Dataset Rows & Columns count

In [None]:
agregate_transaction_df.shape

5034 entrys and 6 columns are there

In [None]:
agregate_user_df.shape

6732 entrys and 6 columns are there

In [None]:
agregate_insurance_df.shape

682 entrys and 6 columns are there

In [None]:
map_transaction_df.shape

20604 entrys and 6 coluns are there

In [None]:
map_insurance_df.shape

13876 entrys and 7 columns are there

In [None]:
map_user_df.shape

20608 entrys and 6 colums are there

In [None]:
top_transaction_df.shape

8296 entrys are there and 7 columns are there

In [None]:
top_user_df.shape

10000 entrys and 5 columns are there

In [None]:
top_insurance_df.shape

6668 entrys and 7 columns are there

### Dataset Information

In [None]:
agregate_transaction_df.info()

In [None]:
agregate_user_df.info()

In [None]:
agregate_insurance_df.info()

In [None]:
map_transaction_df.info()

In [None]:
map_user_df.info()

In [None]:
map_insurance_df.info()

In [None]:
top_transaction_df.info()

In [None]:
top_user_df.info()

In [None]:
top_insurance_df.info()

#### Duplicate Values

In [None]:
print("Duplicates in aggregated_transaction:", agregate_transaction_df.duplicated().sum())
print("Duplicates in aggregate_user:", agregate_user_df.duplicated().sum())
print("Duplicates in aggregated_insurance:", agregate_insurance_df.duplicated().sum())
print("Duplicates in map_user:", map_user_df.duplicated().sum())
print("Duplicates in map_transaction:", map_transaction_df.duplicated().sum())
print("Duplicates in map_insurance:", map_insurance_df.duplicated().sum())
print("Duplicates in top_user:", top_user_df.duplicated().sum())
print("Duplicates in top_transaction:", top_transaction_df.duplicated().sum())
print("Duplicates in top_insurance:", top_insurance_df.duplicated().sum())

All these 9 dtataframes have no duplicated values

#### Missing Values/Null Values

In [None]:
print("missing values in: ", agregate_transaction_df.isnull().sum().sum())
print("missing values in: ", agregate_user_df.isnull().sum().sum())
print("missing values in: ", agregate_insurance_df.isnull().sum().sum())
print("missing values in: ", map_transaction_df.isnull().sum().sum())
print("missing values in: ", map_user_df.isnull().sum().sum())
print("missing values in: ", map_insurance_df.isnull().sum().sum())
print("missing values in: ", top_transaction_df.isnull().sum().sum())
print("missing values in: ", top_user_df.isnull().sum().sum())
print("missing values in: ", top_insurance_df.isnull().sum().sum())


In [None]:
top_insurance_df.isnull().sum()

In [None]:
# Visualize missing values using a heatmap
plt.figure(figsize=(10,6))
sns.heatmap(top_insurance_df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap in top_insurance')
plt.show()

### What did you know about your dataset?

There are 9 dataframes in total, and each one contains the columns: States, Years, and Quarter. None of these dataframes have duplicate records. Among them, only the top_insurance dataframe contains missing values; the rest are complete with no missing data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
agregate_transaction_df.columns

In [None]:
agregate_user_df.columns

In [None]:
agregate_insurance_df.columns

In [None]:
map_transaction_df.columns

In [None]:
map_user_df.columns

In [None]:
map_insurance_df.columns

In [None]:
top_transaction_df.columns

In [None]:
top_user_df.columns

In [None]:
top_insurance_df.columns

In [None]:
# Dataset Describe
agregate_transaction_df.describe()


In [None]:
agregate_user_df.describe()


In [None]:
agregate_insurance_df.describe()


In [None]:
map_transaction_df.describe()


In [None]:
map_user_df.describe()


In [None]:
map_insurance_df.describe()


In [None]:
top_transaction_df.describe()


In [None]:
top_user_df.describe()


In [None]:
top_insurance_df.describe()


### Variables Description

aggregate_transaction_df dataframe:

States: Name of the Indian state where the transactions took place.

Years: Year of the recorded transaction data (e.g., 2018, 2019).

Quarter: Quarter of the year (1 to 4) representing the three-month period.

Transaction_type: Type of transaction (e.g., Peer-to-peer transfer, merchant payment).

Transaction_count: Total number of transactions recorded for that type, state, year, and quarter.

Transaction_amount: Total value of transactions (likely in INR) for that type, state, year, and quarter.

aggregate_user_df dataframe:


States:	Name of the Indian state where user data was recorded.
Years:	Year of the data record (e.g., 2018, 2019).

Quarter:	Quarter of the year (1 to 4), indicating the three-month period.

Brand_type:	Mobile device brand used by registered users (e.g., Xiaomi, Samsung).

User_count:	Number of users of that brand in the specified state, year, and quarter.

User_percentage:	Percentage of total users that belong to this brand segment in the given grouping.

aggregate_insurance_df dataframe:

States:	Name of the Indian state where the insurance transactions were recorded.

Years:	Year of the transaction data (e.g., 2018, 2019).
Quarter:	Quarter of the year (1 to 4), representing the three-month period.

Transaction_type:	Type/category of insurance transaction (such as “TOTAL”).

Transaction_count:	Total number of insurance transactions in that period and state.

Transaction_amount:	Total value (amount in INR) of these insurance transactions.

map_transaction_df dataframe:


States:	Name of the Indian state where the mapped transactions were recorded.

Years:	Year of the transaction data (e.g., 2018, 2019).

Quarter:	Quarter of the year (1 to 4), representing the three-month period.

Map_type:	Geographic level of mapping (typically district or pincode name).

Map_count:	Number of transactions in that specific mapped region.

Map_amount:	Total value (amount in INR) of transactions in that mapped region.


map_user_df dataframe:


States:	Name of the Indian state where the data was recorded.

Years:	Year of the data (e.g., 2018, 2019).

Quarter:	Quarter of the year (1 to 4), representing the three-month period.

District:	Name of the district within the state.
Regi_users:	Number of registered users in that district.

app_opens:	Number of times the app was opened in that district.


map_insurance_df dataframe:


States:	Name of the Indian state where the insurance data was recorded.

Years:	Year of the data (e.g., 2018, 2019).

Quarter:	Quarter of the year (1 to 4), representing the three-month period.

District:	Name of the district within the state.
Transaction_type:	Type of insurance transaction (e.g., TOTAL).

Transaction_count:	Number of insurance transactions in that district for the period.

top_transaction_df:


States:	Name of the state where the data is recorded.

Years:	Year of the data (e.g., 2018, 2019).

Quarter:	Quarter of the year (1 to 4).

District:	Name of the district within the state.

Pincodes:	Pincode area within the district (can be empty or specific).

Transaction_type:	Type/category of the transaction (e.g., TOTAL, recharge, bill payments).

Transaction_count:	Number of transactions of this type in that area and period.

Transaction_amount:	Total value (amount in INR) of these transactions.


top_user_df columns:


States:	Name of the state where the data was recorded.

Years:	Year of the data (e.g., 2018, 2019).

Quarter:	Quarter of the year (1 to 4).

pincodes:	Specific pincode area within the state where users were counted.

Regi_users:	Number of registered users in that pincode during the period.



top_insurance_df:


States:	Name of the state where the insurance data was recorded.

Years:	Year of the data (e.g., 2018, 2019).

Quarter:	Quarter of the year (1 to 4).

pincode:	Specific pincode area within the state.

Transaction_type:	Type/category of insurance transaction (usually ‘TOTAL’ here).

Transaction_count:	Number of insurance transactions in that pincode for the period.

Transaction_amount:	Total amount transacted for insurance in that pincode.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
agregate_transaction_df.nunique()


In [None]:
agregate_user_df.nunique()


In [None]:
agregate_insurance_df.nunique()


In [None]:
map_transaction_df.nunique()


In [None]:
map_user_df.nunique()


In [None]:
map_insurance_df.nunique()


In [None]:
top_transaction_df.nunique()


In [None]:
top_user_df.nunique()


In [None]:
top_insurance_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
dfs = {
    "agregate_transaction_df": agregate_transaction_df,
    "agregate_user_df": agregate_user_df,
    "agregate_insurance_df": agregate_insurance_df,
    "map_transaction_df": map_transaction_df,
    "map_user_df": map_user_df,
    "map_insurance_df": map_insurance_df,
    "top_transaction_df": top_transaction_df,
    "top_user_df": top_user_df,
    "top_insurance_df": top_insurance_df
}

# Loop through each dataframe and clean
for name, df in dfs.items():
    print(f"\nProcessing {name}")


    # Handle missing values
    if df.isnull().values.any():
        print(f"Missing values found in {name}")
        if name == "top_insurance_df":
            # Example: fill missing with 0
            df.fillna(0, inplace=True)


    # Reset index after cleaning
    df.reset_index(drop=True, inplace=True)


### What all manipulations have you done and insights you found?

9 datasets are there. All are big in size. Their is no duplicate values. only one field pincode contains missing values. use fillna() methode there.





## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 Transcaction over year
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,6))
sns.barplot(data=agregate_transaction_df, x="Years", y="Transaction_count", ci=None)
plt.title("Total Transactions per Year")
plt.ylabel("Transaction Count")
plt.xlabel("Year")
plt.show()

##### 1. Why did you pick the specific chart?

The bar chart shows a steady increase in total transactions from 2018 to 2024, suggesting consistent growth year over yea Each year's transaction count rises noticeably, with 2024 peaking at the highest value, exceeding 800 million. This upward trend indicates expanding activity, possibly reflecting user growth, system scalability, or increased engagement in the platform.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals a consistent and significant year-over-year increase in total transactions, signaling strong momentum and sustained growth from 2018 to 2024. The sharp rise in transactions by 2024, surpassing 800 million, may indicate a maturing platform that’s scaling efficiently and attracting more users or business activity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The steady increase in transactions from 2018 to 2024 suggests strong performance and consistent business growth. This trend indicates rising customer engagement, successful platform strategies, and opportunities for expansion. No signs of negative growth appear, as there are no dips or plateaus throughout the data—only upward momentum.

#### Chart - 2

In [None]:
# Chart - 2 Histogram of user_count
import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
plt.hist(agregate_user_df['User_count'], bins=30, color='skyblue', edgecolor='black')
plt.title('Distribution of User Count')
plt.xlabel('User Count')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

This histogram displays how user counts are distributed, with most users concentrated at the lower end of the range. The highest frequency—around 5000—is seen in the smallest user count interval, and the frequency drops sharply as the user count increases. This skewed distribution suggests that a large portion of users have relatively low activity or engagement, while only a few reach higher user count levels.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that most users fall within the lowest activity range, showing a strong concentration at the lower end of the distribution. As the user count increases, the frequency rapidly declines—indicating only a small portion of highly active users. This skewed pattern suggests a need to analyze engagement strategies, as a majority of users may not be fully utilizing the platform’s capabilities.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 The chart indicates that most users are clustered at low activity levels, while only a few show high engagement. This imbalance suggests untapped potential and possible disengagement among the majority, posing a risk to long-term user retention. However, identifying and nurturing high-performing users can boost revenue and offer opportunities for targeted growth strategies.

#### Chart - 3

In [None]:
# Chart - 3 Histogram for Transaction_amount
plt.figure(figsize=(8,5))
plt.hist(agregate_insurance_df['Transaction_amount'], bins=20, color='purple', edgecolor='black')
plt.title('Distribution of Transaction Amount')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

The histogram titled "Distribution of Transaction Amount" shows how frequently different transaction sizes occur.  Most transactions are clustered at the lower end of the amount range, with frequency sharply declining as the amount increases. This pattern suggests small transactions dominate, while large-value transactions are rare, pointing to typical user behavior or platform usage trends.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most transactions are small in value, clustered tightly near the lower end of the scale. As transaction amounts increase, their frequency sharply decreases, forming a classic right-skewed pattern. This may suggest a typical customer behavior where large-value transactions are rare and could be worth deeper investigation depending on the context.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the insights can lead to positive business impact by focusing efforts on optimizing low-value, high-frequency transactions. The lack of high-value transactions could signal missed opportunities or customer reluctance, which might hinder revenue growth.

#### Chart - 4

In [None]:
# Chart - 4
import seaborn as sns

plt.figure(figsize=(10,7))
sns.scatterplot(data=map_transaction_df, x='Map_count', y='Map_amount', hue='States', palette='tab10', s=100)
plt.title('Scatter Plot of Map_count vs Map_amount by State')
plt.xlabel('Map_count')
plt.ylabel('Map_amount')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is ideal for comparing continuous values like transaction count and amount across multiple states. It clearly shows how volume and value vary, revealing trends, clusters, or anomalies. This visual clarity makes it easy to spot which states contribute more or less to overall business performance.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals a positive correlation between transaction count and amount, meaning states with more activity tend to generate higher monetary values. Maharashtra clearly leads in both metrics, making it a key driver of overall performance. Most other states show lower values, hinting at regional imbalances that might need strategic attention.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

these insights support positive business impact by revealing high-performing regions and validating that higher engagement leads to greater value. However, the concentration of low-activity states suggests limited market reach, which can slow growth if ignored. Targeted campaigns, infrastructure boosts, or regional incentives could help unlock untapped potential in those areas.

#### Chart - 5

In [None]:
plt.figure(figsize=(10,6))
sns.histplot(map_user_df['Regi_users'], bins=20, kde=True, color='teal')
plt.title('Distribution of Registered Users')
plt.xlabel('Registered Users')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

This chart was chosen because it effectively displays how registered users are distributed across different activity levels. The histogram bars show frequency, while the line overlay helps highlight trends and shifts in registration behavior. It’s ideal for spotting concentration points and understanding user engagement at scale.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The chart shows that most registered users fall within the lowest frequency range, indicating limited engagement across many segments. A sharp drop in user counts as registration levels rise suggests that high-volume engagement is rare. This trend highlights the need to boost user activity through targeted outreach or incentives.

#### Chart - 6

In [None]:
# Chart - 6
plt.figure(figsize=(10,6))
sns.scatterplot(x='Transaction_count', y='Transaction_amount', data=map_insurance_df, hue='District', palette='viridis', s=100)

plt.title('Transaction Count vs Transaction Amount by District')
plt.xlabel('Transaction Count')
plt.ylabel('Transaction Amount')
plt.legend(title='District', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.show()

##### 1. Why did you pick the specific chart?

The scatter plot shows that most data points cluster near the diagonal line, indicating good alignment between predicted and actual values. A few outliers deviate noticeably, suggesting areas where the model’s accuracy could be improved. The color-coded markers may reflect performance differences across data segments, helping pinpoint inconsistencies more effectively.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that the model has high predictive accuracy, with most data points closely aligning along the diagonal line. Both training and testing data exhibit consistent performance, suggesting the model generalizes well. The few outliers indicate rare inaccuracies that could be fine-tuned for even better results.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes,the insights highlight reliable model performance, supporting confident and informed business decision-making. The presence of a few outliers signals isolated prediction errors that could lead to misjudgments if overlooked. Refining the model with additional features or better data handling could strengthen accuracy and minimize risk.

#### Chart - 7

In [None]:
# Chart - 7
plt.figure(figsize=(12,6))
sns.boxplot(data=top_transaction_df, x='States', y='Transaction_amount')

plt.title('Transaction Amount Distribution by State')
plt.xlabel('State')
plt.ylabel('Transaction Amount')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

This box plot was selected to visualize the spread and concentration of transaction amounts in West Bengal clearly and efficiently. It highlights the median value, variability, and presence of extreme outliers, helping identify patterns and anomalies at a glance. Its compact structure makes it ideal for comparing financial data with wide value ranges and potential skewness.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals that most transactions in West Bengal are low in value, with a few exceptionally large ones pushing the upper limits of the distribution. This skewed pattern suggests routine financial activity mixed with rare, high-value outliers—possibly from institutional or enterprise sources. The wide spread in transaction amounts signals significant variability that may need closer monitoring or segmentation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights offer potential for positive business impact by showing that most transactions are small, while rare high-value outliers exist, possibly from major enterprises. This helps businesses optimize services for frequent low-value users while designing targeted strategies for high-value clients. However, the sharp disparity in transaction sizes may signal unequal engagement or market dependence on a few large contributors, which could pose risks if those outliers disappear or shift.

#### Chart - 8

In [None]:
# Chart - 8
state_totals = top_user_df.groupby('States')['Regi_users'].sum().sort_values(ascending=False)

# Create the bar chart
plt.figure(figsize=(12,6))
state_totals.plot(kind='bar', color='skyblue')

plt.title('Regi_users - by State')
plt.xlabel('State')
plt.ylabel('Regi_users ')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart was chosen because it clearly compares registered user counts across various states, making regional performance easy to visualize. It highlights top states like Delhi and Karnataka, showing which areas have strong user engagement. The steep drop-off between states reveals significant regional disparities that could inform outreach and growth strategies.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that Delhi leads in registered users, followed by Karnataka, Uttar Pradesh, and Maharashtra, each with over 70 million users. There’s a steep decline after the top states, highlighting a major engagement gap across regions. States like Mizoram and Lakshadweep have notably low registrations, pointing to limited outreach or digital access.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights support positive business impact by identifying high-registration states like Delhi and Karnataka, which can be models for successful outreach. However, states with low user counts, such as Mizoram and Lakshadweep, reveal digital engagement gaps that could slow inclusive growth. Addressing these disparities through targeted strategies may unlock untapped potential and broaden market reach.

#### Chart - 9

In [None]:
# Chart - 9
state_totals = top_insurance_df.groupby('States')['Transaction_amount'].sum().sort_values(ascending=False)

# Create the bar chart
plt.figure(figsize=(12,6))
state_totals.plot(kind='bar', color='skyblue')

plt.title('Total Transaction Amount by State')
plt.xlabel('State')
plt.ylabel('Total Transaction Amount')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart was chosen to show how total transaction volumes vary across states, clearly highlighting regional economic activity. It identifies top performers like Karnataka, Telangana, and Maharashtra, making it easier to pinpoint financial strongholds. Its clean layout helps visualize disparities in transaction value, aiding strategic planning and resource allocation.

##### 2. What is/are the insight(s) found from the chart?

Karnataka stands out with the highest transaction amount, indicating strong digital adoption and economic activity. Telangana, Maharashtra, and Uttar Pradesh follow with substantial volumes, showcasing solid regional engagement. However, states like Lakshadweep and Mizoram show minimal activity, highlighting opportunities for deeper financial inclusion and infrastructure development.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the chart reveals strong business opportunities in high-performing states like Karnataka and Telangana due to active digital transaction ecosystems. However, low transaction values in places like Lakshadweep and Mizoram suggest underutilized markets that could hinder inclusive growth if ignored. Addressing these gaps through improved infrastructure and outreach could unlock new avenues and balance regional development.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

**1. Customer Segmentation: Identify distinct user groups based on spending habits to tailor marketing strategies.**

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

# Merge the relevant dataframes
merged_df = agregate_transaction_df.merge(
    agregate_user_df[['States', 'Years', 'Quarter', 'User_count']], # Select relevant columns from agregate_user_df
    on=['States', 'Years', 'Quarter'],
    how='inner'
)

# Prepare data for clustering
# We can use Transaction_count, Transaction_amount, and User_count for segmentation
data = merged_df[['Transaction_count', 'Transaction_amount', 'User_count']].dropna()

# Scale the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) # Add n_init for KMeans
data['Segment'] = kmeans.fit_predict(scaled_data)

# View cluster centers (in scaled space)
print("Cluster Centers (Scaled Features):")
print(kmeans.cluster_centers_)

# Visualize clusters
plt.figure(figsize=(10,6))
# Use the original data for plotting, with Segment from the clustering result
sns.scatterplot(x='User_count', y='Transaction_amount', hue='Segment', data=data, palette='viridis')
plt.title('Customer Segmentation based on Spending Habits and User Count')
plt.xlabel('User Count')
plt.ylabel('Transaction Amount')
plt.show()

This scatter plot visualizes customer segmentation based on two variables: User Count (x-axis) and Transaction Amount (y-axis). It groups customers into three distinct segments:

Segment 0 (purple): These customers have relatively high user counts but moderate transaction amounts, suggesting frequent users with standard spending behavior. They likely represent a broad consumer base or regular shoppers.

Segment 1 (teal): This segment shows lower user counts but notably high transaction amounts, indicating high-value customers with substantial spending. These may be premium or enterprise users who transact infrequently but in large volumes.

Segment 2 (yellow): Positioned toward the lower end of both user count and transaction amount, this group may represent new, inactive, or low-value users, offering potential for targeted engagement or upselling.

**Fraud Detection: Analyze transaction patterns to spot and prevent fraudulent activities.**


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.ensemble import IsolationForest

df = agregate_transaction_df.copy()
scaler = StandardScaler()
features = df[['Transaction_count', 'Transaction_amount']]
scaled_features = scaler.fit_transform(features)

# ---------------------------------
# Fit Isolation Forest
# ---------------------------------
iso_forest = IsolationForest(contamination=0.02, random_state=42)
df['anomaly'] = iso_forest.fit_predict(scaled_features)

# anomaly = -1 means flagged as fraud/anomaly
df['is_fraud'] = df['anomaly'].apply(lambda x: 1 if x == -1 else 0)

# ---------------------------------
# Plot anomalies
# ---------------------------------
plt.figure(figsize=(10,6))
sns.scatterplot(x='Transaction_count', y='Transaction_amount', hue='is_fraud', data=df, palette={0:'blue',1:'red'})
plt.title('Fraud Detection: Anomaly Points Highlighted')
plt.show()

# ---------------------------------
# Display suspected fraudulent records
# ---------------------------------
frauds = df[df['is_fraud']==1]
print(frauds[['States', 'Years', 'Quarter', 'Transaction_count', 'Transaction_amount']])

This scatter plot is used for fraud detection, highlighting how transactions differ between legitimate and fraudulent cases based on two metrics:

Axes Explanation:

The x-axis shows the number of transactions (Transaction_count).

The y-axis, scaled to trillions (1e12), represents the Transaction_amount.

Color Coding:

Blue points (is_fraud = 0) are non-fraudulent transactions—mostly clustered at lower counts and amounts.

Red points (is_fraud = 1) are fraudulent transactions, which tend to have higher counts and amounts, showing clear deviation from typical patterns.

In [None]:
df = top_transaction_df.copy()
scaler = StandardScaler()
features = df[['Transaction_count', 'Transaction_amount']]
scaled_features = scaler.fit_transform(features)

# ---------------------------------
# Fit Isolation Forest
# ---------------------------------
iso_forest = IsolationForest(contamination=0.02, random_state=42)
df['anomaly'] = iso_forest.fit_predict(scaled_features)

# anomaly = -1 means flagged as fraud/anomaly
df['is_fraud'] = df['anomaly'].apply(lambda x: 1 if x == -1 else 0)

# ---------------------------------
# Plot anomalies
# ---------------------------------
plt.figure(figsize=(10,6))
sns.scatterplot(x='Transaction_count', y='Transaction_amount', hue='is_fraud', data=df, palette={0:'blue',1:'red'})
plt.title('Fraud Detection: Anomaly Points Highlighted')
plt.show()

# ---------------------------------
# Display suspected fraudulent records
# ---------------------------------
frauds = df[df['is_fraud']==1]
print(frauds[['States', 'Years', 'Quarter', 'Transaction_count', 'Transaction_amount']])

This scatter plot is used for fraud detection, highlighting how transactions differ between legitimate and fraudulent cases based on two metrics:

Axes Explanation:

The x-axis shows the number of transactions (Transaction_count).

The y-axis, scaled to trillions (1e12), represents the Transaction_amount.

Color Coding:

Blue points (is_fraud = 0) are non-fraudulent transactions—mostly clustered at lower counts and amounts.

Red points (is_fraud = 1) are fraudulent transactions, which tend to have higher counts and amounts, showing clear deviation from typical patterns.

In [None]:
df = map_transaction_df.copy()
scaler = StandardScaler()
features = df[['Map_count', 'Map_amount']]
scaled_features = scaler.fit_transform(features)

# ---------------------------------
# Fit Isolation Forest
# ---------------------------------
iso_forest = IsolationForest(contamination=0.02, random_state=42)
df['anomaly'] = iso_forest.fit_predict(scaled_features)

# anomaly = -1 means flagged as fraud/anomaly
df['is_fraud'] = df['anomaly'].apply(lambda x: 1 if x == -1 else 0)

# ---------------------------------
# Plot anomalies
# ---------------------------------
plt.figure(figsize=(10,6))
sns.scatterplot(x='Map_count', y='Map_amount', hue='is_fraud', data=df, palette={0:'blue',1:'red'})
plt.title('Fraud Detection: Anomaly Points Highlighted')
plt.show()

# ---------------------------------
# Display suspected fraudulent records
# ---------------------------------
frauds = df[df['is_fraud']==1]
print(frauds[['States', 'Years', 'Quarter', 'Map_count', 'Map_amount']])

This scatter plot is used for fraud detection, highlighting how transactions differ between legitimate and fraudulent cases based on two metrics:

Axes Explanation:

The x-axis shows the number of transactions (Transaction_count).

The y-axis, scaled to trillions (1e12), represents the Transaction_amount.

Color Coding:

Blue points (is_fraud = 0) are non-fraudulent transactions—mostly clustered at lower counts and amounts.

Red points (is_fraud = 1) are fraudulent transactions, which tend to have higher counts and amounts, showing clear deviation from typical patterns.

**Geographical Insights: Understand payment trends at state and district levels for targeted marketing.**



State level user growth

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

state_summary = map_transaction_df.groupby('States').agg(
    Total_Transactions=('Map_count','sum'),
    Total_Amount=('Map_amount','sum')
).reset_index()

plt.figure(figsize=(12,6))
sns.barplot(x='Total_Amount', y='States', data=state_summary.sort_values('Total_Amount', ascending=False))
plt.title('Total Transaction Amount by State')
plt.xlabel('Total Amount')
plt.ylabel('State')
plt.show()

This horizontal bar chart compares total transaction amounts by state, with Telangana leading and Lakshadweep at the bottom. The x-axis uses scientific notation (1e13) to represent the large transaction values, highlighting significant economic differences between regions. This visualization makes it easy to assess which states have strong digital financial activity and which may need more investment or outreach.

In [None]:
#Distict level user growth

In [None]:
district_summary = map_user_df.groupby('District').agg(
    Total_Users=('Regi_users', 'sum')
).reset_index()

plt.figure(figsize=(12,6))
sns.barplot(x='Total_Users', y='District', data=district_summary.sort_values('Total_Users', ascending=False).head(10))
plt.title('Top 10 Districts by Registered Users')
plt.xlabel('Total Registered Users')
plt.ylabel('District')
plt.show()

Bengaluru Urban tops the chart with the highest number of registered users among the top 10 districts. Other leading districts include Pune, Thane, Jaipur, and Mumbai Suburban, all showing strong digital engagement. The distribution reveals urban centers as major hubs of user registration, pointing to concentrated adoption in metropolitan regions.

**Payment Performance: Evaluate the popularity of different payment categories for strategic investments.**


Transaction count by category

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

type_summary = agregate_transaction_df.groupby('Transaction_type').agg(
    Total_Transactions=('Transaction_count', 'sum'),
    Total_Amount=('Transaction_amount', 'sum')
).reset_index()

plt.figure(figsize=(10,6))
sns.barplot(x='Total_Transactions', y='Transaction_type', data=type_summary.sort_values('Total_Transactions', ascending=False))
plt.title('Total Transactions by Payment Category')
plt.xlabel('Total Transactions')
plt.ylabel('Payment Category')
plt.show()

Merchant payments account for the highest number of total transactions, showing strong consumer engagement in retail spending. Peer-to-peer and bill payments follow as the next most common categories, indicating widespread use for personal transfers and utilities. Financial services appear at the bottom, suggesting lower transaction frequency in investment or lending-related activities.

Average transaction size by category

In [None]:
type_summary['Avg_Transaction_Size'] = type_summary['Total_Amount'] / type_summary['Total_Transactions']

plt.figure(figsize=(10,6))
sns.barplot(x='Avg_Transaction_Size', y='Transaction_type', data=type_summary.sort_values('Avg_Transaction_Size', ascending=False))
plt.title('Average Transaction Size by Payment Category')
plt.xlabel('Average Amount per Transaction')
plt.ylabel('Payment Category')
plt.show()

This chart compares the average transaction size across five payment categories, using horizontal bars for clarity. Peer-to-peer payments have the highest average value, suggesting larger transfers between individuals, while Merchant payments rank lowest, reflecting frequent but lower-value retail transactions. The distribution helps identify where high-value interactions occur, which is useful for tailoring financial strategies or prioritizing security and infrastructure.

**User Engagement: Monitor user activity to develop strategies that enhance retention and satisfaction.**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Total registered users over time
reg_users_over_time = map_user_df.groupby(['Years', 'Quarter'])['Regi_users'].sum().reset_index()

plt.figure(figsize=(12,6))
sns.lineplot(data=reg_users_over_time, x='Years', y='Regi_users', hue='Quarter', marker='o')
plt.title('Registered Users over Time')
plt.ylabel('Registered Users')
plt.xlabel('Year')
plt.show()

# Similarly for app opens
app_opens_over_time = map_user_df.groupby(['Years', 'Quarter'])['app_opens'].sum().reset_index()

plt.figure(figsize=(12,6))
sns.lineplot(data=app_opens_over_time, x='Years', y='app_opens', hue='Quarter', marker='s')
plt.title('App Opens over Time')
plt.ylabel('App Opens')
plt.xlabel('Year')
plt.show()

This line chart displays the growth in registered users across quarters from 2018 to 2024:

All four quarterly lines show a consistent upward trend, meaning the number of users steadily increased over time regardless of seasonality.

Quarter 4 stands out as the fastest-growing segment, reaching nearly 600 million users by 2024, suggesting peak registration activity during the year’s end.

The overall trajectory points to strong and sustained adoption, reflecting successful engagement strategies and market expansion.

**Product Development: Use data insights to inform the creation of new features and services.**


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# For example, see avg transaction amount by state (over product types)
avg_txn_state = map_transaction_df.groupby('States')['Map_amount'].mean().reset_index()

plt.figure(figsize=(12,6))
sns.barplot(data=avg_txn_state, x='States', y='Map_amount')
plt.xticks(rotation=90)
plt.title('Average Transaction Amount by State')
plt.ylabel('Avg Transaction Amount')
plt.xlabel('State')
plt.show()

This chart compares the average transaction amount by state, showing where high-value transactions are more common.

Andaman & Nicobar Islands and Maharashtra stand out with the highest average amounts, suggesting either large individual transactions or a skewed concentration from big-ticket payments.

Most other states reflect moderate to low averages, indicating smaller or more frequent transactions.

The wide disparity may point to differing economic profiles or transaction behaviors across regions, which can guide localized business strategies.

**Insurance Insights: Analyze insurance transaction data to improve product offerings and customer experience.**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Average insurance transaction amount by state
avg_insurance_state = agregate_insurance_df.groupby("States")["Transaction_amount"].mean().reset_index()

plt.figure(figsize=(12,6))
sns.barplot(data=avg_insurance_state, x="States", y="Transaction_amount")
plt.title("Average Insurance Transaction Amount by State")
plt.ylabel("Avg Insurance Transaction Amount")
plt.xlabel("State")
plt.xticks(rotation=90)
plt.show()

# Trend over time
time_trend = agregate_insurance_df.groupby(["Years", "Quarter"])["Transaction_count"].sum().reset_index()

plt.figure(figsize=(10,5))
sns.lineplot(data=time_trend, x="Years", y="Transaction_count", hue="Quarter", marker="o")
plt.title("Insurance Transactions Over Time by Quarter")
plt.ylabel("Total Transaction Count")
plt.xlabel("Year")
plt.show()

This bar chart shows the average insurance transaction amount across different Indian states, offering a snapshot of how insurance spending varies regionally.

Karnataka leads with the highest average transaction amount, indicating strong insurance adoption or higher-value policies in that region.

Maharashtra and Uttar Pradesh follow with substantial averages, suggesting robust consumer engagement or premium insurance activity.

Meanwhile, states like Mizoram and Lakshadweep reflect significantly lower values, pointing to either limited coverage, access, or smaller-scale insurance markets.

This line chart tracks insurance transaction counts by quarter from 2020 to 2024, revealing key usage trends over time:

Quarter 4 consistently leads, with the highest transaction volumes year after year, peaking near 1.4 million—hinting at seasonal spikes, possibly due to year-end policy renewals or offers.

Quarterly growth is steady, especially between 2021 and 2024, showing rising consumer confidence and insurance adoption across India.

The upward trajectory across all four quarters signals a maturing digital insurance market, with increasing awareness and outreach driving more transactions over time.

Marketing Optimization: Tailor marketing campaigns based on user behavior and transaction patterns.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Find top states by transaction volume
state_tx = agregate_transaction_df.groupby("States")["Transaction_amount"].sum().reset_index()
state_tx = state_tx.sort_values(by="Transaction_amount", ascending=False)

plt.figure(figsize=(12,6))
sns.barplot(data=state_tx, x="States", y="Transaction_amount", palette="magma")
plt.xticks(rotation=90)
plt.title("Total Transaction Amount by State")
plt.ylabel("Transaction Amount (₹)")
plt.xlabel("States")
plt.show()

# 2. User engagement vs spending
eng_spend = agregate_user_df.merge(agregate_transaction_df, on=["States", "Years", "Quarter"])
plt.figure(figsize=(8,6))
sns.scatterplot(data=eng_spend, x="User_percentage", y="Transaction_amount", hue="States")
plt.title("User Engagement vs Transaction Amount")
plt.xlabel("User Engagement %")
plt.ylabel("Transaction Amount (₹)")
plt.show()


 This bar chart presents the total transaction amount by Indian state, offering a clear snapshot of regional financial performance:

Telangana tops the list, with the highest overall transaction value, indicating a highly active digital payment environment.

Karnataka and Maharashtra follow closely, showing strong economic activity and digital transaction adoption.

States toward the bottom, like Lakshadweep and Mizoram, have minimal transaction amounts, pointing to either low population or limited digital infrastructure.

This scatter plot maps how user engagement compares with transaction amounts across Indian states:

States like Maharashtra and Karnataka sit toward the upper-right quadrant, indicating both high engagement and high transaction amounts, which reflects strong digital participation and economic activity.

Regions clustered near the lower-left—such as Lakshadweep and Mizoram—show low engagement and low transaction volumes, signaling areas with limited digital access or adoption.

The spread of data points reveals diverse state-level performance, which can guide differentiated strategies: expanding infrastructure in low-engagement states and reinforcing services in high-activity ones.

**Trend Analysis: Examine transaction trends over time to anticipate demand fluctuations.**


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Group transaction data by Year & Quarter
trend_tx = agregate_transaction_df.groupby(['Years', 'Quarter'])['Transaction_amount'].sum().reset_index()

plt.figure(figsize=(10,6))
sns.lineplot(data=trend_tx, x='Quarter', y='Transaction_amount', hue='Years', marker='o')
plt.title("Transaction Amount Trend by Quarter")
plt.xlabel("Quarter")
plt.ylabel("Total Transaction Amount (₹)")
plt.legend(title="Year")
plt.show()

# Alternatively: aggregate user registrations trend
trend_users = agregate_user_df.groupby(['Years', 'Quarter'])['User_count'].sum().reset_index()
plt.figure(figsize=(10,6))
sns.lineplot(data=trend_users, x='Quarter', y='User_count', hue='Years', marker='o')
plt.title("User Registrations Trend by Quarter")
plt.xlabel("Quarter")
plt.ylabel("Total User Count")
plt.legend(title="Year")
plt.show()


📈 This line graph illustrates the quarterly trend of total transaction amounts from 2018 to 2024, showcasing consistent growth over time.

Each year is represented by a separate colored line, and the upward slope in most lines indicates steady increases in transaction volumes, especially noticeable in 2024, which peaks across all quarters.

The y-axis uses scientific notation (in ₹ trillions), highlighting the rapid scaling of digital financial activity, and suggesting successful adoption and expansion of digital payment platforms year after year.

 This line chart illustrates the trend of user registrations by quarter across five years—2018 to 2022:

Each year shows a steady rise in user count from Q1 to Q4, with 2022 (navy blue) reaching the highest user registrations, surpassing 300 million by the last quarter.

The consistent upward slope across all years signals strong growth momentum, suggesting increasing digital adoption and expanding platform reach.

The seasonal pattern—especially higher registrations in Q4—may hint at marketing pushes, new feature rollouts, or year-end incentives driving user interest.

**Competitive Benchmarking: Compare performance against competitors to identify areas for improvement.**


In [None]:
#Assuming you have PhonePe's quarterly growth rates
phonepe_growth = agregate_transaction_df.groupby(['Years', 'Quarter'])['Transaction_count'].sum().pct_change()

# Suppose NPCI or industry growth for the same periods is loaded as:
industry_growth = [0.12, 0.08, 0.15, 0.10, 0.09]  # example values

comparison_df = pd.DataFrame({
    'PhonePe_growth': phonepe_growth.dropna().values[:len(industry_growth)],
    'Industry_growth': industry_growth[:len(phonepe_growth.dropna().values[:len(industry_growth)])]
})

comparison_df.plot(kind='bar', figsize=(8,5))
plt.title("PhonePe vs Industry Growth Rate per Quarter")
plt.ylabel("Growth Rate")
plt.xlabel("Quarter")
plt.legend(["PhonePe", "Industry Average"])
plt.show()

📊 This bar chart compares PhonePe's growth rate to the industry average growth rate across five quarters, showcasing consistent outperformance:

In each quarter, PhonePe's growth rate is significantly higher, peaking in Quarter 1 at 0.8, compared to the industry’s steady 0.1 average—highlighting dominant expansion and market penetration.

Even in Quarter 4, where PhonePe dips slightly to 0.2, it still doubles the industry rate, reflecting sustained momentum despite possible market seasonality.

The overall pattern demonstrates PhonePe's strong competitive edge, likely driven by user acquisition, service innovation, or aggressive outreach strategies.

In [None]:
tx_sum = agregate_transaction_df.groupby(['Years'])['Transaction_count'].sum()
user_sum = agregate_user_df.groupby(['Years'])['User_count'].sum()

tx_per_user = (tx_sum / user_sum).reset_index()
tx_per_user.columns = ['Years', 'Transactions_per_User']

sns.lineplot(data=tx_per_user, x='Years', y='Transactions_per_User', marker='o')
plt.title("Transactions per User over Years (PhonePe)")
plt.ylabel("Avg Transactions per User")
plt.xlabel("Year")
plt.show()

📈 This line graph shows how the average transactions per PhonePe user have grown from 2018 to 2022:

There's a steady rise from 2018 onward, with a sharp surge in 2022, indicating increased user engagement and possibly better service offerings or incentives.

The y-axis (0 to 100 range) highlights how users became significantly more active over time, suggesting strong adoption of digital payments.

This upward trend signals a deepening relationship between users and the platform—more than just sign-ups, it’s about habitual usage and trust.

Answer Here.

###Business case study

1. Decoding Transaction Dynamics on PhonePe

Scenario

PhonePe, a leading digital payments platform, has recently identified significant variations in transaction behavior across states, quarters, and payment categories. While some regions and transaction types demonstrate consistent growth, others show stagnation or decline. The leadership team seeks a deeper understanding of these patterns to drive targeted business strategies.


To decode the transaction dynamics on PhonePe, the following key questions will guide the analysis:

Which states consistently show growth in transaction counts and amounts?

draw a simple line chart to visualize how the transaction counts and amounts have changed over the years for each state.
This helps you visually spot which states are growing.

In [None]:
#Line chart for transaction count

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,6))
sns.lineplot(data=agregate_transaction_df, x='Years', y='Transaction_count', hue='States', marker='o')
plt.title('Transaction Count Over Years by State')
plt.ylabel('Transaction Count')
plt.xlabel('Year')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

If a line for a state keeps going up, that means it’s consistently growing in transactions.

For example:

 A state like Karnataka might show a steady rise in transaction counts from 2018 to 2024.

While Kerala might have ups and downs (no consistent growth).



In [None]:
#Line chart for transaction amount

plt.figure(figsize=(12,6))
sns.lineplot(data=agregate_transaction_df, x='Years', y='Transaction_amount', hue='States', marker='o')
plt.title('Transaction Amount Over Years by State')
plt.ylabel('Transaction Amount')
plt.xlabel('Year')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

This chart shows how transaction counts have changed over the years for each state.
States like Karnataka and Maharashtra show clear upward trends, meaning their transactions are growing steadily.
Meanwhile, states like Bihar and Assam appear flat or slightly declining, indicating stagnation or a drop.
This helps identify where PhonePe can strengthen efforts or maintain momentum to boost growth.

Which payment categories (like recharge, bills, merchant payments) drive most of the growth?

In [None]:
category_growth = agregate_transaction_df.groupby('Transaction_type')['Transaction_amount'].sum().reset_index()

# Sort for better visualization
category_growth = category_growth.sort_values('Transaction_amount', ascending=False)

# Plot
plt.figure(figsize=(10,6))
sns.barplot(x='Transaction_amount', y='Transaction_type', data=category_growth, palette='viridis')
plt.title('Total Transaction Amount by Payment Category')
plt.xlabel('Total Transaction Amount')
plt.ylabel('Payment Category')
plt.tight_layout()
plt.show()

This chart displays the total transaction amounts by payment category, highlighting how financial activity is distributed across use cases.

Peer-to-peer payments dominate, reaching approximately ₹2.5 × 10¹⁴, showing strong personal transfer usage.

Merchant payments follow, with nearly ₹1.0 × 10¹⁴ in volume—indicating robust retail engagement.

Categories like recharge & bill payments, financial services, and others show far smaller contributions, pointing to specific areas where usage or outreach could grow.

2. Insurance Penetration and Growth Potential Analysis

Scenario

PhonePe has ventured into the insurance domain, providing users with options to secure various policies. With increasing transactions in this segment, the company seeks to analyze its growth trajectory and identify untapped opportunities for insurance adoption at the state level. This data will help prioritize regions for marketing efforts and partnerships with insurers.



In [None]:
#Top States by Insurance Transaction Count

import matplotlib.pyplot as plt

top_states = top_insurance_df.groupby("States")["Transaction_count"].sum().sort_values(ascending=False).head(5)

plt.figure(figsize=(8,5))
top_states.plot(kind='bar', color='mediumseagreen')
plt.title("Top 5 States by Insurance Transaction Count")
plt.ylabel("Transaction Count")
plt.xlabel("States")
plt.xticks(rotation=45)
plt.show()

This chart shows the top five Indian states by insurance transaction count, with Karnataka leading at over 300,000 transactions. Maharashtra, Delhi, Telangana, and Uttar Pradesh follow, indicating solid insurance engagement across urban and economically active regions. The data reflects strong adoption of digital insurance services in these states, signaling opportunities for deeper product offerings and outreach.

In [None]:
#Quarter-wise Insurance Growth

quarter_growth = agregate_transaction_df.groupby("Quarter")["Transaction_count"].sum()

plt.figure(figsize=(8,5))
quarter_growth.plot(marker='o', color='dodgerblue')
plt.title("Quarterly Growth in Insurance Transactions")
plt.ylabel("Transaction Count")
plt.xlabel("Quarter")
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()

This line graph illustrates quarterly growth in insurance transactions, showing a clear upward trend across all four quarters. The transaction count increases steadily, ranging from ₹50 billion to ₹70 billion, indicating consistent adoption. This progression highlights strengthening consumer trust and expanding use of digital insurance services throughout the year.

In [None]:
#Lowest States by Insurance Transaction Count

lowest_states = top_insurance_df.groupby("States")["Transaction_count"].sum().sort_values().head(5)

plt.figure(figsize=(8,5))
lowest_states.plot(kind='barh', color='salmon')
plt.title("Bottom 5 States by Insurance Transaction Count")
plt.xlabel("Transaction Count")
plt.ylabel("States")
plt.show()

This chart highlights the five Indian states with the lowest insurance transaction counts, led by Nagaland and ending with Lakshadweep. The short bar lengths indicate limited digital insurance activity in these regions. These low counts may suggest opportunities for outreach, infrastructure development, or policy interventions to boost coverage and financial inclusion.


3. User Engagement and Growth Strategy

Scenario

PhonePe seeks to enhance its market position by analyzing user engagement across different states and districts. With a significant number of registered users and app opens, understanding user behavior can provide valuable insights for strategic decision-making and growth opportunities.


In [None]:
#Top states by registered users

import matplotlib.pyplot as plt

top_states_users = map_user_df.groupby("States")["Regi_users"].sum().sort_values(ascending=False).head(5)

plt.figure(figsize=(8,5))
top_states_users.plot(kind='bar', color='steelblue')
plt.title("Top 5 States by Registered Users")
plt.ylabel("Registered Users")
plt.xticks(rotation=45)
plt.show()

Maharashtra leads the chart with the highest number of registered users, showing strong digital participation. Uttar Pradesh and Karnataka follow closely, highlighting widespread user adoption in both northern and southern regions. Andhra Pradesh and Rajasthan round out the top five, suggesting solid engagement that supports continued regional expansion.

In [None]:
# App opens vs Registered users (state level)
state_data = map_user_df.groupby("States")[["Regi_users", "app_opens"]].sum()

plt.figure(figsize=(8,6))
plt.scatter(state_data["Regi_users"], state_data["app_opens"], color='purple')
plt.title("App Opens vs Registered Users by State")
plt.xlabel("Registered Users")
plt.ylabel("App Opens")

for state in state_data.index:
    plt.annotate(state, (state_data.loc[state,"Regi_users"], state_data.loc[state,"app_opens"]),
                 fontsize=8, alpha=0.7)
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

This scatter plot visualizes app engagement across Indian states by comparing the number of app opens with registered users. States like Maharashtra, Karnataka, and Uttar Pradesh exhibit high activity, appearing in the upper-right region, which indicates widespread and frequent usage. States in the lower-left show lower participation, suggesting areas with potential for improved outreach and digital growth.

4. Transaction Analysis Across States and Districts

Scenario

PhonePe is conducting an analysis of transaction data to identify the top-performing states, districts, and pin codes in terms of transaction volume and value. This analysis will help understand user engagement patterns and identify key areas for targeted marketing efforts..




In [None]:
#Top states by transaction count and amount

import matplotlib.pyplot as plt

# Group by state
state_summary = agregate_transaction_df.groupby("States")[["Transaction_count", "Transaction_amount"]].sum()

# Top 5 by transaction count
top_states_count = state_summary.sort_values("Transaction_count", ascending=False).head(5)

# Top 5 by transaction amount
top_states_amount = state_summary.sort_values("Transaction_amount", ascending=False).head(5)

# Bar plots
plt.figure(figsize=(10,4))
top_states_count["Transaction_count"].plot(kind="bar", color="teal")
plt.title("Top 5 States by Transaction Count")
plt.ylabel("Transaction Count")
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(10,4))
top_states_amount["Transaction_amount"].plot(kind="bar", color="orange")
plt.title("Top 5 States by Transaction Amount")
plt.ylabel("Transaction Amount")
plt.xticks(rotation=45)
plt.show()

1,This chart shows the top five Indian states by transaction count, highlighting regions with the most digital payment activity.

Maharashtra ranks first, indicating a highly active digital user base with frequent transactions—possibly driven by its urban centers and financial hubs.

Karnataka, Telangana, Andhra Pradesh, and Uttar Pradesh follow, reflecting solid digital adoption across both southern and northern states.

2.This bar chart shows the top five Indian states by total transaction amount, offering a comparison of regional economic activity and digital engagement:

Telangana leads with the highest transaction amount, crossing ₹4.0 × 10¹³, reflecting widespread use of digital payments and a mature financial ecosystem.

Karnataka and Maharashtra follow closely, reinforcing their roles as major digital commerce hubs with substantial transaction volumes.

Andhra Pradesh and Uttar Pradesh

In [None]:
#Top pin codes by transaction amount (using top_transaction_df)

pincode_summary = top_transaction_df.groupby("entityName")[["Transaction_count", "Transaction_amount"]].sum()

top_pincodes_amount = pincode_summary.sort_values("Transaction_amount", ascending=False).head(5)

plt.figure(figsize=(10,4))
top_pincodes_amount["Transaction_amount"].plot(kind="bar", color="green")
plt.title("Top 5 Pin Codes by Transaction Amount")
plt.ylabel("Transaction Amount")
plt.xticks(rotation=45)
plt.show()

Bengaluru Urban leads with the highest transaction amount among the top five pin codes, reflecting its strong digital financial activity. Hyderabad and Pune follow, indicating high engagement in metropolitan areas with robust payment ecosystems. The chart emphasizes regional transaction concentration, useful for targeting high-value zones in strategic business planning.

5 User Registration Analysis

Scenario

PhonePe aims to conduct an analysis of user registration data to identify the top states, districts, and pin codes from which the most users registered during a specific year-quarter combination. This analysis will provide insights into user engagement patterns and highlight potential growth areas.

User Registration Analysis

In [None]:
#Top states by user registrations

import matplotlib.pyplot as plt

state_users = top_user_df.groupby("States")["Regi_users"].sum().sort_values(ascending=False).head(5)

plt.figure(figsize=(10,4))
state_users.plot(kind="bar", color="coral")
plt.title(f"Top 5 States by Registered Users ")
plt.ylabel("Registered Users")
plt.xticks(rotation=45)
plt.show()

This bar chart shows the top five Indian states by number of registered users, offering a snapshot of digital user penetration:

Delhi leads the chart, with the highest user count nearing 10 million, reflecting strong digital adoption and service awareness in the capital.

Karnataka, Uttar Pradesh, Maharashtra, and Haryana follow in descending order, suggesting robust engagement in both southern and northern regions.

The data highlights where digital platforms have succeeded most, which can guide targeted expansion or deepen service offerings in already active states.

In [None]:
#Top pin codes by user registrations

pincode_users = top_user_df.groupby("pincodes")["Regi_users"].sum().sort_values(ascending=False).head(5)

plt.figure(figsize=(10,4))
pincode_users.plot(kind="bar", color="skyblue")
plt.title(f"Top 5 Pin Codes by Registered Users ")
plt.ylabel("Registered Users")
plt.xticks(rotation=45)
plt.show()

This chart highlights the top five Indian pin codes with the highest number of registered users, led by 201301 with over 1.6 million users. Other top regions like 110059 and 560068 show strong digital participation, emphasizing urban engagement. The spread indicates concentrated adoption in specific areas, useful for regional targeting and user expansion strategies.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

Focus marketing on high-potential states & segments:

1.Analysis shows that states like Maharashtra, Karnataka, and Gujarat consistently lead in transaction volumes and insurance penetration.

Intensify localized marketing campaigns and build partnerships there to capitalize on momentum.

2 Boost underperforming regions through targeted offers:

States or districts with stagnating growth can be revived with cashback offers, referral schemes, or educational drives about digital payments and insurance.

3 Leverage customer segmentation:

Cluster analysis of spending patterns can help you design tiered loyalty programs — e.g., high-value transactors get premium offers, while low-frequency users get reactivation incentives.

4 Enhance user engagement:

Data on app opens and registered users suggests opportunities to improve in-app experiences and push timely notifications to drive repeat use.

5 Expand insurance cross-sell:

Since insurance transactions are growing, integrate smart nudges in the app (like micro-insurance offers right after bill payments) to further boost uptake.

 Summary:
Use these insights to tailor marketing, refine user journeys, and allocate resources effectively. This will help drive higher transaction volumes, improve insurance adoption, and deepen customer loyalty — all aligning directly with your business growth objectives.

# **Conclusion**

Through this comprehensive analysis, it’s clear that PhonePe has significant opportunities to strengthen its market position by leveraging data-driven insights. The trends in transactions, insurance penetration, and user engagement highlight both high-performing regions that can be further maximized and underperforming areas where targeted strategies can stimulate growth. By focusing on tailored marketing, customer segmentation, and personalized user engagement initiatives, PhonePe can drive sustainable business growth, boost transaction volumes, and expand insurance adoption. Overall, these insights equip PhonePe to make informed decisions that align operational efforts with strategic objectives, ensuring long-term success in an increasingly competitive digital payments landscape.












### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***