# Summary & Key Insights: Indian Startup Funding Analysis

This notebook summarizes the findings from a descriptive statistical analysis
of Indian startup funding data.

The focus of this project was not prediction, but understanding data behavior
using core statistical concepts such as central tendency, dispersion, and outliers.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/startup_funding.csv")

funding_numeric = (
    df["Amount in USD"]
    .astype(str)
    .str.replace(",", "", regex=True)
)

funding_numeric = pd.to_numeric(funding_numeric, errors="coerce")

df_funding = df.loc[funding_numeric.notna()].copy()
df_funding["Amount in USD"] = funding_numeric[funding_numeric.notna()]

In [2]:
summary_stats = {
    "mean_funding_usd": df_funding["Amount in USD"].mean(),
    "median_funding_usd": df_funding["Amount in USD"].median(),
    "std_funding_usd": df_funding["Amount in USD"].std(),
    "total_records": df_funding.shape[0]
}

summary_stats

{'mean_funding_usd': np.float64(18429897.27080872),
 'median_funding_usd': 1700000.0,
 'std_funding_usd': 121373444.12759419,
 'total_records': 2065}

In [3]:
Q1 = df_funding["Amount in USD"].quantile(0.25)
Q3 = df_funding["Amount in USD"].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR

outlier_pct = (
    df_funding[df_funding["Amount in USD"] > upper_bound].shape[0]
    / df_funding.shape[0]
) * 100

round(outlier_pct, 2)

13.7

## Key Insights

1. **Funding data is highly right-skewed**
   - Mean funding is significantly higher than the median due to a small number
     of very large funding rounds.

2. **Median is a better measure of “typical” startup funding**
   - While the mean funding exceeds $18M, the median remains close to $1–2M,
     indicating that most startups raise much smaller amounts.

3. **Dispersion is extremely high**
   - Standard deviation is comparable to or larger than the mean, highlighting
     high variability in funding outcomes.

4. **Outliers are structurally important**
   - Approximately 13–14% of funding rounds are identified as outliers using the IQR method.
   - These represent legitimate late-stage or unicorn investments, not data errors.

5. **Outliers disproportionately influence averages**
   - Removing IQR-defined outliers reduces the mean funding by nearly 6×,
     while the median changes only marginally.

## What This Project Demonstrates

- A structured approach to exploratory data analysis
- Understanding of when averages can be misleading
- Practical interpretation of variance and standard deviation
- Proper handling and interpretation of outliers
- Ability to translate statistical results into business-relevant insights


## Limitations & Next Steps

- This analysis focuses only on descriptive statistics and does not attempt
  to model or predict funding outcomes.
- Future work could include:
  - Time-based analysis of funding trends
  - Comparison across industries or cities
  - Log-scale modeling or hypothesis testing for deeper insights