# Outlier Detection Using IQR Method

Outliers are extreme values that don't fit the typical funding pattern. We use the Interquartile Range (IQR) method to identify them statistically and see how much they skew our statistics. This helps us understand if the average is being pulled by a few mega-rounds or if it represents the typical startup.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/startup_funding.csv")

funding_numeric = (
    df["Amount in USD"]
    .astype(str)
    .str.replace(",", "", regex=True)
)

funding_numeric = pd.to_numeric(funding_numeric, errors="coerce")

df_funding = df.loc[funding_numeric.notna()].copy()
df_funding["Amount in USD"] = funding_numeric[funding_numeric.notna()]

We start by loading the raw data and cleaning the funding column. This means removing commas from numbers and converting everything to actual numeric values. This prep work is crucial because we can't identify outliers statistically if our data is messy or text-based.

In [2]:
Q1 = df_funding["Amount in USD"].quantile(0.25)
Q3 = df_funding["Amount in USD"].quantile(0.75)
IQR = Q3 - Q1

Q1, Q3, IQR

(np.float64(470000.0), np.float64(8000000.0), np.float64(7530000.0))

Next we calculate the quartiles: Q1 is where 25% of the data sits below, Q3 is where 75% sits below. The IQR (interquartile range) is the gap between Q1 and Q3 - basically the middle zone where most startups cluster. This middle zone is what we use to decide what counts as "weird" and what's normal.

In [3]:
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

lower_bound, upper_bound

(np.float64(-10825000.0), np.float64(19295000.0))

The standard statistical trick for finding outliers is multiplying the IQR by 1.5 and extending that distance from Q1 and Q3. Anything beyond these bounds is considered an outlier. In our case, anything above the upper bound is a mega-round that's way out of the typical funding pattern.

In [4]:
outliers = df_funding[
    df_funding["Amount in USD"] > upper_bound
]

outliers.shape

(283, 10)

Now we actually filter the data to find all funding rounds that exceed the upper bound. This tells us exactly which startups got mega-rounds and how many there are. The count matters because if only a few companies are outliers, they might be skewing our statistics. If there are lots, maybe mega-rounds are just part of normal startup life.

In [7]:
outlier_percentage = (outliers.shape[0] / df_funding.shape[0]) * 100
outlier_percentage

13.704600484261501

We convert the outlier count into a percentage so we can say something like "5% of startups got mega-rounds." This percentage tells us whether outliers are a tiny fringe or a meaningful chunk of the startup landscape.

In [8]:
mean_with_outliers = df_funding["Amount in USD"].mean()
median_with_outliers = df_funding["Amount in USD"].median()

mean_without_outliers = df_funding[
    df_funding["Amount in USD"] <= upper_bound
]["Amount in USD"].mean()

median_without_outliers = df_funding[
    df_funding["Amount in USD"] <= upper_bound
]["Amount in USD"].median()

(mean_with_outliers, median_with_outliers,
 mean_without_outliers, median_without_outliers)

(np.float64(18429897.27080872),
 1700000.0,
 np.float64(3159511.1471492704),
 1000000.0)

Finally, we calculate the mean and median both with and without outliers. This comparison shows us how much the mega-rounds are dragging up our average. If the mean drops way down but the median stays similar, that's a dead giveaway that outliers are inflating the "typical" picture we were seeing before.

## What the Numbers Actually Mean

Look at the comparison:

- With outliers: mean is about 18.43M but median is only 1.7MThe big jump in the mean when we include outliers proves that a small number of mega-rounds are artificially inflating the "average." The median barely changes because it doesn't care about those giants, it just looks at the middle value. This tells us that if you're a typical startup, you should pay way more attention to the median (around 1 to 1.7M) than the mean. The mean is getting pulled up by a handful of unicorn-hunting companies getting huge rounds, but that's not what most startups actually experience.

- Without outliers: mean drops to 3.16M and median is 1.0M