# Upsets overview

One of the most exciting parts of March Madness are the upsets. It's something special when a double digit team beats a highly ranked team, one of the most exciting events in sports. One strategy with the Kaggle competition is taking a couple risks with submissions. A very good model may place in the top 10% but often it comes down to luck in terms of placing high in the leaderboard. Taking a calculated risk.

With the submission deadline coming, one strategy could be to take a chance on a couple different games to attempt to differentiate a submission. Rather than guessing, let's look at the data and see if there are any picks that are better than others. We'll start with plotting the frequency of upsets. 

## Gather games by Seed and Margin of Victory (MOV)

The following section joins the relevant CSVs to compute a margin of victory (or defeat) for each game. The data is re-ordered to always have the lower (better) seed first to assist with future calculations. First four games are skipped given they don't count in this competition. Games from the 2021 season are also skipped given all the games were in a single location and other anomalies of that tournament. 

In [None]:
import marchmania_2022 as marchmania

# See https://www.kaggle.com/davidmezzetti/marchmania-2022
results = marchmania.results()
results

## Get average MOV per seed pairing

Group the data together based on seed-pairings. Compute the margin of victory (MOV) and win percentage for the lower (better) seed.

In [None]:
# See https://www.kaggle.com/davidmezzetti/marchmania-2022
averages = marchmania.averages(results)
averages

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

upsets = results[(results.HSeed - results.LSeed >= 4) & results.Wins.eq(0)]
upsets = upsets.groupby(["HSeed", "LSeed"]).agg({"HScore": ["count"]})
upsets = upsets.unstack(level=0)

fig, ax = plt.subplots(figsize=(15, 8))
sns.heatmap(upsets, cmap="Reds")

xticks_labels = range(5, 17)

plt.xticks(np.arange(12) + .5, labels=xticks_labels)
plt.xlabel("Winning Seed")
plt.ylabel("Losing Seed")
plt.show()

*Note: this notebook defines an upset as games with a seed disparity >= 4*

Looking at the graph above, a couple points stick out. The 5-12 and 6-11 upsets are the most common, not surprisingly. Some years a majority of 11s or 12s win. But a couple other points stick out.

2-7, 2-10, 3-7, 3-10 and 3-11 upsets are often fairly common. The advantage of these upsets are that a team had to at least win a tournament game to get to that point, so in a way it could be said that the team is more reliable.

Let's look closer at the more common data points.

# Common upsets

In [None]:
stats = []
columns = None
for row in upsets.stack().sort_values(by=("HScore", "count"), ascending=False).iterrows():
    lseed, hseed = row[0]

    # See https://www.kaggle.com/davidmezzetti/marchmania-2022
    values = marchmania.stats(averages[averages.LSeed.eq(lseed) & averages.HSeed.eq(hseed)], False)
    if columns is None:
        columns = values.keys()

    stats.append((lseed, hseed) + tuple(values))

stats = pd.DataFrame(stats, columns=("LSeed", "HSeed") + tuple(columns))

stats[stats.Count.gt(10) & stats.MOV.lt(6)].sort_values(by="MOV").reset_index(drop=True)

The table above shows statistics on the more common upsets in the graph above. The data shows the win percentage of the lower (better) seed beating the higher seed and margin of victory.

Looking at 3-7 and 3-10 matchups, there are a fair number of upsets in that category and the games tend to be close. In this scenario, a 7 or 10 seed would need to win 2 games to get to this point, most likely beating a 2-seed along the way. 

4-12, 3-11, 2-7 and 2-10 are Round of 32 games that are also games to look at. These teams would have had to win a tournament game to get to this point.

Let's deep dive further into the seed matchups. 

# 8-9 seed analysis

The following plot shows how 8-9 seeds do against lower (better) seeds.

The MOV (Margin of Victory) is the average number of points the lower (better) seed beats the higher (worse) seed. The wins and win percentage is for the lower (better) team beating the higher (worse) seed. The lower the MOV, the better the higher (worse) seed does. The fewer wins/lower win percentage, the higher the frequency of an upset.

In [None]:
import seaborn as sns

def plot(data, palette, size):
    sns.set(style="whitegrid", color_codes=True, rc={'figure.figsize': size})
    sns.barplot(data=data, x="Key", y="MOV", palette=palette)

plot(averages[averages.HSeed.isin([8, 9])], "Greens_d", (15, 5))

## 8-9 seed 1st round matchup

In [None]:
marchmania.stats(averages[averages.LSeed.eq(8) & averages.HSeed.eq(9)])

9-seed's actually have a few more wins than 8-seeds but this is a toss up. With two submissions available, some take the strategy of taking a 50/50 toss up game like this and setting the probability to 1 for the 8-seed in 1 submission and a probability of 1 for the 9-seed in another.

## 8-9 seeds in the 2nd round

In all but 1 case, the winner of the 8-9 matchup plays a 1-seed. Stats for that matchup below.

In [None]:
marchmania.stats(averages[averages.LSeed.eq(1) & averages.HSeed.isin([8, 9])])

As expected, 8-9's don't do well against 1-seeds with a MOV of -11.55, typically a blowout. Let's look at what happens in future rounds in the case of a 8-1 or 9-1 upset.

## 8-9 seeds in later rounds

In [None]:
matchups = averages[
    averages.Count.ge(3) &
    averages.LSeed.ne(1) &
    averages.HSeed.isin([8, 9]) &
    (averages.LSeed.ne(8) | averages.HSeed.ne(9))
]

marchmania.stats(matchups)

In [None]:
matchups.sort_values(by="MOV").reset_index(drop=True)

When a 8-9 seed beats a 1-seed, it gets more interesting. Look at the entries for 4-8, 4-9, 5-8 and 5-9s and the averages across this set. These points don't occur often but when they do, it's meaningful. When seeds are used as a feature, most models will be able to pick up on this, which normally would be good.

8-9s actually have a higher win percentage than 4-5s! Once again this is a small sample size but models can pick up on this.

# 7-10 seed analysis

The following plot shows how 7-10 seeds do against lower (better) seeds. 

In [None]:
plot(averages[averages.HSeed.isin([7, 10])], "Blues_d", (15, 5))

## 7-10 seed 1st round matchup

In [None]:
marchmania.stats(averages[averages.LSeed.eq(7) & averages.HSeed.eq(10)])

7-10 matchups are not as close as 8-9 games but they are still close games and plenty of wins for the 10-seed.

## 7-10 seeds in the 2nd round

In the vast majority of cases, the winner of the 7-10 matchup plays a 2-seed. Stats for that matchup below.

In [None]:
marchmania.stats(averages[averages.LSeed.eq(2) & averages.HSeed.isin([7, 10])])

7-10s have a much better shot against a 2-seed than a 8-9 against a 1-seed. 

This is a fairly common upset. For those who do make it past the 2-seed, let's see how later round games look.

## 7-10 seeds in later rounds

In [None]:
matchups = averages[
    averages.Count.ge(3) &
    averages.LSeed.ne(2) & 
    averages.HSeed.isin([7, 10]) &
    (averages.LSeed.ne(7) | averages.HSeed.ne(10))
]

marchmania.stats(matchups)

In [None]:
matchups.sort_values(by="MOV").reset_index(drop=True)

The later round games are closer but the win percentage overall isn't higher. There are a couple matchups to look at, the 3-7 matchup leads to close games, as does the 3-10 matchups. Plenty of upsets to consider in this group. 

# 6-11 seed analysis

The following plot shows how 6-11 seeds do against lower (better) seeds.

In [None]:
plot(averages[averages.HSeed.isin([6, 11])], "Purples_d", (15, 5))

## 6-11 seed 1st round matchup

In [None]:
marchmania.stats(averages[averages.LSeed.eq(6) & averages.HSeed.eq(11)])

6-11 matchups are still pretty close. Lot of 11-6 upsets and this is a good one to consider.

Next let's look at how 6-11 seeds do against a 3-seed.

## 6-11 seeds in the 2nd round

In the vast majority of cases, the winner of the 6-11 matchup plays a 3-seed. Stats for that matchup below.

In [None]:
marchmania.stats(averages[averages.LSeed.eq(3) & averages.HSeed.isin([6, 11])])

6-11 seeds do well in this matchup. In many ways, 6-11 seeds are setup well to make a run. Let's look at all upset possiblities for a 6-11 seed.

## 6-11 seeds in later rounds

Data for later rounds below. This also includes 3-6 and 3-11 matchups given the relatively high percentage change of winning for the lower seeds. 

In [None]:
matchups = averages[
    averages.Count.ge(3) &
    averages.HSeed.isin([6, 11]) &
    (averages.LSeed.ne(6) | averages.HSeed.ne(11))
]

marchmania.stats(matchups)

In [None]:
matchups.sort_values(by="MOV").reset_index(drop=True)

Lot of great matchups to look at in this group. If a 2-seed is upset, 11-seeds historically have a good shot at advancing to the Elite 8. In the 1-11 matchup, 11-seeds have won 3 out of 7 times.

# 5-12 seed analysis

The following plot shows how 5-12 seeds do against lower (better) seeds.

In [None]:
plot(averages[averages.HSeed.isin([5, 12])], "Reds_d", (15, 5))

## 5-12 seed 1st round matchup

In [None]:
marchmania.stats(averages[averages.LSeed.eq(5) & averages.HSeed.eq(12)])

Lot of 12-5 upsets. In some years, more 12s win than 5s. This along with the 11-6 upset is most common as shown previously. 

## 5-12 seeds in the 2nd round

In the vast majority of cases, the winner of the 5-12 matchup plays a 4-seed. Stats for that matchup below.

In [None]:
marchmania.stats(averages[averages.LSeed.eq(4) & averages.HSeed.isin([5, 12])])

Not very surprising that a 4-5 matchup is close. A 4-12 matchup is also reasonably close.

## 5-12 seeds in later rounds

Data for later rounds below. This also includes 4-12 matchups given the relatively high percentage change of winning.

In [None]:
matchups = averages[
    averages.Count.ge(3) &
    (averages.LSeed.ne(4) | averages.HSeed.ne(5)) & 
    averages.HSeed.isin([5, 12]) &
    (averages.LSeed.ne(5) | averages.HSeed.ne(12))
]

marchmania.stats(matchups)

In [None]:
matchups.sort_values(by="MOV").reset_index(drop=True)

5-12s don't have a great path. They usually have to face a 1 seed in the Sweet 16 and that doesn't go well.

# 4-13 seed analysis

The following plot shows how 4-13 seeds do against lower (better) seeds.

In [None]:
plot(averages[averages.HSeed.isin([4, 13])], "Greys_d", (15, 5))

## 4-13 seed 1st round matchup

In [None]:
marchmania.stats(averages[averages.LSeed.eq(4) & averages.HSeed.eq(13)])

While there are upsets here, it's not very common.

## 4-13 seeds in the 2nd round

Let's look at at the 5-13 matchup if a 13-seed were to upset a 4-seed.

In [None]:
marchmania.stats(averages[averages.LSeed.eq(5) & averages.HSeed.isin([4, 13])])

Not much to be excited about here. Let's look at later rounds for both 4 and 13 seeds.

## 4-13 seeds in later rounds


In [None]:
matchups = averages[
    averages.Count.ge(3) &
    averages.HSeed.isin([4, 13]) &
    (averages.LSeed.ne(4) | averages.HSeed.ne(13))
]

marchmania.stats(matchups)

In [None]:
matchups.sort_values(by="MOV").reset_index(drop=True)

13-seeds that upset a 4-seed don't have a great track record after that. Even in the case of a 12-13 matchup, that doesn't go well. 4-seeds have to face a 1-seed in the Sweet 16, so that typically limits how far they can go. 

# 3-14 seed analysis

The following plot shows how 3-14 seeds do against lower (better) seeds.

In [None]:
plot(averages[averages.HSeed.isin([3, 14])], "Greens_d", (15, 5))

## 3-14 seed 1st round matchup

In [None]:
marchmania.stats(averages[averages.LSeed.eq(3) & averages.HSeed.eq(14)])

Couple upsets every once in a while but the 14-3 upset is quite uncommon. 

## 3-14 seeds in the 2nd round

Let's look at at the 6-14 matchup if a 14-seed were to upset a 3-seed.

In [None]:
marchmania.stats(averages[averages.LSeed.eq(6) & averages.HSeed.isin([3, 14])])

Not much to be excited about here. Let's look at later rounds for both 3 and 14 seeds.

## 3-14 seeds in later rounds

In [None]:
matchups = averages[
    averages.Count.ge(3) &
    averages.HSeed.isin([3, 14]) &
    (averages.LSeed.ne(3) | averages.HSeed.ne(14))
]

marchmania.stats(matchups)

In [None]:
matchups.sort_values(by="MOV").reset_index(drop=True)

Past an occasional 14-3 upset, 14-seeds don't have much of a chance. 3-seeds are competitive but hardly an upset pick. 

# 2-15 seed analysis

The following plot shows how 2-15 seeds do against lower (better) seeds.

In [None]:
plot(averages[averages.HSeed.isin([2, 15])], "Blues_d", (15, 5))

## 2-15 seed 1st round matchup

In [None]:
marchmania.stats(averages[averages.LSeed.eq(2) & averages.HSeed.eq(15)])

As expected, 2-seeds almost always beat 15-seeds. 

## 15 seeds in later rounds

2-seeds aren't considered upset picks in any scenario. So let's see what 15-seeds do if they do upset a 2-seed.

In [None]:
matchups = averages[
    averages.Count.ge(3) &
    averages.HSeed.ne(2) &
    averages.HSeed.isin([2, 15]) &
    (averages.LSeed.ne(2) | averages.HSeed.ne(15))
]

marchmania.stats(matchups)

In [None]:
matchups.sort_values(by="MOV").reset_index(drop=True)

Not much here, we've already seen much more interesting possibilities above. 

# 1-16 seed analysis

The following plot shows how 1-16 seeds do against lower (better) seeds. This is just for completeness, there has only been a single 16-1 upset and there isn't much to pick here.

In [None]:
plot(averages[averages.HSeed.isin([1, 16])], "Greens_d", (15, 5))

## 1-16 seed 1st round matchup

In [None]:
marchmania.stats(averages[averages.LSeed.eq(1) & averages.HSeed.eq(16)])

Only 1 time has a 16-seed beat a 1-seed. 

## 16 seeds in later rounds

1-seeds aren't considered upset picks in any scenario. So let's see what 16-seeds do if they do upset a 1-seed.

In [None]:
matchups = averages[
    averages.HSeed.ne(1) &
    averages.HSeed.isin([1, 16]) &
    (averages.LSeed.ne(1) | averages.HSeed.ne(16))
]

marchmania.stats(matchups)

In [None]:
matchups.sort_values(by="MOV").reset_index(drop=True)

The one time a 16-seed won, they did have a close game but otherwise not much to see here!

# Upsets by Day

In [None]:
import seaborn as sns

def plot(data, palette, size):
    sns.set(style="whitegrid", color_codes=True, rc={'figure.figsize': size})
    sns.barplot(data=data, x="Day", y="Upset Pct", palette=palette)

# Get all upsets
upsets = results[(results.HSeed - results.LSeed >= 4) & results.Wins.eq(0)]

# Get total number of games with seed disparity >= 4 by day
total = results[(results.HSeed - results.LSeed >= 4)].groupby(["DayNum"]).agg({"DayNum": "count"})
total = total.rename_axis(None).reset_index()
total.columns = ["DayNum", "Count"]

# Get total number of upsets by day
averages = upsets.groupby(["DayNum"]).agg({"DayNum": ["count"]}).reset_index()
averages.columns = ["DayNum", "Count"]

# Add upset percentage
averages = averages.merge(total, how="left", left_on=["DayNum"], right_on=["DayNum"])
averages["Upset Pct"] = averages["Count_x"] / averages["Count_y"]
averages["Day"] = averages.apply(lambda x: marchmania.day(x["DayNum"]), axis=1)
averages = averages.sort_values(by="DayNum")

# Draw the graph
plot(averages, "Blues_d", (15, 5))

The graph above shows the average percentage of upsets by day for games with seed disparity >=4 over the course of the tournament. The points are labeled by round and day, for example Round of 64 Day 1 is R64 D1.

Couple data points stick out, the 2nd day of the Round of 32 and the 1st day of the Sweet 16. Do the normal Friday/Sunday teams tend to lose more than the Thursday/Saturday teams? The first day of the Sweet 16 tends to make sense as in a normal year all teams have to travel to a new location and playing first can be challenging in sports in general. Certainly could be a feature to consider for modeling.

# Upsets by State

In [None]:
import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode,iplot

init_notebook_mode(connected=True)

def transform(values):
    cities = values[values.CityID.ge(0)]

    cities = cities.groupby(["CityID"]).agg({"Wins": ["count"]}).reset_index()
    cities.columns = ["CityID", "Upsets"]

    citynames = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/Cities.csv")

    states = cities.merge(citynames, how="left", left_on=["CityID"], right_on=["CityID"])
    states = states.groupby(["State"]).agg({"Upsets": ["sum"]}).reset_index()
    states.columns = ["State", "Upsets"]

    return states

stateupsets = transform(upsets[upsets.CityID.ge(0)])
stateresults = transform(results[results.CityID.ge(0)])

states = stateupsets.merge(stateresults, how="left", left_on="State", right_on="State")
states["Percent"] = states.apply(lambda x: int((x["Upsets_x"] / x["Upsets_y"]) * 100), axis=1)
states.columns = ["State", "Upsets", "Total Possible", "Percent"]

data = {
    "type": "choropleth",
    "colorscale": "reds",
    "locations": states["State"],
    "locationmode": "USA-states",
    "z": states["Percent"],
    "text": states['State'],
    "marker": {
        "line": {
            "color": "rgb(255,255,255)",
            "width": 1
        }
    },
    "colorbar": {'title' : "Upset %"}
}

layout = {
    "title": "",
    "geo": {
        "scope": "usa"
    }
}

choromap = go.Figure(data = [data], layout = layout)
choromap.show(config={"displayModeBar": False, "scrollZoom": False})

The map above shows upsets by location since 2010 (first year location data is available in Kaggle). Couple interesting data points stand out. California is one, with a high volume of games and upsets happening 25% of the time when there is a seed disparity >=4. This makes sense given the long distance traveled for some teams and the tendency for those games to be late for those who have yet to adjust to the time change. Colorado has the highest upset percentage, could be a possible correlation with altitude. 

The table below shows the details for the highest points on the map.

In [None]:
states[states.Percent.ge(25)].sort_values(by="Percent", ascending=False).reset_index(drop=True)

This year there are games in Illinois (Chicago), California (San Francisco and San Diego) and Louisiana (New Orleans).

# Upsets by Team

In [None]:
import re

# Filter down to last 10 seasons
teams = results[results.Season.ge(2010) & (results.HSeed - results.LSeed >= 4)].reset_index()
teams["Upsets"] = (teams["HScore"] > teams["LScore"]).astype(int)

# Get number of upset wins by season/team
teams = teams.groupby(["HTeamID"]).agg({"Upsets": ["sum", "count"], "Season": pd.Series.nunique, "MOV": ["mean"]}).reset_index()
teams.columns = ["TeamID", "Upsets", "Games", "Tournaments", "MOV"]
teams["Percent"] = teams["Upsets"] / teams["Games"]

# Current tournament teams with a 5 or higher seed
current = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage2/MNCAATourneySeeds.csv")
current["Seed"] = current["Seed"].apply(lambda x: int(re.sub(r"[^0-9]", "", x)))
current = current[current.Season.eq(2022) & current.Seed.ge(5)]

# Filter down to current tournament teams
teams = teams[teams["TeamID"].isin(current["TeamID"])]

# Join to get team name
teamlist = pd.read_csv("/kaggle/input/mens-march-mania-2022/MDataFiles_Stage1/MTeams.csv")
teams = teams.merge(teamlist, how="left", left_on=["TeamID"], right_on="TeamID")
teams = teams.drop(["TeamID", "FirstD1Season", "LastD1Season"], axis=1)

# Filter down to teams with at least 1 upset and a lower seed margin of victory of <= 9
teams = teams[teams["Upsets"].ge(1) & teams["MOV"].le(9)]

teams.sort_values(by="MOV").reset_index(drop=True)

The table above shows teams with at least 1 upset since 2010 and an average lower seed MOV of less than or equal to 9 points. Appearances are filtered to seasons when a team's seed is 5 or greater.

# Potential upsets

Putting this these factors together, the following are upsets to consider.

## Upsets based on location

- (11) Rutgers / (11) Indiana over (6) Alabama (1st round in San Diego, CA)
- (8) Seton Hall / (9) TCU over (1) Arizona (2nd round in San Diego, CA)
- (4) Arkansas / (5) UConn over (1) Gonzaga (West Sweet 16 in San Francisco, CA)

## Upsets based on past team performance

Possible 1st round upsets based on past team performance:

- (12) UAB over (5) Houston
- (12) Richmond over (5) Iowa
- (11) Michigan over (6) Colorado State

2nd round upsets to consider (should they happen).

- (9) Marquette over (1) Baylor
- (7) Ohio State/(10) Loyola winner over (2) Villanova
- (7) USC over (2) Auburn
- (7) Murray State over (2) Kentucky
- (7) Michigan State over (2) Duke

Upsets even further down the road:

- (5) UConn over (1) Gonzaga
- (5) Saint Mary's over (1) Baylor

It's important to state that not all these will happen, the games may not even happen in some cases. These are just games to take a closer look at based on the data.

# Conclusions

While many focus on first round upsets, the second and third rounds are also worth considering. The advantage is that a team needs to win to even get to that point, which would mean there is at least something going for that team. The game may not actually happen given that a team loses in an earlier round but if it does it may give a score boost. This strategy is an effective (and easy if it's manually editing entries) way to take a chance. 

Good luck this year!