# Picking an Upset

One of the most exciting parts of March Madness are the upsets. It's something special when a double digit team beats a highly ranked team, one of the most exciting events in sports. One strategy with the Kaggle competition is taking a couple risks with submissions. A very good model may place in the top 10% but often it comes down to luck in terms of placing high in the leaderboard. Taking a calculated risk.

With the submission deadline coming, one strategy could be to take a chance on a couple different games to attempt to differentiate a submission. But rather than guessing, let's look at the data and see if there are any picks that are better than others.

# Gather games by Seed and Margin of Victory (MOV)

The following section joins the relevant CSVs to compute a margin of victory (or defeat) for each game. The data is re-ordered to always have the lower (better) seed first to assist with future calculations. First four games are skipped given they don't count in this competition.

In [None]:
import marchmania

# See https://www.kaggle.com/davidmezzetti/marchmania
results = marchmania.results()
results

# Get average MOV per seed pairing

Group the data together based on seed-pairings. Compute the margin of victory (MOV) and win percentage for the lower (better) seed.

In [None]:
# See https://www.kaggle.com/davidmezzetti/marchmania
averages = marchmania.averages(results)
averages

# Plot Upset Frequency

*Note: this notebook defines an upset as games with a seed disparity >= 4*

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

upsets = results[(results.HSeed - results.LSeed >= 4) & results.Wins.eq(0)]
upsets = upsets.groupby(["HSeed", "LSeed"]).agg({"HScore": ["count"]})
upsets = upsets.unstack(level=0)

fig, ax = plt.subplots(figsize=(15, 8))
sns.heatmap(upsets, cmap="Reds")

xticks_labels = range(5, 17)

plt.xticks(np.arange(12) + .5, labels=xticks_labels)
plt.xlabel("Winning Seed")
plt.ylabel("Losing Seed")
plt.show()


Looking at the graph above, a couple points stick out. The 5-12 and 6-11 upsets are the most common, not surprisingly. Some years a majority of 11s or 12s win. But a couple other points stick out.

2-7, 2-10, 3-7, 3-10 and 3-11 upsets are often fairly common. The advantage of these upsets are that a team had to at least win a tournament game to get to that point, so in a way it could be said that the team is more reliable (2021 is a different year though, see [Seed-based models risk in 2021](https://www.kaggle.com/davidmezzetti/seed-based-model-risks-in-2021) for more info on this).

Let's look closer at the more common data points.

# Analyzing common upsets

In [None]:
stats = []
columns = None
for row in upsets.stack().sort_values(by=("HScore", "count"), ascending=False).iterrows():
    lseed, hseed = row[0]

    # See https://www.kaggle.com/davidmezzetti/marchmania
    values = marchmania.stats(averages[averages.LSeed.eq(lseed) & averages.HSeed.eq(hseed)], False)
    if columns is None:
        columns = values.keys()

    stats.append((lseed, hseed) + tuple(values))

stats = pd.DataFrame(stats, columns=("LSeed", "HSeed") + tuple(columns))

stats[stats.Count.gt(10) & stats.MOV.lt(6)].sort_values(by="MOV").reset_index(drop=True)

The table above shows statistics on the more common upsets in the graph above. The data shows the win percentage of the lower (better) seed beating the higher seed and margin of victory.

Looking at 3-7 and 3-10 matchups, there are a fair number of upsets in that category and the games tend to be close. In this scenario, a 7 or 10 seed would need to win 2 games to get to this point, most likely beating a 2-seed along the way. 

4-12, 3-11, 2-7 and 2-10 are Round of 32 games that are also games to look at. These teams would have had to win a tournament game to get to this point. 

# Upsets by Day

In [None]:
import seaborn as sns

def plot(data, palette, size):
    sns.set(style="whitegrid", color_codes=True, rc={'figure.figsize': size})
    sns.barplot(data=data, x="Day", y="Upset Pct", palette=palette)

# Get all upsets
upsets = results[(results.HSeed - results.LSeed >= 4) & results.Wins.eq(0)]

# Get total number of games with seed disparity >= 4 by day
total = results[(results.HSeed - results.LSeed >= 4)].groupby(["DayNum"]).agg({"DayNum": "count"})
total = total.rename_axis(None).reset_index()
total.columns = ["DayNum", "Count"]

# Get total number of upsets by day
averages = upsets.groupby(["DayNum"]).agg({"DayNum": ["count"]}).reset_index()
averages.columns = ["DayNum", "Count"]

# Add upset percentage
averages = averages.merge(total, how="left", left_on=["DayNum"], right_on=["DayNum"])
averages["Upset Pct"] = averages["Count_x"] / averages["Count_y"]
averages["Day"] = averages.apply(lambda x: marchmania.day(x["DayNum"]), axis=1)

averages = averages.sort_values(by="DayNum")
averages

# Draw the graph
plot(averages, "Blues_d", (15, 5))


The graph above shows the average percentage of upsets by day for games with seed disparity >=4 over the course of the tournament. The points are labeled by round and day, for example Round of 64 Day 1 is R64 D1.

Couple data points stick out, the 2nd day of the Round of 32 and the 1st day of the Sweet 16. Do the normal Friday/Sunday teams tend to lose more than the Thursday/Saturday teams? The first day of the Sweet 16 tends to make sense as in a normal year all teams have to travel to a new location and playing first can be challenging in sports in general. Certainly could be a feature to consider for modeling. 

# Upsets by State

In [None]:
import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode,iplot

init_notebook_mode(connected=True)

def transform(values):
    cities = values[values.CityID.ge(0)]

    cities = cities.groupby(["CityID"]).agg({"Wins": ["count"]}).reset_index()
    cities.columns = ["CityID", "Upsets"]

    citynames = pd.read_csv("/kaggle/input/ncaam-march-mania-2021/MDataFiles_Stage1/Cities.csv")

    states = cities.merge(citynames, how="left", left_on=["CityID"], right_on=["CityID"])
    states = states.groupby(["State"]).agg({"Upsets": ["sum"]}).reset_index()
    states.columns = ["State", "Upsets"]

    return states

stateupsets = transform(upsets[upsets.CityID.ge(0)])
stateresults = transform(results[results.CityID.ge(0)])

states = stateupsets.merge(stateresults, how="left", left_on="State", right_on="State")
states["Percent"] = states.apply(lambda x: int((x["Upsets_x"] / x["Upsets_y"]) * 100), axis=1)
states.columns = ["State", "Upsets", "Total Possible", "Percent"]

data = {
    "type": "choropleth",
    "colorscale": "reds",
    "locations": states["State"],
    "locationmode": "USA-states",
    "z": states["Percent"],
    "text": states['State'],
    "marker": {
        "line": {
            "color": "rgb(255,255,255)",
            "width": 1
        }
    },
    "colorbar": {'title' : "Upset %"}
}

layout = {
    "title": "",
    "geo": {
        "scope": "usa"
    }
}

choromap = go.Figure(data = [data], layout = layout)
choromap.show(config={"displayModeBar": False, "scrollZoom": False})

The map above shows upsets by location since 2010 (first year location data is available in Kaggle). Couple interesting data points stand out. California is one, with a high volume of games and upsets happening 25% of the time when there is a seed disparity >=4. This makes sense given the long distance traveled for some teams and the tendency for those games to be late for those who have yet to adjust to the time change. Colorado has the highest upset percentage, could be a possible correlation with altitude. 

The table below shows the details for the highest points on the map.

In [None]:
states[states.Percent.ge(25)].sort_values(by="Percent", ascending=False).reset_index(drop=True)

# Upsets in Indiana

In 2021, location data is mostly moot as all games will be played in Indiana. There are a couple of upsets that have happened in Indiana. Let's look closer at those.

In [None]:
# 4161 = Indianapolis
indiana = upsets[upsets.CityID.eq(4161)]

teams = pd.read_csv("/kaggle/input/ncaam-march-mania-2021/MDataFiles_Stage1/MTeams.csv")

indiana = indiana.merge(teams, how="left", left_on="LTeamID", right_on="TeamID")
indiana = indiana.merge(teams, how="left", left_on="HTeamID", right_on="TeamID")

indiana = indiana.rename(columns={"TeamName_x": "LTeam", "TeamName_y": "HTeam"})

indiana = indiana[["Season", "Day", "HSeed", "HTeam", "HScore", "LSeed", "LScore", "LTeam"]]
indiana = indiana.loc[:,~indiana.columns.duplicated()]
indiana


A couple upsets have happened but in this case, there are likely other factors that are better indicators (note the day and seed pairings).

# Upsets by Team

In [None]:
# Filter down to last 15 seasons
teams = upsets[upsets.Season.gt(2005)]

# Get number of upset wins by season/team
teams = teams.groupby(["Season", "HTeamID"]).agg({"Wins": ["count"]}).reset_index()
teams.columns = ["Season", "TeamID", "Upsets"]

# Join with seeds to filter to seasons where team had a #5 seed or greater (only way an upset as defined could occur)
seedlist = pd.read_csv("/kaggle/input/ncaam-march-mania-2021/MDataFiles_Stage1/MNCAATourneySeeds.csv")
teams = teams.merge(seedlist, how="right", left_on=["Season", "TeamID"], right_on=["Season", "TeamID"])
teams["Seed"] = teams.apply(lambda x: marchmania.seed(x["Seed"]), axis=1)
teams = teams[teams.Seed.ge(5)]

# Aggregate statistics
teams = teams.groupby(["TeamID"]).agg({"Upsets": ["sum", "count"], "TeamID": ["count"]}).reset_index()
teams.columns = ["TeamID", "Total Upsets", "Seasons with Upset", "Appearances"]
teams["Total Upsets"] = teams["Total Upsets"].astype(int)
teams["Percent"] = teams["Seasons with Upset"] / teams["Appearances"]

# Join to get team name
teamlist = pd.read_csv("/kaggle/input/ncaam-march-mania-2021/MDataFiles_Stage1/MTeams.csv")
teams = teams.merge(teamlist, how="left", left_on=["TeamID"], right_on="TeamID")
teams = teams.drop(["TeamID", "FirstD1Season", "LastD1Season"], axis=1)

# Filter to teams with an upset at least 20% of the time (3 / 15 years)
teams = teams[teams["Seasons with Upset"].ge(3)]

teams.sort_values(by="Percent", ascending=False).reset_index(drop=True)

The table above shows teams with at least 3 seasons having an upset over the last 15 years. Appearances are filtered to seasons when a team's seed is 5 or greater. Three of four teams are in the tournament this year as 9-11 seeds. Combined with the data above, those teams could be good candidates for a first or second round upset based on past history.

# Conclusions

While many focus on first round upsets, the second and third rounds are also worth considering in this competition. The advantage is that a team needs to win to even get to that point, which would mean there is at least something going for that team. The game may not actually happen given that a team loses in an earlier round but if it does it may give a score boost. 2021 is different in that a team may get to a further round due to a cancelled game but this strategy could still be an effective (and easy if it's manually editing entries) way to take a chance. 

Good luck this year!