# Seed-based model risks in 2021

*Note: For a more detailed look at upsets in general, check out [Picking an Upset](https://www.kaggle.com/davidmezzetti/picking-an-upset)*

With the tournament cancelled last year and 2021 a much different looking tournament, the amount of unknowns are higher than ever. Normally, the tournament is hard to model to begin with but the following will cause much more uncertainty:

- Entire tournament happening in Indiana
- Lack of conference play
  - Colgate is ranked 8th in the NET ratings and does very well in many other ratings models. If they win their conference, they would have beat 5 unique teams over 15 games, all in-conference. They played only 3 teams (Boston University, Holy Cross and Army) in the regular season.
- Potential for cancellation of games with a team moving on by default

The last item is the one this notebook will cover. With what is happening in the conference tournaments (Duke, UVA and Kansas all out due to COVID-19), this looms large over the tournament. Many models have seeds as a feature and there are implications to that. For example, in a 4-8 matchup, models may learn that a 8-seed playing a 4-seed means something. That something is the 8-seed beat a 1-seed. Same thing with a 3-10 or 3-7 matchup, the 7/10 team had to beat a 2-seed. Thankfully cancelled games will not count in the scoring for this competition but the follow-on games played will.

Let's take a look at what the data shows.


# Gather games by Seed and Margin of Victory (MOV)

The following section joins the relevant CSVs to compute a margin of victory (or defeat) for each game. The data is re-ordered to always have the lower (better) seed first to assist with future calculations. First four games are skipped given they don't count in this competition.

In [None]:
import marchmania

# See https://www.kaggle.com/davidmezzetti/marchmania
results = marchmania.results()
results

# Get average MOV per seed pairing

Group the data together based on seed-pairings

In [None]:
# See https://www.kaggle.com/davidmezzetti/marchmania
averages = marchmania.averages(results)
averages

# Average MOV by seed pairing

In [None]:
import seaborn as sns

def plot(data, palette, size):
    sns.set(style="whitegrid", color_codes=True, rc={'figure.figsize': size})
    sns.barplot(data=data, x="Key", y="MOV", palette=palette)

# Plot averages for data points with at least 5 instances
plot(averages[averages.Count.gt(5)], "Purples_d", (25, 10))

The graph above shows the MOV for seed matchups with at least 5 occurrences. For the most part, the MOV is higher when seed disparity is higher but there are exceptions to that. This plot doesn't show the frequency, some of the points are less common than others. 

# Analysis on how 8/9 seeds fare


Let's plot how 8/9 seeds do against lower (better) seeds. 

In [None]:
plot(averages[averages.HSeed.isin([8, 9])], "Greens_d", (15, 5))

As expected, 8/9's don't do well against 1-seeds with a MOV of -11.55, typically a blowout.

In [None]:
marchmania.stats(averages[averages.LSeed.eq(1) & averages.HSeed.isin([8, 9])])

But for the 8/9 seeds that do win, it starts to get more interesting. Look at the entries for 4-8, 4-9, 5-8 and 5-9s and the averages across this set. These points don't occur often but when they do, it's meaningful. When seeds are used as a feature, most models will be able to pick up on this, which normally would be good.

In [None]:
marchmania.stats(averages[averages.LSeed.isin([4, 5]) & averages.HSeed.isin([8, 9])])

In [None]:
averages[averages.LSeed.isin([4, 5]) & averages.HSeed.isin([8, 9])]

8/9s actually have a higher win percentage than 4/5s! Once again this is a small sample size but models can pick up on this.

Unfortunately, our models aren't going to know how a team advanced, every year until this year it was winning the previous game. What if a 8/9 seed were to advance past a 1-seed due to a cancelled game? Would the model think it's a special 8/9 team when in fact it's an ordinary 8/9 that would have lost by 11-12 points against a 1-seed. Will other features help the model out and not potentially make a bad prediction? 

# Analysis on how 7/10 seeds fare

Let's plot how 7/10 seeds do against lower (better) seeds.

In [None]:
plot(averages[averages.HSeed.isin([7, 10])], "Blues_d", (15, 5))

Once again odds are generally against 7/10 seeds when playing a 2-seed but not nearly as daunting of a task as facing a 1-seed. If a 7/10 seed does win, the odds of winning do go up slightly but not as pronounced as a 8/9. The follow-on games are significantly closer though.


In [None]:
marchmania.stats(averages[averages.LSeed.eq(2) & averages.HSeed.isin([7, 10])])

In [None]:
marchmania.stats(averages[averages.LSeed.isin([3, 6]) & averages.HSeed.isin([7, 10])])

In [None]:
averages[averages.LSeed.isin([3, 6]) & averages.HSeed.isin([7, 10])]

# Conclusions

This notebook picked a couple different second and third round scenarios to discuss impacts that could be felt this year. Hopefully, all 63 games are played in this tournament.

Mitigation strategies could be put in place to modify the predictions or pull the predictions down closer to 0.5. If you're doing anything special this year that you'd like to share, please feel free to put it in the comments. 

In my case, I'm going to run with my model like it's 2019, a regular year and see how it goes. In future years, should this year be *really* different, perhaps I would even not include 2021 data when training. Maybe there is a clever person out there who will hedge their bets in some way that wins the competition. That's what makes this one a fun competition. 

Good luck this year!