I'm going to show you how to use betting futures to create a March Madness bracket. You can use this same approach
for many other sports and really anything where there is a prediction market.

#### Outline
1. Scraping Futures Using Beautiful Soup
3. Matching up with Kaggle Data
4. Generating a bracket


In [None]:
import requests
from bs4 import BeautifulSoup
from IPython.core.display import HTML

# Scraping Futures Data

In [None]:
futures_url="https://www.vegasinsider.com/college-basketball/odds/futures/"
req=requests.get(futures_url)
html=req.text
#HTML(req.text) Run this if you want to see visually if html was downloaded
#It will mess up your css, aka jupyter notebook display styling so be prepared to refresh the page
req.text

In [None]:
#looks like the data is there, let's extract the table

In [None]:
soup=BeautifulSoup(html,"html.parser")
futures=soup.find("bc-futures")
futures

In [None]:
#Get column names
header=futures.find("div", "fixed-odds-header-transition")#shortcut for .find("div", {"class": "xxx"})
bookmakers=header.find_all("div","bookmaker-rotated")
column_names=["consensus"]#First one is tricky
for bookmaker in bookmakers[1:]:
    link=bookmaker.parent.parent.parent.find("a")#many ways to parse this
    if link:
        column_names.append(link.attrs.get("href","").split("/goto/")[-1].split("/")[0])
    else:
        column_names.append(bookmaker)
column_names

In [None]:
import re
index_name_divs=futures.find_all("div","team-stats-box")
index_names=[]
for div in index_name_divs:
    name=re.findall("\s+(.*)\n",div.text)
    if len(name)==3:
        index_names.append(name[0])
    else:
        print(name)#err if here!
index_names

In [None]:
values=[]
odds_divs=futures.find_all("div","odds-slider-all")[1].find_all("div","pt-3")
odds=[]
for div in odds_divs:
    odd=re.findall("\s+(.*)\n",div.text)
    if len(odd)==1:
        odd=odd[0]
        odd=float(odd.replace("+","")) if odd!="N/A" else None
        odds.append(odd)
    else:
        print(odd)#err if here!
odds

In [None]:
len(odds)/len(index_names)==len(column_names)#THIS SHOULD EQUAL COLUMNS

In [None]:
from collections import defaultdict
import pandas as pd
j=0
data=defaultdict(list)
for i,odd in enumerate(odds):
    if i%len(index_names)==0 and i!=0:
        j+=1
    data[column_names[j]].append(odd)
df=pd.DataFrame(data, index=index_names)
df.index.name="full_team_name"
df=df.reset_index()
df

Awesome, we've got the data. Now we can rank teams by the worst paying odds (lower is better team) and generate a bracket!

# Bracket Generation
We're going to use data from Kaggle and a tool called bracketeer  
## Kaggle Download
Go to https://www.kaggle.com/c/mens-march-mania-2022/data and download the stage 2 files for:  
- MNCAATourneySeeds.csv
- MNCAATourneySlots.csv
- MTeams.csv
Place these in the data/ folder (I've already done it for 2022)

In [None]:

YEAR=2022
DATA_ROOT="data"
seeds=pd.read_csv(DATA_ROOT+"/MNCAATourneySeeds.csv")
slots=pd.read_csv(DATA_ROOT+"/MNCAATourneySlots.csv")
teams=pd.read_csv(DATA_ROOT+"/MTeams.csv")
teams

In [None]:
#For Kaggle and for our bracket builder we need to generate a matchup prediction for any potential matchup
#in the format {YEAR}_{team_ID_1}_{team_ID_2}
# Team ID 1 must be lower than team ID 2

#Somewhat fancy way of getting upper triangle of pairs i,j of NxN matrix 
current_seeds=seeds[seeds['Season']==YEAR]
matchups=current_seeds.merge(current_seeds,on=["Season"],how="outer").sort_values(by=["TeamID_x","TeamID_y"])
matchups=matchups.loc[matchups['TeamID_x']<matchups['TeamID_y']]
matchups

We've got the matchups but now we need to match up our lines with team names

In [None]:
#There are some semi-automated ways of doing this but with just 68 or so teams I think it's better to do it by hand
#Too many edge cases
lookup={}
#generate assignment to be filled in next
for name in sorted(df['full_team_name']):
    lookup[name]=0
lookup

In [None]:
lookup={'Akron Zips': 1103,
 'Alabama Crimson Tide': 1104,
 'Arizona Wildcats': 1112,
 'Arkansas Razorbacks': 1116,
 'Auburn Tigers': 1120,
 'Baylor Bears': 1124,
 'Boise State Broncos': 1129,
 'Bryant Bulldogs': 1136,
 'Cal State Fullerton Titans': 1168,
 'Chattanooga Mocs': 1151,
 'Colgate Raiders': 1159,
 'Colorado State Rams': 1161,
 'Connecticut Huskies': 1163,
 'Creighton Bluejays': 1166,
 'Davidson Wildcats': 1172,
 "Delaware Fightin' Blue Hens": 1174,
 'Duke Blue Devils': 1181,
 'Georgia State Panthers': 1209,
 'Gonzaga Bulldogs': 1211,
 'Houston Cougars': 1222,
 'Illinois Fighting Illini': 1228,
 'Indiana Hoosiers': 1231,
 'Iowa Hawkeyes': 1234,
 'Iowa State Cyclones': 1235,
 'Jacksonville State Gamecocks': 1240,
 'Kansas Jayhawks': 1242,
 'Kentucky Wildcats': 1246,
 'LSU Tigers': 1261,
 'Longwood Lancers': 1255,
 'Loyola Chicago Ramblers': 1260,
 'Marquette Golden Eagles': 1266,
 'Memphis Tigers': 1272,
 'Miami (FL) Hurricanes': 1274,
 'Michigan State Spartans': 1277,
 'Michigan Wolverines': 1276,
 'Montana State Bobcats': 1286,
 'Murray State Racers': 1293,
 'New Mexico State Aggies': 1308,
 'Norfolk State Spartans': 1313,
 'North Carolina Tar Heels': 1314,
 'Notre Dame Fighting Irish': 1323,
 'Ohio State Buckeyes': 1326,
 'Providence Friars': 1344,
 'Purdue Boilermakers': 1345,
 'Richmond Spiders': 1350,
 "Saint Mary's Gaels": 1388,
 'San Diego State Aztecs': 1361,
 'San Francisco Dons': 1362,
 'South Dakota State Jackrabbits': 1355,
 'TCU Horned Frogs': 1395,
 'Tennessee Volunteers': 1397,
 'Texas A&M-Corpus Christi Islanders': 1394,
 'Texas Longhorns': 1400,
 'Texas Southern Tigers': 1411,
 'Texas Tech Red Raiders': 1403,
 'UAB Blazers': 1412,
 'UCLA Bruins': 1417,
 'USC Trojans': 1425,
 'Vermont Catamounts': 1436,
 'Villanova Wildcats': 1437,
 'Virginia Tech Hokies': 1439,
 'Wisconsin Badgers': 1458,
 'Wright State Raiders': 1460,
 'Yale Bulldogs': 1463}

matched=[id for name,id in lookup.items() if id!=0]
pd.options.display.max_rows=68
teams[(teams['TeamID'].isin(current_seeds['TeamID']))&~(teams['TeamID'].isin(matched))]
        

In [None]:
#We're missing 3 teams, We can reorder those
df['TeamID']=df['full_team_name'].apply(lambda x: lookup[x])
#rest look good!
df.merge(teams.merge(current_seeds))[['full_team_name','TeamName','TeamID','consensus']]

Now that we've matched everything up, we can make our projections


In [None]:
df['rank']=df['consensus']#lower is better 
team_rankings={row['TeamID']:row['rank'] for idx,row in df.iterrows()}
team_rankings

In [None]:
#Filled in the blanks using https://www.covers.com/sport/basketball/ncaab/odds/futures
team_rankings[1353]=45000#Rutgers
team_rankings[1371]=25000#Seton hall 
team_rankings[1389]=200000#St Peters
team_rankings[1461]=9e10#Wyoming - Lost already


In [None]:
matchups['Rank_x']=matchups['TeamID_x'].apply(lambda x: team_rankings[x])
matchups['Rank_y']=matchups['TeamID_y'].apply(lambda x: team_rankings[x])

In [None]:
matchups['Pred']=(matchups['Rank_x']<matchups['Rank_y']).astype(float)
matchups

In [None]:
#For Kaggle output the file like below
matchups['ID']=matchups.apply(lambda row: f"{row['Season']}_{row['TeamID_x']}_{row['TeamID_y']}", axis=1)
matchups[['ID','Pred']].to_csv("data/submission.csv",index=False)

In [None]:
matchups[['ID','Pred']]

In [None]:
import bracketeer
b = bracketeer.build_bracket(
        outputPath=f'{YEAR}_rankings_bracket.png',
        teamsPath=DATA_ROOT+'/MTeams.csv',
        seedsPath=DATA_ROOT+'/MNCAATourneySeeds.csv',
        submissionPath=f'{DATA_ROOT}/submission.csv',
        slotsPath=DATA_ROOT+'/MNCAATourneySlots.csv',
        year=YEAR
)

# Thoughts
This bracket should be considered the chalkiest of the chalk. What most people think is going to happen. Use that info as you will. It should help you find gaps in other brackets you make, maybe a big injury/coach out you didn't know about. Or better yet. It can help you find something you have a strong conviction on that other people don't. That is a great time to bet, provided your conviction is well founded.

The main upsets:
- Iowa over Providence
- Houston over Illinois
- Kentucky over Baylor (Injury woes for Baylor possibly the reason?)

To generate better probability distributions you could fit teams to a skills distribution and use that distribution to predict probabilities. https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/ is a good starting point. And then pymc3
  
Also read up on the Kaggle discussion boards https://www.kaggle.com/c/mens-march-mania-2022/discussion. You can use this futures data to help your Kaggle model too!