#### What are you trying to do in this notebook?
Each season there are thousands of NCAA basketball games played between Division I men's teams, culminating in March Madness®, the 68-team national championship that starts in the middle of March.
Armed with historical data, we can explore it and develop our own distinctive ways of predicting March Madness® game outcomes. We can even evaluate and compare different approaches by seeing which of them would have done best at predicting tournament games from the past.
Define  state represent point process with simple graphical medel.
Calculate each transiction probabilities to use detail data, calculate transiction probabilities and create transiction matrix Monte-Carlo simulation.

#### What we learned while making this notebook?
In stage one of this two-stage competition, participants will build and test their models against previous tournaments. In the second stage, participants will predict the outcome of the 2022 tournament. You don’t need to participate in the first stage to enter the second. The first stage exists to incentivize model building and provide a means to score predictions. The real competition is forecasting the 2022 results.

#### Why are you trying it?
- Simple method of using regular seasons win-loss between pairs of team as probability of winning.
- For instance, if team A and team B has 5 matches in total and team A wins out of 5, the probability of team A wins out of 5, the probability of team A winning is 0.6.

Probability of winning is 0.5 if A and B has no historical matchup.

- Loading necessities
- Exploratory data analysis
- Feature engineering 
- Create train and test sets
- Modelling and verify model

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np
import pandas as pd
from sklearn import *
import glob

In [None]:
f = {f.split('/')[-1]: pd.read_csv(f, encoding='latin1') for f in glob.glob('/*/*/m*-m*2022/MD*2/**')}

In [None]:
teams = f['MTeams.csv']
teams2 = f['MTeamSpellings.csv']
season_cresults = f['MRegularSeasonCompactResults.csv']
season_dresults = f['MRegularSeasonDetailedResults.csv']
tourney_cresults = f['MNCAATourneyCompactResults.csv']
tourney_dresults = f['MNCAATourneyDetailedResults.csv']
slots = f['MNCAATourneySlots.csv']
seeds = f['MNCAATourneySeeds.csv']
seeds = {'_'.join(map(str,[int(k1),k2])):int(v[1:3]) for k1, v, k2 in seeds[['Season', 'Seed', 'TeamID']].values}
#seeds = {**seeds, **{k.replace('2021_','2022_'):seeds[k] for k in seeds if '2021_' in k}}
cities = f['Cities.csv']
gcities = f['MGameCities.csv']
seasons = f['MSeasons.csv']
sub = f['MSampleSubmissionStage2.csv']

In [None]:
teams2 = teams2.groupby(by='TeamID', as_index=False)['TeamNameSpelling'].count()
teams2.columns = ['TeamID', 'TeamNameCount']
teams = pd.merge(teams, teams2, how='left', on=['TeamID'])
del teams2

In [None]:
season_cresults['ST'] = 'S'
season_dresults['ST'] = 'S'
tourney_cresults['ST'] = 'T'
tourney_dresults['ST'] = 'T'
#games = pd.concat((season_cresults, tourney_cresults), axis=0, ignore_index=True)
games = pd.concat((season_dresults, tourney_dresults), axis=0, ignore_index=True)
games.reset_index(drop=True, inplace=True)
games['WLoc'] = games['WLoc'].map({'A': 1, 'H': 2, 'N': 3})

In [None]:
games['ID'] = games.apply(lambda r: '_'.join(map(str, [r['Season']]+sorted([r['WTeamID'],r['LTeamID']]))), axis=1)
games['IDTeams'] = games.apply(lambda r: '_'.join(map(str, sorted([r['WTeamID'],r['LTeamID']]))), axis=1)
games['Team1'] = games.apply(lambda r: sorted([r['WTeamID'],r['LTeamID']])[0], axis=1)
games['Team2'] = games.apply(lambda r: sorted([r['WTeamID'],r['LTeamID']])[1], axis=1)
games['IDTeam1'] = games.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team1']])), axis=1)
games['IDTeam2'] = games.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team2']])), axis=1)

In [None]:
games['Team1Seed'] = games['IDTeam1'].map(seeds).fillna(0)
games['Team2Seed'] = games['IDTeam2'].map(seeds).fillna(0)

In [None]:
games['ScoreDiff'] = games['WScore'] - games['LScore']
games['Pred'] = games.apply(lambda r: 1. if sorted([r['WTeamID'],r['LTeamID']])[0]==r['WTeamID'] else 0., axis=1)
games['ScoreDiffNorm'] = games.apply(lambda r: r['ScoreDiff'] * -1 if r['Pred'] == 0. else r['ScoreDiff'], axis=1)
games['SeedDiff'] = games['Team1Seed'] - games['Team2Seed'] 
games = games.fillna(-1)

In [None]:
c_score_col = ['NumOT', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl',
 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl',
 'LBlk', 'LPF']
c_score_agg = ['sum', 'mean', 'median', 'max', 'min', 'std', 'skew', 'nunique']
gb = games.groupby(by=['IDTeams']).agg({k: c_score_agg for k in c_score_col}).reset_index()
gb.columns = [''.join(c) + '_c_score' for c in gb.columns]

games = games[games['ST']=='T']

In [None]:
sub['WLoc'] = 3
sub['Season'] = sub['ID'].map(lambda x: x.split('_')[0])
sub['Season'] = sub['ID'].map(lambda x: x.split('_')[0])
sub['Season'] = sub['Season'].astype(int)
sub['Team1'] = sub['ID'].map(lambda x: x.split('_')[1])
sub['Team2'] = sub['ID'].map(lambda x: x.split('_')[2])
sub['IDTeams'] = sub.apply(lambda r: '_'.join(map(str, [r['Team1'], r['Team2']])), axis=1)
sub['IDTeam1'] = sub.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team1']])), axis=1)
sub['IDTeam2'] = sub.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team2']])), axis=1)
sub['Team1Seed'] = sub['IDTeam1'].map(seeds).fillna(0)
sub['Team2Seed'] = sub['IDTeam2'].map(seeds).fillna(0)
sub['SeedDiff'] = sub['Team1Seed'] - sub['Team2Seed'] 
sub = sub.fillna(-1)

In [None]:
games = pd.merge(games, gb, how='left', left_on='IDTeams', right_on='IDTeams_c_score')
sub = pd.merge(sub, gb, how='left', left_on='IDTeams', right_on='IDTeams_c_score')

In [None]:
col = [c for c in games.columns if c not in ['ID', 'DayNum', 'ST', 'Team1', 'Team2', 'IDTeams', 'IDTeam1', 'IDTeam2', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'NumOT', 'Pred', 'ScoreDiff', 'ScoreDiffNorm', 'WLoc'] + c_score_col]

In [None]:
reg = linear_model.LinearRegression()
reg.fit(games[col].fillna(-1), games['Pred'])
pred = reg.predict(games[col].fillna(-1)).clip(0,1)
print('Log Loss:', metrics.log_loss(games['Pred'], pred))
sub['Pred'] = reg.predict(sub[col].fillna(-1)) #.clip(0.000002,0.999998)
sub[['ID', 'Pred']].to_csv('submission.csv', index=False)

#### Did it work?
- We should submit predicted probabilities for every possible matchup in the past 5 NCAA® tournaments (2016-2019 and 2021). Note that there was no tournament held in 2020.

- We should submit predicted probabilities for every possible matchup before the 2022 tournament begins.

Refer to the Timeline page for specific dates. In both stages, the sample submission will tell you which games to predict.

#### What did you not understand about this process?
Well, everything provides in the competition data page. I've no problem while working on it. If you guys don't understand the thing that I'll do in this notebook then please comment on this notebook.

#### What else do you think you can try as part of this approach?
In stage one of this two-stage competition, participants will build and test their models against previous tournaments. In the second stage, participants will predict the outcome of the 2022 tournament. You don’t need to participate in the first stage to enter the second. The first stage exists to incentivize model building and provide a means to score predictions. The real competition is forecasting the 2022 results.