# Predicting the NBA

In [None]:
import datetime
import pandas as pd
import numpy as np

teams = pd.read_csv('/kaggle/input/nba-games/teams.csv')
games = pd.read_csv('/kaggle/input/nba-games/games.csv')
# We do not need game details, rankings, or players data.

# Rename visitor to away.
games = games.rename(columns={'VISITOR_TEAM_ID': 'AWAY_TEAM_ID'})

# We only care about these columns for games.
games['ID'] = games['GAME_ID']
games = games[['ID', 'GAME_DATE_EST', 'GAME_ID', 'HOME_TEAM_ID', 'AWAY_TEAM_ID', 'SEASON', 'PTS_home', 'PTS_away', 'HOME_TEAM_WINS']]

# Set index to game ID.
games = games.set_index('ID')

# Format dates.
games['GAME_DATE_EST'] = pd.to_datetime(games['GAME_DATE_EST'], format='%Y-%m-%d')
# Change type of wins to boolean.
games['HOME_TEAM_WINS'] = games['HOME_TEAM_WINS'].astype(bool)

# The season for 2019 is not complete so filter these games out.
games = games[games['SEASON'] != 2019]

# Sort games by date.
games = games.sort_values(by='GAME_DATE_EST')

Since we are predicting games based on previous games, for every game, let us add a pointer to each team's previous game making it easier to trace through a team's history.

In [None]:
# TODO: Use decorators to clean up this code.

def get_previous_game(team, date):
    """Returns the previous game for the given team before the given date."""
    # TODO: If neccessary, do not look at all games, just look at this team's games.
    games_for_team = games[(games['HOME_TEAM_ID'] == team) | (games['AWAY_TEAM_ID'] == team)]
    # TODO: If neccessary, do not look at all this team's games, just look at the team's game for that season. 
    games_for_team_before_date = games_for_team[(games_for_team['GAME_DATE_EST'] < date)]
    # Only consider the most recent game.
    previous_games = games_for_team_before_date.tail(1)
    if len(previous_games) > 0:
        return previous_games.iloc[0]['GAME_ID']

def get_previous_game_for_home_team(row):
    """Gets the previous game for the home team."""
    row['HOME_TEAM_PREVIOUS_GAME_ID'] = get_previous_game(row['HOME_TEAM_ID'], row['GAME_DATE_EST'])
    return row

def get_previous_game_for_away_team(row):
    """Gets the previous game for the away team."""
    row['AWAY_TEAM_PREVIOUS_GAME_ID'] = get_previous_game(row['AWAY_TEAM_ID'], row['GAME_DATE_EST'])
    return row

# Get previous games for every game.
games = games.apply(get_previous_game_for_home_team, axis=1)
games = games.apply(get_previous_game_for_away_team, axis=1)

# For some reason, previous game ID's are being stored as floats. We 
games['HOME_TEAM_PREVIOUS_GAME_ID'] = games['HOME_TEAM_PREVIOUS_GAME_ID'].map(lambda id: -1 if np.isnan(id) else id)
games['AWAY_TEAM_PREVIOUS_GAME_ID'] = games['AWAY_TEAM_PREVIOUS_GAME_ID'].map(lambda id: -1 if np.isnan(id) else id)
games['HOME_TEAM_PREVIOUS_GAME_ID'] = games['HOME_TEAM_PREVIOUS_GAME_ID'].astype(np.int64)
games['AWAY_TEAM_PREVIOUS_GAME_ID'] = games['AWAY_TEAM_PREVIOUS_GAME_ID'].astype(np.int64)

## Expanding on the Initial Model

So, in our previous version of our prediction model, we examined the point differential for given teams in a game in each of their previous games. Considering **1 team** in **1 season** for **82 games** yielded a prediction accuracy of **57.32%**. Not bad! This leads to some questions however:
* How does this model scale? We have data for seasons from 2004 to today. Does this accuracy rise or fall?
* We only considered the previous game for each team. Will our accuracy improve if we consider the previous $n$ games? What is the optimal value of $n$ (or rather, how many previous games are indicative of how well a team will perform on a given night)?

**The Underlying Assumption**: A team that is "hot" will "stay hot". In more real terms, a team that continuously wins games is likely to continue winning games.

### A Rudimentary Model
- In their previous $n$ games, team **A** scored an average of $a_1$ points and allowed an average of $a_2$ points.
- In their previous $n$ games, team **B** scored an average of $b_1$ points and allowed an average of $b_2$ points.
- If the average of $a_1$ and $b_2$ is larger than the average of $b_1$ and $a_2$, we predict team **A** will win, and vice versa. If they are equal, we predict team **A** will win.

**The Question**: What is the most accurate value for $n$?

In [None]:
# Do not overwrite games!

# TODO: Clean up this code. Decoration is a nightmare.

def get_average_points_for_home_team_in_previous_n_games(row, n):
    row['HOME_avg_pts_for_in_previous_games'] = None
    team = row['HOME_TEAM_ID']
    points = 0
    number_of_games = 0
    previous_game_id = row['HOME_TEAM_PREVIOUS_GAME_ID']
    while n > 0:
        # There was a previous game.
        if previous_game_id != -1:
            previous_game = games.loc[previous_game_id]
            number_of_games += 1
            if team == previous_game['HOME_TEAM_ID']:
                # The team was home in their previous game.
                points += previous_game['PTS_home']
                # The next previous game is the home team's previous game.
                previous_game_id = previous_game['HOME_TEAM_PREVIOUS_GAME_ID']
            else:
                # The team was away in their previous game.
                points += previous_game['PTS_away']
                # The next previous game is the away team's previous game.
                previous_game_id = previous_game['AWAY_TEAM_PREVIOUS_GAME_ID']
            n -= 1
        else:
            break
    # Finally, compute the average.
    if number_of_games > 0:
        row['HOME_avg_pts_for_in_previous_games'] = points / number_of_games
    return row

def get_average_points_against_home_team_in_previous_n_games(row, n):
    row['HOME_avg_pts_against_in_previous_games'] = None
    team = row['HOME_TEAM_ID']
    points = 0
    number_of_games = 0
    previous_game_id = row['HOME_TEAM_PREVIOUS_GAME_ID']
    while n > 0:
        # There was a previous game.
        if previous_game_id != -1:
            previous_game = games.loc[previous_game_id]
            number_of_games += 1
            if team == previous_game['HOME_TEAM_ID']:
                # The team was home in their previous game.
                points += previous_game['PTS_away']
                # The next previous game is the home team's previous game.
                previous_game_id = previous_game['HOME_TEAM_PREVIOUS_GAME_ID']
            else:
                # The team was away in their previous game.
                points += previous_game['PTS_home']
                # The next previous game is the away team's previous game.
                previous_game_id = previous_game['AWAY_TEAM_PREVIOUS_GAME_ID']
            n -= 1
        else:
            break
    # Finally, compute the average.
    if number_of_games > 0:
        row['HOME_avg_pts_against_in_previous_games'] = points / number_of_games
    return row

def get_average_points_for_away_team_in_previous_n_games(row, n):
    row['AWAY_avg_pts_for_in_previous_games'] = None
    team = row['AWAY_TEAM_ID']
    points = 0
    number_of_games = 0
    previous_game_id = row['AWAY_TEAM_PREVIOUS_GAME_ID']
    while n > 0:
        # There was a previous game.
        if previous_game_id != -1:
            previous_game = games.loc[previous_game_id]
            number_of_games += 1
            if team == previous_game['HOME_TEAM_ID']:
                # The team was home in their previous game.
                points += previous_game['PTS_home']
                # The next previous game is the home team's previous game.
                previous_game_id = previous_game['HOME_TEAM_PREVIOUS_GAME_ID']
            else:
                # The team was away in their previous game.
                points += previous_game['PTS_away']
                # The next previous game is the away team's previous game.
                previous_game_id = previous_game['AWAY_TEAM_PREVIOUS_GAME_ID']
            n -= 1
        else:
            break
    # Finally, compute the average.
    if number_of_games > 0:
        row['AWAY_avg_pts_for_in_previous_games'] = points / number_of_games
    return row

def get_average_points_against_away_team_in_previous_n_games(row, n):
    row['AWAY_avg_pts_against_in_previous_games'] = None
    team = row['AWAY_TEAM_ID']
    points = 0
    number_of_games = 0
    previous_game_id = row['AWAY_TEAM_PREVIOUS_GAME_ID']
    while n > 0:
        # There was a previous game.
        if previous_game_id != -1:
            previous_game = games.loc[previous_game_id]
            number_of_games += 1
            if team == previous_game['HOME_TEAM_ID']:
                # The team was home in their previous game.
                points += previous_game['PTS_away']
                # The next previous game is the home team's previous game.
                previous_game_id = previous_game['HOME_TEAM_PREVIOUS_GAME_ID']
            else:
                # The team was away in their previous game.
                points += previous_game['PTS_home']
                # The next previous game is the away team's previous game.
                previous_game_id = previous_game['AWAY_TEAM_PREVIOUS_GAME_ID']
            n -= 1
        else:
            break
    # Finally, compute the average.
    if number_of_games > 0:
        row['AWAY_avg_pts_against_in_previous_games'] = points / number_of_games
    return row

def get_accuracy_for_n(n):
    games_for_n = games
    games_for_n = games_for_n.apply(get_average_points_for_home_team_in_previous_n_games, axis=1, args=(n,))
    games_for_n = games_for_n.apply(get_average_points_against_home_team_in_previous_n_games, axis=1, args=(n,))
    games_for_n = games_for_n.apply(get_average_points_for_away_team_in_previous_n_games, axis=1, args=(n,))
    games_for_n = games_for_n.apply(get_average_points_against_away_team_in_previous_n_games, axis=1, args=(n,))
    games_for_n = games_for_n.rename(columns={
        'HOME_avg_pts_for_in_previous_games': 'a1',
        'HOME_avg_pts_against_in_previous_games': 'a2',
        'AWAY_avg_pts_for_in_previous_games': 'b1',
        'AWAY_avg_pts_against_in_previous_games': 'b2'
    })
    games_for_n = games_for_n.dropna()
    games_for_n['PREDICTION_PTS_home'] = (games_for_n['a1'] + games_for_n['b2']) / 2.0
    games_for_n['PREDICTION_PTS_away'] = (games_for_n['b1'] + games_for_n['a2']) / 2.0
    games_for_n['PREDICTION_HOME_TEAM_WINS'] = games_for_n['PREDICTION_PTS_home'] >= games_for_n['PREDICTION_PTS_away']
    games_for_n['PREDICTION_CORRECT'] = games_for_n['HOME_TEAM_WINS'] == games_for_n['PREDICTION_HOME_TEAM_WINS']
    return games_for_n['PREDICTION_CORRECT'].value_counts()[True] / len(games_for_n) * 100

# TODO: Check that these functions are getting the correct model values.

# print(get_accuracy_for_n(1))
values_of_n = np.array(range(1, 51))
s = pd.Series(values_of_n, index=values_of_n)
n_accuracy_values = s.map(get_accuracy_for_n)

In [None]:
n_accuracy_values

We should still decide whether or not to prune preseason and postseason games.

## Exploring Home Team Advantage

**The Underlying Assumption**: A home team is more likely to beat their visiting opponents.

TODO: Add some description about this. Anthony mentioned something about Las Vegas bookies.
TODO: What is the *number* or *thing* we are looking for?

**The Question**: ...

## Well Rested Advantage

**The Underlying Assumption**: A well rested team is more likely to win games than a less rested team.

> **Notes from Meeting**
> Then, can focus on rest. Just focus on rest. How does it do?
> - A tired good team is probably still better than a healthy bad team.
> - Then, can explore for one team.
> - Could also just add points for rest.
> - Visualize with scatterplot.

**The Question**: ...


In [None]:
# TODO: A scatterplot would fit nicely here.

## Final Notes

TODO: Here is everything we know and believe and the data to back it.

TODO: Here are some things we should take a look at.

## What's Next?

TODO: Here's everything we would like to look in to and expand going forward.

TODO: How do we expand on this going forward?

- Regression
- Explore *similar* games
- Team Health
