Data and statistics is ubiquitous with sports. Whether it is tracking “Power Play Goals” in Hockey, “First Serve Points Won” in Tennis, or “Height and Reach” in Boxing, they all speak to how integral we believe these metrics to be to sports performances and outcomes. Despite the universality of data in sports in this decade, sports data and the question of how to use this data to inform decisions in sports is a fairly recent development. 
The formation of the Society for American Baseball Research (SABR) in 1971 was the first major sports-related institution that recognized the value of statistics in sports (DSG, 2021), but the concept of leveraging these statistics to make decisions around baseball (aptly named SABR-metrics) did not have any public or commercial recognition until the publication of Moneyball (Lewis, 2004) and the release of the film with the same name in 2011.

Basketball’s own Association for Professional Basketball Research was established in 1997, and subsequent APBR-metrics (a much less phonetically pleasing name) only made its way into the sport in 2004, with Dean Oliver being hired by the Seattle Supersonics, making him the first statistician hired in the NBA. The current “Statistical Player Value” is derived from “Minutes, Points, Rebounds, Assists, Steals, Blocks, and Turnovers” (Statistical Player Value, SPV - NBAstuffer, 2022). Other metrics such as “Offensive Rating”, “Defensive Rating”, and “Player Winning Percentage” (a combination of Offensive and Defensive Rating), have been developed to try to more accurately represent the game of basketball statistically. 

For us, being avid fans of the game, getting an opportunity to incorporate our love for the sport with this project made this work assignment engaging and a joy to challenge ourselves to work through problems and surpass our expectations.Within the scope of this engagement, each of us has explored basketball from an analytical point of view, tried to understand how certain statistics affect the general functioning of a team, the metrics required for winning games and general player related data sources. 
From a broader point of view that is from an overview of the NBA and the sport as a whole, our analysis gives minor insights into team and player behaviors and some important metrics to consider when considering building a team and the types of the players a NBA General Manager needs to consider while trying to build a Championship team. 

Dataset and Web Scraping 

As a team, our goal is to find a dataset that contains relevant information to answer our guiding questions. The sources we are  utilizing for our datasets are:
Basketball-Reference - Basketball Statistics and History.
NBA.com - Official NBA Stats
Data on these websites are provided by ‘SPORTS RADAR’, (A Sports Technology Company), the official stats partner for the National Basketball Association (NBA). We are permitted to use the stats on the NBA website under its copyright terms which states that “the NBA Statistics may only be used, displayed or published for legitimate news reporting or private, non-commercial purposes” (National Basketball Association).
For player analysis, our dataset includes a player’s information such as age, team, game_played, 2-point field goals, 3-point field goals, points scored, blocks, assists, injuries (body part)
For team analysis, our dataset includes information on a specific NBA team’s Wins, Losses, make_Playoffs, Championships won, games_played, Points Per Game, Opponent Points Per Game
For Meta-Game Analysis our datasets includes the following variables that will help us analyze how basketball has changed over the years: season, games, field goal, points_per_game, offensive/defensive rebounds
The data that is being used for the above three topics has been collected using web scraping. The web scraper in python has been built using two main libraries that are BeautifulSoup4 and urllib.  The data available on these websites are mainly built with CSS and HTML, this allows us to parse through the web data in html format for each component required from the webpage. 

Parsing Contracts data for Players playing in 2022-2023 Season
WEBPAGE: https://www.basketball-reference.com/contracts/players.html

The following example is one of the simpler examples of the scripts that we have built, the following code allows us to extract the sub header for contracts by finding all the “tr” attributes for the table-player-contracts. One of our questions deals with player injuries and if there is a relationship between player injuries and the contracts that the said players are currently employed on. For instance, Derrick Rose was one of the best young players when he was drafted in 2009, he won the MVP in 2011 (The youngest player to ever do so). Unfortunately he suffered an ACL tear in 2012 and since his career took a major dip from his peak. Since 2018, he has contracted with 22 lower body injuries. The goal here is to understand the kind of contract a injury prone player like him commands right now and to correlate it with other players with similar situations. 
In our SQL joining scripts, this data is being used similar to the above example, that is to join this data with injury related data to analyze the trends of how injuries affect  a player getting a contract, how relevant this high level observation is and whether there are any outliers to this.

In [2]:
#Importing required libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import time
import plotly.express as px
import numpy as np
from plotly import graph_objs as go
import sqlalchemy as sq
from sqlalchemy import update, delete, insert
import mysql.connector
from mysql.connector import errorcode

In [None]:
years = [] 
#Creating list with nba years
for year in range(1980,2024):
    years.append(year)

#Looping through each year in years list
for i in years:
    #Loading Website to variable
    url = "https://www.basketball-reference.com/leagues/NBA_"+str(i)+"_per_game.html"
    html = urlopen(url)
    soup = BeautifulSoup(html, features="html.parser")
    #Parsing the required HTML elements
    data = soup.find_all('table', id="per_game_stats")[0].findAll('tr')
    pergame_stats = [[td.getText() for td in data[i].findAll('td')] for i in range(0,len(data))]
    head = [[th.getText() for th in data[i].findAll('th')] for i in range(0,len(data))]
    pergame_stats=pergame_stats[1:]
    header = head[1:]
    #Setting headers and data
    pergame_stat = pd.DataFrame(pergame_stats, columns = ['Player', 'Pos', 'Age', 'Tm', 'G', 'GS', 'MP', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P', '2PA', '2P%', 'eFG%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS'])
    pergame_stat["Season"]= i 
    pergame_stat.dropna(inplace = True) 
    #Writing data to file
    pergame_stat.to_csv("pergamebasics.csv", mode='a', index=False)
    print(str(i)," year done...")
    time.sleep(5)

This data source hosts overall team statistics from 1980 to 2023, this includes per game, shooting and advanced statistics from all teams per season for the said timespan. This data is essential to talk about the overview of how the NBA has changed over the years. This includes rule changes like the introduction of the 3 point line, its increase in rate and boom of its usage after Steph Curry joined the NBA back in 2008. This dataset will help my teammates in their meta analysis of the game though graphical representation.

The use case of this specific data source can be seen when the “Meta” of Basketball has been discussed and especially talking about the rise of 3 point shooting in the NBA, this along with the meteoric rise of Steph Curry within the League. 


In [3]:
#Loading Website to variable
url = "https://www.basketball-reference.com/leagues/NBA_stats_per_game.html"
html = urlopen(url)
soup = BeautifulSoup(html, features="html.parser")

#Parsing the required HTML elements
data = soup.find_all('table', id="stats")[0].findAll('tr')
overall= soup.find_all('table', id="stats")[0].findAll('tr')
overall_stats = [[td.getText() for td in data[i].findAll('td')] for i in range(0,len(data))]
overall_stats = overall_stats[2:]

#Setting headers and data
overall_header = [[th.getText() for th in data[i].findAll('th')] for i in range(0,len(data))]
overall_stat = pd.DataFrame(overall_stats,columns = ['Season', 'Lg', 'Age', 'Ht', 'Wt', 'G', 'MP', 'FG', 'FGA', '3P', '3PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'FG%', '3P%', 'FT%', 'Pace', 'eFG%', 'TOV%', 'ORB%', 'FT/FGA', 'ORtg'])
overall_stat.dropna(inplace = True) 

#Writing data to file
overall_stat.to_csv("stat_overall.csv", index=False)

Parsing per game statistics per player for each team 
WEBPAGES: 
https://www.basketball-reference.com/teams/BOS/2023.html
https://www.basketball-reference.com/players/t/tatumja01.html
https://www.basketball-reference.com/players/t/tatumja01/gamelog/2018

Algorithm
From a code point of view, the following is happening:-
1.For each team in the NBA, set abbreviation and build URL (a)
2.Open web page using urllib and beautifulsoup, find HREF for the roster div- Getting all players for a team 
3.Loop through each player in each team
4.List all seasons the player has played in . For example Jayson Tatum has played in 6 NBA seasons.
5.Build the URL using each seasons code (b)
6.For each season, go to per game stats and pull data ( c )
7.Store data in csv file in append mode 
8.Exit loop

The data collected from the script below becomes the fundamental basis of the “Game Simulator” that the team has built. The data sources mentioned above parses through the per game statistics of all current active players from each team in the league. The data is then cleaned, type corrected and split to the most recent season that is 2022-2023. This data is used by us to predict the score per game per player for our simulations. These scores aggregate to a total value which when compared to an opposing team allows us to predict an outcome for a head to head matchup. 
Models similar to the ones we have built can be seen on betting sites like Fanduels, PointBet and some Fantasy Leagues where each statistic that is each point,assist, rebound,steal or block a player gets amounts to fantasy score. The player with the highest fantasy score ends up being the winner.


In [None]:
# Setting batches of teams since parsing through multiple players per game stats takes some time to process
batch1 = ["BOS", "TOR", "PHI", "BRK", "NYK"]
batch2 = ["MIL", "CLE", "IND", "CHI", "DET"]
batch3 = ["ATL", "WAS", "MIA", "ORL", "CHO"]
batch4 = ["UTA", "DEN", "POR", "MIN", "OKC"]
batch5 = ["PHO", "SAC", "LAC", "GSW", "LAL"]
batch6 = ["MEM", "NOP", "DAL", "SAS", "HOU"]

# change your team selection here!
teams = batch1
# csv file path
csvFilePath = f"../Cleaned_Datasets/cummalativePerGame"+"batch1"+".csv"
#Looping though each team in batch
for team in teams:
    print(str(team)+" starting...")
    #Loading URL to variable
    url = "https://www.basketball-reference.com/teams/"+str(team)+"/2023.html"
    html = urlopen(url)
    soup = BeautifulSoup(html, features="html.parser")
    #Parsing all players in each team
    data = soup.find_all('table', id="roster")[0].findAll("td")
    #Looping through each player in parsed data
    for player in data:
        for a in player.findAll("a", href=True):
            time.sleep(1)
            if "players" in a["href"]:
                #Loading URL to variable
                url1 = "https://www.basketball-reference.com"+str(a["href"])
                html = urlopen(url1)
                soup = BeautifulSoup(html, features="html.parser")
                playername = soup.findAll('div', id="meta")[
                    0].findAll("span")[0].text.strip()
                data1 = soup.find_all('div', id="bottom_nav_container")[0]
                print("Scraping "+str(playername)+" data")
                #Looping through per game stats for each player
                for data in data1.findAll("a"):
                    time.sleep(1)
                    if "gamelog-playoffs" in data.attrs["href"]:
                        continue
                    elif "gamelog" in data.attrs["href"]:
                        #Loading URL to variable
                        url2 = "https://www.basketball-reference.com" + \
                            str(data.attrs["href"])
                        html = urlopen(url2)
                        soup = BeautifulSoup(html, features="html.parser")
                        #Parsing the required HTML elements
                        pergamestats = soup.find_all('table', id="pgl_basic")[
                            0].findAll("tr")
                        pergamestatsdata = [[td.getText() for td in pergamestats[i].findAll(
                            'td')] for i in range(0, len(pergamestats))]
                        pergamestatsheader = [[th.getText() for th in pergamestats[i].findAll(
                            'th')] for i in range(0, len(pergamestats))]
                        pergamestatsdata = pergamestatsdata[1:]
                        #Setting headers and data
                        pergamestat = pd.DataFrame(pergamestatsdata, columns=['G', 'Date', 'Age', 'Tm', '\xa0', 'Opp', '\xa0', 'GS', 'MP', 'FG', 'FGA',
                                                   'FG%', '3P', '3PA', '3P%', 'FT', 'FTA', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'GmSc', '+/-'])
                        pergamestat["Player"] = playername
                        #Writing data to file
                        pergamestat.to_csv(
                            csvFilePath, mode='a', index=False)
                    else:
                        continue
            else:
                continue
    print(str(team), " done...")

Parsing player continuity data since 1952 season
WEBPAGE: https://www.basketball-reference.com/friv/continuity.html

Team chemistry is an intangible which cannot be evaluated by simple statistics but a team’s ability to retain a player becomes an important aspect in team building and the overall morale of the team. For instance, the San Antonio Spurs with their core of Tim Duncan, Tony Parker and Manu Ginoobili along with their coach Greg Popovich was one of the longest groups of the people working together in the NBA in the 2000’s and 2010’s. Trying to quantify the teams bonds is not possible due to the availability of the stat but we can quantify if that team bond actually led to any success for the team or not. Joining and visualizing the continuity of a team with the overall team success is something which was in scope for our work.

In [None]:
#Loading URL to variable
url = "https://www.basketball-reference.com/friv/continuity.html"
html = urlopen(url)
soup = BeautifulSoup(html, features="html.parser")

#Parsing the required HTML elements
data = soup.find_all('table', id="continuity")[0].findAll('tr')
continuity = [[td.getText() for td in data[i].findAll('td')] for i in range(0,len(data))]
head = [[th.getText() for th in data[i].findAll('th')] for i in range(0,len(data))]
print(head[1:])
continuity=continuity[1:]
#Setting headers and data
continuity_stat = pd.DataFrame(continuity, columns = ['ATL', 'BOS', 'CHA', 'CHI', 'CLE', 'DAL', 'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 
                                                      'MIA', 'MIL', 'MIN', 'NJN', 'NOH', 'NYK', 'OKC', 'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'TOR', 'UTA', 'WAS'])
continuity_stat["Season"]= head[1:]
#Writing data to file
continuity_stat.to_csv("C:\Code\continuity.csv", index=False)

Parsing individual accomplishment’s data since the inception of awards
WEBPAGE: https://www.basketball-reference.com/awards/mvp.html

The common thought is that the top teams will have the top players playing for them. Awards are one of the most sound metric to decide who these top players are. The purpose of getting this data is to understand how often the teams with the so-called “Best Players” actually end up having championship success. For example, the Boston Celtics went up against the Golden State warriors in the 2022 NBA Finals. The Celtics had the reigning Defensive Player of the Year. This award however became useless since Steph Curry from the Warriors averaged 31.1 points in the 6 game span and won the title. Looking into how relevant these award winners are can be good from a team building point of view.

In [None]:
awards = ['mvp','roy','dpoy','smoy','mip']
executive_awards=['coy','eoy']

#Looping through each award in the awards list
for award in awards:
    #Loading URL to variable
    url = "https://www.basketball-reference.com/awards/"+str(award)+".html"
    html = urlopen(url)
    soup = BeautifulSoup(html, features="html.parser")
    id = str(award)+"_NBA"
    #Parsing the required HTML elements
    data = soup.find_all('table', id=id)[0].findAll('tr')
    award_stats = [[td.getText() for td in data[i].findAll('td')] for i in range(0,len(data))]
    head = [[th.getText() for th in data[i].findAll('th')] for i in range(0,len(data))]
    award_stats=award_stats[2:]
    h = head[1]
    header = h[1:]
    #Setting headers and data
    award_stats = pd.DataFrame(award_stats, columns = header)
    award_stats['Award'] = award
    #Writing data to file
    award_stats.to_csv("C:\Code\statsaward.csv", mode='a', index=False)
    print(str(award)," award done...")

Parsing advanced statistics from 1980 to 2023
WEBPAGE:https://www.basketball-reference.com/leagues/NBA_2022_advanced.html#advanced_stats::per

Ever since sports analytics became a popular profession, multiple teams started hiring and tracking advanced statistics for player and team metrics. This includes statistics like how often a player turns over the ball or how much of the team's possessions goes through the said player. Moreover we can look into Player Efficiency Ratings , win share percentages and true shooting percentages as a metric to evaluate the overall confidence in a players skills. 

In [None]:
years = []

#Setting NBA years into years list
for year in range(1980,2024):
     years.append(year)

#Looping through each year in years list        
for season in years:
    time.sleep(2)
    #Loading URL to variable
    url = "https://www.basketball-reference.com/leagues/NBA_"+str(season)+"_advanced.html#advanced_stats::per"
    html = urlopen(url)
    soup = BeautifulSoup(html, features="html.parser")
    
    #Parsing the required HTML elements
    data = soup.find_all('table', id="advanced_stats")[0].findAll('tr')
    player_stats = [[td.getText() for td in data[i].findAll('td')] for i in range(0,len(data))]
    player_header = [[th.getText() for th in data[i].findAll('th')] for i in range(0,len(data))]
    player_stats = player_stats[1:]
    
    #Setting headers and data
    player_stat = pd.DataFrame(player_stats,columns = ['Player', 'Pos', 'Age', 'Tm', 'G', 'MP', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 
                                                        'STL%', 'BLK%', 'TOV%', 'USG%', '\xa0', 'OWS', 'DWS', 'WS', 'WS/48', '\xa0', 'OBPM', 'DBPM', 'BPM', 'VORP'])
    player_stat.dropna(inplace = True)
    player_stat['Season'] = season
    #Writing data to file
    player_stat.to_csv("C:\Code\playerstat.csv", mode='a', index=False)
    print(str(season)," season done...")

Data Exploration

Guiding Question 1: How have the rule changes in basketball affected team stats?

The NBA implemented a new rule in 1979 with the introduction of the 3-point line , and not much changed right away. It is statistically proven that scoring one-third of your shots from behind the 3-point line will yield a higher score than a player scoring half their shots from inside the line. In other words: Players Shooting as many 3-pointers as possible will lead to a higher score. But the question we would like to analyze is that have teams adopted their shooting style to shoot more 3-pointers? What is the general trend we see in teams scoring 3-pointers? After analyzing should we suggest coaches to make sure that players are well trained in taking 3-pointer shots?

In [None]:
#making a secure connection to my MariaDB Database
engine = sq.create_engine('mysql+mysqlconnector://L01-7:2XW^s3H6@datasciencedb.ucalgary.ca/L01-7')

In [None]:
#Importing the dataset stat_overall into pandas dataframe
df = pd.read_csv("stat_overall.csv")

#Filling all the NaN in the dataframe with 0
df.fillna(0,inplace=True)

#Sorting the the seasons from lowest to highest
overallstat=df[::-1]  
display(overallstat)

In [None]:
#Inserting the dataset into MariaDB using SQLalchemy

query = pd.read_sql_query('SELECT * FROM stat_overall', engine)
query.head()

In [None]:
#selecting only necessary columns for our analysis
query= '''
SELECT Season, 3PA FROM stat_overall
'''
overallstat = pd.read_sql_query(query, engine)
display(overallstat)

In [None]:
#Scatter Graph
fig = px.scatter(overallstat, x="Season", y="3PA", color="Season", trendline="ols",
           )
fig.update_layout(
    title_text="3P Trend",
    showlegend=False,
    width=800, height=500
)
fig.add_vline(x='1978-79', line_width=3, line_dash="dash", line_color="green", name='3 Pointer Started')
fig.add_vline(x='2009-10', line_width=3, line_dash="dash", line_color="red", name='Steph Curry Drafted')

fig.show()

From the above graph we can see that after the 3-pointer Line was introduced in 1979(Green dashed line) players started taking advantage and we can see a steady rise in the number of 3-pointer attempts. Steph Curry, current NBA player was drafted in 2009(Red-dashed line) and later over the years, in 2012-13 set the NBA seasonal mark for most 3-pointers with 272, then improved his own record later with 402 in 2016. We could see a steep increase after the year 2009 of the 3 pointer attempts. We can see how one player has inspired other NBA teams to take more 3-Pointer shots. With the help of this analysis we can see just how one rule has changed the whole NBA game slowly over the years.
 
This data would be useful for coaches who wants to improve their team's overall performance in the game. Making sure that the players are more focused on improving their 3-pointer accuracy skills.

Guiding Question 2: How does player continuity affect the overall success of the team?

Throughout the history of the NBA league, one of the most frequently overlooked ingredients in the recipe for winning a championship has been team consistency. In order to secure a win is it important for NBA teams to stick to the same players so that the teams can evolve together and do better? As Bob Myers, the general manager of the Golden State Warriors said, "Playing together with the same group of people for a long time makes you better. It just does."

In order to test this theory we will utilize following datasets
 
team_continuity.csv : this dataset reflects rotation continuity on NBA teams from 1952-2022. Continuity is defined by the overlap in minutes played by the same players from one year to the next. Or 100 - % change in rotation minutes between years.
 
teamstats.csv : this dataset contains number of Wins and Loss for every team of every season
 
team_mapping : this contains Team names and its abbreviations

In [None]:
#Importing the datasets into pandas dataframe
df1 = pd.read_csv("team_continuity.csv")
df2 = pd.read_csv("teamstats.csv")
df3 = pd.read_csv("team_mapping.csv")

In [None]:
#Importing the team_continuity.csv dataset into pandas dataframe
df1 = pd.read_csv("team_continuity.csv")

#Filling all the team continuity NaN in the dataframe with 0
df1.fillna(0,inplace=True)
display(df1)

#Pivot the team_continuity data to match the format with other datasets to be used for joining
team_continuity = pd.melt(df1, id_vars = 'Season', value_vars = ['ATL','BOS','CHA','CHI','CLE','DAL','DEN','DET','GSW','HOU','IND','LAC','LAL','MEM','MIA','MIL','MIN','NJN','NOH','NYK','OKC','ORL','PHI','PHO','POR','SAC','SAS','TOR','UTA','WAS']
,var_name ='Team', value_name ='Coninuity%') 

display(team_continuity)

In [None]:
# Importing team continuity to SQL
# team_continuity.to_sql('team_continuity', engine)

query = '''
SELECT * FROM team_continuity;
'''
team_continuity = pd.read_sql_query(query, engine)

team_continuity.head()

In [None]:
#Importing the team_continuity.csv dataset into pandas dataframe

df2 = pd.read_csv("teamstats.csv")
df2=df2.drop(df2.columns[[8,15]], axis=1) #delete unnamed and NAN columns

display(df2)

df2.drop(['SRS', 'W/L%','Lg','Finish', 'Playoffs','Coaches','Top WS'], axis=1, inplace=True)
df2=df2[df2.Season != 'Season'] # Cleaning row headers in middle of every data

df2.fillna(0,inplace=True) # #Filling all the teamstat NaN in the dataframe with 0

#Changing the datatypes to numeric
df2[['Pace','W','L','Rel Pace', 'ORtg', 'Rel ORtg', 'DRtg', 'Rel DRtg']]=df2[['Pace','W','L', 'Rel Pace', 'ORtg', 'Rel ORtg', 'DRtg', 'Rel DRtg']].apply(pd.to_numeric)

#Adding Winning percentage column after calculating
df2['W%'] = df2['W']/(df2['W']+df2['L']) 

print(df2.dtypes)
display(df2)

In [None]:
# Importing teamstats to SQL
# df2.to_sql('teamstats', engine)

query = '''

SELECT * FROM teamstats;
'''
teamstat = pd.read_sql_query(query, engine)

display(teamstat)

In [None]:
#Importing the team_mapping.csv dataset into pandas dataframe

df3 = pd.read_csv("team_mapping.csv")

# Importing team map to SQL
df3.to_sql('team_map', engine)

query = '''
SELECT * FROM team_map;
'''
teammap = pd.read_sql_query(query, engine)

display(teammap)

In [None]:
#Joining the three datasets using SQL queries
query = '''

SELECT subquery.Season,subquery.Team,
       subquery.Abbreviation as Team_Abrv,round(subquery.`W%`,2) as `Win%`,tc.`Coninuity%` as `Continuity%`
FROM (
    SELECT ts.Season,
        REPLACE(ts.Team, '*', '') as Team,
        tm.Abbreviation,
        ts.`W%`
     FROM teamstats as ts
        JOIN team_map as tm
        on REPLACE(ts.Team, '*', '') = tm.Team_Name
) as subquery
JOIN team_continuity as tc
    on subquery.Abbreviation = tc.Team and tc.Season = subquery.Season;
'''
team_wc = pd.read_sql_query(query, engine)

display(team_wc)

In [None]:
team_wc.loc[(team_wc['Team_Abrv'] == 'CHI') & (team_wc['Continuity%'] > .70)]

For example lets check the data for the Chicago Bulls. The team has won the NBA championships in 1991, 1992, 1993, 1996, 1997, 1998. When we look at the above data Bull's team continuity has stayed above 80% for their wins in those years. This was also one of the factors why Chicago Bulls is the only team in recent years to have won the most NBA Champions ships in a row. We can see that in 1996-97 for Chicago Bulls (CHI) when their team continuity was 97% their chance of Winning were 84%. 

We can conclude from the the graph above that there exits a postive corelation between number of Winning Percentage and the Team continuity percentage. This means that it is more likely for a team to stick together in order to to play better and secure a win for their games in the Season. 

Finally we can agree with the general manager Bob Myers' statement that Team Continuity plays a big role in Team Wins and it should not be overlooked by any General Manager when building the team or trading players to build a NBA championship winning Team. 

Guiding Question 3: How does playing defensively vs offensively impact a teams overall season?

There is a common understanding in the NBA that Defence has a bigger impact on the overall team success compared to it's offence. On multiple occassions commentators and NBA legends like Bill Russel have been quoted saying "Offense wins games, Defense wins Championships". From a broader view, this makes sense but due to the rise in the 3 point usage in the league and the development of high paced offensive focused teams, the above mentioned quotes seems moot. It is our attempt to understand if there is any weight in the late NBA legend's saying or has the 3 point and pass heavy team focused game style totally overpowered the modern NBA defense.

In [4]:
def_vs_off = pd.read_sql_query('''
Select
distinct Def_off_flag,
sum(
    case
        when Winner_flag != 0 then 1
    else 0
end
) / sum(
case
    when (
        Winner_flag = 0
        and played_flag != 0
    ) then 1
    else 0
end
) as win_pct
from(
Select
    distinct cleaned_team,case
        when Net_RTG > 0 then "Offensive"
        else "Defensive"
    end as Def_off_flag,
    sum(Winner_flag) as Winner_flag,
    sum(played_flag) as played_flag
from(
        Select
            distinct cleaned_team,
            round(ORtg - DRtg, 0) as Net_RTG,
            Winner_flag,
            played_flag
        from(
                select
                    distinct replace(Team, '*', '') as cleaned_team,
                    Sum(ORtg) as ORtg,
                    SUM(DRtg) as DRtg,
                    sum(
                        case
                            when Playoffs = ("Won Finals") then 1
                            else 0
                        end
                    ) as Winner_flag,
                    sum(
                        case
                            when Playoffs is not null then 1
                            else 0
                        end
                    ) as played_flag
                FROM
                    teamstats_1
                group by
                    1
            ) a
    ) a
group by
    1,
    2
) a
group by
	1;''', engine)
print(def_vs_off)

NameError: name 'engine' is not defined

The data table below concludes that teams that have a better offensive play have a significantly higher winning percentage as compared to the other group of teams

In [None]:
NEED TO EXPLAIN RESULTS IN DETAIL- MRIDUL

Guiding Question 4: Do individual accomplishments accumulate towards winning the NBA season?

Continuing with the trend of team success and factors that affect that, we are now looking into the individual accomplishments of players who have been awarded by the one or more of the 5 core NBA awards which includes Most Valuable Player, Defensive PLayer of the Year, Most Improved Player, Six Man of the Year and Rookie of the Year. The purpose of this study is to understand the importance of having award winning players in your team and can these players lead a team to a successful record or championships. The initial inference based on personal understanding of the game would suggest having a top 5 player and players that win awards in your team lead you deep in the NBA playoffs. For instance Milwaukee Bucks star Giannis Antetoukoumpo has been to the all star game 6 times, additionally he has won three of 5 core awards - MVP(x2), MIP, DPOY, moreever he was the star player of the Bucks when they won the championship in 2021. Is this trend consistent thoughout the NBA;s inception of awards or was this an isolated incident?

In [None]:
individual_accomplishments = pd.read_sql_query('''
with awards as (
Select
    distinct Lg,
    Tm,
    Team_Name,
    sum(Flag) as total_no_of_awards
from(
        Select
            saw.Lg,
            saw.Player,
            saw.Tm,
            map.Team_Name,
            saw.Award,
            1 as Flag
        from
            statsaward saw
            left join Team_mapping map on saw.Tm = map.Abbreviation
        where
            Lg = "NBA"
    ) a
group by
    1,
    2
)
Select
Awards_category,
count(cleaned_team),
sum(
    case
        when category = "Win" then 1
        else 0
    end
) / sum(
    case
        when category = "Did not Win" then 1
        else 0
    end
) as win_pct
from(
    Select
        *,
        case
            when Winner_flag != 0 then "Win"
            when (
                Winner_flag = 0
                and played_flag != 0
            ) then "Did not Win"
            else "Did not Play"
        end as category,
        case
            when total_no_of_awards > 4 then "High"
            else "Low"
        end as Awards_category
    from
        (
            Select
                distinct cleaned_team,
                sum(Winner_flag) as Winner_flag,
                sum(played_flag) as played_flag
            from(
                    Select
                        *,
                        replace(Team, '*', '') as cleaned_team,
                        case
                            when Playoffs = ("Won Finals") then 1
                            else 0
                        end as Winner_flag,
                        case
                            when Playoffs is not null then 1
                            else 0
                        end as played_flag
                    from
                        teamstats_1
                ) a
            group by
                1
        ) a
        left join awards aw on a.cleaned_team = aw.Team_Name
    where
        total_no_of_awards is not null
) a
group by
1;''',engine)
print(individual_accomplishments)
print("\nFrom the below data, we can interpret that teams that have higher number of players receiving individual awards have a significantly higher win percentage as compared to those teams that have lesser number of players with individual achievements")

From the below data, we can interpret that teams that have higher number of players receiving individual awards have a significantly higher win percentage as compared to those teams that have lesser number of players with individual achievements

In [None]:
NEED TO EXPLAIN RESULTS IN DETAIL- MRIDUL

Guiding Question 5: Does a players net statistics affect their in game performance? Does this performance translate to team success?

In the NBA there is a consistent flow of good and bad teams each season. The one common trend is that good players in bad teams often "Stat Pad" that is score, pass , rebound more to make their statistics inflated to their actual value. Many times teams get trapped into thinking a player on a bad team is a star based on these artificial numbers, the purpose of the following is to analyze whether their affective net statictics is a good determinent of the players in game statistics. 

In [None]:
points_vs_nrr = pd.read_sql_query('''
Select
distinct a.Player,
avg(a.PTS) as Avg_points,
avg(b.WS) as avg_NRR
from
pergamebasic a
left join (
    Select
        distinct Player as player_b,
        season as season_b,
        WS
    from
        advancedstats
) b on a.Player = b.player_b
and a.Season = b.season_b
group by
1''', engine)

print(points_vs_nrr[["Avg_points", "avg_NRR"]].corr())

import seaborn as sns
sns.scatterplot(x="Avg_points", y="avg_NRR", data=points_vs_nrr);

As we can observe from the data as well as the graph, there is a positive correlation between points and net run rate of a player i.e. as the points increase, we also observe an increase in the net run rate

NEED TO EXPLAIN RESULTS IN DETAIL- MRIDUL

Guiding Question 6: How injuries affect player contracts?

Injuries and harsh NBA schedule goes hand and hand, this has been the case since the 82 game season was first set in motion. Many players have been affected by the "injury bug" do this schedule and have lost alot of money over the years. Player like Derrick Rose, Greg Oden, Brandon Roy and Gordon Hayward have become the poster for the injured players in the NBA. Many calls for shortening the season have been on the trend over the past 10 years but to no avail. Here we are trying to analyze the average amount of money a player loses due to injuries and if the NBPA (National Basketball Player Association) has a legit evidence of metrics to support their plea to shorten the season.

In [None]:
injuries = pd.read_sql_query('''
with injuries as (
Select
    distinct trim(Relinquished) as Player,
    count(date) as times_injured
from
    `injuries_2010-2020`
where
    Acquired is null
    and Relinquished is not null
    and Relinquished != ("76ers")
group by
    1
)
Select
distinct Injured_freq_category,
sum(
    cast(
        replace(right(Guaranteed, length(Guaranteed) -1), ',', '') as UNSIGNED
    )
) as contract_value
from(
    Select
        *,
        case
            when times_injured >= 21 THEN "High"
            else "Low"
        END as Injured_freq_category
    From(
            Select
                distinct c.Player,
                c.Guaranteed,
                i.times_injured
            from
                contracts c
                left join injuries i on i.Player = c.Player
            where
                i.Player is not null
        ) a
) b
group by
1;''', engine)
print(injuries)

The data above represents the frequency of injuries and the current NBA salaries the players are drawing. The current players who have been injured more than 20 times over the last 10 years have been categorized into the HIGH injury category otherwise they are put into the LOW category. The average salary for the 2022-2023 season has been computed and compared for both the categories and the results can be see from the above.It is easily concluded that players that have a high frequency of getting injured have ~50% contract value as compared to those who tend to remain fit or have lesser injuries.

Game Simulator

The collaboration of the entire group from data collection to model building to the Web Application can all be seen in this component of our project. Our game simulator is build as an aesthically simple UI which allows users to build their own "Dream Team", on the similar accords of the Dream Team built by USA in the 1992 Olympics. The game allows 2 or more users to pick a team from a selection of current players with the 2022-2023 salary cap in mind. The teams built by the players can go head to head against each other through the Play button on the UI. The results of the head to head matchup is computed through the models built though Linear Regression, providing us with an agregated value of the net +/- statictic which represents the contributiion of each player while they are on court, this value is computed with deviation to introduce randomness into our games similar to that of a real NBA game.

In [None]:
MODEL WEIGHTS 

Project Overview Diagram

Legend

<img src="Legend.png" width="800" height="400">

Project Heirarchy

<img src="hierarchy.png" width="800" height="400">

MODEL

<img src="model.png" width="800" height="400">

Discussion

In my opinion, the use case of web scraping was really challenging and interesting for me. Working through multiple datsets and seeing real time results from scraping the web elements was very rewarding. Furthermore, to see the scrapped data being utilized by msyelf and my teams was really interesting due to the additonal data slicing and type processing as well as understanding the needs of my teammates required data sources. 

From a game simulation point of view, having the time and the understanding of intermediate machine learning concepts would have helped us refine the simulation code and provide us with more accurate results. Moreover the use case of linear regression caused minor issues due to the type and the magnitude of data that we are utilizing. Based on minor research on the topic, it was found that an algorithm like Random Forest would work exponentially better than a simple or multiple linear regression model. 

After having additional understanding into some of the mistakes and learnings that were made in the duration of the project, building and refining the web app for simulation as well as building a real time ML algorithm that works accurately with our current data could be a scope of improvement. 
-Vardaan Bhatia

In [None]:
GAVIN

In [None]:
PREM

In [None]:
MRIDUL

Conclusion

In [None]:
NEED TO WRITE CONCLUSION... TIRED 

REFERENCES

[1] Association for Professional Basketball Research [WWW Document], n.d. URL https://www.apbr.org/ (accessed 11.2.22).
[2] Basketball Statistics & History of Every Team & NBA and WNBA Players [WWW Document], n.d. . Basketball-Reference.com. URL https://www.basketball-reference.com (accessed 11.5.22).
[3] DSG, 2021. The Rise Of Sports Analytics [WWW Document]. URL https://datasportsgroup.com/news-article/74282/the-rise-of-sports-analytics/ (accessed 11.2.22).
[4] Lewis, M. (2004) Moneyball: The Art of Winning an Unfair Game. 1st edition. New York, NY: W. W. Norton & Company.
[5] Official NBA Stats | Stats | NBA.com [WWW Document], n.d. URL https://www.nba.com/stats (accessed 11.5.22).
[6] The Role of Data Science in Sports, 2020. . CORP-MIDS1 (MDS). URL https://www.mastersindatascience.org/resources/big-data-in-sports/ (accessed 11.5.22).
[7] Statistical Player Value, SPV - NBAstuffer (2022). Available at: https://www.nbastuffer.com/analytics101/statistical-player-value-spv/ (accessed: 7 November 2022).
[8] www.Researchgate.Net [Online]. Available at: https://www.researchgate.net/publication/332406802_A_systematic_review_of_sports_analytics (Accessed: 7 November 2022).

MORE REFRENCES???