**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Alan Xia
- David Lu
- Jin Noh
- Nathan Tewinpagti
- Ricky Wen

# Research Question

Which statistical metrics (e.g. Player Efficiency Rating, box plus-minus, true shooting percentage, etc.) have the highest predictive accuracy when it comes to determining a player's career longevity over the past 20 seasons within the NBA and how does this differ across positions?

## Background and Prior Work

In the fast paced, high-stakes world of professional basketball, fans, sports analysts, and front offices constantly seek to identify key indicators of a player’s long term success. Commonly, traditional stats like points per game and rebounds have long been used to assess a player’s value; however, there are advanced analytics such as player efficiency rating, box plus-minus, and true shooting percentages that offer a deeper insight into a player’s impact on the court and in the future for an NBA team. This project looks to discover which advanced statistic has the most significance when predicting a player’s career length.

<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Some prior work relevant to our research question can be found in an NBA career longevity prediction project in which the authors applied machine learning models to predict whether or not players will play at least 5 years in the NBA given by their performances within their rookie season. Critically, this project utilized more than twenty variables to understand what factors allowed players to thrive in the NBA, such as BLK (average blocks per game), OREB (average offensive rebounds per game), TOV (average turnovers per game), and a lot more. By applying these statistics to models such as SVM and logistic regression, the authors have interpreted that variables such as the amount of games played in a season and win percentage tends to contribute greater results to a team and thus one’s career in the long term.

<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) In another project, an analyst Ian Geertsen sought to compare how people rank the greatest NBA players versus how a “machine” would rank the greatest NBA players based on stats. Geertsen scraped data for the “people” ranking portion from 3 groups: media, fans, and experts. When scraping for data in the media and expert portion Geertsen chose 8 reliable media sources’ rankings and 7 trusted experts’ rankings to use as data. For the fan portion Geertsen utilized 7 polls and weighed the results based on the number of responses to ensure the data wasn’t heavily skewed because of sample sizes. 

For the “machine” part, Geertsen used ten metrics to rank every player in order to find the greatest player. Geertsen categorized these statistics into two types, a rate based category (BPM, PIPM) and a sum based category (Win Shares, Raptor JAWS). In order to more fairly calculate a player’s greatness it is important to also weigh a player’s efficiency rather than their sum as certain players had shorter careers. When comparing the machine ranking with the man ranking the lists essentially shared a similar ranking with similar players in the top 12 of each list except for one outlier that was David Robinson. With these results, Geertsen was able to learn that the statistics he used (VORP, PER, BPM, Win Shares, RAPTOR +/-, RAPTOR JAWS, CORP, WOWYR, POPM +/-, PIPM Wins Added) were a good predictor in measuring how great a player was. With great players essentially being guaranteed long careers, we can use the statistics that Ian Geertsen used in our project and see how the correlation of each statistic relates to the length of an NBA career and how these statistics apply to a larger sample size.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) (Feb 2022) NBA Career Longevity Prediction. *The New York Times*. https://achanbour.github.io/nba-project/index.html#i-project-introduction 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Geertsen, Ian. (21 Sep 2020) Man vs Machine: Human and Analytical Evaluations of NBA Greats. *Bruin Sports Analytics*. https://www.bruinsportsanalytics.com/post/man_vs_machine 

# Hypothesis


We predict that players with higher win shares and true shooting percentages (TS%) are more likely to have longer careers. TS% is defined by a players efficiency at scoring, which is one of the most important factors in team success. We also expect win shares, which is defined as a player's overall contribution to winning games, to have a strong correlation to career longevity because teams prioritize players who enhance their chances of winning.

# Data

Assuming the dataset were ideal, we would have access to detailed player performance metrics such as points per game, assists per game, rebounds per game, and shooting efficiency. We would also get detailed injury history data for each player, covering everything from minor to severe injuries, including recovery times and any recurrence of those injuries throughout their career. Ideally, this dataset would include all NBA players throughout history, around 4800 in total. The data would be organized in a relational database for easy storage and querying. Player statistics would be sourced from Basketball-Reference, while injury data would be gathered from the official NBA injury reports provided by each team.

The NBA provides detailed reports of all players and their stats from the 2024-2025 season all the way back to the 1946-47 season. For easier data wrangling we can also use NBA Players Dataset on Kaggle or Basketball-Reference. NBA Injury Database will be able to provide the injury information. However, this may be different from the ideal dataset as there may be minor injuries that are not ultimately reported or injuries before entering the NBA which may be protected by medical confidentiality

## Data overview

To access the data, please run the following command in a terminal or cell: pip install nba_api

For each dataset include the following information
- NBA Advanced Stats
  - NBA Advanced Stats 2002-2022
  - https://www.kaggle.com/datasets/owenrocchi/nba-advanced-stats-20022022
  - Number of observations: 12211
  - Number of variables: 28
- NBA Games Data
  - Games Details
  - https://www.kaggle.com/datasets/nathanlauga/nba-games/data?select=games_details.csv
  - Number of observations: 668628
  - Number of variables: 29

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

The first dataset provides statistical metrics for NBA players from 2002 to 2022, providing detailed insights into player performance beyond basic box score statistics. Some important variables in the dataset are PER (player efficiency rating), TS% (true shooting percentage) and WS (win shares) which are crucial metrics for diving into our research question. We would need to clean our data by dropping unecessary columns such as "Unnamed: 0" as well as handle any missing values or convert data types such as "year-name" to "name" since a "year" column already exists.

The second dataset provides basic box score statistics from 2004 to 2023, providing a general view into a specific performances in a player's career. Important variables such as FTA (free throws attempted), PF (personal fouls) and the team name - the latter variable can utilized for predicting the player's value to a team. There are multiple columns that are unecessary so we can simply remove the columns such as the team abbreviation and player nickname.

We plan to standardize the second dataset so that we can combine any overlapping variables between both datasets, essentially providing more accurate numbers to predict a career's longetivity with both basic box scores and advanced statistics. We can also merge the datasets, given the position of each player and the teams they have played for.

## NBA Advanced Stats

In [37]:
import pandas as pd
import numpy as np

In [38]:
df1 = pd.read_csv("Data/NBA_Advanced_Stats_2002-2022.csv")
df1.shape

(12211, 28)

In [39]:
# take a look at first dataset
df1.head()

Unnamed: 0.1,Unnamed: 0,year-name,Pos,Age,Tm,G,MP,PER,TS%,3PAr,...,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,year
0,0,2003-Tariq Abdul-Wahad,SG,28,DAL,14,204,12.4,0.47,0.017,...,15.0,0.2,0.2,0.4,0.104,-1.6,0.2,-1.4,0.0,2003
1,1,2003-Shareef Abdur-Rahim,PF,26,ATL,81,3087,19.9,0.566,0.051,...,24.2,7.4,2.3,9.7,0.151,2.3,-0.7,1.6,2.8,2003
2,2,2003-Courtney Alexander,PG,25,NOH,66,1360,9.3,0.459,0.113,...,21.3,0.1,1.0,1.1,0.04,-3.3,-1.2,-4.5,-0.9,2003
3,3,2003-Malik Allen,PF,24,MIA,80,2318,9.9,0.455,0.005,...,19.7,-1.7,2.6,0.9,0.018,-3.9,-0.4,-4.4,-1.4,2003
4,4,2003-Ray Allen,SG,27,TOT,76,2880,21.3,0.565,0.391,...,27.8,7.6,1.5,9.1,0.152,4.7,-1.0,3.6,4.1,2003


In [40]:
df1 = df1.drop(columns=["Unnamed: 0"])
df1["Player Name"] = df1["year-name"].str.split("-").str[1]
df1 = df1.drop(columns=["year-name"])
df1.columns

Index(['Pos', 'Age', 'Tm', 'G', 'MP', 'PER', 'TS%', '3PAr', 'FTr', 'ORB%',
       'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'OWS', 'DWS',
       'WS', 'WS/48', 'OBPM', 'DBPM', 'BPM', 'VORP', 'year', 'Player Name'],
      dtype='object')

In [41]:
df1.describe()

Unnamed: 0,Age,G,MP,PER,TS%,3PAr,FTr,ORB%,DRB%,TRB%,...,USG%,OWS,DWS,WS,WS/48,OBPM,DBPM,BPM,VORP,year
count,12211.0,12211.0,12211.0,12206.0,12143.0,12139.0,12139.0,12206.0,12206.0,12206.0,...,12206.0,12211.0,12211.0,12211.0,12206.0,12211.0,12211.0,12211.0,12211.0,12211.0
mean,26.538777,45.357874,1046.308329,12.445379,0.513641,0.270418,0.287378,5.451262,14.5699,10.010437,...,18.543192,1.110867,1.051151,2.163377,0.067993,-1.751232,-0.234649,-1.986004,0.500925,2013.036852
std,4.202183,26.08499,861.618007,6.655481,0.104659,0.224198,0.23757,5.040813,6.837086,5.128961,...,5.51034,1.872498,1.107955,2.723092,0.106706,9.997313,1.880475,10.4644,1.200409,5.826648
min,18.0,1.0,0.0,-54.4,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-3.3,-0.6,-2.1,-1.312,-1000.0,-31.1,-1000.0,-2.0,2003.0
25%,23.0,22.0,264.5,9.4,0.48,0.029,0.165,2.0,9.9,6.2,...,14.9,0.0,0.2,0.2,0.031,-3.3,-1.1,-3.7,-0.1,2008.0
50%,26.0,48.0,861.0,12.5,0.526,0.269,0.25,3.9,13.5,8.9,...,18.1,0.4,0.7,1.2,0.077,-1.4,-0.2,-1.5,0.0,2013.0
75%,29.0,69.0,1710.5,15.8,0.565,0.4315,0.358,8.1,18.5,13.2,...,21.9,1.7,1.6,3.3,0.118,0.3,0.7,0.4,0.7,2018.0
max,44.0,85.0,3485.0,133.8,1.5,1.0,6.0,100.0,100.0,86.4,...,54.6,14.8,9.1,20.3,2.712,199.4,42.7,242.2,11.8,2022.0


## Games Details

In [42]:
df2 = pd.read_csv("Data/games_details.csv", low_memory=False)
df2.shape

(668628, 29)

In [43]:
# take a look at second dataset
df2.head()

Unnamed: 0,GAME_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_CITY,PLAYER_ID,PLAYER_NAME,NICKNAME,START_POSITION,COMMENT,MIN,...,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
0,22200477,1610612759,SAS,San Antonio,1629641,Romeo Langford,Romeo,F,,18:06,...,1.0,1.0,2.0,0.0,1.0,0.0,2.0,5.0,2.0,-2.0
1,22200477,1610612759,SAS,San Antonio,1631110,Jeremy Sochan,Jeremy,F,,31:01,...,6.0,3.0,9.0,6.0,1.0,0.0,2.0,1.0,23.0,-14.0
2,22200477,1610612759,SAS,San Antonio,1627751,Jakob Poeltl,Jakob,C,,21:42,...,1.0,3.0,4.0,1.0,1.0,0.0,2.0,4.0,13.0,-4.0
3,22200477,1610612759,SAS,San Antonio,1630170,Devin Vassell,Devin,G,,30:20,...,0.0,9.0,9.0,5.0,3.0,0.0,2.0,1.0,10.0,-18.0
4,22200477,1610612759,SAS,San Antonio,1630200,Tre Jones,Tre,G,,27:44,...,0.0,2.0,2.0,3.0,0.0,0.0,2.0,2.0,19.0,0.0


In [44]:
df2 = df2.drop(columns=["TEAM_ABBREVIATION", "NICKNAME", "COMMENT"])
df2.columns

Index(['GAME_ID', 'TEAM_ID', 'TEAM_CITY', 'PLAYER_ID', 'PLAYER_NAME',
       'START_POSITION', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A',
       'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'STL',
       'BLK', 'TO', 'PF', 'PTS', 'PLUS_MINUS'],
      dtype='object')

In [45]:
df2.describe()

Unnamed: 0,GAME_ID,TEAM_ID,PLAYER_ID,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,...,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
count,668628.0,668628.0,668628.0,558938.0,558938.0,558938.0,558938.0,558938.0,558938.0,558938.0,...,558938.0,558938.0,558938.0,558938.0,558938.0,558938.0,558938.0,558938.0,558938.0,535277.0
mean,21717710.0,1610613000.0,401343.4,3.588446,7.896652,0.416842,0.778117,2.186019,0.201032,1.733217,...,1.024212,3.033798,4.05801,2.103958,0.721436,0.460339,1.320297,1.999538,9.688218,-0.000488
std,5656289.0,8.65226,7225618.0,3.030466,5.677002,0.251913,1.227615,2.569913,0.289685,2.353981,...,1.39783,2.687384,3.4825,2.475476,0.972231,0.860962,1.402329,1.502963,8.082152,10.665573
min,10300000.0,1610613000.0,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-57.0
25%,20700030.0,1610613000.0,2466.0,1.0,3.0,0.267,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,3.0,-7.0
50%,21200960.0,1610613000.0,201181.0,3.0,7.0,0.429,0.0,1.0,0.0,1.0,...,1.0,2.0,3.0,1.0,0.0,0.0,1.0,2.0,8.0,0.0
75%,21800140.0,1610613000.0,203471.0,5.0,11.0,0.571,1.0,4.0,0.4,3.0,...,2.0,4.0,6.0,3.0,1.0,1.0,2.0,3.0,14.0,6.0
max,52100210.0,1610613000.0,1962938000.0,28.0,50.0,1.0,14.0,24.0,1.0,26.0,...,18.0,25.0,31.0,25.0,10.0,12.0,12.0,15.0,81.0,57.0


# Ethics & Privacy

Although our study relies on publicly available NBA player statistics based on the last 20 seasons, privacy concerns might arise on the topic of dealing with specific player information, such as injury history and external factors that might influence career longevity. Michael Beasley is a prominent example of this. Although he showed flashes of brilliance early on in his career, his lack of professionalism and off-court issues led to a shorter than expected NBA career. While our study is mainly focusing on data relating to individual player performance, it is important to note that career longevity is affected by many variables off the court, such as a player’s personal choices as well as outside factors that are out of their control. These external factors, such as lifestyle and personal history, cannot be directly incorporated into the scope of our analysis. With player injury information in particular, it can be difficult to uncover the specific details of the injury histories released by teams because they are relatively vague. It is difficult to discern the impact of these injuries on specific players by accessing reports online alone, so this brings up the question of if incorporating injury data is a reasonable choice or not. Overall, it is important to balance the need for comprehensive data analysis with respecting the individual privacy of each player, especially when dealing with the long-term impacts of how injuries can affect a player’s career.

Many statistical metrics used in the NBA like player efficiency rating, box-plus-minus, and fantasy score, are inherently biased towards certain archetypes or playstyles. In particular, offensive contributions are far more quantifiable than defensive contributions. While a great offensive player’s impact is clearly shown in their statline, a strong defensive player’s influence on the game is not as evident from metrics alone. This inherent skew in importance is a direct result of how these statistics are recorded. From a defensive perspective, the number of blocks a defender gets per game is not directly correlated with their overall defensive performance. That is to say, a player can have a low number of blocks recorded but might be a great defender overall. This disparity between offensive and defensive contributions needs to be carefully handled, and requires a delicate balance of which metrics to include in our analysis.

To address privacy issues relating to player data, we will make sure to strictly limit our analysis to publicly available statistics, and also acknowledge the limitations of our study by noting that statistical metrics alone cannot fully explain career longevity. By excluding external factors that might otherwise be difficult to quantify such as ones that relate with a player’s personal life off the court, our analysis becomes more generalizable while simultaneously evading privacy issues. In the same vein, it is best to exclude injury histories as a variable in our analysis because public injury reports might not cover the full context around a player’s injury. Additionally, it is known that teams occasionally list players on the injury report to let certain players get rest during games to prevent injuries, which is colloquially known as load management. This further supports not directly using injury histories as a metric in our data. We can instead use a related but simpler statistic such as total games missed per season.

During analysis, we can best mitigate bias by incorporating a wide variety of metrics to deter the influence of spurious correlations and confounding variables. Since offensive contributions are often favored in the context of NBA statistics, we should balance our metrics by including a combination of offensive, defensive, and more advanced metrics. Some examples of defensive metrics we can utilize include steal percentage, defensive win shares, and the number of opponent points scored in the paint. Creating a well-defined balance of metrics negates the presence of bias overall as it allows for a nuanced and standardized analysis of players of vastly different playstyles and archetypes. We could implement separate models and utilize different features for players of different positions as well, such as placing guards, forwards, and centers in different groups to analyze. Additionally, factors like a player’s role in their team (e.g. first option, role player, bench player), playing time, injuries, and differences in the era in which a player belongs to could potentially skew our results. To account for this, we can potentially group players by era/team role and also include injury data as a variable if available.

Furthermore, we will clearly emphasize that our findings are merely for descriptive and exploratory purposes and should not be used as a strict set of guidelines to adhere to, and that further research should be encouraged to look at a multitude of views. Teams should consider a wide range of factors when deciding on which players to recruit, including many factors that cannot be quantified via our model. It is important to note that our model’s intent is not to be used to predict a player’s future, but rather, to examine preexisting trends based on prior data. Ultimately, this project is one that seeks to understand and uncover past trends, rather than serving as a definitive predictor of player longevity.

# Team Expectations 

* *Our team expects to respectfully communicate to one another throughout the project’s timeline*
* *Our team expects to complete their responsibilities by each deadline*
* *Our team expects to contribute equally and collaborate by providing feedback and constructive criticism as well as by communicating our findings*
* *Our team expects to prepare and attend group meetings to check on each other and cover specific issues in case of conflicts*
* *Should conflicts arise, we expect to cooperate by sharing our viewpoints and then finding a conclusion as a team*
* *Alan, David, Jin, Nathan, and Ricky will perform at their best and approach each project component with utmost diligence and contribution*

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/16  |  8:30 PM | Take a look at project proposal feedback  | Discuss changes and new info about checkpoint #1 | 
| 2/19  |  8:30 PM | Background research on topic data | Discuss dataset(s); draft checkpoint #1 | 
| 2/22  |  8:30 PM | Edit checkpoint #1; Look at datasets in detail  | Discuss possible rangling approaches |
| 2/23  |  8:30 PM | Import Data | Discuss Wrangling and Analysis Plan; Assign group members to lead each specific part |
| 3/1   |  8:30 PM  | Finalize wrangling/EDA; Begin Analysis | Discuss/edit Analysis; Complete project check-in |
| 3/13  |  8:30 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project |
| 3/20  |  Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |