===
NBA 5-Man Lineup Analysis Worksheet
===

This notebook will guide you through:
1. Pulling 5-man lineup data from the nba_api.

2. Identifying lineups that have varying numbers of good 3-point shooters.

3. Performing a simple regression to see if there's a correlation with Offensive Rating.

4. (Optional) Clustering lineups based on 3PT shooting and Offensive Rating.

Feel free to expand, modify, and explore further!
The comments and instructions below will help you fill in the steps.

-------------------------
Step 0: Install Dependencies
-------------------------
If you haven't installed nba_api or scikit-learn in your environment, uncomment and run:


In [None]:
#!pip install nba_api
#!pip install scikit-learn

In [None]:
import pandas as pd
import numpy as np

#nba_api endpoints:
from nba_api.stats.endpoints import leaguedashlineups, playercareerstats
from nba_api.stats.static import players, teams

#For regression/clustering:
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans

-------------------------
Step 1: Pull 5-Man Lineup Data
-------------------------
We'll demonstrate a call to the LeagueDashLineups endpoint that returns
data for 5-man groupings. By default it might return "Base" measure types.
For advanced stats (including Offensive Rating), you can specify measure_type="Advanced".


Try changing parameters such as 'Season' or 'SeasonType' below.

In [None]:
five_man_lineups = leaguedashlineups.LeagueDashLineups(
    season='2023-24',
    season_type_all_star='Regular Season',  # or 'Playoffs
    group_quantity=5,  # if you want to do 3 man lineups change this to 3
    per_mode_detailed='Totals', #or 'PerGame', 'Per100Possessions', etc.
    pace_adjust='N' # or 'Y'
)

#Convert the data into a DataFrame
df_lineups = five_man_lineups.get_data_frames()[0]

#Let's peek at the columns we have
print("Columns:", df_lineups.columns.tolist())
df_lineups.head()

-------------------------
Step 1.5: Parse 5-Man Lineup player IDs
-------------------------

Notice each row's GROUP_ID might look like "-1626157-1628384-1628404-1628969-1628973-",

which is a string of five player IDs separated by hyphens.

We'll strip leading/trailing dashes, split by '-', and get a list of player IDs.


In [None]:
df_lineups["PLAYER_IDS"] = df_lineups["GROUP_ID"].apply(
    lambda g: [x for x in g.strip("-").split("-") if x]
)

# Now df_lineups["PLAYER_IDS"] is a list of the 5 player IDs for each lineup.
df_lineups[["GROUP_ID", "PLAYER_IDS"]].head()

-------------------------
Step 2: Filter / Identify 3PT Shooters in Each Lineup
-------------------------
The "GROUP_NAME" column in df_lineups might contain the 5 player names,
but you also have individual player stats endpoints in nba_api.

One approach:
 - Parse the player names from "GROUP_NAME" or "GROUP_ID" (whichever is more convenient).
 - For each player, pull career or season 3PT% using the PlayerCareerStats endpoint or LeagueDashPlayerStats.
 - Decide on your threshold for "good" 3PT shooting (e.g., 35% or 37%).
 - Count how many "good" shooters a given lineup has.

This step can involve a few sub-queries. We'll outline some pseudocode for you to complete.

(Pseudo-ish code—uncomment and adapt)



In [None]:
### 2.1 figure out how good three point shooters are in each lineup
good_shooter_threshold = 0.36

def fetch_player_3pt_percentage(player_id):
    # 1) Search for the player's ID in nba_api.stats.static.players
    # 2) Use playercareerstats.PlayerCareerStats(player_id=...).get_data_frames()[0] to get their career stats. 
    # 3) use indexing to get their 3P% for the season you want, (or get another stat)
    # 3) Return that 3P% as a float
    pass


In [None]:
### 3.2 count the shooters in each lineup
df_lineups['Good3ptShootersCount'] = 0  # We'll fill this in with a loop below.

for idx, row in df_lineups.iterrows():
    # Example group_name might look like: "S. Curry - D. Green - A. Wiggins - K. Looney - K. Thompson"
    # We'll split on the ' - ' delimiter to get each name or parse from the group ID if easier.
    lineup_player_names = row['GROUP_NAME'].split(' - ')
    
    count_good_shooters = 0
    for player in lineup_player_names:
        # Attempt to fetch their 3pt%
        # three_pt_percent = fetch_player_3pt_percentage(player)
        # if three_pt_percent > good_shooter_threshold:
        #     count_good_shooters += 1
        pass
    
    # df_lineups.at[idx, 'Good3ptShootersCount'] = count_good_shooters

-------------------------
Step 3: Explore Relationship: 3PT Shooting Count vs. Offensive Rating
-------------------------
In advanced measure type data, Offensive Rating might appear under the column "OFF_RATING".
We can do a quick regression or correlation check.

Let's assume you've filled 'Good3ptShootersCount' with real data from Step 2.



In [None]:
# 3.1 - Basic correlation
correlation = df_lineups[['Good3ptShootersCount', 'OFF_RATING']].corr()
print(correlation)

In [None]:
# 3.2 - Simple linear regression
X = df_lineups[['Good3ptShootersCount']]  # predictor
y = df_lineups['OFF_RATING']              # outcome

reg = LinearRegression()
reg.fit(X, y)

print("Intercept:", reg.intercept_)
print("Coefficient:", reg.coef_)

-------------------------
Step 4: Optional Clustering
-------------------------
You could also see how lineups cluster based on OFF_RATING and number of good 3PT shooters.
We'll outline a quick approach with KMeans. Try different numbers of clusters (n_clusters).

In [None]:
### 4.1 dividie the lineups into categories called clusters based on offensive efficiency and 3 point shooters
kmeans = KMeans(n_clusters=3, random_state=42)

df_lineups['cluster_label'] = kmeans.fit_predict(df_lineups[['Good3ptShootersCount', 'OFF_RATING']])

In [None]:
### 4.2 Then you can explore how lineups are grouped:

df_lineups.groupby('cluster_label')[['Good3ptShootersCount','OFF_RATING']].mean()

-------------------------
Step 5: Interpret & Extend
-------------------------
1. Interpret the correlation/regression results: is there a positive relationship?

2. Experiment with your threshold for what "good" shooting means (maybe 38% or 40%). Maybe you should focus on attempts per minute?

3. Try pulling data for multiple seasons or focusing on only specific teams.

4. Extend your analysis with other advanced stats, such as eFG%, TS%, or scoring efficiency.

5. Try a different clustering approach or add more features into your model.
