# NBA Players: Data Cleaning

## Data Fields:

### Player Identification
- **player_name** — Player full name
- **team_abbreviation** — Team played for during the season
- **season** — NBA season of record

### Player Background
- **age** — Player age during the season
- **player_height** — Player height (cm)
- **player_weight** — Player weight (kg)
- **college** — College attended
- **country** — Country of origin

### Draft Information
- **draft_year** — Year drafted
- **draft_round** — Draft round selected
- **draft_number** — Overall draft pick number

### Playing Time
- **gp** — Games played

### Performance Metrics
- **pts** — Points per game
- **reb** — Rebounds per game
- **ast** — Assists per game

### Advanced Metrics
- **net_rating** — Team point differential per 100 possessions
- **oreb_pct** — Offensive rebound percentage
- **dreb_pct** — Defensive rebound percentage
- **usg_pct** — Usage percentage
- **ts_pct** — True shooting percentage
- **ast_pct** — Assist percentage

In [12]:
import pandas as pd

df = pd.read_csv('raw-data/players_data.csv')

df.info()  # Check data types to ensure every column has the correct format
print(f"\nTotal duplicated records: {df.duplicated().sum()}")  # Count duplicate rows; none expected in this dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12844 entries, 0 to 12843
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         12844 non-null  int64  
 1   player_name        12844 non-null  object 
 2   team_abbreviation  12844 non-null  object 
 3   age                12844 non-null  float64
 4   player_height      12844 non-null  float64
 5   player_weight      12844 non-null  float64
 6   college            10990 non-null  object 
 7   country            12844 non-null  object 
 8   draft_year         12844 non-null  object 
 9   draft_round        12844 non-null  object 
 10  draft_number       12844 non-null  object 
 11  gp                 12844 non-null  int64  
 12  pts                12844 non-null  float64
 13  reb                12844 non-null  float64
 14  ast                12844 non-null  float64
 15  net_rating         12844 non-null  float64
 16  oreb_pct           128

In [13]:
# Drop the duplicated index column
df.drop(columns=['Unnamed: 0'], inplace=True)

In [14]:
# Convert object columns to numeric
cols = ["age", "draft_year", "draft_round", "draft_number", "gp"]

for col in cols:
    df[col] = pd.to_numeric(df[col], errors="coerce").astype("Int64")

In [15]:
# Replace missing college values to group international / non-NCAA players
df["college_clean"] = df["college"].fillna("International")

# Standardize country names
df['country'] = df['country'].str.strip().replace({
    'US Virgin Islands': 'USA',
    'U.S. Virgin Islands': 'USA'
})

# Create a draft status flag (True = drafted, False = undrafted)
df["draft_status"] = df["draft_year"].notna().map(
    {True: True, False: False}
)

# Changes "season" column to match datetime type
df["season_start"] = pd.to_datetime(df["season"].str[:4], format="%Y")

# Rename team_abbreviation column
df = df.rename(columns={"player_name": "name",
                        "team_abbreviation": "team",
                        "player_height": "height",
                        "player_weight": "weight"})

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12844 entries, 0 to 12843
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   name           12844 non-null  object        
 1   team           12844 non-null  object        
 2   age            12844 non-null  Int64         
 3   height         12844 non-null  float64       
 4   weight         12844 non-null  float64       
 5   college        10990 non-null  object        
 6   country        12844 non-null  object        
 7   draft_year     10486 non-null  Int64         
 8   draft_round    10433 non-null  Int64         
 9   draft_number   10430 non-null  Int64         
 10  gp             12844 non-null  Int64         
 11  pts            12844 non-null  float64       
 12  reb            12844 non-null  float64       
 13  ast            12844 non-null  float64       
 14  net_rating     12844 non-null  float64       
 15  oreb_pct       1284

In [16]:
# Reorder the columns into a logical sequence for readability
new_order = ["name", "team", "age", "height", "weight", "college", "college_clean", "country", "draft_status", "draft_year", "draft_round", "draft_number", "gp", "pts", "reb", "ast", "net_rating", "oreb_pct", "dreb_pct", "usg_pct", "ts_pct", "ast_pct", "season", "season_start"]

df = df[new_order]

# Sample of the final dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12844 entries, 0 to 12843
Data columns (total 24 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   name           12844 non-null  object        
 1   team           12844 non-null  object        
 2   age            12844 non-null  Int64         
 3   height         12844 non-null  float64       
 4   weight         12844 non-null  float64       
 5   college        10990 non-null  object        
 6   college_clean  12844 non-null  object        
 7   country        12844 non-null  object        
 8   draft_status   12844 non-null  bool          
 9   draft_year     10486 non-null  Int64         
 10  draft_round    10433 non-null  Int64         
 11  draft_number   10430 non-null  Int64         
 12  gp             12844 non-null  Int64         
 13  pts            12844 non-null  float64       
 14  reb            12844 non-null  float64       
 15  ast            1284

In [17]:
# Export the cleaned dataset
df.to_csv('players_data.csv', index=False)