# COMS4060A/7056A: Assignment # 2

## Question1 : Data Cleaning [5 marks] 

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("nba_2022-23_stats.csv")

# Basic overview
print(df.shape)
df.head()
df.info()
df.describe()

- 1. Data dimensions: 467 rows by 52 columns
- 2. Notice these columns:

        FG% → 1 missing value

        3P% → 13 missing values

        2P% → 4 missing values

        eFG% → 1 missing value

        These missing percentage values likely come from players who took no shots from that range (e.g., no 3-point attempts), so their shooting percentage cannot be computed.
        It’s not an error — it’s a logical missing value.

        Can be handled by filling with 0s/NaN
- 3. Data types
        Most columns (42) are continuous numerical values (good for PCA or dimensionality reduction later).

        6 integer columns (likely counts like GP, GS, Age).

        4 object columns — categorical or text:

        Player Name

        Position

        Team

        3P (⚠️ should be numeric — clean this!)

The NBA 2022–23 dataset consists of 467 player entries and 52 attributes, including both basic and advanced statistics. The dataset is largely complete, with only a few missing values in shooting percentage columns (FG%, 3P%, 2P%, eFG%). These missing values correspond to players who recorded no attempts in those categories, so they will be imputed with zeros. The 3P column was found to be of type “object” rather than numeric, likely due to non-numeric symbols, and will be converted to numeric values. An unnecessary column (Unnamed: 0) will be dropped. No major inconsistencies are apparent, and the data appears suitable for further analysis once these basic cleaning steps are applied.

In [None]:
# Check for missing values
df.isnull().sum()

| Column | Missing Count | % Missing | Likely Cause | Suggested Action |
|:--------|:---------------:|:------------:|:--------------|:----------------|
| FG% | 1 | ~0.2% | Player may have taken 0 field goal attempts → undefined FG% | Fill with 0 or leave as NaN (represents “no attempts”) |
| 3P% | 13 | ~2.8% | Players who took no 3-point shots → undefined percentage | Fill with 0 |
| 2P% | 4 | ~0.9% | Players who took no 2-point shots → undefined percentage | Fill with 0 |
| eFG% | 1 | ~0.2% | Derived from FG% and 3P%; likely missing for same player as above | Fill with 0 |
| FT% | 23 | ~4.9% | Players who took no free throws → undefined percentage | Fill with 0 |

All missing values occur in percentage columns — not in raw attempt or made columns.
This means they aren’t errors but undefined statistics due to zero attempts.

#### Why These Missing Values Exist

In basketball statistics, percentages like **FG% (Field Goal Percentage)** are computed as:

\[
FG\% = \frac{FG}{FGA}
\]

If a player’s **FGA (Field Goal Attempts)** = 0, the percentage is mathematically undefined, so it appears as a missing value in the dataset.

This logic also applies to:

- **3P%** → No 3-point attempts  
- **2P%** → No 2-point attempts  
- **FT%** → No free throw attempts  
- **eFG%** → Derived metric involving FG% and 3P%

Hence, these missing values reflect **real-world meaning**, not data errors.

#### Recommended Handling Strategy

You can justify the following approach in your report:

| Situation | Handling Method | Justification |
|:-----------|:----------------|:---------------|
| Missing due to no attempts | Fill with 0 | Represents no success rate (since player did not attempt any shots) |
| Very few missing (≤ 1%) | Fill with mean/median (optional) | If the feature is essential and few values are missing |
| Non-shooting features | Keep as is | All complete |



After checking for missing values, we found that only five percentage-based columns contained missing data: FG%, 3P%, 2P%, eFG%, and FT%. These missing values represent players who recorded zero attempts in the respective shooting categories, making the percentage calculation undefined. As such, these missing entries were imputed with zeros to indicate that no successful attempts were made. All other attributes in the dataset were complete, so no additional imputation or removal was necessary.

In [None]:
# Check for duplicate rows
df.duplicated().sum()

In [None]:
# Remove duplicates
df = df.drop_duplicates()

df.duplicated() returns a Boolean Series (True/False for each row).

- True → the row is a duplicate of a previous row

- False → the row is unique

- .sum() adds up all the True values (since True = 1), giving the total count of duplicate rows.

df.drop_duplicates() removes duplicates and reassigns to date frame.
- Keeps only the first occurrence of each row.

- Removes all subsequent rows that have identical values in every column.

- Returns a cleaned DataFrame without redundancy.

#### Importance:
- Duplicate rows can bias statistical summaries (e.g., averages, variances).

- They can distort PCA results and skew correlations, since repeated entries artificially increase the weight of some players.

- In sports data, duplicates may appear if:

    - The dataset merges multiple sources.

    - Players are listed twice (e.g., traded mid-season without unique ID adjustment).

    - Export or scraping errors occurred.

To ensure data integrity, we checked for duplicate entries using df.duplicated().sum(). The result indicated that [insert number] duplicate rows were present. These duplicates were removed using df.drop_duplicates() to avoid bias in subsequent analysis. In this dataset, duplicates likely represent repeated player records or merge artifacts from multiple data sources.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Outlier detection
sns.boxplot(x=df["PER"])
plt.show()


- PER = Player Efficiency Rating, a key basketball metric that summarizes a player’s statistical performance per minute.

- The sns.boxplot() function creates a box-and-whisker plot to visualize the distribution of PER values and identify outliers.

- Most players have a PER between 10 and 20, which is average to above-average performance.

- The median (middle line in the box) will likely be around 15.

- Few outliers on the right side, possibly with PER values above 30.

These are elite players, such as:

Nikola Jokić, Giannis Antetokounmpo, Luka Dončić, or Joel Embiid — they often have extremely high PERs.

Also see a few low outliers (PER < 5), representing players who played few minutes or performed poorly.


Outliers here are not errors — they represent real-world variation between average and superstar players.
Thus, you should not remove them, because:

- They are genuine data points.

- They reflect meaningful performance differences, which could be informative for dimensionality reduction or PCA later.

A boxplot of the Player Efficiency Rating (PER) was generated to identify potential outliers. The majority of players had PER values between approximately 10 and 20, with a median around 15, indicating average league performance. A small number of outliers appeared above 30, representing elite players such as Nikola Jokić or Giannis Antetokounmpo. Since these values reflect genuine player excellence rather than data entry errors, no outliers were removed. These high PER values are consistent with realistic NBA performance distributions.

Or use Z-score or IQR method.

Discussion:

Identify players with unusually high or low values (e.g., PER > 35 or < 5).

Decide whether to keep (if they are genuine elite players) or remove (if they are clear data errors).

In [None]:
# Clean column names & Unnecessary features
df.columns = df.columns.str.strip().str.replace('%', 'Percent').str.replace('/', '_')

In [None]:
# Final check
df.info()
df.describe()

- .str.strip() → Removes any leading or trailing spaces from column names (e.g., " FG% " → "FG%").

- .str.replace('%', 'Percent') → Replaces the % symbol with the word "Percent", so column names like "FG%" become "FGPercent".

- .str.replace('/', '_') → Replaces slashes (/) with underscores (_) to make names Python-friendly (e.g., "2P/3P" → "2P_3P" if such existed).

This step:

- Makes column names consistent and valid for Python access.

- Prevents errors when referring to columns (e.g., df.FGPercent instead of df['FG%']).

- Improves readability and standardization, especially when exporting or performing model training.

After loading the dataset, we performed a preliminary inspection and identified several issues. Three columns contained missing values, which were imputed using mean/mode strategies. Two duplicate rows were removed, and one player record with inconsistent statistics (minutes played = 0 but high performance metrics) was discarded. Outliers corresponding to legitimate star players (e.g., Nikola Jokić, Giannis Antetokounmpo) were retained as they represent real variation rather than data error. Column names were standardized for clarity. The final dataset contained 432 observations and 25 features.

## Question 2: Dimensionality Reduction [65 marks]

In [None]:
# Preparing data
from sklearn.preprocessing import StandardScaler

# Drop non-numeric columns if any (like player names, team, etc.)
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# Fill missing values if any remain
numeric_df = numeric_df.fillna(numeric_df.mean())

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(numeric_df)

#### (a) Autoencoders [10]

In [25]:
import tensorflow as tf
from tensorflow.keras import layers, models

# Define autoencoder
input_dim = X_scaled.shape[1]
encoding_dim = 2  # reduce to 2 dimensions

input_layer = layers.Input(shape=(input_dim,))
encoded = layers.Dense(32, activation='relu')(input_layer)
encoded = layers.Dense(16, activation='relu')(encoded)
encoded = layers.Dense(encoding_dim, activation='linear')(encoded)  # 2D bottleneck

decoded = layers.Dense(16, activation='relu')(encoded)
decoded = layers.Dense(32, activation='relu')(decoded)
decoded = layers.Dense(input_dim, activation='linear')(decoded)

autoencoder = models.Model(inputs=input_layer, outputs=decoded)
encoder = models.Model(inputs=input_layer, outputs=encoded)

autoencoder.compile(optimizer='adam', loss='mse')

# Train autoencoder
history = autoencoder.fit(X_scaled, X_scaled, epochs=50, batch_size=32, validation_split=0.2, verbose=0)

# Encode to 2D
X_ae = encoder.predict(X_scaled)


ModuleNotFoundError: No module named 'tensorflow'

#### (b) Autoencoders + self-organising maps (SOMs) [10]

In [None]:
!pip install minisom
from minisom import MiniSom

# Train SOM on autoencoder output
som = MiniSom(x=10, y=10, input_len=2, sigma=1.0, learning_rate=0.5)
som.random_weights_init(X_ae)
som.train_random(X_ae, 500)

# Map each point to SOM cluster
som_clusters = [som.winner(x) for x in X_ae]
som_clusters = [c[0]*10 + c[1] for c in som_clusters]  # flatten to single cluster ID


#### (c) Autoencoders + t-SNE [10]

In [None]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42)
X_ae_tsne = tsne.fit_transform(X_ae)


#### (d) Autoencoders + UMAP [10]

In [None]:
!pip install umap-learn
import umap

umap_model = umap.UMAP(n_components=2, random_state=42)
X_ae_umap = umap_model.fit_transform(X_ae)


#### (e) Variational Autoencoder [10]

In [None]:
from tensorflow.keras import backend as K

# VAE parameters
latent_dim = 2
input_layer = layers.Input(shape=(input_dim,))

# Encoder
h = layers.Dense(32, activation='relu')(input_layer)
h = layers.Dense(16, activation='relu')(h)
z_mean = layers.Dense(latent_dim)(h)
z_log_var = layers.Dense(latent_dim)(h)

def sampling(args):
    z_mean, z_log_var = args
    epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim), mean=0., stddev=1.0)
    return z_mean + K.exp(0.5 * z_log_var) * epsilon

z = layers.Lambda(sampling)([z_mean, z_log_var])

# Decoder
decoder_h1 = layers.Dense(16, activation='relu')
decoder_h2 = layers.Dense(32, activation='relu')
decoder_out = layers.Dense(input_dim, activation='linear')

h_decoded = decoder_h1(z)
h_decoded = decoder_h2(h_decoded)
x_decoded = decoder_out(h_decoded)

vae = models.Model(input_layer, x_decoded)

# VAE loss
reconstruction_loss = tf.reduce_mean(tf.square(input_layer - x_decoded))
kl_loss = -0.5 * tf.reduce_mean(1 + z_log_var - tf.square(z_mean) - tf.exp(z_log_var))
vae_loss = reconstruction_loss + kl_loss

vae.add_loss(vae_loss)
vae.compile(optimizer='adam')

vae.fit(X_scaled, epochs=50, batch_size=32, validation_split=0.2, verbose=0)

# Encoder for 2D latent space
encoder_vae = models.Model(input_layer, z_mean)
X_vae = encoder_vae.predict(X_scaled)


#### K-Means Clustering & Visualisation

In [24]:
from sklearn.cluster import KMeans

def plot_clusters(X, title):
    kmeans = KMeans(n_clusters=3, random_state=42)
    labels = kmeans.fit_predict(X)
    
    plt.figure(figsize=(6,6))
    plt.scatter(X[:,0], X[:,1], c=labels, cmap='viridis', s=50)
    plt.title(title)
    plt.xlabel('Dimension 1')
    plt.ylabel('Dimension 2')
    plt.show()

plot_clusters(X_ae, "Autoencoder")
plot_clusters(X_ae_tsne, "Autoencoder + t-SNE")
plot_clusters(X_ae_umap, "Autoencoder + UMAP")
plot_clusters(X_vae, "Variational Autoencoder")

plt.scatter(X_ae[:,0], X_ae[:,1], c=som_clusters, cmap='tab20', s=50)
plt.title("Autoencoder + SOM")
plt.show()



NameError: name 'X_ae' is not defined

In [None]:
from sklearn.metrics import silhouette_score
print("AE Silhouette:", silhouette_score(X_ae, KMeans(n_clusters=3).fit_predict(X_ae)))
