<a href="https://colab.research.google.com/github/sajalf49/DS-AI_Assignments/blob/main/week8_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 8: Unsupervised Learning
### Project: Credit Card Fraud Detection

In Week 8 I applied unsupervised learning to my dataset. I used **K-Means clustering** and visualized results in 2D using **PCA**. I also evaluated clustering quality with the Silhouette score and saved the clustered dataset.

## Overview of tasks I will do
1. Load the cleaned dataset (`creditcard_cleaned.csv`) or create a small sample if it's missing.
2. Prepare numeric features (fill missing values, scale).
3. Run an elbow analysis to choose number of clusters (k).
4. Fit K-Means, assign cluster labels.
5. Reduce dimensionality to 2D with PCA and plot clusters.
6. Compute Silhouette score and save the dataset with cluster labels.


In [None]:
# Imports and settings
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
%matplotlib inline
sns.set(style='whitegrid')


In [None]:
# Step 1: Load dataset (use cleaned dataset from Week 2 if available)
csv_name = 'creditcard_cleaned.csv'
if os.path.exists(csv_name):
    df = pd.read_csv(csv_name)
    print(f"Loaded '{csv_name}' (shape: {df.shape})")
else:
    print(f"'{csv_name}' not found — creating a sample dataset for clustering.")
    data = {
        'TransactionID': list(range(1, 51)),
        'Amount': np.concatenate([np.random.normal(100, 50, 40), np.random.normal(1000, 300, 10)]) .tolist(),
        'Age': np.random.randint(18, 70, size=50).tolist(),
        'Fraudulent': np.random.choice([0,0,0,1], size=50).tolist()
    }
    df = pd.DataFrame(data)
    df.to_csv(csv_name, index=False)
    print(f"Sample '{csv_name}' created (shape: {df.shape})")

df.head()

## Step 2: Select numeric features & preprocessing
I will use numeric columns for clustering. I fill missing values with median and scale features using StandardScaler.

In [None]:
# Select numeric columns (exclude identifiers)
numeric = df.select_dtypes(include=[np.number]).copy()
if 'TransactionID' in numeric.columns:
    numeric = numeric.drop(columns=['TransactionID'])

print("Numeric columns used for clustering:", list(numeric.columns))

# Fill missing values with median (if any)
numeric = numeric.fillna(numeric.median())

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(numeric)

print("Scaled feature matrix shape:", X_scaled.shape)


## Step 3: Elbow method to choose k (number of clusters)
I will compute inertia for k from 2 to 8 and plot the elbow curve to pick a reasonable k.

In [None]:
inertias = []
K_range = range(2,9)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.figure(figsize=(7,4))
plt.plot(list(K_range), inertias, '-o')
plt.xlabel('k (number of clusters)')
plt.ylabel('Inertia (within-cluster sum of squares)')
plt.title('Elbow Method for choosing k')
plt.xticks(list(K_range))
plt.show()

print('Inertias:', dict(zip(K_range, inertias)))

👉 **I inspect the elbow plot** and choose a k where the inertia reduction starts to plateau. For this notebook I will pick `k = 3` as a reasonable starting point (you can change it after inspecting the plot).

In [None]:
# Step 4: Fit K-Means with chosen k and attach labels
k = 3
kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
cluster_labels = kmeans.fit_predict(X_scaled)

df_clusters = df.copy()
df_clusters['cluster'] = cluster_labels
print('Cluster counts:')
print(df_clusters['cluster'].value_counts().sort_index())


## Step 5: PCA to 2D & Visualization
I reduce the scaled features to 2 principal components and plot the clusters in 2D. This helps me visually inspect cluster separation.

In [None]:
# PCA to 2 components
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

df_clusters['pca1'] = X_pca[:,0]
df_clusters['pca2'] = X_pca[:,1]

plt.figure(figsize=(8,6))
palette = sns.color_palette('tab10', n_colors=k)
sns.scatterplot(data=df_clusters, x='pca1', y='pca2', hue='cluster', palette=palette, s=80)
plt.title('K-Means clusters visualized in 2D PCA space')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend(title='cluster')
plt.show()


### Optional: show cluster centers in PCA space
I convert cluster centers (which are in scaled feature space) to PCA space and plot them as X markers.

In [None]:
centers_scaled = kmeans.cluster_centers_
centers_pca = pca.transform(centers_scaled)

plt.figure(figsize=(8,6))
sns.scatterplot(data=df_clusters, x='pca1', y='pca2', hue='cluster', palette=palette, s=80, alpha=0.6)
plt.scatter(centers_pca[:,0], centers_pca[:,1], marker='X', s=200, c='black', label='centers')
plt.title('Clusters and cluster centers (PCA space)')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.legend()
plt.show()


## Step 6: Evaluate clustering quality (Silhouette score)
Silhouette score ranges from -1 to 1; higher is better. It measures how similar an object is to its own cluster compared to other clusters.

In [None]:
if len(set(cluster_labels)) > 1:
    sil_score = silhouette_score(X_scaled, cluster_labels)
    print(f"Silhouette score for k={k}: {sil_score:.3f}")
else:
    print('Only one cluster found — silhouette score not defined.')


## Step 7: Inspect cluster characteristics
I compute summary statistics per cluster (mean Amount, mean Age, count) to understand cluster behavior.

In [None]:
cluster_summary = df_clusters.groupby('cluster').agg(
    count=('cluster','size'),
    mean_amount=('Amount','mean'),
    median_amount=('Amount','median'),
    mean_age=('Age','mean')
).reset_index()
cluster_summary

👉 **Insight:** I look for clusters with higher average `Amount` or other distinguishing traits — such clusters could highlight suspicious behavior worth investigating further.

In [None]:
# Step 8: Save clustered dataset for the repo
out_csv = 'creditcard_clusters.csv'
df_clusters.to_csv(out_csv, index=False)
print(f"Saved clustered dataset as '{out_csv}' (shape: {df_clusters.shape})")


## Conclusion / Project Milestone
- I applied K-Means clustering and visualized clusters in 2D using PCA.  
- I computed the Silhouette score to get a quick sense of cluster quality.  
- I saved the dataset with cluster labels as `creditcard_clusters.csv` for further analysis.  

**Next steps:** use clusters to help with feature engineering (create cluster membership feature), or apply anomaly detection methods to the high-value/sparse clusters.