Importing the Dependencies

https://levelup.gitconnected.com/dbscan-a-density-based-clustering-algorithm-110b726fd6fe 
https://clustering-visualizer.web.app/dbscan 
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ 
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/ 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

Data Collection & Analysis

In [2]:
# loading the data from csv file to a Pandas DataFrame
path = r'https://media.githubusercontent.com/media/shahil04/ds_materials/refs/heads/main/8.0_Machine%20Learning/class/Module%209.1.1%20Mall_Customers.csv'
customer_data = pd.read_csv(path)

In [3]:
# first 5 rows in the dataframe
customer_data.head()

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40


In [4]:
# finding the number of rows and columns
customer_data.shape

(200, 5)

In [5]:
# getting some informations about the dataset
customer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              200 non-null    int64 
 1   Gender                  200 non-null    object
 2   Age                     200 non-null    int64 
 3   Annual Income (k$)      200 non-null    int64 
 4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB


In [6]:
# checking for missing values
customer_data.isnull().sum()

CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64

# üìò PCA (Principal Component Analysis) ‚Äî Code-Based Tutorial

### Example: Customer Segmentation Dataset

### Dataset Columns

```
CustomerID | Gender | Age | Annual Income (k$) | Spending Score (1-100)
```

---

## üîπ 1. Why PCA? (1-minute explanation for students)

* PCA is a **dimensionality reduction** technique
* It **compresses data** while keeping **maximum information (variance)**
* Converts many features ‚Üí **few principal components**
* Helps in:

  * Visualization
  * Noise reduction
  * Faster ML models

---


## üîπ 2. Import Required Libraries

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
```

---

## üîπ 3. Create the Dataset (Your Data)

```python
data = {
    "CustomerID": [1, 2, 3, 4, 5],
    "Gender": ["Male", "Male", "Female", "Female", "Female"],
    "Age": [19, 21, 20, 23, 31],
    "Annual Income (k$)": [15, 15, 16, 16, 17],
    "Spending Score (1-100)": [39, 81, 6, 77, 40]
}

df = pd.DataFrame(data)
df
```

---

## üîπ 4. Encode Categorical Column (Gender)

PCA **only works on numbers**, not text.

```python
df["Gender"] = df["Gender"].map({"Male": 0, "Female": 1})
df
```

---

## üîπ 5. Select Features (Remove CustomerID)

```python
X = df.drop("CustomerID", axis=1)
X
```

---

## üîπ 6. Feature Scaling (VERY IMPORTANT)

> PCA is **variance-based**, so scaling is mandatory.

```python
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_scaled
```

---

## üîπ 7. Apply PCA (2 Components)

```python
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

X_pca
```

Now the dataset is reduced from **4 features ‚Üí 2 principal components**

---

## üîπ 8. Explained Variance Ratio (KEY CONCEPT)

```python
pca.explained_variance_ratio_
```

### Example Output:

```
[0.63, 0.27]
```

### Explain to Students:

* PC1 keeps **63% information**
* PC2 keeps **27% information**
* Total ‚âà **90% data preserved**

---

## üîπ 9. Convert PCA Result to DataFrame

```python
pca_df = pd.DataFrame(
    X_pca,
    columns=["Principal Component 1", "Principal Component 2"]
)

pca_df
```

---

## üîπ 10. Visualize PCA (2D Plot)

```python
plt.figure(figsize=(6,4))
plt.scatter(
    pca_df["Principal Component 1"],
    pca_df["Principal Component 2"],
    c=df["Gender"], cmap="coolwarm", s=80
)

plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA ‚Äì Customer Data")
plt.colorbar(label="Gender (0=Male, 1=Female)")
plt.grid()
plt.show()
```

---

## üîπ 11. What PCA Actually Did (Simple Explanation)

| Original Feature | PCA Effect |
| ---------------- | ---------- |
| Age              | Combined   |
| Gender           | Combined   |
| Income           | Combined   |
| Spending Score   | Combined   |

‚û° PCA **creates new axes** that capture **maximum variation**

---

## üîπ 12. PCA Loadings (Advanced but Useful)

```python
pca.components_
```

Each row shows **how much each original feature contributes** to a component.

---

## üîπ 13. When to Use PCA (Interview-Ready)

‚úÖ High-dimensional data
‚úÖ Visualization needed
‚úÖ Reduce multicollinearity
‚ùå When interpretability is critical
‚ùå When features already very few

---

## üîπ 14. PCA vs Feature Selection (1-line)

| PCA                    | Feature Selection         |
| ---------------------- | ------------------------- |
| Creates new features   | Selects existing features |
| Loses interpretability | Keeps meaning             |

---

## üîπ 15. Mini Assignment (For Students)

1. Apply PCA with **3 components**
2. Plot **cumulative explained variance**
3. Use PCA output for **K-Means clustering**