
# 🌟 Data Mining Project: Principal Component Analysis (PCA) on the Adult Dataset 🌟

Welcome to your Data Mining project! In this comprehensive exercise, you'll apply **Principal Component Analysis (PCA)** to analyze the Adult dataset. PCA helps reduce dimensionality, simplify visualization, and highlight underlying patterns in data.

🎯 **Project Goals:**

By completing this project, you'll learn how to:

- Import essential Python libraries for data analysis.
- Load, clean, and preprocess real-world data.
- Perform Exploratory Data Analysis (EDA) to uncover data insights.
- Encode categorical variables and normalize numerical features.
- Implement PCA manually to better understand the algorithm.
- Visualize PCA results clearly and interpret principal components.

Let's start your journey into PCA analysis! 🚀



## 📚 Step 1: Importing Essential Libraries

In this initial step, you will import all necessary Python libraries required for data manipulation, visualization, and preprocessing.

Run the provided code to import the following libraries:

- **pandas**: For data handling and manipulation.
- **numpy**: For numerical computations.
- **matplotlib** and **seaborn**: For creating insightful visualizations.
- **StandardScaler and OneHotEncoder from sklearn**: For scaling numerical features and encoding categorical data.

Execute the cell below to load these libraries into your environment.


In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, OneHotEncoder



## 📂 Step 2: Loading and Exploring the Dataset

In this step, load the dataset named `adult.csv` into a DataFrame using pandas. Once loaded, briefly inspect the dataset by displaying the first five rows.

**Instructions:**

- Load your data using `pd.read_csv()`.
- Use the `.head()` method to preview the data structure.

**Example Code:**

```python
df = pd.read_csv('your-dataset.csv')
df.head()
```


In [None]:
# Write your code here


## 🧹 Step 3: Cleaning and Preprocessing the Dataset

Data in the real world is often incomplete or messy. Your task here is to clean the dataset by:

- Replacing '?' entries (unknown values) with `NaN`.
- Removing all rows containing any `NaN` values.
- Resetting the DataFrame's index to ensure it's clean and orderly.
- Checking the data type and completeness of each feature with `.info()`.

**Useful Methods:**

- `.replace()` for replacing values.
- `.dropna()` for removing missing values.
- `.reset_index()` to reorder indices.

Complete the tasks in the following cell.


In [None]:
# Write your code here


## 📊 Step 4: Exploratory Data Analysis (EDA)

EDA helps you understand your data and discover insights before modeling. Complete the following visual analyses:

- **Scatter Plot:** Examine the relationship between `age` and `hours-per-week`, distinguishing individuals by `income`.
- **Histogram:** Analyze the distribution of the `age` variable to understand its frequency distribution.
- **Box Plot:** Identify potential outliers in the `age` data.
- **Pair Plot:** Explore relationships and interactions among `age`, `educational-num`, and `hours-per-week` with respect to `income`.

**Recommended Functions:**

- `sns.scatterplot()` for scatter plots.
- `sns.histplot()` for histograms.
- `sns.boxplot()` to detect outliers visually.
- `sns.pairplot()` to study pairwise relationships between multiple features.

Perform these visualizations clearly and interpret your observations briefly.


In [None]:
# Write your code here


## ⚙️ Step 5: Encoding Categorical Data & Normalizing Numerical Features

Before PCA can be applied, it's important to convert categorical data into numerical form and scale numerical features:

- Apply **One-Hot Encoding** to transform categorical variables (`workclass`, `education`, `gender`, etc.) using `pd.get_dummies()`.
- Normalize numerical variables (`age`, `fnlwgt`, `hours-per-week`, etc.) using `StandardScaler()` from sklearn.

After processing, display the first 5 rows of your cleaned and transformed dataset to verify results.

**Example:**

```python
df_encoded = pd.get_dummies(df, columns=['your-categorical-columns'])
scaler = StandardScaler()
df_encoded[your_numerical_columns] = scaler.fit_transform(df_encoded[your_numerical_columns])
df_encoded.head()
```


In [None]:
# Write your code here


## 🛠️ Step 6: Implement PCA Manually

Gain deeper insight by manually implementing PCA. Create a function called `perform_pca()` that:

- Accepts normalized data and the number of principal components (`n_components`) as arguments.
- Calculates covariance matrix, eigenvalues, and eigenvectors.
- Sorts eigenvectors based on eigenvalues and selects top components.
- Projects data onto the selected components to reduce dimensionality.

Execute PCA for 2 components and verify by displaying the top 5 rows of the result.

**Structure your function as follows:**

```python
def perform_pca(data, n_components=2):
    # Your PCA code here
    return pca_result
```


In [None]:

def perform_pca(data, n_components=2):
    # Implement PCA steps here
    pass



## 📈 Step 7: Visualizing and Interpreting PCA Results

Now, visualize the PCA output clearly:

- **Scatter plot:** Plot the two principal components to visualize data separation and clusters.
- **Heatmap:** Illustrate correlations between original features and PCA components using a heatmap.
- Provide a brief interpretation of what each principal component represents in terms of original features.

**Visualization Tools:**

- Use `plt.scatter()` for scatter plots.
- Use `sns.heatmap()` for correlation heatmaps.

Reflect on your findings briefly in your analysis.


In [None]:
# Write your code here


## 📉 Step 8: PCA Explained Variance

Determine how many principal components are needed by plotting the cumulative explained variance.

**Tasks:**
- Fit PCA from `sklearn.decomposition` to your encoded data.
- Plot the cumulative sum of the explained variance ratio.

**Hint:**  
```python
from sklearn.decomposition import PCA
pca = PCA().fit(your_data)
# plot cumulative explained variance here
```


In [None]:
# Write your code here


## 🌀 Step 9: K-means Clustering on PCA results

Apply K-means clustering (with 2 clusters) on your PCA-transformed data and visualize the clusters.

**Tasks:**

- Apply `KMeans` clustering from `sklearn.cluster`.
- Visualize clusters using scatter plots and mark cluster centers.

**Hint:**
```python
from sklearn.cluster import KMeans
# apply KMeans on PCA data
# visualize your clusters clearly
```


In [None]:
# Write your code here


---
*Project designed by: Amirerfan Teimoori*
