# PCA Lab
In this notebook, you will work with Principal Component Analysis (PCA) using a customer retail dataset.

## Step 1: Load our customer data


In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('retail_customer_data.csv')

# Display the first few rows of the dataset
df.head()


Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100),Number of Purchases,Years as Customer
0,1,56,89,47,17,2
1,2,69,91,86,7,2
2,3,46,46,23,5,2
3,4,32,28,66,12,3
4,5,60,81,27,17,2


## Step 2: Standardize the Data
Standardizing the data is a crucial step before applying PCA. This will ensure each feature has a mean of 0 and a standard deviation of 1.

In [2]:
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)
scaled_df.head()

Unnamed: 0,CustomerID,Age,Annual Income (k$),Spending Score (1-100),Number of Purchases,Years as Customer
0,-1.723412,0.843704,0.65565,-0.112121,1.237061,-1.148346
1,-1.706091,1.715924,0.723173,1.208939,-0.572831,-1.148346
2,-1.688771,0.172767,-0.796098,-0.925081,-0.934809,-1.148346
3,-1.67145,-0.766547,-1.403806,0.531473,0.332115,-0.761697
4,-1.654129,1.11208,0.385557,-0.789587,1.237061,-1.148346


## Step 3: Apply PCA to Reduce Dimensionality
Now, apply PCA to reduce the dimensionality of the dataset. Let's start by reducing it to 2 components.

In [3]:
from sklearn.decomposition import PCA

# Apply PCA with 2 components
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_df)

explained_variance = pca.explained_variance_ratio_
print("Explained variance ratio:", explained_variance)

Explained variance ratio: [0.21154792 0.18624409]


### Exercise 1: Analyze the Explained Variance
What do the explained variance ratios tell you about the data? How much of the original data's variance is captured by the first two principal components? Write your observations below.

In [4]:
#Your answer here

## Step 4: Visualize the PCA Results
Let's visualize the dataset in the new PCA-transformed 2D space.

In [None]:
import matplotlib.pyplot as plt

# Plot the PCA results
plt.figure(figsize=(8, 6))
plt.scatter(pca_result[:, 0], pca_result[:, 1])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Random Dataset')
plt.show()

### Exercise 2: Interpretation of the Plot
Interpret the scatter plot. Do you notice any clusters or patterns? What might this indicate about the underlying structure of the data?

In [5]:
#Your answer here

## Step 5: Experiment with Different Number of Components
Experiment with applying PCA with a different number of components (e.g., 3 or 4). Observe how the cumulative explained variance changes.

In [None]:
# Apply PCA with 4 components
pca = PCA(n_components=4)
pca_result = pca.fit_transform(scaled_df)

cumulative_variance = pca.explained_variance_ratio_.cumsum()
print("Cumulative explained variance:", cumulative_variance)

# Plot cumulative explained variance
plt.figure(figsize=(8, 6))
plt.plot(range(1, 5), cumulative_variance, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance by PCA Components')
plt.show()

### Exercise 3: Reflection on PCA Components
Reflect on how the cumulative explained variance changes as more components are added. What does this suggest about the number of components needed to adequately represent the data?

In [6]:
#Your answer here

## Conclusion
In this lab, you applied PCA to a new retail customer dataset. You learned how to standardize data, apply PCA, and interpret the results. Notice how PCA can be useful in reducing dimensionality while retaining as much information as possible.