Experiment - 1 (Principle Component Analysis)

---

Name : Shruti Hore

PRN : 24070126172

---

## Principle Component Analysis (PCA)

- PCA is a dimensionality reduction technique and helps us to reduce the number of features in a dataset while keeping the most important information.

- It changes complex datasets by transforming correlated features into a smaller set of uncorrelated components.

# 1. Iris Dataset

# Import libraries and load dataset

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [2]:
iris_df = sns.load_dataset('iris')
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
iris_df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [4]:
iris_df.shape

(150, 5)

In [5]:
iris_df.isnull().sum()

Unnamed: 0,0
sepal_length,0
sepal_width,0
petal_length,0
petal_width,0
species,0


In [6]:
# drop missing values
df = iris_df.dropna()

# Feature Selection

In [7]:
X = iris_df.drop(columns=['species'])
y = iris_df['species']

In [8]:
X

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [9]:
y

Unnamed: 0,species
0,setosa
1,setosa
2,setosa
3,setosa
4,setosa
...,...
145,virginica
146,virginica
147,virginica
148,virginica


# Train Test Split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

# Standardization

In [11]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

# Classication (without PCA)

In [12]:
CLF = RandomForestClassifier(n_estimators=100, random_state=42)
CLF.fit(X_train_scaled, y_train)
y_pred = CLF.predict(X_test_scaled)
print("Accuracy on raw data :", metrics.accuracy_score(y_test, y_pred))

Accuracy on raw data : 0.9666666666666667


# With PCA

## 2 components

In [13]:
# Apply PCA (2 components)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

In [14]:
# Train classifier with PCA-transformed data
CLF_PCA = RandomForestClassifier(n_estimators=100, random_state=42)
CLF_PCA.fit(X_train_pca, y_train)
y_pred_pca2 = CLF_PCA.predict(X_test_pca)

raw = metrics.accuracy_score(y_test, y_pred_pca2)
print(f"Accuracy with PCA (2 components) : {raw:5f}")

Accuracy with PCA (2 components) : 0.933333


## 3 components

In [15]:
# Apply PCA (3 components)
pca3 = PCA(n_components=3)
X_train_pca3 = pca3.fit_transform(X_train_scaled)
X_test_pca3 = pca3.transform(X_test_scaled)

CLF_PCA3 = RandomForestClassifier(n_estimators=100, random_state=42)
CLF_PCA3.fit(X_train_pca3, y_train)
y_pred_pca3 = CLF_PCA3.predict(X_test_pca3)

print("Accuracy with PCA (3 components) :", metrics.accuracy_score(y_test, y_pred_pca3))

Accuracy with PCA (3 components) : 0.9333333333333333


## 4 components

In [16]:
# Apply PCA (4 components)
pca4 = PCA(n_components=4)
X_train_pca4 = pca4.fit_transform(X_train_scaled)
X_test_pca4 = pca4.transform(X_test_scaled)

CLF_PCA4 = RandomForestClassifier(n_estimators=100, random_state=42)
CLF_PCA4.fit(X_train_pca4, y_train)
y_pred_pca4 = CLF_PCA4.predict(X_test_pca4)
print("Accuracy with PCA (4 components) :", metrics.accuracy_score(y_test, y_pred_pca4))

Accuracy with PCA (4 components) : 0.9666666666666667


## Results and Analysis

## Accuracy

In [17]:
print(f"Accuracy on raw data : {metrics.accuracy_score(y_test, y_pred):.3f}")
print(f"Accuracy with PCA (2 components) : {metrics.accuracy_score(y_test, y_pred_pca2):.3f}")
print(f"Accuracy with PCA (3 components) : {metrics.accuracy_score(y_test, y_pred_pca3):.3f}")
print(f"Accuracy with PCA (4 components) : {metrics.accuracy_score(y_test, y_pred_pca4):.3f}")

Accuracy on raw data : 0.967
Accuracy with PCA (2 components) : 0.933
Accuracy with PCA (3 components) : 0.933
Accuracy with PCA (4 components) : 0.967


## Variance

- Explained Variance Ratio (EVR) tells us:

  How much of the total data information (variance) is captured by each principal component.

In [18]:
print("Explained Variance Ratio (2 components):", pca.explained_variance_ratio_)
print("Total Variance Retained:", np.sum(pca.explained_variance_ratio_))

Explained Variance Ratio (2 components): [0.72551423 0.23000922]
Total Variance Retained: 0.955523448362601


Iris has only 4 features

- With PCA:
2 components usually retain ~95–98% variance

This means dimensionality can be reduced by 50% with minimal information loss

# Conclusion

- PCA was applied to the Iris dataset to reduce dimensionality. The explained variance ratio showed that the first few principal components captured most of the data variance.

- Using PCA reduced the number of features while maintaining classification accuracy comparable to the original dataset.

- This indicates that PCA effectively reduces dimensionality without significant loss of information and improves computational efficiency.

# Taxi Dataset

# Import libraries and load dataset

In [19]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [20]:
df = sns.load_dataset('taxis')
df.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


In [21]:
df.isnull().sum()

Unnamed: 0,0
pickup,0
dropoff,0
passengers,0
distance,0
fare,0
tip,0
tolls,0
total,0
color,0
payment,44


In [22]:
# drop missing values
taxi_df = df.dropna()

# Encoding

In [23]:
# encode target variable (payment type)
LE = LabelEncoder()
y = LE.fit_transform(taxi_df['payment'])

In [24]:
x = taxi_df[['fare', 'distance', 'pickup_borough',
             'dropoff_borough', 'total', 'passengers',
             'pickup_zone', 'dropoff_zone']]
x = pd.get_dummies(x, drop_first=True) # one-hot encode catergorical data

In [25]:
print("Total features after encoding :", x.shape[1])

Total features after encoding : 406


# Train_Test_Split

In [26]:
x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    test_size=0.27,
                                                    random_state=67)

# Feature Scaling

In [27]:
SS = StandardScaler()
x_train_sc = SS.fit_transform(x_train)
x_test_sc = SS.transform(x_test)

# Classification

## Without PCA

In [28]:
RFC = RandomForestClassifier(n_estimators=100,
                             random_state=78)
RFC.fit(x_train_sc, y_train)

y_pred = RFC.predict(x_test_sc)
acc_raw = accuracy_score(y_test, y_pred)
print("Accuracy on raw data :", acc_raw)

Accuracy on raw data : 0.8061879743140689


## 2 components

In [29]:
PCA2 = PCA(n_components=2)
x_train_pca2 = PCA2.fit_transform(x_train_sc)
x_test_pca2 = PCA2.transform(x_test_sc)

In [30]:
RFC_PCA2 = RandomForestClassifier(n_estimators=100, random_state=67)
RFC_PCA2.fit(x_train_pca2, y_train)

y_pred_pca2 = RFC_PCA2.predict(x_test_pca2)
acc_pca2 = accuracy_score(y_test, y_pred_pca2)

print("Accuracy with PCA (2 components):", acc_pca2)

Accuracy with PCA (2 components): 0.6935201401050788


In [31]:
print("\nExplained Variance Ratio (2 components):",
      PCA2.explained_variance_ratio_)
print("Total Variance Retained (2 components):",
      np.sum(PCA2.explained_variance_ratio_))


Explained Variance Ratio (2 components): [0.01390138 0.00889541]
Total Variance Retained (2 components): 0.022796794236760204


Using 2 components retains very little variance **(2.24%)**, which explains the significant drop in model performance after PCA

## 3 components

In [32]:
PCA3 = PCA(n_components=3)
x_train_pca3 = PCA3.fit_transform(x_train_sc)
x_test_pca3 = PCA3.transform(x_test_sc)

In [33]:
RFC_PCA3 = RandomForestClassifier(n_estimators=100, random_state=67)
RFC_PCA3.fit(x_train_pca3, y_train)

y_pred_pca3 = RFC_PCA3.predict(x_test_pca3)
acc_pca3 = accuracy_score(y_test, y_pred_pca3)

print("Accuracy with PCA (3 components):", acc_pca3)

Accuracy with PCA (3 components): 0.7011091652072388


In [34]:
print("\nExplained Variance Ratio (3 components):",
      pca3.explained_variance_ratio_)
print("Total Variance Retained (3 components):",
      np.sum(pca3.explained_variance_ratio_))


Explained Variance Ratio (3 components): [0.72551423 0.23000922 0.03960774]
Total Variance Retained (3 components): 0.9951311833787948


Using 3 principal components retained only **2.98%** of the total variance, which is insufficient to represent the dataset effectively.

This significant information loss led to reduced classification accuracy after applying PCA.

## 4 components

In [35]:
PCA4 = PCA(n_components=4)
x_train_pca4 = PCA4.fit_transform(x_train_sc)
x_test_pca4 = PCA4.transform(x_test_sc)

In [36]:
RFC_PCA4 = RandomForestClassifier(n_estimators=100, random_state=67)
RFC_PCA4.fit(x_train_pca4, y_train)

y_pred_pca4 = RFC_PCA4.predict(x_test_pca4)
acc_pca4 = accuracy_score(y_test, y_pred_pca4)

print("Accuracy with PCA (4 components):", acc_pca4)

Accuracy with PCA (4 components): 0.7139521307647402


# Results & Analysis

In [37]:
print("Accuracy without PCA:", acc_raw)
print("Accuracy with PCA (2):", acc_pca2)
print("Accuracy with PCA (3):", acc_pca3)
print("Accuracy with PCA (4):", acc_pca4)

Accuracy without PCA: 0.8061879743140689
Accuracy with PCA (2): 0.6935201401050788
Accuracy with PCA (3): 0.7011091652072388
Accuracy with PCA (4): 0.7139521307647402


# Conclusion
- Accuracy without PCA: 80.62% (best performance).

- Using 2–4 components reduced accuracy (69–71%), showing information loss.

- Fewer components improve efficiency but hurt performance.

PCA reduces dimensionality, but too few components significantly decrease accuracy for the Taxi dataset.