In [78]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### ***PCA (Principal Component Analysis)*** 
is a `dimensionality reduction technique` used to reduce the number of features in a dataset while retaining most of the variance. It transforms the original features into a new set of uncorrelated features called principal components, ordered by the amount of variance they capture from the data.

In [79]:
data = sns.load_dataset('car_crashes')

data = data.loc[:, 'total':'not_distracted']
data.head()

Unnamed: 0,total,speeding,alcohol,not_distracted
0,18.8,7.332,5.64,18.048
1,18.1,7.421,4.525,16.29
2,18.6,6.51,5.208,15.624
3,22.4,4.032,5.824,21.056
4,12.0,4.2,3.36,10.92


In [80]:
from sklearn.preprocessing import Binarizer
bin = Binarizer(threshold=data['total'].mean())
data['total'] = bin.transform(data[['total']])
data['total'] = data['total'].apply(lambda x: int(x))
data.head()



Unnamed: 0,total,speeding,alcohol,not_distracted
0,1,7.332,5.64,18.048
1,1,7.421,4.525,16.29
2,1,6.51,5.208,15.624
3,1,4.032,5.824,21.056
4,0,4.2,3.36,10.92


1 -> Large number of crashes

0 -> Small number of crashes

---
**STEPS INVOLVED:**

1. **Standardization**: Standardize the dataset to have a mean of 0 and a standard deviation of 1.
2. **Covariance Matrix Computation**: Compute the covariance matrix to understand how the features vary with respect to each other.
3. **Eigenvalue and Eigenvector Calculation**: Calculate the eigenvalues and eigenvectors of the covariance matrix to identify the principal components.
4. **Sort Eigenvalues and Eigenvectors**: Sort the eigenvalues in descending order and arrange the corresponding eigenvectors accordingly.
5. **Select Principal Components**: Choose the top 'k' eigenvectors (principal components) based on the largest eigenvalues.
6. **Transform the Data**: Project the original data onto the selected principal components to obtain dataset with reduced dimensions.
---

***Summary through story:***
To transform the vectors in the dataset we need a matrix. Now what matrix? Research says Covariance matrix. Why? Because covariance matrix tells us how the features vary with respect to each other. Now, what will the ideal vectors / axes for transformation be? The axes where the variance is maximum. How to find those axes? By finding the eigenvectors of the covariance matrix. Eigenvectors corresponding to the largest eigenvalues will give us the directions of maximum variance. Finally, we project our original data onto these new axes (eigenvectors) to get a reduced representation of the data while retaining most of its variance.

In [81]:
# 1. 
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
data.iloc[:, 1:] = sc.fit_transform(data.iloc[:, 1:])
data.head()

Unnamed: 0,total,speeding,alcohol,not_distracted
0,1,1.168148,0.439938,1.002301
1,1,1.212695,-0.211311,0.608532
2,1,0.756709,0.187615,0.459357
3,1,-0.483614,0.547408,1.676052
4,0,-0.399524,-0.891763,-0.594276


In [82]:
# 2. 
cov_matrix = np.cov([data.iloc[:, 1], data.iloc[:, 2], data.iloc[:, 3]])
cov_matrix

array([[1.02      , 0.68311294, 0.5997706 ],
       [0.68311294, 1.02      , 0.74747271],
       [0.5997706 , 0.74747271, 1.02      ]])

In [83]:
# 3.
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
eig_vals, eig_vecs

(array([2.3753564 , 0.42799725, 0.25664635]),
 array([[ 0.55661313,  0.78999511,  0.25707889],
        [ 0.59837107, -0.16656398, -0.78371455],
        [ 0.57631058, -0.59005438,  0.56542191]]))

In [84]:
# 5.
pcs = eig_vecs[0:2]
pcs

array([[ 0.55661313,  0.78999511,  0.25707889],
       [ 0.59837107, -0.16656398, -0.78371455]])

In [85]:
# 6.
transformed_data = np.dot(data.iloc[:, 1:], pcs.T)
transformed_data_df = pd.DataFrame(transformed_data[:], columns=['PCA1', 'PCA2'])
transformed_data_df["TARGET"] = data["total"]
transformed_data_df.head()

Unnamed: 0,PCA1,PCA2,TARGET
0,1.255425,-0.15981,1
1,0.664508,0.283923,1
2,0.6875,0.061538,1
3,0.594142,-1.694106,1
4,-1.079644,0.375215,0



***Why Standardization is main step in PCA?***
Standardization is a crucial step in PCA because it ensures that all features contribute equally to the analysis. PCA is sensitive to the scale of the data; if features are on different scales, those with larger ranges can dominate the variance captured by the principal components. By standardizing the data (mean = 0, standard deviation = 1), we ensure that each feature has equal weight in the covariance matrix computation, allowing PCA to accurately identify the directions of maximum variance across all features.

---

In [87]:
from sklearn.ensemble import RandomForestClassifier
clf1 = RandomForestClassifier()
clf2 = RandomForestClassifier()

In [88]:
clf1.fit(X=data.iloc[:, 1:], y=data["total"])
clf2.fit(X=transformed_data_df.iloc[:, 0:1], y=transformed_data_df["TARGET"])

In [93]:
preds = clf1.predict(data.iloc[:, 1:])
preds

array([1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1])

In [92]:
preds_ = clf2.predict(transformed_data_df.iloc[:, 0:1])
preds_

array([1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1])

***Both giving same predictions.***