# Principal component analysis
Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

we are going to do everything that we did before so refere to Linear-Regression notebook for explanations.

In [1]:
import numpy as np
import pandas as pd

In [2]:
cars =  pd.read_csv("cleaned_data.csv")
cars.columns

Index(['Name', 'style', 'Exterior color', 'interior color', 'Engine',
       'drive type', 'Fuel Type', 'Transmission', 'Mileage', 'mpg city',
       'mpg highway', 'price', 'Year', 'Engine V', 'Brand'],
      dtype='object')

In [3]:
X =  cars[['Name', 'style', 'Exterior color', 'interior color', 'Engine',
       'drive type', 'Fuel Type', 'Transmission', 'Mileage', 'mpg city',
       'mpg highway', 'Year', 'Engine V', 'Brand']]


Y = cars["price"].values

In [4]:
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder(categories="auto", handle_unknown="ignore")

categorical_features = onehot.fit_transform(X.iloc[:, [1,4,5,6,7,13]]).toarray()
X = np.delete(X.values, [0,1,2,3,4,5,6,7,13], 1)
X = np.concatenate((X,categorical_features), axis=1)

In [5]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    X, Y,
    test_size=0.1,
    random_state=42
)

In [6]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
std_scaler.fit(x_train)

x_train_std = std_scaler.transform(x_train)
x_test_std  = std_scaler.transform(x_test)

so the main purpose of creating this notbook is to see that Principal component analysis (PCA) can help us to get a higher accuracy or not.

## Principal component analysis (PCA).
Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.
<br><br>
It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract.<br>
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html">read full documentation</a>

In [7]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression

results = list()

for i in range(2,25):
    pca = PCA(n_components=i)
    lr = LinearRegression(
        fit_intercept=True,
        normalize="deprecated"
    )

    x_train_pca = pca.fit_transform(x_train_std)
    x_test_pca  = pca.transform(x_test_std)

    lr.fit(x_train_pca, y_train)
    result = [i, lr.score(x_test_pca, y_test)]

    results.append(result)

In [8]:
results = sorted(
    results,
    key= lambda x: (-x[1], x[0])
)

results_df = pd.DataFrame(
    results, 
    columns=["n principle component", "score"]
)
results_df[:3]

Unnamed: 0,n principle component,score
0,22,0.720175
1,24,0.716087
2,23,0.714258


as you can see we didnt get a better result with even 24 features. so it means that PCA dimentionality reduction doesn't relly help us.

using pipeline example for simplicity:

In [9]:
from sklearn.pipeline import make_pipeline

pip_lr = make_pipeline(
    StandardScaler(),
    PCA(n_components=23),
    LinearRegression(fit_intercept=True, normalize="deprecated")
)

pip_lr.fit(x_train_std, y_train)
pip_lr.score(x_test_std, y_test)

print("Test Accuracy : {:.3f}".format(pip_lr.score(x_test_std, y_test)))

Test Accuracy : 0.701


Sina Kazemi<br>
Github : <a href="https://github.com/sina96n/">sina96n</a>