# Dimensionality Reduction

Feature Selection VS. Feature Extraction 

**Feature Selection**: algorithm select best atributes, by importance

**Feature Extraction**: find similar atributes and unite them atributes into one, so it creates new atributes

## Principal Component Analysis - PCA


- ML **not supervised**

- Identifies variable correlation: strong correlation allows to reduce dimension

- From $m$ independent variables, **PCA** extracts $p\leq m$

- User can choose $p$

- Linear seperable

<img src=".\theory\pca_lda.png"  style="width: 700px;"/>

## Linear Discriminat Analysis - LDA

LDA fins axis who maximizes mutiple classes separation (classes are y categories)

- ML **supervised** (as uses y classes)

- From $m$ independent variables, **PCA** extracts $p\leq m$ whom separes the most classes from dependent variable

- Linear seperable

- More usability when the number of y classes is high

<img src=".\theory\pca_lda.png"  style="width: 700px;"/>

## Kernel IPCA

- Used for more complex problems

- It's a version of PCA where data are mapped for one dimension higher using **kernel trick**

<img src=".\theory\kernel_trick.png"  style="width: 500px;"/>

- Important components (combined variables) are extracted from data with higher dimensions

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

In [2]:
census_data = pd.read_csv('census.csv')
census_data

Unnamed: 0,age,workclass,final-weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loos,hour-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [3]:
x_census = census_data.iloc[:,0:14].values
x_census

array([[39, ' State-gov', 77516, ..., 0, 40, ' United-States'],
       [50, ' Self-emp-not-inc', 83311, ..., 0, 13, ' United-States'],
       [38, ' Private', 215646, ..., 0, 40, ' United-States'],
       ...,
       [58, ' Private', 151910, ..., 0, 40, ' United-States'],
       [22, ' Private', 201490, ..., 0, 20, ' United-States'],
       [52, ' Self-emp-inc', 287927, ..., 0, 40, ' United-States']],
      dtype=object)

In [4]:
y_census = census_data.iloc[:,14].values
y_census

array([' <=50K', ' <=50K', ' <=50K', ..., ' <=50K', ' <=50K', ' >50K'],
      dtype=object)

In [5]:
label_encoder_workclass = LabelEncoder()
label_encoder_education = LabelEncoder()
label_encoder_marital = LabelEncoder()
label_encoder_occupation   = LabelEncoder()
label_encoder_relationship = LabelEncoder()
label_encoder_race = LabelEncoder()
label_encoder_sex  = LabelEncoder()
label_encoder_country = LabelEncoder()

x_census[:,1] = label_encoder_workclass.fit_transform(x_census[:,1])
x_census[:,3] = label_encoder_education.fit_transform(x_census[:,3])
x_census[:,5] = label_encoder_marital.fit_transform(x_census[:,5])
x_census[:,6] = label_encoder_occupation.fit_transform(x_census[:,6])
x_census[:,7] = label_encoder_relationship.fit_transform(x_census[:,7])
x_census[:,8] = label_encoder_race.fit_transform(x_census[:,8])
x_census[:,9] = label_encoder_sex.fit_transform(x_census[:,9])
x_census[:,13] = label_encoder_country.fit_transform(x_census[:,13])

In [6]:
x_census

array([[39, 7, 77516, ..., 0, 40, 39],
       [50, 6, 83311, ..., 0, 13, 39],
       [38, 4, 215646, ..., 0, 40, 39],
       ...,
       [58, 4, 151910, ..., 0, 40, 39],
       [22, 4, 201490, ..., 0, 20, 39],
       [52, 5, 287927, ..., 0, 40, 39]], dtype=object)

In [7]:
from sklearn.preprocessing import StandardScaler #standardization
scaler = StandardScaler()
x_census = scaler.fit_transform(x_census)
x_census

array([[ 0.03067056,  2.15057856, -1.06361075, ..., -0.21665953,
        -0.03542945,  0.29156857],
       [ 0.83710898,  1.46373585, -1.008707  , ..., -0.21665953,
        -2.22215312,  0.29156857],
       [-0.04264203,  0.09005041,  0.2450785 , ..., -0.21665953,
        -0.03542945,  0.29156857],
       ...,
       [ 1.42360965,  0.09005041, -0.35877741, ..., -0.21665953,
        -0.03542945,  0.29156857],
       [-1.21564337,  0.09005041,  0.11095988, ..., -0.21665953,
        -1.65522476,  0.29156857],
       [ 0.98373415,  0.77689313,  0.92989258, ..., -0.21665953,
        -0.03542945,  0.29156857]])

In [8]:
from sklearn.model_selection import train_test_split
x_census_training, x_census_test, y_census_training, y_census_test = train_test_split(x_census,
                                                                                      y_census,
                                                                                      test_size = 0.15,
                                                                                      random_state = 0)

In [9]:
x_census_training.shape, x_census_test.shape, y_census_training.shape, y_census_test.shape

((27676, 14), (4885, 14), (27676,), (4885,))

# PCA

In [10]:
from sklearn.decomposition import PCA

In [17]:
pca = PCA(n_components = 8) #we have 14

x_census_training_pca = pca.fit_transform(x_census_training)
x_census_test_pca = pca.transform(x_census_test)

In [18]:
x_census_training_pca.shape, x_census_test_pca.shape

((27676, 8), (4885, 8))

In [19]:
pca.explained_variance_ratio_

array([0.151561  , 0.10109701, 0.08980379, 0.08076277, 0.07627678,
       0.07357646, 0.06772289, 0.06690789])

In [20]:
pca.explained_variance_ratio_.sum() # we have 70% of variance explained from variables (with no reduction was 100%)

0.7077085943199323

## Random Forest

In [33]:
from sklearn.ensemble import RandomForestClassifier
random_f_census = RandomForestClassifier(criterion='entropy',
                                         #min_samples_leaf=1,
                                         #min_samples_split=5,
                                         n_estimators=40,
                                         random_state=0)
random_f_census.fit(x_census_training_pca,y_census_training)

In [34]:
from sklearn.metrics import accuracy_score
prediction = random_f_census.predict(x_census_test_pca)
accuracy_score(y_census_test,prediction)

0.8372569089048106

In [35]:
from sklearn.metrics import classification_report
print(classification_report(y_census_test, prediction))

              precision    recall  f1-score   support

       <=50K       0.87      0.93      0.90      3693
        >50K       0.72      0.55      0.62      1192

    accuracy                           0.84      4885
   macro avg       0.79      0.74      0.76      4885
weighted avg       0.83      0.84      0.83      4885



# Kernel PCA

In [28]:
from sklearn.decomposition import KernelPCA

In [29]:
kpca =  KernelPCA(n_components=8, kernel='rbf') #rbf usually for non-linear sperabale problem

x_census_training_kpca = kpca.fit_transform(x_census_training)
x_census_test_kpca = kpca.transform(x_census_test)

In [30]:
x_census_training_kpca.shape, x_census_test_kpca.shape

((27676, 8), (4885, 8))

## Random Forest

In [36]:
random_f_census = RandomForestClassifier(criterion='entropy',
                                         #min_samples_leaf=1,
                                         #min_samples_split=5,
                                         n_estimators=40,
                                         random_state=0)
random_f_census.fit(x_census_training_kpca,y_census_training)

In [37]:
prediction = random_f_census.predict(x_census_test_kpca)
accuracy_score(y_census_test,prediction)

0.8235414534288639

In [38]:
print(classification_report(y_census_test, prediction))

              precision    recall  f1-score   support

       <=50K       0.86      0.92      0.89      3693
        >50K       0.68      0.52      0.59      1192

    accuracy                           0.82      4885
   macro avg       0.77      0.72      0.74      4885
weighted avg       0.81      0.82      0.81      4885



# LDA

class is **income**

In [39]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [41]:
x_census_training.shape, y_census_training.shape

((27676, 14), (27676,))

In [42]:
lda = LinearDiscriminantAnalysis(n_components=1)

#ValueError: n_components cannot be larger than min(n_features, n_classes - 1). = min (8,2-1=1) so it's not possible to use LDA

x_census_training_lda = lda.fit_transform(x_census_training, y_census_training)
x_census_test_lda = lda.transform(x_census_test)

In [43]:
x_census_training_lda.shape, x_census_test_lda.shape

((27676, 1), (4885, 1))

## Random Forest

In [44]:
random_f_census = RandomForestClassifier(criterion='entropy',
                                         #min_samples_leaf=1,
                                         #min_samples_split=5,
                                         n_estimators=40,
                                         random_state=0)
random_f_census.fit(x_census_training_lda,y_census_training)

In [45]:
prediction = random_f_census.predict(x_census_test_lda)
accuracy_score(y_census_test,prediction)

0.7334698055271238

In [46]:
print(classification_report(y_census_test, prediction))

              precision    recall  f1-score   support

       <=50K       0.82      0.82      0.82      3693
        >50K       0.45      0.45      0.45      1192

    accuracy                           0.73      4885
   macro avg       0.64      0.64      0.64      4885
weighted avg       0.73      0.73      0.73      4885

