### What is Dimensionality Reduction?
* Dimensionality reduction is simply, the process of reducing the dimension of your feature set. Your feature set could be a dataset with a hundred columns (i.e features) or it could be an array of points that make up a large sphere in the three-dimensional space. Dimensionality reduction is bringing the number of columns down to say, twenty or converting the sphere to a circle in the two-dimensional space.

###### PCA (Principal Component Analysis) : Popularly used for dimensionality reduction in continuous data, PCA rotates and projects data along the direction of increasing variance. The features with the maximum variance are the principal components.

### Objectives
* Inroduction & Concepts
* Properties of PCA
* Application of PCA

### Introduction 
* Main idea is to reduce the dimensionality of the data
* Dimensions of data means columns in the data
* Feature selection means choosing important feature
* Dimensionality Reduction is about deriving new features (m) out of original features(n)
* m < n
* You don't want to compromise in accuracy.
* This is achieved by transforming variables (columns) to a new set of columns or variables which are known as principal component
* These principal components have a parameter telling how important they are in representing the data

#### Adding Important Libraries  

In [19]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [4]:
# loading data set
breast_cancer_data = load_breast_cancer()

In [7]:
# creating minmax object
mm = MinMaxScaler()

In [9]:
data = mm.fit_transform(breast_cancer_data.data)

In [10]:
data.shape # right now we have 569 rows and 30 columns

(569, 30)

In [11]:
# pca object
pca = PCA(n_components=10)

In [12]:
res = pca.fit_transform(data)

In [16]:
res

array([[ 1.38702121,  0.42689533, -0.54170264, ..., -0.03945588,
         0.07759022,  0.1552944 ],
       [ 0.46230825, -0.55694674, -0.20517458, ...,  0.0206439 ,
        -0.07063961, -0.08528526],
       [ 0.95462147, -0.10970115, -0.1478484 , ...,  0.00736216,
        -0.05933439, -0.07368895],
       ...,
       [ 0.22631131, -0.28794577,  0.31522402, ...,  0.01570717,
        -0.10833436, -0.07691075],
       [ 1.67783369,  0.33594595,  0.29611601, ..., -0.09247446,
         0.0837925 ,  0.0054401 ],
       [-0.90506804, -0.10410875,  0.38285992, ...,  0.03943239,
        -0.01986765,  0.10546525]])

In [17]:
res.shape # after reduction we have 10 columns 

(569, 10)

In [18]:
pca.explained_variance_

array([0.33133389, 0.10785038, 0.0443947 , 0.04000678, 0.02549742,
       0.01916637, 0.00986455, 0.00743488, 0.00616788, 0.00589966])

### Notes
* Scaling should be done before PCA
* Explainded_variance tells how important those derived features are

In [20]:
pca.components_.shape

(10, 30)

###### creating new dataframe using pandas 

In [23]:
df = pd.DataFrame({'A':[1,2,3,4,5],'B':[6,7,8,9,10]})

In [24]:
df

Unnamed: 0,A,B
0,1,6
1,2,7
2,3,8
3,4,9
4,5,10


In [26]:
pca = PCA(n_components=1)

In [27]:
pca.fit_transform(df)

array([[ 2.82842712],
       [ 1.41421356],
       [-0.        ],
       [-1.41421356],
       [-2.82842712]])

In [28]:
pca.components_.shape

(1, 2)

In [29]:
df

Unnamed: 0,A,B
0,1,6
1,2,7
2,3,8
3,4,9
4,5,10


In [30]:
df['C']=20

In [31]:
pca = PCA(n_components=1)

In [32]:
pca.fit_transform(df)

array([[ 2.82842712],
       [ 1.41421356],
       [-0.        ],
       [-1.41421356],
       [-2.82842712]])

In [34]:
pca.explained_variance_

array([5.])

In [35]:
df

Unnamed: 0,A,B,C
0,1,6,20
1,2,7,20
2,3,8,20
3,4,9,20
4,5,10,20


In [37]:
# beaste cancer dataset

data = breast_cancer_data.data
target = breast_cancer_data.target

In [38]:
# object of standardscaler
ss = StandardScaler()

In [39]:
# creating pipeline
pipeline = make_pipeline(StandardScaler(),PCA(n_components=10),DecisionTreeClassifier())

In [40]:
pipeline

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pca', PCA(n_components=10)),
                ('decisiontreeclassifier', DecisionTreeClassifier())])

In [41]:
gs = GridSearchCV(pipeline, param_grid = {'pca__n_components':[10,11,12,13,14]},cv=5)

In [42]:
# spliting the train test
trainX, testX, trainY, testY = train_test_split(data,target)

In [43]:
gs.fit(trainX,trainY)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('pca', PCA(n_components=10)),
                                       ('decisiontreeclassifier',
                                        DecisionTreeClassifier())]),
             param_grid={'pca__n_components': [10, 11, 12, 13, 14]})

In [44]:
gs.best_params_

{'pca__n_components': 11}

In [45]:
gs.best_score_

0.9343091655266758

In [46]:
gs.score(testX, testY)

0.8951048951048951

In [47]:
# creating pipeline using logistic Regression
pipeline = make_pipeline(StandardScaler(),PCA(n_components=10),LogisticRegression())

In [48]:
pipeline

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('pca', PCA(n_components=10)),
                ('logisticregression', LogisticRegression())])

In [49]:
# Hyperparameter tunning
gs = GridSearchCV(pipeline, param_grid={'pca__n_components':[10,11,12,13,14]},cv=5)

In [50]:
gs.fit(trainX, trainY)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('standardscaler', StandardScaler()),
                                       ('pca', PCA(n_components=10)),
                                       ('logisticregression',
                                        LogisticRegression())]),
             param_grid={'pca__n_components': [10, 11, 12, 13, 14]})

In [51]:
gs.best_params_

{'pca__n_components': 11}

In [52]:
gs.best_score_

0.97890560875513

In [53]:
gs.score(testX, testY)

0.986013986013986

In [55]:
# creating simple model
lr  = LogisticRegression()

In [56]:
lr.fit(trainX, trainY)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [57]:
lr.score(testX, testY)

0.9230769230769231

### Advantages of using PCA
* Reduces dimension of the data thus improving training time
* Since the data is better represented, simple models(linear) might work better.