### What are the main motivations for reducing a dataset’s dimensionality? 
### What are the main drawbacks? 

MOTIVATION
1. As the number of features increases the model becomes more complex and more dependent on the training data. This in turn makes the model overfitted. To avoid Overfitting DR is used
2. Less data means less computing time, less space complexity
3. DR removes redundant features and noise 
4. It is nearly impossible to find correlated features manually in a dataset, PCA does this efficiently and result in Principal components are independent  

DRAWBACKS
1. Can result in Data Loss
2. DR not always increase the accuracy of the model, it varies from model to model and data to data.
3. Independent variables becomes less interpretable 


### What are other applications of PCA (other than visualizing data)?

1. To reduce the dimensionality of data to speed up training of the model and maybe better accuracy (Data Compression)
2. Image Processing -> to reduce the number of pixels used to display an image at the same time retain the shape of the picture
3. To remove interdependency of features from the data 


### What are the limitations of PCA?

1. We need to standardize the data before implementing PCA on it else PCA wont be able to find out the Principal Components
2. Number of Principal Components are to be chosen with care else it may result in information loss.
3. PCA is not suitable to be applied on non-continuous data

### Load the MNIST dataset and split it into a training set and a test set
### Take the first 60,000 instances for training, and the remaining 10,000 for testing.


In [1]:
from sklearn.datasets import fetch_openml
import pandas as pd
import warnings 
warnings.filterwarnings("ignore")

mnist = fetch_openml('mnist_784')

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split( mnist.data, mnist.target, test_size=10000, random_state=0)


### Train a Random Forest classifier on the dataset and time how long it takes, 
### then evaluate the resulting model on the test set. 

In [None]:
import time 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

model = RandomForestClassifier()
LRmodel = LogisticRegression()

start = time.time()
model.fit(x_train,y_train)
end = time.time()

LRstart = time.time()
LRmodel.fit(x_train,y_train)
LRend = time.time()

y_pred = model.predict(x_test)
print("Accuracy of Random Forest without PCA = ", accuracy_score(y_pred,y_test))
print("Time taken by Random Forest classifier without PCA : ", end-start)

print()

y_pred = LRmodel.predict(x_test)
print("Accuracy of Logistic Regressor without PCA = ", accuracy_score(y_pred,y_test))
print("Time taken by logistic Regressor without PCA : ", LRend-LRstart)

### Next, use PCA to reduce the dataset’s dimensionality, with an explained variance ratio of 95%.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

scaler_model = StandardScaler()
scaler_model.fit(x_train)

x_train_std = scaler_model.transform(x_train)
x_test_std = scaler_model.transform(x_test)

pca_model = PCA(n_components=0.95)

pca_model.fit(x_train_std)

trans_train_data_df = pca_model.transform(x_train_std)
trans_test_data_df = pca_model.transform(x_test_std)

#print(trans_train_data_df.shape)
#print(trans_test_data_df.shape)

### Train a new Random Forest classifier on the reduced dataset and see how long it takes.
### Was training much faster? 

In [None]:
start1 = time.time()
model.fit(trans_train_data_df,y_train)
end1 = time.time()


y_pred = model.predict(trans_test_data_df)
print("Accuracy of Random Forest with PCA = ", accuracy_score(y_pred,y_test))
print("Time taken by Random Forest classifier with PCA : ", end1-start1)

print() 

LRstart1 = time.time()
LRmodel.fit(trans_train_data_df,y_train)
LRend1 = time.time()

y_pred = LRmodel.predict(trans_test_data_df)
print("Accuracy of Logistic Regressor with PCA = ", accuracy_score(y_pred,y_test))
print("Time taken by Logistic Regressor with PCA : ", LRend1-LRstart1)


### Next evaluate the classifier on the test set: how does it compare to the previous classifier?

OBSERVATIONS

The time taken by logistic regressor model is more when the data is feeded as it is. While after performing PCA the model takes much less time for training and also there is slight increase in the accuracy 

While for Random forest model time taken to train the model after performing PCA is much higher as compared to the raw data feeded to the model. Also the accuracy is reduced when reduced data is used to train the model.

CONCLUSION

PCA does not always ensure the best result. It depends on the model being used and the type of data.