# Dimensionality reduction

In this task you will practice dimensionality reduction.
Use code cells to answer the Tasks and Markdown cells for the Questions (Q's).

In [None]:
import sklearn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
%matplotlib inline 


# Load data

In [None]:
(X, y) = load_wine(return_X_y=True, as_frame=True)

# split X into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0,stratify=y)

Lets take a quick look at the data:

In [None]:
pd.DataFrame(X_train).describe()

# PCA + SVM

Task 1: Use X_train, y_train to train a SVM (SKlean's SVC) with the deafult parameters. You can read more about the algorithm in SKlearn's documentation.
Make sure you normailize the data by using StandardScaler
Evaulate the algorithm using accuracy score and X_test, y_test.

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

pipe = make_pipeline(StandardScaler(), SVC()).fit(X_train, y_train)
y_pred = pipe.predict(X_test)
accuracy_score(y_test, y_pred)

Task 2: Now do the same, but use PCA.

In this task, we want to keep all of the variance! No data is going to be discarded.
You are asked to use the maximal number of componenets for PCA.

Q1: Your co-worker says that the results should be at least as good as the results we had without PCA. Explain why might he say that.
    
> PCA is just rotating the dataset so that the new coordinate system (or the principal components) would be ordered by the variance they capture. 
>
> Using all PCs means we're just using the exact same data with a different coordinate system. Since we're giving our model the same data to work with, it makes sense we won't get different results (barring some regularization etc.)
Print the accuracy of SVM + PCA.

In [None]:
pca_pipe = make_pipeline(StandardScaler(), PCA(), SVC()).fit(X_train, y_train)
y_pred_pca= pca_pipe.predict(X_test)
accuracy_score(y_test, y_pred_pca)

Q2: Did the results improve\stayed the same\got worse? 
    
> Got the same results.

# PCA + logistice regression

Task 3: repeat task 1 with logistic regression.

In [None]:
pipe = make_pipeline(StandardScaler(), LogisticRegression()).fit(X_train, y_train)
y_pred = pipe.predict(X_test)
accuracy_score(y_test, y_pred)

Task 4: repeast task 2 with logistic regression.

In [None]:

pca_pipe = make_pipeline(StandardScaler(), PCA(), LogisticRegression()).fit(X_train, y_train)
y_pred_pca = pca_pipe.predict(X_test)
accuracy_score(y_test, y_pred_pca)

Q3: Did the results improved\stayed the same\got worse?


Q4: How can you explain the difference between answers to Q2 and Q3. Hint: think about the nature of Logistic regression and the main difference of SVM from it. Hint: SVM assumes the data can be seperated by an hyperplan.
> TODO

# Visualizing

Task 5: Use locally linear embedding in sklearn to visualize the data. Plot the results.
Optimze the n_neighbors by running at least 5 times and use the best looking result you can find.

In [None]:
from sklearn.manifold import locally_linear_embedding
import matplotlib.pyplot as plt
import seaborn as sns

ns = range(45, 146, 25)

fig, axs = plt.subplots(1, 5, sharex=True, sharey=True)
fig.set_figwidth(20)
fig.set_figheight(4)


X_scaled = StandardScaler().fit_transform(X)

for ax, n in zip(axs.flatten(), ns):
    if n==95:
        for side in ['bottom', 'top', 'right', 'left']:
            ax.spines[side].set_color('0.2')
            ax.spines[side].set_linewidth(4)
            ax.set_title(f'This one seems ok (n={n})')
    ax.set_box_aspect(1)
    X_emb = locally_linear_embedding(X_scaled, n_neighbors=n, n_components=2)[0]
    sns.scatterplot(X_emb x=[:,0], y=X_emb[:,1], hue=y, ax=ax, legend=0)

fig.tight_layout()


Task 6: Use t-SNE to visualize the data. Plot the results.

In [None]:
from yellowbrick.features.manifold import manifold_embedding
manifold_embedding(X_scaled, y, normalized_stress='auto', manifold='tsne');

Task 7: Use UMAP to visualize the data. Plot the results.

In [10]:
import umap
X_emb = umap.UMAP().fit_transform(X_scaled)
sns.scatterplot(x=X_emb[:,0], y=X_emb[:,1], hue=y, palette='tab10')

Q5: If we run one of this visualziaing algorithms various times with the default parameters, are we guranteed to see the same results? Why?
    
> LLE is deterministic and would yield the same results. 
>
> UMAP and t-SNE would not, since they work through stochastic gradient descent
