<div class="alert alert-block alert-info">
<h2> There are several techniques to dimension reduction but we will see here
            PCA, Factor analysis, SVD,
            t-sne and
            UMAP. The frequently used ones 
</h2></div>

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import os
import gc
import json
import math
import cv2
from PIL import Image
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
#matplotlib.interactive(False)
import scipy
from tqdm import tqdm
%matplotlib inline

from keras.preprocessing import image
from sklearn.ensemble import RandomForestClassifier

### In this kernel, i've tried to apply the common dimensional reduction methods. 

<div class="alert alert-block alert-warning"><h3> 1. Factor analysis</h3> <h3> 2. PCA </h3>
                                          <h3> 3. SVD </h3> <h3> 4. t-sne </h3> <h3> 5. Umap </h3>

In [None]:
train_df = pd.read_csv("../input/siim-isic-melanoma-classification/train.csv")
test_df = pd.read_csv("../input/siim-isic-melanoma-classification/test.csv")

<div class="alert alert-block alert-info">
<h3> Resizing and converting the images to array </h3></div>

In [None]:
def preprocess_image(image_path, desired_size=128):
    im = Image.open(image_path)
    im = im.resize((desired_size, )*2, resample=Image.LANCZOS)
    return im

In [None]:
def convert_to_array(df, size=128):
    N = df.shape[0]
    x_train = np.empty((N, size, size, 3), dtype=np.uint8)
    for i, image_id in enumerate(tqdm(df['image_name'])):
        x_train[i, :, :, :] = preprocess_image(
            f'../input/siim-isic-melanoma-classification/jpeg/train/{image_id}.jpg'
        )
    return x_train

##### The below resized array is from @tunguz's kernel - https://www.kaggle.com/tunguz/image-resizing-128x128-train as converting it to array is taking a long time

In [None]:
#x_train = convert_to_array(train_df)
x_train = np.load('../input/x-train-128npy/x_train_128.npy')

In [None]:
x_t = x_train[:2000]
x_t.shape
#train = x_train.reshape((x_train.shape[0], 128*128*3))

### As you can see above, it’s a 3-dimensional array. We must convert it to 1-dimension as all the upcoming techniques only take 1-dimensional input. To do this, we need to flatten the images:

In [None]:
image = []
for i in range(0,2000):
    img = x_t[i].flatten()
    image.append(img)
image = np.array(image)

In [None]:
feat_cols = ['pixel'+str(i) for i in range(image.shape[1])]
df = pd.DataFrame(image,columns=feat_cols)
df['label'] = train_df['benign_malignant'][:2000]
df['label1'] = train_df['target'][:2000]

<div class="alert alert-block alert-info">
<h1> 1. Factor Analysis </h1></div>

### In the Factor Analysis technique, variables are grouped by their correlations, i.e., all variables in a particular group will have a high correlation among themselves, but a low correlation with variables of other group(s). Here, each group is known as a factor. These factors are small in number as compared to the original dimensions of the data.

In [None]:
from sklearn.decomposition import FactorAnalysis

FA = FactorAnalysis(n_components=2).fit_transform(df[feat_cols].values)

# Here, n_components will decide the number of factors in the transformed data. After transforming the data, it’s time to visualize the results:

fa_data = np.vstack((FA.T, df['label1'])).T
fa_df = pd.DataFrame(fa_data, columns=['1st Component', '2nd Component', 'Label'])
sns.FacetGrid(fa_df, hue="Label", size=10).map(plt.scatter, "1st Component", "2nd Component").add_legend()
plt.show()

## For 2000 images - we can see some orange and mostly blue, highly imbalanced?

<div class="alert alert-block alert-info">
<h1> 2. PCA </h1></div>

### PCA is a technique which helps us in extracting a new set of variables from an existing large set of variables.These newly extracted variables are called Principal Components. Some of the key points you should know about PCA before proceeding further:

    - A principal component is a linear combination of the original variables
    - Principal components are extracted in such a way that the first principal component explains maximum variance in the dataset
    - Second principal component tries to explain the remaining variance in the dataset and is uncorrelated to the first principal component
    - Third principal component tries to explain the variance which is not explained by the first two principal components and so on

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(df[feat_cols].values)

### In this case, n_components will decide the number of principal components in the transformed data. 
### Let’s visualize how much variance has been explained using these 2 components. We will use explained_variance_ratio_ to calculate the same.

In [None]:
plt.figure(figsize=(12.5, 8))
plt.plot(range(2), pca.explained_variance_ratio_)
plt.plot(range(2), np.cumsum(pca.explained_variance_ratio_))
plt.title("Component-wise and Cumulative Explained Variance")

### In the above graph, the blue line represents component-wise explained variance while the orange line represents the cumulative explained variance. We are able to explain around 70% variance in the dataset using just four components. Let us now try to visualize each of these decomposed components

In [None]:
from sklearn.decomposition import PCA 
pca = PCA(n_components=2, random_state=42).fit_transform(df[feat_cols].values)
pca_data = np.vstack((pca.T, df['label1'])).T
pca_df = pd.DataFrame(pca_data, columns=['1st Component', '2nd Component', 'Label'])
sns.FacetGrid(pca_df, hue="Label", size=10).map(plt.scatter, "1st Component", "2nd Component").add_legend()
plt.show()

<div class="alert alert-block alert-info">
<h1> 3. SVD </h1></div>

### We can also use Singular Value Decomposition (SVD) to decompose our original dataset into its constituents, resulting in dimensionality reduction. SVD decomposes the original variables into three constituent matrices. It is essentially used to remove redundant features from the dataset. It uses the concept of Eigenvalues and Eigenvectors to determine those three matrices. 

### Let’s implement SVD and decompose our original variables:

In [None]:
from sklearn.decomposition import TruncatedSVD 
svd = TruncatedSVD(n_components=2, random_state=42).fit_transform(df[feat_cols].values)
svd_data = np.vstack((svd.T, df['label1'])).T
svd_df = pd.DataFrame(svd_data, columns=['1st Component', '2nd Component', 'Label'])
sns.FacetGrid(svd_df, hue="Label", size=10).map(plt.scatter, "1st Component", "2nd Component").add_legend()
plt.show()

### The above scatter plot shows us the decomposed components.

<div class="alert alert-block alert-info">
<h1> 4. t-sne </h1></div>

### There are mainly two types of approaches we can use to map the data points:

    - Local approaches :  They maps nearby points on the manifold to nearby points in the low dimensional representation.
    - Global approaches : They attempt to preserve geometry at all scales, i.e. mapping nearby points on manifold to nearby points in low dimensional representation as well as far away points to far away points.
    
    - t-SNE is one of the few algorithms which is capable of retaining both local and global structure of the data at the same time
    - It calculates the probability similarity of points in high dimensional space as well as in low dimensional space

In [None]:
from sklearn.manifold import TSNE 
tsne = TSNE(n_components=2).fit_transform(df[feat_cols].values)
ts_data = np.vstack((tsne.T, df['label1'])).T
ts_df = pd.DataFrame(ts_data, columns=['1st Component', '2nd Component', 'Label'])
sns.FacetGrid(ts_df, hue="Label", size=10).map(plt.scatter, "1st Component", "2nd Component").add_legend()
plt.show()

<div class="alert alert-block alert-info">
<h1> 5. UMAP </h1></div>

### Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can preserve as much of the local,  and more of the global data structure as compared to t-SNE

### Some of the key advantages of UMAP are:
    - It can handle large datasets and high dimensional data without too much difficulty
    - It combines the power of visualization with the ability to reduce the dimensions of the data

In [None]:
import umap
umap_data = umap.UMAP(n_neighbors=5, min_dist=0.3, n_components=2).fit_transform(df[feat_cols].values)

#Here,
#n_neighbors determines the number of neighboring points used
#min_dist controls how tightly embedding is allowed. Larger values ensure embedded points are more evenly distributed

umap_data = np.vstack((umap_data.T, df['label1'])).T
umap_df = pd.DataFrame(umap_data, columns=['1st Component', '2nd Component', 'Label'])
sns.FacetGrid(umap_df, hue="Label", size=10).map(plt.scatter, "1st Component", "2nd Component").add_legend()
plt.show()

### Baseline model

In [None]:
def convert_to_tarray(df, size=128):
    N = df.shape[0]
    x_test = np.empty((N, size, size, 3), dtype=np.uint8)
    for i, image_id in enumerate(tqdm(df['image_name'])):
        x_test[i, :, :, :] = preprocess_image(
            f'../input/siim-isic-melanoma-classification/jpeg/test/{image_id}.jpg'
        )
    return x_test

x_test = convert_to_tarray(test_df)

In [None]:
x_train = x_train.reshape((x_train.shape[0], 128*128*3))
#x_test = x_test.reshape((x_test.shape[0], 128*128*3))
y = train_df.target.values

In [None]:
train_oof = np.zeros((x_train.shape[0], ))
test_preds = 0

In [None]:
n_splits = 3

kf = KFold(n_splits=n_splits, random_state=137, shuffle=True)

for ij, (train_index, val_index) in enumerate(kf.split(x_train)):
    
    print("Fitting fold", ij+1)
    train_features = x_train[train_index]
    train_target = y[train_index]
    
    val_features = x_train[val_index]
    val_target = y[val_index]
        
    model = RandomForestClassifier(max_depth=2, random_state=0)
    model.fit(train_features, train_target)
    
    val_pred = model.predict_proba(val_features)[:,1]
    
    train_oof[val_index] = val_pred
    
    print("Fold AUC:", roc_auc_score(val_target, val_pred))
    test_preds += model.predict_proba(x_test)[:,1]/n_splits
    
    del train_features, train_target, val_features, val_target
    gc.collect()
    
print(roc_auc_score(y, train_oof))

In [None]:
sample_submission = pd.read_csv('../input/siim-isic-melanoma-classification/sample_submission.csv')

sample_submission['target'] = test_preds

sample_submission.to_csv('submission_RF_01.csv', index=False)

sample_submission['target'].max()

sample_submission['target'].min()