# Context
This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 values)

# Content
Attribute Information:

Id number: 1 to 214 (removed from CSV file)
RI: refractive index
Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
Mg: Magnesium
Al: Aluminum
Si: Silicon
K: Potassium
Ca: Calcium
Ba: Barium
Fe: Iron

Type of glass: (class attribute) 
* -- 1 buildingwindowsfloatprocessed -- 2 buildingwindowsnonfloatprocessed -- 3 vehiclewindowsfloatprocessed 
* -- 4 vehiclewindowsnonfloatprocessed (none in this database) 
* -- 5 containers 
* -- 6 tableware 
* -- 7 headlamps

# **Principal Component Analysis (PCA)**
PCA is a dimensionality-reduction technique that is often used to transform a high-dimensional dataset into a smaller-dimensional subspace prior to running a machine learning algorithm on the data
## So how can this algorithm help us? What are the uses of this algorithm?
* Identifies the most relevant directions of variance in the data.
* Helps capture the most “important” features.
* Easier to make computations on the dataset after the dimension reductions since we have fewer data to deal with.
* Visualization of the data.

## So what are the steps to make PCA work? How do we apply the magic?
1. Take the dataset you want to apply the algorithm on.
2. Calculate the covariance matrix.
3. Calculate the eigenvectors and their eigenvalues.
4. Sort the eigenvectors according to their eigenvalues in descending order.
5. Choose the first K eigenvectors (where k is the dimension we’d like to end up with).
6. Build new reduced dataset.

# Importing Libraries¶


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import numpy as np

# Importing Dataset¶


In [None]:
data=pd.read_csv('../input/glass/glass.csv')

# Let's take a look at what our dataset looks like:¶


In [None]:
data.head(5)


In [None]:
x.shape

In [None]:
y.shape

# Preprocessing
The first preprocessing step is to divide the dataset into a feature set and corresponding labels. The following script performs this task: The script above stores the feature sets into the X variable and the series of corresponding labels in to the y variable.

In [None]:
x = data.iloc[:,0:9]
y = data.iloc[:,9]

# Standardization of the data¶
If you’re familiar with data analysis and processing, you know that missing out on standardization will probably result in a biased outcome. Standardization is all about scaling your data in such a way that all the variables and their values lie within a similar range.We will perform standard scalar normalization to normalize our feature set. To do this, execute the following code

In [None]:
sample_data = StandardScaler().fit_transform(x)

# Applying PCA
It is only a matter of three lines of code to perform PCA using Python's Scikit-Learn library. The PCA class is used for this purpose. PCA depends only upon the feature set and not the label data. Therefore, PCA can be considered as an unsupervised machine learning technique. Performing PCA using Scikit-Learn is a two-step process:

1. Initialize the PCA class by passing the number of components to the constructor.
2. Call the fit and then transform methods by passing the feature set to these methods. The transform method returns the specified number of principal components.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)

pct = pca.fit_transform(x)

principal_df = pd.DataFrame(pct,columns=['pc1','pc2'])

finaldf= pd.concat([principal_df,data[['Type']]],axis=1)

In [None]:
import seaborn as sn
sn.FacetGrid(finaldf, hue="Type", size=6).map(plt.scatter, 'pc1', 'pc2').add_legend()
plt.show()

# PCA for dimensionality redcution (not for visualization)

In [None]:

pca.n_components = 9
pca_data = pca.fit_transform(sample_data)

percentage_var_explained = pca.explained_variance_ / np.sum(pca.explained_variance_);

cum_var_explained = np.cumsum(percentage_var_explained)

# Plot the PCA spectrum
plt.figure(1, figsize=(6, 4))

plt.clf()
plt.plot(cum_var_explained, linewidth=2)
plt.axis('tight')
plt.grid()
plt.xlabel('n_components')
plt.ylabel('Cumulative_explained_variance')
plt.show()


# (t-SNE) t-Distributed Stochastic Neighbor Embedding 
* t-SNE is a non-linear dimensionality reduction algorithm used for exploring high-dimensional data. It maps multi-dimensional data to two or more dimensions suitable for human observation. With help of the t-SNE algorithms, you may have to plot fewer exploratory data analysis plots next time you work with high dimensional data.
* t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. It is extensively applied in image processing, NLP, genomic data and speech processing. To keep things simple, here’s a brief overview of working of t-SNE
                                                                                                
                                                                                                

In [None]:
# TSNE

from sklearn.manifold import TSNE

# Picking the top 1000 points as TSNE takes a lot of time for 15K points
data_1000 = sample_data[0:214,:]
labels_1000 = y[0:214]

model = TSNE(n_components=2, random_state=0,perplexity=30,n_iter=5000)
# configuring the parameteres
# the number of components = 2
# default perplexity = 30
# default learning rate = 200
# default Maximum number of iterations for the optimization = 1000

tsne_data = model.fit_transform(data_1000)


# creating a new data frame which help us in ploting the result data
tsne_data = np.vstack((tsne_data.T, labels_1000)).T
tsne_df = pd.DataFrame(data=tsne_data, columns=("Dim_1", "Dim_2", "Type"))

# Ploting the result of tsne
sn.FacetGrid(tsne_df, hue="Type", size=6).map(plt.scatter, 'Dim_1', 'Dim_2').add_legend()
plt.show()