### Unsupervised learning

In [None]:
Unsupervised learning is a type of machine learning where the algorithm learns patterns or relationships in the data without
being explicitly told what to look for. In unsupervised learning, the algorithm is given a dataset without any pre-existing 
labels or target variables to guide its learning process.

The goal of unsupervised learning is to find structure and patterns within the data that can be used to gain insights or make 
predictions about new data. Unsupervised learning is often used in exploratory data analysis, data clustering,and dimensionality
reduction.

Some common unsupervised learning techniques include clustering, anomaly detection, and dimensionality reduction. In clustering,
the algorithm groups data points based on their similarity or distance to each other. In anomaly detection, the algorithm 
identifies outliers or anomalies in the data that do not fit with the rest of the data. In dimensionality reduction, the 
algorithm reduces the number of variables or features in the data while retaining as much information as possible.

Unsupervised learning can be more challenging than supervised learning because there are no pre-existing labels to evaluate the 
accuracy of the algorithm's predictions. However, unsupervised learning can be a powerful tool for uncovering hidden patterns
and relationships in the data that may not be apparent through other methods.

### pca(principle component analysis)

In [None]:
PCA stands for Principal Component Analysis, which is a technique used in data analysis and machine learning to reduce the 
dimensionality of a large dataset while preserving the important information. In PCA, a dataset is transformed into a new 
coordinate system in such a way that the largest variability in the data is captured by the first few principal components.

PCA works by finding the orthogonal linear combinations of the original variables that explain the largest amount of variability
in the data. These linear combinations, known as principal components, are ordered by the amount of variability they explain, 
with the first principal component explaining the most variability in the data.

By reducing the dimensionality of the data to a smaller set of principal components, PCA can help simplify complex datasets,
reduce noise and redundancy, and identify underlying patterns and relationships in the data. This makes it a useful tool in 
data exploration, visualization, and clustering, as well as in machine learning tasks such as classification and regression.

PCA is widely used in various fields, including finance, biology, engineering, and computer science, among others.

### what is manifold learning

In [None]:
Manifold learning is a class of unsupervised machine learning techniques used for dimensionality reduction and data 
visualization. It is based on the idea that many high-dimensional datasets are actually represented as lower-dimensional 
manifolds embedded in the high-dimensional space. Manifold learning methods aim to discover these lower-dimensional manifolds
and to represent the data in a lower-dimensional space that preserves the underlying structure and relationships in the data.

The term "manifold" refers to a mathematical concept that describes a curved space that can be locally approximated by a flat 
space. In manifold learning, the goal is to find a low-dimensional representation of the data that captures the local structure
of the manifold.

There are several popular manifold learning techniques, including Isomap, Locally Linear Embedding (LLE), t-Distributed 
Stochastic Neighbor Embedding (t-SNE), and Principal Curves Analysis (PCA). These techniques differ in their underlying
assumptions, computational complexity, and ability to preserve different aspects of the data structure.

Manifold learning has applications in various fields, including image and speech recognition, natural language processing, 
and bioinformatics, among others. It is particularly useful in situations where the data has many dimensions, making it 
difficult to visualize and analyze, or where the underlying structure of the data is complex and nonlinear.

### t sne

In [None]:
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a popular dimensionality reduction technique used for visualizing
high-dimensional datasets. It was proposed in 2008 by Laurens van der Maaten and Geoffrey Hinton.

The goal of t-SNE is to map high-dimensional data points to a lower-dimensional space, typically two or three dimensions, 
while preserving the pairwise similarities between the data points as much as possible. The algorithm achieves this by 
minimizing a cost function that measures the difference between the pairwise similarities of the high-dimensional and the 
low-dimensional data points.

The t-SNE algorithm has become popular for visualizing complex datasets, such as images, text, and genomic data. It is 
particularly useful for identifying clusters and patterns in the data, and for revealing the underlying structure of the
data that is not easily visible in the high-dimensional space.

However, it is important to note that t-SNE is a non-linear algorithm and its results can be sensitive to the choice of 
hyperparameters and initialization. Therefore, it is often recommended to experiment with different hyperparameters and random
seeds to obtain stable and meaningful visualizations of the data.

### how to evaluate unsupervised learning 

In [None]:
Evaluating unsupervised learning algorithms can be challenging since there are no predefined labels or ground truth to compare
the output of the algorithm against. However, there are several techniques that can be used to evaluate the performance of
unsupervised learning algorithms, including:

1. Visualization: Visualization is a powerful tool for evaluating unsupervised learning algorithms since it allows you to 
    inspect the output of the algorithm and assess whether the patterns and structures in the data have been captured correctly.
    Scatter plots, heat maps, and other types of visualizations can be used to assess the clustering or grouping of data points.
    

2. Internal validation: Internal validation metrics such as silhouette score, Calinski-Harabasz index, and Davies-Bouldin index 
    can be used to evaluate the performance of clustering algorithms. These metrics measure the quality of the clustering by 
    assessing the compactness and separation of the clusters.

3. External validation: In some cases, external validation measures can be used to evaluate the performance of unsupervised 
    learning algorithms. This involves comparing the output of the algorithm with some ground truth or labeled data. However, 
    this approach is less common in unsupervised learning since the ground truth is often not available.

4. Domain-specific evaluation: Finally, in some cases, domain-specific evaluation metrics can be used to assess the performance
    of unsupervised learning algorithms. For example, in image analysis, unsupervised learning algorithms can be evaluated based
    on how well they capture visual features such as edges, shapes, and textures.

Overall, evaluating unsupervised learning algorithms can be challenging, and the appropriate evaluation metric will depend on 
the specific problem and data at hand. Therefore, it is important to carefully select and apply evaluation techniques that are
appropriate for the specific application.

### unsupervised learning

### preamble and datasets

In [1]:
%matplotlib notebook
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer

# Breast cancer dataset
cancer = load_breast_cancer()
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)

# Our sample fruits dataset
fruits = pd.read_table('assets/fruit_data_with_colors.txt')
X_fruits = fruits[['mass','width','height', 'color_score']]
y_fruits = fruits[['fruit_label']] - 1

### Dimentionality reduction and manifold learning

#### priciple component analysis(pca)

#### Using PCA to find the first two principal components of the breast cancer dataset

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

X_normalized = StandardScaler().fit(X_cancer).transform(X_cancer)

pca = PCA(n_components= 2 ).fit(X_normalized)
X_pca =pca.transform(X_normalized)

print(X_cancer.shape, X_pca.shape)

(569, 30) (569, 2)


#### Plotting the PCA-transformed version of the breast cancer dataset

In [5]:
from adspy_shared_utilities import plot_labelled_scatter

plot_labelled_scatter(X_pca,y_cancer,['maliganant','benign'])

plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.title('Breast Cancer Dataset PCA (n_components = 2)');

<IPython.core.display.Javascript object>

### Plotting the magnitude of each feature value for the first two principal components

In [13]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(8, 5))
plt.imshow(pca.components_, interpolation = 'none', cmap = 'plasma')
feature_names = list(cancer.feature_names)

plt.gca().set_xticks(np.arange(-.5, len(feature_names)));
plt.gca().set_yticks(np.arange(0.5, 2));
plt.gca().set_xticklabels(feature_names+[""], rotation=90, ha='left', fontsize=12);
plt.gca().set_yticklabels(['First PC', 'Second PC'], va='bottom', fontsize=12); 

plt.colorbar(orientation='horizontal', ticks=[pca.components_.min(), 0,
                                              pca.components_.max()], pad=0.65);

<IPython.core.display.Javascript object>

#### PCA on the fruit dataset (for comparison)

In [25]:
%load_ext autoreload
%autoreload 2

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


X_normalized = StandardScaler().fit(X_fruits).transform(X_fruits)

pca = PCA(n_components=2).fit(X_normalized)
X_pca = pca.transform(X_normalized)

from adspy_shared_utilities import plot_labelled_scatter
plot_labelled_scatter(X_pca,y_fruits,['apple','mandarin','orange','lemon'])

plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.title('Fruits Dataset PCA (n_components = 2)');

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<IPython.core.display.Javascript object>

### meanifold learning

#### Multidimensional scaling (MDS) on the fruit dataset

In [32]:
from adspy_shared_utilities import plot_labelled_scatter
from sklearn.manifold import MDS
from sklearn.preprocessing import StandardScaler

X_normalized = StandardScaler().fit(X_fruits).transform(X_fruits)

mds = MDS(n_components=2)
X_mds = mds.fit_transform(X_normalized)

plot_labelled_scatter(X_mds,y_fruits,['apple','mandarin','orange','lemon'])

plt.xlabel('First mds')
plt.ylabel('Second mds')
plt.title('Fruits Dataset MDS(n_components = 2)');

<IPython.core.display.Javascript object>

### Multidimensional scaling (MDS) on the breast cancer dataset

In [34]:
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import MDS
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)

# each feature should be centered (zero mean) and with unit variance
X_normalized = StandardScaler().fit(X_cancer).transform(X_cancer)  

mds = MDS(n_components = 2)

X_mds = mds.fit_transform(X_normalized)

from adspy_shared_utilities import plot_labelled_scatter
plot_labelled_scatter(X_mds, y_cancer, ['malignant', 'benign'])

plt.xlabel('First MDS dimension')
plt.ylabel('Second MDS dimension')
plt.title('Breast Cancer Dataset MDS (n_components = 2)');

<IPython.core.display.Javascript object>

### T-sne on the fruit datasets

In [35]:
from sklearn.manifold import TSNE

tsne = TSNE(random_state = 0)

X_tsne = tsne.fit_transform(X_fruits_normalized)

plot_labelled_scatter(X_tsne, y_fruits, 
    ['apple', 'mandarin', 'orange', 'lemon'])
plt.xlabel('First t-SNE feature')
plt.ylabel('Second t-SNE feature')
plt.title('Fruits dataset t-SNE');

NameError: name 'X_fruits_normalized' is not defined