<img src="./images/banner.png" width="800">

# Introduction to Unsupervised Learning

Unsupervised learning is a branch of machine learning that focuses on discovering patterns, structures, or relationships in data without the guidance of labeled outcomes or explicit feedback. Unlike supervised learning, where the algorithm learns from labeled examples, unsupervised learning algorithms work with raw, unlabeled data to uncover hidden insights.


<img src="./images/types-of-ml.png" width="800">

<img src="./images/supervised-learning.png" width="800">

Unsupervised learning can be formally defined as follows:

- Given a dataset $X = \{x_1, x_2, ..., x_n\}$ where each $x_i$ is a feature vector, the goal is to find patterns or structures in $X$ without any corresponding target variables or labels.


<img src="./images/unsupervised-learning.png" width="800">

💡 **Tip:** Think of unsupervised learning as exploring a new city without a map or guide. You're discovering the layout, landmarks, and patterns on your own!


Here are some key characteristics of unsupervised learning:

1. **No labeled data:** Unsupervised learning algorithms work with datasets that don't have predefined target variables or labels.

2. **Pattern discovery:** The primary objective is to identify inherent structures, groupings, or relationships within the data.

3. **Exploratory nature:** Unsupervised learning is often used for exploratory data analysis, helping to gain insights into the underlying structure of the data.


And here are some of the main types of unsupervised learning:

1. **Clustering:** Grouping similar data points together based on certain characteristics.

   Example: $K$-means clustering algorithm

2. **Dimensionality Reduction:** Reducing the number of features while preserving the essential information.

   Example: Principal Component Analysis (PCA)

3. **Association Rule Learning:** Discovering interesting relations between variables in large databases.

   Example: Apriori algorithm for market basket analysis


<img src="./images/algorithms.jpg" width="800">

Note that unsupervised learning is not limited to these three types, and there are many other techniques and algorithms that fall under the umbrella of unsupervised learning.


The term "unsupervised" refers to the fact that the learning process is not guided by a specific target or outcome. Instead, the algorithm must find structure in the data on its own.


❗️ **Important Note:** While unsupervised learning doesn't use labeled data for training, domain expertise is often crucial for interpreting the results and determining their relevance.


Unsupervised learning bears similarities to how humans learn about their environment through observation and pattern recognition. For instance, a child learning to distinguish between different animals without being explicitly taught every single species is a form of unsupervised learning.


By leveraging unsupervised learning techniques, data scientists and researchers can uncover hidden patterns, detect anomalies, and gain valuable insights from complex, high-dimensional datasets, paving the way for more informed decision-making and deeper understanding of various phenomena.

**Table of contents**<a id='toc0_'></a>    
- [Unsupervised vs. Supervised Learning](#toc1_)    
  - [Fundamental Differences](#toc1_1_)    
  - [Comparative Example](#toc1_2_)    
    - [Supervised Learning Approach](#toc1_2_1_)    
    - [Unsupervised Learning Approach](#toc1_2_2_)    
  - [Key Differences in Application](#toc1_3_)    
  - [Challenges and Considerations](#toc1_4_)    
  - [Hybrid Approaches](#toc1_5_)    
- [Key Objectives of Unsupervised Learning](#toc2_)    
  - [Pattern Discovery](#toc2_1_)    
  - [Dimensionality Reduction](#toc2_2_)    
  - [Feature Learning](#toc2_3_)    
  - [Density Estimation](#toc2_4_)    
  - [Data Compression](#toc2_5_)    
  - [Generative Modeling](#toc2_6_)    
  - [Exploratory Data Analysis](#toc2_7_)    
- [Real-World Applications of Unsupervised Learning](#toc3_)    
  - [Customer Segmentation in Marketing](#toc3_1_)    
  - [Anomaly Detection in Cybersecurity](#toc3_2_)    
  - [Recommendation Systems](#toc3_3_)    
  - [Image and Video Analysis](#toc3_4_)    
  - [Natural Language Processing](#toc3_5_)    
  - [Genomics and Bioinformatics](#toc3_6_)    
  - [Anomaly Detection in Industrial IoT](#toc3_7_)    
  - [Financial Market Analysis](#toc3_8_)    
- [Summary](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_'></a>[Unsupervised vs. Supervised Learning](#toc0_)

Understanding the differences between unsupervised and supervised learning is crucial for grasping the unique aspects of each approach. This section will compare these two fundamental paradigms of machine learning, highlighting their key characteristics, use cases, and challenges.


### <a id='toc1_1_'></a>[Fundamental Differences](#toc0_)


1. **Data Requirements**
   - **Supervised Learning:** Requires labeled data with input features $(X)$ and corresponding target variables $(y)$.
   - **Unsupervised Learning:** Works with unlabeled data, using only input features $(X)$.

2. **Learning Process**
   - **Supervised Learning:** Learns a mapping function $f(X) = y$ to predict outputs for new inputs.
   - **Unsupervised Learning:** Discovers inherent patterns or structures within the data without predefined outputs.

3. **Objective**
   - **Supervised Learning:** Minimize the difference between predicted and actual outputs.
   - **Unsupervised Learning:** Maximize the understanding of data structure or relationships.


### <a id='toc1_2_'></a>[Comparative Example](#toc0_)


Let's consider a dataset of customer information for an e-commerce platform:


```python
import pandas as pd

data = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45],
    'Income': [50000, 60000, 75000, 90000, 100000],
    'Spending': [5000, 7000, 8000, 10000, 12000]
})

print(data)
```


#### <a id='toc1_2_1_'></a>[Supervised Learning Approach](#toc0_)

In supervised learning, we might predict 'Spending' based on 'Age' and 'Income':


```python
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X = data[['Age', 'Income']]
y = data['Spending']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)

# Predict spending for new customers
new_customer = [[28, 55000]]
predicted_spending = model.predict(new_customer)
print(f"Predicted spending: ${predicted_spending[0]:.2f}")
```


#### <a id='toc1_2_2_'></a>[Unsupervised Learning Approach](#toc0_)

In unsupervised learning, we might cluster customers based on their attributes:


```python
from sklearn.cluster import KMeans

X = data[['Age', 'Income', 'Spending']]

kmeans = KMeans(n_clusters=2)
data['Cluster'] = kmeans.fit_predict(X)

print(data)
```


### <a id='toc1_3_'></a>[Key Differences in Application](#toc0_)


1. **Problem Types**
   - **Supervised:** Classification, Regression
   - **Unsupervised:** Clustering, Dimensionality Reduction, Anomaly Detection

2. **Evaluation Metrics**
   - **Supervised:** Accuracy, Precision, Recall, Mean Squared Error
   - **Unsupervised:** Silhouette Score, Inertia, Explained Variance Ratio

3. **Use Cases**
   - **Supervised:** Spam detection, Price prediction, Image classification
   - **Unsupervised:** Customer segmentation, Topic modeling, Feature learning


### <a id='toc1_4_'></a>[Challenges and Considerations](#toc0_)


1. **Data Quality and Quantity**
   - **Supervised:** Requires high-quality labeled data, which can be expensive and time-consuming to obtain.
   - **Unsupervised:** Can work with unlabeled data but may require larger datasets for meaningful patterns.

2. **Interpretability**
   - **Supervised:** Often easier to interpret, as the relationship between inputs and outputs is explicit.
   - **Unsupervised:** Can be more challenging to interpret, requiring domain expertise to validate findings.

3. **Computational Complexity**
   - **Supervised:** Generally more straightforward, with well-defined optimization objectives.
   - **Unsupervised:** Can be more computationally intensive, especially for high-dimensional data.


💡 **Tip:** When choosing between supervised and unsupervised learning, consider your data availability, problem type, and desired outcomes.


### <a id='toc1_5_'></a>[Hybrid Approaches](#toc0_)


In practice, supervised and unsupervised learning are often used in combination:

1. **Semi-supervised Learning:** Uses a small amount of labeled data with a large amount of unlabeled data.
2. **Transfer Learning:** Applies knowledge from a supervised task to improve unsupervised learning, or vice versa.


❗️ **Important Note:** The choice between supervised and unsupervised learning is not always binary. Many real-world problems benefit from a combination of both approaches.


By understanding the distinctions between supervised and unsupervised learning, data scientists can make informed decisions about which approach (or combination of approaches) is most suitable for their specific problem and dataset.

## <a id='toc2_'></a>[Key Objectives of Unsupervised Learning](#toc0_)

Unsupervised learning algorithms are designed to explore and analyze data without predefined labels or outcomes. This approach to machine learning has several key objectives that make it a powerful tool for data analysis and discovery.


### <a id='toc2_1_'></a>[Pattern Discovery](#toc0_)


One of the primary objectives of unsupervised learning is to uncover hidden patterns or structures within data.


**Techniques for Pattern Discovery:**
- **Clustering:** Grouping similar data points together.
- **Association Rule Learning:** Finding relationships between variables in large datasets.
- **Anomaly Detection:** Identifying unusual patterns that don't conform to expected behavior.


💡 **Tip:** Pattern discovery can reveal insights that might not be immediately obvious, even to domain experts!


Example of K-means clustering for pattern discovery:


<img src="./images/k-means.webp" width="800">

```python
from sklearn.cluster import KMeans
import numpy as np

# Generate sample data
X = np.random.rand(100, 2)

# Apply K-means clustering
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

# Cluster assignments
labels = kmeans.labels_
```


### <a id='toc2_2_'></a>[Dimensionality Reduction](#toc0_)


Another crucial objective is to reduce the complexity of high-dimensional data while preserving its essential characteristics.


<img src="./images/dim-reduction.webp" width="800">

**Benefits of Dimensionality Reduction:**
- Mitigates the "curse of dimensionality"
- Improves computational efficiency
- Facilitates visualization of high-dimensional data


**Common Techniques:**
- Principal Component Analysis (PCA)
- t-SNE (t-Distributed Stochastic Neighbor Embedding)
- Autoencoders


Example of PCA for dimensionality reduction:


```python
from sklearn.decomposition import PCA

# Assume X is your high-dimensional data
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
```


### <a id='toc2_3_'></a>[Feature Learning](#toc0_)


Unsupervised learning can automatically learn meaningful features from raw data, which is particularly useful when dealing with complex, high-dimensional datasets.


**Applications of Feature Learning:**
- Image and speech recognition
- Natural language processing
- Recommendation systems


❗️ **Important Note:** Learned features often capture more nuanced and relevant information than hand-engineered features.


### <a id='toc2_4_'></a>[Density Estimation](#toc0_)


Estimating the underlying probability distribution of data is another key objective of unsupervised learning.


<img src="./images/kde.gif" width="800">

**Uses of Density Estimation:**
- Generating new samples similar to the training data
- Detecting outliers or anomalies
- Improving the performance of other machine learning models


Example of Gaussian Mixture Model for density estimation:


```python
from sklearn.mixture import GaussianMixture

# Assume X is your data
gmm = GaussianMixture(n_components=3)
gmm.fit(X)

# Generate new samples
new_samples = gmm.sample(n_samples=100)
```


### <a id='toc2_5_'></a>[Data Compression](#toc0_)


Unsupervised learning techniques can be used to compress data by finding more efficient representations.


**Benefits of Data Compression:**
- Reduces storage requirements
- Speeds up data transmission
- Can improve learning in downstream tasks


### <a id='toc2_6_'></a>[Generative Modeling](#toc0_)


Creating models that can generate new data similar to the training set is an exciting objective of unsupervised learning.


<img src="./images/auto-encoder.png" width="800">

**Applications of Generative Modeling:**
- Creating synthetic images or text
- Data augmentation for improving supervised learning models
- Simulating scenarios for decision-making or planning


Example of a simple autoencoder for generative modeling:


```python
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense

# Define the encoder
input_layer = Input(shape=(original_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_layer)

# Define the decoder
decoded = Dense(original_dim, activation='sigmoid')(encoded)

# Create the autoencoder model
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Train the model
autoencoder.fit(X_train, X_train, epochs=50, batch_size=256, shuffle=True, validation_data=(X_test, X_test))
```


### <a id='toc2_7_'></a>[Exploratory Data Analysis](#toc0_)


Unsupervised learning serves as a powerful tool for exploratory data analysis, helping to gain insights into the structure and characteristics of datasets.


**Benefits for Data Analysis:**
- Reveals underlying structures in the data
- Identifies potential subgroups or clusters
- Highlights relationships between variables


💡 **Tip:** Always visualize the results of unsupervised learning algorithms to gain intuition about your data!


By pursuing these objectives, unsupervised learning algorithms can extract valuable information from unlabeled data, paving the way for more informed decision-making, improved data understanding, and enhanced performance in various machine learning tasks.

## <a id='toc3_'></a>[Real-World Applications of Unsupervised Learning](#toc0_)

Unsupervised learning techniques have found a wide range of applications across various industries and domains. These methods excel at discovering hidden patterns and structures in data, making them invaluable for many real-world problems. Let's explore some of the most impactful applications:


### <a id='toc3_1_'></a>[Customer Segmentation in Marketing](#toc0_)


Unsupervised learning algorithms, particularly clustering techniques, are widely used in marketing for customer segmentation.


**Key Benefits:**
- Identify distinct customer groups based on behavior, preferences, or demographics
- Tailor marketing strategies to specific segments
- Improve customer retention and acquisition


**Example Technique:** K-means clustering


💡 **Tip:** Combining unsupervised segmentation with supervised prediction models can lead to more effective targeted marketing campaigns.


### <a id='toc3_2_'></a>[Anomaly Detection in Cybersecurity](#toc0_)


Unsupervised learning plays a crucial role in identifying unusual patterns that may indicate security threats or fraudulent activities.


**Applications:**
- Network intrusion detection
- Fraud detection in financial transactions
- Identifying unusual user behavior


**Example Technique:** Isolation Forest


### <a id='toc3_3_'></a>[Recommendation Systems](#toc0_)


Collaborative filtering, an unsupervised learning technique, is the backbone of many recommendation systems.


**Use Cases:**
- Product recommendations in e-commerce
- Movie and music suggestions in streaming services
- Content recommendations in social media


**Example Technique:** Matrix Factorization


### <a id='toc3_4_'></a>[Image and Video Analysis](#toc0_)


Unsupervised learning techniques are powerful tools for processing and understanding visual data.


**Applications:**
- Image compression
- Feature extraction for computer vision tasks
- Video summarization


**Example Technique:** Autoencoders for image compression


### <a id='toc3_5_'></a>[Natural Language Processing](#toc0_)


Unsupervised learning techniques are fundamental in various NLP tasks.


**Applications:**
- Topic modeling in document analysis
- Word embedding generation
- Language translation


**Example Technique:** Latent Dirichlet Allocation (LDA) for topic modeling


### <a id='toc3_6_'></a>[Genomics and Bioinformatics](#toc0_)


Unsupervised learning helps in analyzing complex biological data.


**Applications:**
- Gene expression analysis
- Protein structure prediction
- Disease subtype discovery


❗️ **Important Note:** In bioinformatics, careful preprocessing and domain knowledge are crucial when applying unsupervised learning techniques.


### <a id='toc3_7_'></a>[Anomaly Detection in Industrial IoT](#toc0_)


Unsupervised learning is valuable for monitoring and maintaining industrial equipment.


**Applications:**
- Predictive maintenance
- Quality control in manufacturing
- Energy consumption optimization


**Example Technique:** One-class SVM for anomaly detection


### <a id='toc3_8_'></a>[Financial Market Analysis](#toc0_)


Unsupervised learning techniques help in understanding complex financial data.


**Applications:**
- Portfolio optimization
- Risk assessment
- Market trend analysis


💡 **Tip:** Combining unsupervised learning with time series analysis can provide deeper insights into financial market behavior.


These real-world applications demonstrate the versatility and power of unsupervised learning techniques. By extracting meaningful patterns and structures from unlabeled data, these methods continue to drive innovation and insights across numerous fields, from business and technology to science and healthcare.

## <a id='toc4_'></a>[Summary](#toc0_)

Unsupervised learning represents a powerful and versatile branch of machine learning that focuses on extracting insights and patterns from unlabeled data. Throughout this lecture, we've explored various aspects of unsupervised learning, its objectives, and its real-world applications. Let's recap the key points:

1. **Definition:** Unsupervised learning algorithms work with unlabeled data to discover hidden structures and patterns without predefined outcomes.

2. **Contrast with Supervised Learning:** Unlike supervised learning, unsupervised learning doesn't rely on labeled data for training, making it suitable for exploratory data analysis and pattern discovery.

3. **Main Objectives:**
   - Pattern discovery
   - Dimensionality reduction
   - Feature learning
   - Density estimation
   - Data compression
   - Generative modeling
   - Exploratory data analysis

4. **Common Techniques:**
   - Clustering (e.g., K-means, hierarchical clustering)
   - Dimensionality reduction (e.g., PCA, t-SNE)
   - Association rule learning
   - Autoencoders


Unsupervised learning continues to evolve, with exciting developments in areas such as:

- Self-supervised learning
- Generative adversarial networks (GANs)
- Unsupervised reinforcement learning
- Advances in clustering algorithms for high-dimensional data


As the field progresses, we can expect unsupervised learning to play an increasingly important role in extracting value from the vast amounts of unlabeled data generated daily.


In conclusion, unsupervised learning offers a unique approach to understanding complex datasets, uncovering hidden patterns, and generating insights that might otherwise remain hidden. By mastering these techniques, data scientists and researchers can unlock new possibilities in data analysis and machine learning applications across various industries and scientific domains.