# Notations

## Numerical Objects

<img src="0.1 numerical_objects.png" alt="Numerical Objects" width="800"/>

## Set Theory

<img src="0.2 set_theory.png" alt="Set Theory" width="800"/>

## Functions and Operators

<img src="0.3 functions_and_operators.png" alt="Functions and Operators" width="800"/>

## Calculus

<img src="0.4 calculus.png" alt="Calculus" width="800"/>

## Probability and Information Theory

<img src="0.5 probability_and_information_theory.png" alt="Probability and Information Theory" width="800" />

# Introduction

<center><strong>Machine learning is the study of algorithms that can learn from experience</strong></center>

## Overview

Traditional computer programs operate on **rigid**, **predefined rules**. For example, an e-commerce platform might use a set of explicit instructions to manage user interactions, database operations, and business logic. While this approach works well for many applications, it falls short when dealing with complex, dynamic, or poorly understood problems.

### Limitations of Rule-Based Systems

Consider these challenging tasks:
- Predicting tomorrow's weather
- Answering free-form text questions
- Identifying people in images
- Recommending products users might enjoy

These problems are difficult to solve with traditional programming methods due to:
1. Changing patterns over time
2. Complex relationships between inputs and outputs
3. Processes that lie beyond our conscious understanding

### The Machine Learning Approach

**Machine learning algorithms learn from experience**, typically in the form of data or environmental interactions. As they accumulate experience, their performance improves. This adaptability makes machine learning particularly suited for tasks where:

1. Patterns change over time
2. Relationships are too complex for manual coding
3. The precise steps to perform the task are unknown

Deep learning, a powerful subset of machine learning techniques, is driving innovations in various fields, including computer vision, natural language processing, healthcare, and genomics.

## A Motivating Example: Voice Recognition

Modern smartphones use multiple machine learning models in everyday interactions. Consider the process of using voice commands to get directions:

1. Wake word recognition ("Hey Siri")
2. Speech-to-text conversion
3. Natural language processing
4. Route planning and travel time prediction

### The Challenge of Programming Voice Recognition

Coding a wake word recognizer from scratch is extremely difficult:

- Audio input: ~44,000 samples per second
- Complex mapping from raw audio to yes/no predictions
- No clear rules for recognition

### The Machine Learning Approach

Instead of explicit programming, we:

1. Collect a large dataset of audio snippets and labels
2. Define a flexible program with **adjustable parameters**
3. Use a learning algorithm to find optimal parameter values

Key concepts:
- **Model**: A program with fixed parameters
- **Model family**: Set of all possible programs created by adjusting parameters
- **Learning algorithm**: Process that uses data to adjust the parameters for learning.

### The Training Process

1. Start with a **randomly initialized model**
2. Use **labeled data** (audio snippets and corresponding labels)
3. **Adjust parameters** to improve performance on the data
4. Repeat steps 2-3 until satisfactory performance is achieved

This approach, "programming with data," allows us to create complex systems like wake-word recognizers, image classifiers, and more without explicitly coding the rules. **Deep learning** is the key to solve these kind of problems.

<img src="1.1.2 training process.png" alt="Training Process" width="800"/>

To summarize, rather than code up a wake-word recognizer, we code up a program that can **learn** to recognize wake words from large dataset.

## Key Components
1. **Data:** to learn from.
2. **Model:** to transform the data.
3. **Objective/Loss Function:** to quantify how well the model is doing.
4. **Optimization Algorithm:** to optimize the objective/loss function by adjusting the model’s parameters.

### 1.Data

The foundation of any machine learning task is data, which consists of examples with features (inputs) and labels (targets). The data must be accurately represented numerically. Fixed-length vectors are convenient but not always possible, especially with varying-length data like text. Large, high-quality datasets are crucial for training effective models, as poor data quality can lead to biased or inaccurate predictions.

### 2.Model

Models transform data into predictions. They can range from simple statistical models to complex deep learning architectures. Deep learning models are characterized by multiple layers of transformations, allowing them to handle complex tasks that simpler models cannot.

### 3. Objective/Loss Function

These functions measure the performance of a model. Typically, lower values indicate better performance. Common objective functions include squared error for regression and error rate for classification. In practice, surrogate objectives might be used for optimization due to the complexity of directly optimizing some functions.

### 4. Optimization Algorithm

Algorithms like gradient descent are used to adjust model parameters to minimize the loss function. This involves iteratively updating parameters in the direction that reduces the loss.

## Kinds of Machine Learning Problems

### Supervised Machine Learning

Algorithm Trained on Labeled Dataset i.e input-label features also known as example.

Supervised learning can be divided into two main types:
- **Classification**: Predicts `discrete categories`, such as determining if an email is spam or not.
- **Regression**: Predicts `continuous values`, like estimating house prices based on features.

- **Applications**: Supervised learning is used in various applications, including image recognition, fraud detection, and medical diagnosis.

- **Optimization**: Models are optimized using loss functions, such as **cross-entropy** for classification and **squared error** for regression, to minimize prediction errors.

#### Regression

Regression is focused on predicting **continuous numerical value target** from input features. It is commonly used to answer **"how much?"** or **"how many?"** questions.

- **Feature Vectors**: Each example (input-label) is represented by a **fixed-length vector** of features.

**Real-World Applications**: 
  - Predicting movie ratings based on user preferences.
  - Estimating the duration of a surgery.
  - Forecasting rainfall amounts.

**Modeling and Loss Function**: Regression models aim to minimize the difference between predicted and actual values, often using the **squared error loss** function. This approach assumes data may be affected by Gaussian noise.

#### Classification

Classification is focused on predicting the **category or class of an input** from a discrete set of options.

- **Binary Classification**: Involves two classes, such as determining whether an email is spam or not.

- **Multiclass Classification**: Involves more than two classes, such as classifying handwritten digits (0-9). Models assign **probabilities** to each class, allowing for uncertainty estimation.

- **Hierarchical Classification**: Addresses structured classes where errors vary in severity. For example, confusing similar dog breeds is less severe than confusing a dog with a dinosaur.

**Model Output and Optimization:**

- **Probabilistic Output**: Models often output probabilities for each class, which helps in decision-making by conveying uncertainty. For instance, a model might predict a 90% probability that an image is a cat.

- **Loss Function**: **Cross-entropy** is commonly used to measure the difference between predicted and true class probabilities.


**Risk Assessment**: Beyond probabilities, decision-making involves weighing potential risks and benefits. For example, even with an 80% probability that a mushroom is safe, the risk of it being poisonous may deter consumption.

### Tagging

In the context of classification problems, traditional binary or multiclass classification might not always be suitable, especially when dealing with scenarios **where multiple labels can be applied simultaneously**. A classic example is identifying animals in an image like the "Town Musicians of Bremen," which features multiple animals together. In such cases, multi-label classification is more appropriate, allowing the model to identify all present categories, such as a cat, dog, donkey, and rooster, rather than forcing a single category choice.

This concept extends to **auto-tagging** problems, such as tagging posts on a technical blog with multiple relevant categories like "machine learning," "technology," and "cloud computing." These tags often exhibit correlation, as certain topics frequently appear together.

In more complex scenarios, such as tagging medical articles in **PubMed**, a vast set of possible tags exists, drawn from the **Medical Subject Headings (MeSH)** ontology, which includes around 28,000 tags. Accurate tagging is crucial for facilitating comprehensive literature reviews. Given the time-consuming nature of manual tagging, machine learning can assist by providing provisional tags, helping bridge the gap until a manual review is completed. **Competitions** like those hosted by **BioASQ** aim to improve machine learning techniques for such tasks.

<img src="1.3.3 tagging.png" alt="Tagging" width="400" />

### Search

In **information retrieval**, particularly in web search, the focus is on **ranking items** to determine which results should be most prominently displayed to a user. The objective is not just to identify relevant pages, but to prioritize them effectively. Initially, systems like **Google's PageRank** assigned scores to pages based on their authority, independent of the specific query. This involved a simple relevance filter to identify relevant candidates, followed by using PageRank to prioritize them.

Today, search engines have evolved to use machine learning and **behavioral models** to generate query-dependent relevance scores, allowing for more personalized and accurate search results. This area of study is so significant that it has spawned entire academic conferences dedicated to exploring and advancing these techniques.

### Recommender System

Recommender systems are closely **related to search** and **ranking**, with the key distinction being their focus on personalizing content for individual users. Unlike search engines, which aim to rank relevant items for general queries, recommender systems tailor their suggestions based on user-specific preferences. For example, a science fiction fan and a Peter Sellers comedy enthusiast would receive different movie recommendations.

These systems **rely on** both **explicit feedback**, like product ratings and reviews, and **implicit feedback**, such as skipped songs in a playlist, to estimate user preferences. The simplest models predict scores like expected ratings or purchase probabilities, which are then used to recommend items with the highest scores to users. Advanced systems incorporate detailed user behavior and item characteristics to refine these predictions.

Despite their economic value, recommender systems face challenges due to their reliance on censored feedback, where users tend to rate only items they feel strongly about, leading to skewed data. Additionally, feedback loops can occur, where items recommended more frequently are perceived as better due to increased purchases, reinforcing their prominence in recommendations. Addressing issues like feedback loops and data censoring remains an important area of ongoing research.

### Sequence Learning

Sequence learning addresses problems where **inputs** and **outputs** are **not fixed in number**, but rather consist of sequences that can vary in length. Unlike traditional models that handle independent observations, sequence learning models consider the context provided by previous and succeeding elements in a sequence, making them suitable for tasks like video analysis, language processing, and medical monitoring.

In sequence learning, models are designed to handle sequences of inputs and/or outputs. A key example is sequence-to-sequence learning, where both inputs and outputs are variable-length sequences. This is crucial for tasks such as:

- **Machine Translation**: Translating sentences from one language to another, where input and output sequences may differ in length and order.
- **Automatic Speech Recognition**: Converting audio recordings into text, where the audio input is much longer than the text output.
- **Text to Speech**: The reverse of speech recognition, where text is converted into audio, resulting in a longer output sequence.

Other sequence learning tasks include **tagging** and **parsing**, where text sequences are annotated with attributes like **parts of speech** or **named entities**, and **dialogue systems**, which require understanding context over long temporal distances.

Sequence learning is a vibrant area of research, with challenges such as handling unaligned data, as seen in machine translation, and managing complex dialogue interactions. These tasks require sophisticated models that can capture the nuances of sequential data, making sequence learning one of the most exciting applications of machine learning today.

## Unsupervised and Self-Supervised Learning

**Unsupervised** and **self-supervised learning** represent areas of machine learning **where models learn from data without explicit labels** unlike supervised learning, where models are trained with labeled datasets. This approach is akin to working with a vague directive, where the data scientist must creatively determine the questions to ask and the insights to derive.

**Unsupervised Learning** focuses on tasks such as:

- **Clustering**: Grouping data into categories, like sorting photos into landscapes, animals, and people, or categorizing users based on browsing behavior.
- **Subspace Estimation / Dimensionality Reduction**: Identifying a small set of parameters that capture essential data characteristics, such as using principal component analysis to reduce dimensionality.
- **Representation in Euclidean Space**: Mapping complex objects and their relationships into a space where symbolic properties are well-matched.
- **Causality and Probabilistic Models**: Understanding the root causes and relationships within data, such as analyzing demographic factors affecting house prices.

A significant advancement in unsupervised learning is **deep generative models**, which **estimate data density** and **generate new data** samples. Notable models include **variational autoencoders, generative adversarial networks, normalizing flows,** and **diffusion models**.

**Self-Supervised Learning** leverages unlabeled data to create supervisory signals. In text, models can **predict masked words using context**, while in images, models might learn by predicting the relative position of image parts or identifying perturbed versions of the same image. These techniques often produce robust representations that can be fine-tuned for specific tasks, enhancing their applicability in various domains.

## Interacting with Environment

Interacting with an environment introduces a dynamic aspect to machine learning that goes beyond the static nature of offline learning. In traditional supervised and unsupervised learning, models are trained on pre-collected datasets without further interaction with the environment. This approach, known as **offline learning**, allows for isolated pattern recognition but limits the capability to develop intelligent agents that can act and adapt in real-world scenarios.

To create truly intelligent agents, we need to consider how actions impact the environment and future observations. This involves addressing several key questions:

- Does the environment remember what we did previously?
- Does the environment want to help us, e.g., a user reading text into a speech recognizer?
- Does the environment want to beat us, e.g., spammers adapting their emails to evade spam filters?
- Does the environment have shifting dynamics? For example, would future data always resemble the past or would the patterns change over time, either naturally or in response to our automated tools?

These considerations lead to the concept of **distribution shift**, where the data encountered during deployment differs from the training data. This is akin to facing different exam questions than those practiced in homework.

**Reinforcement learning** provides a framework for addressing these challenges by allowing an agent to learn through interaction with the environment. The **agent makes decisions, receives feedback,** and **adjusts its actions** to maximize long-term rewards, making it well-suited for scenarios where actions directly impact future states and observations. This approach is essential for developing adaptive, intelligent systems capable of operating effectively in dynamic and complex environments.

## Reinforcement learning

**Reinforcement learning (RL)** is a machine learning paradigm focused on **developing agents** that interact with environments to make decisions and take actions. It is particularly useful in applications such as robotics, dialogue systems, and AI for video games. **Deep reinforcement learning**, which combines deep learning with RL, has gained prominence with achievements like the **deep Q-network** surpassing humans in Atari games and AlphaGo defeating the world champion in Go.

In RL, an agent interacts with an environment over time, receiving observations and selecting actions that influence the environment. The agent receives rewards based on its actions, and its behavior is guided by a policy—a function mapping observations to actions. The primary goal is to develop policies that maximize long-term rewards.

RL is versatile and can be applied to problems beyond the scope of supervised learning. Unlike supervised learning, where inputs come with correct labels, RL does not assume optimal actions for each observation. Instead, agents receive rewards, which may not directly indicate which actions led to success, presenting the credit assignment problem. For example, in chess, the reward comes only at the game's end, requiring the agent to determine which moves contributed to winning or losing.

RL also addresses partial observability, where current observations might not fully reveal the agent's state. For instance, a cleaning robot trapped in a closet must infer its location based on previous observations.

A key challenge in RL is balancing exploration and exploitation. An agent must decide whether to exploit known strategies for immediate rewards or explore new strategies that might yield better long-term outcomes.

The RL framework encompasses various problem settings:

- **Markov Decision Processes (MDPs)**: When the environment is fully observable.
- **Contextual Bandit Problems**: When the state does not depend on previous actions.
- **Multi-Armed Bandit Problems**: When there is no state, just a set of actions with unknown rewards.

These specialized cases help researchers tackle RL problems with varying levels of complexity, making RL a versatile and widely applicable approach in machine learning.

<img src="1.3.7 reinforcement learning.png" alt="Reinforcement Learning" width=800/>