In [None]:
'''
 * Copyright (c) 2005 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

![image.png](attachment:image.png)
Examples of hand-written dig- its taken from US zip codes.

![image-2.png](attachment:image-2.png)
Plot of a training data set of N = 10 points, shown as blue circles, each comprising an observation of the input variable x along with the corresponding target variable t. The green curve shows the function sin(2πx) used to gener- ate the data. Our goal is to pre- dict the value of t for some new value of x, without knowledge of the green curve.

## Pattern Recognition and Machine Learning

The problem of searching for patterns in data is fundamental and has a long history. For example, Tycho Brahe's extensive astronomical observations in the 16th century allowed Johannes Kepler to discover empirical laws of planetary motion. These discoveries later laid the foundation for classical mechanics.

Similarly, discovering regularities in atomic spectra played a crucial role in developing quantum physics in the early 20th century. The field of **pattern recognition** deals with automatic discovery of data regularities through algorithms and applies these regularities to classify data into various categories.

Consider recognizing handwritten digits. Each digit can be represented as a $28 \times 28$ pixel image, forming a vector $x$ of 784 real numbers. The goal is to create a machine learning model that maps an input vector $x$ to an output indicating the identity of the digit (0–9).

Building such a machine is challenging because of handwriting variability. A rule-based approach might lead to poor results due to the numerous exceptions. Instead, machine learning offers a better solution by using a large dataset $\{x_1, \dots, x_N\}$ called a **training set**, where each image is labeled with its correct digit. The **target vector** $t$ represents the corresponding digit. The learning algorithm produces a function $y(x)$, which takes a new image $x$ and outputs a vector $y$, corresponding to the predicted digit.

The function $y(x)$ is determined during the **training phase** using the training data. After training, the model can generalize to new examples, known as the **test set**. A key challenge is **generalization**, ensuring the model correctly categorizes new inputs that weren't part of the training set.

In most applications, input data undergoes **pre-processing** to simplify the recognition task. For digit recognition, images are typically translated and scaled to fit within a fixed-size box, reducing variability. Pre-processing might also reduce dimensionality, making computation faster and more efficient.

## Learning Paradigms

- **Supervised Learning**: Training data includes input vectors and corresponding target vectors. Examples include **classification** (e.g., digit recognition) and **regression** (predicting continuous variables, such as chemical yield).
  
- **Unsupervised Learning**: Training data consists only of input vectors without target values. The aim is to discover patterns like **clustering** (grouping similar examples) or **density estimation** (modeling data distribution).

- **Reinforcement Learning**: The algorithm learns by interacting with the environment to maximize a reward. Unlike supervised learning, there are no explicit examples of correct outputs. The system learns through trial and error, balancing **exploration** (trying new actions) and **exploitation** (using actions known to provide high rewards).

Reinforcement learning continues to be an active area of research.


![image.png](attachment:image.png)

The error function (1.2) corre- sponds to (one half of) the sum of the squares of the displacements (shown by the vertical green bars) of each data point from the function y(x, w).

## Polynomial Curve Fitting Example

We introduce a simple **regression** problem to motivate key concepts in curve fitting. Suppose we observe a real-valued input variable $x$, and we want to predict a real-valued target variable $t$. For illustration, consider an artificial dataset generated from the function $\sin(2\pi x)$ with added Gaussian noise, as shown in Appendix A.

Let the training set consist of $N$ observations of $x$ (denoted as $\mathbf{x} = (x_1, \dots, x_N)^\top$) and their corresponding target values $\mathbf{t} = (t_1, \dots, t_N)^\top$. The input data $x_n$ is spaced uniformly in the range $[0, 1]$, and the corresponding $t_n$ values are computed as:

$$
t_n = \sin(2\pi x_n) + \text{noise}
$$

The training set in Figure 1.2 comprises $N = 10$ data points, with noise added using a Gaussian distribution. This method mimics real-world datasets, where underlying patterns exist but individual observations are corrupted by noise.

Our goal is to predict a new value $\hat{t}$ for a given $\hat{x}$. In essence, we aim to discover the underlying function $\sin(2\pi x)$. However, because the observed data are noisy, there is uncertainty in predicting $\hat{t}$ for $\hat{x}$.

## Polynomial Curve Fitting

We begin by fitting the data with a polynomial of the form:

$$
y(x, \mathbf{w}) = w_0 + w_1 x + w_2 x^2 + \dots + w_M x^M = \sum_{j=0}^{M} w_j x^j
$$

where \(M\) is the order of the polynomial, and the coefficients $w_0, \dots, w_M$ are collectively represented as the vector $\mathbf{w}$.

Although the polynomial function $y(x, \mathbf{w})$ is nonlinear in $x$, it is linear in the parameters $\mathbf{w}$. Such models, which are linear in their unknown parameters, are called **linear models**.

## Error Function

The polynomial coefficients $\mathbf{w}$ are determined by minimizing an error function, which measures the difference between the model predictions $y(x_n, \mathbf{w})$ and the actual target values $t_n$. A common error function is the sum of squares of errors:

$$
E(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^{N} \left[ y(x_n, \mathbf{w}) - t_n \right]^2
$$

The factor $\frac{1}{2} $ is included for convenience in later calculations. The error function is always non-negative and becomes zero only if the model perfectly fits the training data.
