In [None]:
'''
 * Copyright (c) 2004 Radhamadhab Dalai
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in
 * all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 * THE SOFTWARE.
'''

![image.png](attachment:image.png)

# Deep Learning Book - Chapter 1: Introduction

## Chapter 1: Introduction

Inventors have long dreamed of creating machines that think. This desire dates back to at least the time of ancient Greece. The mythical figures Pygmalion, Daedalus, and Hephaestus may all be interpreted as legendary inventors, and Galatea, Talos, and Pandora may all be regarded as artificial life (Ovid and Martin, 2004; Sparkes, 1996; Tandy, 1997). When programmable computers were first conceived, people wondered whether they might become intelligent, over a hundred years before one was built (Lovelace, 1842). 

Today, artificial intelligence (AI) is a thriving field with many practical applications and active research topics. We look to intelligent software to automate routine labor, understand speech or images, make diagnoses in medicine and support basic scientific research.

In the early days of artificial intelligence, the field rapidly tackled and solved problems that are intellectually difficult for human beings but relatively straightforward for computers—problems that can be described by a list of formal, mathematical rules. The true challenge to artificial intelligence proved to be solving the tasks that are easy for people to perform but hard for people to describe formally—problems that we solve intuitively, that feel automatic, like recognizing spoken words or faces in images.

## The Deep Learning Solution

This book is about a solution to these more intuitive problems. This solution is to allow computers to learn from experience and understand the world in terms of a hierarchy of concepts, with each concept defined in terms of its relation to simpler concepts. By gathering knowledge from experience, this approach avoids the need for human operators to formally specify all of the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones. If we draw a graph showing how these concepts are built on top of each other, the graph is deep, with many layers. For this reason, we call this approach to AI **deep learning**.

## Early AI Successes and Limitations

Many of the early successes of AI took place in relatively sterile and formal environments and did not require computers to have much knowledge about the world. For example, IBM's Deep Blue chess-playing system defeated world champion Garry Kasparov in 1997 (Hsu, 2002). Chess is of course a very simple world, containing only sixty-four locations and thirty-two pieces that can move in only rigidly circumscribed ways. 

Devising a successful chess strategy is a tremendous accomplishment, but the challenge is not due to the difficulty of describing the set of chess pieces and allowable moves to the computer. Chess can be completely described by a very brief list of completely formal rules, easily provided ahead of time by the programmer.

Ironically, abstract and formal tasks that are among the most difficult mental undertakings for a human being are among the easiest for a computer. Computers have long been able to defeat even the best human chess player, but are only recently matching some of the abilities of average human beings to recognize objects or speech.

## The Knowledge Challenge

A person's everyday life requires an immense amount of knowledge about the world. Much of this knowledge is subjective and intuitive, and therefore difficult to articulate in a formal way. Computers need to capture this same knowledge in order to behave in an intelligent way. One of the key challenges in artificial intelligence is how to get this informal knowledge into a computer.

### The Knowledge Base Approach

Several artificial intelligence projects have sought to hard-code knowledge about the world in formal languages. A computer can reason about statements in these formal languages automatically using logical inference rules. This is known as the **knowledge base approach** to artificial intelligence. None of these projects has led to a major success.

One of the most famous such projects is Cyc (Lenat and Guha, 1989). Cyc is an inference engine and a database of statements in a language called CycL. These statements are entered by a staff of human supervisors. It is an unwieldy process. People struggle to devise formal rules with enough complexity to accurately describe the world.

#### The Fred Shaving Example

For example, Cyc failed to understand a story about a person named Fred shaving in the morning (Linde, 1992). Its inference engine detected an inconsistency in the story: it knew that people do not have electrical parts, but because Fred was holding an electric razor, it believed the entity "FredWhileShaving" contained electrical parts. It therefore asked whether Fred was still a person while he was shaving.

## Machine Learning as a Solution

The difficulties faced by systems relying on hard-coded knowledge suggest that AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data. This capability is known as **machine learning**.

The introduction of machine learning allowed computers to tackle problems involving knowledge of the real world and make decisions that appear subjective. Examples include:

- A simple machine learning algorithm called **logistic regression** can determine whether to recommend cesarean delivery (Mor-Yosef et al., 1990)
- A simple machine learning algorithm called **naive Bayes** can separate legitimate e-mail from spam e-mail

## The Representation Problem

The performance of these simple machine learning algorithms depends heavily on the **representation** of the data they are given. For example, when logistic regression is used to recommend cesarean delivery, the AI system does not examine the patient directly. Instead, the doctor tells the system several pieces of relevant information, such as the presence or absence of a uterine scar. Each piece of information included in the representation of the patient is known as a **feature**.

Logistic regression learns how each of these features of the patient correlates with various outcomes. However, it cannot influence the way that the features are defined in any way. If logistic regression was given an MRI scan of the patient, rather than the doctor's formalized report, it would not be able to make useful predictions. Individual pixels in an MRI scan have negligible correlation with any complications that might occur during delivery.

### The Universality of Representation

This dependence on representations is a general phenomenon that appears throughout computer science and even daily life. In computer science, operations such as searching a collection of data can proceed exponentially faster if the collection is structured and indexed intelligently. People can easily perform arithmetic on Arabic numerals, but find arithmetic on Roman numerals much more time-consuming.

For example, consider the computational complexity of different representations:

$$\text{Search Time} = \begin{cases} 
O(\log n) & \text{if data is sorted and indexed} \\
O(n) & \text{if data is unsorted}
\end{cases}$$

It is not surprising that the choice of representation has an enormous effect on the performance of machine learning algorithms.

## Feature Engineering Challenges

Many artificial intelligence tasks can be solved by designing the right set of features to extract for that task, then providing these features to a simple machine learning algorithm. For example, a useful feature for speaker identification from sound is an estimate of the size of speaker's vocal tract. It therefore gives a strong clue as to whether the speaker is a man, woman, or child.

However, for many tasks, it is difficult to know what features should be extracted. For example, suppose that we would like to write a program to detect cars in photographs. We know that cars have wheels, so we might like to use the presence of a wheel as a feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms of pixel values.

### The Wheel Detection Problem

A wheel has a simple geometric shape but its image may be complicated by:
- Shadows falling on the wheel
- The sun glaring off the metal parts of the wheel  
- The fender of the car or an object in the foreground obscuring part of the wheel
- And so on...

This illustrates the fundamental challenge that deep learning aims to solve: automatically learning good representations from raw data, without requiring hand-crafted features.

---



In [1]:
#!/usr/bin/env python3
"""
Deep Learning Chapter 1 Implementation Examples
Demonstrating key concepts from the introduction chapter
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

class RepresentationDemo:
    """Demonstrates the importance of data representation"""
    
    def __init__(self):
        self.data = None
        self.labels = None
    
    def create_spiral_data(self, n_samples=1000):
        """Create spiral data that's hard to classify in original space"""
        t = np.linspace(0, 4*np.pi, n_samples)
        x1 = t * np.cos(t) + np.random.normal(0, 0.5, n_samples)
        y1 = t * np.sin(t) + np.random.normal(0, 0.5, n_samples)
        
        x2 = (t + np.pi) * np.cos(t + np.pi) + np.random.normal(0, 0.5, n_samples)
        y2 = (t + np.pi) * np.sin(t + np.pi) + np.random.normal(0, 0.5, n_samples)
        
        X = np.vstack([np.column_stack([x1, y1]), np.column_stack([x2, y2])])
        y = np.hstack([np.zeros(n_samples), np.ones(n_samples)])
        
        return X, y
    
    def polar_transform(self, X):
        """Transform Cartesian coordinates to polar representation"""
        r = np.sqrt(X[:, 0]**2 + X[:, 1]**2)
        theta = np.arctan2(X[:, 1], X[:, 0])
        return np.column_stack([r, theta])
    
    def demonstrate_representation_importance(self):
        """Show how representation affects classification performance"""
        X, y = self.create_spiral_data()
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
        
        # Original representation
        clf_original = LogisticRegression()
        clf_original.fit(X_train, y_train)
        score_original = clf_original.score(X_test, y_test)
        
        # Polar representation
        X_train_polar = self.polar_transform(X_train)
        X_test_polar = self.polar_transform(X_test)
        
        clf_polar = LogisticRegression()
        clf_polar.fit(X_train_polar, y_train)
        score_polar = clf_polar.score(X_test_polar, y_test)
        
        print(f"Original representation accuracy: {score_original:.3f}")
        print(f"Polar representation accuracy: {score_polar:.3f}")
        
        return X, y, score_original, score_polar

class MedicalDiagnosisDemo:
    """Demonstrates logistic regression for medical diagnosis (cesarean delivery)"""
    
    def __init__(self):
        self.features = [
            'maternal_age', 'previous_cesarean', 'gestational_age', 
            'fetal_weight_estimate', 'maternal_height', 'labor_duration'
        ]
    
    def generate_synthetic_data(self, n_samples=1000):
        """Generate synthetic medical data for demonstration"""
        np.random.seed(42)
        
        # Generate realistic medical features
        maternal_age = np.random.normal(28, 5, n_samples)
        previous_cesarean = np.random.binomial(1, 0.3, n_samples)
        gestational_age = np.random.normal(39, 2, n_samples)
        fetal_weight = np.random.normal(3200, 400, n_samples)  # grams
        maternal_height = np.random.normal(165, 8, n_samples)  # cm
        labor_duration = np.random.exponential(8, n_samples)  # hours
        
        X = np.column_stack([
            maternal_age, previous_cesarean, gestational_age,
            fetal_weight, maternal_height, labor_duration
        ])
        
        # Create realistic target variable based on medical factors
        risk_score = (
            0.1 * (maternal_age > 35) +
            0.4 * previous_cesarean +
            0.2 * (gestational_age > 42) +
            0.2 * (fetal_weight > 4000) +
            0.1 * (maternal_height < 150) +
            0.1 * (labor_duration > 12)
        )
        
        # Add some randomness
        y = np.random.binomial(1, 1 / (1 + np.exp(-2 * risk_score + 1)), n_samples)
        
        return X, y
    
    def train_and_evaluate(self):
        """Train logistic regression model for medical diagnosis"""
        X, y = self.generate_synthetic_data()
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
        
        # Standardize features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)
        
        # Train model
        model = LogisticRegression()
        model.fit(X_train_scaled, y_train)
        
        # Evaluate
        train_score = model.score(X_train_scaled, y_train)
        test_score = model.score(X_test_scaled, y_test)
        
        print(f"Medical Diagnosis Model Performance:")
        print(f"Training accuracy: {train_score:.3f}")
        print(f"Test accuracy: {test_score:.3f}")
        print(f"Feature importance (coefficients):")
        
        for i, feature in enumerate(self.features):
            print(f"  {feature}: {model.coef_[0][i]:.3f}")
        
        return model, scaler

class SpamDetectionDemo:
    """Demonstrates Naive Bayes for spam detection"""
    
    def __init__(self):
        self.spam_words = [
            'free', 'money', 'win', 'cash', 'prize', 'urgent', 'click',
            'offer', 'deal', 'guarantee', 'limited', 'act now'
        ]
        self.normal_words = [
            'meeting', 'project', 'report', 'schedule', 'team', 'work',
            'deadline', 'update', 'status', 'review', 'discussion'
        ]
    
    def generate_email_features(self, n_samples=1000):
        """Generate synthetic email features for spam detection"""
        X = []
        y = []
        
        for i in range(n_samples):
            is_spam = np.random.binomial(1, 0.3)  # 30% spam
            
            if is_spam:
                # Generate spam email features
                spam_word_count = np.random.poisson(3)
                normal_word_count = np.random.poisson(1)
                caps_ratio = np.random.beta(2, 1)  # More caps in spam
                exclamation_count = np.random.poisson(2)
            else:
                # Generate normal email features
                spam_word_count = np.random.poisson(0.2)
                normal_word_count = np.random.poisson(5)
                caps_ratio = np.random.beta(1, 3)  # Less caps in normal
                exclamation_count = np.random.poisson(0.1)
            
            features = [spam_word_count, normal_word_count, caps_ratio, exclamation_count]
            X.append(features)
            y.append(is_spam)
        
        return np.array(X), np.array(y)
    
    def train_and_evaluate(self):
        """Train Naive Bayes model for spam detection"""
        X, y = self.generate_email_features()
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
        
        # Train Naive Bayes model
        model = GaussianNB()
        model.fit(X_train, y_train)
        
        # Evaluate
        train_score = model.score(X_train, y_train)
        test_score = model.score(X_test, y_test)
        
        print(f"Spam Detection Model Performance:")
        print(f"Training accuracy: {train_score:.3f}")
        print(f"Test accuracy: {test_score:.3f}")
        
        return model

class FeatureEngineeringDemo:
    """Demonstrates the challenge of manual feature engineering"""
    
    def __init__(self):
        pass
    
    def create_image_like_data(self, size=28):
        """Create simple 2D data that simulates pixel values"""
        # Create a simple "wheel" pattern
        center = size // 2
        y, x = np.ogrid[:size, :size]
        
        # Create circle (wheel)
        wheel_mask = (x - center)**2 + (y - center)**2 <= (size//3)**2
        outer_ring = ((x - center)**2 + (y - center)**2 <= (size//3)**2) & \
                    ((x - center)**2 + (y - center)**2 >= (size//4)**2)
        
        wheel_image = np.zeros((size, size))
        wheel_image[wheel_mask] = 0.5
        wheel_image[outer_ring] = 1.0
        
        # Add noise and distortions
        noise = np.random.normal(0, 0.1, (size, size))
        wheel_image += noise
        
        return wheel_image
    
    def extract_hand_crafted_features(self, image):
        """Extract hand-crafted features from image"""
        features = []
        
        # Circular symmetry measure
        center = image.shape[0] // 2
        radial_profile = []
        for r in range(1, center):
            mask = np.zeros_like(image)
            y, x = np.ogrid[:image.shape[0], :image.shape[1]]
            ring_mask = ((x - center)**2 + (y - center)**2 <= r**2) & \
                       ((x - center)**2 + (y - center)**2 >= (r-1)**2)
            if np.sum(ring_mask) > 0:
                radial_profile.append(np.mean(image[ring_mask]))
        
        # Variance in radial profile (wheels should be relatively uniform)
        features.append(np.var(radial_profile) if radial_profile else 0)
        
        # Edge density
        sobel_x = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
        sobel_y = np.array([[-1, -2, -1], [0, 0, 0], [1, 2, 1]])
        
        # Simple convolution approximation
        edge_strength = 0
        for i in range(1, image.shape[0]-1):
            for j in range(1, image.shape[1]-1):
                patch = image[i-1:i+2, j-1:j+2]
                gx = np.sum(patch * sobel_x)
                gy = np.sum(patch * sobel_y)
                edge_strength += np.sqrt(gx**2 + gy**2)
        
        features.append(edge_strength)
        
        # Intensity statistics
        features.extend([np.mean(image), np.std(image), np.max(image)])
        
        return np.array(features)
    
    def demonstrate_feature_engineering(self):
        """Show the complexity of manual feature engineering"""
        print("Manual Feature Engineering Challenges:")
        print("=====================================")
        
        # Create different "wheel" variations
        normal_wheel = self.create_image_like_data()
        
        # Wheel with shadow
        shadowed_wheel = normal_wheel.copy()
        shadowed_wheel[:14, :] *= 0.5  # Add shadow
        
        # Partially occluded wheel
        occluded_wheel = normal_wheel.copy()
        occluded_wheel[20:, 15:] = 0  # Occlude part
        
        wheels = [normal_wheel, shadowed_wheel, occluded_wheel]
        labels = ["Normal", "Shadowed", "Occluded"]
        
        print("\nExtracted features for different wheel conditions:")
        feature_names = ["Radial Var", "Edge Strength", "Mean", "Std", "Max"]
        
        for wheel, label in zip(wheels, labels):
            features = self.extract_hand_crafted_features(wheel)
            print(f"\n{label} wheel features:")
            for fname, fval in zip(feature_names, features):
                print(f"  {fname}: {fval:.3f}")
        
        return wheels, labels

def main():
    """Run all demonstrations"""
    print("Deep Learning Chapter 1 - Implementation Examples")
    print("=" * 50)
    
    # 1. Representation importance
    print("\n1. IMPORTANCE OF REPRESENTATION")
    print("-" * 30)
    rep_demo = RepresentationDemo()
    X, y, score_orig, score_polar = rep_demo.demonstrate_representation_importance()
    print(f"Improvement: {((score_polar - score_orig) / score_orig * 100):.1f}%")
    
    # 2. Medical diagnosis with logistic regression
    print("\n2. MEDICAL DIAGNOSIS (Logistic Regression)")
    print("-" * 40)
    med_demo = MedicalDiagnosisDemo()
    model, scaler = med_demo.train_and_evaluate()
    
    # 3. Spam detection with Naive Bayes
    print("\n3. SPAM DETECTION (Naive Bayes)")
    print("-" * 30)
    spam_demo = SpamDetectionDemo()
    spam_model = spam_demo.train_and_evaluate()
    
    # 4. Feature engineering challenges
    print("\n4. FEATURE ENGINEERING CHALLENGES")
    print("-" * 33)
    feature_demo = FeatureEngineeringDemo()
    wheels, labels = feature_demo.demonstrate_feature_engineering()
    
    print("\n" + "=" * 50)
    print("Key Insights from Chapter 1:")
    print("- Representation is crucial for ML performance")
    print("- Simple algorithms can solve complex problems with good features")
    print("- Manual feature engineering is challenging and domain-specific")
    print("- Deep learning aims to learn representations automatically")

if __name__ == "__main__":
    main()

Deep Learning Chapter 1 - Implementation Examples

1. IMPORTANCE OF REPRESENTATION
------------------------------
Original representation accuracy: 0.535
Polar representation accuracy: 0.637
Improvement: 19.0%

2. MEDICAL DIAGNOSIS (Logistic Regression)
----------------------------------------
Medical Diagnosis Model Performance:
Training accuracy: 0.689
Test accuracy: 0.713
Feature importance (coefficients):
  maternal_age: 0.007
  previous_cesarean: 0.443
  gestational_age: 0.144
  fetal_weight_estimate: 0.104
  maternal_height: 0.041
  labor_duration: 0.166

3. SPAM DETECTION (Naive Bayes)
------------------------------
Spam Detection Model Performance:
Training accuracy: 0.984
Test accuracy: 0.980

4. FEATURE ENGINEERING CHALLENGES
---------------------------------
Manual Feature Engineering Challenges:

Extracted features for different wheel conditions:

Normal wheel features:
  Radial Var: 0.110
  Edge Strength: 808.315
  Mean: 0.225
  Std: 0.380
  Max: 1.180

Shadowed wheel feat

In [3]:
#!/usr/bin/env python3
"""
Deep Learning Chapter 1 Implementation Examples
Pure Python implementation without external dependencies
Demonstrating key concepts from the introduction chapter
"""

import math
import random
from collections import defaultdict, Counter

# Set random seed for reproducibility
random.seed(42)

class RepresentationDemo:
    """Demonstrates the importance of data representation"""
    
    def __init__(self):
        self.data = None
        self.labels = None
    
    def create_spiral_data(self, n_samples=500):
        """Create spiral data that's hard to classify in original space"""
        data = []
        labels = []
        
        # First spiral
        for i in range(n_samples):
            t = (i / n_samples) * 4 * math.pi
            x = t * math.cos(t) + random.gauss(0, 0.5)
            y = t * math.sin(t) + random.gauss(0, 0.5)
            data.append([x, y])
            labels.append(0)
        
        # Second spiral (offset)
        for i in range(n_samples):
            t = (i / n_samples) * 4 * math.pi + math.pi
            x = t * math.cos(t) + random.gauss(0, 0.5)
            y = t * math.sin(t) + random.gauss(0, 0.5)
            data.append([x, y])
            labels.append(1)
        
        return data, labels
    
    def polar_transform(self, data):
        """Transform Cartesian coordinates to polar representation"""
        polar_data = []
        for x, y in data:
            r = math.sqrt(x**2 + y**2)
            theta = math.atan2(y, x)
            polar_data.append([r, theta])
        return polar_data
    
    def split_data(self, data, labels, test_ratio=0.3):
        """Split data into training and testing sets"""
        n = len(data)
        indices = list(range(n))
        random.shuffle(indices)
        
        split_idx = int(n * (1 - test_ratio))
        train_indices = indices[:split_idx]
        test_indices = indices[split_idx:]
        
        train_data = [data[i] for i in train_indices]
        train_labels = [labels[i] for i in train_indices]
        test_data = [data[i] for i in test_indices]
        test_labels = [labels[i] for i in test_indices]
        
        return train_data, train_labels, test_data, test_labels
    
    def simple_linear_classifier(self, train_data, train_labels, test_data):
        """Simple linear classifier using basic math"""
        # Calculate means for each class
        class_0_data = [data for data, label in zip(train_data, train_labels) if label == 0]
        class_1_data = [data for data, label in zip(train_data, train_labels) if label == 1]
        
        if not class_0_data or not class_1_data:
            return [0] * len(test_data)
        
        # Calculate centroids
        centroid_0 = [sum(x[i] for x in class_0_data) / len(class_0_data) for i in range(2)]
        centroid_1 = [sum(x[i] for x in class_1_data) / len(class_1_data) for i in range(2)]
        
        # Classify test data based on distance to centroids
        predictions = []
        for test_point in test_data:
            dist_0 = sum((test_point[i] - centroid_0[i])**2 for i in range(2))
            dist_1 = sum((test_point[i] - centroid_1[i])**2 for i in range(2))
            predictions.append(0 if dist_0 < dist_1 else 1)
        
        return predictions
    
    def calculate_accuracy(self, predictions, true_labels):
        """Calculate classification accuracy"""
        correct = sum(1 for pred, true in zip(predictions, true_labels) if pred == true)
        return correct / len(true_labels)
    
    def demonstrate_representation_importance(self):
        """Show how representation affects classification performance"""
        data, labels = self.create_spiral_data()
        train_data, train_labels, test_data, test_labels = self.split_data(data, labels)
        
        # Original representation
        pred_original = self.simple_linear_classifier(train_data, train_labels, test_data)
        score_original = self.calculate_accuracy(pred_original, test_labels)
        
        # Polar representation
        train_data_polar = self.polar_transform(train_data)
        test_data_polar = self.polar_transform(test_data)
        
        pred_polar = self.simple_linear_classifier(train_data_polar, train_labels, test_data_polar)
        score_polar = self.calculate_accuracy(pred_polar, test_labels)
        
        print(f"Original representation accuracy: {score_original:.3f}")
        print(f"Polar representation accuracy: {score_polar:.3f}")
        
        return score_original, score_polar

class LogisticRegression:
    """Simple logistic regression implementation"""
    
    def __init__(self, learning_rate=0.01, max_iterations=1000):
        self.learning_rate = learning_rate
        self.max_iterations = max_iterations
        self.weights = None
        self.bias = None
    
    def sigmoid(self, z):
        """Sigmoid activation function with overflow protection"""
        z = max(-500, min(500, z))  # Prevent overflow
        return 1 / (1 + math.exp(-z))
    
    def fit(self, X, y):
        """Train the logistic regression model"""
        n_samples = len(X)
        n_features = len(X[0])
        
        # Initialize weights and bias
        self.weights = [0.0] * n_features
        self.bias = 0.0
        
        # Gradient descent
        for _ in range(self.max_iterations):
            # Forward pass
            predictions = []
            for i in range(n_samples):
                z = self.bias + sum(self.weights[j] * X[i][j] for j in range(n_features))
                predictions.append(self.sigmoid(z))
            
            # Calculate gradients
            dw = [0.0] * n_features
            db = 0.0
            
            for i in range(n_samples):
                error = predictions[i] - y[i]
                db += error
                for j in range(n_features):
                    dw[j] += error * X[i][j]
            
            # Update weights
            self.bias -= self.learning_rate * db / n_samples
            for j in range(n_features):
                self.weights[j] -= self.learning_rate * dw[j] / n_samples
    
    def predict(self, X):
        """Make predictions"""
        predictions = []
        for x in X:
            z = self.bias + sum(self.weights[j] * x[j] for j in range(len(x)))
            prob = self.sigmoid(z)
            predictions.append(1 if prob >= 0.5 else 0)
        return predictions
    
    def predict_proba(self, X):
        """Predict probabilities"""
        probabilities = []
        for x in X:
            z = self.bias + sum(self.weights[j] * x[j] for j in range(len(x)))
            probabilities.append(self.sigmoid(z))
        return probabilities

class MedicalDiagnosisDemo:
    """Demonstrates logistic regression for medical diagnosis"""
    
    def __init__(self):
        self.features = [
            'maternal_age', 'previous_cesarean', 'gestational_age', 
            'fetal_weight_estimate', 'maternal_height', 'labor_duration'
        ]
    
    def generate_synthetic_data(self, n_samples=800):
        """Generate synthetic medical data for demonstration"""
        data = []
        labels = []
        
        for _ in range(n_samples):
            # Generate realistic medical features
            maternal_age = random.gauss(28, 5)
            previous_cesarean = 1 if random.random() < 0.3 else 0
            gestational_age = random.gauss(39, 2)
            fetal_weight = random.gauss(3200, 400)  # grams
            maternal_height = random.gauss(165, 8)  # cm
            labor_duration = random.expovariate(1/8)  # hours
            
            features = [maternal_age, previous_cesarean, gestational_age,
                       fetal_weight, maternal_height, labor_duration]
            
            # Create realistic target variable
            risk_score = (
                0.1 * (1 if maternal_age > 35 else 0) +
                0.4 * previous_cesarean +
                0.2 * (1 if gestational_age > 42 else 0) +
                0.2 * (1 if fetal_weight > 4000 else 0) +
                0.1 * (1 if maternal_height < 150 else 0) +
                0.1 * (1 if labor_duration > 12 else 0)
            )
            
            # Convert to probability
            prob = 1 / (1 + math.exp(-2 * risk_score + 1))
            label = 1 if random.random() < prob else 0
            
            data.append(features)
            labels.append(label)
        
        return data, labels
    
    def standardize_features(self, data):
        """Standardize features (z-score normalization)"""
        n_features = len(data[0])
        n_samples = len(data)
        
        # Calculate means and standard deviations
        means = []
        stds = []
        
        for j in range(n_features):
            feature_values = [data[i][j] for i in range(n_samples)]
            mean = sum(feature_values) / n_samples
            variance = sum((x - mean)**2 for x in feature_values) / n_samples
            std = math.sqrt(variance) if variance > 0 else 1.0
            
            means.append(mean)
            stds.append(std)
        
        # Standardize data
        standardized_data = []
        for i in range(n_samples):
            standardized_row = []
            for j in range(n_features):
                standardized_value = (data[i][j] - means[j]) / stds[j]
                standardized_row.append(standardized_value)
            standardized_data.append(standardized_row)
        
        return standardized_data, means, stds
    
    def train_and_evaluate(self):
        """Train logistic regression model for medical diagnosis"""
        data, labels = self.generate_synthetic_data()
        
        # Split data
        n = len(data)
        split_idx = int(n * 0.7)
        train_data = data[:split_idx]
        train_labels = labels[:split_idx]
        test_data = data[split_idx:]
        test_labels = labels[split_idx:]
        
        # Standardize features
        train_data_std, means, stds = self.standardize_features(train_data)
        
        # Standardize test data using training statistics
        test_data_std = []
        for row in test_data:
            std_row = [(row[j] - means[j]) / stds[j] for j in range(len(row))]
            test_data_std.append(std_row)
        
        # Train model
        model = LogisticRegression(learning_rate=0.1, max_iterations=1000)
        model.fit(train_data_std, train_labels)
        
        # Evaluate
        train_pred = model.predict(train_data_std)
        test_pred = model.predict(test_data_std)
        
        train_accuracy = sum(1 for p, t in zip(train_pred, train_labels) if p == t) / len(train_labels)
        test_accuracy = sum(1 for p, t in zip(test_pred, test_labels) if p == t) / len(test_labels)
        
        print(f"Medical Diagnosis Model Performance:")
        print(f"Training accuracy: {train_accuracy:.3f}")
        print(f"Test accuracy: {test_accuracy:.3f}")
        print(f"Feature importance (coefficients):")
        
        for i, feature in enumerate(self.features):
            print(f"  {feature}: {model.weights[i]:.3f}")
        
        return model

class NaiveBayes:
    """Simple Naive Bayes classifier"""
    
    def __init__(self):
        self.class_probs = {}
        self.feature_stats = {}
    
    def fit(self, X, y):
        """Train the Naive Bayes model"""
        n_samples = len(X)
        n_features = len(X[0])
        
        # Calculate class probabilities
        class_counts = Counter(y)
        self.class_probs = {cls: count/n_samples for cls, count in class_counts.items()}
        
        # Calculate feature statistics for each class
        self.feature_stats = {}
        for cls in class_counts:
            self.feature_stats[cls] = []
            class_data = [X[i] for i in range(n_samples) if y[i] == cls]
            
            for j in range(n_features):
                feature_values = [row[j] for row in class_data]
                mean = sum(feature_values) / len(feature_values)
                variance = sum((x - mean)**2 for x in feature_values) / len(feature_values)
                std = math.sqrt(variance) if variance > 0 else 1e-6
                self.feature_stats[cls].append((mean, std))
    
    def gaussian_pdf(self, x, mean, std):
        """Calculate Gaussian probability density"""
        exponent = -0.5 * ((x - mean) / std) ** 2
        return (1 / (std * math.sqrt(2 * math.pi))) * math.exp(exponent)
    
    def predict(self, X):
        """Make predictions"""
        predictions = []
        for x in X:
            class_scores = {}
            
            for cls in self.class_probs:
                score = math.log(self.class_probs[cls])
                for j, feature_val in enumerate(x):
                    mean, std = self.feature_stats[cls][j]
                    prob = self.gaussian_pdf(feature_val, mean, std)
                    score += math.log(prob + 1e-10)  # Add small constant to avoid log(0)
                class_scores[cls] = score
            
            predicted_class = max(class_scores, key=class_scores.get)
            predictions.append(predicted_class)
        
        return predictions

class SpamDetectionDemo:
    """Demonstrates Naive Bayes for spam detection"""
    
    def generate_email_features(self, n_samples=800):
        """Generate synthetic email features for spam detection"""
        data = []
        labels = []
        
        for _ in range(n_samples):
            is_spam = 1 if random.random() < 0.3 else 0  # 30% spam
            
            if is_spam:
                # Generate spam email features
                spam_word_count = max(0, int(random.gauss(3, 1)))
                normal_word_count = max(0, int(random.gauss(1, 0.5)))
                caps_ratio = min(1.0, max(0.0, random.betavariate(2, 1)))
                exclamation_count = max(0, int(random.gauss(2, 1)))
            else:
                # Generate normal email features
                spam_word_count = max(0, int(random.gauss(0.2, 0.3)))
                normal_word_count = max(0, int(random.gauss(5, 2)))
                caps_ratio = min(1.0, max(0.0, random.betavariate(1, 3)))
                exclamation_count = max(0, int(random.gauss(0.1, 0.2)))
            
            features = [spam_word_count, normal_word_count, caps_ratio, exclamation_count]
            data.append(features)
            labels.append(is_spam)
        
        return data, labels
    
    def train_and_evaluate(self):
        """Train Naive Bayes model for spam detection"""
        data, labels = self.generate_email_features()
        
        # Split data
        n = len(data)
        split_idx = int(n * 0.7)
        train_data = data[:split_idx]
        train_labels = labels[:split_idx]
        test_data = data[split_idx:]
        test_labels = labels[split_idx:]
        
        # Train model
        model = NaiveBayes()
        model.fit(train_data, train_labels)
        
        # Evaluate
        train_pred = model.predict(train_data)
        test_pred = model.predict(test_data)
        
        train_accuracy = sum(1 for p, t in zip(train_pred, train_labels) if p == t) / len(train_labels)
        test_accuracy = sum(1 for p, t in zip(test_pred, test_labels) if p == t) / len(test_labels)
        
        print(f"Spam Detection Model Performance:")
        print(f"Training accuracy: {train_accuracy:.3f}")
        print(f"Test accuracy: {test_accuracy:.3f}")
        
        return model

class FeatureEngineeringDemo:
    """Demonstrates the challenge of manual feature engineering"""
    
    def create_image_like_data(self, size=20):
        """Create simple 2D data that simulates pixel values"""
        # Create a simple "wheel" pattern
        center = size // 2
        image = [[0.0 for _ in range(size)] for _ in range(size)]
        
        # Create circle (wheel)
        for y in range(size):
            for x in range(size):
                dist_from_center = math.sqrt((x - center)**2 + (y - center)**2)
                
                # Outer circle
                if dist_from_center <= size//3:
                    image[y][x] = 0.5
                
                # Inner ring (rim)
                if size//4 <= dist_from_center <= size//3:
                    image[y][x] = 1.0
        
        # Add noise
        for y in range(size):
            for x in range(size):
                image[y][x] += random.gauss(0, 0.1)
                image[y][x] = max(0, min(1, image[y][x]))  # Clamp to [0,1]
        
        return image
    
    def extract_hand_crafted_features(self, image):
        """Extract hand-crafted features from image"""
        size = len(image)
        center = size // 2
        features = []
        
        # Circular symmetry measure
        radial_values = []
        for r in range(1, center):
            ring_values = []
            for y in range(size):
                for x in range(size):
                    dist = math.sqrt((x - center)**2 + (y - center)**2)
                    if r-1 <= dist < r:
                        ring_values.append(image[y][x])
            
            if ring_values:
                radial_values.append(sum(ring_values) / len(ring_values))
        
        # Variance in radial profile
        if radial_values:
            mean_radial = sum(radial_values) / len(radial_values)
            variance = sum((x - mean_radial)**2 for x in radial_values) / len(radial_values)
            features.append(variance)
        else:
            features.append(0)
        
        # Edge density (simplified)
        edge_strength = 0
        for y in range(1, size-1):
            for x in range(1, size-1):
                # Simple gradient approximation
                gx = image[y][x+1] - image[y][x-1]
                gy = image[y+1][x] - image[y-1][x]
                edge_strength += math.sqrt(gx**2 + gy**2)
        
        features.append(edge_strength)
        
        # Intensity statistics
        flat_image = [pixel for row in image for pixel in row]
        mean_intensity = sum(flat_image) / len(flat_image)
        variance_intensity = sum((x - mean_intensity)**2 for x in flat_image) / len(flat_image)
        std_intensity = math.sqrt(variance_intensity)
        max_intensity = max(flat_image)
        
        features.extend([mean_intensity, std_intensity, max_intensity])
        
        return features
    
    def add_shadow(self, image):
        """Add shadow effect to image"""
        size = len(image)
        shadowed = [[image[y][x] for x in range(size)] for y in range(size)]
        
        # Darken upper half
        for y in range(size//2):
            for x in range(size):
                shadowed[y][x] *= 0.5
        
        return shadowed
    
    def add_occlusion(self, image):
        """Add occlusion to image"""
        size = len(image)
        occluded = [[image[y][x] for x in range(size)] for y in range(size)]
        
        # Occlude bottom-right corner
        for y in range(size//2, size):
            for x in range(size//2, size):
                occluded[y][x] = 0
        
        return occluded
    
    def demonstrate_feature_engineering(self):
        """Show the complexity of manual feature engineering"""
        print("Manual Feature Engineering Challenges:")
        print("=====================================")
        
        # Create different "wheel" variations
        normal_wheel = self.create_image_like_data()
        shadowed_wheel = self.add_shadow(normal_wheel)
        occluded_wheel = self.add_occlusion(normal_wheel)
        
        wheels = [normal_wheel, shadowed_wheel, occluded_wheel]
        labels = ["Normal", "Shadowed", "Occluded"]
        
        print("\nExtracted features for different wheel conditions:")
        feature_names = ["Radial Var", "Edge Strength", "Mean", "Std", "Max"]
        
        for wheel, label in zip(wheels, labels):
            features = self.extract_hand_crafted_features(wheel)
            print(f"\n{label} wheel features:")
            for fname, fval in zip(feature_names, features):
                print(f"  {fname}: {fval:.3f}")
        
        return wheels, labels

def main():
    """Run all demonstrations"""
    print("Deep Learning Chapter 1 - Pure Python Implementation")
    print("=" * 55)
    
    # 1. Representation importance
    print("\n1. IMPORTANCE OF REPRESENTATION")
    print("-" * 30)
    rep_demo = RepresentationDemo()
    score_orig, score_polar = rep_demo.demonstrate_representation_importance()
    if score_orig > 0:
        improvement = ((score_polar - score_orig) / score_orig * 100)
        print(f"Improvement: {improvement:.1f}%")
    
   
    # 3. Spam detection with Naive Bayes
    print("\n3. SPAM DETECTION (Naive Bayes)")
    print("-" * 30)
    spam_demo = SpamDetectionDemo()
    spam_model = spam_demo.train_and_evaluate()
    
    # 4. Feature engineering challenges
    print("\n4. FEATURE ENGINEERING CHALLENGES")
    print("-" * 33)
    feature_demo = FeatureEngineeringDemo()
    wheels, labels = feature_demo.demonstrate_feature_engineering()
    
    print("\n" + "=" * 55)
    print("Key Insights from Chapter 1:")
    print("- Representation is crucial for ML performance")
    print("- Simple algorithms can solve complex problems with good features")
    print("- Manual feature engineering is challenging and domain-specific")
    print("- Deep learning aims to learn representations automatically")
    print("\nMathematical Concepts Implemented:")
    print("- Sigmoid function: σ(z) = 1/(1 + e^(-z))")
    print("- Polar transformation: (r,θ) = (√(x²+y²), atan2(y,x))")
    print("- Gaussian PDF: f(x) = (1/σ√(2π)) * e^(-(x-μ)²/(2σ²))")
    print("- Logistic regression with gradient descent")
    print("- Naive Bayes with Gaussian assumption")

if __name__ == "__main__":
    main()

Deep Learning Chapter 1 - Pure Python Implementation

1. IMPORTANCE OF REPRESENTATION
------------------------------
Original representation accuracy: 0.510
Polar representation accuracy: 0.647
Improvement: 26.8%

3. SPAM DETECTION (Naive Bayes)
------------------------------
Spam Detection Model Performance:
Training accuracy: 1.000
Test accuracy: 0.996

4. FEATURE ENGINEERING CHALLENGES
---------------------------------
Manual Feature Engineering Challenges:

Extracted features for different wheel conditions:

Normal wheel features:
  Radial Var: 0.077
  Edge Strength: 115.557
  Mean: 0.227
  Std: 0.323
  Max: 1.000

Shadowed wheel features:
  Radial Var: 0.045
  Edge Strength: 91.456
  Mean: 0.175
  Std: 0.267
  Max: 1.000

Occluded wheel features:
  Radial Var: 0.049
  Edge Strength: 94.838
  Mean: 0.161
  Std: 0.290
  Max: 1.000

Key Insights from Chapter 1:
- Representation is crucial for ML performance
- Simple algorithms can solve complex problems with good features
- Manual fe

# Representation Learning and Deep Learning

## Cartesian vs. Polar Coordinates
$$
\begin{aligned}
&\text{Figure 1.1: Example of different representations: suppose we want to separate two categories of data by drawing a line between them in a scatterplot.} \\
&\text{In the plot on the left, we represent some data using Cartesian coordinates, and the task is impossible.} \\
&\text{In the plot on the right, we represent the data with polar coordinates and the task becomes simple to solve with a vertical line.} \\
&\text{(Figure produced in collaboration with David Warde-Farley)}
\end{aligned}
$$

One solution to this problem is to use machine learning to discover not only the mapping from representation to output but also the representation itself. This approach is known as **representation learning**. Learned representations often result in much better performance than can be obtained with hand-designed representations. They also allow AI systems to rapidly adapt to new tasks, with minimal human intervention. A representation learning algorithm can discover a good set of features for a simple task in minutes, or a complex task in hours to months. Manually designing features for a complex task requires a great deal of human time and effort; it can take decades for an entire community of researchers.

### Autoencoders
The quintessential example of a representation learning algorithm is the **autoencoder**. An autoencoder is the combination of an encoder function that converts the input data into a different representation, and a decoder function that converts the new representation back into the original format. Autoencoders are trained to preserve as much information as possible when an input is run through the encoder and then the decoder, but are also trained to make the new representation have various nice properties. Different kinds of autoencoders aim to achieve different kinds of properties.

When designing features or algorithms for learning features, our goal is usually to separate the **factors of variation** that explain the observed data. In this context, we use the word “factors” simply to refer to separate sources of influence; the factors are usually not combined by multiplication. Such factors are often not quantities that are directly observed. Instead, they may exist either as unobserved objects or unobserved forces in the physical world that affect observable quantities. They may also exist as constructs in the human mind that provide useful simplifying explanations or inferred causes of the observed data. They can be thought of as concepts or abstractions that help us make sense of the rich variability in the data.

- **Example 1**: When analyzing a speech recording, the factors of variation include the speaker’s age, their sex, their accent, and the words that they are speaking.
- **Example 2**: When analyzing an image of a car, the factors of variation include the position of the car, its color, and the angle and brightness of the sun.

A major source of difficulty in many real-world artificial intelligence applications is that many of the factors of variation influence every single piece of data we are able to observe. The individual pixels in an image of a red car might be very close to black at night. The shape of the car’s silhouette depends on the viewing angle. Most applications require us to disentangle the factors of variation and discard the ones that we do not care about. Of course, it can be very difficult to extract such high-level, abstract features from raw data. Many of these factors of variation, such as a speaker’s accent, can be identified only using sophisticated, nearly human-level understanding of the data. When it is nearly as difficult to obtain a representation as to solve the original problem, representation learning does not, at first glance, seem to help us.

## Deep Learning
Deep learning solves this central problem in representation learning by introducing representations that are expressed in terms of other, simpler representations. Deep learning allows the computer to build complex concepts out of simpler concepts. Figure 1.2 shows how a deep learning system can represent the concept of an image of a person by combining simpler concepts, such as corners and contours, which are in turn defined in terms of edges.

### Multilayer Perceptron (MLP)
The quintessential example of a deep learning model is the **feedforward deep network or multilayer perceptron (MLP)**. A multilayer perceptron is just a mathematical function mapping some set of input values to output values. The function is formed by composing many simpler functions. We can think of each application of a different mathematical function as providing a new representation of the input.

The idea of learning the right representation for the data provides one perspective on deep learning. Another perspective on deep learning is that depth allows the computer to learn a multi-step computer program. Each layer of the representation can be thought of as the state of the computer’s memory after executing another set of instructions in parallel. Networks with greater depth can execute more instructions in sequence. Sequential instructions offer great power because later instructions can refer back to the results of earlier instructions.

$$
\begin{aligned}
&\text{Figure 1.2: Illustration of a deep learning model.} \\
&\text{It is difficult for a computer to understand the meaning of raw sensory input data, such as this image represented as a collection of pixel values.} \\
&\text{The function mapping from a set of pixels to an object identity is very complicated.} \\
&\text{Learning or evaluating this mapping seems insurmountable if tackled directly.} \\
&\text{Deep learning resolves this difficulty by breaking the desired complicated mapping into a series of nested simple mappings, each described by a different layer of the model.} \\
&\text{The input is presented at the input layer, so named because it contains the variables that we are able to observe.} \\
&\text{Then a series of hidden layers extracts increasingly abstract features from the image.} \\
&\text{These layers are called “hidden” because their values are not given in the data; instead the model must determine which concepts are useful for explaining the relationships in the observed data.} \\
&\text{The images here are visualizations of the kind of feature represented by each hidden unit.} \\
&\text{Given the pixels, the first layer can easily identify edges, by comparing the brightness of neighboring pixels.} \\
&\text{Given the first hidden layer’s description of the edges, the second hidden layer can easily search for corners and extended contours, which are recognizable as collections of edges.} \\
&\text{Given the second hidden layer’s description of the image in terms of corners and contours, the third hidden layer can detect entire parts of specific objects, by finding specific collections of contours and corners.} \\
&\text{Finally, this description of the image in terms of the object parts it contains can be used to recognize the objects present in the image.} \\
&\text{Images reproduced with permission from Zeiler and Fergus (2014).}
\end{aligned}
$$

# Deep Learning and Representation Learning
## A Comprehensive Introduction

### 1. Coordinate System Transformations and Data Representation

![image-2.png](attachment:image-2.png)

Fig.1: Example of diﬀerent representations: suppose we want to separate two categories of data by drawing a line between them in a scatterplot. In the plot on the left, we represent some data using Cartesian coordinates, and the task is impossible. In the plot on the right, we represent the data with polar coordinates and the task becomes simple to solve with a vertical line. (Figure produced in collaboration with David Warde-Farley)

The choice of representation can dramatically affect the difficulty of a machine learning task. Consider the example of separating two categories of data:

**Cartesian Coordinates vs Polar Coordinates:**
- **Cartesian**: $(x, y)$ coordinates
- **Polar**: $(r, \theta)$ coordinates where:
  $$r = \sqrt{x^2 + y^2}$$
  $$\theta = \arctan\left(\frac{y}{x}\right)$$

**Conversion formulas:**
$$x = r \cos(\theta)$$
$$y = r \sin(\theta)$$

The same dataset can be:
- Impossible to separate linearly in Cartesian coordinates
- Easily separable with a simple vertical line in polar coordinates

This demonstrates that the right representation can transform an intractable problem into a trivial one.

### 2. Representation Learning

**Definition:** Representation learning is the use of machine learning to discover both:
1. The mapping from representation to output
2. The representation itself

**Key advantages:**
- Better performance than hand-designed features
- Rapid adaptation to new tasks
- Minimal human intervention required

**Time comparison:**
- Manual feature design: decades for entire research communities
- Representation learning: minutes to months depending on complexity

### 3. Autoencoders

**Definition:** An autoencoder consists of two main components:

1. **Encoder function**: $f: \mathcal{X} \rightarrow \mathcal{H}$
   $$\mathbf{h} = f(\mathbf{x})$$

2. **Decoder function**: $g: \mathcal{H} \rightarrow \mathcal{X}$
   $$\mathbf{x}' = g(\mathbf{h})$$

**Objective:** Minimize reconstruction error while learning useful representations:
$$\mathcal{L}(\mathbf{x}, \mathbf{x}') = \|\mathbf{x} - g(f(\mathbf{x}))\|^2$$

**Training goal:** Preserve information while achieving desired properties in $\mathbf{h}$.

### 4. Factors of Variation

**Definition:** Separate sources of influence that explain observed data.

**Mathematical representation:**
Let $\mathbf{x}$ be observed data and $\mathbf{z} = [z_1, z_2, ..., z_k]^T$ be latent factors:
$$\mathbf{x} = f(\mathbf{z}) + \boldsymbol{\epsilon}$$

where $\boldsymbol{\epsilon}$ represents noise.

**Examples:**

**Speech analysis factors:**
- Speaker age: $z_1$
- Speaker sex: $z_2$ 
- Accent: $z_3$
- Words spoken: $z_4$

**Image analysis factors:**
- Object position: $\mathbf{z}_{pos} = [x, y, z]^T$
- Color: $\mathbf{z}_{color} = [R, G, B]^T$
- Lighting: $\mathbf{z}_{light} = [\text{angle}, \text{brightness}]^T$

**Challenge:** Disentangling factors when they influence every observable piece of data.

### 5. Deep Learning Architecture

**Core principle:** Build complex concepts from simpler concepts through hierarchical representations.

**Multilayer Perceptron (MLP):**
For an $L$-layer network:
$$\mathbf{h}^{(0)} = \mathbf{x}$$
$$\mathbf{h}^{(l)} = \sigma(\mathbf{W}^{(l)}\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}) \quad \text{for } l = 1, 2, ..., L-1$$
$$\mathbf{y} = \mathbf{W}^{(L)}\mathbf{h}^{(L-1)} + \mathbf{b}^{(L)}$$

where:
- $\mathbf{W}^{(l)}$ are weight matrices
- $\mathbf{b}^{(l)}$ are bias vectors  
- $\sigma(\cdot)$ is the activation function
- $\mathbf{h}^{(l)}$ is the $l$-th hidden layer representation

**Function composition view:**
$$f(\mathbf{x}) = f^{(L)}(f^{(L-1)}(...f^{(2)}(f^{(1)}(\mathbf{x}))))$$

### 6. Hierarchical Feature Learning

**Layer-by-layer abstraction:**

**Layer 1 (Edge Detection):**
$$h^{(1)}_{i,j} = \sigma\left(\sum_{p,q} w_{p,q} \cdot x_{i+p,j+q} + b\right)$$

Detects simple edges by comparing neighboring pixel intensities.

**Layer 2 (Corners and Contours):**
$$h^{(2)}_{i,j} = \sigma\left(\sum_{k} w_k \cdot h^{(1)}_{i,j,k} + b\right)$$

Combines edges to form corners and extended contours.

**Layer 3 (Object Parts):**
$$h^{(3)}_{i,j} = \sigma\left(\sum_{k} w_k \cdot h^{(2)}_{i,j,k} + b\right)$$

Detects specific object parts from collections of contours and corners.

**Output Layer (Object Recognition):**
$$y = \text{softmax}(\mathbf{W}^{(L)} \mathbf{h}^{(L-1)} + \mathbf{b}^{(L)})$$

where:
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

### 7. Two Perspectives on Deep Learning

#### Perspective 1: Representation Learning
Each layer learns increasingly abstract representations:
$$\mathbf{x} \rightarrow \mathbf{h}^{(1)} \rightarrow \mathbf{h}^{(2)} \rightarrow ... \rightarrow \mathbf{h}^{(L-1)} \rightarrow \mathbf{y}$$

#### Perspective 2: Multi-step Computer Program
Each layer represents the state of computer memory after executing instructions:
- **Sequential processing**: Later layers can reference earlier computations
- **Parallel execution**: Each layer processes multiple features simultaneously
- **Memory states**: $\mathbf{h}^{(l)}$ represents program state after step $l$

**Computational complexity:**
For a network with $L$ layers and $n$ units per layer:
- **Forward pass**: $O(L \cdot n^2)$ operations
- **Backward pass**: $O(L \cdot n^2)$ operations

### 8. Mathematical Formulation Summary

**Key equations:**

**Universal Approximation:**
A feedforward network with sufficient width can approximate any continuous function:
$$\exists \text{ network } f_{\text{net}} : \|f(\mathbf{x}) - f_{\text{net}}(\mathbf{x})\| < \epsilon \quad \forall \mathbf{x}$$

**Depth vs Width Trade-off:**
Functions requiring exponential width in shallow networks may require only polynomial depth:
$$\text{Width}_{\text{shallow}} = O(2^n) \text{ vs } \text{Depth}_{\text{deep}} = O(n)$$

**Information Processing:**
Each layer performs a nonlinear transformation:
$$I(\mathbf{x}; \mathbf{y}) = I(\mathbf{x}; \mathbf{h}^{(1)}) + I(\mathbf{h}^{(1)}; \mathbf{h}^{(2)}) + ... + I(\mathbf{h}^{(L-1)}; \mathbf{y})$$

where $I(\cdot; \cdot)$ denotes mutual information.

### 9. Practical Implementation Notes

**Activation functions:**
- **ReLU**: $\sigma(z) = \max(0, z)$
- **Sigmoid**: $\sigma(z) = \frac{1}{1 + e^{-z}}$
- **Tanh**: $\sigma(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$

**Loss functions:**
- **Regression**: $\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = \frac{1}{2}\|\mathbf{y} - \hat{\mathbf{y}}\|^2$
- **Classification**: $\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{i} y_i \log(\hat{y}_i)$

**Optimization:**
Gradient descent with backpropagation:
$$\mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$$

### 10. Conclusion

Deep learning's power lies in its ability to:
1. **Learn hierarchical representations** automatically
2. **Compose simple functions** into complex mappings  
3. **Disentangle factors of variation** in high-dimensional data
4. **Scale with data and compute** to solve previously intractable problems

The key insight is that **depth enables complexity**: by stacking simple transformations, we can build arbitrarily sophisticated functions that capture the underlying structure of data.

![image.png](attachment:image.png)

Fig.2: Illustration of a deep learning model.

In [1]:
import math
import random
from typing import List, Tuple, Callable, Optional

class Matrix:
    """Matrix class for linear algebra operations"""
    
    def __init__(self, data: List[List[float]]):
        self.data = data
        self.rows = len(data)
        self.cols = len(data[0]) if data else 0
        
    def __getitem__(self, key):
        return self.data[key]
    
    def __setitem__(self, key, value):
        self.data[key] = value
    
    def __repr__(self):
        return f"Matrix({self.rows}x{self.cols})"
    
    def transpose(self) -> 'Matrix':
        """Transpose the matrix"""
        transposed = [[self.data[j][i] for j in range(self.rows)] 
                     for i in range(self.cols)]
        return Matrix(transposed)
    
    def multiply(self, other: 'Matrix') -> 'Matrix':
        """Matrix multiplication"""
        if self.cols != other.rows:
            raise ValueError("Matrix dimensions don't match for multiplication")
        
        result = [[0.0 for _ in range(other.cols)] for _ in range(self.rows)]
        
        for i in range(self.rows):
            for j in range(other.cols):
                for k in range(self.cols):
                    result[i][j] += self.data[i][k] * other.data[k][j]
        
        return Matrix(result)
    
    def add(self, other: 'Matrix') -> 'Matrix':
        """Matrix addition"""
        if self.rows != other.rows or self.cols != other.cols:
            raise ValueError("Matrix dimensions don't match for addition")
        
        result = [[self.data[i][j] + other.data[i][j] 
                  for j in range(self.cols)] for i in range(self.rows)]
        return Matrix(result)
    
    def subtract(self, other: 'Matrix') -> 'Matrix':
        """Matrix subtraction"""
        if self.rows != other.rows or self.cols != other.cols:
            raise ValueError("Matrix dimensions don't match for subtraction")
        
        result = [[self.data[i][j] - other.data[i][j] 
                  for j in range(self.cols)] for i in range(self.rows)]
        return Matrix(result)
    
    def scalar_multiply(self, scalar: float) -> 'Matrix':
        """Scalar multiplication"""
        result = [[self.data[i][j] * scalar 
                  for j in range(self.cols)] for i in range(self.rows)]
        return Matrix(result)
    
    def apply_function(self, func: Callable[[float], float]) -> 'Matrix':
        """Apply function element-wise"""
        result = [[func(self.data[i][j]) 
                  for j in range(self.cols)] for i in range(self.rows)]
        return Matrix(result)
    
    @staticmethod
    def zeros(rows: int, cols: int) -> 'Matrix':
        """Create zero matrix"""
        return Matrix([[0.0 for _ in range(cols)] for _ in range(rows)])
    
    @staticmethod
    def random_matrix(rows: int, cols: int, scale: float = 1.0) -> 'Matrix':
        """Create random matrix"""
        data = [[random.gauss(0, scale) for _ in range(cols)] 
                for _ in range(rows)]
        return Matrix(data)

class ActivationFunctions:
    """Collection of activation functions and their derivatives"""
    
    @staticmethod
    def sigmoid(x: float) -> float:
        """Sigmoid activation function"""
        return 1.0 / (1.0 + math.exp(-max(-500, min(500, x))))
    
    @staticmethod
    def sigmoid_derivative(x: float) -> float:
        """Derivative of sigmoid"""
        s = ActivationFunctions.sigmoid(x)
        return s * (1 - s)
    
    @staticmethod
    def relu(x: float) -> float:
        """ReLU activation function"""
        return max(0, x)
    
    @staticmethod
    def relu_derivative(x: float) -> float:
        """Derivative of ReLU"""
        return 1.0 if x > 0 else 0.0
    
    @staticmethod
    def tanh(x: float) -> float:
        """Tanh activation function"""
        return math.tanh(x)
    
    @staticmethod
    def tanh_derivative(x: float) -> float:
        """Derivative of tanh"""
        t = math.tanh(x)
        return 1 - t * t
    
    @staticmethod
    def linear(x: float) -> float:
        """Linear activation (identity)"""
        return x
    
    @staticmethod
    def linear_derivative(x: float) -> float:
        """Derivative of linear"""
        return 1.0

class LossFunctions:
    """Collection of loss functions and their derivatives"""
    
    @staticmethod
    def mean_squared_error(predicted: List[float], actual: List[float]) -> float:
        """Mean squared error loss"""
        if len(predicted) != len(actual):
            raise ValueError("Prediction and actual lengths must match")
        
        total = sum((p - a) ** 2 for p, a in zip(predicted, actual))
        return total / len(predicted)
    
    @staticmethod
    def mse_derivative(predicted: List[float], actual: List[float]) -> List[float]:
        """Derivative of MSE with respect to predictions"""
        if len(predicted) != len(actual):
            raise ValueError("Prediction and actual lengths must match")
        
        n = len(predicted)
        return [2 * (p - a) / n for p, a in zip(predicted, actual)]
    
    @staticmethod
    def cross_entropy(predicted: List[float], actual: List[float]) -> float:
        """Cross entropy loss"""
        epsilon = 1e-15  # For numerical stability
        total = 0
        for p, a in zip(predicted, actual):
            p = max(epsilon, min(1 - epsilon, p))  # Clip for stability
            total += a * math.log(p) + (1 - a) * math.log(1 - p)
        return -total / len(predicted)

class CoordinateTransforms:
    """Coordinate system transformations"""
    
    @staticmethod
    def cartesian_to_polar(x: float, y: float) -> Tuple[float, float]:
        """Convert Cartesian to polar coordinates"""
        r = math.sqrt(x**2 + y**2)
        theta = math.atan2(y, x)
        return r, theta
    
    @staticmethod
    def polar_to_cartesian(r: float, theta: float) -> Tuple[float, float]:
        """Convert polar to Cartesian coordinates"""
        x = r * math.cos(theta)
        y = r * math.sin(theta)
        return x, y
    
    @staticmethod
    def transform_dataset(data: List[Tuple[float, float]], 
                         transform_func: Callable) -> List[Tuple[float, float]]:
        """Apply coordinate transformation to dataset"""
        return [transform_func(x, y) for x, y in data]

class NeuralLayer:
    """Single layer of a neural network"""
    
    def __init__(self, input_size: int, output_size: int, 
                 activation: str = 'sigmoid'):
        self.input_size = input_size
        self.output_size = output_size
        
        # Initialize weights with Xavier initialization
        scale = math.sqrt(2.0 / (input_size + output_size))
        self.weights = Matrix.random_matrix(output_size, input_size, scale)
        self.biases = Matrix.zeros(output_size, 1)
        
        # Set activation function
        self.activation_name = activation
        if activation == 'sigmoid':
            self.activation = ActivationFunctions.sigmoid
            self.activation_derivative = ActivationFunctions.sigmoid_derivative
        elif activation == 'relu':
            self.activation = ActivationFunctions.relu
            self.activation_derivative = ActivationFunctions.relu_derivative
        elif activation == 'tanh':
            self.activation = ActivationFunctions.tanh
            self.activation_derivative = ActivationFunctions.tanh_derivative
        else:  # linear
            self.activation = ActivationFunctions.linear
            self.activation_derivative = ActivationFunctions.linear_derivative
        
        # Store for backpropagation
        self.last_input = None
        self.last_z = None  # Pre-activation values
        self.last_output = None
    
    def forward(self, input_matrix: Matrix) -> Matrix:
        """Forward pass through the layer"""
        self.last_input = input_matrix
        
        # z = W * x + b
        z = self.weights.multiply(input_matrix).add(self.biases)
        self.last_z = z
        
        # Apply activation function
        output = z.apply_function(self.activation)
        self.last_output = output
        
        return output
    
    def backward(self, output_gradient: Matrix, learning_rate: float) -> Matrix:
        """Backward pass (backpropagation)"""
        # Compute activation derivative
        activation_grad = self.last_z.apply_function(self.activation_derivative)
        
        # Element-wise multiplication of gradients
        delta = Matrix([[output_gradient[i][j] * activation_grad[i][j]
                        for j in range(output_gradient.cols)]
                       for i in range(output_gradient.rows)])
        
        # Compute gradients
        input_gradient = self.weights.transpose().multiply(delta)
        weight_gradient = delta.multiply(self.last_input.transpose())
        bias_gradient = delta
        
        # Update weights and biases
        self.weights = self.weights.subtract(
            weight_gradient.scalar_multiply(learning_rate))
        self.biases = self.biases.subtract(
            bias_gradient.scalar_multiply(learning_rate))
        
        return input_gradient

class MultilayerPerceptron:
    """Multi-layer Perceptron (Deep Neural Network)"""
    
    def __init__(self, layer_sizes: List[int], 
                 activations: List[str] = None):
        if len(layer_sizes) < 2:
            raise ValueError("Need at least input and output layers")
        
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes) - 1
        
        # Default activations
        if activations is None:
            activations = ['sigmoid'] * (self.num_layers - 1) + ['linear']
        
        # Create layers
        self.layers = []
        for i in range(self.num_layers):
            layer = NeuralLayer(layer_sizes[i], layer_sizes[i + 1], 
                              activations[i])
            self.layers.append(layer)
    
    def forward(self, input_data: List[float]) -> List[float]:
        """Forward propagation through the network"""
        # Convert input to matrix
        current = Matrix([[x] for x in input_data])
        
        # Forward through each layer
        for layer in self.layers:
            current = layer.forward(current)
        
        # Convert output back to list
        return [current[i][0] for i in range(current.rows)]
    
    def backward(self, predicted: List[float], actual: List[float], 
                learning_rate: float = 0.01):
        """Backward propagation (training)"""
        # Compute output gradient
        output_grad = LossFunctions.mse_derivative(predicted, actual)
        current_grad = Matrix([[grad] for grad in output_grad])
        
        # Backpropagate through layers (reverse order)
        for layer in reversed(self.layers):
            current_grad = layer.backward(current_grad, learning_rate)
    
    def train(self, training_data: List[Tuple[List[float], List[float]]], 
              epochs: int = 1000, learning_rate: float = 0.01, 
              verbose: bool = True):
        """Train the neural network"""
        for epoch in range(epochs):
            total_loss = 0
            
            for inputs, targets in training_data:
                # Forward pass
                predicted = self.forward(inputs)
                
                # Compute loss
                loss = LossFunctions.mean_squared_error(predicted, targets)
                total_loss += loss
                
                # Backward pass
                self.backward(predicted, targets, learning_rate)
            
            if verbose and epoch % 100 == 0:
                avg_loss = total_loss / len(training_data)
                print(f"Epoch {epoch}, Average Loss: {avg_loss:.6f}")
    
    def predict(self, input_data: List[float]) -> List[float]:
        """Make prediction"""
        return self.forward(input_data)

class Autoencoder:
    """Autoencoder for representation learning"""
    
    def __init__(self, input_size: int, hidden_sizes: List[int]):
        # Create encoder layers
        encoder_sizes = [input_size] + hidden_sizes
        self.encoder = MultilayerPerceptron(
            encoder_sizes, 
            ['sigmoid'] * (len(encoder_sizes) - 2) + ['linear']
        )
        
        # Create decoder layers (reverse of encoder)
        decoder_sizes = list(reversed(encoder_sizes))
        self.decoder = MultilayerPerceptron(
            decoder_sizes,
            ['sigmoid'] * (len(decoder_sizes) - 2) + ['linear']
        )
    
    def encode(self, input_data: List[float]) -> List[float]:
        """Encode input to hidden representation"""
        return self.encoder.forward(input_data)
    
    def decode(self, hidden_data: List[float]) -> List[float]:
        """Decode hidden representation back to input space"""
        return self.decoder.forward(hidden_data)
    
    def forward(self, input_data: List[float]) -> List[float]:
        """Full autoencoder forward pass"""
        hidden = self.encode(input_data)
        reconstructed = self.decode(hidden)
        return reconstructed
    
    def train(self, data: List[List[float]], epochs: int = 1000, 
              learning_rate: float = 0.01, verbose: bool = True):
        """Train autoencoder to reconstruct input"""
        training_data = [(x, x) for x in data]  # Target is same as input
        
        for epoch in range(epochs):
            total_loss = 0
            
            for inputs, targets in training_data:
                # Forward pass through full autoencoder
                reconstructed = self.forward(inputs)
                
                # Compute reconstruction loss
                loss = LossFunctions.mean_squared_error(reconstructed, targets)
                total_loss += loss
                
                # Train encoder and decoder separately
                hidden = self.encode(inputs)
                
                # Backward pass for decoder
                decoder_grad = LossFunctions.mse_derivative(reconstructed, targets)
                self.decoder.backward(reconstructed, targets, learning_rate)
                
                # Backward pass for encoder
                # This is simplified - in practice, we'd need to compute
                # gradients through the decoder to train the encoder
                self.encoder.backward(hidden, targets, learning_rate * 0.5)
            
            if verbose and epoch % 100 == 0:
                avg_loss = total_loss / len(data)
                print(f"Epoch {epoch}, Reconstruction Loss: {avg_loss:.6f}")

class FactorAnalysis:
    """Simple factor analysis for understanding data variation"""
    
    def __init__(self, data: List[List[float]]):
        self.data = data
        self.num_samples = len(data)
        self.num_features = len(data[0]) if data else 0
        
    def compute_mean(self) -> List[float]:
        """Compute mean of each feature"""
        if not self.data:
            return []
        
        means = [0.0] * self.num_features
        for sample in self.data:
            for i, value in enumerate(sample):
                means[i] += value
        
        return [mean / self.num_samples for mean in means]
    
    def compute_variance(self) -> List[float]:
        """Compute variance of each feature"""
        means = self.compute_mean()
        variances = [0.0] * self.num_features
        
        for sample in self.data:
            for i, value in enumerate(sample):
                variances[i] += (value - means[i]) ** 2
        
        return [var / self.num_samples for var in variances]
    
    def normalize_data(self) -> List[List[float]]:
        """Normalize data (zero mean, unit variance)"""
        means = self.compute_mean()
        variances = self.compute_variance()
        std_devs = [math.sqrt(var) for var in variances]
        
        normalized = []
        for sample in self.data:
            normalized_sample = []
            for i, value in enumerate(sample):
                if std_devs[i] > 0:
                    normalized_value = (value - means[i]) / std_devs[i]
                else:
                    normalized_value = 0.0
                normalized_sample.append(normalized_value)
            normalized.append(normalized_sample)
        
        return normalized

def demonstrate_coordinate_transformation():
    """Demonstrate coordinate transformation for data separation"""
    print("=== Coordinate Transformation Demo ===")
    
    # Generate circular data that's hard to separate in Cartesian coordinates
    data_cartesian = []
    labels = []
    
    for _ in range(100):
        # Inner circle (class 0)
        r = random.uniform(0.5, 1.0)
        theta = random.uniform(0, 2 * math.pi)
        x, y = CoordinateTransforms.polar_to_cartesian(r, theta)
        data_cartesian.append((x, y))
        labels.append(0)
        
        # Outer circle (class 1)
        r = random.uniform(1.5, 2.0)
        theta = random.uniform(0, 2 * math.pi)
        x, y = CoordinateTransforms.polar_to_cartesian(r, theta)
        data_cartesian.append((x, y))
        labels.append(1)
    
    # Convert to polar coordinates
    data_polar = [CoordinateTransforms.cartesian_to_polar(x, y) 
                  for x, y in data_cartesian]
    
    print(f"Generated {len(data_cartesian)} data points")
    print("Sample Cartesian points:", data_cartesian[:5])
    print("Sample Polar points:", data_polar[:5])
    print("Sample labels:", labels[:5])
    
    # In polar coordinates, separation is easy: r < 1.25 vs r >= 1.25
    correct_polar = sum(1 for i, (r, theta) in enumerate(data_polar)
                       if (r < 1.25 and labels[i] == 0) or 
                          (r >= 1.25 and labels[i] == 1))
    
    print(f"Polar coordinate classification accuracy: {correct_polar/len(labels):.2%}")

def demonstrate_mlp():
    """Demonstrate Multi-layer Perceptron"""
    print("\n=== Multi-layer Perceptron Demo ===")
    
    # Create XOR dataset (classic non-linearly separable problem)
    xor_data = [
        ([0, 0], [0]),
        ([0, 1], [1]),
        ([1, 0], [1]),
        ([1, 1], [0])
    ]
    
    # Create MLP with hidden layer
    mlp = MultilayerPerceptron([2, 4, 1], ['sigmoid', 'sigmoid'])
    
    print("Training MLP on XOR problem...")
    mlp.train(xor_data, epochs=2000, learning_rate=0.5, verbose=True)
    
    # Test predictions
    print("\nXOR Predictions:")
    for inputs, expected in xor_data:
        predicted = mlp.predict(inputs)
        print(f"Input: {inputs}, Expected: {expected[0]}, "
              f"Predicted: {predicted[0]:.3f}")

def demonstrate_autoencoder():
    """Demonstrate Autoencoder"""
    print("\n=== Autoencoder Demo ===")
    
    # Create simple dataset
    data = [
        [1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 1],
        [1, 1, 0, 0],
        [0, 0, 1, 1]
    ]
    
    # Create autoencoder (4-2-4: compress to 2D)
    autoencoder = Autoencoder(4, [2])
    
    print("Training autoencoder...")
    autoencoder.train(data, epochs=1000, learning_rate=0.1, verbose=True)
    
    # Test reconstruction
    print("\nReconstruction Results:")
    for i, original in enumerate(data):
        reconstructed = autoencoder.forward(original)
        encoded = autoencoder.encode(original)
        
        print(f"Original: {[f'{x:.1f}' for x in original]}")
        print(f"Encoded:  {[f'{x:.3f}' for x in encoded]}")
        print(f"Reconstructed: {[f'{x:.3f}' for x in reconstructed]}")
        print()

def demonstrate_hierarchical_features():
    """Demonstrate hierarchical feature learning concept"""
    print("\n=== Hierarchical Feature Learning Demo ===")
    
    # Simulate image-like data (8x8 = 64 pixels)
    # Create simple patterns
    edge_horizontal = [1, 1, 1, 1, 0, 0, 0, 0] * 8
    edge_vertical = [1, 0] * 32
    corner = [1, 1, 0, 0] * 16
    
    patterns = [edge_horizontal, edge_vertical, corner]
    pattern_names = ["Horizontal Edge", "Vertical Edge", "Corner"]
    
    # Create deep network for hierarchical learning
    deep_net = MultilayerPerceptron(
        [64, 32, 16, 8, 3],  # 64 -> 32 -> 16 -> 8 -> 3 features
        ['relu', 'relu', 'relu', 'linear']
    )
    
    # Create training data (patterns -> one-hot labels)
    training_data = []
    for i, pattern in enumerate(patterns):
        label = [0, 0, 0]
        label[i] = 1
        training_data.append((pattern, label))
    
    print("Training deep network for pattern recognition...")
    deep_net.train(training_data, epochs=1000, learning_rate=0.01, verbose=True)
    
    # Test pattern recognition
    print("\nPattern Recognition Results:")
    for i, (pattern, name) in enumerate(zip(patterns, pattern_names)):
        prediction = deep_net.predict(pattern)
        predicted_class = prediction.index(max(prediction))
        confidence = max(prediction)
        
        print(f"{name}: Predicted class {predicted_class}, "
              f"Confidence: {confidence:.3f}")

def main():
    """Main demonstration function"""
    print("Deep Learning Implementation in Core Python")
    print("=" * 50)
    
    # Set random seed for reproducibility
    random.seed(42)
    
    # Run demonstrations
    demonstrate_coordinate_transformation()
    demonstrate_mlp()
    demonstrate_autoencoder()
    demonstrate_hierarchical_features()
    
    print("\n" + "=" * 50)
    print("All demonstrations completed!")

if __name__ == "__main__":
    main()

Deep Learning Implementation in Core Python
=== Coordinate Transformation Demo ===
Generated 200 data points
Sample Cartesian points: [(0.8096126996969449, 0.12828613863115132), (0.2743298806071533, 1.614372130473761), (-0.38588771238938485, -0.7777684377962036), (1.6628892480320796, 1.010972203175772), (0.6985370082648876, 0.13233088607202836)]
Sample Polar points: [(0.8197133992289418, 0.157147209736526), (1.6375146591845597, 1.402474430341393), (0.8682356070820061, -2.031357030427993), (1.9460897838524227, 0.5462527958004928), (0.7109609098426354, 0.18722145136808954)]
Sample labels: [0, 1, 0, 1, 0]
Polar coordinate classification accuracy: 100.00%

=== Multi-layer Perceptron Demo ===
Training MLP on XOR problem...
Epoch 0, Average Loss: 0.302097
Epoch 100, Average Loss: 0.275704
Epoch 200, Average Loss: 0.269335
Epoch 300, Average Loss: 0.245575
Epoch 400, Average Loss: 0.195457
Epoch 500, Average Loss: 0.136552
Epoch 600, Average Loss: 0.045073
Epoch 700, Average Loss: 0.017113
Ep

ValueError: Prediction and actual lengths must match

In [2]:
import math
import random
from typing import List, Tuple, Callable, Optional

class Matrix:
    """Matrix class for linear algebra operations"""
    
    def __init__(self, data: List[List[float]]):
        self.data = data
        self.rows = len(data)
        self.cols = len(data[0]) if data else 0
        
    def __getitem__(self, key):
        return self.data[key]
    
    def __setitem__(self, key, value):
        self.data[key] = value
    
    def __repr__(self):
        return f"Matrix({self.rows}x{self.cols})"
    
    def transpose(self) -> 'Matrix':
        """Transpose the matrix"""
        transposed = [[self.data[j][i] for j in range(self.rows)] 
                     for i in range(self.cols)]
        return Matrix(transposed)
    
    def multiply(self, other: 'Matrix') -> 'Matrix':
        """Matrix multiplication"""
        if self.cols != other.rows:
            raise ValueError("Matrix dimensions don't match for multiplication")
        
        result = [[0.0 for _ in range(other.cols)] for _ in range(self.rows)]
        
        for i in range(self.rows):
            for j in range(other.cols):
                for k in range(self.cols):
                    result[i][j] += self.data[i][k] * other.data[k][j]
        
        return Matrix(result)
    
    def add(self, other: 'Matrix') -> 'Matrix':
        """Matrix addition"""
        if self.rows != other.rows or self.cols != other.cols:
            raise ValueError("Matrix dimensions don't match for addition")
        
        result = [[self.data[i][j] + other.data[i][j] 
                  for j in range(self.cols)] for i in range(self.rows)]
        return Matrix(result)
    
    def subtract(self, other: 'Matrix') -> 'Matrix':
        """Matrix subtraction"""
        if self.rows != other.rows or self.cols != other.cols:
            raise ValueError("Matrix dimensions don't match for subtraction")
        
        result = [[self.data[i][j] - other.data[i][j] 
                  for j in range(self.cols)] for i in range(self.rows)]
        return Matrix(result)
    
    def scalar_multiply(self, scalar: float) -> 'Matrix':
        """Scalar multiplication"""
        result = [[self.data[i][j] * scalar 
                  for j in range(self.cols)] for i in range(self.rows)]
        return Matrix(result)
    
    def apply_function(self, func: Callable[[float], float]) -> 'Matrix':
        """Apply function element-wise"""
        result = [[func(self.data[i][j]) 
                  for j in range(self.cols)] for i in range(self.rows)]
        return Matrix(result)
    
    @staticmethod
    def zeros(rows: int, cols: int) -> 'Matrix':
        """Create zero matrix"""
        return Matrix([[0.0 for _ in range(cols)] for _ in range(rows)])
    
    @staticmethod
    def random_matrix(rows: int, cols: int, scale: float = 1.0) -> 'Matrix':
        """Create random matrix"""
        data = [[random.gauss(0, scale) for _ in range(cols)] 
                for _ in range(rows)]
        return Matrix(data)

class ActivationFunctions:
    """Collection of activation functions and their derivatives"""
    
    @staticmethod
    def sigmoid(x: float) -> float:
        """Sigmoid activation function"""
        return 1.0 / (1.0 + math.exp(-max(-500, min(500, x))))
    
    @staticmethod
    def sigmoid_derivative(x: float) -> float:
        """Derivative of sigmoid"""
        s = ActivationFunctions.sigmoid(x)
        return s * (1 - s)
    
    @staticmethod
    def relu(x: float) -> float:
        """ReLU activation function"""
        return max(0, x)
    
    @staticmethod
    def relu_derivative(x: float) -> float:
        """Derivative of ReLU"""
        return 1.0 if x > 0 else 0.0
    
    @staticmethod
    def tanh(x: float) -> float:
        """Tanh activation function"""
        return math.tanh(x)
    
    @staticmethod
    def tanh_derivative(x: float) -> float:
        """Derivative of tanh"""
        t = math.tanh(x)
        return 1 - t * t
    
    @staticmethod
    def linear(x: float) -> float:
        """Linear activation (identity)"""
        return x
    
    @staticmethod
    def linear_derivative(x: float) -> float:
        """Derivative of linear"""
        return 1.0

class LossFunctions:
    """Collection of loss functions and their derivatives"""
    
    @staticmethod
    def mean_squared_error(predicted: List[float], actual: List[float]) -> float:
        """Mean squared error loss"""
        if len(predicted) != len(actual):
            raise ValueError("Prediction and actual lengths must match")
        
        total = sum((p - a) ** 2 for p, a in zip(predicted, actual))
        return total / len(predicted)
    
    @staticmethod
    def mse_derivative(predicted: List[float], actual: List[float]) -> List[float]:
        """Derivative of MSE with respect to predictions"""
        if len(predicted) != len(actual):
            raise ValueError("Prediction and actual lengths must match")
        
        n = len(predicted)
        return [2 * (p - a) / n for p, a in zip(predicted, actual)]
    
    @staticmethod
    def cross_entropy(predicted: List[float], actual: List[float]) -> float:
        """Cross entropy loss"""
        epsilon = 1e-15  # For numerical stability
        total = 0
        for p, a in zip(predicted, actual):
            p = max(epsilon, min(1 - epsilon, p))  # Clip for stability
            total += a * math.log(p) + (1 - a) * math.log(1 - p)
        return -total / len(predicted)

class CoordinateTransforms:
    """Coordinate system transformations"""
    
    @staticmethod
    def cartesian_to_polar(x: float, y: float) -> Tuple[float, float]:
        """Convert Cartesian to polar coordinates"""
        r = math.sqrt(x**2 + y**2)
        theta = math.atan2(y, x)
        return r, theta
    
    @staticmethod
    def polar_to_cartesian(r: float, theta: float) -> Tuple[float, float]:
        """Convert polar to Cartesian coordinates"""
        x = r * math.cos(theta)
        y = r * math.sin(theta)
        return x, y
    
    @staticmethod
    def transform_dataset(data: List[Tuple[float, float]], 
                         transform_func: Callable) -> List[Tuple[float, float]]:
        """Apply coordinate transformation to dataset"""
        return [transform_func(x, y) for x, y in data]

class NeuralLayer:
    """Single layer of a neural network"""
    
    def __init__(self, input_size: int, output_size: int, 
                 activation: str = 'sigmoid'):
        self.input_size = input_size
        self.output_size = output_size
        
        # Initialize weights with Xavier initialization
        scale = math.sqrt(2.0 / (input_size + output_size))
        self.weights = Matrix.random_matrix(output_size, input_size, scale)
        self.biases = Matrix.zeros(output_size, 1)
        
        # Set activation function
        self.activation_name = activation
        if activation == 'sigmoid':
            self.activation = ActivationFunctions.sigmoid
            self.activation_derivative = ActivationFunctions.sigmoid_derivative
        elif activation == 'relu':
            self.activation = ActivationFunctions.relu
            self.activation_derivative = ActivationFunctions.relu_derivative
        elif activation == 'tanh':
            self.activation = ActivationFunctions.tanh
            self.activation_derivative = ActivationFunctions.tanh_derivative
        else:  # linear
            self.activation = ActivationFunctions.linear
            self.activation_derivative = ActivationFunctions.linear_derivative
        
        # Store for backpropagation
        self.last_input = None
        self.last_z = None  # Pre-activation values
        self.last_output = None
    
    def forward(self, input_matrix: Matrix) -> Matrix:
        """Forward pass through the layer"""
        self.last_input = input_matrix
        
        # z = W * x + b
        z = self.weights.multiply(input_matrix).add(self.biases)
        self.last_z = z
        
        # Apply activation function
        output = z.apply_function(self.activation)
        self.last_output = output
        
        return output
    
    def backward(self, output_gradient: Matrix, learning_rate: float) -> Matrix:
        """Backward pass (backpropagation)"""
        # Compute activation derivative
        activation_grad = self.last_z.apply_function(self.activation_derivative)
        
        # Element-wise multiplication of gradients
        delta = Matrix([[output_gradient[i][j] * activation_grad[i][j]
                        for j in range(output_gradient.cols)]
                       for i in range(output_gradient.rows)])
        
        # Compute gradients
        input_gradient = self.weights.transpose().multiply(delta)
        weight_gradient = delta.multiply(self.last_input.transpose())
        bias_gradient = delta
        
        # Update weights and biases
        self.weights = self.weights.subtract(
            weight_gradient.scalar_multiply(learning_rate))
        self.biases = self.biases.subtract(
            bias_gradient.scalar_multiply(learning_rate))
        
        return input_gradient

class MultilayerPerceptron:
    """Multi-layer Perceptron (Deep Neural Network)"""
    
    def __init__(self, layer_sizes: List[int], 
                 activations: List[str] = None):
        if len(layer_sizes) < 2:
            raise ValueError("Need at least input and output layers")
        
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes) - 1
        
        # Default activations
        if activations is None:
            activations = ['sigmoid'] * (self.num_layers - 1) + ['linear']
        
        # Create layers
        self.layers = []
        for i in range(self.num_layers):
            layer = NeuralLayer(layer_sizes[i], layer_sizes[i + 1], 
                              activations[i])
            self.layers.append(layer)
    
    def forward(self, input_data: List[float]) -> List[float]:
        """Forward propagation through the network"""
        # Convert input to matrix
        current = Matrix([[x] for x in input_data])
        
        # Forward through each layer
        for layer in self.layers:
            current = layer.forward(current)
        
        # Convert output back to list
        return [current[i][0] for i in range(current.rows)]
    
    def backward(self, predicted: List[float], actual: List[float], 
                learning_rate: float = 0.01):
        """Backward propagation (training)"""
        # Compute output gradient
        output_grad = LossFunctions.mse_derivative(predicted, actual)
        current_grad = Matrix([[grad] for grad in output_grad])
        
        # Backpropagate through layers (reverse order)
        for layer in reversed(self.layers):
            current_grad = layer.backward(current_grad, learning_rate)
    
    def train(self, training_data: List[Tuple[List[float], List[float]]], 
              epochs: int = 1000, learning_rate: float = 0.01, 
              verbose: bool = True):
        """Train the neural network"""
        for epoch in range(epochs):
            total_loss = 0
            
            for inputs, targets in training_data:
                # Forward pass
                predicted = self.forward(inputs)
                
                # Compute loss
                loss = LossFunctions.mean_squared_error(predicted, targets)
                total_loss += loss
                
                # Backward pass
                self.backward(predicted, targets, learning_rate)
            
            if verbose and epoch % 100 == 0:
                avg_loss = total_loss / len(training_data)
                print(f"Epoch {epoch}, Average Loss: {avg_loss:.6f}")
    
    def predict(self, input_data: List[float]) -> List[float]:
        """Make prediction"""
        return self.forward(input_data)

class Autoencoder:
    """Autoencoder for representation learning"""
    
    def __init__(self, input_size: int, hidden_sizes: List[int]):
        # Create encoder layers
        encoder_sizes = [input_size] + hidden_sizes
        self.encoder = MultilayerPerceptron(
            encoder_sizes, 
            ['sigmoid'] * (len(encoder_sizes) - 2) + ['linear']
        )
        
        # Create decoder layers (reverse of encoder)
        decoder_sizes = list(reversed(encoder_sizes))
        self.decoder = MultilayerPerceptron(
            decoder_sizes,
            ['sigmoid'] * (len(decoder_sizes) - 2) + ['linear']
        )
    
    def encode(self, input_data: List[float]) -> List[float]:
        """Encode input to hidden representation"""
        return self.encoder.forward(input_data)
    
    def decode(self, hidden_data: List[float]) -> List[float]:
        """Decode hidden representation back to input space"""
        return self.decoder.forward(hidden_data)
    
    def forward(self, input_data: List[float]) -> List[float]:
        """Full autoencoder forward pass"""
        hidden = self.encode(input_data)
        reconstructed = self.decode(hidden)
        return reconstructed
    
    def train(self, data: List[List[float]], epochs: int = 1000, 
              learning_rate: float = 0.01, verbose: bool = True):
        """Train autoencoder to reconstruct input"""
        
        for epoch in range(epochs):
            total_loss = 0
            
            for inputs in data:
                # Forward pass through full autoencoder
                reconstructed = self.forward(inputs)
                
                # Compute reconstruction loss
                loss = LossFunctions.mean_squared_error(reconstructed, inputs)
                total_loss += loss
                
                # Proper autoencoder training:
                # 1. Train decoder to reconstruct from hidden representation
                hidden = self.encode(inputs)
                self.decoder.train([(hidden, inputs)], epochs=1, 
                                 learning_rate=learning_rate, verbose=False)
                
                # 2. Train encoder by minimizing reconstruction error
                # We need to backpropagate through the entire autoencoder
                self._train_encoder_decoder_jointly(inputs, learning_rate)
            
            if verbose and epoch % 100 == 0:
                avg_loss = total_loss / len(data)
                print(f"Epoch {epoch}, Reconstruction Loss: {avg_loss:.6f}")
    
    def _train_encoder_decoder_jointly(self, inputs: List[float], learning_rate: float):
        """Train encoder and decoder jointly using chain rule"""
        # This is a simplified joint training
        # In practice, we'd implement full backpropagation through both networks
        
        # Forward pass
        hidden = self.encode(inputs)
        reconstructed = self.decode(hidden)
        
        # Compute gradients
        reconstruction_error = [r - i for r, i in zip(reconstructed, inputs)]
        
        # Simple gradient descent on reconstruction error
        # Update decoder
        decoder_training_data = [(hidden, inputs)]
        self.decoder.train(decoder_training_data, epochs=1, 
                          learning_rate=learning_rate, verbose=False)
        
        # Update encoder (simplified - assume hidden representation should
        # minimize reconstruction error)
        encoder_training_data = [(inputs, hidden)]
        self.encoder.train(encoder_training_data, epochs=1,
                          learning_rate=learning_rate * 0.1, verbose=False)

class FactorAnalysis:
    """Simple factor analysis for understanding data variation"""
    
    def __init__(self, data: List[List[float]]):
        self.data = data
        self.num_samples = len(data)
        self.num_features = len(data[0]) if data else 0
        
    def compute_mean(self) -> List[float]:
        """Compute mean of each feature"""
        if not self.data:
            return []
        
        means = [0.0] * self.num_features
        for sample in self.data:
            for i, value in enumerate(sample):
                means[i] += value
        
        return [mean / self.num_samples for mean in means]
    
    def compute_variance(self) -> List[float]:
        """Compute variance of each feature"""
        means = self.compute_mean()
        variances = [0.0] * self.num_features
        
        for sample in self.data:
            for i, value in enumerate(sample):
                variances[i] += (value - means[i]) ** 2
        
        return [var / self.num_samples for var in variances]
    
    def normalize_data(self) -> List[List[float]]:
        """Normalize data (zero mean, unit variance)"""
        means = self.compute_mean()
        variances = self.compute_variance()
        std_devs = [math.sqrt(var) for var in variances]
        
        normalized = []
        for sample in self.data:
            normalized_sample = []
            for i, value in enumerate(sample):
                if std_devs[i] > 0:
                    normalized_value = (value - means[i]) / std_devs[i]
                else:
                    normalized_value = 0.0
                normalized_sample.append(normalized_value)
            normalized.append(normalized_sample)
        
        return normalized

def demonstrate_coordinate_transformation():
    """Demonstrate coordinate transformation for data separation"""
    print("=== Coordinate Transformation Demo ===")
    
    # Generate circular data that's hard to separate in Cartesian coordinates
    data_cartesian = []
    labels = []
    
    for _ in range(100):
        # Inner circle (class 0)
        r = random.uniform(0.5, 1.0)
        theta = random.uniform(0, 2 * math.pi)
        x, y = CoordinateTransforms.polar_to_cartesian(r, theta)
        data_cartesian.append((x, y))
        labels.append(0)
        
        # Outer circle (class 1)
        r = random.uniform(1.5, 2.0)
        theta = random.uniform(0, 2 * math.pi)
        x, y = CoordinateTransforms.polar_to_cartesian(r, theta)
        data_cartesian.append((x, y))
        labels.append(1)
    
    # Convert to polar coordinates
    data_polar = [CoordinateTransforms.cartesian_to_polar(x, y) 
                  for x, y in data_cartesian]
    
    print(f"Generated {len(data_cartesian)} data points")
    print("Sample Cartesian points:", data_cartesian[:5])
    print("Sample Polar points:", data_polar[:5])
    print("Sample labels:", labels[:5])
    
    # In polar coordinates, separation is easy: r < 1.25 vs r >= 1.25
    correct_polar = sum(1 for i, (r, theta) in enumerate(data_polar)
                       if (r < 1.25 and labels[i] == 0) or 
                          (r >= 1.25 and labels[i] == 1))
    
    print(f"Polar coordinate classification accuracy: {correct_polar/len(labels):.2%}")

def demonstrate_mlp():
    """Demonstrate Multi-layer Perceptron"""
    print("\n=== Multi-layer Perceptron Demo ===")
    
    # Create XOR dataset (classic non-linearly separable problem)
    xor_data = [
        ([0, 0], [0]),
        ([0, 1], [1]),
        ([1, 0], [1]),
        ([1, 1], [0])
    ]
    
    # Create MLP with hidden layer
    mlp = MultilayerPerceptron([2, 4, 1], ['sigmoid', 'sigmoid'])
    
    print("Training MLP on XOR problem...")
    mlp.train(xor_data, epochs=2000, learning_rate=0.5, verbose=True)
    
    # Test predictions
    print("\nXOR Predictions:")
    for inputs, expected in xor_data:
        predicted = mlp.predict(inputs)
        print(f"Input: {inputs}, Expected: {expected[0]}, "
              f"Predicted: {predicted[0]:.3f}")

def demonstrate_autoencoder():
    """Demonstrate Autoencoder"""
    print("\n=== Autoencoder Demo ===")
    
    # Create simple dataset
    data = [
        [1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 1],
        [1, 1, 0, 0],
        [0, 0, 1, 1]
    ]
    
    # Create autoencoder (4-2-4: compress to 2D)
    autoencoder = Autoencoder(4, [2])
    
    print("Training autoencoder...")
    autoencoder.train(data, epochs=1000, learning_rate=0.1, verbose=True)
    
    # Test reconstruction
    print("\nReconstruction Results:")
    for i, original in enumerate(data):
        reconstructed = autoencoder.forward(original)
        encoded = autoencoder.encode(original)
        
        print(f"Original: {[f'{x:.1f}' for x in original]}")
        print(f"Encoded:  {[f'{x:.3f}' for x in encoded]}")
        print(f"Reconstructed: {[f'{x:.3f}' for x in reconstructed]}")
        print()

def demonstrate_hierarchical_features():
    """Demonstrate hierarchical feature learning concept"""
    print("\n=== Hierarchical Feature Learning Demo ===")
    
    # Simulate image-like data (8x8 = 64 pixels)
    # Create simple patterns
    edge_horizontal = [1, 1, 1, 1, 0, 0, 0, 0] * 8
    edge_vertical = [1, 0] * 32
    corner = [1, 1, 0, 0] * 16
    
    patterns = [edge_horizontal, edge_vertical, corner]
    pattern_names = ["Horizontal Edge", "Vertical Edge", "Corner"]
    
    # Create deep network for hierarchical learning
    deep_net = MultilayerPerceptron(
        [64, 32, 16, 8, 3],  # 64 -> 32 -> 16 -> 8 -> 3 features
        ['relu', 'relu', 'relu', 'linear']
    )
    
    # Create training data (patterns -> one-hot labels)
    training_data = []
    for i, pattern in enumerate(patterns):
        label = [0, 0, 0]
        label[i] = 1
        training_data.append((pattern, label))
    
    print("Training deep network for pattern recognition...")
    deep_net.train(training_data, epochs=1000, learning_rate=0.01, verbose=True)
    
    # Test pattern recognition
    print("\nPattern Recognition Results:")
    for i, (pattern, name) in enumerate(zip(patterns, pattern_names)):
        prediction = deep_net.predict(pattern)
        predicted_class = prediction.index(max(prediction))
        confidence = max(prediction)
        
        print(f"{name}: Predicted class {predicted_class}, "
              f"Confidence: {confidence:.3f}")

def main():
    """Main demonstration function"""
    print("Deep Learning Implementation in Core Python")
    print("=" * 50)
    
    # Set random seed for reproducibility
    random.seed(42)
    
    # Run demonstrations
    demonstrate_coordinate_transformation()
    demonstrate_mlp()
    demonstrate_autoencoder()
    demonstrate_hierarchical_features()
    
    print("\n" + "=" * 50)
    print("All demonstrations completed!")

if __name__ == "__main__":
    main()

Deep Learning Implementation in Core Python
=== Coordinate Transformation Demo ===
Generated 200 data points
Sample Cartesian points: [(0.8096126996969449, 0.12828613863115132), (0.2743298806071533, 1.614372130473761), (-0.38588771238938485, -0.7777684377962036), (1.6628892480320796, 1.010972203175772), (0.6985370082648876, 0.13233088607202836)]
Sample Polar points: [(0.8197133992289418, 0.157147209736526), (1.6375146591845597, 1.402474430341393), (0.8682356070820061, -2.031357030427993), (1.9460897838524227, 0.5462527958004928), (0.7109609098426354, 0.18722145136808954)]
Sample labels: [0, 1, 0, 1, 0]
Polar coordinate classification accuracy: 100.00%

=== Multi-layer Perceptron Demo ===
Training MLP on XOR problem...
Epoch 0, Average Loss: 0.302097
Epoch 100, Average Loss: 0.275704
Epoch 200, Average Loss: 0.269335
Epoch 300, Average Loss: 0.245575
Epoch 400, Average Loss: 0.195457
Epoch 500, Average Loss: 0.136552
Epoch 600, Average Loss: 0.045073
Epoch 700, Average Loss: 0.017113
Ep

![image-2.png](attachment:image-2.png)

Fig.3: Illustration of computational graphs mapping an input to an output where each node performs an operation

![image-3.png](attachment:image-3.png)

Fig.4: A Venn diagram showing how deep learning is a kind of representation learning, which is in turn a kind of machine learning, which is used for many but not all approaches to AI. Each section of the Venn diagram includes an example of an AI technology.

# 1.1 What Is Deep Learning?

## Depth and Computational Graphs

Consider a function computed as:

$$
\sigma(w^\top x)
$$

where:

- $x$ is the input vector
- $w$ are the model weights
- $\sigma$ is the logistic sigmoid function

Depending on how we define the basic operations in our computational language, this model can have different **depths**:

- If **addition, multiplication, and sigmoid** are atomic elements → depth = 3
- If we treat **logistic regression** as a single element → depth = 1

---

## Computational Graph vs Conceptual Graph

- **Computational depth**: number of sequential operations from input to output
- **Conceptual depth**: how concepts relate hierarchically (e.g., from edges to faces to identities)

> An AI that infers missing facial features from context shows how concept layers (e.g., eyes → face) can be shallow, but the computation may involve multiple refinement passes (greater depth)

---

## No Universal Depth Definition

Different choices in "primitive operations" lead to different flowchart depths:
- Just like in programming, the same task may take more or fewer steps depending on the language

There is no **single correct depth** for a model—what matters is the **degree of composition** and **hierarchical abstraction** it learns.

---

## Summary: What Is Deep Learning?

- A **subset of machine learning**, which itself is a subset of **AI**
- Learns representations by **composing simpler functions into deeper hierarchies**
- Facilitates learning from **raw data** with **minimal manual feature engineering**
- Often measured by the **depth of composition**, not just the number of parameters

---

## Figure 1.4: AI Disciplinary Hierarchy



![image-4.png](attachment:image-4.png)

Fig.5: Flowcharts showing how the diﬀerent parts of an AI system relate to each other within diﬀerent AI disciplines. Shaded boxes indicate components that are able to learn from data.

# 1.1 Applications and Organization of This Book

Deep learning has proven to be impactful across a variety of software disciplines:

- Computer vision
- Speech and audio processing
- Natural language processing (NLP)
- Robotics
- Bioinformatics and chemistry
- Video games
- Search engines
- Online advertising
- Finance

---

## Organization of the Book

The book is divided into **three parts**:

1. **Part I**: Introduces mathematical tools and machine learning foundations
2. **Part II**: Covers well-established and widely-used deep learning algorithms
3. **Part III**: Discusses speculative and research-focused topics for future advancements

Readers are encouraged to **skip sections** based on their background and interest:

- Those with a strong grasp of **linear algebra, probability, and machine learning basics** may skip Part I.
- Readers focused on **practical implementation** can stop at the end of Part II.

> A flowchart (Figure 1.6) is provided to help navigate the book's structure.

The book assumes a **computer science background**, including familiarity with:

- Programming
- Computational performance and complexity theory
- Introductory calculus
- Basic graph theory terminology

---

# 1.2 Historical Trends in Deep Learning

Understanding deep learning is aided by considering its evolution:

### 🔁 Terminology and Popularity

- Deep learning has existed for decades under various names, often reflecting **different philosophical perspectives**.
- Its popularity has **waxed and waned** over time.

### 📊 Data-Driven Effectiveness

- Its utility **increases with more training data**.
- Larger datasets empower models to learn more expressive representations.

### 💽 Infrastructure Evolution

- As **hardware and software infrastructures** improve, so do deep learning models.
- Models have grown dramatically in size and complexity with access to powerful computing resources.

### 🧠 Expanding Capabilities

- Modern deep learning systems solve tasks that were previously considered infeasible.
- These include accurate performance on complex real-world tasks such as language understanding, image recognition, and game playing.

---

Deep learning today represents a confluence of **mathematical theory**, **computational power**, and **massive data availability**, enabling machines to achieve levels of performance once attributed only to human intelligence.

![image-5.png](attachment:image-5.png)

Fig.6: The high-level organization of the book. An arrow from one chapter to another indicates that the former chapter is prerequisite material for understanding the latter.

## The Many Names and Changing Fortunes of Neural Networks

While many consider **deep learning** a recent breakthrough, its roots trace back to the **1940s**, and the field has evolved through several phases and renamings, each reflecting shifts in philosophy and dominant perspectives.

---

## Historical Phases of Deep Learning

- **1940s–1960s**: Known as **cybernetics**  
- **1980s–1990s**: Known as **connectionism**  
- **2006 onward**: Resurgence under the name **deep learning**

This historical evolution is depicted in *Figure 1.7*.

---

## Biological Motivation and Neural Analogies

Many early learning algorithms were designed as **computational models of biological learning**, particularly inspired by the human brain.

- Term: **Artificial Neural Networks (ANNs)**
- Inspiration: Simulating how neurons might collectively represent and compute

Even though modern networks are often not biologically accurate, the analogy remains influential.

---

## Two Motivations Behind the Neural Perspective

1. **Engineering Inspiration**:
   - The brain demonstrates that intelligent behavior is possible.
   - Reverse engineering its computational principles might guide AI system design.

2. **Scientific Curiosity**:
   - Machine learning can offer insight into how intelligence arises.
   - Such models are valuable even when they don't lead directly to practical applications.

---

## Beyond Neuroscience: Toward General Representation Learning

Today, **“deep learning”** transcends its neural roots:

- It refers to systems that learn **multiple levels of representation and composition**.
- Applicable in frameworks not explicitly modeled after the brain.

Thus, while the origins were biologically motivated, modern deep learning emphasizes abstract, scalable **function composition** more than biological realism.

---

*Deep learning* has become a unifying term that captures the power of hierarchical representation, regardless of whether the architecture mimics neural biology or not.

##  The Many Names and Changing Fortunes of Neural Networks

Although often seen as an emerging technology, **deep learning** has a history dating back to the **1940s**. Its limited popularity during certain decades and recurring rebranding have obscured this long lineage.

---

## 📈 Figure 1.7: Frequency of Historical Terms

This figure visualizes two waves of neural network research using term frequency from Google Books:

- **Cybernetics**: First wave (1940s–1960s)
- **Connectionism + Neural Networks**: Second wave (1980s–1995)

The third wave—modern **deep learning**—began around **2006**, but is too recent to show up prominently in this corpus.

---

## 🧠 The Three Historical Waves

### 1. Cybernetics (1940s–1960s)
- Focus: Modeling biological learning
- Key models:
  - **McCulloch-Pitts Neuron** (1943): Linear binary classifier using hard thresholds
  - **Perceptron** (Rosenblatt, 1958): Learnable weights for binary classification
  - **ADALINE** (Widrow and Hoff, 1960): Predicts real values using $f(x) = \sum x_i w_i$

These models used hand-set or learned weights to compute:

$$
f(x, w) = \sum_{i=1}^{n} x_i w_i
$$

The **perceptron** was the first algorithm to learn these weights from labeled examples. ADALINE introduced learning for regression.

> 🚧 Limitations of linear models: They cannot represent non-linear functions such as XOR.

### 2. Connectionism (1980s–1990s)
- Revival with **multi-layer neural networks**
- Introduction of **backpropagation** (Rumelhart et al., 1986)
- Interest surged in biologically plausible networks with 1–2 hidden layers

### 3. Deep Learning (2006–present)
- Sparked by work from **Hinton et al. (2006)** and others
- Emphasizes deep compositions of functions beyond biological realism

---

## 🎯 Lasting Contributions from Early Models

- Training methods like **stochastic gradient descent** emerged from early research and remain central to deep learning.
- Many **modern models are generalizations of early linear models**, trained differently.

---

## 🔄 Shift Away from Neuroscience

- Early AI was inspired by biology, but today’s deep learning is mostly driven by **computational pragmatism**.
- The gap between AI and neuroscience persists due to limited access to large-scale neural recordings (e.g. thousands of neurons simultaneously).

---

Today, neuroscience provides **conceptual inspiration** rather than engineering blueprints. Deep learning continues to evolve as a field rooted in math, optimization, and data—not solely in biology.

# 1.2.1 Neural Networks: From Biological Inspiration to Distributed Representation

## Neural Architecture Inspired by Biology

While our understanding of biological learning is incomplete (Olshausen & Field, 2005), neuroscience has helped inspire key hypotheses and architectures in deep learning:

- **Single algorithm hypothesis**: Experiments with ferrets (Von Melchner et al., 2000) showed that brain regions can repurpose themselves to process unfamiliar data modalities, suggesting the brain may apply a **shared computational algorithm** to diverse tasks.
- Historically, this unified view of intelligence was missing—fields like vision, speech, NLP, and robotics developed in isolation. Now, deep learning research often spans across them.

---

## Architectures Inspired by Neuroscience

- 🧠 **Neocognitron** (Fukushima, 1980): Inspired by the **mammalian visual cortex**, it laid the foundation for convolutional neural networks (CNNs).
- 💡 **Rectified Linear Units (ReLU)**:
  - Simplified activation functions used in most modern networks.
  - Influenced by brain modeling (e.g., Nair & Hinton, 2010; Glorot et al., 2011).
  - Engineering-focused studies (e.g., Jarrett et al., 2009) also contributed.

> While actual neurons compute differently than ReLUs, biologically precise models haven't improved ML performance. Neuroscience is an inspiration—not a strict blueprint.

---

## Deep Learning ≠ Brain Simulation

- Media often highlight brain-like features in deep learning.
- Truth: deep learning owes as much to **linear algebra, information theory, and optimization** as it does to neuroscience.

There is, however, a parallel field: 🧬 **Computational Neuroscience**, which seeks to model the brain itself—not just engineer intelligent systems.

---

## Second Wave: Connectionism and Distributed Cognition

In the 1980s, **connectionism** emerged (Rumelhart et al., 1986; McClelland et al., 1995):

- 👥 Modeled **distributed networks of simple processing units** (neurons or artificial nodes)
- Sought **biologically plausible** alternatives to symbolic AI in cognitive science
- Inspired by early work (Hebb, 1949) and revived neural-style learning

---

## Distributed Representation: A Foundational Idea

Rather than one neuron per concept (e.g., "red bird"), use **shared feature-based encoding**:

- Example:
  - Objects: {car, truck, bird}
  - Colors: {red, green, blue}
  - Naive representation: 3 × 3 = 9 distinct neurons
  - Distributed representation:
    - 3 neurons for color
    - 3 neurons for object
    - Total = 6 neurons

This supports **combinatorial generalization** and **parameter sharing**, reducing redundancy and improving scalability.

> Each feature (e.g., redness) participates in many combinations, enabling more efficient learning and generalization.

---

Modern deep learning, while influenced by brain science, is a computational discipline focused on **building intelligent systems**, not simulating biological ones.

# 1.2.1 The Evolution of Neural Network Research: Distributed Representations and the Rise of Deep Learning

## 🔶 Distributed Representation

A key idea in modern deep learning is the **distributed representation**:

- Instead of having one neuron per concept (e.g., a neuron for "red bird"),
- The network uses multiple neurons to represent different **attributes**, such as color and object identity.

### Example:

| Feature Type | Features        |
|--------------|-----------------|
| Color        | Red, Green, Blue |
| Object       | Car, Truck, Bird |

This enables:

- Shared features across categories
- Generalization from limited data
- Reduced model complexity (6 neurons instead of 9 combinations)

> Each concept is encoded by a **combination** of active units.

---

## 🔁 Back-Propagation and Deep Models

One of the major successes of the **connectionist movement** was the introduction and popularization of the **back-propagation algorithm**:

- Introduced by Rumelhart et al. (1986) and LeCun (1987)
- Enabled training of deep neural networks with internal representations
- Still the dominant training algorithm used today

---

## ⏳ Sequence Modeling and LSTMs

In the 1990s:

- Researchers tackled the challenge of **modeling long sequences**
- Hochreiter (1991) and Bengio et al. (1994) identified **vanishing gradient problems**
- Hochreiter & Schmidhuber (1997) introduced the **Long Short-Term Memory (LSTM)**

> LSTMs allow deep networks to capture **long-term dependencies** and are widely used in NLP and time-series tasks.

---

## 📉 Decline in Popularity (Late 1990s to 2006)

Despite strong results:

- **Investor disillusionment** due to overhyped AI ventures
- Rise of competing methods:
  - **Kernel machines** (SVMs)
  - **Probabilistic graphical models**
- Resulted in decreased interest in neural networks

---

## 🧠 CIFAR and Neural Research Support

The **Canadian Institute for Advanced Research (CIFAR)** maintained interest in deep learning through its **Neural Computation and Adaptive Perception (NCAP)** program, which included:

- Geoffrey Hinton (Toronto)
- Yoshua Bengio (Montreal)
- Yann LeCun (NYU)

> This collaborative effort kept neural research active during its "quiet years."

---

## 🚀 The Third Wave: Deep Learning (2006–present)

### Breakthrough:

- Hinton et al. (2006) introduced **greedy layer-wise pretraining** for **deep belief networks**
- Later extended to other deep architectures (Bengio et al., 2007; Ranzato et al., 2007)

### Key Results:

- Enabled **efficient training** of deep models
- Highlighted the **theoretical importance of depth**
- Outperformed earlier AI systems on many benchmarks

> Deep learning became the umbrella term to describe successful training and usage of **deep architectures**

---

## 📈 Shifting Trends Within the Third Wave

| Period             | Focus                                               |
|--------------------|------------------------------------------------------|
| Early 2000s        | Unsupervised learning, small-data generalization     |
| Today              | Supervised learning, leveraging large labeled datasets |

Deep learning has evolved to absorb both classic and modern training paradigms, scaling impressively with data and compute.

---

The journey of neural networks—across three historical waves—reflects the dynamic interaction between theoretical insights, algorithmic breakthroughs, and advances in computational power.


In [3]:
import random
import math

# Activation functions and derivatives
def relu(x): return max(0, x)
def drelu(x): return 1 if x > 0 else 0

def sigmoid(x): return 1 / (1 + math.exp(-x))
def dsigmoid(x):  # derivative with respect to output
    s = sigmoid(x)
    return s * (1 - s)

# Initialize weights for each layer
def init_weights(layer_sizes):
    weights = []
    for i in range(len(layer_sizes) - 1):
        layer = []
        for _ in range(layer_sizes[i + 1]):
            neuron = [random.uniform(-1, 1) for _ in range(layer_sizes[i] + 1)]  # +1 for bias
            layer.append(neuron)
        weights.append(layer)
    return weights

# Forward pass
def forward_pass(x, weights):
    activations = [x[:]]
    for layer in weights:
        x_new = []
        for neuron in layer:
            z = sum(wi * xi for wi, xi in zip(neuron[:-1], x)) + neuron[-1]  # bias
            a = relu(z)
            x_new.append(a)
        x = x_new
        activations.append(x)
    return activations

# Backpropagation (with ReLU)
def backprop(x, y, weights, lr=0.01):
    activations = forward_pass(x, weights)
    delta = [(a - y) * drelu(a) for a in activations[-1]]

    for l in range(len(weights) - 1, -1, -1):
        inputs = activations[l]
        new_delta = []
        for j in range(len(weights[l])):
            for i in range(len(inputs)):
                weights[l][j][i] -= lr * delta[j] * inputs[i]
            weights[l][j][-1] -= lr * delta[j]  # bias update

        if l != 0:
            for i in range(len(weights[l][0]) - 1):
                s = sum(weights[l][j][i] * delta[j] for j in range(len(weights[l])))
                new_delta.append(s * drelu(activations[l][i]))
            delta = new_delta

# Train on a toy dataset: XOR
data = [
    ([0, 0], 0),
    ([1, 0], 1),
    ([0, 1], 1),
    ([1, 1], 0),
]

weights = init_weights([2, 4, 1])  # 2-input, 4-hidden, 1-output

for epoch in range(10000):
    x, y = random.choice(data)
    backprop(x, y, weights, lr=0.1)

# Evaluate
for x, y in data:
    output = forward_pass(x, weights)[-1][0]
    print(f"Input: {x}, Target: {y}, Predicted: {round(output, 3)}")


Input: [0, 0], Target: 0, Predicted: 0.0
Input: [1, 0], Target: 1, Predicted: 1.0
Input: [0, 1], Target: 1, Predicted: 0
Input: [1, 1], Target: 0, Predicted: 0
