

## What is Machine Learning?
- **Definition**: A subfield of AI focused on building models to understand and predict data.
- **Learning Process**: Models adjust parameters based on data to make predictions.
- **Application**: Used to predict outcomes for new, unseen data.

## Types of Machine Learning

### Supervised Learning
- **Data**: Labeled data.
- **Tasks**: 
  - **Classification**: Predict discrete categories (e.g., spam or not spam).
  - **Regression**: Predict continuous values (e.g., price prediction).

### Unsupervised Learning
- **Data**: Unlabeled data.
- **Tasks**:
  - **Clustering**: Group similar data points (e.g., customer segmentation).
  - **Dimensionality Reduction**: Simplify data while preserving important information (e.g., PCA).

### Semi-Supervised Learning
- **Data**: Partially labeled data.
- **Tasks**: Combines aspects of supervised and unsupervised learning, useful when only some data is labeled.

## Supervised vs Unsupervised Learning

| Feature            | Supervised Learning          | Unsupervised Learning         |
|--------------------|------------------------------|-------------------------------|
| **Labels**         | Requires labeled data         | Works with unlabeled data      |
| **Common Tasks**   | Classification, Regression    | Clustering, Dimensionality Reduction |
| **Goal**           | Predict labels for new data   | Find hidden patterns in data   |


## Learning Process Flow

1. **Data Input**: Labeled or unlabeled data.
2. **Model Selection**: Choose supervised or unsupervised learning method.
3. **Training**: Adjust model parameters based on data.
4. **Prediction**: Use the model to predict or analyze new, unseen data.

---


<img src="/Users/tanukhanuja/data_science_essential_packages/sklearn/Screenshot 2024-09-15 at 9.17.09 AM.png" alt="Alt Text" width="500">


# Qualitative Examples of Machine Learning Applications

## 1. Classification: Predicting Discrete Labels

### Overview
- **Task**: Classify new, unlabeled points based on labeled points.
- **Data**: Two-dimensional data (x, y positions).
- **Labels**: Discrete categories like "blue" or "red".

### Model
- **Assumption**: A straight line can separate the two groups.
- **Parameters**: Numbers describing the location and orientation of the line.
- **Training**: Adjusts parameters to fit the line to the data.
- **Prediction**: Use the model to assign labels to new data points by drawing the line.

### Real-World Example: Automated Spam Detection
- **Features**: Counts of important words/phrases (e.g., "Viagra," "Nigerian prince").
- **Labels**: "Spam" or "Not Spam".
- **Training**: Labels are determined manually for a sample; the model predicts labels for the rest.
- **Effectiveness**: Works well with thousands or millions of features.

## Important Classification Algorithms

| Algorithm                   | Description                    | 
|-----------------------------|--------------------------------|
| **Gaussian Naive Bayes**    | Probabilistic classification    | 
| **Support Vector Machines** | Classification using hyperplanes | 
| **Random Forest Classification** | Ensemble method with decision trees |

---

## 2. Regression: Predicting Continuous Labels

### Overview
- **Regression**: Predicts continuous values (opposed to classification's discrete categories).

Simple Linear Regression
- **Concept**: Fit a plane to data where one dimension is continuous. (i.e. label)
- **Purpose**: Predict continuous labels for new data points.

### Key Algorithms
| Algorithm                | Description                      |
|--------------------------|----------------------------------|
| **Linear Regression**   | Fits a line/plane to data         |
| **Support Vector Machines (SVM)** | Handles regression tasks too  |
| **Random Forest Regression** | Uses multiple decision trees for prediction |

---
## 3. Clustering: Inferring Labels on Unlabeled Data

- **Clustering**: An unsupervised learning technique to group data into discrete clusters without using known labels.

### Key Concepts
- **Unsupervised Learning**: Describes data without predefined labels.
- **Clustering**: Automatically assigns data to distinct groups based on intrinsic data structure.

### Algorithms
- **k-Means Clustering**:
  - **Method**: Fits a model with k cluster centers.
  - **Goal**: Minimize distance of each point from its assigned cluster center.
- **Gaussian Mixture Models**: Probabilistic model for clustering based on data distribution.
- **Spectral Clustering**: Uses graph theory to identify clusters.

### Application
- **Use Case**: Extract useful patterns from complex datasets.

---

## 4. Dimensionality Reduction: Inferring Structure of Unlabeled Data

- **Dimensionality Reduction**: An unsupervised algorithm that infers labels or information from the dataset's structure.
- **Objective**: Extract a low-dimensional representation that retains relevant qualities of the full dataset.

##$ Key Concepts
- **Abstract Nature**: Seeks to simplify data while preserving its essential features.
- **Nonlinear Structure**: Suitable models detect and represent complex, nonlinear structures.

### Example
- **Data Structure**: Data drawn from a one-dimensional line arranged in a spiral within a two-dimensional space.
- **Dimensionality Reduction**: Sensitive to this nonlinear embedded structure to reveal a lower-dimensional representation.

### Important Algorithms
- **Principal Component Analysis (PCA)**: Reduces dimensionality by identifying principal components.
- **Manifold Learning Algorithms**: Includes Isomap and Locally Linear Embedding (LLE), which handle complex data structures.

