# Supervised Learning

## Introduction to Supervised Learning

Supervised learning is a type of machine learning where we teach a model to map inputs to known outputs. The algorithm learns from examples (input-output pairs) and then predicts outcomes for new, unseen data.

In practice, supervised learning uses **labeled data**—each example comes with a correct answer. The model adjusts its parameters as it sees more data, learning the relationship between features and labels. This explicit guidance helps the model make accurate predictions on new inputs.

Supervised learning is widely used in real-world problems, such as:  
- Email spam detection  
- Stock price prediction  
- Customer churn prediction  

It is especially useful when you want to build models that are highly accurate and can generalize well to new data.

Main points:  
- Predict a target variable (y) from input features (X).  
- Targets can be **continuous** (regression) or **categorical** (classification).  
- Difference from other learning types:  
  - **Unsupervised:** find patterns without labels  
  - **Reinforcement:** learn by interacting with an environment


---

## Fundamental Concepts

Before working with models, some basics to keep in mind:  

- **Dataset:** collection of samples with features (X) and targets (y)  
- **Train vs Test:** train to learn, test to evaluate generalization  
- **Overfitting / Underfitting:** too complex vs too simple models  
- **Bias-Variance tradeoff:** balancing simplicity and flexibility  
- **Model evaluation:** use metrics like MSE, R², accuracy, or F1-score; cross-validation helps estimate performance reliably


## Data Preprocessing

Data preprocessing is a critical step in any machine learning workflow. Raw data is often messy, inconsistent, or incomplete, and models trained directly on such data can perform poorly or produce biased results. Preprocessing ensures that the dataset is clean, structured, and in a form suitable for the algorithms we want to use.  

The first step is **data cleaning**, which involves identifying and correcting errors, inconsistencies, or duplicate entries. Missing values are common in real-world datasets, and handling them appropriately is crucial. Depending on the context, missing data can be removed, imputed using statistics like the mean or median, or predicted using more advanced methods such as k-nearest neighbors or regression models.  

Another important consideration is **scaling and normalization**. Many algorithms, especially those based on distances (like KNN or SVM) or gradient-based optimization, are sensitive to the scale of input features. Standardization transforms features to have zero mean and unit variance, while normalization rescales data to a fixed range. Choosing the right scaling method can significantly affect model convergence and performance.  

**Categorical variables** require special treatment because most machine learning algorithms only process numerical data. Common approaches include **label encoding**, which assigns an integer to each category, and **one-hot encoding**, which creates binary columns for each category. More sophisticated encodings, such as target encoding or frequency encoding, can capture additional information while avoiding introducing spurious orderings.  

Finally, **feature engineering and selection** are essential for improving model accuracy and interpretability. Feature engineering involves creating new features or transforming existing ones to better capture underlying patterns. Feature selection helps reduce dimensionality, remove irrelevant or redundant features, and mitigate overfitting. Techniques range from simple correlation analysis to more advanced methods like recursive feature elimination or model-based importance scores.  

Overall, careful and thoughtful data preprocessing forms the foundation of effective machine learning models. Without it, even the most sophisticated algorithms may fail to deliver reliable results.


## Regression
### Linear Regression

Linear regression is one of the most fundamental and widely used techniques in supervised learning. It models the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a linear equation to the observed data. The main idea is to predict the target as a weighted sum of the input features plus an intercept term.

Mathematically, the model can be expressed as:

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n + \epsilon
$$

Here, $\beta_0$ is the intercept, $\beta_1, \dots, \beta_n$ are the coefficients representing the influence of each feature on the target, and $\epsilon$ is the error term capturing the difference between observed and predicted values. The coefficients can be interpreted as the expected change in the target variable for a one-unit change in the corresponding feature, assuming all other features remain constant.

Parameter estimation is usually performed using **Ordinary Least Squares (OLS)**, which finds the coefficients that minimize the sum of squared differences between the predicted and actual target values:

$$
\text{Minimize } \sum_{i=1}^m (y_i - \hat{y}_i)^2
$$

This ensures the best possible linear fit to the training data in terms of squared error.

Linear regression relies on several key **assumptions** that ensure the model produces valid, interpretable, and reliable results. Understanding these assumptions is crucial because if they are violated, the predictions, inference, or coefficient interpretations can be misleading.

- **Linearity**: This assumption states that the relationship between each feature and the target variable is linear. In other words, the expected change in the target is proportional to a change in the feature. If the true relationship is non-linear, linear regression may underfit the data, leading to biased predictions. Visualizing scatter plots of each feature against the target can help detect non-linearity. 
- **Independence**: Observations should be independent of each other. This means that the value of one observation does not influence another. Violations occur in time series data or grouped data where correlations exist between samples. Ignoring dependence can result in underestimated standard errors and misleading significance tests. 
- **Homoscedasticity**: The variance of residuals (errors) should be constant across all levels of the features. If residuals systematically increase or decrease with the predicted values (heteroscedasticity), the model may be less efficient and confidence intervals or hypothesis tests can be inaccurate. Plotting residuals versus predicted values is a common way to check for this.
- **Normality of residuals**: The residuals should follow a normal distribution. This assumption is especially important for conducting inference, such as calculating confidence intervals or p-values for coefficients. Even if predictions can still be reasonably accurate, non-normal residuals can invalidate hypothesis testing. Histograms or Q-Q plots of residuals are often used to check normality. 
- **No multicollinearity**: Features should not be highly correlated with each other. When multicollinearity exists, it becomes difficult to isolate the individual effect of each feature on the target, coefficients can become unstable, and small changes in the data can lead to large changes in estimates. Correlation matrices or variance inflation factors (VIF) are commonly used to detect multicollinearity.

---

Evaluating a linear regression model is essential to understand how well it predicts the target variable and how much of the underlying variability it captures. Unlike classification problems where metrics like accuracy or F1-score are used, regression requires measures that quantify the **difference between predicted and actual values**, as well as the overall explanatory power of the model.

One of the most common metrics is the **Mean Squared Error (MSE)**, which calculates the average of the squared differences between predicted and actual values:

$$
\text{MSE} = \frac{1}{m} \sum_{i=1}^m (y_i - \hat{y}_i)^2
$$

Here, $y_i$ represents the actual target value, $\hat{y}_i$ is the predicted value, and $m$ is the number of samples. Squaring the errors penalizes larger deviations more heavily, making MSE sensitive to outliers. A lower MSE indicates better predictive performance.

The **Root Mean Squared Error (RMSE)** is simply the square root of the MSE:

$$
\text{RMSE} = \sqrt{\text{MSE}}
$$

RMSE is often preferred because it is expressed in the same units as the target variable, making it easier to interpret and compare with the scale of the data.

Another important metric is the **Coefficient of Determination ($R^2$)**:

$$
R^2 = 1 - \frac{\sum_{i=1}^m (y_i - \hat{y}_i)^2}{\sum_{i=1}^m (y_i - \bar{y})^2}
$$

$R^2$ measures the proportion of variance in the target variable that is explained by the model. A value of $R^2 = 1$ indicates a perfect fit, whereas $R^2 = 0$ suggests that the model does no better than predicting the mean of the target. Negative values can occur when the model performs worse than simply predicting the mean.

Together, these metrics give a **comprehensive view** of model performance: MSE and RMSE focus on prediction errors, while $R^2$ provides insight into how well the model captures the underlying structure of the data. By analyzing these metrics, we can assess not only how accurate our predictions are but also how much of the variability in the target is being explained.

In practice, these evaluation measures guide decisions on model selection, hyperparameter tuning, and whether further feature engineering or preprocessing is needed. Even though linear regression is conceptually simple, mastering these metrics is crucial for building a solid foundation before moving on to more complex supervised learning models.



### Polynomial Regression
- Using polynomials for non-linear models
- Overfitting and regularization

### Regularized Regression
- Ridge regression (L2)
- Lasso regression (L1)
- Elastic Net

### Other Regression Models
- Support Vector Regression (SVR)
- Decision Tree Regressor
- Random Forest Regressor
- Gradient Boosting Regressor

## Classification
### Binary Classification
- Logistic Regression
- Interpretation of coefficients
- Sigmoid function
- Decision boundary

### Multiclass Classification
- One-vs-Rest
- Softmax
- Multinomial Logistic Regression

### Classification Algorithms
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Decision Tree Classifier
- Random Forest Classifier
- Gradient Boosting Classifier (XGBoost, LightGBM, CatBoost)
- Naive Bayes
- Neural Networks for classification

## Model Evaluation
### Regression Metrics
- MSE, RMSE, MAE
- R² (Coefficient of determination)
- Adjusted R²

### Classification Metrics
- Accuracy
- Precision, Recall, F1-score
- Confusion Matrix
- ROC curve and AUC
- Log-loss
- Matthews Correlation Coefficient (MCC)
- Cohen’s Kappa
- Balanced accuracy
- F-beta score

### Cross-validation
- K-Fold Cross Validation
- Leave-One-Out Cross Validation
- Stratified K-Fold

## Advanced Techniques
- Ensemble methods
  - Bagging (Random Forest)
  - Boosting (AdaBoost, Gradient Boosting)
  - Stacking
- Feature importance
- Hyperparameter tuning
  - Grid Search
  - Random Search
  - Bayesian Optimization
- Learning curves and validation curves

## Advanced Topics / Algorithms
- Probabilistic models:
  - Bayesian regression
  - Gaussian Naive Bayes
- Distance-based methods beyond KNN:
  - Metric learning concepts
- Tree-based advanced techniques:
  - Extra Trees
  - Gradient Boosting variants (CatBoost specifics)
- Neural networks for tabular data (basic MLP)
- Calibration of probabilistic classifiers

## Practical Data Handling
- Feature transformation:
  - Log transformation, polynomial features
  - Interaction terms
- Categorical variable encoding advanced:
  - Target encoding
  - Frequency encoding
- Handling missing data advanced:
  - Imputation techniques (mean, median, k-NN, MICE)
- Data leakage prevention
- Pipeline automation (scikit-learn pipelines)

## Evaluation / Metrics
- Learning curves (training vs validation performance)
- Validation curves (hyperparameter impact)
- Bootstrapping and Monte Carlo evaluation
- Nested Cross-validation
- Confounding variables and collinearity
- Concept drift handling

## Practical Considerations
- Imbalanced datasets
  - Oversampling (SMOTE)
  - Undersampling
- Handling outliers
- Model interpretability
- Model deployment
- Scalability and optimization

## Applications
- Sales prediction (regression)
- Image or text classification
- Fraud detection
- Churn prediction
- Medical diagnosis


# Unsupervised Learning

## Introduction to Unsupervised Learning
- Definition of unsupervised learning
- Difference between supervised, unsupervised, and reinforcement learning
- Goal: find patterns, structures, or groupings in data
- No target variable (y)
- Common tasks:
  - Clustering
  - Dimensionality reduction
  - Anomaly detection

## Fundamental Concepts
- Dataset and features
- Distance and similarity measures
  - Euclidean distance
  - Manhattan distance
  - Cosine similarity
  - Correlation
- Overfitting and underfitting in unsupervised learning
- Evaluation challenges (lack of ground truth)

## Data Preprocessing
- Data cleaning
- Handling missing values
- Normalization and standardization
- Encoding categorical variables
- Feature scaling
- Feature selection and feature engineering
- Dimensionality reduction before clustering (optional)

## Clustering
### Partitioning Methods
- K-Means
  - Algorithm overview
  - Choosing number of clusters (elbow method, silhouette score)
  - Limitations: sensitive to initialization, outliers
- K-Medoids / PAM
- Mini-Batch K-Means

### Hierarchical Clustering
- Agglomerative clustering
  - Linkage methods: single, complete, average, ward
- Divisive clustering
- Dendrogram visualization

### Density-Based Clustering
- DBSCAN
- OPTICS
- HDBSCAN

### Model-Based Clustering
- Gaussian Mixture Models (GMM)
- Expectation-Maximization algorithm
- Choosing number of components (BIC/AIC)

### Other Clustering Techniques
- Spectral Clustering
- Self-Organizing Maps (SOM)
- Mean-Shift clustering

## Dimensionality Reduction
- Principal Component Analysis (PCA)
- Kernel PCA
- Independent Component Analysis (ICA)
- t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Uniform Manifold Approximation and Projection (UMAP)
- Linear Discriminant Analysis (LDA, supervised variant)

## Anomaly Detection
- Z-score and statistical methods
- Isolation Forest
- One-Class SVM
- Local Outlier Factor (LOF)
- Autoencoder-based anomaly detection

## Evaluation of Unsupervised Models
- Internal metrics:
  - Silhouette score
  - Davies-Bouldin index
  - Calinski-Harabasz index
- External metrics (if ground truth available):
  - Adjusted Rand Index (ARI)
  - Normalized Mutual Information (NMI)
  - Fowlkes-Mallows score
- Visual inspection:
  - Scatter plots, cluster plots
  - Heatmaps

## Advanced Techniques
- Ensemble clustering
- Consensus clustering
- Subspace clustering
- Feature learning with autoencoders
- Self-supervised learning (modern approach)

## Practical Considerations
- Choosing the right number of clusters/components
- Handling high-dimensional data
- Handling categorical features
- Handling outliers
- Scaling for large datasets
- Interpretability of clusters or latent features

## Applications
- Customer segmentation
- Market basket analysis
- Anomaly/fraud detection
- Image compression or embedding
- Topic modeling in text
- Dimensionality reduction for visualization
