# Intro to Machine Learning: Preprocessing Data

Before we can train a model to predict or classify anything, we need to make sure our **data is ready to learn from**. This is what preprocessing is all about — cleaning, organizing, and transforming data so that it can be understood by algorithms.

Throughout this lecture, we’ll learn **why preprocessing is important**, explore key techniques like **scaling**, **encoding**, and **dimensionality reduction**, and discuss how these steps can improve performance and interpretability.

By the end, you should be able to:
- Explain what preprocessing is and why we do it.
- Describe how scaling and encoding work.
- Understand what PCA (Principal Component Analysis) does.
- Recognize when preprocessing helps a model and when it might not be necessary.

## Why Preprocess?
Raw data is rarely clean or consistent. Some features might be on different scales (like income in dollars vs. age in years), and others might be categorical (like city names or shirt sizes). Without preprocessing, your model might misinterpret these differences.

For example, imagine building a model to predict a student’s grade based on study hours and number of snacks eaten while studying. If you don’t adjust the scales, the feature with bigger numbers (snacks, possibly in the hundreds!) could outweigh study hours.

### Common Reasons to Preprocess:
1. **Standardization**: Ensures all numerical features are measured on similar scales.
2. **Encoding**: Converts text categories into numbers that models can understand.
3. **Dimensionality Reduction**: Simplifies datasets with many features without losing much information.

### Q: Why might a model give inaccurate predictions if some features have much larger numbers than others?

### A: 
YOUR ANSWER HERE

----
## Scaling Data 📏
Scaling is one of the most important preprocessing steps. It ensures that features measured in different units are treated fairly by algorithms.

Many models — like **K-Nearest Neighbors (KNN)** and **Logistic Regression** — rely on mathematical distances or optimization processes. When one feature’s scale is much larger than another’s, it can dominate these calculations.

### Two Common Scaling Methods
1. **Standard Scaling (Z-Score Normalization)**  
   - In math:  
     $$ z = \frac{x - \mu}{\sigma} $$
   - In words: This centers the data around zero and makes the standard deviation equal to one. Useful for models that assume normally distributed data.

2. **Min-Max Scaling**  
   - In math:  
     $$ x' = \frac{x - x_{min}}{x_{max} - x_{min}} $$
   - In words: This rescales all values to fall between 0 and 1. It’s great when you want all features to have equal influence.

### Example: Predicting a Person’s Happiness
Suppose you’re studying factors that affect happiness. You collect data on **hours of sleep** (0–12) and **monthly income** (0–10,000). 

If you don’t scale these features, the model may think income is far more important simply because it has larger numbers!

By scaling both features, you let the model evaluate them on equal footing.

### Q: When might Min-Max scaling be better than Standard scaling?

### A: 
YOUR ANSWER HERE

----
## Encoding Categorical Data
Machine learning models can only understand **numbers**, not text. So we must convert any categorical (non-numeric) data into numeric form.

### Two Main Types of Encoding
1. 🔥 **One-Hot Encoding**: For categories without order (like colors or city names).  
   - Example: ‘Color’ → Red = [1, 0, 0], Blue = [0, 1, 0], Green = [0, 0, 1]
   - Creates a new column for each category, marking presence (1) or absence (0).

2. 🔢 **Ordinal Encoding**: For categories with a natural order (like sizes or satisfaction levels).  
   - Example: ‘Size’ → Small = 1, Medium = 2, Large = 3
   - This tells the model that Large > Medium > Small.

### Example: Movie Preferences
You’re analyzing survey data about favorite movie genres and enjoyment levels. ‘Genre’ (Comedy, Drama, Action) can be One-Hot Encoded, while ‘Enjoyment’ (Low, Medium, High) should be Ordinal Encoded.

### Key Note 📝
**Using the wrong encoder can mislead your model**. 

For instance, if you Ordinal Encode unordered data (like ‘Genre’), the model might assume Drama > Comedy just because of higher numbers.

### Q: When could One-Hot Encoding become a problem if there are many possible categories?
Please explain using your own hypothetical example!

### A: 
YOUR ANSWER HERE

----
## Dimensionality Reduction: Principal Component Analysis (PCA)
In large datasets with many features, some information overlaps or doesn’t add much value. PCA helps reduce the number of features while keeping the most important patterns.

### How PCA Works
1. Finds directions (called **principal components**) where the data varies the most.
2. Projects the data into a new, smaller space while preserving most of its structure.

Think of PCA as summarizing a long story into a few key points — you still understand the main idea without reading every detail.

### Example: Fashion Trends
Imagine you have data on 1,000 shoppers with 20 features each (color preferences, style choices, budgets, etc.). PCA could help you reduce those 20 features down to just 2 or 3 — allowing you to plot the shoppers in 2D and see clusters like ‘casual’, ‘sporty’, and ‘formal’ styles.

### When to Use PCA
- When you have many correlated features.
- When you want to visualize high-dimensional data.
- When reducing computational cost matters.

### ⚠️ Be Careful!
PCA can make results less interpretable because new features (components) are combinations of old ones. **Sometimes, simplicity comes at the cost of explainability.**

## The Big Picture 🖼️
Let’s summarize how preprocessing fits into the broader ML workflow:

1. **Collect Data** — Surveys, experiments, sensors, or public datasets.
2. **Clean Data** — Handle missing values, fix outliers, and remove irrelevant features.
3. **Preprocess** — Scale numerical values, encode categorical ones, and maybe reduce dimensions.
4. **Train Model** — Use processed data for training.
5. **Evaluate Results** — Check accuracy, precision, or other metrics.

### Example Scenario
You’re building a model to predict which products a store should restock. You have features like price, product category, weekly sales, and supplier rating. You would:
1. Scale price and sales.
2. Encode product category.
3. Maybe apply PCA to simplify patterns across multiple supplier ratings.

### Q: Which preprocessing step do you think would have the biggest impact in this scenario — scaling, encoding, or PCA?
Please explain your reasoning below!

### A: 
YOUR ANSWER HERE

## Summary 📚
Preprocessing is how we prepare data so that it’s ready to learn.

- **Scaling** puts features on similar scales.
- **Encoding** lets models understand words and categories.
- **Principal Component Analysis (PCA)** simplifies complex data while keeping its structure.

By applying these techniques, we help models learn more efficiently, make fairer predictions, and uncover clearer patterns.

**Next Steps:** In the next lesson, we’ll see how preprocessing fits into an actual ML pipeline using real-world data.