# Lecture 05: Data Representation & Normalization
## Possible Subjective Exam Questions
---

## Section 1: Data Representation Questions

### Q1. What are the different ways to represent data in machine learning? Explain each type briefly.

**Answer:**

Data can be represented in two main ways:

**1. Vector Representation:**
- **Static Data:** Data that does not change with time. Each sample is a fixed-size vector.
- **Time-series Data:** Data that changes over time, like stock prices or temperature readings.

**2. Non-vectorial Representation:**
- **Strings:** A group of characters (like text data)
- **Graph/Tree Patterns:** Data structured as graphs where:
  - A graph is a set $G = (V, E)$
  - $V$ is a set of vertices (nodes)
  - $E$ is a set of edges (connections between vertices)

### Q2. Define a graph mathematically. What are vertices and edges?

**Answer:**

A graph is mathematically defined as:

$$G = (V, E)$$

Where:
- $V$ = Set of vertices (also called nodes). These are the points in the graph.
- $E$ = Set of edges. These are unordered pairs of vertices that show connections.

**Example:** In a social network, people are vertices and friendships are edges.

### Q3. What is the difference between static data and time-series data? Give examples.

**Answer:**

| Static Data | Time-series Data |
|-------------|------------------|
| Does not change with time | Changes with time |
| Each sample is independent | Samples are ordered by time |
| Example: Height, weight of a person | Example: Stock prices, daily temperature |
| No time dependency | Has time dependency |

## Section 2: Data Quality Questions

### Q4. List and explain the five measures of data quality.

**Answer:**

The five measures of data quality are:

1. **Accuracy:** How correct is the observed data? Data should match the real values.

2. **Completeness:** Are there any missing feature values? Complete data has no missing entries.

3. **Consistency:** Is the data same across multiple sources? Data should not have conflicts.

4. **Believability:** How credible is the data? Can we trust the data source?

5. **Interpretability:** How meaningful is the data? Can we understand what the data represents?

### Q5. Why is data quality important in machine learning?

**Answer:**

Data quality is important because:

1. **Garbage in, garbage out:** Bad quality data leads to bad models
2. Poor accuracy leads to wrong predictions
3. Missing data can bias the model
4. Inconsistent data confuses the learning algorithm
5. Good quality data saves time and resources in preprocessing

## Section 3: Data Preprocessing Questions

### Q6. What are the four key data preprocessing tasks? Explain each briefly.

**Answer:**

The four key preprocessing tasks are:

1. **Data Cleaning:** Fixing errors, handling missing values, removing noise

2. **Data Integration:** Combining data from multiple sources into one dataset

3. **Data Reduction:** Reducing the size of data while keeping important information

4. **Data Transformation:** Changing data format, scaling, or creating new features

### Q7. What are the different methods to handle missing data? Explain each method.

**Answer:**

Methods to fill missing data:

1. **Global Constant:** Fill with a fixed value like "unknown" or create a new class

2. **Attribute Mean:** Fill with the average value of that attribute
   - Example: If age is missing, use average age of all samples

3. **Class-wise Attribute Mean:** Fill with mean of samples from same class (smarter approach)
   - Example: Use average age of males for missing male ages

4. **Most Probable Value:** Use inference-based methods like Bayesian formula to predict the most likely value

### Q8. Why is class-wise attribute mean considered smarter than simple attribute mean for filling missing values?

**Answer:**

Class-wise attribute mean is smarter because:

1. It considers the group/class the sample belongs to
2. Different classes may have different distributions
3. It gives more accurate estimates

**Example:** 
- Average salary of all employees = ₹50,000
- Average salary of managers = ₹80,000
- Average salary of interns = ₹20,000

If a manager's salary is missing, using ₹80,000 is better than ₹50,000.

### Q9. What is correlation coefficient and why is it used in data integration?

**Answer:**

**Correlation Coefficient:**
- It measures the relationship between two variables
- Value ranges from $-1$ to $+1$
- $+1$ = Perfect positive correlation
- $-1$ = Perfect negative correlation
- $0$ = No correlation

**Use in Data Integration:**
- Helps identify if two features from different sources are the same
- Helps find redundant features
- Helps detect if same attribute has different names in different sources

## Section 4: Curse of Dimensionality Questions

### Q10. What is the curse of dimensionality? Explain its effects.

**Answer:**

**Curse of Dimensionality:**

When the number of dimensions (features) increases, data becomes increasingly sparse (spread out).

**Effects:**
1. Data points become far apart from each other
2. Density of data decreases
3. Distance between points becomes less meaningful
4. More data is needed to fill the space
5. Machine learning algorithms perform poorly

### Q11. What is dimensionality reduction? List its benefits.

**Answer:**

**Dimensionality Reduction:**
It is the process of reducing the number of features in the dataset.

**Benefits:**
1. Avoids the curse of dimensionality
2. Helps eliminate irrelevant features
3. Reduces noise in the data
4. Reduces time required for training
5. Reduces space/memory required
6. Allows easier visualization (can plot 2D or 3D)

## Section 5: Data Transformation Questions

### Q12. What are the different types of data transformation techniques?

**Answer:**

The data transformation techniques are:

1. **Smoothing:** Remove noise from data

2. **Attribute/Feature Construction:** Create new attributes from existing ones

3. **Normalization:** Scale data to a smaller range
   - Min-max normalization
   - Z-score normalization
   - Decimal scaling
   - Log scaling

### Q13. Explain Min-Max Normalization with formula and example.

**Answer:**

**Min-Max Normalization (Linear Scaling):**

It scales data to a new range $[new\_min_A, new\_max_A]$

**Formula:**

$$v' = \frac{v - min_A}{max_A - min_A} \times (new\_max_A - new\_min_A) + new\_min_A$$

**Example:**
- Income range: \$12,000 to \$98,000
- Normalize to range: $[0.0, 1.0]$
- For income \$73,600:

$$v' = \frac{73600 - 12000}{98000 - 12000} \times (1.0 - 0) + 0$$

$$v' = \frac{61600}{86000} = 0.716$$

### Q14. Explain Z-score Normalization with formula and example.

**Answer:**

**Z-score Normalization (Standardization / Whitening Transform):**

It transforms data to have mean = 0 and standard deviation = 1.

**Formula:**

$$v' = \frac{v - \mu}{\sigma}$$

Where:
- $\mu$ = mean of attribute A
- $\sigma$ = standard deviation of attribute A

**Example:**
- $\mu = 54,000$
- $\sigma = 16,000$
- For value $v = 73,600$:

$$v' = \frac{73600 - 54000}{16000} = \frac{19600}{16000} = 1.225$$

### Q15. Explain Decimal Scaling normalization with formula and example.

**Answer:**

**Decimal Scaling:**

It normalizes by moving the decimal point.

**Formula:**

$$v' = \frac{v}{10^j}$$

Where $j$ is the smallest integer such that $Max(|v'|) < 1$

**Example:**
- If values range from $-999$ to $999$
- We need $j = 3$ (divide by $10^3 = 1000$)
- Then $999$ becomes $0.999$ and $-999$ becomes $-0.999$
- All values are now between $-1$ and $1$

### Q16. What is Log Scaling? When is it useful?

**Answer:**

**Log Scaling:**

**Formula:**

$$x' = \log(x)$$

**When to use:**
1. When data has very large range of values
2. When data is skewed (not symmetric)
3. When data follows exponential distribution
4. To reduce the effect of outliers

**Example:**
- Population of cities: 1000, 10000, 100000, 1000000
- After log scaling: 3, 4, 5, 6 (much smaller range)

### Q17. Compare Min-Max Normalization and Z-score Normalization.

**Answer:**

| Min-Max Normalization | Z-score Normalization |
|----------------------|----------------------|
| Scales to a fixed range like $[0,1]$ | Centers around mean with unit variance |
| Affected by outliers | Less affected by outliers |
| Formula: $v' = \frac{v - min_A}{max_A - min_A}$ | Formula: $v' = \frac{v - \mu}{\sigma}$ |
| Uses min and max values | Uses mean and standard deviation |
| Good when you know the bounds | Good when distribution is Gaussian |
| Also called linear scaling | Also called standardization |

### Q18. Why is normalization important in machine learning?

**Answer:**

Normalization is important because:

1. **Equal importance:** Features with larger values don't dominate smaller ones

2. **Faster convergence:** Gradient descent converges faster with normalized data

3. **Better accuracy:** Many algorithms work better with normalized data

4. **Distance calculations:** In KNN, distances are meaningful only when features are on same scale

5. **Avoiding numerical issues:** Very large or small values can cause overflow/underflow

## Section 6: Numerical Problems

### Q19. Given income range \$20,000 to \$80,000, normalize \$50,000 to the range [0, 1] using Min-Max normalization.

**Answer:**

Given:
- $min_A = 20,000$
- $max_A = 80,000$
- $v = 50,000$
- $new\_min_A = 0$, $new\_max_A = 1$

**Formula:**
$$v' = \frac{v - min_A}{max_A - min_A} \times (new\_max_A - new\_min_A) + new\_min_A$$

**Solution:**
$$v' = \frac{50000 - 20000}{80000 - 20000} \times (1 - 0) + 0$$

$$v' = \frac{30000}{60000} = 0.5$$

### Q20. For an attribute with mean = 100 and standard deviation = 25, calculate the z-score for value 150.

**Answer:**

Given:
- $\mu = 100$
- $\sigma = 25$
- $v = 150$

**Formula:**
$$v' = \frac{v - \mu}{\sigma}$$

**Solution:**
$$v' = \frac{150 - 100}{25} = \frac{50}{25} = 2$$

The z-score is $2$, meaning the value is 2 standard deviations above the mean.

### Q21. What value of j is needed for decimal scaling if the maximum value in dataset is 4567?

**Answer:**

Given: Maximum value = 4567

We need: $Max(|v'|) < 1$

**Checking:**
- $j = 3$: $v' = \frac{4567}{10^3} = \frac{4567}{1000} = 4.567$ (Not < 1)
- $j = 4$: $v' = \frac{4567}{10^4} = \frac{4567}{10000} = 0.4567$ (< 1 ✓)

**Answer:** $j = 4$

### Q22. Normalize the value 60,000 to range [-1, 1] given min = 40,000 and max = 100,000.

**Answer:**

Given:
- $min_A = 40,000$
- $max_A = 100,000$
- $v = 60,000$
- $new\_min_A = -1$, $new\_max_A = 1$

**Formula:**
$$v' = \frac{v - min_A}{max_A - min_A} \times (new\_max_A - new\_min_A) + new\_min_A$$

**Solution:**
$$v' = \frac{60000 - 40000}{100000 - 40000} \times (1 - (-1)) + (-1)$$

$$v' = \frac{20000}{60000} \times 2 - 1$$

$$v' = 0.333 \times 2 - 1 = 0.667 - 1 = -0.333$$

## Section 7: Conceptual Questions

### Q23. What is the difference between data cleaning and data transformation?

**Answer:**

| Data Cleaning | Data Transformation |
|---------------|--------------------|
| Fixes errors in data | Changes format of data |
| Handles missing values | Normalizes/scales data |
| Removes noise | Creates new features |
| Makes data correct | Makes data suitable for algorithms |
| Example: Filling missing ages | Example: Scaling salary to [0,1] |

### Q24. What is smoothing in data transformation and why is it needed?

**Answer:**

**Smoothing:**
It is the process of removing noise from data.

**Why needed:**
1. Real data contains random errors (noise)
2. Noise can mislead machine learning models
3. Smoothing helps find the true pattern in data
4. Reduces the effect of outliers

**Common methods:**
- Moving average
- Binning
- Regression

### Q25. What is feature/attribute construction? Give an example.

**Answer:**

**Feature Construction:**
Creating new features from existing ones to improve model performance.

**Examples:**
1. From "date of birth", create "age"
2. From "height" and "weight", create "BMI"
3. From "length" and "width", create "area"
4. From "total marks" and "maximum marks", create "percentage"

**Benefits:**
- Can capture domain knowledge
- May improve model accuracy
- Reduces dimensionality sometimes

### Q26. When would you use log scaling instead of min-max normalization?

**Answer:**

**Use Log Scaling when:**
1. Data spans many orders of magnitude (like 10, 100, 1000, 10000)
2. Data is highly skewed (not symmetric)
3. Data has exponential growth pattern
4. There are extreme outliers

**Use Min-Max Normalization when:**
1. Data is roughly uniformly distributed
2. You need values in a specific range like $[0, 1]$
3. The range of data is not too extreme

### Q27. What problems can occur if we don't normalize data before training a model?

**Answer:**

Problems without normalization:

1. **Feature dominance:** Features with large values dominate the learning
   - Example: Salary (in lakhs) vs Age (in years)

2. **Slow convergence:** Gradient descent takes many more iterations

3. **Poor distance calculations:** In KNN or K-means, distances become meaningless

4. **Numerical instability:** Very large values can cause overflow

5. **Biased model:** Model gives wrong importance to features

## Section 8: Short Answer Questions

### Q28. What does data sparsity mean in the context of curse of dimensionality?

**Answer:**

Data sparsity means that as dimensions increase, data points become spread out far from each other. The volume of space increases exponentially, but the number of data points remains the same. This makes the data "sparse" - there are very few points in a very large space.

### Q29. What is the range of z-score normalized values?

**Answer:**

Z-score normalization does NOT have a fixed range. Values can be any positive or negative number. However:
- Most values (about 99.7%) fall between $-3$ and $+3$
- Mean becomes $0$
- Standard deviation becomes $1$
- Outliers will have z-scores far from 0

### Q30. Why is z-score normalization also called "whitening transform"?

**Answer:**

Z-score normalization is called "whitening transform" because:

1. It makes the data have mean = 0 and variance = 1
2. This is similar to white noise which has uniform properties
3. The transformed data has uniform statistical properties across all features
4. It "whitens" the data by removing the color (mean and variance differences)

---
## Summary of Important Formulas

| Normalization Type | Formula |
|-------------------|--------|
| Min-Max | $v' = \frac{v - min_A}{max_A - min_A} \times (new\_max - new\_min) + new\_min$ |
| Z-score | $v' = \frac{v - \mu}{\sigma}$ |
| Decimal Scaling | $v' = \frac{v}{10^j}$ where $Max(|v'|) < 1$ |
| Log Scaling | $x' = \log(x)$ |

---