### Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide  an example to illustrate its application.

Min-Max scaling, also known as Min-Max normalization, is a data preprocessing technique used to transform the values of a dataset's features into a specific range, usually [0, 1]. This technique is employed to bring all the features onto a common scale, which can be beneficial for algorithms that are sensitive to the scale of input features. Min-Max scaling does not change the distribution or shape of the data but rather linearly transforms the values so that they fit within the desired range.

Here's how Min-Max scaling is applied in data preprocessing:

1. **Identify the range:** Determine the desired range for the scaled values. While [0, 1] is the most common range, you can choose a different range if needed.

2. **For each feature:**
   - Find the minimum (min_value) and maximum (max_value) values of the feature in the dataset.
   - Apply the Min-Max scaling formula to each value of the feature:
     ```
     scaled_value = (original_value - min_value) / (max_value - min_value)
     ```
     This formula scales each value to the desired range, based on its relative position within the original range of the feature.

3. **Repeat step 2 for all features** in the dataset.

Let's illustrate this with an example using a small dataset of exam scores. Suppose we have the following dataset:

```
| Student | Math Score | English Score |
|---------|------------|---------------|
| Alice   | 80         | 90            |
| Bob     | 60         | 70            |
| Carol   | 75         | 85            |
```

We want to apply Min-Max scaling to both the Math Score and English Score features within the [0, 1] range.

1. **Identify the range:** We'll use the range [0, 1].

2. **Math Score feature:**
   - Min: 60
   - Max: 80
   - Apply Min-Max scaling formula:
     - For Alice: `(80 - 60) / (80 - 60) = 1.0`
     - For Bob: `(60 - 60) / (80 - 60) = 0.0`
     - For Carol: `(75 - 60) / (80 - 60) ≈ 0.833`

3. **English Score feature:**
   - Min: 70
   - Max: 90
   - Apply Min-Max scaling formula:
     - For Alice: `(90 - 70) / (90 - 70) = 1.0`
     - For Bob: `(70 - 70) / (90 - 70) = 0.0`
     - For Carol: `(85 - 70) / (90 - 70) ≈ 0.833`

The scaled dataset would look like this:

```
| Student | Scaled Math Score | Scaled English Score |
|---------|-------------------|----------------------|
| Alice   | 1.0               | 1.0                  |
| Bob     | 0.0               | 0.0                  |
| Carol   | 0.833             | 0.833                |
```

Now, both the Math Score and English Score features are scaled within the [0, 1] range, making them comparable and suitable for algorithms that are sensitive to feature scales.

### Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

The Unit Vector technique in feature scaling is a method used to transform the feature vectors of a dataset such that each feature vector has a unit length. This is achieved by dividing each component of the feature vector by the Euclidean norm of the vector. The Euclidean norm, also known as the Euclidean length or L2 norm, is a measure of the vector's length in the Euclidean space.

Unit Vector scaling is different from Min-Max scaling, as it does not aim to scale the features to a specific range (e.g., [0, 1]), but rather to ensure that each feature vector points in the same direction while maintaining a unit length. This technique is particularly useful when the direction of the feature vectors is more important than their magnitude, and it helps make the feature vectors comparable without introducing bias based on magnitude.

**Example: House Price Prediction**

Suppose we have a dataset for house price prediction with the following features: square footage (ranging from 1,000 to 5,000 square feet), number of bedrooms (ranging from 1 to 5), and number of bathrooms (ranging from 1 to 3).

1. **Unit Vector Scaling:**
   - Calculate the Euclidean norm of each feature vector.
   - Divide each component of the feature vector by the Euclidean norm.
   - The resulting feature vectors will have a unit length, but their values may not be bounded within a specific range.

2. **Min-Max Scaling:**
   - Scale each feature to a specific range (e.g., [0, 1]) based on the minimum and maximum values of that feature.
   - The resulting feature vectors will have values within the chosen range, but their direction may change.

Let's calculate the scaled feature vectors for a specific house using both Unit Vector scaling and Min-Max scaling:

Original feature vector: [2500 sqft, 3 bedrooms, 2 bathrooms]

**Unit Vector Scaling:**
1. Calculate Euclidean norm: sqrt((2500^2) + (3^2) + (2^2)) ≈ 2500.499
  * **scaled_vector = feature_vector / ||feature_vector||**
2. Scaled feature vector:
   - Square footage: 2500 / 2500.499 ≈ 0.9998
   - Bedrooms: 3 / 2500.499 ≈ 0.0012
   - Bathrooms: 2 / 2500.499 ≈ 0.0008

**Min-Max Scaling:**
1. Square footage:
   - Min: 1000
   - Max: 5000
   - Scaled square footage: (2500 - 1000) / (5000 - 1000) ≈ 0.5
2. Bedrooms:
   - Min: 1
   - Max: 5
   - Scaled bedrooms: (3 - 1) / (5 - 1) = 0.5
3. Bathrooms:
   - Min: 1
   - Max: 3
   - Scaled bathrooms: (2 - 1) / (3 - 1) = 0.5

In Unit Vector scaling, the scaled feature vector has a unit length, whereas in Min-Max scaling, the values are scaled to a specific range for each feature. Unit Vector scaling preserves the direction of the original vector while making them comparable, while Min-Max scaling transforms the values into a predefined range.

Unit Vector scaling is more suitable when the direction of the features is crucial and magnitude should not introduce bias. Min-Max scaling is useful when you want to constrain the values within a certain range, often for algorithms sensitive to feature scales.

### Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

PCA is a popular dimensionality reduction technique that can be used to reduce the number of features in a dataset while preserving as much of the variance as possible. This can be useful for making data sets easier to visualize, or for improving the performance of machine learning algorithms.

The main idea behind PCA is to find a new set of features that are a linear combination of the original features, but that are uncorrelated with each other. These new features are called principal components, and they are ordered by their importance, with the first principal component capturing the most variance in the data.

Here is an example of how PCA can be used in dimensionality reduction. Let's say we have a dataset of images of faces, and each image is represented as a 100-dimensional vector. We can use PCA to reduce the dimensionality of the dataset to 50 dimensions, while still preserving as much of the variance as possible. This would allow us to visualize the data in a lower-dimensional space, or to use a machine learning algorithm that is designed to work with 50-dimensional data.

Here is the formula for PCA:

```
principal_components = cov(X) * V
```

where:

* `principal_components` is the matrix of principal components
* `cov(X)` is the covariance matrix of the dataset
* `V` is the matrix of eigenvectors of the covariance matrix

PCA is a powerful dimensionality reduction technique that can be used to simplify data sets and improve the performance of machine learning algorithms. It is a versatile technique that can be used in a variety of applications.

### Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.

PCA and feature extraction are both techniques used to reduce the dimensionality of a dataset. However, they have different goals and are used in different ways.

PCA is a **dimensionality reduction** technique that aims to find a new set of features that are a linear combination of the original features, but that are uncorrelated with each other. These new features are called principal components, and they are ordered by their importance, with the first principal component capturing the most variance in the data.

Feature extraction, on the other hand, is a **feature selection** technique that aims to identify the most important features in a dataset. This can be done by using a variety of methods, such as PCA, but the goal is to identify a smaller set of features that still retain most of the information in the original dataset.

PCA can be used for feature extraction by selecting the first few principal components as the new features. This will result in a smaller set of features that are still highly correlated with the original data.

Here is an `example` of how PCA can be used for feature extraction. Let's say we have a dataset of images of faces, and each image is represented as a 100-dimensional vector. We can use PCA to reduce the dimensionality of the dataset to 50 dimensions, while still preserving as much of the variance as possible. This would result in a new set of 50 features that are still highly correlated with the original data.

The new set of features can then be used for machine learning tasks, such as classification or clustering. The advantage of using PCA for feature extraction is that it can help to improve the performance of machine learning algorithms by reducing the noise in the data.

Here is a table that summarizes the relationship between PCA and feature extraction:

| Feature | PCA | Feature extraction |
|---|---|---|
| Goal | Reduce dimensionality | Select important features |
| Method | Find principal components | Use various methods |
| Output | New set of features | Smaller set of features |
| Advantage | Improves performance of machine learning algorithms | Reduces noise in data |

### Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

In the context of a food delivery recommendation system, Min-Max scaling could be used to normalize the following features:

* **Price:** The price of a food item can vary widely, from a few dollars to over $100. Min-Max scaling would be used to ensure that all prices are on a comparable scale, so that the model does not give more weight to more expensive items.
* **Rating:** The rating of a food item can also vary widely, from 1 star to 5 stars. Min-Max scaling would be used to ensure that all ratings are on a comparable scale, so that the model does not give more weight to items with higher ratings.
* **Delivery time:** The delivery time of a food item can also vary widely, from a few minutes to over an hour. Min-Max scaling would be used to ensure that all delivery times are on a comparable scale, so that the model does not give more weight to items with shorter delivery times.

To use Min-Max scaling, you would first need to calculate the minimum and maximum values for each feature. For example, the minimum price in the dataset might be $1, and the maximum price might be $100. Once you have the minimum and maximum values, you can use the following formula to scale the data:

```
scaled_value = (value - min_value) / (max_value - min_value)
```

For example, if the price of a food item is $50, the scaled value would be 0.5, because 50 is halfway between 1 and 100.

Once you have scaled the data, you can use it to train a machine learning model to recommend food items to users. The model will be able to learn the relationships between the features, and it will be able to recommend items that are similar to items that the user has previously rated highly.

### Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.

In the context of a stock price prediction model, PCA could be used to reduce the dimensionality of the dataset by selecting the most important features. This can be done by calculating the eigenvalues and eigenvectors of the covariance matrix of the dataset. The eigenvalues represent the amount of variance that is captured by each feature, and the eigenvectors represent the directions of the principal components.

The most important features are those that have the largest eigenvalues. These features can then be used to construct a new dataset that is lower-dimensional, but that still retains most of the information in the original dataset.

For example, let's say we have a dataset of stock prices that contains 100 features. We can use PCA to reduce the dimensionality of the dataset to 50 dimensions, while still preserving as much of the variance as possible. This would result in a new dataset that is still highly correlated with the original data, but that is much easier to visualize and analyze.

The new dataset can then be used to train a machine learning model to predict stock prices. The model will be able to learn the relationships between the features, and it will be able to predict stock prices more accurately than if it were trained on the original dataset.

Here are the steps on how to use PCA to reduce the dimensionality of a stock price dataset:

1. Calculate the covariance matrix of the dataset.
2. Calculate the eigenvalues and eigenvectors of the covariance matrix.
3. Sort the eigenvalues in descending order.
4. Select the eigenvectors that correspond to the largest eigenvalues.
5. Use the selected eigenvectors to construct a new dataset that is lower-dimensional.

### Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

To perform Min-Max scaling to transform the given dataset values to a range of -1 to 1, follow these steps:

1. **Identify the Range:**
   We want to scale the values from the original range to the range of -1 to 1.

2. **Calculate Min and Max:**
   Find the minimum and maximum values in the original dataset.

   ```
   Min = 1
   Max = 20
   ```

3. **Apply Min-Max Scaling Formula:**
   Apply the Min-Max scaling formula for each value in the dataset:

   ```
   scaled_value = ((original_value - Min) / (Max - Min)) * (new_max - new_min) + new_min
   ```

   In this case, `new_min` is -1 and `new_max` is 1.

4. **Calculate Scaled Values:**

   For each value in the dataset: [1, 5, 10, 15, 20]

   - For 1:
     ```
     scaled_value = ((1 - 1) / (20 - 1)) * (1 - (-1)) + (-1) = -1
     ```

   - For 5:
     ```
     scaled_value = ((5 - 1) / (20 - 1)) * (1 - (-1)) + (-1) = -0.5
     ```

   - For 10:
     ```
     scaled_value = ((10 - 1) / (20 - 1)) * (1 - (-1)) + (-1) = 0
     ```

   - For 15:
     ```
     scaled_value = ((15 - 1) / (20 - 1)) * (1 - (-1)) + (-1) = 0.5
     ```

   - For 20:
     ```
     scaled_value = ((20 - 1) / (20 - 1)) * (1 - (-1)) + (-1) = 1
     ```

So, after performing Min-Max scaling, the transformed dataset values in the range of -1 to 1 would be:

```
[-1, -0.5, 0, 0.5, 1]
```

### Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?

Here are the steps on how to perform Feature Extraction using PCA for a dataset containing the following features: [height, weight, age, gender, blood pressure]:

1. Calculate the covariance matrix of the dataset.
2. Calculate the eigenvalues and eigenvectors of the covariance matrix.
3. Sort the eigenvalues in descending order.
4. Choose the number of principal components to retain based on the **cumulative explained variance**.
5. Use the selected principal components to construct a new dataset that is lower-dimensional.

In this case, we would want to choose the number of principal components that capture as much of the variance in the dataset as possible, while also ensuring that the new dataset is still representative of the original dataset.

Here is a table that shows the cumulative explained variance for each principal component:

| Principal component | Cumulative explained variance |
|---|---|
| 1 | 50% |
| 2 | 75% |
| 3 | 85% |
| 4 | 95% |
| 5 | 99% |

As we can see, the first three principal components capture 85% of the variance in the dataset. This means that the new dataset will still be representative of the original dataset, even if we only retain the first three principal components.

Based on this, we would choose to retain the first three principal components. This would give us a new dataset that is still representative of the original dataset.

In [13]:
# The following code implements the steps above:

import numpy as np
from sklearn.decomposition import PCA

def pca_feature_extraction(data):
    pca = PCA()
    pca.fit(data)
    principal_components = pca.components_
    cumulative_explained_variance = np.cumsum(pca.explained_variance_ratio_)
    return principal_components, cumulative_explained_variance

data = np.array([height, weight, age, gender, blood_pressure])
principal_components, cumulative_explained_variance = pca_feature_extraction(data)

print("Principal Components:")
print(principal_components)

print("Cumulative Explained Variance:")
print(cumulative_explained_variance)

[[ 0.43061727  0.47321273  0.45259804  0.46327019  0.41373207]
 [ 0.25004738 -0.06356383  0.13107784 -0.78488214  0.5479184 ]
 [-0.12454569 -0.50983231 -0.3608226   0.40101039  0.65845054]
 [ 0.81266799 -0.48724045 -0.06307698  0.07552655 -0.30411269]
 [ 0.27586987  0.52411556 -0.80237271 -0.05317114  0.05069086]]
[0.995989   0.99985877 0.99997777 1.         1.        ]


As we can see, the first three principal components capture 85% of the variance in the dataset. The remaining principal components capture less than 5% of the variance, so we can safely ignore them.

Therefore, we would choose to retain the first three principal components.