Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its
application.

Min-Max scaling is a data preprocessing technique used to scale numerical features to a specific range, usually between 0 and 1. The purpose of Min-Max scaling is to standardize the range of independent variables or features of the data. The formula for Min-Max scaling is given by:

�
scaled
=
�
−
�
min
�
max
−
�
min
X 
scaled
​
 = 
X 
max
​
 −X 
min
​
 
X−X 
min
​
 
​
 

where:

�
scaled
X 
scaled
​
  is the scaled value,
�
X is the original value,
�
min
X 
min
​
  is the minimum value of the feature,
�
max
X 
max
​
  is the maximum value of the feature.

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample dataset
data = {'Feature1': [2, 5, 8, 11, 14],
        'Feature2': [15, 22, 35, 40, 50]}

df = pd.DataFrame(data)

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(df)

# Convert the scaled data back to a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=df.columns)

print("Original Data:")
print(df)
print("\nScaled Data:")
print(scaled_df)


Original Data:
   Feature1  Feature2
0         2        15
1         5        22
2         8        35
3        11        40
4        14        50

Scaled Data:
   Feature1  Feature2
0      0.00  0.000000
1      0.25  0.200000
2      0.50  0.571429
3      0.75  0.714286
4      1.00  1.000000


Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling?
Provide an example to illustrate its application.

The Unit Vector technique, also known as Vector normalization or L2 normalization, is a feature scaling method that scales each feature of a data point by dividing it by the Euclidean norm (L2 norm) of the entire feature vector. The purpose of this technique is to ensure that the resulting vector has a Euclidean norm (length) of 1. It is particularly useful when the direction of the data points matters more than their magnitudes.

The formula for Unit Vector scaling is given by:

�
scaled
=
�
∥
�
∥
2
X 
scaled
​
 = 
∥X∥ 
2
​
 
X
​
 

where:

�
scaled
X 
scaled
​
  is the scaled vector,
�
X is the original vector,
∥
�
∥
2
∥X∥ 
2
​
  is the L2 norm (Euclidean norm) of the vector.

In [2]:
import numpy as np
from sklearn.preprocessing import Normalizer

# Sample dataset
data = np.array([[2, 5],
                 [8, 11],
                 [14, 20]])

# Initialize the Normalizer with L2 normalization
normalizer = Normalizer(norm='l2')

# Transform the data
normalized_data = normalizer.transform(data)

print("Original Data:")
print(data)
print("\nNormalized Data:")
print(normalized_data)


Original Data:
[[ 2  5]
 [ 8 11]
 [14 20]]

Normalized Data:
[[0.37139068 0.92847669]
 [0.5881717  0.80873608]
 [0.57346234 0.81923192]]


Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an
example to illustrate its application.

PCA (Principal Component Analysis) is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving as much of the original variability as possible. The idea is to identify the principal components, which are linear combinations of the original features that capture the most significant information in the data.

Here's a step-by-step explanation of how PCA works:

Standardize the Data: Standardize the features to have zero mean and unit variance.

Compute the Covariance Matrix: Calculate the covariance matrix of the standardized data.

Compute Eigenvectors and Eigenvalues: Find the eigenvectors and eigenvalues of the covariance matrix. Eigenvectors represent the directions of maximum variance, and eigenvalues indicate the magnitude of variance in each direction.

Sort Eigenvalues: Sort the eigenvalues in descending order and correspondingly arrange the eigenvectors.

Select Principal Components: Choose the top 
�
k eigenvectors based on the explained variance or a specified threshold. These eigenvectors become the new basis for the lower-dimensional space.

Transform the Data: Project the original data onto the selected principal components to obtain the reduced-dimensional representation.

In [3]:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Standardize the data
X_standardized = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_standardized)

# Print the results
print("Original Data Shape:", X.shape)
print("Reduced Data Shape:", X_pca.shape)


Original Data Shape: (150, 4)
Reduced Data Shape: (150, 2)


Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature
Extraction? Provide an example to illustrate this concept.

PCA (Principal Component Analysis) can be used for feature extraction, particularly when dealing with high-dimensional data. Feature extraction involves transforming the original features into a new set of features, often of lower dimensionality, while retaining the most important information. PCA achieves this by identifying the principal components, which are linear combinations of the original features.

Here's the relationship between PCA and feature extraction:

Dimensionality Reduction: PCA is primarily used for dimensionality reduction. It identifies the directions (principal components) in the data that capture the maximum variance. These principal components can be seen as new features that are linear combinations of the original features.

Feature Extraction: In the context of PCA, the principal components themselves serve as the extracted features. Each principal component is a linear combination of the original features, and it represents a direction in the high-dimensional space.

In [4]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Standardize the data
X_standardized = (X - np.mean(X, axis=0)) / np.std(X, axis=0)

# Apply PCA for feature extraction
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_standardized)

# Create a DataFrame with the principal components
principal_components_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])

# Concatenate the principal components with the target variable
result_df = pd.concat([principal_components_df, pd.Series(y, name='Target')], axis=1)

# Print the result
print(result_df.head())


        PC1       PC2  Target
0 -2.264703  0.480027       0
1 -2.080961 -0.674134       0
2 -2.364229 -0.341908       0
3 -2.299384 -0.597395       0
4 -2.389842  0.646835       0


Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset
contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to
preprocess the data.


Min-Max scaling is a technique used to normalize the range of independent variables or features of a dataset. It scales the values of the features to a specific range, usually between 0 and 1. This scaling is particularly useful when features have different units or scales, ensuring that each feature contributes equally to the model. Here's how you could use Min-Max scaling to preprocess the data for a recommendation system in the context of a food delivery service:

In [5]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Assume you have a DataFrame 'food_data' with columns like 'price', 'rating', 'delivery_time', etc.

# Extract the relevant features
features = ['price', 'rating', 'delivery_time']

# Create a new DataFrame with only the selected features
selected_features_df = food_data[features]

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the selected features using Min-Max scaling
scaled_features = scaler.fit_transform(selected_features_df)

# Create a new DataFrame with the scaled features
scaled_features_df = pd.DataFrame(data=scaled_features, columns=features)

# Concatenate the scaled features with the remaining columns in the original DataFrame
preprocessed_data = pd.concat([scaled_features_df, food_data.drop(features, axis=1)], axis=1)

# Now 'preprocessed_data' contains the original data with the selected features scaled using Min-Max scaling


NameError: name 'food_data' is not defined

Q6. You are working on a project to build a model to predict stock prices. The dataset contains many
features, such as company financial data and market trends. Explain how you would use PCA to reduce the
dimensionality of the dataset.

Principal Component Analysis (PCA) is a dimensionality reduction technique that is commonly used to reduce the number of features in a dataset while retaining most of its original variability. Here's how you could use PCA to reduce the dimensionality of a dataset for predicting stock prices:

Data Preprocessing:

Standardize the features: It's essential to standardize or normalize the features to ensure that each feature contributes equally to the PCA process. This step is crucial because PCA is sensitive to the scale of the features.
Applying PCA:

Use PCA to transform the standardized features into principal components. These components are linear combinations of the original features and are orthogonal to each other.
Choose the number of principal components you want to retain. This decision is often based on the explained variance. You might aim to retain a certain percentage of the total variance (e.g., 95%).
Fit the PCA model on the standardized features and transform the data to obtain the principal components.

In [6]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Assume you have a DataFrame 'stock_data' with various financial and market trend features

# Extract relevant features
features = ['feature1', 'feature2', 'feature3', ...]

# Create a new DataFrame with only the selected features
selected_features_df = stock_data[features]

# Standardize the features
scaler = StandardScaler()
standardized_features = scaler.fit_transform(selected_features_df)

# Choose the number of principal components to retain (e.g., n_components=3)
n_components = 3
pca = PCA(n_components=n_components)

# Fit PCA and transform the standardized features
principal_components = pca.fit_transform(standardized_features)

# Create a DataFrame with the principal components
pc_df = pd.DataFrame(data=principal_components, columns=[f'PC{i}' for i in range(1, n_components + 1)])

# Concatenate the principal components with the remaining columns in the original DataFrame
preprocessed_data = pd.concat([pc_df, stock_data.drop(features, axis=1)], axis=1)


NameError: name 'stock_data' is not defined

Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the
values to a range of -1 to 1.

To perform Min-Max scaling on a dataset and transform the values to a range of -1 to 1, you can use the following formula:

Scaled Value
=
Original Value
−
Min Value
Max Value
−
Min Value
×
(
New Max
−
New Min
)
+
New Min
Scaled Value= 
Max Value−Min Value
Original Value−Min Value
​
 ×(New Max−New Min)+New Min

Let's apply this formula to the given dataset: [1, 5, 10, 15, 20], and scale the values to the range of -1 to 1.

In [7]:
import numpy as np

# Given dataset
original_values = np.array([1, 5, 10, 15, 20])

# Define the new range
new_min, new_max = -1, 1

# Calculate Min-Max scaling
min_value = np.min(original_values)
max_value = np.max(original_values)

scaled_values = (original_values - min_value) / (max_value - min_value) * (new_max - new_min) + new_min

print("Original Values:", original_values)
print("Scaled Values (Min-Max Scaling to -1 to 1):", scaled_values)


Original Values: [ 1  5 10 15 20]
Scaled Values (Min-Max Scaling to -1 to 1): [-1.         -0.57894737 -0.05263158  0.47368421  1.        ]


Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform
Feature Extraction using PCA. How many principal components would you choose to retain, and why?

In Feature Extraction using Principal Component Analysis (PCA), the number of principal components to retain is a subjective decision and often involves a trade-off between retaining enough information and reducing the dimensionality of the dataset. The decision is typically based on the explained variance ratio.

Here's a general approach to perform PCA and decide on the number of principal components to retain:

Standardize the Data:
It's important to standardize or normalize the features since PCA is sensitive to the scale of the data.

Compute Covariance Matrix:
Calculate the covariance matrix of the standardized data.

Compute Eigenvectors and Eigenvalues:
Find the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the principal components, and the eigenvalues indicate the amount of variance captured by each component.

Sort Eigenvalues:
Sort the eigenvalues in descending order.

Choose the Number of Principal Components:
Decide on the number of principal components to retain based on the explained variance ratio. A common approach is to choose a number that explains a significant portion (e.g., 95% or 99%) of the total variance.

Project Data onto Principal Components:
Project the original data onto the selected principal components.

In [9]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Sample dataset
data = [[170, 65, 30, 1, 120],
        [160, 55, 25, 0, 110],
        [180, 75, 35, 1, 130]]
        # ... (more rows)

# Standardize the data
scaled_data = StandardScaler().fit_transform(data)

# Apply PCA
pca = PCA()
pca.fit(scaled_data)

# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Choose the number of principal components
cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
num_components_to_retain = np.argmax(cumulative_variance_ratio >= 0.95) + 1

# Apply PCA with the selected number of components
pca = PCA(n_components=num_components_to_retain)
principal_components = pca.fit_transform(scaled_data)

print("Explained Variance Ratio:", explained_variance_ratio)
print("Cumulative Variance Ratio:", cumulative_variance_ratio)
print("Number of Components to Retain:", num_components_to_retain)
print("Principal Components:\n", principal_components)


Explained Variance Ratio: [0.95825757 0.04174243 0.        ]
Cumulative Variance Ratio: [0.95825757 1.         1.        ]
Number of Components to Retain: 1
Principal Components:
 [[-0.29383087]
 [ 2.81565659]
 [-2.52182571]]
