**Q1.** What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its
application.

Min-Max scaling is a data normalization technique used in data preprocessing to scale numerical features within a specific range, typically between 0 and 1. It linearly transforms the original data into a range by subtracting the minimum value from each data point and then dividing by the range of the data.

In [2]:
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = [[1000], [1500], [1200], [1800], [900]]

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the data
scaled_data = scaler.fit_transform(data)

print("Original data:")
print(data)
print("\nScaled data:")
print(scaled_data)


Original data:
[[1000], [1500], [1200], [1800], [900]]

Scaled data:
[[0.11111111]
 [0.66666667]
 [0.33333333]
 [1.        ]
 [0.        ]]


**Q2.** What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling?
Provide an example to illustrate its application.

The Unit Vector technique, also known as vector normalization or normalization to unit norm, is a feature scaling method that scales each sample or data point to have a unit norm (length or magnitude) in the feature space. It involves dividing each data point by its magnitude, effectively scaling the vector to have a length of 1 while preserving its direction.

Unit vector scaling differs from Min-Max scaling primarily in how it handles the scaling process. While Min-Max scaling adjusts the values within a specific range (e.g., [0, 1]), unit vector scaling focuses on normalizing the vector so that its magnitude becomes 1 without changing the direction of the vector.

In [13]:
import numpy as np

# Sample data
data = np.array([[3, 4], [1, -2], [5, 0]])

# Calculate L2 norm for each row (axis=1)
norms = np.linalg.norm(data, axis=1, ord=2)

# Unit vector scaling
unit_scaled_data = data / norms[:,None]

print("Original data:")
print(data)
print("\nUnit-scaled data:")
print(unit_scaled_data)


Original data:
[[ 3  4]
 [ 1 -2]
 [ 5  0]]

Unit-scaled data:
[[ 0.6         0.8       ]
 [ 0.4472136  -0.89442719]
 [ 1.          0.        ]]


**Q3.** What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an
example to illustrate its application.

Principal Component Analysis (PCA) is a widely used technique in machine learning and statistics for dimensionality reduction. Its main goal is to transform high-dimensional data into a lower-dimensional space while preserving the most important information or variance present in the original data.

PCA works by identifying the principal components, which are new uncorrelated variables that are linear combinations of the original variables. These components are ordered by the amount of variance they explain in the data, with the first component capturing the most variance and subsequent components capturing progressively less variance.



Here's a step-by-step explanation of PCA:

Mean Centering: The mean is subtracted from each feature to ensure that the data is centered around the origin.

Covariance Matrix Calculation: The covariance matrix of the mean-centered data is computed.

Eigenvalue Decomposition: The eigenvectors and eigenvalues of the covariance matrix are calculated. The eigenvectors represent the directions (principal components), and the eigenvalues indicate the amount of variance explained by each principal component.

Selection of Principal Components: The eigenvectors corresponding to the largest eigenvalues (which explain the most variance) are chosen as the principal components.

Dimensionality Reduction: The original data is projected onto the new space formed by the selected principal components, effectively reducing the dimensionality while retaining most of the variance.

In [19]:
from sklearn.decomposition import PCA
import numpy as np

data = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
])

# Initialize PCA and specify the number of components
pca = PCA(n_components=2)

# Fit PCA on the data and transform it
transformed_data = pca.fit_transform(data)

print("Transformed data (after PCA):")
print(transformed_data)
print("\nExplained variance ratio:")
print(pca.explained_variance_ratio_)


Transformed data (after PCA):
[[-7.79422863  0.        ]
 [-2.59807621  0.        ]
 [ 2.59807621  0.        ]
 [ 7.79422863 -0.        ]]

Explained variance ratio:
[1. 0.]


**Q4.** What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature
Extraction? Provide an example to illustrate this concept.

PCA plays a crucial role in feature extraction, especially in reducing the dimensionality of the dataset by extracting or selecting the most informative features (principal components) that capture the essential information present in the original features.

**Relationship between PCA and Feature Extraction:**

Dimensionality Reduction: PCA is a form of feature extraction that transforms high-dimensional data into a lower-dimensional space by creating new features (principal components) that are linear combinations of the original features.

Information Retention: It captures the variance within the data while compressing it into a reduced set of features, thereby extracting the most important information.

**Using PCA for Feature Extraction:**

Identifying Important Features: PCA identifies the most important features (principal components) that explain the variance in the data.

Discarding Less Important Features: It allows for discarding or compressing less relevant features that contribute minimally to the overall variance.

In [15]:
from sklearn.decomposition import PCA
import numpy as np

# Sample dataset (5D data)
data = np.array([
    [1, 2, 3, 4, 5],
    [2, 3, 4, 5, 6],
    [3, 4, 5, 6, 7],
    [4, 5, 6, 7, 8],
    [5, 6, 7, 8, 9]
])

# Initialize PCA for feature extraction
pca = PCA(n_components=3)  # Selecting 3 principal components

# Fit PCA on the data and transform it for feature extraction
extracted_features = pca.fit_transform(data)

# Print the extracted features
print("Extracted Features:")
print(extracted_features)
print("\nExplained variance ratio:")
print(pca.explained_variance_ratio_)


Extracted Features:
[[ 4.47213595  0.          0.        ]
 [ 2.23606798  0.          0.        ]
 [-0.          0.          0.        ]
 [-2.23606798  0.          0.        ]
 [-4.47213595  0.         -0.        ]]

Explained variance ratio:
[1. 0. 0.]


**Q5.** You are working on a project to build a recommendation system for a food delivery service. The dataset
contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to
preprocess the data.

**Objective:**

Normalize numerical features like 'price,' 'rating,' and 'delivery time' to a common scale (e.g., [0, 1]) for fair comparison and modeling.

**Steps for Min-Max Scaling:**

**Identify Numerical Features:**

Select relevant numerical features from the dataset ('price,' 'rating,' 'delivery time').

**Import the Necessary Libraries:**

Use libraries like scikit-learn in Python that provide Min-Max scaling functionality.

**Initialize Min-Max Scaler:**

Create an instance of the MinMaxScaler from the preprocessing module.

**Fit and Transform:**

Apply Min-Max scaling to each numerical feature separately.

Calculate the minimum and maximum values for each feature.

Scale the values within the range [0, 1] using the Min-Max scaling formula.

In [20]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

data = pd.DataFrame({
    'price': [1000, 1500, 1200, 1800, 900],
    'rating': [4, 3, 5, 4, 2],
    'delivery_time': [30, 45, 35, 50, 25]
})

numerical_columns = ['price', 'rating', 'delivery_time']

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit and transform the selected columns
data[numerical_columns] = scaler.fit_transform(data[numerical_columns])


In [21]:
data

Unnamed: 0,price,rating,delivery_time
0,0.111111,0.666667,0.2
1,0.666667,0.333333,0.8
2,0.333333,1.0,0.4
3,1.0,0.666667,1.0
4,0.0,0.0,0.0


Outcome:

The numerical features 'price,' 'rating,' and 'delivery time' will be transformed to a common scale between 0 and 1.
Integration in Recommendation System:

Utilize the scaled features as inputs for your recommendation system model.
Algorithms like collaborative filtering or content-based filtering can utilize these scaled features for accurate recommendations.
Refinement and Evaluation:

Continuously assess the performance of the recommendation system.
Adjust preprocessing techniques if necessary based on performance metrics and user feedback to enhance the recommendation accuracy.

**Q6.** You are working on a project to build a model to predict stock prices. The dataset contains many
features, such as company financial data and market trends. Explain how you would use PCA to reduce the
dimensionality of the dataset.

**Dataset Overview:**

The dataset includes numerous features encompassing company-specific financial data and market trends relevant to stock price movements.
Preprocessing Steps:

**Data Cleaning:** Handle missing values, outliers, and ensure data consistency.

**Normalization/Scaling:** Normalize or scale the features to bring them to a common scale, especially if they have different units or ranges.

**Approach for Dimensionality Reduction:**

**Feature Selection Exploration:**

Consider traditional feature selection techniques (like correlation analysis or feature importance) to identify crucial features directly related to stock price prediction.

**PCA Implementation:**

Normalization: Scale the dataset to ensure comparable feature magnitudes.

PCA Application: Apply PCA to the scaled dataset.

Determine the number of principal components based on explained variance or a predefined threshold.

Fit PCA to the data and transform it to a lower-dimensional space.

**Model Development:**

Utilize the transformed dataset post-PCA for training predictive models.

Employ regression models (linear regression, SVR, etc.) or more advanced techniques tailored for stock price prediction (time series models, neural networks).

**Evaluation and Refinement:**

Model Evaluation: Assess model performance using relevant metrics (RMSE, MAE, etc.) using validation techniques (cross-validation, train-test split).

Iterative Process: Iterate on feature selection strategies or PCA components if the model performance requires improvement.

**Objective of PCA:**

PCA aims to reduce dimensionality while preserving as much information/variance as possible, aiding in computational efficiency and potentially improving model accuracy in stock price prediction.

In [22]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np


from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

np.random.seed(42)
data = pd.DataFrame(np.random.randn(100, 10), columns=[f'feature_{i}' for i in range(1, 11)])

# Normalize or scale the data (StandardScaler in this case)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Apply PCA for dimensionality reduction
pca = PCA(n_components=5)  # Specify the number of components to retain
transformed_data = pca.fit_transform(scaled_data)

# Check the explained variance ratio to understand the importance of components
print("Explained Variance Ratio:", pca.explained_variance_ratio_)



# Assume 'target' is the column representing stock prices in the dataset
target = np.random.randn(100)

# Split the transformed data and target into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(transformed_data, target, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate the model
print("Model Score on Test Data:", model.score(X_test, y_test))


Explained Variance Ratio: [0.15598434 0.13026324 0.12116218 0.1121681  0.10047779]
Model Score on Test Data: 0.0031885851065562854


**Q7.** For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the
values to a range of -1 to 1.

In [24]:
import numpy as np

# Given dataset
data = np.array([1, 5, 10, 15, 20])

min_val = -1  
max_val = 1  

# Calculate original min and max values
data_min = np.min(data)
data_max = np.max(data)

# Min-Max scaling formula
scaled_data = min_val + ((data - data_min) * (max_val - min_val)) / (data_max - data_min)


print("Original data:", data)
print("Scaled data (range -1 to 1):", scaled_data)


Original data: [ 1  5 10 15 20]
Scaled data (range -1 to 1): [-1.         -0.57894737 -0.05263158  0.47368421  1.        ]


In [26]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np

data = np.array([1, 5, 10, 15, 20]).reshape(-1, 1)  # Reshape to 2D array for MinMaxScaler

# Initialize MinMaxScaler for range (-1, 1)
min_max_scaler = MinMaxScaler(feature_range=(-1, 1))

# Fit and transform the data to the desired range
scaled_data = min_max_scaler.fit_transform(data)

# Reshape the scaled data to a 1D array for better visibility
scaled_data = scaled_data.flatten()

print("Original data:", data.flatten())
print("Scaled data (range -1 to 1):", scaled_data)


Original data: [ 1  5 10 15 20]
Scaled data (range -1 to 1): [-1.         -0.57894737 -0.05263158  0.47368421  1.        ]


**Q8.** For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform
Feature Extraction using PCA. How many principal components would you choose to retain, and why?

**Variance Retention:**

Aim to retain a high cumulative variance (e.g., 90% or more) to preserve most of the dataset's information.

**Explained Variance Ratio:**

Analyze the explained variance ratio provided by PCA.

Retain components that collectively explain a significant portion of the variance.

**Scree Plot Analysis:**

Plot the explained variance against the number of components.

Select the number of components before the variance explained plateaus.

**Business/Application Context:**

Consider the specific needs of the problem domain.

Choose a number of components that balances complexity and predictive power.

**Computational Efficiency:**

Consider computational constraints when selecting the number of components.

Fewer components reduce computational overhead.

**Interpretability vs. Information Loss:**

Balance interpretability and information loss.

Fewer components might be more interpretable but may sacrifice some nuanced information.

**Trade-off between Dimensionality and Noise:**

Strive for a balance between dimensionality reduction and noise introduced by retaining too few or too many components.

In [30]:
from sklearn.decomposition import PCA
import numpy as np

# Sample dataset or features: [height, weight, age, gender, blood pressure]

# Sample data generation for demonstration
np.random.seed(42)
data = np.random.randn(100, 5)  # Assuming 100 samples and 5 features

# Initialize PCA
pca = PCA()

# Fit PCA on the data
pca.fit(data)

# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Calculate cumulative explained variance
cumulative_variance = np.cumsum(explained_variance_ratio)

# Determine the number of components to retain (e.g., 90% variance)
num_components = np.argmax(cumulative_variance >= 0.9) + 1

# Print explained variance ratio and number of components to retain
print("Explained Variance Ratio:", explained_variance_ratio)
print("Cumulative Explained Variance Ratio:", cumulative_variance)
print(f"Number of components to retain for 90% variance: {num_components}")


Explained Variance Ratio: [0.26256655 0.21579693 0.20219416 0.18127308 0.13816928]
Cumulative Explained Variance Ratio: [0.26256655 0.47836348 0.68055764 0.86183072 1.        ]
Number of components to retain for 90% variance: 5
