Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.

Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.

Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?

### Q1. What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

Min-Max scaling is a technique used in data preprocessing to scale numerical features to a specific range, usually between 0 and 1 or -1 and 1. The goal is to normalize the data so that features with large values do not dominate the learning algorithm, which can cause issues with the model's performance. 

Min-Max formula = (xi - x_min) / (x_max - x_min) #### for range (0,1)

Min-Max formula = (xi - x_min) / (x_max - x_min) * 2 - 1 #### for range (-1,1)

In [46]:
import pandas as pd 
from sklearn.preprocessing import MinMaxScaler

d = {'age': [23,54,34,46,61,31],
     'income' : [20000, 300000, 100000, 240000, 150000, 500000]}

df = pd.DataFrame(d)

min_max = MinMaxScaler(feature_range=(0,1))

scaled_df = min_max.fit_transform(df)

pd.DataFrame(scaled_df, columns=df.columns)

Unnamed: 0,age,income
0,0.0,0.0
1,0.815789,0.583333
2,0.289474,0.166667
3,0.605263,0.458333
4,1.0,0.270833
5,0.210526,1.0


### Q2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

Unit Vector scaling is a technique used in data preprocessing to scale numerical features such that each feature vector has a length of 1. This is achieved by dividing each feature value by the Euclidean norm of the feature vector, which is the square root of the sum of the squares of each feature value.

The main difference between Unit Vector scaling and Min-Max scaling is that Unit Vector scaling preserves the direction of the feature vectors, while Min-Max scaling only preserves the relative distances between the feature vectors.



In [56]:
import pandas as pd
from sklearn.preprocessing import Normalizer

d = {'age': [23,54,34,46,61,31],
     'income' : [20000, 300000, 100000, 240000, 150000, 500000]}

df = pd.DataFrame(d)

unit_vector = Normalizer()

scaled_uni_vector = unit_vector.fit_transform(df)

scaled_df_1 = pd.DataFrame(scaled_uni_vector, columns=df.columns)

scaled_df_1

Unnamed: 0,age,income
0,0.00115,0.999999
1,0.00018,1.0
2,0.00034,1.0
3,0.000192,1.0
4,0.000407,1.0
5,6.2e-05,1.0


### Q3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

PCA (Principal Component Analysis) is a widely used technique for dimensionality reduction, which involves transforming a large set of variables into a smaller set of uncorrelated variables called principal components.

PCA works by identifying the directions in the feature space where the variance of the data is maximum and then projecting the data onto those directions. The resulting principal components are ordered in terms of their explained variance, where the first principal component explains the most variance in the data, and each subsequent component explains progressively less.

In [1]:
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

# load the iris dataset
iris = load_iris()

# create an instance of the PCA class with 2 components
pca = PCA(n_components=2)

# fit and transform the data
iris_pca = pca.fit_transform(iris.data)

print(iris_pca[:5])

[[-2.68412563  0.31939725]
 [-2.71414169 -0.17700123]
 [-2.88899057 -0.14494943]
 [-2.74534286 -0.31829898]
 [-2.72871654  0.32675451]]


### Q4. What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.

PCA (Principal Component Analysis) is a technique for feature extraction, which involves transforming a large set of variables into a smaller set of uncorrelated variables called principal components. Feature extraction is the process of reducing the dimensionality of the data by selecting a subset of the original features.

PCA can be used for feature extraction by identifying the directions in the feature space where the variance of the data is maximum and then projecting the data onto those directions. The resulting principal components are ordered in terms of their explained variance, where the first principal component explains the most variance in the data, and each subsequent component explains progressively less. By selecting a subset of the principal components, we can effectively reduce the dimensionality of the data while retaining most of the important information.

In [4]:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA

digits = load_digits()

pca = PCA(n_components=10)

digit_pca = pca.fit_transform(digits.data)

digit_pca, digit_pca.shape

(array([[ -1.25944953,  21.27481525,  -9.46294903, ...,   2.56424314,
          -0.58727595,   3.60167822],
        [  7.95759679, -20.76882077,   4.43939022, ...,  -4.61145426,
           3.53649103,  -1.02412114],
        [  6.99187905,  -9.95584018,   2.95870928, ..., -16.41198281,
           0.75892723,   4.28081477],
        ...,
        [ 10.80126114,  -6.96000443,   5.59951508, ...,  -7.43308853,
          -3.89862392, -13.0739433 ],
        [ -4.8721472 ,  12.42395304, -10.1707731 , ...,  -4.33296756,
           3.91927311, -13.08721077],
        [ -0.34438531,   6.36568448,  10.77377433, ...,   0.66309635,
          -4.06781981, -12.59296193]]),
 (1797, 10))

### Q5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

We have a dataset of food delivery orders that includes the following features:

Price (in dollars)
Rating (on a scale of 1 to 5)
Delivery time (in minutes)
We could apply Min-Max scaling to each of these features so that they fall within the range of 0 to 1. This would involve subtracting the minimum value of each feature from each data point, dividing the result by the range of the feature, and then multiplying by the desired range (in this case, 1). This would result in a dataset where each feature has been normalized to the same scale, making it easier to compare and analyze the data.

For example, the Min-Max scaling formula for the price feature would be:

scaled_price = (price - min_price) / (max_price - min_price) * 1

where min_price is the minimum price in the dataset, max_price is the maximum price in the dataset, and scaled_price is the transformed price value betw

In [17]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

data = {'price' : [299, 149, 179, 249], 
       'rating' : [3.4, 4.1, 5.0, 3.8], 
       'delivery time' : [20, 25, 18, 15]}

df = pd.DataFrame(data)

min_max = MinMaxScaler(feature_range=(0,1))

scaled = min_max.fit_transform(df)

scaled_df = pd.DataFrame(scaled, columns=df.columns)

scaled_df

Unnamed: 0,price,rating,delivery time
0,1.0,0.0,0.5
1,0.0,0.4375,1.0
2,0.2,1.0,0.3
3,0.666667,0.25,0.0


In [20]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler

X = np.array([[299, 3.4, 20], [149,4.1,25], [179,5.0, 18], [249, 3.8, 15]])

scaler = MinMaxScaler(feature_range=(0,1))

transformed_value = scaler.fit_transform(X)

transformed_value

array([[1.        , 0.        , 0.5       ],
       [0.        , 0.4375    , 1.        ],
       [0.2       , 1.        , 0.3       ],
       [0.66666667, 0.25      , 0.        ]])

### Q6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset.

Standardize the data: First, we would need to standardize the data by subtracting the mean and dividing by the standard deviation of each feature. This ensures that each feature has the same scale and allows for more effective PCA analysis.

Determine the number of principal components: We would need to determine the number of principal components to retain for our analysis. This can be done by analyzing the explained variance of each principal component and choosing the number that explains most of the variance in the data.

Perform PCA: We would perform PCA to transform the data into a new set of principal components. Each principal component is a linear combination of the original features, with the first principal component explaining the most variance in the data.

Choose the new features: We would choose the new features based on the most important principal components. These features are the linear combination of the original features that contribute the most to the variance in the data.

Train the model: We would then train the model on the reduced feature set and evaluate its performance on the test data.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the dataset
X = ...

# Standardize the data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

# Perform PCA
pca = PCA(n_components=5)
X_pca = pca.fit_transform(X_std)

# Get the explained variance of each principal component
explained_var = pca.explained_variance_ratio_

# Choose the new features based on the most important principal components
new_features = X_pca[:, :2]

# Train the model on the reduced feature set
model = ...
model.fit(new_features, y)


### Q7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

In [18]:
from sklearn.preprocessing import MinMaxScaler
data = [[1],[5],[10],[15],[20]]

In [33]:
min_max = MinMaxScaler(feature_range=(-1,1))
min_max

In [25]:
min_max.fit_transform(data)

array([[-1.        ],
       [-0.57894737],
       [-0.05263158],
       [ 0.47368421],
       [ 1.        ]])

In [30]:
min_max.fit(data)

In [31]:
min_max.transform(data)

array([[-1.        ],
       [-0.57894737],
       [-0.05263158],
       [ 0.47368421],
       [ 1.        ]])

In [39]:
x = [1, 5, 10, 15, 20]

# calculate the minimum and maximum values
x_min = min(x)
x_max = max(x)

# perform Min-Max scaling
x_scaled = [(xi - x_min) / (x_max - x_min) * 2 - 1 for xi in x]

print(x_scaled)


[-1.0, -0.5789473684210527, -0.052631578947368474, 0.4736842105263157, 1.0]


### Q8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?

The number of principal components to retain would depend on the specific requirements of the project, the amount of variance in the data that we want to retain, and the trade-off between accuracy and computational efficiency.

To determine the number of principal components to retain, we can use the scree plot or the cumulative explained variance plot. The scree plot shows the amount of variance explained by each principal component, while the cumulative explained variance plot shows the cumulative amount of variance explained as a function of the number of principal components. We can select the number of principal components that explain a significant amount of variance in the data, such as 80% or 90%, and ignore the rest.



In [None]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load the dataset
df = pd.read_csv('health_data.csv')

# Separate the features and target variable
X = df[['height', 'weight', 'age', 'gender', 'blood_pressure']]
y = df['target']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA for feature extraction
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

# Check the explained variance ratio of the principal components
print(pca.explained_variance_ratio_)

# Retain the first three principal components
X_pca = X_pca[:, :3]
