# 1.  What is Min-Max scaling, and how is it used in data preprocessing? Provide an example to illustrate its application.

- Min-Max scaling is like adjusting the volume of your music to make it fit the mood of the party.
- In data preprocessing, it helps to transform data values into a specific range, typically between 0 and 1, so they play well together and don't overshadow each other.

> - Here's how it works!! :

> - >  [i] **Find the Minimum and Maximum Values:** Imagine you have a playlist, and you want to know the softest and loudest songs. 
> - > You find the quietest (minimum) and the loudest (maximum) song volumes.
> - >  [ii] **Rescale the Values:** Now, you adjust all song volumes so they fit between 0 (quietest) and 1 (loudest). 
> - > It's like turning the volume knob in a way that soft songs become 0, loud songs become 1, and everything else adjusts proportionally in between.

> - eg: 
- Let's say you have a dataset of house sizes, and they range from 500 square feet (the smallest) to 2000 square feet (the largest).

- A 500-square-foot house would be scaled to 0.
- A 1000-square-foot house would be scaled to 0.5 (right in the middle).
- A 2000-square-foot house would be scaled to 1 (the largest). 

> Hence, Min-Max scaling helps to put all the house sizes on the same volume level, making it easier for a model to understand and work with the data. 
> - It's like making sure all songs in your playlist have a similar volume so that no single song drowns out the others at your party.

In [14]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# sample data 
data = {
    'House Size (sq. ft.)': [500, 1000, 1500, 2000] # House sizes in square feet
}

df = pd.DataFrame(data)

# Initializing the Min-Max scaler
scaler = MinMaxScaler()

# fit and transform the data to scale it between 0 and 1
df['Scaled Size'] = scaler.fit_transform(df[['House Size (sq. ft.)']])

print(df)


   House Size (sq. ft.)  Scaled Size
0                   500     0.000000
1                  1000     0.333333
2                  1500     0.666667
3                  2000     1.000000


- Here, we first created a dataset with house sizes. 
- Then, we used the Min-Max scaler from scikit-learn to transform the data. 
- The "Scaled Size" column represents the Min-Max scaled values, where 0 corresponds to the smallest house size, and 1 corresponds to the largest house size.
- All house sizes in between are scaled proportionally within this range.

# 2. What is the Unit Vector technique in feature scaling, and how does it differ from Min-Max scaling? Provide an example to illustrate its application.

- The **Unit Vector technique** in feature scaling is like resizing all the arrows in your collection to have the same length without changing their directions.
- It ensures that every feature has the same influence (length) in your data, making them equally important, just like each arrow has the same length when measured.

> - Here's how it differs from Min-Max scaling!! :

> - > - [i] **Min-Max Scaling:** It resizes the arrows (features) to fit within a specific range (e.g., between 0 and 1). 
- It adjusts the lengths to play well within that range, like adjusting the volume of songs in a playlist to fit between the quietest and loudest.
> - > - [ii] **Unit Vector Scaling:** Instead of fitting the arrows within a range, it changes their lengths so that they all have a length of 1.
- The direction (relationship) between them remains the same, but their lengths are adjusted so they all have equal influence.

> - eg : 

- Let's use a dataset of houses with two features: "House Size" and "Number of Bedrooms.
- " Min-Max scaling would fit these features within a specific range, like resizing two arrows to have lengths between 0 and 1.
- Unit Vector scaling, on the other hand, would make both arrows have a length of 1 while preserving their direction.


In [15]:
import pandas as pd
from sklearn.preprocessing import Normalizer

# sample data
data = {
    'House Size (sq. ft.)': [500, 1000, 1500, 2000],  # House size and number of bedrooms
    'Number of Bedrooms': [1, 2, 3, 4]
}

df = pd.DataFrame(data)

# initializing the Normalizer for Unit Vector scaling
normalizer = Normalizer()

# fit and transform the data to create unit vectors
df[['Scaled Size', 'Scaled Bedrooms']] = normalizer.transform(df)

print(df)


   House Size (sq. ft.)  Number of Bedrooms  Scaled Size  Scaled Bedrooms
0                   500                   1     0.999998            0.002
1                  1000                   2     0.999998            0.002
2                  1500                   3     0.999998            0.002
3                  2000                   4     0.999998            0.002




- Here, we used the Normalizer from scikit-learn to apply Unit Vector scaling to the dataset.
- The "Scaled Size" and "Scaled Bedrooms" columns represent the unit vectors, where each feature's length is 1, ensuring they have equal influence while maintaining their relationship.

# 3. What is PCA (Principle Component Analysis), and how is it used in dimensionality reduction? Provide an example to illustrate its application.

- **Principal Component Analysis (PCA)** is like simplifying a complex dish to its core flavors. 
- In dimensionality reduction, it helps you identify the most essential ingredients (features) in your data, allowing you to reduce the complexity while retaining the most critical information.

> Here's how PCA works!! :

> - > - [i] **Data Preparation:** Imagine you have a recipe with many ingredients, but some might be redundant or less important. 
- PCA helps you find the key ingredients (features) in your recipe.
> - > - [ii] **Create New Ingredients:** PCA combines and transforms your original ingredients into new ones, known as "principal components." 
- These components capture the most significant variations in your data. It's like creating flavor combinations to represent the core taste of a dish.
> - > - [iii] **Rank by Importance:** PCA ranks these new components by their importance. 
- The first component represents the most significant variation in your data, the second represents the second most significant, and so on.
> - > - [iv] **Choose Components:** You can decide how many principal components to keep, simplifying your recipe.
- It's like selecting the primary flavors to make your dish without losing its essence.


In [16]:
import pandas as pd
from sklearn.decomposition import PCA

# sample data
data = {
    'Saltiness': [2, 3, 4, 5],  # Ingredients (features) in a dish
    'Spiciness': [1, 2, 3, 4],
    'Sweetness': [3, 2, 1, 2] 
}

df = pd.DataFrame(data)

# initializing PCA with 2 components
pca = PCA(n_components=2)

# fit and transform the data to create principal components
principal_components = pca.fit_transform(df)

# creating a DataFrame to show the transformed data
df_pca = pd.DataFrame(data=principal_components, columns=['Component 1', 'Component 2'])

print(df_pca)


   Component 1  Component 2
0     2.324567    -0.310461
1     0.673887     0.214186
2    -0.976793     0.738834
3    -2.021662    -0.642559


- Here, we applied PCA to reduce the number of features (ingredients) while retaining the core information.
- The resulting "Component 1" and "Component 2" represent the simplified, essential flavors of the dish.

# 4.  What is the relationship between PCA and Feature Extraction, and how can PCA be used for Feature Extraction? Provide an example to illustrate this concept.

- PCA (Principal Component Analysis) and Feature Extraction are like two chefs working together to create a simplified and tastier dish.
- In the world of data and machine learning, PCA is a technique used for feature extraction, which helps reduce the number of features (ingredients) while preserving the most important information.

> - Here's how PCA can be used for feature extraction!! :

> - > - [i] **Data Complexity Reduction:** Imagine you have a recipe with lots of ingredients, some of which are redundant or less important.
- Feature extraction, with the help of PCA, simplifies the recipe by selecting only the key ingredients (features).
> - > - [ii] **Create Essential Components:** PCA transforms the original ingredients into new components that capture the most significant variations in your data.
- These components are like the core flavors of the dish.
> - > - [iii] **Feature Reduction:** You can choose how many of these components to keep. 
- For example, if your original recipe had 20 ingredients (features), PCA might find that you can achieve the same delicious taste with just 5 components (features).


In [17]:
import pandas as pd
from sklearn.decomposition import PCA

# sample data
data = {
    'Saltiness': [2, 3, 4, 5], # Ingredients (features) in a dish
    'Spiciness': [1, 2, 3, 4],
    'Sweetness': [3, 2, 1, 2]
}

df = pd.DataFrame(data)

# initializing PCA to reduce to 2 components
pca = PCA(n_components=2)

# fit and transform the data to create principal components
extracted_features = pca.fit_transform(df)

# creating a DataFrame to show the extracted features
df_extracted = pd.DataFrame(data=extracted_features, columns=['Feature 1', 'Feature 2'])

print(df_extracted)


   Feature 1  Feature 2
0   2.324567  -0.310461
1   0.673887   0.214186
2  -0.976793   0.738834
3  -2.021662  -0.642559


- Here, we applied PCA for feature extraction. 
- The "Feature 1" and "Feature 2" are the essential components that capture the most critical information from the original ingredients (features).
- By using only these extracted features, we simplify our recipe without losing the core flavors.

# 5. You are working on a project to build a recommendation system for a food delivery service. The dataset contains features such as price, rating, and delivery time. Explain how you would use Min-Max scaling to preprocess the data.

- In our food delivery recommendation system project, using Min-Max scaling is like making sure all your food preferences (features like price, rating, and delivery time) are on the same scale, just like you'd compare pizza prices, burger ratings, and sushi delivery times equally.

> - Here's how we'd use Min-Max scaling!! :

> - > - [i] **Data Gathering:** You collect a dataset with various features like food prices (ranging from $5 to $30), ratings (on a scale from 1 to 5), and delivery times (from 10 to 60 minutes).
> - > - [ii] **Scaling Range:** You decide on a range, often between 0 and 1, in which you want to fit your data. 
- It's like saying you want to measure everything on a scale from "not important" (0) to "very important" (1).
> - > - [iii] **Adjusting Values:** You apply Min-Max scaling to each feature individually.
- It's like taking each feature's values and adjusting them proportionally so that they fit within your chosen range.


In [18]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# sample data
data = {
    'Price': [5, 15, 30, 10],   #  food preferences with different scales
    'Rating': [2, 4, 5, 3],
    'Delivery Time (min)': [10, 30, 60, 20]
}

df = pd.DataFrame(data)

# initializing the Min-Max scaler
scaler = MinMaxScaler()

# fit and transform the data to scale within the range [0, 1]
scaled_data = scaler.fit_transform(df)

# creating a DataFrame to show the scaled data
df_scaled = pd.DataFrame(data=scaled_data, columns=['Scaled Price', 'Scaled Rating', 'Scaled Delivery Time'])

print(df_scaled)


   Scaled Price  Scaled Rating  Scaled Delivery Time
0           0.0       0.000000                   0.0
1           0.4       0.666667                   0.4
2           1.0       1.000000                   1.0
3           0.2       0.333333                   0.2


- Here, we applied Min-Max scaling to the dataset.
- Which transformed the original data into a new scale where each feature is adjusted to fit within the range [0, 1].
- Now, we can easily compare the scaled features, making it suitable for building our food delivery recommendation system.

# 6. You are working on a project to build a model to predict stock prices. The dataset contains many features, such as company financial data and market trends. Explain how you would use PCA to reduce the dimensionality of the dataset. 

- In our stock price prediction project, using PCA (Principal Component Analysis) to reduce dimensionality is like simplifying a complex stock market analysis by focusing on the most critical factors, similar to distilling a vast recipe down to its essential ingredients.

> - Here's how we'd use PCA for dimensionality reduction!! :

> - > - [i] **Data Complexity Reduction:** Imagine we have a recipe with lots of ingredients (features), but not all of them are equally important. 
- Similarly, in our dataset, some features might be redundant or have less impact on stock prices.
> - > - [ii] **Create New Factors:** PCA transforms our original features into new components (factors) that represent the most significant variations in our data. 
- These components are like the core flavors in our recipe.
> - > - [iii] **Rank by Importance:** PCA ranks these new components based on their importance.
- The first component captures the most significant variation, the second captures the second most, and so on.
> - > - [iv] **Select Fewer Components:** We can choose how many of these components to keep. 
- Instead of dealing with all the original features, we focus on a smaller set of the most influential components.

In [19]:
import pandas as pd
from sklearn.decomposition import PCA

# sample data
data = {
    'Earnings Growth': [0.05, 0.03, 0.06, 0.02],    # company financial data and market trends (features)
    'Market Volatility': [0.1, 0.15, 0.08, 0.12],
    'Debt-to-Equity Ratio': [1.2, 0.9, 1.4, 1.1],
    # ... More financial and market features
}

df = pd.DataFrame(data)

# initializing PCA to reduce to 2 components (for simplicity)
pca = PCA(n_components=2)

# fit and transform the data to create principal components
reduced_features = pca.fit_transform(df)

# creating a DataFrame to show the reduced features
df_reduced = pd.DataFrame(data=reduced_features, columns=['Component 1', 'Component 2'])

print(df_reduced)


   Component 1  Component 2
0    -0.051828    -0.006746
1     0.252880    -0.007247
2    -0.252863    -0.002488
3     0.051811     0.016481


- Here, we applied PCA to reduce the dimensionality of the dataset. 
- The "Component 1" and "Component 2" represent the most essential components that capture the primary variations in the data.
- By focusing on these components, we simplify our analysis while retaining the core factors that influence stock prices.

# 7. For a dataset containing the following values: [1, 5, 10, 15, 20], perform Min-Max scaling to transform the values to a range of -1 to 1.

- Min-Max scaling is like resizing a group of values so they fit within a specific range, and in this case, we want to make them fit between -1 and 1. 
- It's similar to adjusting the volume of music to make it neither too quiet (less than -1) nor too loud (more than 1). 

In [20]:
# sample data
data = [1, 5, 10, 15, 20]

# defining the minimum and maximum values in the original data
min_value = min(data)
max_value = max(data)

# defining the desired range (from -1 to 1)
new_min = -1
new_max = 1

# performing Min-Max scaling
scaled_data = [((x - min_value) / (max_value - min_value)) * (new_max - new_min) + new_min for x in data]

print(scaled_data)


[-1.0, -0.5789473684210527, -0.052631578947368474, 0.4736842105263157, 1.0]


- Here, we first find the minimum and maximum values in the original data.
- Then, we use these values to transform the data into the desired range of -1 to 1.
- The scaled data will now be adjusted to fit within this range, making it easier to compare and work with.

# 8. For a dataset containing the following features: [height, weight, age, gender, blood pressure], perform Feature Extraction using PCA. How many principal components would you choose to retain, and why?

- In feature extraction using PCA, we aim to simplify our dataset by focusing on the most essential factors while reducing complexity.
- The number of principal components to retain depends on how much variance (variation) in our data we want to preserve.
- We typically choose to keep a sufficient number of principal components to maintain a high percentage of the original data's variance.
- A common choice is to retain components that capture 95% or more of the variance.

> - eg : 
- If we choose to retain 2 principal components, it means we keep the two most influential factors that explain a significant portion of the data's variance.
- If we choose to retain 3 principal components, it means we keep three factors that explain even more of the data's variance.


- > - The decision on how many principal components to retain depends on the trade-off between simplification and preserving data's information. Retaining more principal components can capture more variance but may not significantly reduce complexity.



In [21]:
import pandas as pd
from sklearn.decomposition import PCA

# sample data
data = {
    'Height': [170, 180, 160, 165, 175], #  Features (height, weight, age, gender, blood pressure)
    'Weight': [70, 90, 60, 65, 80],
    'Age': [30, 40, 25, 35, 50],
    'Gender': [0, 1, 0, 1, 0],  # Assuming 0 for male and 1 for female
    'Blood Pressure': [120, 140, 110, 130, 135]
}

df = pd.DataFrame(data)

# initializing PCA with no specified number of components
pca = PCA()

# fit PCA on the data
pca.fit(df)

# determining how many components to retain to capture 95% of the variance
cumulative_variance = pca.explained_variance_ratio_.cumsum()
num_components = len(cumulative_variance[cumulative_variance < 0.95]) + 1

print(f"Number of principal components to retain: {num_components}")


Number of principal components to retain: 2


- Here, we used PCA to determine how many principal components to retain to capture 95% of the data's variance. 
- The number of components we choose helps strike a balance between simplification and preserving data information