<a href="https://colab.research.google.com/github/rat145/pw_assignments/blob/main/feature_engg_assignment_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

In [None]:
'''
Ordinal Encoding and Label Encoding are two popular techniques used to convert categorical variables into numerical ones.
The main difference between them lies in the nature of the categorical variable and its relationship with the target variable.

Ordinal Encoding is suitable when the categorical variable has an inherent order or ranking. It assigns unique numerical labels
to each category, preserving the ordinal relationship between them. For example, if you have a variable representing education
qualification with categories like "High School," "Bachelor's Degree," and "Master's Degree," you can assign them numerical
labels like 1, 2, and 3, respectively.

On the other hand, Label Encoding is appropriate when encoding target variables, especially for categorical variables with no
inherent order. It assigns each unique category value an integer value based on alphabetical order. For example, if you have a
dataset with a column called "Team" containing categories like "A," "B," and "C," you can assign them integer values based on
alphabetical order: 0 for "A," 1 for "B," and 2 for "C".

Here's an example to illustrate when you might choose one over the other:

Suppose you are working on a machine learning project where you need to predict student performance based on various features.
One of the features is the student's education qualification, which has categories like "High School," "Bachelor's Degree," and
"Master's Degree." In this case, you might choose Ordinal Encoding because there is an inherent order or ranking among these
categories. By preserving this order, your model can potentially learn from the ordinal relationship between education
qualifications.

However, if you are working on a recommendation system that uses categorical variables to represent user preferences or item
categories, you might choose Label Encoding. For example, if you have a dataset with a column representing different genres
of movies like "Action," "Comedy," and "Drama," you can assign them integer labels based on alphabetical order[^10^]. This allows
your model to work with numerical data while still capturing the different categories.

'''

### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

In [None]:
'''
Target Guided Ordinal Encoding is a technique used to encode categorical variables for machine learning models. This encoding
technique is particularly useful when the target variable is ordinal, meaning that it has a natural order, such as low, medium, and
high.

The process of Target Guided Ordinal Encoding involves the following steps:

1. Sort the categories based on the corresponding target variable. For example, if we have a dataset of employees with columns like
"City," "Highest Qualification," and "Salary," we can sort the cities based on the mean salary of each city.
2. Assign ranks to the categories based on their order in step 1. The category with the highest mean salary gets the highest rank,
and so on.
3. Use these ranks to encode the categorical variable in the dataset.

Here's an example to illustrate when you might use Target Guided Ordinal Encoding in a machine learning project:

Suppose you are working on a project to predict employee salaries based on various features such as city, education qualification,
and experience. The city where an employee lives can have a significant impact on their salary. By using Target Guided Ordinal
Encoding, you can encode the city column based on the mean salary of each city. This encoding allows your model to capture the
relationship between cities and salaries more effectively.

By incorporating this encoding technique, your machine learning model can learn from the ordinal relationship between cities and
make better predictions about employee salaries.
'''

### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

In [None]:
'''
Covariance is a statistical measure that quantifies the joint variability of two random variables, denoted as X and Y. It
describes the direction and strength of the linear relationship between these variables. Covariance is important in statistical
analysis because it helps us understand how changes in one variable are associated with changes in another variable.

The resulting covariance value can be positive, negative, or zero. A positive covariance indicates that when one variable
increases, the other tends to increase as well. A negative covariance indicates that when one variable increases, the other
tends to decrease. A covariance of zero suggests no linear relationship between the variables.

It's important to note that covariance alone does not provide information about the strength or directionality of the relationship.
To address this, we can use the correlation coefficient, which is derived from covariance and provides a standardized measure
of linear association between variables.

Remember that covariance is sensitive to the scale of the variables being analyzed. Therefore, it's crucial to interpret covariance
values carefully and consider their context when assessing relationships between variables.
'''

### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [1]:
# Lets make sample dataset first
import pandas as pd

df = pd.DataFrame({
    "Color": ["red","green","blue","green","red","blue"],
    "Size": ["small","medium","large","small","large","large"],
    "Material": ["wood","metal","plastic","plastic","wood","metal"]
})
df

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic
3,green,small,plastic
4,red,large,wood
5,blue,large,metal


In [2]:
from sklearn.preprocessing import LabelEncoder

In [3]:
encoder = LabelEncoder()

In [6]:
encoded_color = encoder.fit_transform(df["Color"])
encoded_color

array([2, 1, 0, 1, 2, 0])

In [8]:
encoded_size = encoder.fit_transform(df['Size'])
encoded_material = encoder.fit_transform(df['Material'])

In [11]:
df["Color_encoded"] = encoded_color
df["Size_encoded"] = encoded_size
df["Material_encoded"] = encoded_material
df

Unnamed: 0,Color,Size,Material,Color_encoded,Size_encoded,Material_encoded
0,red,small,wood,2,2,2
1,green,medium,metal,1,1,0
2,blue,large,plastic,0,0,1
3,green,small,plastic,1,2,1
4,red,large,wood,2,0,2
5,blue,large,metal,0,0,0


In [None]:
'''
Label Encoding is nothing but assigning values to the categorical data.
It assigns value alphabetically, therefore for color column we have 0 for blue, 1 for green and 2 for red.
Similarly value is assigned to size and material columns.
'''

### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [12]:
age = [25, 30, 35, 40, 45, 50, 55, 60]
income = [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000]
education_level = ["Bachelor's degree", "Master's degree", "PhD", "Bachelor's degree", "Master's degree", "PhD", "Bachelor's degree", "Master's degree"]

df = pd.DataFrame({
    "Age": age,
    "Income": income,
    "Education Level": education_level
})
df

Unnamed: 0,Age,Income,Education Level
0,25,50000,Bachelor's degree
1,30,60000,Master's degree
2,35,70000,PhD
3,40,80000,Bachelor's degree
4,45,90000,Master's degree
5,50,100000,PhD
6,55,110000,Bachelor's degree
7,60,120000,Master's degree


In [14]:
df.cov(numeric_only = True)

Unnamed: 0,Age,Income
Age,150.0,300000.0
Income,300000.0,600000000.0


In [None]:
'''
We can say age and income have a positive covariance.
'''

### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

In [None]:
'''
1. For gender variable, I will use nominal/OHE encoding because there is no rank or order in that column.
2. For Education level, I will use Ordinal and Label Encoding as there is an order in the levels of education.
3. For Employment status, I will use nominal/OHE encoding as there is no specific order plus it is convient.
'''

### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

In [16]:
import numpy as np

temperature = np.random.randint(10, 40, size=1000)
humidity = np.random.randint(30, 90, size=1000)
weather_condition = np.random.choice(["Sunny", "Cloudy", "Rainy"], size=1000)
wind_direction = np.random.choice(["North", "South", "East", "West"], size=1000)
data = {'temperature': temperature, 'humidity': humidity, 'weather_condition': weather_condition, 'wind_direction': wind_direction}
df = pd.DataFrame(data)

df.head()

Unnamed: 0,temperature,humidity,weather_condition,wind_direction
0,31,85,Sunny,East
1,37,61,Rainy,East
2,16,79,Cloudy,East
3,29,59,Cloudy,North
4,14,55,Cloudy,South


In [18]:
df.cov(numeric_only=True)

Unnamed: 0,temperature,humidity
temperature,74.559839,0.917847
humidity,0.917847,293.657737


In [None]:
'''There is a positive covariance between each variable'''