# 21 MARCH ASSIGNMENT

Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Both ordinal encoding and label encoding are methods for converting category data into numerical values. They differ, however, in how they allocate values to each category.

Ordinal encoding assigns a unique integer value to each category depending on its order or rank in the feature space. For example, if we had a categorical variable "education" with three categories "high school," "college," and "graduate," we may give the values 1, 2, and 3 to these categories. The order of the values suggests a classification, but not necessarily a definite numerical difference between them.

The choice between ordinal encoding and label encoding is determined by the unique scenario and data properties. Where there is a natural order or hierarchy between the categories, such as in the example of "education," where "graduate" is higher than "college" and "high school," ordinal encoding is more appropriate. Label encoding, on the other hand, is more suited when there is no inherent ordering or link between the categories, such as with colours or country names.

Assume we have a categorical variable "income level" with three levels: low, medium, and high. We may use ordinal encoding in this example since there is a natural order between the groups depending on income level. In contrast, if we have a categorical variable "preferred colour" with categories "red," "green," and "blue," we might utilise label encoding because the colours have no inherent order or hierarchy.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

Target Directed Ordinal Encoding is a method for assigning numerical values to categories based on their relationship to the target variable. The objective is to encode categories with comparable numerical values that have similar target variable values. This can assist capture the association between the category variable and the target variable, which can improve the machine learning model's performance.

The following is the general procedure for Target Directed Ordinal Encoding:

1. Calculate the mean of the target variable for each category in the categorical variable for all occurrences where that category appears.
2. Arrange the categories in ascending order according to their mean goal values.
3. Each category should be assigned a number value depending on its position in the sorted list.

For example, suppose we have a categorical variable "region" with the categories "North", "South", "East", and "West". We want to encode this variable using Target Guided Ordinal Encoding based on its relationship with the target variable, which is the income level of customers.


Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical term that quantifies how closely two variables are connected linearly. It quantifies the combined variability of two random variables. When two variables vary together, their covariance is positive. If they fluctuate in the other direction, their covariance is negative. Their covariance is 0 if they are independent.

For numerous reasons, covariance is significant in statistical analysis. Initially, it indicates the direction of the link between two variables. A positive covariance suggests that the two variables tend to move together, whereas a negative covariance indicates that they tend to move in opposing directions. Second, covariance is utilised to calculate the correlation coefficient, which is a standardised measure of the strength and direction of a two-variable linear connection. Finally, covariance is utilised in regression models, which predict the value of one variable depending on the value of another variable. Lastly, covariance is utilised to compute the variance of a linear combination of two or more random variables.


The covariance between two random variables X and Y is calculated as follows
1. Find the mean of X (X) and the mean of Y (Y).

2. Remove the X mean from each X value, and the Y mean from each Y value. This yields the deviations of X and Y from their respective means.

3. Multiply each X departure from the mean by the corresponding Y variation from the mean. This returns the deviations' products.

4. Add the deviations' products together.

5. Subtract the sum of the deviation products from the total number of observations (N).

The formula for covariance can be written as:

Cov(X, Y) = Σ((Xi - μX) * (Yi - μY)) / (N - 1)

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [3]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

data = pd.DataFrame({
    'Color': ['red', 'green', 'blue'],
    'Size': ['small', 'medium', 'large'],
    'Material': ['wood', 'metal', 'plastic']
})

le = LabelEncoder()

for col in data.columns:
    data[col] = le.fit_transform(data[col])

In [4]:
data

Unnamed: 0,Color,Size,Material
0,2,2,2
1,1,1,0
2,0,0,1


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [5]:
import numpy as np
import pandas as pd

data = pd.DataFrame({
    'Age': [25, 30, 40, 45, 50],
    'Income': [40000, 60000, 80000, 100000, 120000],
    'Education': [12, 14, 16, 18, 20]
})

cov_matrix = np.cov(data, rowvar=False)

print(cov_matrix)


[[1.075e+02 3.250e+05 3.250e+01]
 [3.250e+05 1.000e+09 1.000e+05]
 [3.250e+01 1.000e+05 1.000e+01]]


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

In a machine learning project, we must transform categorical variables to numerical values so that they may be used as inputs to machine learning algorithms. There are numerous encoding methods available, and the one chosen relies on the nature of the variable and the project's unique requirements. We have three categorical variables in this case: gender, education level, and employment status. Below are some potential encoding techniques for each variable, along with reasons why they would be appropriate:

Gender: Since Gender is a binary categorical variable with just two potential values (Male/Female), we may convert it to a numerical number using binary encoding. Each category is assigned a number between 0 and 1, with 0 representing one category and 1 representing the other. In this scenario, we can assign a value of 0 to Male and a value of 1 to Female. Binary encoding is acceptable in this scenario because it keeps the variable's binary nature while avoiding the creation of needless ordinal links between the categories.

Education Level: Because Education Level is an ordinal categorical variable with four potential values (High School/Bachelor's/Master's/PhD), we may convert it to a numerical value using ordinal encoding. Ordinal encoding gives each category a numerical value depending on its order or rank. In this situation, we may award a 0 to High School, a 1, a 2, and a 3 to Bachelor's, Master's, and PhD. In this scenario, ordinal encoding is acceptable since it retains the ordinal connections between the categories.

Employment Status: Because Employment Status is a nominal categorical variable with three potential values (Unemployed/Part-Time/Full-Time), we can convert it to a numerical value using one-hot encoding. One-hot encoding generates a binary vector for each category, with 1 representing the existence of the category and 0 representing its absence. We may generate three binary vectors in this case: [1, 0, 0] for Unemployed, [0, 1, 0] for Part-Time, and [0, 0, 1] for Full-Time. In this case, one-hot encoding is appropriate because it avoids creating unnecessary ordinal relationships between the categories and allows the machine learning algorithm to treat each category independently.


Overall, the encoding method chosen is determined by the nature of the categorical variable and the project's specific requirements. It is critical to select the appropriate encoding technique to eliminate biases or needless links between the categories.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [7]:
import numpy as np
import pandas as pd

data = {'Temperature': [25, 28, 27, 26, 24],
        'Humidity': [50, 60, 55, 45, 70],
        'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
        'Wind Direction': ['North', 'South', 'East', 'West', 'South']}

df = pd.DataFrame(data)

cov_matrix = np.cov(df[['Temperature', 'Humidity']].T)
print(cov_matrix)

[[ 2.5  -3.75]
 [-3.75 92.5 ]]
