Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Ordinal Encoding and Label Encoding are two methods used for converting categorical variables into numerical representations in machine learning and data analysis. They are used to prepare categorical data for training machine learning models that typically require numerical input.

When to choose one over the other:

Ordinal Encoding is suitable when the categorical variable has an inherent ordinal relationship between categories, meaning the categories have a meaningful order or rank. For example, "Low", "Medium", and "High" have a natural order, and Ordinal Encoding can capture that information.

Label Encoding is appropriate when there is no inherent ordinal relationship between categories, and they are treated as arbitrary labels without any particular order. For example, "Red", "Green", and "Blue" are just three different categories without any meaningful order, and Label Encoding can be used in this case.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

Target Guided Ordinal Encoding is a method of encoding categorical variables based on the mean of the target variable for each category. It is a data-driven approach that takes into account the relationship between the categorical variable and the target variable in a supervised machine learning setting. This method can be useful when dealing with categorical variables where the categories have an ordinal relationship with the target variable, and the goal is to capture this relationship in the encoded values.
Example - 

In [1]:
import pandas as pd
df = pd.DataFrame({'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'], 'price': [200,150,300,250,180,320]})
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [7]:
mean_price = df.groupby('city')['price'].mean().to_dict()
df['city_encoded'] = df['city'].map(mean_price)
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical measure that describes the extent to which two random variables change together. It quantifies the direction and magnitude of the linear relationship between two variables. A positive covariance indicates that when one variable increases, the other tends to increase as well, and when one variable decreases, the other tends to decrease as well. A negative covariance indicates an inverse relationship, where one variable tends to increase as the other decreases, and vice versa. A covariance of zero indicates no linear relationship between the variables.

Covariance is an important concept in statistical analysis for several reasons:

Relationship Assessment: Covariance helps to assess the direction and strength of the relationship between two variables. It is used to determine whether the variables tend to move together or in opposite directions, and the magnitude of their association.

Portfolio Diversification: In finance, covariance is used to assess the diversification benefits of combining different assets in a portfolio. A portfolio that includes assets with low or negative covariance can reduce overall risk and increase potential returns.

Regression Analysis: Covariance is used in regression analysis to estimate the coefficients of a regression model. Covariance between the predictor variables and the response variable helps to determine the strength and direction of their relationship, which is used to make predictions and infer statistical significance.

Risk Management: Covariance is used in risk management to assess the risk of multiple variables collectively. It helps to understand how changes in one variable may affect other variables, which is important in managing risks associated with different factors.

Covariance is calculated using the following formula:

Cov(X, Y) = Σ [(xi - μx) * (yi - μy)] / (n - 1)

Where:

Cov(X, Y) is the covariance between variables X and Y.
xi and yi are the individual data points of X and Y, respectively.
μx and μy are the means of X and Y, respectively.
n is the number of data points in X and Y.

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [12]:
dic = {'Color': ['red', 'green', 'blue'], 'Size': ['small', 'medium','large'],'Material':['wood', 'metal', 'plastic']}

In [9]:
import pandas as pd
df = pd.DataFrame(dict)
df

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic


In [16]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()

In [22]:
encoded = encoder.fit_transform(df[['Color','Size','Material']]).toarray()
encoded_df = pd.DataFrame(encoded,columns = encoder.get_feature_names_out())
encoded_df

Unnamed: 0,Color_blue,Color_green,Color_red,Size_large,Size_medium,Size_small,Material_metal,Material_plastic,Material_wood
0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [None]:
#Second method

In [23]:
from sklearn.preprocessing import LabelEncoder

color = ['red', 'green', 'blue', 'red', 'blue', 'green']
size = ['small', 'medium', 'large', 'medium', 'small', 'large']
material = ['wood', 'metal', 'plastic', 'plastic', 'wood', 'metal']

label_encoder = LabelEncoder()

color_encoded = label_encoder.fit_transform(color)
print("Encoded Color:", color_encoded)

size_encoded = label_encoder.fit_transform(size)
print("Encoded Size:", size_encoded)
material_encoded = label_encoder.fit_transform(material)
print("Encoded Material:", material_encoded)

Encoded Color: [2 1 0 2 0 1]
Encoded Size: [2 1 0 1 2 0]
Encoded Material: [2 0 1 1 2 0]


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [26]:
import numpy as np
age = [12,34,56,78,43,25]
income = [2000,67000,78000,40000,80000,75000]
education = [8,14,16,18,14,16]
data = np.vstack((age,income,education))
cov_matrix = np.cov(data)
print(f"The covariance matrix is {cov_matrix}")

The covariance matrix is [[5.48666667e+02 1.93800000e+05 6.38666667e+01]
 [1.93800000e+05 9.41600000e+08 6.80000000e+04]
 [6.38666667e+01 6.80000000e+04 1.18666667e+01]]


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

For the categorical variables "Gender", "Education Level", and "Employment Status" in your machine learning project, there are several encoding methods that you can use to represent these variables as numerical values, depending on the characteristics of your dataset and the machine learning algorithm you plan to use. Here are some common encoding methods and their potential use cases:

One-Hot Encoding: This method creates a binary (0/1) indicator variable for each category in a categorical variable. For example, "Gender" would be encoded as two binary variables, "Male" and "Female", with values of 0 or 1 to indicate the presence or absence of each category. One-hot encoding is suitable when there is no ordinal relationship or hierarchy among the categories, and you want to treat all categories as equally important.

Label Encoding: This method assigns a unique numerical label to each category in a categorical variable. For example, "Education Level" could be encoded as 0 for "High School", 1 for "Bachelor's", 2 for "Master's", and 3 for "PhD". Label encoding is suitable when there is an ordinal relationship or hierarchy among the categories, where the numerical values represent the order or level of the categories.

Binary Encoding: This method represents each category in a categorical variable as a binary sequence, where each binary digit represents the presence or absence of a category. For example, "Employment Status" could be encoded as "Unemployed" as 001, "Part-Time" as 010, and "Full-Time" as 100. Binary encoding is suitable when you have multiple categories and you want to minimize the number of binary variables needed to represent the categories

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [39]:
import numpy as np
import pandas as pd
data = {'Temperature': [25, 30, 28, 20, 22],
        'Humidity': [50, 60, 55, 40, 45],
        'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Cloudy', 'Sunny'],
        'Wind Direction': ['North', 'South', 'East', 'West', 'North']}
df = pd.DataFrame(data)
cov_temp_hum = np.cov(df['Temperature'], df['Humidity'])[0][1]
print(cov_temp_hum)


32.5
