Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Label Encoding:

Definition: Label Encoding assigns a unique numerical label to each category in a categorical variable.
Application: It is typically used when the categorical variable does not have an inherent order or hierarchy among its categories.
Example: Consider a categorical variable representing different fruits: 'Apple', 'Orange', 'Banana'. Using Label Encoding, these categories might be encoded as 0 for 'Apple', 1 for 'Orange', and 2 for 'Banana'. Here, the numerical labels are arbitrary and don't imply any specific order.

Ordinal Encoding:

Definition: Ordinal Encoding assigns numerical labels to categorical data with an inherent order or hierarchy among its categories.
Application: It is employed when the categorical variable has a natural order or ranking among its categories.
Example: Suppose a categorical variable represents education levels: 'High School', 'Bachelor's Degree', 'Master's Degree', 'Ph.D.'. Using Ordinal Encoding, these categories might be encoded as 0 for 'High School', 1 for 'Bachelor's Degree', 2 for 'Master's Degree', and 3 for 'Ph.D.'. Here, the numerical labels represent a specific order or hierarchy of education levels.

When to Choose One Over the Other:

Choosing Label Encoding:

Use Label Encoding when the categorical variable's categories do not have a meaningful order or hierarchy.
For instance, when dealing with nominal categorical variables like different types of fruits, colors, or genders where there is no inherent order.
Choosing Ordinal Encoding:

Use Ordinal Encoding when there exists a natural order or hierarchy among the categories of the categorical variable.
For example, categorical variables like education levels, rankings (e.g., low, medium, high), or economic status (e.g., low-income, middle-income, high-income) where the categories follow a specific sequence or ranking.

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

This encoding technique is  useful when we have a categorical variable with large no. of unique categories and we want to use this variable as a feature in our machine learning model. 

We replace each category in the category variable with a numerical value based on the mean or median of the target variable fot that catagory.

In [16]:
import pandas as pd

In [17]:
df = pd.DataFrame({'city': ['Newyork','london','paris','tokyo','Newyork','paris'],
                  'price' : [200,150,300,250,180,320]})

In [18]:
df

Unnamed: 0,city,price
0,Newyork,200
1,london,150
2,paris,300
3,tokyo,250
4,Newyork,180
5,paris,320


In [19]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [20]:
mean_price

{'Newyork': 190.0, 'london': 150.0, 'paris': 310.0, 'tokyo': 250.0}

In [21]:
df['city_encoded']=df['city'].map(mean_price)

In [22]:
df

Unnamed: 0,city,price,city_encoded
0,Newyork,200,190.0
1,london,150,150.0
2,paris,300,310.0
3,tokyo,250,250.0
4,Newyork,180,190.0
5,paris,320,310.0


Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a measure of the relationship between two random variables and to what extent, they change together. Or we can say, in other words, it defines the changes between the two variables, such that change in one variable is equal to change in another variable.

we can calculate by covariance method , Pearson correlation coefficient , Spearman rank correlation.

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [23]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample data
data = {
    'Color': ['red', 'green', 'blue', 'red'],
    'Size': ['small', 'medium', 'large', 'medium'],
    'Material': ['wood', 'metal', 'plastic', 'wood']
}

# Creating a DataFrame
df = pd.DataFrame(data)

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply LabelEncoder to each categorical column
for col in df.columns:
    if df[col].dtype == 'object':  # Check if column is categorical
        df[col + '_encoded'] = label_encoder.fit_transform(df[col])

# Displaying the encoded DataFrame
print(df)


   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3    red  medium     wood              2             1                 2


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [24]:
import numpy as np

# Sample data for Age, Income, and Education Level (just for illustration)
# Replace this with your actual dataset
age = [30, 40, 25, 35, 28]
income = [50000, 60000, 45000, 55000, 48000]
education_level = [12, 16, 10, 14, 12]

# Create a numpy array from the data
data = np.array([age, income, education_level])

# Calculate the covariance matrix
covariance_matrix = np.cov(data)

print("Covariance Matrix:")
print(covariance_matrix)


Covariance Matrix:
[[3.53e+01 3.53e+04 1.34e+01]
 [3.53e+04 3.53e+07 1.34e+04]
 [1.34e+01 1.34e+04 5.20e+00]]


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

For gender we can use binary encoding.
For Education Level  , we use label encoding because it contains rank
For Enployment Status , we use Nominal or OHE encoding because it does not need any rank.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [25]:
import numpy as np

# Sample data for Temperature and Humidity (just for illustration)
temperature = [25, 30, 28, 32, 27]
humidity = [60, 65, 62, 68, 63]

# Calculate the covariance between Temperature and Humidity
covariance_temp_humidity = np.cov(temperature, humidity)

print("Covariance between Temperature and Humidity:")
print(covariance_temp_humidity)


Covariance between Temperature and Humidity:
[[7.3  7.95]
 [7.95 9.3 ]]
