Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

In [23]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

data={'color':['red','green','yellow','red','green','yellow','red','green','yellow']}

df=pd.DataFrame(data)

le=LabelEncoder()
df['Label_encoded']=le.fit_transform(df['color'])


oe=OrdinalEncoder(categories=[['red','yellow','green']])
df['Ordinal_encoded']=oe.fit_transform(df[['color']])
print(df)

    color  Label_encoded  Ordinal_encoded
0     red              1              0.0
1   green              0              2.0
2  yellow              2              1.0
3     red              1              0.0
4   green              0              2.0
5  yellow              2              1.0
6     red              1              0.0
7   green              0              2.0
8  yellow              2              1.0


Target Guided Ordinal Encoding is a technique used to encode categorical variables based on the relationship between the categories and the target variable. It's particularly useful when dealing with ordinal categorical variables, where the categories have a meaningful order but might not necessarily have evenly spaced intervals. This method leverages the target variable's influence on the encoding process to capture valuable information.

Here's how Target Guided Ordinal Encoding works:

Compute Mean/Median per Category: For each category in the categorical variable, calculate the mean (or median) of the target variable within that category. This reflects the average behavior of the target variable for each category.

In [30]:
import pandas as pd
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})


In [31]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [33]:
mean_value=df.groupby('city')['price'].mean().to_dict()

In [34]:
mean_value

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [35]:
df['City_Encoded']=df['city'].map(mean_value)

In [43]:
df.drop(columns=['city'],inplace=True)

In [44]:
df

Unnamed: 0,price,City_Encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical measure that quantifies the degree to which two variables change together. In other words, it measures the relationship and direction of change between two variables. It indicates whether the variables tend to increase or decrease together (positive covariance), move in opposite directions (negative covariance), or show no consistent pattern (near zero covariance).



In [6]:
import numpy as np

# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])

# Calculate covariance
covariance = np.cov(x, y)[0, 1]

print("Covariance:", covariance)


Covariance: -2.5


Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [16]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder


data={'color':['red','green','blue'],'size':['small','medium','large'],'material':['wood','metal','plastic']}

df=pd.DataFrame(data)


le=LabelEncoder()
df['C_E']=le.fit_transform(df['color'])
df['S_E']=le.fit_transform(df['size'])
df['M_e']=le.fit_transform(df['material'])

In [17]:
df

Unnamed: 0,color,size,material,C_E,S_E,M_e
0,red,small,wood,2,2,2
1,green,medium,metal,1,1,0
2,blue,large,plastic,0,0,1


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [18]:
import numpy as np

# Sample data
age = np.array([25, 30, 28, 35, 40])
income = np.array([50000, 60000, 55000, 70000, 80000])
education_level = np.array([12, 16, 14, 18, 20])

# Create a data matrix
data_matrix = np.vstack((age, income, education_level))

# Calculate covariance matrix
covariance_matrix = np.cov(data_matrix)

print("Covariance Matrix:")
print(covariance_matrix)


Covariance Matrix:
[[3.53e+01 7.15e+04 1.85e+01]
 [7.15e+04 1.45e+08 3.75e+04]
 [1.85e+01 3.75e+04 1.00e+01]]


Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

In [2]:
pip install category_encoders


Collecting category_encoders
  Downloading category_encoders-2.6.2-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.8/81.8 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.2
Note: you may need to restart the kernel to use updated packages.


In [12]:
import pandas as pd
from category_encoders import BinaryEncoder
from sklearn.preprocessing import LabelEncoder,OrdinalEncoder


data = {
    "Gender": ['Male', 'Female', 'Male', 'Female'],
    "Education Level": ['High School', 'Bachelor', 'Masters', 'PhD'],
    "Employment Status": ['Unemployed', 'Part-Time', 'Full-Time', 'Unemployed']
}

df = pd.DataFrame(data)

# Specify columns to encode using BinaryEncoder
encoder = BinaryEncoder()
le=LabelEncoder()
oe=OrdinalEncoder()

# Apply binary encoding
#df_encoded = encoder.fit_transform(df)
df['o_e']=oe.fit_transform(df[['Education Level']])
df['l_e']=le.fit_transform(df['Employment Status'])
df_be=encoder.fit_transform(df[['Gender']])


df_be.columns= ['FeMale', 'male']

df=pd.concat([df,df_be],axis=1)
df


Unnamed: 0,Gender,Education Level,Employment Status,o_e,l_e,FeMale,male
0,Male,High School,Unemployed,1.0,2,0,1
1,Female,Bachelor,Part-Time,0.0,1,1,0
2,Male,Masters,Full-Time,2.0,0,0,1
3,Female,PhD,Unemployed,3.0,2,1,0


Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [40]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {
    'Temperature': [25, 28, 22, 20, 30],
    'humidity': [60, 55, 70, 75, 50],
    'weather condition': ['sunny', 'cloudy', 'rainy', 'sunny', 'cloudy'],
    'wind direction': ['north', 'south', 'east', 'west', 'north']
}

df = pd.DataFrame(data)


encoder=LabelEncoder()
df["encoded_weather_condition"]=encoder.fit_transform(df['weather condition'])
df["encoded_wind_direction"]=encoder.fit_transform(df['wind direction'])


df.drop(columns=['weather condition','wind direction'],inplace=True)

covariance=df.cov()

In [41]:
df

Unnamed: 0,Temperature,humidity,encoded_weather_condition,encoded_wind_direction
0,25,60,2,1
1,28,55,0,2
2,22,70,1,0
3,20,75,2,3
4,30,50,0,1


In [42]:
covariance

Unnamed: 0,Temperature,humidity,encoded_weather_condition,encoded_wind_direction
Temperature,17.0,-42.5,-3.25,-1.0
humidity,-42.5,107.5,7.5,2.75
encoded_weather_condition,-3.25,7.5,1.0,0.25
encoded_wind_direction,-1.0,2.75,0.25,1.3
