In [None]:
Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

In [None]:
Encoding Method:
    
OrdinalEncoder encodes categorical features as ordinal integers based on the order of 
the categories.
LabelEncoder assigns a unique integer to each category without considering any order.

Handling of Ordinality:
OrdinalEncoder preserves the ordinal information present in the categorical features.
LabelEncoder does not consider the ordinality of the categories and treats them as nominal.

Suitable Data Types:
OrdinalEncoder is suitable for encoding ordinal categorical features where the order matters,
such as temperature categories (cold, warm, hot).
LabelEncoder is typically used for encoding nominal categorical features where there is no
inherent order, such as color categories (red, blue, green).

Example:
OrdinalEncoder can be used to encode temperature categories (cold, warm, hot) as 0, 1, 2, 
preserving their order.
LabelEncoder can be used to encode color categories (red, blue, green) as 0, 1, and 2, 
without considering any order.

Library:
Both OrdinalEncoder and LabelEncoder are supported in scikit-learn, making them readily
accessible for data preprocessing tasks.

In [None]:
Label encoding can be applied to both ordinal and nominal categorical variables. It 
assigns a number to each category whereas Ordinal encoding is used when categories
have a natural order, like low, medium, high. For example, responses like 
'strongly agree,' 'agree,' 'neutral,' 'disagree,' and 'strongly disagree' are ordinal
because they follow a specific sequence.

In [None]:
Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.

In [None]:
Target Guided Ordinal Encoding (TGOE): is a technique for transforming categorical 
variables into numerical values using information from the target variable.
It's particularly useful when the target variable is ordinal, meaning it has a natural
order (e.g., low, medium, high).

Its working:
    
1. Sort Categories: Categories are sorted based on the mean (or another statistic) of 
the target variable for each category.

2. Rank Categories: Categories are assigned numerical values based on their rank in the
sorted order. Higher ranks receive higher values.

3. Encode Data: Each data point is replaced with the numerical value corresponding to
its category's rank.

Benefits of TGOE:
    
1. Captures the inherent order of the categorical variable.
2. Can lead to better performance compared to simple ordinal encoding.
3. More informative than one-hot encoding for ordinal data.

Drawbacks of TGOE:

1. Susceptible to data leakage if not implemented carefully.
2. Can be sensitive to outliers in the target variable.

When to use TGOE:

1. When the target variable is ordinal.
2. When the categorical variable has a clear order.
3. When we want to capture the relationship between the categorical variable and the target variable.


In [5]:
# Example:

import pandas as pd

df = pd.DataFrame({'Employee Id': ['A100','A101','A102','B101','B102','C100','D103'],
                   'City ': ['Delhi','Delhi','Mumbai','Pune','Kolkata','Pune','Kolkata'],
                   'Highest Qualification': ['phd','bsc','msc','bsc','phd','msc','msc'],
                   'Salary': [50000,30000,45000,25000,48000,30000,44000]
                   })
df

Unnamed: 0,Employee Id,City,Highest Qualification,Salary
0,A100,Delhi,phd,50000
1,A101,Delhi,bsc,30000
2,A102,Mumbai,msc,45000
3,B101,Pune,bsc,25000
4,B102,Kolkata,phd,48000
5,C100,Pune,msc,30000
6,D103,Kolkata,msc,44000


In [8]:
mean_price = df.groupby('Highest Qualification')['Salary'].mean().to_dict()
mean_price

{'bsc': 27500.0, 'msc': 39666.666666666664, 'phd': 49000.0}

In [10]:
df['Highest Qualification_encoded'] = df['Highest Qualification'].map(mean_price)
df

Unnamed: 0,Employee Id,City,Highest Qualification,Salary,Highest Qualification_encoded
0,A100,Delhi,phd,50000,49000.0
1,A101,Delhi,bsc,30000,27500.0
2,A102,Mumbai,msc,45000,39666.666667
3,B101,Pune,bsc,25000,27500.0
4,B102,Kolkata,phd,48000,49000.0
5,C100,Pune,msc,30000,39666.666667
6,D103,Kolkata,msc,44000,39666.666667


In [None]:
Q3. Define covariance and explain why it is important in statistical analysis. 
How is covariance calculated?

In [None]:
Covariance is a statistical tool that measures the relationship between two or more
random variables and how they change together. It can be used to determine the direction
of the relationship, and whether the variables move in tandem or in opposite directions.

Covariance is important in statistical analysis because it can help us understand the
relationship between variables and analyze risk. For example, in finance, covariance
is used in portfolio theory to help diversify assets and reduce unsystematic risk.

Formula for Covariance: 
    
Covariance(x,y) = ∑ [(xi-x̄)(yi-ȳ)]/(n−1)
where,
xi = data points of x
x̄ = sample mean of x
yi = data points of y
ȳ = sample mean of y
n = Sample Size

In [None]:
Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [40]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Color': ['red','green','blue','red','blue'],
                   'Size': ['small','medium','large','medium','large'],
                   'Material': ['wood','metal','plastic','wood','plastic']})

encoder = LabelEncoder()
encoded1 = encoder.fit_transform(df['Color'])
encoded2 = encoder.fit_transform(df['Material'])
print(encoded1,encoded2)

[2 1 0 2 0] [2 0 1 2 1]


In [None]:
With the help of Label Encoding, we have labels as follows:-
Color red is given '2', green as '1' and blue as '0'.
Similarly, Material wood is given '2', metal as '0' and plastic as '1'.

In [None]:
Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [1]:
import pandas as pd

df = pd.DataFrame({'Age': [35,28,31,27,38],
                   'Income': [50000,30000,45000,25000,48000],
                   'Education': ['phd','bsc','msc','bsc','phd']})

df.cov()

  df.cov()


Unnamed: 0,Age,Income
Age,21.7,46900.0
Income,46900.0,128300000.0


In [None]:
The covariance matrix is symmetric, as the covariance between Age and Income is the
same as the covariance between Income and Age.

The variance of Age is approximately 21.7.
The variance of Income is approximately 128300000.0.

As the variables Age and Income have positive covariance i.e.,46900.00
that means there is direct relationship between the variables.
It indicates that as Age increases, Income tends to increase as well (they move together).

In [None]:
Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

In [None]:
Gender (Binary Categorical Variable: Male/Female): is a binary categorical variable
with only two possible values (Male and Female), the preferred encoding method is
Label Encoding. We can assign 0 to one category (e.g., Male) and 1 to the other 
category (e.g., Female). 

Education Level (Ordinal Categorical Variable: High School/Bachelor's/Master's/PhD):In
this case, the recommended encoding method is Ordinal Encoding. 
For example, we can encode High School as 0, Bachelor's as 1, Master's as 2, and PhD as 3.

Employment Status (Nominal Categorical Variable: Unemployed/Part-Time/Full-Time): is a nominal
categorical variable with no inherent order among its categories. The preferred encoding 
method is One-Hot Encoding. 
For example, we can create three columns (Unemployed, Part-Time, Full-Time) where the
corresponding category is encoded as 1 and the others as 0.

In [None]:
Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [2]:
import pandas as pd

df = pd.DataFrame({'Temperature': [48,50,33,30,28],
                   'Humidity': [60,20,55,15,58],
                   'Weather Condition': ['Rainy','Sunny','Cloudy','Sunny','Cloudy'],
                   'Wind Direction': ['North','South','East','West','North']})

In [3]:
df.cov()

  df.cov()


Unnamed: 0,Temperature,Humidity
Temperature,108.2,-23.35
Humidity,-23.35,490.3


In [None]:
The covariance matrix is symmetric, as the covariance between Temperature and
Humidity is the same as the covariance between Humidity and Temperature.

The variance of Temperature is approximately 108.20.
The variance of Humidity is approximately 490.30.

As the variables Temperature and Humidity have negative covariance i.e.,-23.35
that means there is inverse relationship between the variables.
It indicates that as Temperature increases, Humidity tends to decrease as well
(they move in opposite directions) and vice-versa.