In [None]:
#Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.
"""
Key difference between the two techniques is that ordinal encoding can capture the relative order or ranking between categories, whereas label 
encoding cannot. 
In general, if there is a clear order or ranking among the categories, then ordinal encoding may be more appropriate. On the other hand, if there is 
no such order or ranking, then label encoding may be a better choice.

For example, if we are working on a problem where we need to predict the salary of a person based on their education level, and we believe that there
is a clear order or ranking among the categories, then we may choose ordinal encoding. However, if we are working on a problem where we need to 
predict whether a customer is likely to buy a product based on their occupation, and we believe that there is no inherent order or ranking among the 
categories, then we may choose label encoding.
"""

In [None]:
#Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.
"""
Target Guided Ordinal Encoding is a technique used to encode categorical features based on the relationship between the feature and the target 
variable in a supervised learning problem. It involves assigning a numerical value to each category of a categorical variable based on the mean of 
the target variable for that category.

The steps involved in Target Guided Ordinal Encoding are:
1. Calculate the mean of the target variable for each category of the categorical variable.
2. Sort the categories in descending order of the mean of the target variable.
3. Assign a numerical value to each category based on its position in the sorted list.

For example, suppose we have a dataset with a categorical variable "City" and a target variable "Sales". We can use Target Guided Ordinal Encoding to 
encode the "City" variable.
"""

In [None]:
#Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?
"""
Covariance is a statistical measure that quantifies the extent to which two variables change in relation to each other. It measures the degree to 
which two random variables are related to each other..

Covariance is important in statistical analysis because it provides information about the direction and strength of the relationship between two 
variables. For example, if the covariance between two variables is positive, it indicates that the variables tend to increase or decrease together, 
while a negative covariance indicates that the variables tend to move in opposite directions.

Covariance is calculated using the formula:
Cov(X,Y) = Σ[(Xi - X_mean)(Yi - Y_mean)] / (n - 1)
"""

In [7]:
#Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,large), and Material (wood, metal, plastic)
#    , perform label encoding using Python's scikit-learn library. Show your code and explain the output.

from sklearn.preprocessing import LabelEncoder
import pandas as pd

df = pd.DataFrame({'Color': ['red', 'green', 'blue', 'red', 'green', 'blue', 'red', 'green', 'blue'],
        'Size': ['small', 'medium', 'large', 'small', 'medium', 'large', 'small', 'medium', 'large'],
        'Material': ['wood', 'metal', 'plastic', 'metal', 'plastic', 'wood', 'plastic', 'wood', 'metal']})

le = LabelEncoder()

df['Color_encoded'] = le.fit_transform(df['Color'])
df['Size_encoded'] = le.fit_transform(df['Size'])
df['Material_encoded'] = le.fit_transform(df['Material'])

df

"""
The LabelEncoder encodes the categorical variables as integers, with each category being assigned a unique integer value. In this case, since there
are three categories for each variable, the integer values range from 0 to 2.
"""

Unnamed: 0,Color,Size,Material,Color_encoded,Size_encoded,Material_encoded
0,red,small,wood,2,2,2
1,green,medium,metal,1,1,0
2,blue,large,plastic,0,0,1
3,red,small,metal,2,2,0
4,green,medium,plastic,1,1,1
5,blue,large,wood,0,0,2
6,red,small,plastic,2,2,1
7,green,medium,wood,1,1,2
8,blue,large,metal,0,0,0


In [11]:
#Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

import pandas as pd

data=pd.read_csv('Salary_Data.csv')

data.cov()

Unnamed: 0,YearsExperience,Age,Salary
YearsExperience,8.053609,14.46046,76106.3
Age,14.46046,26.638678,137889.3
Salary,76106.303448,137889.293103,751551000.0


In [None]:
#Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), 
#    Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you
#    use for each variable, and why?
"""
Gender: One-Hot Encoding
    Reason: As we have only two categories and it is unordered.
    
Education Level: Ordinal Encoding
    Reason: Here the categories have order and ranking.
    
Employment Status: One-Hot Encoding
    Reason: There are only three unordered categories.
"""

In [10]:
#Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" 
#    (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables.

import pandas as pd

df = pd.DataFrame({
    'Temperature': [20, 22, 25, 19, 18, 23, 26, 21, 24, 20],
    'Humidity': [50, 45, 55, 60, 62, 48, 52, 58, 54, 51],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Rainy', 'Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy', 'Sunny'],
    'Weather Direction': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West', 'North', 'South']
})

from sklearn.preprocessing import OneHotEncoder

encoder=OneHotEncoder()

encoded=encoder.fit_transform(df[['Weather Condition','Weather Direction']]).toarray()

encoded_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

data=pd.concat([df,encoded_df],axis=1)

cov_matrix = data.cov(numeric_only=True)

print(cov_matrix)


                          Temperature      Humidity  Weather Condition_Cloudy  \
Temperature                  7.066667 -5.777778e+00                  0.400000   
Humidity                    -5.777778  2.894444e+01                 -1.500000   
Weather Condition_Cloudy     0.400000 -1.500000e+00                  0.233333   
Weather Condition_Rainy      0.511111  7.222222e-01                 -0.100000   
Weather Condition_Sunny     -0.911111  7.777778e-01                 -0.133333   
Weather Direction_East       0.822222  1.541976e-17                 -0.066667   
Weather Direction_North     -0.377778  6.111111e-01                  0.011111   
Weather Direction_South     -0.044444 -1.833333e+00                  0.122222   
Weather Direction_West      -0.400000  1.222222e+00                 -0.066667   

                          Weather Condition_Rainy  Weather Condition_Sunny  \
Temperature                              0.511111                -0.911111   
Humidity                         