Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.

Label encoding and ordinal encoding are both techniques used to convert categorical data into numerical data, but they have some key differences and are used in different scenarios:

Label Encoding:

1. Assigns a unique integer to each category.
2. The integers are assigned arbitrarily, usually in alphabetical order.
Best used when:
3. There's no inherent order in the categories
4. The machine learning algorithm can handle non-ordinal numeric inputs (e.g., tree-based models)
There are many categories

Example:
Colors: Red -> 0, Blue -> 1, Green -> 2

Ordinal Encoding:

1. Assigns integers to categories based on their relative order or rank.
2. The assigned integers reflect a meaningful order in the data.
3. Best used when:
There's a clear, logical order to the categories
You want to preserve the ordinal relationship in the encoded values

Example:
Education levels: High School -> 0, Bachelor's -> 1, Master's -> 2, PhD -> 3

Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.



Target Guided Ordinal Encoding is a technique used to encode categorical variables based on their relationship with the target variable in a supervised learning context. 

How Target Guided Ordinal Encoding works:

1. For each category in the categorical feature, calculate the mean (for regression) or proportion (for classification) of the target variable.
2. Rank the categories based on these calculated values.
3. Assign ordinal numbers to the categories based on this ranking.

This method creates an ordering of categories that is informed by their relationship with the target variable, potentially capturing more relevant information than standard ordinal or label encoding.

Examples where we can use it:
1. High cardinality features: When a categorical variable has many unique values, making one-hot encoding impractical.
2. Unknown ordinal relationships: When you suspect there's an ordinal relationship in a categorical variable, but it's not immediately apparent.
3. Feature importance: When you want to create a feature that might be more predictive than standard encoding methods.
4. Reducing dimensionality: As an alternative to one-hot encoding to keep the number of features lower.
5. Time series forecasting: Encoding cyclical features like days of the week or months based on their relationship with the target.
6. Customer segmentation: Encoding customer attributes based on their relationship with a key metric like total spend.



Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical measure that indicates the extent to which two random variables change together. If the variables tend to show similar behavior, i.e., when one variable increases, the other tends to increase, and vice versa, the covariance is positive. Conversely, if one variable tends to increase when the other decreases, the covariance is negative.

1. Covariance helps in identifying the direction of the relationship between two variables.
2.  In finance, covariance is used to understand how different assets move together, which is crucial for diversification and risk management.
3. Covariance is a foundational concept in regression analysis, helping to understand the relationship between the independent and dependent variables.
4. It provides insights into the correlation and interaction between different data points, facilitating better data analysis and interpretation.

Calculation of Covariance
1. Find the Mean: Calculate the mean of each variable 
2. Subtract the mean of X from each X value to get the deviation scores for X. Do the same for Y
3. Product of Deviations: Multiply the deviation scores of X and Y for each data point.
4. Sum of Products: Sum up all the products obtained in the previous step.
5. Average: Divide the sum by nâˆ’1 (for a sample) to get the covariance

Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.

In [1]:
import pandas as pd

data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
    'Color': ['Red', 'Green', 'Blue', 'Red', 'Green', 'Blue', 'Red', 'Green', 'Blue', 'Red', 'Green', 'Blue', 'Red', 'Green', 'Blue'],
    'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small', 'Medium', 'Large', 'Large', 'Small', 'Small', 'Medium', 'Large', 'Large', 'Small', 'Medium'],
    'Material': ['Wood', 'Metal', 'Plastic', 'Plastic', 'Metal', 'Wood', 'Metal', 'Wood', 'Plastic', 'Metal', 'Wood', 'Metal', 'Plastic', 'Wood', 'Metal']
}

df = pd.DataFrame(data)
df.head()


Unnamed: 0,ID,Color,Size,Material
0,1,Red,Small,Wood
1,2,Green,Medium,Metal
2,3,Blue,Large,Plastic
3,4,Red,Medium,Plastic
4,5,Green,Small,Metal


In [15]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
L_encoder= LabelEncoder()
O_encoder= OrdinalEncoder(categories=[['Small','Medium','Large']])
a=pd.DataFrame(L_encoder.fit_transform(df['Color']),columns=['encoded_colors'])
b=pd.DataFrame(L_encoder.fit_transform(df['Material']),columns=['encoded_Materials'])
c=pd.DataFrame(O_encoder.fit_transform(df[['Size']]),columns=['encoded_Size'])
new_df=pd.concat([df,a,b,c],axis=1)

new_df.head()

Unnamed: 0,ID,Color,Size,Material,encoded_colors,encoded_Materials,encoded_Size
0,1,Red,Small,Wood,2,2,0.0
1,2,Green,Medium,Metal,1,0,1.0
2,3,Blue,Large,Plastic,0,1,2.0
3,4,Red,Medium,Plastic,2,1,1.0
4,5,Green,Small,Metal,1,0,0.0


Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.

In [63]:
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
data = {
    'Age': [25, 30, 35, 40, 45],
    'Income': [50000, 60000, 70000, 80000, 90000],
    'Education Level': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']
}

# Create DataFrame
df = pd.DataFrame(data)

encoder=OrdinalEncoder(categories=[['Bachelor', 'Master', 'PhD']])
df['encoded']=encoder.fit_transform(df[['Education Level']])
cov_matrix=np.cov(df[['Age','Income','encoded']].T)
cov_matrix

array([[6.25e+01, 1.25e+05, 1.25e+00],
       [1.25e+05, 2.50e+08, 2.50e+03],
       [1.25e+00, 2.50e+03, 7.00e-01]])

Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?

1. For gender I will be using labelencoding because it is used for binary categorical features. 
2. For Educational Level I will be using Ordinal encoding so that we can give the order to category and their meaning is preserved.
3. For Employment Status I will be using OHE as the categories are less and as they are not having any specific meaning or order.

Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.

In [56]:
import pandas as pd
import numpy as np


np.random.seed(42)
n=20
temperature=np.random.uniform(15,35,n)
humidity=np.random.uniform(40,90,n)
weather_conditions=np.random.choice(['Sunny','Cloudy','Rainy'],n)
wind_directions=np.random.choice(['North','South','East','West'],n)

data={
    'Temperature':temperature,
    'Humidity': humidity,
    'Weather Condition': weather_conditions,
    'Wind Direction': wind_directions
}
df= pd.DataFrame(data)


In [57]:
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
encoded=encoder.fit_transform(df[['Weather Condition']])
encoded_df=pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
df=pd.concat([df,encoded_df],axis=1)


mean_temp =df.groupby('Wind Direction')['Temperature'].mean().to_dict()
mean_hummidity =df.groupby('Wind Direction')['Humidity'].mean().to_dict()
df['encoded1']=df['Wind Direction'].map(mean_hummidity)
df['encoded2']=df['Wind Direction'].map(mean_temp)
df['final_encoded']=df['encoded1']+df['encoded2']
df.drop(['encoded1','encoded2','Weather Condition','Wind Direction'],axis=1,inplace=True)
print(df)

    Temperature   Humidity  Weather Condition_Cloudy  Weather Condition_Rainy  \
0     22.490802  70.592645                       0.0                      1.0   
1     34.014286  46.974693                       0.0                      0.0   
2     29.639879  54.607232                       0.0                      1.0   
3     26.973170  58.318092                       0.0                      1.0   
4     18.120373  62.803499                       1.0                      0.0   
5     18.119890  79.258798                       0.0                      0.0   
6     16.161672  49.983689                       1.0                      0.0   
7     32.323523  65.711722                       1.0                      0.0   
8     27.022300  69.620728                       1.0                      0.0   
9     29.161452  42.322521                       1.0                      0.0   
10    15.411690  70.377243                       1.0                      0.0   
11    34.398197  48.526206  

In [61]:
#For covariance Matrix
cov_matrix=np.cov(df[['Humidity','Temperature','Weather Condition_Cloudy','Weather Condition_Sunny','Weather Condition_Rainy','final_encoded']].T)

In [62]:
cov_matrix

array([[ 2.11963073e+02, -5.28354175e+01, -2.43616850e+00,
         1.33965386e+00,  1.09651464e+00,  3.12105862e+01],
       [-5.28354175e+01,  3.78684834e+01, -2.26255863e-03,
        -5.69041345e-02,  5.91666931e-02,  1.23242177e+00],
       [-2.43616850e+00, -2.26255863e-03,  2.39473684e-01,
        -1.02631579e-01, -1.36842105e-01, -9.24499032e-01],
       [ 1.33965386e+00, -5.69041345e-02, -1.02631579e-01,
         1.34210526e-01, -3.15789474e-02,  5.69175338e-01],
       [ 1.09651464e+00,  5.91666931e-02, -1.36842105e-01,
        -3.15789474e-02,  1.68421053e-01,  3.55323694e-01],
       [ 3.12105862e+01,  1.23242177e+00, -9.24499032e-01,
         5.69175338e-01,  3.55323694e-01,  3.24430079e+01]])