In [None]:
Q1. Label Encoding vs Ordinal Encoding:-

•Label Encoding apply on ordinal and nominal categorical variables.

•Ordinal Encoding apply on ordinal categorical variables.

Example:- 

In ordinal encoding:- labels are translated to numbers based on their ordinal relationship to one another. For example, if one feature contains {low, medium, high}, it can be converted into {1,2,3}, where 1 represents low, 2 represents medium, and 3 represents high.

Label Encoding:-  converting categorical features into numerical values. Features which define a category are Categorical Variables. E.g. Color (red, blue, green), Gender(Male, Female). Machine learning models expect features to be either floats or integers therefore categorical features like color, gender etc.

Q2. Target Guided Ordinal Encoding is a technique used in machine learning for encoding categorical variables based on their relationship with the target variable. It is particularly useful when dealing with ordinal categorical variables, which have a natural order or ranking among their categories. The goal of this encoding is to capture the information about the ordinal nature of the categories and their impact on the target variable.

Here's how Target Guided Ordinal Encoding works:

Order Categories: First, you need to establish a meaningful order or ranking for the categories within the ordinal variable. For example, if you're working with an ordinal variable representing education level (e.g., "High School," "Bachelor's," "Master's," "Ph.D."), you would assign a numerical order to these categories based on their level of education.

Calculate Target Statistics: For each unique category in the ordinal variable, calculate relevant statistics of the target variable. Typically, these statistics include measures like mean, median, or some other quantile of the target variable for each category. These statistics reflect the relationship between the ordinal variable and the target variable.

Map Categories to Statistics: Map each category to the calculated target statistic. In other words, replace the ordinal categories with their corresponding statistic values. For example, if the ordinal variable is education level and you've calculated the mean target value for each education level, you would replace "High School" with the mean target value for that category, and so on.

Handle Missing Categories: If there are missing categories in your dataset, you can handle them by assigning them a default value, such as the overall mean or median of the target variable.

In [1]:
# Example:-

import pandas as pd

data = {
    'Employment_Type': ['Unemployed', 'Part-Time', 'Full-Time', 'Self-Employed', 'Full-Time', 'Part-Time'],
    'Default_Rate': [0.2, 0.1, 0.05, 0.15, 0.08, 0.12]
}

df = pd.DataFrame(data)

In [2]:
print(df)

  Employment_Type  Default_Rate
0      Unemployed          0.20
1       Part-Time          0.10
2       Full-Time          0.05
3   Self-Employed          0.15
4       Full-Time          0.08
5       Part-Time          0.12


In [3]:
mean_default_rate = df.groupby('Employment_Type')['Default_Rate'].mean().sort_values()

mapping = {employment_type: rate for employment_type, rate in zip(mean_default_rate.index, mean_default_rate)}

df['Employment_Type_Encoded'] = df['Employment_Type'].map(mapping)

print(df)

  Employment_Type  Default_Rate  Employment_Type_Encoded
0      Unemployed          0.20                    0.200
1       Part-Time          0.10                    0.110
2       Full-Time          0.05                    0.065
3   Self-Employed          0.15                    0.150
4       Full-Time          0.08                    0.065
5       Part-Time          0.12                    0.110


Q3. Covariance is a statistical measure that indicates the extent to which two random variables change together. In other words, it measures the degree of linear relationship between two variables. A positive covariance indicates that as one variable increases, the other tends to increase as well. A negative covariance indicates that as one variable increases, the other tends to decrease. A covariance close to zero suggests that there is little to no linear relationship between the variables.

Covariance is important in statistical analysis for several reasons:

Relationship Assessment: 
Portfolio Diversification:
Feature Selection: 
Multivariate Analysis: 
Principal Component Analysis (PCA): 


In [4]:
import numpy as np

x = [1, 2, 3, 4, 5]
y = [2, 3, 4, 5, 6]

covariance = np.cov(x, y)[0, 1]

print("Covariance:", covariance)


Covariance: 2.5


Q4. Label encoding is a technique used to convert categorical variables into numerical values. 

In [5]:
from sklearn.preprocessing import LabelEncoder

data = {
    'Color': ['red', 'green', 'blue', 'red', 'blue'],
    'Size': ['small', 'medium', 'large', 'medium', 'small'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'plastic']
}


In [6]:
import pandas as pd
df = pd.DataFrame(data)


In [7]:
print(df)

   Color    Size Material
0    red   small     wood
1  green  medium    metal
2   blue   large  plastic
3    red  medium    metal
4   blue   small  plastic


In [10]:
label_encoders = {}
for column in df.columns:
    label_encoders[column] = LabelEncoder()
    df[column + '_encoded'] = label_encoders[column].fit_transform(df[column])
print(df)

   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    red   small     wood              2             2                 2
1  green  medium    metal              1             1                 0
2   blue   large  plastic              0             0                 1
3    red  medium    metal              2             1                 0
4   blue   small  plastic              0             2                 1


Q5. 5. Calculate the covariance matrix:-

In [8]:
import numpy as np

age = [25, 30, 28, 35, 22]
income = [50000, 60000, 55000, 75000, 40000]
education_level = [2, 3, 2, 4, 1]

data = np.array([age, income, education_level])
print(data)

[[   25    30    28    35    22]
 [50000 60000 55000 75000 40000]
 [    2     3     2     4     1]]


In [12]:
covariance_matrix = np.cov(data)

print("Covariance Matrix:")
print(covariance_matrix)

Covariance Matrix:
[[2.450e+01 6.375e+04 5.500e+00]
 [6.375e+04 1.675e+08 1.450e+04]
 [5.500e+00 1.450e+04 1.300e+00]]


Q6. Encoding method for each of the given categorical variables :
    
  Gender (Nominal Variable - No Inherent Order):
  
  Education Level (Ordinal Variable with Implicit Order):
  
  Employment Status (Nominal Variable - No Inherent Order):
  

In [9]:
import pandas as pd

data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Education Level': ['Bachelor\'s', 'Master\'s', 'PhD', 'High School', 'Master\'s'],
    'Employment Status': ['Full-Time', 'Part-Time', 'Unemployed', 'Full-Time', 'Part-Time']
}

df = pd.DataFrame(data)
print(df)

   Gender Education Level Employment Status
0    Male      Bachelor's         Full-Time
1  Female        Master's         Part-Time
2    Male             PhD        Unemployed
3  Female     High School         Full-Time
4    Male        Master's         Part-Time


In [7]:
df_encoded = pd.get_dummies(df, columns=['Gender'], prefix=['Gender'])
print(df_encoded)

  Education Level Employment Status  Education_Level_Encoded  Gender_Female  \
0      Bachelor's         Full-Time                      1.0              0   
1        Master's         Part-Time                      2.0              1   
2             PhD        Unemployed                      3.0              0   
3     High School         Full-Time                      0.0              1   
4        Master's         Part-Time                      2.0              0   

   Gender_Male  
0            1  
1            0  
2            1  
3            0  
4            1  


In [12]:
from sklearn.preprocessing import OrdinalEncoder

education_levels = ['High School', "Bachelor's", "Master's", 'PhD']
ordinal_encoder = OrdinalEncoder(categories=[education_levels])
df['Education_Level_Encoded'] = ordinal_encoder.fit_transform(df[['Education Level']])
print(df)

   Gender Education Level Employment Status  Education_Level_Encoded
0    Male      Bachelor's         Full-Time                      1.0
1  Female        Master's         Part-Time                      2.0
2    Male             PhD        Unemployed                      3.0
3  Female     High School         Full-Time                      0.0
4    Male        Master's         Part-Time                      2.0


In [6]:
df_encoded = pd.get_dummies(df, columns=['Employment Status'], prefix=['Employment_Status'])

In [10]:
print(df_encoded )

  Education Level Employment Status  Education_Level_Encoded  Gender_Female  \
0      Bachelor's         Full-Time                      1.0              0   
1        Master's         Part-Time                      2.0              1   
2             PhD        Unemployed                      3.0              0   
3     High School         Full-Time                      0.0              1   
4        Master's         Part-Time                      2.0              0   

   Gender_Male  
0            1  
1            0  
2            1  
3            0  
4            1  


Q7.Calculate the covariance between each pair of variables in the dataset:-

In [10]:
import numpy as np

temperature = [75, 68, 82, 60, 72]
humidity = [45, 60, 75, 80, 55]
weather_condition = ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy']
wind_direction = ['North', 'South', 'East', 'West', 'North']


In [17]:
data = np.array([temperature, humidity])
print("Covariance Matrix between Temperature and Humidity:")

Covariance Matrix between Temperature and Humidity:


In [12]:

weather_encoded = np.array([0, 1, 2, 0, 1])  # Encoding Sunny=0, Cloudy=1, Rainy=2
wind_encoded = np.array([0, 1, 2, 3, 0])    # Encoding North=0, South=1, East=2, West=3
categorical_data = np.array([weather_encoded, wind_encoded])

covariance_matrix_categorical = np.cov(categorical_data)

print("\nCovariance Matrix between Weather Condition and Wind Direction:")
print(covariance_matrix_categorical)



Covariance Matrix between Weather Condition and Wind Direction:
[[0.7  0.05]
 [0.05 1.7 ]]
