# 1.  What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

- **Ordinal Encoding** and **Label Encoding** are both techniques used to convert categorical data into numerical values, but they are typically used in different situations and have different characteristics.

> - Ordinal Encoding is used when you have categorical data with a clear order or hierarchy. 
> > - For example, "Low," "Medium," and "High" have a logical order, and you can assign numerical values like 1, 2, and 3 to represent this order.

> - Label Encoding, on the other hand, is used when you have categorical data without a natural order.
> > - It assigns a unique integer label to each category, and there is no implied order among these labels.

- > For, eg : Let's say we're working with a dataset that includes a "T-shirt size" feature.

- In this case:
> - If "T-shirt size" has an inherent order (e.g., "Small" < "Medium" < "Large"), we would use Ordinal Encoding. 
> > - We should assign 1, 2, and 3 to represent "Small," "Medium," and "Large," respectively.

> - If "T-shirt size" is purely nominal, with no natural order among the sizes, we would use Label Encoding. 
> > - Each size ("Small," "Medium," "Large") is given a unique integer label (e.g., 1, 2, 3).

In [1]:
import pandas as pd

# sample data -: T-shirt sizes 
data = {
    "T-shirt Size" : ["Small", "Medium", "Large", "Small", "Medium"], 
    }
df = pd.DataFrame(data)

## ordinal encoding for T-shirt sizes 
ordinal_mapping = {"Small":1, "Medium":2, "Large":3}
df["T-shirt size (ordinal)"] = df["T-shirt Size"].map(ordinal_mapping)

## label encoding for T-shirt sizes 
df["T-shirt size (label)"] = df["T-shirt Size"].astype("category").cat.codes 

print(df)

  T-shirt Size  T-shirt size (ordinal)  T-shirt size (label)
0        Small                       1                     2
1       Medium                       2                     1
2        Large                       3                     0
3        Small                       1                     2
4       Medium                       2                     1


> - > - Here, we demonstrated both encoding techniques. 
> - > - Ordinal Encoding is used for "T-shirt size" to represent the order, while Label Encoding is used to assign unique integer labels.

# 2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

- **Target Guided Ordinal Encoding** is a technique used to convert categorical data into ordinal numerical values based on the relationship between the category and the target variable.
- It's particularly useful when you want to create ordinal labels for categories that show a meaningful and predictive relationship with the target variable.

> - We calculate the mean or median of the target variable (e.g., the percentage of customers who churned) for each category of the categorical feature.
> - This means we find the average or median target value for each category. 
> - We order the categories based on these mean or median values. 
> - The category with the lowest mean/median is assigned the lowest numerical label, and the category with the highest mean/median is assigned the highest numerical label.
> - This creates an ordinal encoding of the categorical variable based on its predictive power in relation to the target variable.
> - Categories that tend to have higher target values are assigned higher ordinal labels, indicating their importance in predicting the target.

- > - For, eg : Consider a Customer Credit Card
- Suppose we're working on a project to predict customer credit risk, and we have a dataset with a "Credit Score Range" feature.
- This feature has categories like "Poor," "Fair," "Good," and "Excellent."

- We observe that there's a strong relationship between the "Credit Score Range" and the likelihood of a customer defaulting on a loan.
- "Poor" credit scores are associated with a higher default rate, while "Excellent" scores have a low default rate.

- In this case, we can use Target Guided Ordinal Encoding to create ordinal labels for "Credit Score Range" based on the observed relationship with the target variable (default/non-default). 
- This encoding will reflect the predictive power of each credit score range in determining credit risk.

In [None]:
import pandas as pd

# Sample data: Credit Score Range and Target (Default: 1, Non-Default: 0)
data = {
    'Credit Score Range': ['Poor', 'Fair', 'Good', 'Excellent', 'Good', 'Excellent', 'Poor'],
    'Default': [1, 0, 0, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Calculate the mean default rate for each category
mean_default_rate = df.groupby('Credit Score Range')['Default'].mean()

# Order the categories based on default rate
ordinal_mapping = mean_default_rate.sort_values().index

# Create ordinal labels
ordinal_labels = {category: label for label, category in enumerate(ordinal_mapping)}

# Apply Target Guided Ordinal Encoding
df['Credit Score Range (Target Guided Ordinal)'] = df['Credit Score Range'].map(ordinal_labels)

print(df)


In [3]:
import pandas as pd 

# sample data - Credi score range and target (default : 1, non-default : 0)

data = {
    "Credit Score Range" : ["Poor", "Fair", "Good", "Excellent", "Good", "Excellent", "Poor"],
    "Default" : [1, 0 ,0 ,0 ,1, 0,1]
}
df = pd.DataFrame(data)
df


Unnamed: 0,Credit Score Range,Default
0,Poor,1
1,Fair,0
2,Good,0
3,Excellent,0
4,Good,1
5,Excellent,0
6,Poor,1


In [5]:
# > calculating the mean default rate for each category 
mean_default_rate = df.groupby("Credit Score Range")["Default"].mean()

## order the categories based on the default range 
ordinal_mapping = mean_default_rate.sort_values().index

## creating ordinal labels  
ordinal_labels = {category : label for label, category in enumerate(ordinal_mapping)}

## applying target guided ordinal encoding 
df["Credit Score Range (Target Guided Ordinal)"] = df["Credit Score Range"].map(ordinal_labels)

print(df)

  Credit Score Range  Default  Credit Score Range (Target Guided Ordinal)
0               Poor        1                                           3
1               Fair        0                                           1
2               Good        0                                           2
3          Excellent        0                                           0
4               Good        1                                           2
5          Excellent        0                                           0
6               Poor        1                                           3


> - > - Here, we calculated the mean default rate for each "Credit Score Range" category and then create ordinal labels based on the default rate. 
> - > - This encoding reflects the predictive power of each credit score range in determining credit risk, making it a valuable feature for the machine learning model.

# 3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

- **Covariance** is a statistical measure that helps us understand the relationship between two sets of data or variables.
- It tells us how changes in one variable are associated with changes in another.
- In a nut shell, it quantifies whether two variables tend to increase or decrease together, stay unrelated, or move in opposite directions.

> - Covariance is essential in statistical analysis for several reasons:
-  > [i] **Relationship Assessment:** It helps us understand if there is a relationship between two variables. 
- If the covariance is positive, it suggests that as one variable increases, the other tends to increase as well. 
- If it's negative, one variable tends to decrease as the other increases.
-  > [ii] **Direction of Association:** It indicates the direction of the relationship. 
- A positive covariance means the two variables move in the same direction, while a negative covariance means they move in opposite directions.
-  > [iii] **Strength of Association:** The magnitude of the covariance value indicates the strength of the relationship. 
- Larger absolute values of covariance imply a stronger association.


In [4]:
import numpy as np 

## sampled data - two sets of data (X and Y)
X = [1, 2, 3, 4, 5]
Y = [2, 4, 5, 4, 6]

## calculate the covariance  
covariance= np.cov(X, Y)[0, 1]

# [0,1] - To extract the covariance between X and Y from the covariance matrix because 
## it's the value at the first row (0) and second column (1) of the matrix. This value represents how X and Y are related. 

print(f"Covariance between X and Y : {covariance}")

Covariance between X and Y : 2.0


> - > - Here, we used the np.cov function from the NumPy library to calculate the covariance between the two sets of data, X and Y. 
> - > - The result represents the strength and direction of their association.

# 4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [9]:
from sklearn.preprocessing import LabelEncoder 
import pandas as pd 

## sample data - Categorical variables 
data = {
    "Color" : ["red", "green", "blue", "red", "green"],
    "Size" : ["small", "medium", "large", "medium", "small"], 
    "Material" : ["wood", "metal", "plastic", "metal", "wood"]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic
3,red,medium,metal
4,green,small,wood


In [20]:
# initializing the LabelEncoder
label_encoder = LabelEncoder()

# performing label encoding for each categorical variable
for column in df.select_dtypes(include=['object']).columns:
    df[column] = label_encoder.fit_transform(df[column])

print("Encoded DataFrame:")
print(df)

Encoded DataFrame:
   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      2     1         0
4      1     2         2


> - > - Here, we started with a DataFrame containing categorical data.
> - > - we created a LabelEncoder to transform the categorical values into numerical labels.
> - > - For each categorical column, we applied the label encoding using fit_transform.
> - > - The resulting DataFrame displays the encoded values.
> - > - To interpret the output: Each unique category in the original data is now represented by a numerical label.

- The encoded representation allows machine learning models to work with categorical data, which is typically required for many statistical and machine learning algorithms.

# 5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [24]:
import numpy as np 

## sample data - Age, Income, and Education level 
Age = [30, 35, 25, 40, 28]
Income = [50000, 60000, 45000, 70000, 55000]
Education = [12, 16, 14, 18, 13]

## creating a data matrix  
data_matrix = np.array([Age, Income, Education])

## calculating the co-variance matrix  
covariance_matrix = np.cov(data_matrix)

print("Covariance Matrix : ")
print(covariance_matrix)

Covariance Matrix : 
[[3.530e+01 5.425e+04 1.180e+01]
 [5.425e+04 9.250e+07 1.925e+04]
 [1.180e+01 1.925e+04 5.800e+00]]


- The covariance matrix shows how these variables are related:
> - > - [i] **Age and Age (Variance):** The variance of Age is approximately 35.3. 
- This means Age varies from the average age by this amount.
> - > - [ii] **Income and Income (Variance):** The variance of Income is about 92,500,000.
- It shows how much individual incomes differ from the average income.
> - > - [iii] **Education and Education (Variance):** The variance of Education level is around 5.8, indicating how much education levels vary from the average.
> - > - [iv] **Age and Income (Covariance):** The covariance between Age and Income is about 54,250. 
- It suggests that as Age tends to increase, Income also tends to increase. 
- However, the magnitude of the relationship is given by the covariance value.
> - > - [v] **Age and Education (Covariance):** The covariance between Age and Education is approximately 11.8.
- This implies that as Age tends to increase, Education level also tends to increase, but the relationship's strength is indicated by the covariance.
> - > - [vi] **Income and Education (Covariance):** The covariance between Income and Education is roughly 19,250. 
- It suggests that as Income tends to increase, Education level also tends to increase, with the covariance value reflecting the strength of this relationship.

> - In simpler terms, the covariance values indicate how variables tend to move together. 
- For example, Age and Income show a positive relationship, meaning that as one increases, the other tends to increase. 
- The magnitude of these relationships is given by the covariance values.

# 6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

- **Gender (Binary Categorical Variable - Male/Female):**
> - We use One-Hot Encoding: Since "Gender" is a binary variable (only two categories - Male and Female), one-hot encoding is a suitable choice. 
> - It converts each category into a binary column (0 or 1), allowing us to represent this information without introducing unnecessary ordinality.


In [25]:
import pandas as pd 

data = {
    "Gender" : ["Male", "Female", "Male", "Female", "Male"], 
}

df = pd.DataFrame(data)

## performing One-hot encoding for "Gender" variable 
df = pd.get_dummies(df, columns=["Gender"], prefix = "Gender")

print(df) 

   Gender_Female  Gender_Male
0              0            1
1              1            0
2              0            1
3              1            0
4              0            1


- **Education Level (Nominal Categorical Variable - High School/Bachelor's/Master's/PhD):**
> - We use One-Hot Encoding: "Education Level" is a nominal categorical variable with no inherent order, so one-hot encoding is appropriate. 
> - It represents each education level with a separate binary column.

In [26]:
data = {
    'Education Level': ["High School", "Bachelor's", "Master's", "PhD", "Bachelor's"],
}

df = pd.DataFrame(data)

# Performing one-hot encoding for 'Education Level'
df = pd.get_dummies(df, columns=['Education Level'], prefix='Education')

print(df)

   Education_Bachelor's  Education_High School  Education_Master's  \
0                     0                      1                   0   
1                     1                      0                   0   
2                     0                      0                   1   
3                     0                      0                   0   
4                     1                      0                   0   

   Education_PhD  
0              0  
1              0  
2              0  
3              1  
4              0  


- **Employment Status (Ordinal Categorical Variable - Unemployed/Part-Time/Full-Time):**
> - We use Ordinal Encoding: "Employment Status" is an ordinal categorical variable with a clear order (Unemployed < Part-Time < Full-Time).
> - Ordinal encoding assigns numerical values according to this order.

In [27]:
data = {
    'Employment Status': ['Part-Time', 'Unemployed', 'Full-Time', 'Part-Time', 'Full-Time'],
}

df = pd.DataFrame(data)

# Defining the ordinal mapping
ordinal_mapping = {'Unemployed': 1, 'Part-Time': 2, 'Full-Time': 3}

# Performing ordinal encoding for 'Employment Status'
df['Employment Status (Ordinal)'] = df['Employment Status'].map(ordinal_mapping)

print(df)

  Employment Status  Employment Status (Ordinal)
0         Part-Time                            2
1        Unemployed                            1
2         Full-Time                            3
3         Part-Time                            2
4         Full-Time                            3


> - > - Hence, by using these encoding methods, we ensure that the machine learning model can work with the categorical data effectively.
> - > - One-hot encoding is used for nominal data, while ordinal encoding is appropriate for ordinal data.
> - > - These techniques prevent misinterpretation of the data by the model and help it make informed predictions.

# 7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

- To calculate the covariance between each pair of variables in a dataset with two continuous variables (Temperature and Humidity) and two categorical variables (Weather Condition and Wind Direction), we need to handle the continuous variables separately from the categorical ones.

> - Covariance is generally used to measure the linear relationship between two continuous variables. 
> - It is not directly applicable to categorical variables. 
> - However, we can calculate the covariance between the two continuous variables (Temperature and Humidity) and then briefly interpret the results. 
> - For categorical variables, we would need to use techniques like one-hot encoding and then calculate covariances.

In [31]:
import numpy as np

# sample data - Temperature, Humidity, Weather Condition, Wind Direction
Temperature = [25, 30, 22, 27, 28]
Humidity = [50, 60, 45, 55, 58]
Weather_Condition = ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Rainy']
Wind_Direction = ['North', 'South', 'East', 'North', 'West']

# Calculating the covariance between Temperature and Humidity
cov_cont = np.cov(Temperature, Humidity)[0, 1]

print(f'Covariance between Temperature and Humidity: {cov_cont:.2f}')

# Encoding categorical variables using one-hot encoding
Weather_Encoded = pd.get_dummies(Weather_Condition, prefix='Weather')
Wind_Encoded = pd.get_dummies(Wind_Direction, prefix='Wind')

# calculating the covariance between continuous variables and categorical variables
cov_cont_weather = np.cov(Temperature, Weather_Encoded, rowvar=False)
cov_cont_wind = np.cov(Temperature, Wind_Encoded, rowvar=False)

print('Covariance between Temperature and Weather Condition:')
print(cov_cont_weather)
print('\nCovariance between Temperature and Wind Direction:')
print(cov_cont_wind)


Covariance between Temperature and Humidity: 18.45
Covariance between Temperature and Weather Condition:
[[ 9.3  0.9 -0.7 -0.2]
 [ 0.9  0.2 -0.1 -0.1]
 [-0.7 -0.1  0.3 -0.2]
 [-0.2 -0.1 -0.2  0.3]]

Covariance between Temperature and Wind Direction:
[[ 9.3  -1.1  -0.2   0.9   0.4 ]
 [-1.1   0.2  -0.1  -0.05 -0.05]
 [-0.2  -0.1   0.3  -0.1  -0.1 ]
 [ 0.9  -0.05 -0.1   0.2  -0.05]
 [ 0.4  -0.05 -0.1  -0.05  0.2 ]]


- # [i] **Covariance between Temperature and Humidity:**
> - > - A positive covariance (e.g., 50.00) indicates that as Temperature tends to increase, Humidity also tends to increase, suggesting a positive relationship. 
> - > - In simple terms, warmer days are associated with higher humidity.
- # [ii] **Covariance between Temperature and Weather Condition:**
> - > - Each row in the covariance matrix represents the relationship between Temperature and one of the Weather Condition categories (Sunny, Cloudy, Rainy).
> - > - In this context, it's challenging to provide a straightforward interpretation because Weather Condition is categorical. 
> - > - The covariance matrix shows how Temperature's variation relates to each weather category, but it may not be as informative as with continuous variables.
- # [iii] **Covariance between Temperature and Wind Direction:**
> - > - Similar to the Weather Condition, the covariance matrix for Wind Direction shows how Temperature's variation relates to each wind direction category (North, South, East, West).
> - > - The interpretation here is complex because Wind Direction is also categorical, and the covariance matrix doesn't provide as clear a relationship as with continuous-continuous variables.

- In the case of categorical variables, the covariance matrix may not offer as intuitive insights as it does for continuous variables.
> - To assess relationships between categorical and continuous variables, other statistical tests and techniques may be more suitable.