# Feature Engineering Assignment - 4

**Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other.**

Ordinal encoding and label encoding are both techniques used in machine learning to represent categorical data numerically. However, they are used in different contexts and have some distinctions.

1. **Label Encoding:**
   - In label encoding, each unique category is assigned a unique integer label.
   - The order of the labels does not convey any information about the relationships between categories. It's just a numerical representation.
   - Label encoding is commonly used for nominal data (categories without an inherent order).

2. **Ordinal Encoding:**
   - Ordinal encoding is used when there is an inherent order or ranking among the categories.
   - It assigns integer labels to categories based on their order, preserving the information about the ordinal relationship between them.
   - Ordinal encoding is suitable for ordinal data (categories with a meaningful order).

In [9]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

color = ['Red', 'Green', 'Blue', 'Red', 'Blue']
categories = ['Low', 'Medium', 'High', 'Medium', 'Low']

# Label Encoding for color
color_encoder = LabelEncoder()
color_encoding = color_encoder.fit_transform(color)
print("Color Encoding:", color_encoding)

# Ordinal Encoding for categories
ordinal_encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
categories = [['Low'], ['Medium'], ['High'], ['Medium'], ['Low']]  
categories_encoding = ordinal_encoder.fit_transform(categories)
print("Categories Encoding:", categories_encoding)


Color Encoding: [2 1 0 2 0]
Categories Encoding: [[0.]
 [1.]
 [2.]
 [1.]
 [0.]]


**Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project.**

Target Guided Ordinal Encoding is a technique used to encode categorical variables based on the mean or other statistical measure of the target variable for each category. The steps involved are:

1. **Calculate the Mean Target for Each Category:**
   - Group the data by the categorical variable.
   - Calculate the mean (or another statistical measure) of the target variable for each category.

2. **Order Categories Based on Target Mean:**
   - Order the categories based on the calculated mean of the target variable in ascending or descending order.

3. **Assign Ordinal Labels:**
   - Assign ordinal labels to the categories based on their order of means. The category with the lowest mean gets the lowest label, and so on.

The idea is to leverage information from the target variable to encode categorical variables in a way that reflects their impact on the target. This is particularly useful when dealing with ordinal data where the order of categories matters in relation to the target variable.

In [12]:
import pandas as pd

data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B', 'C'],
    'Target': [1, 0, 1, 0, 1, 1, 0, 0, 1]
}

df = pd.DataFrame(data)

mean_target = df.groupby('Category')['Target'].mean().to_dict()


df['Category_encoded'] = df['Category'].map(mean_target)

print(df)


  Category  Target  Category_encoded
0        A       1          0.666667
1        B       0          0.333333
2        A       1          0.666667
3        C       0          0.666667
4        B       1          0.333333
5        C       1          0.666667
6        A       0          0.666667
7        B       0          0.333333
8        C       1          0.666667


**Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?**

**Covariance:**
Covariance is a measure of how much two random variables change together. It indicates the direction of the linear relationship between two variables. In other words, it measures the extent to which one variable tends to increase or decrease as the other variable does. A positive covariance suggests a direct relationship (both variables tend to increase or decrease together), while a negative covariance suggests an inverse relationship (one variable tends to increase as the other decreases).

**Importance in Statistical Analysis:**
1. **Relationship Assessment:** Covariance is crucial for understanding the relationship between two variables. It helps identify whether an increase in one variable is associated with an increase or decrease in another.

2. **Portfolio Diversification:** In finance, covariance is used to analyze the relationships between the returns of different assets. Positive covariance indicates that the assets move in the same direction, while negative covariance suggests diversification potential.

3. **Regression Analysis:** In linear regression, covariance is used to estimate the coefficients of the model. The covariance between the independent and dependent variables is a key factor in determining the slope of the regression line.

**Calculation of Covariance:**
The covariance between two variables, X and Y, is calculated using the following formula:

\[ \text{Cov}(X, Y) = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{n-1} \]

Where:
- \(X_i\) and \(Y_i\) are individual data points.
- \(\bar{X}\) and \(\bar{Y}\) are the means of X and Y, respectively.
- \(n\) is the number of data points.

Alternatively, it can be expressed using the expected values (means) as:

\[ \text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] \]

Covariance has some limitations, such as being sensitive to the scale of the variables, and it doesn't provide a standardized measure of the strength of the relationship. For a standardized measure, the correlation coefficient is often used, which is the covariance normalized by the standard deviations of the variables.

 **Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output.**

In [29]:
Color = ['Red', 'Green', 'Blue', 'Red', 'Blue']
Size = ['Low', 'Medium', 'High', 'Medium', 'Low']
Material=['wood','metal','plastic','metal','wood']
df = pd.DataFrame({'Color': Color, 'Size': Size, 'Material': Material})
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
column=['Color','Size','Material']

for columns in column:
    df[columns+'_encoded']=encoder.fit_transform(df[columns])
print(df)

   Color    Size Material  Color_encoded  Size_encoded  Material_encoded
0    Red     Low     wood              2             1                 2
1  Green  Medium    metal              1             2                 0
2   Blue    High  plastic              0             0                 1
3    Red  Medium    metal              2             2                 0
4   Blue     Low     wood              0             1                 2


**Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results.**

In [31]:
import pandas as pd
data = {
    'Age': [25, 30, 35, 40, 45],
    'Income': [50000, 60000, 75000, 80000, 90000],
    'EducationLevel': [12, 16, 14, 18, 20]
}

df = pd.DataFrame(data)

covariance_matrix = df.cov()

print("Covariance Matrix:")
print(covariance_matrix)


Covariance Matrix:
                     Age       Income  EducationLevel
Age                 62.5     125000.0            22.5
Income          125000.0  255000000.0         42500.0
EducationLevel      22.5      42500.0            10.0


**Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why?**

When dealing with categorical variables in a machine learning project, the choice of encoding method depends on the nature of the variables and the machine learning algorithm you plan to use. Here's a general guideline for encoding the given categorical variables:

1. **Gender (Binary Categorical Variable):**
   - **Encoding Method:** Label Encoding or One-Hot Encoding.
   - **Explanation:**
     - For binary categorical variables like "Gender" (Male/Female), you can use label encoding (0 and 1) if there is no inherent ordinal relationship between the categories.
     - Alternatively, one-hot encoding can be used to create two binary columns, one for Male and one for Female. This is useful when there is no ordinal relationship, and you want to avoid implying an order with label encoding.

2. **Education Level (Ordinal Categorical Variable):**
   - **Encoding Method:** Ordinal Encoding or One-Hot Encoding.
   - **Explanation:**
     - Since "Education Level" has an inherent order (e.g., High School < Bachelor's < Master's < PhD), you can use ordinal encoding to preserve this order.
     - Alternatively, one-hot encoding is suitable if you prefer to treat education levels as distinct categories without assuming a specific order. However, this may result in a larger feature space.

3. **Employment Status (Nominal Categorical Variable):**
   - **Encoding Method:** One-Hot Encoding.
   - **Explanation:**
     - "Employment Status" is likely nominal, meaning there is no inherent order among the categories (Unemployed, Part-Time, Full-Time). One-hot encoding is appropriate for nominal variables because it creates binary columns for each category.

In summary:
- Use label encoding or one-hot encoding for binary variables like "Gender."
- Use ordinal encoding or one-hot encoding for ordinal variables like "Education Level."
- Use one-hot encoding for nominal variables like "Employment Status."

Remember that the choice of encoding can impact the performance of your machine learning model, so it's important to consider the characteristics of each variable and experiment to find the encoding that works best for your specific case. Additionally, some machine learning algorithms may be sensitive to the choice of encoding, so it's a good practice to check the documentation or conduct experiments to determine the most suitable approach for your model.

**Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results.**

In [34]:
import pandas as pd

# Example Data
data = {
    'Temperature': [25, 28, 22, 30, 26],
    'Humidity': [50, 60, 45, 70, 55],
    'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Cloudy'],
    'Wind Direction': ['North', 'South', 'East', 'West', 'North']
}

df = pd.DataFrame(data)

# Calculate the covariance matrix
covariance_matrix = df[['Temperature','Humidity']].cov()

# Display the covariance matrix
print("Covariance Matrix:")
print(covariance_matrix)


Covariance Matrix:
             Temperature  Humidity
Temperature          9.2      28.5
Humidity            28.5      92.5


Interpreting the results:

The diagonal elements represent the variance of each variable.
The off-diagonal elements represent the covariances between pairs of variables.
The interpretation of covariance for continuous variables (Temperature and Humidity) is straightforward:

A positive covariance between Temperature and Humidity suggests that as one variable increases, the other tends to increase.
A negative covariance suggests an inverse relationship