### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.


Ordinal Encoding and Label Encoding are techniques to convert categorical data into numerical format.

- **Label Encoding:**
  - This assigns a unique integer to each category. This approach does not imply any order between categories. It is typically used for categorical data that does not have a specific order.
  - Example: If you have a feature `Color` with categories `Red`, `Green`, and `Blue`, label encoding might assign `0` to `Red`, `1` to `Green`, and `2` to `Blue`.

- **Ordinal Encoding:**
  - This also assigns a unique integer to each category but implies an order or ranking among the categories. It is used for categorical data that has a clear order.
  - Example: For a feature `Size` with categories `Small`, `Medium`, and `Large`, ordinal encoding might assign `0` to `Small`, `1` to `Medium`, and `2` to `Large`.

Choose Label Encoding when the categorical variable does not have an intrinsic order (e.g., color, brand names). Choose Ordinal Encoding when the categorical variable has a meaningful order (e.g., rating scales, sizes).


### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.


Target Guided Ordinal Encoding involves ordering the categories according to the mean (or median) of the target variable for each category. It is useful when there is a correlation between the categorical feature and the target variable.

**Steps:**
1. Calculate the mean (or median) of the target variable for each category.
2. Assign ranks to the categories based on these statistics.
3. Replace categories with their corresponding ranks.

**Example:**
Suppose we have a dataset with a categorical variable `City` and a target variable `House Price`. If we calculate the mean house price for each city and rank the cities accordingly, we can replace city names with their rank.

Use Target Guided Ordinal Encoding when you believe the categorical variable has a significant impact on the target variable.


### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a measure of how much two random variables vary together. If the greater values of one variable correspond to the greater values of the other variable, and the lesser values correspond to the lesser values, the covariance is positive. If greater values of one variable correspond to lesser values of the other variable, the covariance is negative.

**Importance:**
Covariance is important because it indicates the direction of the linear relationship between variables. It helps in understanding how one variable changes with respect to another.

**Calculation:**
Cov(X, Y) = Σ ((X_i - mean(X)) * (Y_i - mean(Y))) / (n - 1)
Where:
- X and Y are the two variables
- X_i and Y_i are individual samples
- mean(X) and mean(Y) are the means of the variables
- n is the number of samples


### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.



In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'medium', 'small'],
    'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']
}
df = pd.DataFrame(data)

label_encoder = LabelEncoder()

df['Color'] = label_encoder.fit_transform(df['Color'])
df['Size'] = label_encoder.fit_transform(df['Size'])
df['Material'] = label_encoder.fit_transform(df['Material'])

df


Unnamed: 0,Color,Size,Material
0,2,2,2
1,1,1,0
2,0,0,1
3,1,1,0
4,2,2,2


The output DataFrame will show the categorical variables encoded as integers. For example:
- `Color` might be encoded as `2`, `1`, `0` corresponding to `red`, `green`, `blue`
- `Size` might be encoded as `2`, `1`, `0` corresponding to `small`, `medium`, `large`
- `Material` might be encoded as `2`, `0`, `1` corresponding to `wood`, `metal`, `plastic`


### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.


In [4]:
import pandas as pd

data = {
    'Age': [25, 45, 35, 50, 23],
    'Income': [50000, 100000, 75000, 120000, 45000],
    'Education': [12, 16, 14, 18, 12]
}
df = pd.DataFrame(data)

cov_matrix = df.cov()
cov_matrix


Unnamed: 0,Age,Income,Education
Age,141.8,381500.0,30.7
Income,381500.0,1032500000.0,83500.0
Education,30.7,83500.0,6.8


### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including Gender (Male/Female), Education Level (High School/Bachelor's/Master's/PhD), and Employment Status (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?


- **Gender:**
  Use Label Encoding because there are only two categories and no intrinsic order (Male, Female).

- **Education Level:**
  Use Ordinal Encoding because the categories have a meaningful order (High School < Bachelor's < Master's < PhD).

- **Employment Status:**
  Use One-Hot Encoding because there are multiple categories with no intrinsic order (Unemployed, Part-Time, Full-Time).


### Q7. You are analyzing a dataset with two continuous variables, Temperature and Humidity, and two categorical variables, Weather Condition (Sunny/Cloudy/Rainy) and Wind Direction (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.


In [5]:
data = {
    'Temperature': [30, 25, 27, 22, 28],
    'Humidity': [70, 65, 68, 75, 72]
}
df = pd.DataFrame(data)

cov_temp_humidity = df.cov()
cov_temp_humidity


Unnamed: 0,Temperature,Humidity
Temperature,9.3,-3.25
Humidity,-3.25,14.5


The covariance matrix will show the covariance between Temperature and Humidity.
- A positive value indicates that as Temperature increases, Humidity tends to increase as well.
- A negative value indicates that as Temperature increases, Humidity tends to decrease.
- Since categorical variables cannot be directly used in covariance calculations, they are not included in this analysis.
