### Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.
Ans: \

###  **1. Ordinal Encoding**

- Assigns **integer values** to categories **with a meaningful order** (i.e., ordinal data).
- The numeric values **represent ranking or levels**.
  
**Example:**
```text
Education Level = ['High School', 'Bachelor', 'Master', 'PhD']
Encoded as:      [0, 1, 2, 3]
```

 **Use when** the categories have **natural order or hierarchy**.

---

###  **2. Label Encoding**

- Also assigns **integers** to categories, **but without assuming any order**.
- Treats categories as **just different labels**.

**Example:**
```text
Color = ['Red', 'Green', 'Blue']
Encoded as: [0, 1, 2]
```

 **Problem**: Some models may **misinterpret** these values as having order (e.g., 2 > 1 > 0), even though there's no such relationship.

 **Use only when:**
- The number of categories is small
- You're using a model that can **handle categorical data properly** (e.g., tree-based models)

---

###  **Key Differences:**

| Feature               | Ordinal Encoding               | Label Encoding                |
|-----------------------|--------------------------------|-------------------------------|
| Category type         | Ordered                        | Unordered                     |
| Values represent      | Rank/Level                     | Unique identity only          |
| Risk in linear models | Low (if true order exists)     | High (can introduce false order) |
| Example use case      | Education level, Shirt sizes   | Country, City, Animal species |

---

###  **When to Choose Which:**

- **Ordinal Encoding**: Use for ordered categories like `["Low", "Medium", "High"]` or `["Beginner", "Intermediate", "Expert"]`
- **Label Encoding**: Use for unordered categories **only if**:
  - The model doesn’t assume ordinal meaning (e.g., Decision Trees)
  - Or as a temporary step before using embedding layers (in deep learning)

### Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.
Ans: \

**Target Guided Ordinal Encoding** is a feature encoding method where **categorical values are assigned numerical labels based on their relationship with the target variable**.

 It’s most useful when you have a **supervised learning problem**, especially for **classification or regression**.

---

###  **How It Works:**

1. **Group by the categorical feature**
2. **Calculate the mean (or another statistic) of the target** for each category
3. **Sort the categories by the target statistic**
4. **Assign ordinal values based on the sorted order**

---

###  **Example: Predicting Customer Churn**

Let’s say you have a feature:  
`Contract Type = ['Month-to-month', 'One year', 'Two year']`  
And a target:  
`Churn = 1 (Yes), 0 (No)`

Let’s calculate the **mean churn rate** for each contract type:

| Contract Type     | Churn Rate |
|-------------------|------------|
| Month-to-month    | 0.43       |
| One year          | 0.11       |
| Two year          | 0.05       |

**Sorted order (lowest churn to highest):**
- `Two year` → 0
- `One year` → 1
- `Month-to-month` → 2

So after Target Guided Ordinal Encoding:

```text
Contract Type = ['Month-to-month', 'One year', 'Two year']
Encoded as    = [2, 1, 0]
```

---

###  **When to Use It:**

- When you have **categorical features with few categories**
- When the categories show **clear influence on the target**
- Useful in **logistic regression** or **linear models** that benefit from ordinal representations

---

###  **Caution:**

- Can **leak target information** if not done **inside cross-validation** or **training split** only
- Must avoid using target statistics from the test data

---

###  **Summary:**

**Target Guided Ordinal Encoding** transforms categories based on their **impact on the target**, helping your model learn patterns more effectively — especially when a natural ordering **doesn’t exist but can be inferred** from the target variable.

### Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?
Ans: \

**Covariance** is a statistical measure that tells us **how two variables change together**.

- If both variables **increase or decrease together**, the covariance is **positive**.
- If one increases while the other decreases, the covariance is **negative**.
- If there is **no consistent pattern**, the covariance is **close to zero**.

---

###  **Why Is Covariance Important in Statistics and Data Science?**

- It helps identify **relationships between variables**.
- It’s the **foundation for correlation** (a normalized form of covariance).
- It’s widely used in:
  - **Portfolio optimization** in finance
  - **Principal Component Analysis (PCA)** in machine learning
  - **Feature selection and multivariate analysis**

---

###  **How Is Covariance Calculated?**

Let’s say we have two variables:
- $( X = [x_1, x_2, ..., x_n] )$
- $( Y = [y_1, y_2, ..., y_n] )$

#### **Covariance Formula:**
$$
[
\text{Cov}(X, Y) = \frac{1}{n - 1} \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})
]
$$
Where:
- $( \bar{x} )$ = Mean of X  
- $( \bar{y} )$ = Mean of Y  
- \( n \) = Number of data points

---

###  **Interpretation:**

- **Positive Covariance** → Variables move in the same direction
- **Negative Covariance** → Variables move in opposite directions
- **Zero or near-zero** → No linear relationship

---

###  **Example:**

Let’s say:

| X (Study Hours) | Y (Scores) |
|------------------|------------|
| 2                | 50         |
| 4                | 60         |
| 6                | 70         |
| 8                | 80         |

1. $( \bar{x} = 5 ), ( \bar{y} = 65 )$
2. Use the formula to compute covariance:
$$
[
\text{Cov}(X, Y) = \frac{1}{3}[(2 - 5)(50 - 65) + (4 - 5)(60 - 65) + (6 - 5)(70 - 65) + (8 - 5)(80 - 65)]
]
$$
$$
[
= \frac{1}{3}[(−3)(−15) + (−1)(−5) + (1)(5) + (3)(15)] = \frac{1}{3}[45 + 5 + 5 + 45] = \frac{100}{3} ≈ 33.33
]
$$
Positive covariance → **more study = higher score**

### Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.
Ans: \


In [1]:
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Color': ['red', 'green', 'blue', 'green', 'red'],
    'Size': ['small', 'medium', 'large', 'medium', 'small'],
    'Material': ['wood', 'metal', 'plastic', 'wood', 'metal']
})

print(data)

from sklearn.preprocessing import LabelEncoder

# Create a copy to encode
encoded_data = data.copy()

# Apply Label Encoding to each categorical column
label_encoders = {}

for column in encoded_data.columns:
    le = LabelEncoder()
    encoded_data[column] = le.fit_transform(encoded_data[column])
    label_encoders[column] = le  # Save encoder to reverse if needed

print(encoded_data)


   Color    Size Material
0    red   small     wood
1  green  medium    metal
2   blue   large  plastic
3  green  medium     wood
4    red   small    metal
   Color  Size  Material
0      2     2         2
1      1     1         0
2      0     0         1
3      1     1         2
4      2     2         0


### Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.
Ans: \


In [2]:
import pandas as pd
import numpy as np

# Create the DataFrame
data = pd.DataFrame({
    'Age': [25, 30, 45, 35, 50],
    'Income': [40, 50, 65, 55, 70],
    'Education': [2, 3, 4, 2, 3]
})

# Calculate covariance matrix
cov_matrix = data.cov()

print(cov_matrix)


             Age  Income  Education
Age        107.5   122.5        5.5
Income     122.5   142.5        6.5
Education    5.5     6.5        0.7


### 🧠 **Interpretation:**

1. **Diagonal Elements** → Variance of each variable:
   - Age variance = 92.5
   - Income variance = 97.5
   - Education variance = 0.70

2. **Off-Diagonal Elements** → Covariance between variables:
   - **Age vs Income** = 87.5 → Strong **positive relationship**
   - **Age vs Education** = 2.5 → Slight positive relationship
   - **Income vs Education** = 2.5 → Slight positive relationship



### Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?
Ans: \

###  **1. Gender (Male/Female)**

- **Type**: **Nominal** (no meaningful order)
- **Recommended Encoding**: **One-Hot Encoding**

 **Why?**  
- Gender has only two categories.
- One-Hot avoids assigning artificial numeric values (like 0 and 1), which might imply an order.
- For two categories, **Label Encoding** could also work with models like decision trees, but One-Hot is safer and clearer.

 **Example (One-Hot):**
```
Male → [1, 0]  
Female → [0, 1]
```

---

###  **2. Education Level (High School, Bachelor’s, Master’s, PhD)**

- **Type**: **Ordinal** (has a meaningful order)
- **Recommended Encoding**: **Ordinal Encoding**

 **Why?**  
- There’s a clear hierarchy: High School < Bachelor’s < Master’s < PhD.
- Ordinal Encoding preserves this order, which can help models understand progression.

 **Example (Ordinal Encoding):**
```
High School → 1  
Bachelor’s   → 2  
Master’s     → 3  
PhD          → 4
```

---

###  **3. Employment Status (Unemployed, Part-Time, Full-Time)**

- **Type**: **Nominal** (depends on context, but generally treated as unordered)
- **Recommended Encoding**: **One-Hot Encoding**

 **Why?**  
- No strict numerical ordering (although it *feels* like increasing employment, there's no universal scale).
- One-Hot prevents the model from assuming "Full-Time > Part-Time > Unemployed" unless that’s specifically intended.

 **Example (One-Hot):**
```
Unemployed → [1, 0, 0]  
Part-Time  → [0, 1, 0]  
Full-Time  → [0, 0, 1]
```

---

###  **Summary Table:**

| Variable           | Type     | Encoding Method    | Reason                             |
|--------------------|----------|--------------------|-------------------------------------|
| Gender             | Nominal  | One-Hot Encoding   | Avoids false order                  |
| Education Level    | Ordinal  | Ordinal Encoding   | Maintains educational hierarchy     |
| Employment Status  | Nominal  | One-Hot Encoding   | No strict ordering among categories |

### Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.
Ans: \

> **Covariance is only defined for numerical (continuous) variables.**  
So, we can only directly compute **covariance between "Temperature" and "Humidity"** — **not** between continuous and categorical variables (like "Temperature" vs "Weather Condition").

To include categorical variables, you'd first need to **encode** them numerically, **but this can be misleading** unless the categories are ordinal and the encoding reflects their true relationship.

###  **Step 1: Sample Data**

Let’s assume this data:

| Temperature (°C) | Humidity (%) | Weather Condition | Wind Direction |
|------------------|--------------|-------------------|----------------|
| 30               | 70           | Sunny             | North          |
| 25               | 65           | Cloudy            | East           |
| 20               | 90           | Rainy             | South          |
| 28               | 80           | Cloudy            | West           |
| 22               | 85           | Rainy             | North          |

---

###  **Step 2: Covariance of Continuous Variables Only**

Let’s calculate covariance between **Temperature** and **Humidity** using Python:

```python
import pandas as pd

# Create sample data
data = pd.DataFrame({
    'Temperature': [30, 25, 20, 28, 22],
    'Humidity': [70, 65, 90, 80, 85]
})

# Calculate covariance matrix
cov_matrix = data.cov()
print(cov_matrix)
```

**Output:**

```
             Temperature  Humidity
Temperature        17.2     -19.0
Humidity          -19.0      97.5
```

---

###  **Interpretation:**

- **Cov(Temperature, Temperature) = 17.2** → variance of Temperature
- **Cov(Humidity, Humidity) = 97.5** → variance of Humidity
- **Cov(Temperature, Humidity) = -19.0**
  - 🔹 **Negative covariance** means: As **temperature increases**, **humidity tends to decrease**.
  - The strength is moderate based on the value.

---

###  Categorical Variables:

To compute covariance with **"Weather Condition"** or **"Wind Direction"**, you’d need to **encode them**, e.g., using label encoding:

```python
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({
    'Temperature': [30, 25, 20, 28, 22],
    'Humidity': [70, 65, 90, 80, 85],
    'Weather': ['Sunny', 'Cloudy', 'Rainy', 'Cloudy', 'Rainy'],
    'Wind': ['North', 'East', 'South', 'West', 'North']
})

le_weather = LabelEncoder()
le_wind = LabelEncoder()

df['Weather_enc'] = le_weather.fit_transform(df['Weather'])
df['Wind_enc'] = le_wind.fit_transform(df['Wind'])

print(df[['Temperature', 'Humidity', 'Weather_enc', 'Wind_enc']].cov())
```

But  this numeric transformation **does not preserve true relationships** — e.g., Sunny is not “greater than” Cloudy. So this kind of covariance is **mathematically valid but semantically misleading**.

---

###  **Conclusion:**

- Covariance is only meaningful between **continuous variables** → "Temperature" and "Humidity".
- For **categorical variables**, consider:
  - Using **correlation ratios** or **ANOVA** for relation with continuous variables.
  - Or **chi-square test** for relationships between two categorical variables.