
---

### **Q1. What is data encoding? How is it useful in data science?**

**Data encoding** is the process of converting categorical (non-numerical) data into numerical format so that machine learning algorithms can process it. Most algorithms work with numbers, not strings, so encoding is a crucial step in the data preprocessing phase.

**Why it’s useful:**
- Converts real-world labels (e.g., gender = "Male", "Female") into machine-readable format.
- Helps improve model performance.
- Enables the use of mathematical operations on previously non-numeric data.

---

### **Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.**

**Nominal encoding** refers to assigning an arbitrary numerical value to **unordered categorical variables** (nominal variables). This is often done using **Label Encoding**.

#### **Example:**
For a dataset with a feature `Fruit = [Apple, Banana, Orange]`, nominal encoding might look like:
- Apple → 0
- Banana → 1
- Orange → 2

#### **Real-world use case:**
In an e-commerce dataset, you might encode payment methods like:
- Credit Card → 0
- PayPal → 1
- Cash on Delivery → 2

---

### **Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.**

Nominal (Label) encoding is preferred when:
- The categorical feature has **many unique values**.
- You want to avoid **dimensionality explosion** caused by one-hot encoding.
- The categorical values are **unordered**, and the model you're using **can handle numerical categories without implying order** (e.g., decision trees).

#### **Practical example:**
In a dataset with thousands of **product IDs**, one-hot encoding would create thousands of columns. Instead, nominal encoding keeps it compact:
- ProductID_123 → 0
- ProductID_456 → 1
- ProductID_789 → 2

---

### **Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.**

It depends on the nature of the data:

- If the values are **unordered**, use **One-Hot Encoding**.
  - Because it avoids implying any ordinal relationship.
  - Creates 5 new binary columns (each representing one category).

- If the values are **ordered or numerous**, use **Label Encoding**.

#### **Best practice:**
Use **One-Hot Encoding** for small, nominal (unordered) categories to avoid misleading the model.

---

### **Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.**

With **nominal (label) encoding**, each categorical column is converted into **1 numerical column**.

So:
- Original categorical columns: 2
- After nominal encoding: Still 2 columns (just numerical now)

✅ **Total columns after encoding: 3 (numerical) + 2 (encoded categorical) = 5**

No new columns are created — values are just transformed.

---

### **Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.**

Most likely, **One-Hot Encoding** would be appropriate here because:
- Features like `species`, `habitat`, and `diet` are **nominal** (unordered).
- One-hot encoding avoids implying any numeric relationship (e.g., Dog ≠ 1, Cat ≠ 2).

#### Example:
- Habitat = [Forest, Desert, Ocean]  
One-hot encoded as:
```
Habitat_Forest  Habitat_Desert  Habitat_Ocean
       1               0              0
```

This makes the data machine-friendly without adding false assumptions about the order.

---

### **Q7. You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features:**

- `gender` (categorical)
- `age` (numerical)
- `contract type` (categorical)
- `monthly charges` (numerical)
- `tenure` (numerical)

#### **Step-by-step Encoding Plan:**

1. **Identify Categorical Columns:**
   - `gender`
   - `contract type`

2. **Encoding Techniques:**
   - `gender`: Binary → use **Label Encoding**
     - Male → 0, Female → 1
   - `contract type`: Nominal with multiple values (e.g., month-to-month, one year, two year) → use **One-Hot Encoding**

3. **Final Structure After Encoding:**
   - `gender`: 1 column
   - `contract type`: 3 one-hot columns (if 3 unique types)
   - `age`, `monthly charges`, `tenure`: 3 columns (unchanged)

✅ **Total columns after encoding:**  
1 (`gender`) + 3 (`contract type`) + 3 numerical = **7 columns**

This setup ensures the categorical data is numerically represented **without introducing bias from false ordering**.

---
