## Q1. What is data encoding? How is it useful in data science?

Answer:

Data encoding is the process of converting categorical data (non-numeric) into a numerical format so that it can be used by machine learning algorithms.

Why is it useful?
Most machine learning models require numerical input.

Encoding ensures that categorical data can be meaningfully represented in models.

It improves the model's ability to detect patterns in the data.

## Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Answer:

Nominal encoding (also called label encoding) assigns a unique integer to each category in a categorical feature. It is used when the categories do not have any ordinal relationship.

Example:
For a "Color" column: ["Red", "Blue", "Green"]

Nominal encoding might produce:

Red → 0

Blue → 1

Green → 2

Real-World Scenario:
For a car dataset, we might encode the "Brand" column (e.g., Honda, Ford, BMW) using nominal encoding.



## Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Answer:

Nominal encoding is preferred:

When the number of unique categories is large.

When memory or storage is a concern.

When the model can interpret label values appropriately (e.g., tree-based models like Random Forest or XGBoost).

Example:
If you have a column "City" with 100 unique values, one-hot encoding would create 100 new columns, but nominal encoding would only use 1 column with values from 0 to 99.

## Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

Answer:

Choice: If the dataset is small and the algorithm is linear or distance-based (e.g., KNN, Logistic Regression), I would use One-Hot Encoding to avoid implying any order.

If the model is tree-based or the dataset is large, I would use Nominal (Label) Encoding to reduce dimensionality.

So:

One-Hot Encoding for algorithms sensitive to distance and feature scale.

Nominal Encoding for algorithms like Decision Trees.

## Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

Answer:

Nominal encoding does not increase the number of columns—it replaces categorical values with integer codes.

So:

2 categorical columns → 2 columns (encoded)

3 numerical columns remain unchanged

Total Columns After Encoding: 5

## Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

Answer:

For this dataset:

If the number of categories in each column is small → use One-Hot Encoding.

If the categories are many and the model is tree-based → use Nominal Encoding.

Justification:
"Species", "Habitat", and "Diet" are nominal categories (no inherent order).

If using a model like Logistic Regression or SVM → One-Hot Encoding.

If using Random Forest or XGBoost → Nominal Encoding.

## Q7. You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

Answer:

Features:
Categorical: Gender, Contract Type

Numerical: Age, Monthly Charges, Tenure

Step-by-step Encoding:
Identify Categorical Columns:

Gender → ["Male", "Female"]

Contract Type → ["Month-to-month", "One year", "Two year"]

Choose Encoding:

Use One-Hot Encoding for both, as these are nominal with few categories and probably used in logistic regression.

In [9]:
import pandas as pd

df = pd.DataFrame({
    'gender': ['Male', 'Female', 'Female'],
    'contract_type': ['Month-to-month', 'One year', 'Two year'],
    'age': [25, 45, 35],
    'monthly_charges': [70.5, 80.0, 60.2],
    'tenure': [12, 24, 6]
})

encoded_df = pd.get_dummies(df, columns=['gender', 'contract_type'])
print(encoded_df)


   age  monthly_charges  tenure  gender_Female  gender_Male  \
0   25             70.5      12          False         True   
1   45             80.0      24           True        False   
2   35             60.2       6           True        False   

   contract_type_Month-to-month  contract_type_One year  \
0                          True                   False   
1                         False                    True   
2                         False                   False   

   contract_type_Two year  
0                   False  
1                   False  
2                    True  
