# Q1. What is data encoding? How is it useful in data science?

Data encoding refers to the process of converting categorical data into a numerical format that can be easily understood and processed by machine learning algorithms. It is useful in data science because most machine learning algorithms require numerical input, and encoding ensures that categorical data can be used effectively without introducing bias or errors.


# Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.


Nominal encoding is a method of encoding categorical variables that do not have an inherent order or ranking. An example is encoding the colors of cars (red, blue, green) into numerical values (e.g., red = 1, blue = 2, green = 3). This can be used in a car sales dataset where the color of the car might influence the sales price.



# Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.


Nominal encoding is preferred over one-hot encoding when the categorical variable has a large number of unique values, as one-hot encoding would create a large number of columns. For example, if we have a dataset of cities with 1000 unique city names, nominal encoding would convert these cities into numerical values (1-1000) instead of creating 1000 new columns.



# Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.


For 5 unique values, one-hot encoding is often preferred because it avoids introducing any ordinal relationships between the categories and is generally manageable in terms of the number of new columns created. Each unique value would be represented as a binary vector, ensuring that the algorithm does not assume any ordinal relationship between the categories.



# Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

If we use nominal encoding, each unique category in the two columns would be assigned a unique numerical value, resulting in no additional columns being created. The categorical columns would simply be replaced by their encoded values.



# Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.


One-hot encoding would be suitable for transforming the categorical data (species, habitat, diet) because it ensures that the algorithm does not assume any ordinal relationship between the categories. This is important for attributes like species and habitat, where there is no inherent order.



# Q7. You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

1. Identify categorical columns: 'Gender' and 'Contract Type'.
2. Apply one-hot encoding to both columns.
3. Concatenate the encoded columns with the numerical columns ('Age', 'Monthly Charges', 'Tenure').





In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = {
    'Gender': ['Male', 'Female', 'Female', 'Male'],
    'Age': [23, 45, 31, 35],
    'Contract Type': ['Month-to-Month', 'One Year', 'Two Year', 'Month-to-Month'],
    'Monthly Charges': [70, 80, 90, 85],
    'Tenure': [12, 24, 36, 48]
}

df = pd.DataFrame(data)

encoder = OneHotEncoder(sparse=False)
encoded_columns = encoder.fit_transform(df[['Gender', 'Contract Type']])
encoded_df = pd.DataFrame(encoded_columns, columns=encoder.get_feature_names_out(['Gender', 'Contract Type']))

final_df = pd.concat([df[['Age', 'Monthly Charges', 'Tenure']], encoded_df], axis=1)

print(final_df)

   Age  Monthly Charges  Tenure  Gender_Female  Gender_Male  \
0   23               70      12            0.0          1.0   
1   45               80      24            1.0          0.0   
2   31               90      36            1.0          0.0   
3   35               85      48            0.0          1.0   

   Contract Type_Month-to-Month  Contract Type_One Year  \
0                           1.0                     0.0   
1                           0.0                     1.0   
2                           0.0                     0.0   
3                           1.0                     0.0   

   Contract Type_Two Year  
0                     0.0  
1                     0.0  
2                     1.0  
3                     0.0  


