## Q1. What is data encoding? How is it useful in data science?

## ANS:-

Data encoding is the process of converting data from one form to another. In machine learning, data encoding is used to convert categorical data into numerical data so that it can be used in machine learning models. There are two popular techniques for encoding categorical data: Ordinal Encoding and One-Hot Encoding.

In data science, data encoding plays an important role in data preprocessing, which is a crucial step in preparing data for analysis.

Machine learning algorithms typically require numerical data, but many datasets contain categorical or textual data. Encoding techniques can be used to transform categorical or textual data into numerical data, making it suitable for machine learning.

## Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

## ANS:- 

Nominal encoding, also known as one-hot encoding, is a technique used to transform categorical data into numerical data. In nominal encoding, each unique category value is assigned a binary value, with one binary feature being created for each category value.

For example, suppose we have a dataset of customer purchases, and one of the categorical features is the payment method used for the purchase, with three possible values: cash, credit card, and debit card. To use this data in a machine learning algorithm, we need to encode this feature numerically. We can use nominal encoding to create three new binary features, one for each payment method, as follows:

In [1]:
import pandas as pd
df = pd.DataFrame({'payment_method':['Cash','Credit Card','Debit Card','UPI']})
df

Unnamed: 0,payment_method
0,Cash
1,Credit Card
2,Debit Card
3,UPI


In [3]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[['payment_method']])
encoded_df = pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
encoded_df

Unnamed: 0,payment_method_Cash,payment_method_Credit Card,payment_method_Debit Card,payment_method_UPI
0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0


In [5]:
pd.concat([df,encoded_df],axis=1)

Unnamed: 0,payment_method,payment_method_Cash,payment_method_Credit Card,payment_method_Debit Card,payment_method_UPI
0,Cash,1.0,0.0,0.0,0.0
1,Credit Card,0.0,1.0,0.0,0.0
2,Debit Card,0.0,0.0,1.0,0.0
3,UPI,0.0,0.0,0.0,1.0


## Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

## ANS:-

Nominal encoding and one-hot encoding are actually the same thing, and the terms are often used interchangeably. One-hot encoding is a type of nominal encoding where each category value is assigned a binary value, and it is the most commonly used nominal encoding technique in data science.

However, there is another type of nominal encoding called label encoding, where each unique category value is assigned a numerical label. Label encoding can be useful in situations where the categorical values have an inherent order or ranking, such as rating scales or levels of education.

For example, in a dataset of job applicants, we might have a feature for the level of education, with values such as high school, bachelor's degree, and master's degree. We could use label encoding to assign numerical labels to each of these values, with high school as 1, bachelor's degree as 2, and master's degree as 3. This would allow us to preserve the inherent order of the values while still transforming them into numerical data for use in machine learning algorithms. On the other hand in One-Hot Encoding for the color it is create binary values.

![1.png](attachment:b404e482-5cf7-4ddd-9fea-83f8e65c8a14.png)

## Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

## ANS:-

If we have a dataset containing categorical data with 5 unique values, we could use nominal encoding techniques such as one-hot encoding to transform this data into a format suitable for machine learning algorithms. In one-hot encoding, we would create 5 new binary features, one for each unique category value, and assign a value of 1 to the corresponding feature for each data point.

The reason why we would choose one-hot encoding in this scenario is that nominal encoding techniques such as one-hot encoding are preferred for categorical data because they can accurately represent the categorical data in numerical form without creating false relationships between categories. Other encoding techniques, such as label encoding, can create false relationships between categories by assigning numerical labels that imply an order or ranking to the categories.

## Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

## ANS:-

If we use nominal encoding to transform the two categorical columns in the dataset, we would create new binary features for each unique category value in each column. The number of new binary features created for each column would depend on the number of unique category values in each column.

Let's assume that the first categorical column has 4 unique category values, and the second categorical column has 6 unique category values. To perform one-hot encoding on these columns, we would create 4 new binary features for the first column (one for each unique category value), and 6 new binary features for the second column (again, one for each unique category value). Each row in the original dataset would then be represented by the original three numerical columns, as well as the 4 binary features for the first categorical column and the 6 binary features for the second categorical column.

Therefore, the total number of new columns created through one-hot encoding would be: 4 + 6 + 3 = 13. So, we would have 13 columns in the transformed dataset after nominal encoding.

## Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

## ANS:-

For transforming the categorical data in the animal dataset, I would use nominal encoding techniques, such as one-hot encoding. This is because nominal encoding techniques are preferred for categorical data since they can accurately represent the categorical data in numerical form without creating false relationships between categories.

In the animal dataset, we have categorical variables such as species, habitat, and diet. One-hot encoding would be a suitable technique for encoding these variables. For example, we could create binary features for each unique value in the species variable, such as lion, tiger, and leopard. Similarly, we could create binary features for each unique value in the habitat and diet variables, such as forest, grassland, and carnivorous.

## Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

## ANS:-

To transform the categorical data in the customer churn dataset into numerical data, you can use either ordinal encoding or one-hot encoding.

If the categorical variable has a natural order or ranking, then ordinal encoding can be used. For example, if the dataset contains information about the contract type of the customers, such as “month-to-month”, “one year”, and “two year”, then ordinal encoding can be used to encode this information.

If the categorical variable has no natural order or ranking, then one-hot encoding can be used. For example, if the dataset contains information about the gender of the customers, such as “male” and “female”, then one-hot encoding can be used to encode this information.

Here are the steps to implement this encoding:
1. Identify the categorical variables in the dataset. In this case, the categorical variables are the customer’s gender and contract type.

2. Seperate Nominal and Ordinal Variables. In this case Gender is an Nominal variable, while contract type is ordinal variable.

3. Apply One Hot Encoding to Nominal Variable in this case Gender Variable.

4. Apply Ordinal Encoding to Ordinal Variable in this case contract type variable.

5. Scale Numerical data using StandardScaler

6. Combine all 3 encoding into single dataframe

7. Data is now ready for machine learning model

In [6]:
import pandas as pd
data={'gender':['Female','Female','Male','Female','Male'],
      'age':[34,36,20,29,52],
      'contract':['yearly','half yearly','yearly','quarterly','half yearly'],
      'monthly_charges':[1042,966,1165,1002,1043],
      'tenure':[25,12,13,35,16]
     }

df = pd.DataFrame(data)
df

Unnamed: 0,gender,age,contract,monthly_charges,tenure
0,Female,34,yearly,1042,25
1,Female,36,half yearly,966,12
2,Male,20,yearly,1165,13
3,Female,29,quarterly,1002,35
4,Male,52,half yearly,1043,16


In [11]:
# Performing one hot encoding on gender column
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[['gender']])
new_gender = pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
new_gender

Unnamed: 0,gender_Female,gender_Male
0,1.0,0.0
1,1.0,0.0
2,0.0,1.0
3,1.0,0.0
4,0.0,1.0


In [9]:
# Performing ordinal encoding on Contract type
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[['monthly','quarterly','half yearly','yearly']])
contract_encoded = ordinal_encoder.fit_transform(df[['contract']])
new_contract = pd.DataFrame(contract_encoded.flatten(),columns=['contract'])
new_contract

Unnamed: 0,contract
0,3.0
1,2.0
2,3.0
3,1.0
4,2.0


In [10]:
# Getting numeric variables
numeric = df.select_dtypes(exclude='object')
numeric

Unnamed: 0,age,monthly_charges,tenure
0,34,1042,25
1,36,966,12
2,20,1165,13
3,29,1002,35
4,52,1043,16


In [13]:
# Concatenating all 3 variables Nominal, Ordinal and Numerical
new_df = pd.concat([numeric,new_contract,new_gender],axis=1)
new_df

Unnamed: 0,age,monthly_charges,tenure,contract,gender_Female,gender_Male
0,34,1042,25,3.0,1.0,0.0
1,36,966,12,2.0,1.0,0.0
2,20,1165,13,3.0,0.0,1.0
3,29,1002,35,1.0,1.0,0.0
4,52,1043,16,2.0,0.0,1.0
