Q1. What is data encoding? How is it useful in data science?
 

Data encoding is the process of converting data into a specific format for efficient storage, transmission, and processing. In data science, this often involves transforming categorical data into a numerical format that can be used in machine learning models.

Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Nominal Encoding is a technique used to transform categorical variables that have no intrinsic ordering into numerical values that can be used in machine learning models. One common method for nominal encoding is one-hot encoding, which creates a binary vector for each category in the variable.

Real-World Example:
Imagine a retail company wants to analyze customer purchase behavior based on their demographic data. One of the variables in the dataset is the "Customer Type" which can have values like "Regular", "Loyal", and "New".

In [1]:
# implementation
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df= pd.DataFrame({'customer type':['Regular', 'Loyal', 'New', 'Regular', 'New', 'Loyal', 'Regular']})
encoder=OneHotEncoder()
encoded=encoder.fit_transform(df[['customer type']])
encoded_df=pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
pd.concat([df,encoded_df],axis=1)

Unnamed: 0,customer type,customer type_Loyal,customer type_New,customer type_Regular
0,Regular,0.0,0.0,1.0
1,Loyal,1.0,0.0,0.0
2,New,0.0,1.0,0.0
3,Regular,0.0,0.0,1.0
4,New,0.0,1.0,0.0
5,Loyal,1.0,0.0,0.0
6,Regular,0.0,0.0,1.0


Q3. In what situations in nominal encoding preferred over one-hot encoding? Provide a practical example.




Situations where nominal encoding is preferred:

1. High Cardinality
2. Tree-Based Models
3. Feature Importance

Practical example
Scenario: Predicting Product Returns in E-Commerce
An e-commerce company wants to predict whether a product will be returned based on various features, including the "Product Category". The dataset contains a large number of unique product categories.

In [10]:
from sklearn.preprocessing import LabelEncoder

data=pd.DataFrame({
    'Order ID': [1, 2, 3, 4, 5],
    'Product Category': ['Electronics', 'Clothing', 'Home Decor', 'Electronics', 'Beauty'],
    'Price': [299, 49, 79, 199, 29],
    'Returned': ['Yes', 'No', 'No', 'Yes', 'No']
})
encoder=LabelEncoder()
encoded=encoder.fit_transform(data['Product Category'])
encoded_df=pd.DataFrame({'p_c':encoded})
pd.concat([data,encoded_df],axis=1)


Unnamed: 0,Order ID,Product Category,Price,Returned,p_c
0,1,Electronics,299,Yes,2
1,2,Clothing,49,No,1
2,3,Home Decor,79,No,3
3,4,Electronics,199,Yes,2
4,5,Beauty,29,No,0


Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform suitable for machine learning algorithms? Explain why you made this choice.

**If there is no ordinal relationship**
Use One-Hot Encoding to avoid implying any ordinal relationship and to ensure that the categorical data is represented in a way that most algorithms can handle effectively.

**If there exists an ordinal relationship**
Use Label Encoding to preserve the ordinal information. 




Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? show your calculations.

it depends on the no. of unique value present in the categorical columns:

suppose:
column A = 5 unique values
column B = 3 unique values
therefore, 5+3=8 new columns will be created

Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

To transform categorical data into a format suitable for machine learning algorithms, I would use One-Hot Encoding.

Because:
1. Sutability for Nominal Data: 
One-hot encoding is particularly useful for categorical data where there is no inherent order among the categories (nominal data). In the context of animal species, habitat, and diet, these categories do not have a specific order, making one-hot encoding a fitting choice.

2. Avoiding Ordinal Interpretation: If we were to use techniques like label encoding, which assigns a unique integer to each category, the algorithm might misinterpret the integers as having some ordinal relationship. This could lead to erroneous conclusions, especially when there's no actual order in the data.

3. Compatibility with Many Algorithms: Many machine learning algorithms, such as linear regression, logistic regression, and neural networks, perform better with one-hot encoded data because it prevents them from assuming any sort of priority or hierarchy among the categories.

Q7. You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explaination of how would you implement the encoding.


The encoding techniques used would depend on the nature of the categorical data:
1. One-Hot Encoding for nominal(unordered) categorical features
2. Label Encoding for binary categorical features.

#### Step 1: Inspect the Dataset

In [22]:
import pandas as pd
data=pd.read_csv("C:/Users/TRISHA ROY/Downloads/WA_Fn-UseC_-Telco-Customer-Churn.csv")
data=data[['gender','tenure','MonthlyCharges','Contract']]
data.head()

Unnamed: 0,gender,tenure,MonthlyCharges,Contract
0,Female,1,29.85,Month-to-month
1,Male,34,56.95,One year
2,Male,2,53.85,Month-to-month
3,Male,45,42.3,One year
4,Female,2,70.7,Month-to-month


#### Step2: Identify Categorical Features
In this dataset:
1. Gender is a binary categorical feature.
2. Contract is a nominal categorical feature.

In [23]:
## Step 3: Apply label Encoding to Binary Features
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
data['gender']=encoder.fit_transform(data['gender'])

In [24]:
## Step 4: Apply OHE to nominal categorical Features
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
new_data=pd.DataFrame(encoder.fit_transform(data[['Contract']]).toarray(),columns=encoder.get_feature_names_out())
data=pd.concat([data,new_data],axis=1,)
data.drop('Contract',axis=1,inplace=True)
data

Unnamed: 0,gender,tenure,MonthlyCharges,Contract_Month-to-month,Contract_One year,Contract_Two year
0,0,1,29.85,1.0,0.0,0.0
1,1,34,56.95,0.0,1.0,0.0
2,1,2,53.85,1.0,0.0,0.0
3,1,45,42.30,0.0,1.0,0.0
4,0,2,70.70,1.0,0.0,0.0
...,...,...,...,...,...,...
7038,1,24,84.80,0.0,1.0,0.0
7039,0,72,103.20,0.0,1.0,0.0
7040,0,11,29.60,1.0,0.0,0.0
7041,1,4,74.40,1.0,0.0,0.0
