Q1. What is data encoding? How is it useful in data science?

Data encoding refers to the process of converting categorical or textual data into a numerical format that can be easily processed by machine learning algorithms.

Here's why data encoding is useful in data science:

1. Algorithm Compatibility: 
Most machine learning algorithms are designed to work with numerical data. By encoding categorical features into numerical representations, we make our data compatible with a wide range of algorithms.

2. Feature Engineering: 
Categorical data encoding is a crucial step in feature engineering, where we create new features or modify existing ones to improve model performance. Proper encoding can uncover patterns and relationships in the data that were previously hidden in categorical features.

3. Dimensionality Reduction: 
In some cases, data encoding can lead to a more compact representation of categorical variables, reducing the dimensionality of the feature space and improving computational efficiency.

4. Interpretability: 
Encoded features can improve the interpretability of models. Numerical representations of categorical data allow for better visualization and understanding of relationships between features and the target variable.

5. Avoiding Bias: 
Proper data encoding helps avoid introducing biases into the model due to the way categorical data is represented. Biased encoding can lead to incorrect model assumptions and predictions.

6. Handling Missing Values: 
Data encoding can facilitate handling missing values in categorical features. Encoded representations allow for effective imputation strategies.

7. Scalability: Encoded numerical data is computationally efficient and can be processed quickly, which is important for large datasets.

Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Nominal encoding, also known as categorical encoding, is a technique used to convert categorical data into numerical form. It is typically applied to categorical variables where the categories have no inherent order or ranking. For example: City of person lives in, Gender of person, Marital Status, etc… In nominal encoding, each category is assigned a unique numerical value, allowing machine learning algorithms to process the data. 

Here's an example of how nominal encoding could be used in a real-world scenario:

Scenario: Predicting Car Prices

Suppose we are building a machine learning model to predict car prices based on various features, including the car's make, model, and color. Since the make and model are categorical variables, we need to encode them numerically.

We decide to use one-hot encoding for the "Make" feature, which has categories like "Toyota," "Honda," and "Ford." Each make becomes a binary column, and if a car is a Toyota, the "Make_Toyota" column would be 1 while the others are 0.

Similarly for the "Model" feature, we might use one-hot encoding. If the categories are "Sedan," "SUV," and "Convertible," Each model becomes a binary column.

By performing nominal encoding, we can transform categorical data into numerical format suitable for machine learning algorithms, allowing model to learn patterns and make predictions effectively. 

Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Nominal encoding is preferred over one-hot encoding in situations where the categorical variable has a large number of unique categories, and the resulting one-hot encoded matrix would become extremely high-dimensional. One-hot encoding can lead to a sparse and memory-intensive representation, making the dataset challenging to handle and increasing the computational complexity of algorithms. In such cases, nominal encoding provides a more compact and efficient representation.

Practical Example Scenario: Movie Genres

Imagine you are working on a movie recommendation system and you have a feature that represents movie genres. There are hundreds of unique genres like "Action," "Comedy," "Drama," "Sci-Fi," "Fantasy," and so on.

One-Hot Encoding:
In one-hot encoding, each unique genre would be represented as a separate binary feature column. For example, if you have 10 unique genres, you would end up with 10 new columns, each representing a specific genre. If a movie belongs to a particular genre, its corresponding column would have a value of 1, and all other genre columns would have a value of 0. This leads to a sparse matrix when you have many unique genres.

Nominal Encoding:
In nominal encoding, you assign a unique integer identifier to each genre. For instance, "Action" might be encoded as 1, "Comedy" as 2, "Drama" as 3, and so on. This results in a single column of integers representing the genres.

Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
technique would you use to transform this data into a format suitable for machine learning algorithms?
Explain why you made this choice.

If you have a categorical variable with 5 unique values, you have a few options for encoding the data into numerical format suitable for machine learning algorithms. The choice of encoding technique depends on the nature of the categorical variable, its relationship to the target variable, and the algorithm you plan to use. \
In this scenario, with only 5 unique values, one-hot encoding (OHE) is a reasonable choice.

One-Hot Encoding (OHE):\
One-hot encoding is a technique that converts categorical variables into binary columns, where each unique value gets its own column. For a categorical variable with 5 unique values, one-hot encoding would create 5 binary columns, each representing one of the values.

Why Choose One-Hot Encoding:
1. Preservation of Information: 
OHE ensures that each unique value is explicitly represented, allowing the algorithm to understand the distinctions between categories.

2. No Assumption of Order: 
OHE is suitable for nominal categorical data, where there is no inherent order or ranking among the categories.

3. Minimal Loss of Information: 
With only 5 unique values, the resulting one-hot encoded matrix would have a manageable number of features, avoiding excessive dimensionality.

4. Compatibility with Many Algorithms: 
One-hot encoded data is compatible with a wide range of machine learning algorithms, including linear models, decision trees, and neural networks.

5. Interpretability: 
One-hot encoded features are easy to interpret and visualize. Each binary column directly corresponds to a categorical value.

Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns
are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to
transform the categorical data, how many new columns would be created? Show your calculations.

If you were to use nominal encoding to transform categorical data in a dataset with 2 categorical columns, you would create a new column for each unique value within each categorical column. Since nominal encoding converts each category into a unique numerical value, the number of new columns created would be equal to the total number of unique values across both categorical columns.

Let's assume the following:

The first categorical column has 10 unique values.\
The second categorical column has 8 unique values.

The total number of new columns created for nominal encoding would be:\
Number of new columns = Total unique values in categorical column 1 + Total unique values in categorical column 2 \
Number of new columns = 10 + 8 \
Number of new columns = 18 

So, if you were to use nominal encoding on a dataset with 2 categorical columns, where one has 10 unique values and the other has 8 unique values, you would create 18 new columns in the transformed dataset.

Q6. You are working with a dataset containing information about different types of animals, including their
species, habitat, and diet. Which encoding technique would you use to transform the categorical data into
a format suitable for machine learning algorithms? Justify your answer.

In the scenario where you're working with a dataset containing information about different types of animals, including their species, habitat, and diet, you would likely need to apply a combination of encoding techniques to transform the categorical data into a format suitable for machine learning algorithms. The choice of techniques depends on the nature of the categorical variables, their relationships, and the machine learning algorithms you plan to use.

Here's how you might approach encoding for the different categorical variables:

Species (Nominal Categorical):\
The "species" column likely represents nominal categorical data, where each species has no inherent order. One-hot encoding (OHE) is a suitable technique for this type of variable. Each species would get its own binary column, and OHE would ensure that the algorithm doesn't assume any relationships between species.

Habitat (Nominal Categorical):\
Similarly, the "habitat" column is also likely nominal categorical data, as habitats typically don't have a natural order. You would again use one-hot encoding to create binary columns for each unique habitat.

Diet (Ordinal Categorical):\
The "diet" column might represent ordinal categorical data, as different diets could have an order (e.g., herbivore, omnivore, carnivore). If the order is meaningful, you could use ordinal encoding, assigning numerical values that reflect the order. However, if the ordinal relationship is complex or the number of categories is small, you might consider one-hot encoding to avoid introducing an unintended ordinal relationship.

Q7.You are working on a project that involves predicting customer churn for a telecommunications
company. You have a dataset with 5 features, including the customer's gender, age, contract type,
monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical
data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

The categorical features in the dataset are "gender" and "contract type." These features have distinct and non-ordinal categories. \
OHE is particularly useful for nominal categorical data like gender and contract type, where no inherent order exists between categories. OHE will create separate binary columns for each category.

Gender is a binary categorical feature with two unique values: "male" and "female" --> (2 binary columns representing each category). \
Contract type is a nominal categorical feature with multiple unique values: "month-to-month", "one year" and "two year" --> (3 binary columns representing each category). \
Age, Monthly Charges, and Tenure: These are already in numerical format and don't require encoding.

In [8]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Sample dataset
data = {
    'gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'age': [32, 45, 22, 28, 59],
    'contract_type': ['Month-to-month', 'Two year', 'Month-to-month', 'One year', 'Two year'],
    'monthly_charges': [65.5, 75.2, 85.0, 60.3, 95.7],
    'tenure': [12, 24, 6, 18, 36]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Select categorical columns for encoding
categorical_columns = ['gender', 'contract_type']

# Initialize OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the categorical columns using OneHotEncoder
encoded_data = encoder.fit_transform(df[categorical_columns])

# Create a DataFrame from the encoded data
encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out())

# Concatenate the encoded DataFrame with the original DataFrame
final_df = pd.concat([df, encoded_df], axis=1)

print(final_df)

   gender  age   contract_type  monthly_charges  tenure  gender_Female  \
0    Male   32  Month-to-month             65.5      12            0.0   
1  Female   45        Two year             75.2      24            1.0   
2    Male   22  Month-to-month             85.0       6            0.0   
3  Female   28        One year             60.3      18            1.0   
4    Male   59        Two year             95.7      36            0.0   

   gender_Male  contract_type_Month-to-month  contract_type_One year  \
0          1.0                           1.0                     0.0   
1          0.0                           0.0                     0.0   
2          1.0                           1.0                     0.0   
3          0.0                           0.0                     1.0   
4          1.0                           0.0                     0.0   

   contract_type_Two year  
0                     0.0  
1                     1.0  
2                     0.0  
3         