### Q1. What is data encoding? How is it useful in data science?

Data encoding is the process of converting data from one form to another to make it suitable for various computational processes. In data science, encoding is essential for several reasons:

1. **Handling Categorical Data**: Many machine learning algorithms require numerical input, so categorical data must be converted to numerical form. Techniques like one-hot encoding, label encoding, and binary encoding are commonly used.

2. **Improving Model Performance**: Proper encoding can help models better understand and utilize the underlying patterns in the data, potentially improving accuracy and performance.

3. **Data Compression**: Encoding can reduce the size of data, making storage and transmission more efficient.

4. **Feature Engineering**: Encoding can create new features from existing data, enriching the dataset and potentially enhancing model predictions.

### Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Nominal encoding, also known as label encoding, is a method for converting categorical variables into numerical values. Each unique category is assigned a distinct integer value. This is useful when the categorical variable represents different classes without any intrinsic order or ranking.

**Example**:

Imagine you have a dataset with a feature called `Color` with values like `Red`, `Blue`, and `Green`. You can apply nominal encoding to convert these categories into numerical values:

- `Red` → 0
- `Blue` → 1
- `Green` → 2

In a real-world scenario, let's say you’re working on a machine learning model to classify products based on their color. By applying nominal encoding to the `Color` feature, you transform it into numerical values that the model can use. This allows the model to incorporate color information into its predictions.

### Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.


Nominal encoding is preferred over one-hot encoding in situations where the categorical variable has a large number of unique categories or when memory efficiency is a concern. 

**Practical Example**:

Consider a dataset with a feature `Country` containing the names of hundreds of countries. Using one-hot encoding would result in a very large number of additional features (one for each country), which can significantly increase the dataset's dimensionality and computational complexity.

Instead, nominal encoding can be used to assign a unique integer to each country. For instance:

- `USA` → 0
- `Canada` → 1
- `Germany` → 2
- and so on...

This approach is more memory-efficient and can be easier to manage when dealing with many categories. However, it's important to be cautious as nominal encoding may imply an unintended ordinal relationship if not handled properly, so it should be used when the categorical variable is truly nominal (without any inherent order).

### Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

For a dataset with categorical data containing 5 unique values, **one-hot encoding** is generally preferred. It avoids introducing any unintended ordinal relationships and ensures that each category is represented distinctly. This technique is effective for categorical variables without an inherent order.

### Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.


To determine how many new columns would be created by nominal encoding, you need to count the unique categories in each categorical column. Let's assume:

- **Categorical Column 1** has 8 unique values.
- **Categorical Column 2** has 12 unique values.

Nominal encoding will convert each unique value in these columns to a separate column. Thus:

- For Categorical Column 1: 8 new columns.
- For Categorical Column 2: 12 new columns.

**Total new columns created** = 8 (from Column 1) + 12 (from Column 2) = **20 columns**.

### Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.


For a dataset with categorical information such as species, habitat, and diet, **one-hot encoding** is generally preferred. 

**Justification**:
- **Avoids Ordinal Assumptions**: One-hot encoding avoids implying any ordinal relationship between categories, which is suitable for categorical data where no inherent order exists.
- **Machine Learning Compatibility**: Most machine learning algorithms perform better with one-hot encoded data, as it provides a clear distinction between categories.

In this case, each category is treated as a separate binary feature, making it easier for the algorithm to learn patterns without misinterpreting categorical values.

### Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

For predicting customer churn, you would use **one-hot encoding** for the categorical features and **leave the numerical features as they are**. Here's a step-by-step explanation:

1. **Identify Categorical Features**:
   - **Gender** (e.g., Male, Female)
   - **Contract Type** (e.g., Month-to-Month, One Year, Two Year)

2. **Apply One-Hot Encoding**:
   - **Gender**:
     - Create a new binary feature for each category:
       - `Gender_Male` (1 if Male, 0 otherwise)
       - `Gender_Female` (1 if Female, 0 otherwise)

   - **Contract Type**:
     - Create a new binary feature for each category:
       - `Contract_Month-to-Month` (1 if Month-to-Month, 0 otherwise)
       - `Contract_One Year` (1 if One Year, 0 otherwise)
       - `Contract_Two Year` (1 if Two Year, 0 otherwise)

3. **Leave Numerical Features**:
   - **Age**, **Monthly Charges**, and **Tenure** are already numerical and do not need encoding.

4. **Combine All Features**:
   - After encoding, you will have the following features:
     - `Gender_Male`, `Gender_Female`
     - `Contract_Month-to-Month`, `Contract_One Year`, `Contract_Two Year`
     - `Age`, `Monthly Charges`, `Tenure`

This process will transform the categorical features into a format suitable for machine learning algorithms while preserving the numerical features in their original form.

## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

In [2]:
# OHE Encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [6]:
# creating simple categorical df
df = pd.DataFrame({
    'color' : ['blue','red','green','green','red','blue','blue']
})

df.head()

Unnamed: 0,color
0,blue
1,red
2,green
3,green
4,red


In [7]:
# creating an instance of one hot encoder
encoder = OneHotEncoder()

In [8]:
# performing fit-transform
encoded = encoder.fit_transform(df[['color']]).toarray()

In [9]:
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [11]:
# observe it has encoded alphabetically 
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0
6,1.0,0.0,0.0


In [12]:
## for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [13]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,blue,1.0,0.0,0.0
1,red,0.0,0.0,1.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0
6,blue,1.0,0.0,0.0


### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [16]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder=LabelEncoder()

In [17]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([0, 2, 1, 1, 2, 0, 0])

In [18]:
lbl_encoder.transform([['green']])
lbl_encoder.transform([['blue']])
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [19]:
## Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

In [20]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [21]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [22]:
## create an instance of ORdinalEncoder and then fit_transform
encoder=OrdinalEncoder(categories=[['small','medium','large']])

In [23]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [24]:
encoder.transform([['small']])



array([[0.]])