### Q1. What is data encoding? How is it useful in data science?

 Data encoding is the process of converting data from one format or representation to another. In data science, encoding is particularly useful for handling categorical data, which are variables that can take on a limited, fixed number of values. By encoding categorical data into numerical or binary format, machine learning algorithms can better interpret and analyze the data.

### Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

Nominal encoding, also known as label encoding, assigns a unique numerical value to each category in a categorical variable. For example, if you have a categorical variable "City" with categories {New York, London, Paris}, nominal encoding would assign numerical values {0, 1, 2} respectively. In a real-world scenario, you might use nominal encoding to encode categorical variables like "color" or "department" in a dataset.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample dataset
data = {
    'City': ['New York', 'London', 'Paris', 'New York', 'Paris'],
    'Product_Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing', 'Books'],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Contract_Type': ['A', 'B', 'A', 'C', 'B'],
    'Age': [30, 25, 40, 35, 45],
    'Monthly_Charges': [100, 80, 120, 90, 110],
    'Tenure': [3, 2, 5, 4, 6]
}

# Convert data to DataFrame
df = pd.DataFrame(data)

# Q2. Nominal encoding for 'City'
label_encoder = LabelEncoder()
df['City_Encoded'] = label_encoder.fit_transform(df['City'])
df

Unnamed: 0,City,Product_Category,Gender,Contract_Type,Age,Monthly_Charges,Tenure,City_Encoded
0,New York,Electronics,Male,A,30,100,3,1
1,London,Clothing,Female,B,25,80,2,0
2,Paris,Electronics,Male,A,40,120,5,2
3,New York,Clothing,Female,C,35,90,4,1
4,Paris,Books,Male,B,45,110,6,2


### Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

Nominal encoding is preferred over one-hot encoding when dealing with categorical variables with high cardinality (a large number of unique categories) or when memory efficiency is a concern. For example, if you have a categorical variable "Product Category" with hundreds of unique categories, using nominal encoding would result in a more memory-efficient representation compared to one-hot encoding.

In [4]:
#Nominal encoding for 'Product_Category' (since it has high cardinality)
df['Product_Category_Encoded'] = label_encoder.fit_transform(df['Product_Category'])
df

Unnamed: 0,City,Product_Category,Gender,Contract_Type,Age,Monthly_Charges,Tenure,City_Encoded,Product_Category_Encoded
0,New York,Electronics,Male,A,30,100,3,1,2
1,London,Clothing,Female,B,25,80,2,0,1
2,Paris,Electronics,Male,A,40,120,5,2,2
3,New York,Clothing,Female,C,35,90,4,1,1
4,Paris,Books,Male,B,45,110,6,2,0


### Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

If the dataset contains categorical data with 5 unique values, nominal encoding would be suitable for transforming this data into a format suitable for machine learning algorithms. This choice is made because nominal encoding assigns a unique numerical value to each category, preserving the ordinal relationship between categories without introducing a high-dimensional representation like one-hot encoding.

In [5]:
# Sample dataset
data = {
    'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']
}

# Convert data to DataFrame
df = pd.DataFrame(data)

# Initialize label encoder
label_encoder = LabelEncoder()

# Apply label encoding to 'Color'
df['Color_Encoded'] = label_encoder.fit_transform(df['Color'])

print(df)

   Color  Color_Encoded
0    Red              2
1  Green              1
2   Blue              0
3    Red              2
4  Green              1


### Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

To calculate the number of new columns created when using nominal encoding for categorical data, we need to determine the number of unique categories in each categorical column. Then, we sum up these unique categories across all categorical columns. Each unique category will become a new column after encoding.

Given:
- Dataset size: 1000 rows and 5 columns
- Two columns are categorical, and the remaining three are numerical.

Let's denote:
- \( n_1 \) as the number of unique categories in the first categorical column.
- \( n_2 \) as the number of unique categories in the second categorical column.

The total number of new columns created will be \( n_1 + n_2 \).

### Calculations:
1. We don't have the actual data, so let's assume some numbers for the unique categories:
   - Assume the first categorical column has 4 unique categories.
   - Assume the second categorical column has 3 unique categories.

2. Calculate the total number of new columns:
   - Total new columns = \( n_1 + n_2 \).

Let's calculate:

### Step 1: Assumptions
- Unique categories in the first categorical column (\( n_1 \)) = 4
- Unique categories in the second categorical column (\( n_2 \)) = 3

### Step 2: Calculate Total New Columns
- Total new columns = \( n_1 + n_2 \)
                     = \( 4 + 3 \)
                     = 7

So, when using nominal encoding to transform the categorical data, 7 new columns would be created.

### Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

For a dataset containing information about different types of animals, including their species, habitat, and diet, I would use nominal encoding. Nominal encoding preserves the ordinal relationship between categories, which is important for variables like "species" where there might be inherent ordering (e.g., from more common to rare species).

In [8]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample dataset
data = {
    'Species': ['Tiger', 'Lion', 'Bear', 'Elephant', 'Tiger'],
    'Habitat': ['Forest', 'Savanna', 'Mountain', 'Jungle', 'Savanna'],
    'Diet': ['Carnivore', 'Carnivore', 'Omnivore', 'Herbivore', 'Carnivore']
}

# Convert data to DataFrame
df = pd.DataFrame(data)

# Initialize label encoder
label_encoder = LabelEncoder()

# Apply label encoding to 'Species', 'Habitat', and 'Diet'
df['Species_Encoded'] = label_encoder.fit_transform(df['Species'])
df['Habitat_Encoded'] = label_encoder.fit_transform(df['Habitat'])
df['Diet_Encoded'] = label_encoder.fit_transform(df['Diet'])

df


Unnamed: 0,Species,Habitat,Diet,Species_Encoded,Habitat_Encoded,Diet_Encoded
0,Tiger,Forest,Carnivore,3,0,0
1,Lion,Savanna,Carnivore,2,3,0
2,Bear,Mountain,Omnivore,0,2,2
3,Elephant,Jungle,Herbivore,1,1,1
4,Tiger,Savanna,Carnivore,3,3,0


### Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

For predicting customer churn in a telecommunications company dataset with categorical features like gender and contract type, and numerical features like age, monthly charges, and tenure, we can use a combination of encoding techniques. Specifically, we can use nominal encoding for gender and contract type, while leaving the numerical features unchanged.

Here's a step-by-step explanation of how we would implement the encoding:

1. **Identify Categorical and Numerical Features**:
   - Categorical Features: Gender, Contract Type
   - Numerical Features: Age, Monthly Charges, Tenure

2. **Nominal Encoding for Categorical Features**:
   - For gender and contract type, we'll use nominal encoding (label encoding) to convert the categorical values into numerical format.
   - Each unique category in gender and contract type will be assigned a unique numerical value.

3. **Leave Numerical Features Unchanged**:
   - Since age, monthly charges, and tenure are already in numerical format, we don't need to encode them further. These features can be used directly in the machine learning model without any transformation.

### Implementation in Python:

In [10]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Sample dataset
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Age': [30, 25, 40, 35, 45],
    'Contract_Type': ['A', 'B', 'A', 'C', 'B'],
    'Monthly_Charges': [100, 80, 120, 90, 110],
    'Tenure': [3, 2, 5, 4, 6]
}

# Convert data to DataFrame
df = pd.DataFrame(data)

# Initialize label encoder
label_encoder = LabelEncoder()

# Nominal encoding for 'Gender' and 'Contract_Type'
df['Gender_Encoded'] = label_encoder.fit_transform(df['Gender'])
df['Contract_Type_Encoded'] = label_encoder.fit_transform(df['Contract_Type'])

# Display the resulting DataFrame
df

Unnamed: 0,Gender,Age,Contract_Type,Monthly_Charges,Tenure,Gender_Encoded,Contract_Type_Encoded
0,Male,30,A,100,3,1,0
1,Female,25,B,80,2,0,1
2,Male,40,A,120,5,1,0
3,Female,35,C,90,4,0,2
4,Male,45,B,110,6,1,1
