# 1. What is data encoding? How is it useful in data science?

- **Data encoding** is like translating a message from one language to another so that everyone can understand it. 
- In data science, it's the process of converting data from one format or representation to another. 
- It's useful because it allows us to work with different types of data, handle categorical variables, and make data suitable for machine learning models.

> For, eg:
- Suppose you're working with a dataset of movie genres, and the genres are in text form like "Action," "Comedy," and "Drama."
- However, to use this data in a machine learning model, you need to convert it into a numerical format.
- Data encoding helps us do this.


In [20]:
import pandas as pd

# sample data
data = {
    'Movie': ['Movie A', 'Movie B', 'Movie C', 'Movie D'],  #  movie genres
    'Genre': ['Action', 'Comedy', 'Drama', 'Action']
}

df = pd.DataFrame(data)

# performing data encoding - using one-hot encoding
encoded_data = pd.get_dummies(df, columns=['Genre'], prefix='Genre')

print(encoded_data)

     Movie  Genre_Action  Genre_Comedy  Genre_Drama
0  Movie A             1             0            0
1  Movie B             0             1            0
2  Movie C             0             0            1
3  Movie D             1             0            0


- Here, we performed data encoding using one-hot encoding, which converts the movie genres into numerical format.
- Each genre becomes a binary (0 or 1) column, making it suitable for machine learning.
- Now, the data is like translating the movie genres from a language everyone understands, like "Action" or "Comedy," into a language that a machine learning model can work with, like 0s and 1s.
- > - This Binary representation allows machine learning models to understand and process the data. 
- > - It's like creating a language that computers easily comprehend. 
- > - The 0's and 1's simplify data and make it possible for models to make predictions or classifications based on these binary features.

# 2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario. 

- **Nominal encoding** is like giving names or labels to different categories or groups, allowing us to distinguish between them.
- In data science, it's a way to convert categorical data into numerical form while preserving the distinct categories.
- It's useful for scenarios where the categories have no inherent order or ranking.

> - For, eg: 
- Imagine you're working with a dataset of car colors, which include categories like "Red," "Blue," and "Green." 
- These colors have no natural order, and you want to convert them into numerical values for a machine learning model.


In [21]:
import pandas as pd

# sample data
data = {
    'Car': ['Car A', 'Car B', 'Car C', 'Car D'], # Car colors
    'Color': ['Red', 'Blue', 'Green', 'Red']
}

df = pd.DataFrame(data)

# performing nominal encoding
color_mapping = {'Red': 1, 'Blue': 2, 'Green': 3}
df['Color Code'] = df['Color'].map(color_mapping)

print(df)


     Car  Color  Color Code
0  Car A    Red           1
1  Car B   Blue           2
2  Car C  Green           3
3  Car D    Red           1


- Here, we used nominal encoding to convert car colors into numerical values.
- Each color is given a unique code (1 for Red, 2 for Blue, 3 for Green), allowing the model to understand and work with the data. 
- Nominal encoding is useful when you have categories without a specific order, and you want to maintain the distinct identities of each category in numerical form.

# 3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

- Nominal encoding is preferred over one-hot encoding when you have categorical data with categories that don't have any natural order or ranking. 
- It's suitable for situations where you want to represent categories as numerical values without creating multiple binary columns.

> - For, eg :
- You're working on a dataset of countries, and you want to encode their regions.
- Each country belongs to a specific region, and you want to represent this information numerically.


In [22]:
import pandas as pd

# sample data
data = {
    'Country': ['Country A', 'Country B', 'Country C', 'Country D'], # Countries and their regions
    'Region': ['Europe', 'Asia', 'Africa', 'Europe']
}

df = pd.DataFrame(data)

# performing nominal encoding
region_mapping = {'Europe': 1, 'Asia': 2, 'Africa': 3}
df['Region Code'] = df['Region'].map(region_mapping)

print(df)

     Country  Region  Region Code
0  Country A  Europe            1
1  Country B    Asia            2
2  Country C  Africa            3
3  Country D  Europe            1


- Here, we assigned each country with a numerical code based on its region.
- Nominal encoding simplifies the representation of regions, making it easier for machine learning models to work with. 
- It's a suitable choice when the categories have no natural order, and you want to preserve their identities as numerical values.

# 4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

- As we have categorical data with 5 unique values, one common and straightforward encoding technique is nominal encoding using integer labels.
- This choice is made because nominal encoding preserves the distinct categories in numerical form without creating an excessive number of binary columns (as in one-hot encoding).

> - for, eg: 
- Consider we're working with a dataset of phone models, and we want to encode the phone brands. 
- Each phone model belongs to a specific brand, and we want to represent this information numerically. 



In [23]:
import pandas as pd 

# sample data 
data = {
    "Phone Model" : ["Model A", "Model B", "Model C", "Model D", "Model E"], # phone models and their brands 
    "Brand" : ["Samsung","Apple","Samsung", "Huawei", "Sony"]
}

df = pd.DataFrame(data)

# performing nominal encoding 
brand_mapping = {"Samsung":1 , "Apple" : 2, "Huawei": 3, "Sony":4}
df["Brand Code"]= df["Brand"].map(brand_mapping)

print(df)

  Phone Model    Brand  Brand Code
0     Model A  Samsung           1
1     Model B    Apple           2
2     Model C  Samsung           1
3     Model D   Huawei           3
4     Model E     Sony           4


> - > -  Here, we used nominal encoding to convert phone brands into numerical values. 
> - > - Each brand is given a unique code, making it suitable for machine learning models. 
> - > - Nominal encoding with integer labels is an efficient choice for datasets with a moderate number of categories like the 5 unique values in this example. 
> - > - It maintains the identity of each category in numerical form without introducing excessive complexity.

# 5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

- When using nominal encoding for categorical data, each unique category is assigned a unique integer label. 
- So, for each categorical column, we create a single new column to store the encoded values.
- No additional columns are created for each category within the categorical columns.

- In this case, we have two categorical columns. 
- Therefore, we would create two new columns to store the encoded values for these columns.


In [24]:
import pandas as pd 

# sample data - 2 cat col's and 3 num col's 
data = {
    "Cat 1" : ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'C'],
    "Cat 2" : ['X', 'Y', 'Y', 'X', 'Z', 'Z', 'X', 'Y'],
    "Num 1" : [10, 20, 30, 40, 50, 60, 70, 80],
    "Num 2" : [1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8],
    "Num 3" : [100, 200, 300, 400, 500, 600, 700, 800]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Cat 1,Cat 2,Num 1,Num 2,Num 3
0,A,X,10,1.1,100
1,B,Y,20,2.2,200
2,A,Y,30,3.3,300
3,C,X,40,4.4,400
4,B,Z,50,5.5,500
5,A,Z,60,6.6,600
6,C,X,70,7.7,700
7,C,Y,80,8.8,800


In [25]:
# performing nominal encoding fot the categorical col's 
df["Category 1 Code"] = df["Cat 1"].astype("category").cat.codes 
df['Category 2 Code'] = df['Cat 2'].astype('category').cat.codes

print(df)

  Cat 1 Cat 2  Num 1  Num 2  Num 3  Category 1 Code  Category 2 Code
0     A     X     10    1.1    100                0                0
1     B     Y     20    2.2    200                1                1
2     A     Y     30    3.3    300                0                1
3     C     X     40    4.4    400                2                0
4     B     Z     50    5.5    500                1                2
5     A     Z     60    6.6    600                0                2
6     C     X     70    7.7    700                2                0
7     C     Y     80    8.8    800                2                1


> - > - Here, nominal encoding is performed on the two categorical columns, resulting in two new columns to store the encoded values.

# 6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer. 

- To transform categorical data about animals, such as species, habitat, and diet, into a format suitable for machine learning algorithms.
- I would recommend using nominal encoding with integer labels. 

- **Nominal Encoding for Categorical Data:** Nominal encoding is ideal when we have categories with no natural order or ranking, which is often the case with animal species, habitats, and diets. 
> - It preserves the distinct categories in numerical form without introducing a large number of binary columns (as in one-hot encoding).
- **Integer Labels:** Using integer labels for encoding keeps the representation simple and efficient. 
> - Each category is assigned a unique integer, making it straightforward for machine learning models to work with.

In [26]:
import pandas as pd 

# sample data - animal information 
data = {
    "Animal" :["Lion","Elephant","Giraffe","Tiger","Kangaroo"],
    "Habitat" : ["Savannah", "Jungle", "Savannah", "Jungle", "Forest"],
    "Diet" : ["Carnivore", "Herbivore", "Herbivore", "Carnivore", "Herbivore"]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Animal,Habitat,Diet
0,Lion,Savannah,Carnivore
1,Elephant,Jungle,Herbivore
2,Giraffe,Savannah,Herbivore
3,Tiger,Jungle,Carnivore
4,Kangaroo,Forest,Herbivore


In [27]:
# > Performing nominal encoding for the categorical columns  
animal_mapping = {"Lion" : 1, "Elephant": 2, "Giraffe": 3, "Tiger" : 4, "Kangaroo" : 5}
habitat_mapping = {"Savannah" : 1, "Jungle" : 2, "Forest" : 3}
diet_mapping = {"Carnivore" : 1, "Herbivore" : 2}

df["Animal Code"] = df["Animal"].map(animal_mapping)
df["Habitat Code"] = df["Habitat"].map(habitat_mapping)
df["Diet Code"] = df["Diet"].map(diet_mapping)

print(df)

     Animal   Habitat       Diet  Animal Code  Habitat Code  Diet Code
0      Lion  Savannah  Carnivore            1             1          1
1  Elephant    Jungle  Herbivore            2             2          2
2   Giraffe  Savannah  Herbivore            3             1          2
3     Tiger    Jungle  Carnivore            4             2          1
4  Kangaroo    Forest  Herbivore            5             3          2


> - > - Here, we usec nominal encoding to convert categorical animal data into numerical values. 
> - > - Each category (species, habitat, diet) is assigned a unique integer code.
> - > - This representation is suitable for machine learning algorithms and maintains the identities of animal categories without introducing unnecessary complexity.

# 7. You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

- To transform the categorical data in the customer churn dataset into numerical data, we can use a combination of encoding techniques, depending on the nature of the categorical variables.

- **Label Encoding for Ordinal Data:** If the dataset contains categorical features with a clear order or hierarchy, such as "contract type," we can use label encoding.
> - In label encoding, each category is assigned an integer value based on its position in the order.
> - For instance, we can assign 0 for "Month-to-Month," 1 for "One Year," and 2 for "Two Year" contract types.
- **One-Hot Encoding for Nominal Data:** For categorical variables like "gender," which have no inherent order or ranking, we should use one-hot encoding.
> - One-hot encoding creates binary columns (0 or 1) for each category.
> - For "gender," it will create two columns: one for "Male" and one for "Female."


In [28]:
import pandas as pd

# Sample data: Customer churn dataset
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Age': [25, 30, 40, 35, 28],
    'Contract Type': ['Month-to-Month', 'One Year', 'Two Year', 'Month-to-Month', 'One Year'],
    'Monthly Charges': [50, 60, 80, 55, 70],
    'Tenure': [6, 12, 24, 8, 18]
}

df = pd.DataFrame(data)

# Step 1: Label Encoding for 'Contract Type'
contract_mapping = {'Month-to-Month': 0, 'One Year': 1, 'Two Year': 2}
df['Contract Type'] = df['Contract Type'].map(contract_mapping)

# Step 2: One-Hot Encoding for 'Gender'
df = pd.get_dummies(df, columns=['Gender'], prefix='Gender')

print(df)


   Age  Contract Type  Monthly Charges  Tenure  Gender_Female  Gender_Male
0   25              0               50       6              0            1
1   30              1               60      12              1            0
2   40              2               80      24              0            1
3   35              0               55       8              1            0
4   28              1               70      18              0            1


In [29]:
import pandas as pd 

# sample data - Customer Churn dataset 

data = {
    "Gender" : ["Male", "Female", "Male", "Female", "Male"], 
    "Age" : [25, 30, 45, 35, 28], 
    "Contract Type" : ["Month-to-Month", "One Year", "Two Year", "Month-to-Month", "One Year"],
    "Monthly Charges" : [50, 60, 80, 55, 70], 
    "Tenure" : [6, 12, 24, 8, 18]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Gender,Age,Contract Type,Monthly Charges,Tenure
0,Male,25,Month-to-Month,50,6
1,Female,30,One Year,60,12
2,Male,45,Two Year,80,24
3,Female,35,Month-to-Month,55,8
4,Male,28,One Year,70,18


In [30]:
# > Step 1 :- label encoding for "Contract Type" 
contract_mapping = {"Month-to-Month" : 0, "One Year" : 1, "Two Year" : 2}
df["Contract Type"] = df["Contract Type"].map(contract_mapping)

# > Step 2 :- One-Hot encoding for "Gender" 
df = pd.get_dummies(df, columns= ["Gender"], prefix = "Gender")

print(df)

   Age  Contract Type  Monthly Charges  Tenure  Gender_Female  Gender_Male
0   25              0               50       6              0            1
1   30              1               60      12              1            0
2   45              2               80      24              0            1
3   35              0               55       8              1            0
4   28              1               70      18              0            1


> - > - Here, we used label encoding for the "Contract Type" column, where each contract type is assigned a numerical value.
> - > - One-hot encoding is used for the "Gender" column, creating separate binary columns for "Male" and "Female."
> - > - This combination of encoding techniques makes the data suitable for machine learning models while maintaining the meaningful distinctions between categories.