In [None]:
""" Q1. What is data encoding? How is it useful in data science? """


Ans.

    Data encoding refers to the process of converting data from one format to another, typically for the purpose of representation, storage, or transmission. In the context of data science, data encoding plays a crucial role in preparing and manipulating data for analysis and modeling tasks.

    Data encoding is useful in data science for several reasons:

    1.Categorical Variable Encoding: In many datasets, variables or features can take categorical values (e.g., colors, labels, or categories). Data encoding techniques are employed to represent these categorical variables numerically, enabling mathematical operations and inclusion in machine learning algorithms. Examples of categorical encoding techniques include one-hot encoding, label encoding, ordinal encoding, and binary encoding.

    2.Feature Scaling: Data encoding is also useful for scaling numerical features to ensure they are on a similar scale. Common scaling techniques include Min-Max scaling (also known as normalization) and Z-score normalization, which can prevent certain features from dominating the analysis due to their larger ranges.

    3.Encoding Time and Dates: Temporal data, such as dates and timestamps, can be encoded to capture relevant information such as day of the week, month, or year. This enables the incorporation of time-related patterns and trends into data analysis and modeling.


In [None]:
""" Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario. """

Ans:
    Nominal encoding, also known as categorical encoding, is a technique used to represent categorical variables numerically. It assigns unique integers or codes to each category, without any inherent ordering or numerical meaning. Nominal encoding is suitable for variables where there is no particular order or hierarchy among the categories.

In [1]:
import seaborn as sns
df = sns.load_dataset("tips")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
encoder = OneHotEncoder()
encoded = encoder.fit_transform(df[["sex","day","smoker","time"]])
df1 = pd.DataFrame(encoded.toarray(),columns = encoder.get_feature_names_out())[0:10]
df1

Unnamed: 0,sex_Female,sex_Male,day_Fri,day_Sat,day_Sun,day_Thur,smoker_No,smoker_Yes,time_Dinner,time_Lunch
0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
5,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
6,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
7,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
8,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
9,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [10]:
pd.concat([df,df1],axis = 1)[0:5]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_Female,sex_Male,day_Fri,day_Sat,day_Sun,day_Thur,smoker_No,smoker_Yes,time_Dinner,time_Lunch
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [None]:
""" Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example. """

Ans.
Nominal encoding and one-hot encoding are two different approaches for representing categorical variables numerically. The choice between them depends on the specific characteristics of the dataset and the goals of the analysis.

Nominal encoding is preferred over one-hot encoding in the following situations:

High Cardinality: When dealing with categorical variables that have a large number of unique categories (high cardinality), one-hot encoding can result in a significant increase in the dimensionality of the dataset. This can lead to computational and memory challenges, especially if the dataset is large. In such cases, nominal encoding is preferred as it reduces the dimensionality by representing categories with numerical codes.

Practical Example: Consider a dataset for customer churn prediction in a telecommunications company. One of the categorical variables is "Payment Method," which includes categories such as "Credit Card," "Bank Transfer," "Electronic Check," and "Mailed Check."

In [None]:
""" Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
technique would you use to transform this data into a format suitable for machine learning algorithms?
Explain why you made this choice. """

Ans.
1. If I have a dataset containing categorical data with 5 unique values, I can use either ordinal encoding or one-hot encoding to transform this data into a format suitable for machine learning algorithms.
2. If the categorical variable has a natural order or ranking, then ordinal encoding can be used. If the categorical variable has no natural order or ranking, then one-hot encoding can be used.
3. In general, one-hot encoding is preferred over ordinal encoding because it does not assume any ordinal relationship between the categories and can be used for categorical variables with any number of unique values. However, one-hot encoding can lead to the curse of dimensionality if the number of unique values is very large.
4. Ordinal encoding is preferred when the number of unique values is large and one-hot encoding would lead to the curse of dimensionality.

In [None]:
""" Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns
are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to
transform the categorical data, how many new columns would be created? Show your calculations. """

Ans:
    
    To calculate the number of new columns created when using nominal encoding for categorical data, you need to determine the number of unique categories in each categorical column.

    Let’s assume that the first categorical column has 12 unique values and the second categorical column has 5 unique values. Then, nominal encoding/one-hot encoding would create 12 + 5 = 17 new columns.

    In general, if the first categorical column has n unique values and the second categorical column has m unique values, then nominal encoding would create n + m new columns.

In [None]:
""" Q6. You are working with a dataset containing information about different types of animals, including their
species, habitat, and diet. Which encoding technique would you use to transform the categorical data into
a format suitable for machine learning algorithms? Justify your answer. """

Ans:
    When working with a dataset containing categorical data about different types of animals, such as species, habitat, and diet, the choice of encoding technique depends on the specific characteristics of the categorical variables and the requirements of the machine learning algorithms being used.

1.Categorical Variables: The variables such as species, habitat, and diet are categorical in nature. One-hot encoding is particularly useful when dealing with categorical variables that have no inherent order or hierarchy among their categories. It allows for the representation of each category as a separate binary column, indicating the presence or absence of that category for each data point.

2.No Numerical Relationships: In the case of species, habitat, and diet, there is no inherent numerical relationship or ordering between the categories. One-hot encoding ensures that the encoded representation does not introduce any unintended numerical relationships among the categories.

3.Machine Learning Algorithms: Many machine learning algorithms, such as decision trees, support one-hot encoded features naturally. These algorithms can handle binary inputs and effectively capture the relationships between categorical variables and the target variable.

By using one-hot encoding, you can transform the categorical variables (species, habitat, and diet) into a format suitable for machine learning algorithms. The resulting encoded features can then be used as inputs for various tasks such as classification, clustering, or regression involving the animal dataset

In [None]:
""" Q7.You are working on a project that involves predicting customer churn for a telecommunications
company. You have a dataset with 5 features, including the customer's gender, age, contract type,
monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical
data into numerical data? Provide a step-by-step explanation of how you would implement the encoding. """

In [6]:
import numpy as np
import pandas as pd
np.random.seed(501)
n = 1000
gender = np.random.choice(['Male','Female'],size = n)
age = np.random.randint(low = 25,high = 50,size = n)
contract = np.random.choice(['Monthly','quaterly','half_yearly','Yearly'],size = n)
Monthly_charges = np.random.normal(loc = 1000,scale = 100,size = n)
tenure = np.random.randint(low = 10,high = 15,size = n)
churn = np.random.choice([0,1] , p = [0.1,0.9] ,size = n)

In [9]:
dic = {
        'gender' : gender,
        'age' : age,
        'contract' : contract,
        'monthly_charges' : Monthly_charges,
        'tenure' : tenure,
        'churn' : churn
        }
df = pd.DataFrame(dic)
df.head(10)

Unnamed: 0,gender,age,contract,monthly_charges,tenure,churn
0,Female,33,Monthly,1035.214284,10,1
1,Female,42,Yearly,1043.907428,13,1
2,Female,34,Yearly,830.449614,13,1
3,Female,27,half_yearly,1002.337786,10,1
4,Male,28,quaterly,1060.907822,10,1
5,Female,36,Yearly,847.574953,14,0
6,Male,36,Monthly,1142.517164,13,1
7,Female,29,half_yearly,1086.298181,10,1
8,Female,42,half_yearly,1053.332184,12,1
9,Male,25,Yearly,854.030685,13,1


In [17]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
encode = ohe.fit_transform(df[['gender']]).toarray()
ohe_encode = pd.DataFrame(encode,columns = ohe.get_feature_names_out())
ohe_encode

Unnamed: 0,gender_Female,gender_Male
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,0.0,1.0
...,...,...
995,0.0,1.0
996,0.0,1.0
997,1.0,0.0
998,1.0,0.0


In [21]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories = [['Monthly','quaterly','half_yearly','Yearly']])
encoder = oe.fit_transform(df[["contract"]])
oe_encode = pd.DataFrame(encoder,columns = oe.get_feature_names_out())
oe_encode

Unnamed: 0,contract
0,0.0
1,3.0
2,3.0
3,2.0
4,1.0
...,...
995,2.0
996,0.0
997,1.0
998,0.0


In [23]:
numeric = df[["age","monthly_charges","tenure"]]
numeric

Unnamed: 0,age,monthly_charges,tenure
0,33,1035.214284,10
1,42,1043.907428,13
2,34,830.449614,13
3,27,1002.337786,10
4,28,1060.907822,10
...,...,...,...
995,28,952.821949,14
996,43,1127.720683,11
997,33,893.592186,13
998,29,968.635549,13


In [26]:
churn = df["churn"]

In [27]:
df1 = pd.concat([ohe_encode,numeric,oe_encode,churn],axis = 1)
df1

Unnamed: 0,gender_Female,gender_Male,age,monthly_charges,tenure,contract,churn
0,1.0,0.0,33,1035.214284,10,0.0,1
1,1.0,0.0,42,1043.907428,13,3.0,1
2,1.0,0.0,34,830.449614,13,3.0,1
3,1.0,0.0,27,1002.337786,10,2.0,1
4,0.0,1.0,28,1060.907822,10,1.0,1
...,...,...,...,...,...,...,...
995,0.0,1.0,28,952.821949,14,2.0,1
996,0.0,1.0,43,1127.720683,11,0.0,1
997,1.0,0.0,33,893.592186,13,1.0,1
998,1.0,0.0,29,968.635549,13,0.0,1
