In [None]:
Q1. What is data encoding? How is it useful in data science?

In [None]:
Data encoding is a crucial step in data science, where categorical data is converted into
numerical representations for machine learning algorithms to understand.
Machine learning algorithms rely on mathematical computations and statistical operations,
which can't be performed on categorical data directly.
Encoding transforms data from a human-readable format to a machine-readable format.

Data decoding is important in data science as it helps in converting data from a
machine-readable format back to a human-readable format. This is crucial in understanding
the results of data analysis and machine learning models, and in communicating these 
results to stakeholders.

In [None]:
Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

In [None]:
Nominal encoding is a process that converts categorical data into numerical data for
use in machine learning models. It's used for nominal data, which is data that has 
distinct categories but no inherent order. For example, "city" is a nominal variable 
because cities can't be ranked or ordered. 

One common method of nominal encoding is one-hot encoding. In this method, each category
is converted into a new column, with a 1 in the column that matches the category and a 0
in the other columns. For example, if a "ethnicity" column has values "Asian", "Indonesian",
and "Japanese", one-hot encoding would create three new columns: "Asian", "Indonesian", and
"Japanese". A person with Asian ethnicity would be represented as ethnicity_Asian, a person
with Indonesian ethnicity as ethnicity_Indonesian, and a Japanese as ethnicity_Japanese. 

In [3]:
# Example

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'ethnicity':['Asian','Indonesian','Japanese','Japanese','Asian','Indonesian']})
encoder = OneHotEncoder()
encoder.fit_transform(df[['ethnicity']]).toarray()
encoded= encoder.fit_transform(df[['ethnicity']])
encoded_df = pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
pd.concat([df,encoded_df],axis=1)

Unnamed: 0,ethnicity,ethnicity_Asian,ethnicity_Indonesian,ethnicity_Japanese
0,Asian,1.0,0.0,0.0
1,Indonesian,0.0,1.0,0.0
2,Japanese,0.0,0.0,1.0
3,Japanese,0.0,0.0,1.0
4,Asian,1.0,0.0,0.0
5,Indonesian,0.0,1.0,0.0


In [None]:
Q3. In what situations is nominal encoding preferred over one-hot encoding? 
Provide a practical example.

In [None]:
There is like a small difference in Label encoding we give unique numbers
to categories like 1,2,3,4,5 etc. Each unique category value is assigned a unique
integer based on alphabetical or numerical ordering.

But in One-hot encoding, on the other hand, creates a binary column for each 
category, indicating the presence (1) or absence (0) of that category.Here there
is a significant increase in dimension because the more the categories there will
same number of dimensions.

In a model where increase in dimensions leads to complexity in such a case Label
Encoding is preffered,Whereas a model where Binary inputs are accepted in such a case
One Hot Encoding is preffered.

In [5]:
# Example

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'name':['Jack','Yongchu','Khushi','Sunil','Sumo','Klin'],
                   'ethnicity':['American','Japanese','Asian','Asian','Japanese','Indonesian']})
encoder = OneHotEncoder()
encoder.fit_transform(df[['ethnicity']]).toarray()
encoded= encoder.fit_transform(df[['ethnicity']])
encoded_df = pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
pd.concat([df,encoded_df],axis=1)

Unnamed: 0,name,ethnicity,ethnicity_American,ethnicity_Asian,ethnicity_Indonesian,ethnicity_Japanese
0,Jack,American,1.0,0.0,0.0,0.0
1,Yongchu,Japanese,0.0,0.0,0.0,1.0
2,Khushi,Asian,0.0,1.0,0.0,0.0
3,Sunil,Asian,0.0,1.0,0.0,0.0
4,Sumo,Japanese,0.0,0.0,0.0,1.0
5,Klin,Indonesian,0.0,0.0,1.0,0.0


In [None]:
Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding
technique would you use to transform this data into a format suitable for machine learning algorithms?
Explain why you made this choice.

In [None]:
Dataset contains categorical data with 5 unique values, and there is no inherent 
order or ranking among the categories, we would prefer to use one-hot encoding to
transform the data into a format suitable for machine learning algorithms.

One-hot encoding is the most appropriate technique for nominal data because it 
converts each category into a binary vector representation, effectively removing
any numerical relationship between the categories.

It ensures that no ordinality is imposed among the categories. Each category is 
represented as a binary vector with a single '1' and the rest '0's, which prevents 
the model from interpreting any magnitude or rank relationship among the categories.

It provides better interpretability for the model's predictions. The encoded binary
vectors directly represent the presence or absence of each category, making it easier
to understand the impact of each category on the model's output.

It creates a sparse representation of the data, which is efficient in terms of 
memory usage and computation. Only one element in each binary vector is '1', reducing
the amount of memory needed to store the encoded features.

Many machine learning algorithms, such as logistic regression, decision trees, and
support vector machines, are designed to work with numerical inputs. One-hot encoding
converts categorical data into numerical form, making it compatible with a wide range
of machine learning algorithms.

In [7]:
# Example of dataset containing categorical data with 5 unique values.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'Name':['Jack','Yami','Khushi','Sunil','Sofia'],
                   'Country':['USA','Japan','India','India','USA']})
encoder = OneHotEncoder()
encoder.fit_transform(df[['Country']]).toarray()
encoded= encoder.fit_transform(df[['Country']])
encoded_df = pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
pd.concat([df,encoded_df],axis=1)

Unnamed: 0,Name,Country,Country_India,Country_Japan,Country_USA
0,Jack,USA,0.0,0.0,1.0
1,Yami,Japan,0.0,1.0,0.0
2,Khushi,India,1.0,0.0,0.0
3,Sunil,India,1.0,0.0,0.0
4,Sofia,USA,0.0,0.0,1.0


In [None]:
Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns
are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to
transform the categorical data, how many new columns would be created? Show your calculations.

In [None]:
When using nominal encoding to transform categorical data, the number of new columns
created is equal to the number of unique categories in the original categorical
columns. Each unique category is represented as a binary vector, where one column
is created for each category.

Suppose categorical Column 1 has 4 unique categories and categorical Column 2 has 5 
unique categories.

Total New Columns = Unique Categories in Categorical Column 1 + Unique Categories in Categorical Column 2
= 4 + 5 = 9

In [None]:
Q6. You are working with a dataset containing information about different types of animals, including their
species, habitat, and diet. Which encoding technique would you use to transform the categorical data into
a format suitable for machine learning algorithms? Justify your answer.

In [None]:
Dataset contains categorical data information about different types of animals, 
including their species, habitat, and diet, and there is no inherent 
order or ranking among the categories, we would prefer to use one-hot encoding to
transform the data into a format suitable for machine learning algorithms.

One-hot encoding ensures that no ordinality is imposed among the categories. 
Each category is represented as a binary vector with a single '1' and the rest
'0's, which prevents the model from interpreting any magnitude or rank relationship
among the categories.

In [None]:
Q7.You are working on a project that involves predicting customer churn for a telecommunications
company. You have a dataset with 5 features, including the customer's gender, age, contract type,
monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical
data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

In [None]:
We would prefer to use one-hot encoding to transform the data into a format
suitable for machine learning algorithms.

One-hot encoding ensures that no ordinality is imposed among the categories. 
Each category is represented as a binary vector with a single '1' and the rest
'0's, which prevents the model from interpreting any magnitude or rank relationship
among the categories.

In [8]:
# Example:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame({'Gender':['Male','Female','Female','Male','Female'],
                   'Age': [28,27,28,29,30],
                   'Contract type': ['month','one year','two years','one year','month'],
                   'Monthly charges': [45000,20000,25000,48000,40000],
                   'Tenure': [5,1,2,5.5,4]})
encoder = OneHotEncoder()
encoder.fit_transform(df[['Contract type']]).toarray()
encoded= encoder.fit_transform(df[['Contract type']])
encoded_df = pd.DataFrame(encoded.toarray(),columns=encoder.get_feature_names_out())
df = pd.concat([df,encoded_df],axis=1)

encoder1 = LabelEncoder()
df['gender_label'] = pd.DataFrame(encoder1.fit_transform(df[['Gender']]))
df

  y = column_or_1d(y, warn=True)


Unnamed: 0,Gender,Age,Contract type,Monthly charges,Tenure,Contract type_month,Contract type_one year,Contract type_two years,gender_label
0,Male,28,month,45000,5.0,1.0,0.0,0.0,1
1,Female,27,one year,20000,1.0,0.0,1.0,0.0,0
2,Female,28,two years,25000,2.0,0.0,0.0,1.0,0
3,Male,29,one year,48000,5.5,0.0,1.0,0.0,1
4,Female,30,month,40000,4.0,1.0,0.0,0.0,0


In [None]:
We have used One-hot encoding for feature Contract type and is represented as a binary 
vector with a single '1' and the rest '0's.
We have used Label encoding for feature Gender as it is a binary categorical variable
with only two possible values (Male and Female). It assigns 1 to one category (e.g., Male)
and 0 to the other category (e.g., Female). 