## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [1]:
#Disadvantages of OHE-
#1- We shouldnt use it a feature has many categories since it creates that many more features
#2- sparse matrix- since a sparse matrix is created, the model gets trained too well on the training data leading to over fitting

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [60]:
## create a simple df
df = pd.DataFrame({
    'colour':['red','blue','green','green','red','blue']
})

In [61]:
df.head()

Unnamed: 0,colour
0,red
1,blue
2,green
3,green
4,red


In [6]:
## create an instance of onehotencoder

encoder=OneHotEncoder()

In [10]:
## Perform fit and transform
encoded=encoder.fit_transform(df[['colour']]).toarray()

In [11]:
encoded

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [13]:
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())
encoder_df

Unnamed: 0,colour_blue,colour_green,colour_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [16]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,colour,colour_blue,colour_green,colour_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


Additional Assignment- Tips Dataset

In [24]:
import seaborn as sns

tips_df=sns.load_dataset('tips')

In [25]:
tips_df['time'].unique()

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']

In [31]:
from sklearn.preprocessing import OneHotEncoder

encode=OneHotEncoder()

time_encoded=encode.fit_transform(tips_df[['time']]).toarray()

encoded_categories_df=pd.DataFrame(time_encoded,columns=encode.get_feature_names_out())

In [32]:
encoded_categories_df

Unnamed: 0,time_Dinner,time_Lunch
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0
...,...,...
239,1.0,0.0
240,1.0,0.0
241,1.0,0.0
242,1.0,0.0


In [35]:
concat_df=pd.concat([tips_df['time'],encoded_categories_df],axis=1)

In [40]:
concat_df.head(200)

Unnamed: 0,time,time_Dinner,time_Lunch
0,Dinner,1.0,0.0
1,Dinner,1.0,0.0
2,Dinner,1.0,0.0
3,Dinner,1.0,0.0
4,Dinner,1.0,0.0
...,...,...,...
195,Lunch,0.0,1.0
196,Lunch,0.0,1.0
197,Lunch,0.0,1.0
198,Lunch,0.0,1.0


In [42]:
df.head()

Unnamed: 0,colour
0,red
1,blue
2,green
3,green
4,red


### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [None]:
## Disadvantage of Label Enconding is that it assigns numerical values like 0,1,2 to various categories.
## our model might consider the category with numeric value 2 to be higher than the category with numerical value 1 or 0

In [43]:
df.head()

Unnamed: 0,colour
0,red
1,blue
2,green
3,green
4,red


In [44]:
from sklearn.preprocessing import LabelEncoder

In [45]:
label_encoder=LabelEncoder()

In [50]:
encoded_labels=label_encoder.fit_transform(df[['colour']])

  y = column_or_1d(y, warn=True)


In [113]:
encoded_labels

array([2, 0, 1, 1, 2, 0])

In [None]:
Label_encoded_categories_df=dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))) ##create dictionary with mapping

In [55]:
print(Label_encoded_categories_df)

{'blue': np.int64(0), 'green': np.int64(1), 'red': np.int64(2)}


In [47]:
label_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [48]:
label_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [49]:
label_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [56]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

In [65]:
df1= pd.DataFrame({
    'size':['small','medium','large','medium','small','large']
})

df1

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [66]:
##Assigning Ranks
#1-create an instance of ordinal encoder
#2-Perform Fit_transform

In [68]:
#1-Creating instance of ordinal encoder
oe=OrdinalEncoder(categories=[['small','medium','large']])
oe.fit_transform(df1[['size']])


array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [69]:
oe.transform([['small']])



array([[0.]])

## Target Guided Ordinal Encoding 
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [70]:
# create a sample dataframe with a categorical variable and a target variable
df2 = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [71]:
df2

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [72]:
df2.groupby('city')['price'].mean()

city
London      150.0
New York    190.0
Paris       310.0
Tokyo       250.0
Name: price, dtype: float64

In [98]:
mean_price=df2.groupby('city')['price'].mean().to_dict()

In [99]:
df2['city_encoded']=df2['city'].map(mean_price)

In [100]:
df2['city_encoded']

0    190.0
1    150.0
2    310.0
3    250.0
4    190.0
5    310.0
Name: city_encoded, dtype: float64

In [102]:
df2[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


In [105]:
##target ordinal encoding tips dataset

import seaborn as sns

df5=sns.load_dataset('tips')
df5

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [106]:
#groupby
mean_price_groupby_time=df5.groupby('time')['total_bill'].mean().to_dict()

mean_price_groupby_time

  mean_price_groupby_time=df5.groupby('time')['total_bill'].mean().to_dict()


{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [108]:
df5['time_encoded']=df5['time'].map(mean_price_groupby_time)

In [112]:
df5[['total_bill','time_encoded']]

Unnamed: 0,total_bill,time_encoded
0,16.99,20.797159
1,10.34,20.797159
2,21.01,20.797159
3,23.68,20.797159
4,24.59,20.797159
...,...,...
239,29.03,20.797159
240,27.18,20.797159
241,22.67,20.797159
242,17.82,20.797159


In [22]:
import pandas as pd
df=pd.DataFrame({'student_id':[1,2,3,4,5],
                 'age':[20,30,40,50,60],
                 'no':[1,2,3,4,4]})

In [4]:
student_data=[[1,2],[2,11],[1,3]]

In [7]:
df=student_data

In [8]:
df

[[1, 2], [2, 11], [1, 3]]

In [15]:
df.any(df['student_id']=1)

SyntaxError: expression cannot contain assignment, perhaps you meant "=="? (199823691.py, line 1)

In [17]:
df.loc[df['student_id']==1,['age']]

Unnamed: 0,age
0,20


In [23]:
df

Unnamed: 0,student_id,age,no
0,1,20,1
1,2,30,2
2,3,40,3
3,4,50,4
4,5,60,4


In [27]:
df['student_id']=

KeyError: 'id'