<a href="https://colab.research.google.com/github/s-ourabh001/Machine-Learning/blob/main/lecture_04/label_encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Label encoding

Label Encoding is a technique in machine learning used to convert categorical data (non-numerical labels) into numeric format so that algorithms can work with them.

🔷 What is Label Encoding?
Label Encoding assigns a unique integer to each unique category in a column.

Example:
Suppose we have a column "Color":

Color
Red
Blue
Green
Red

Using label encoding:

Color	Encoded
Red	2
Blue	0
Green	1
Red	2

🔶 How Label Encoding Works:
Find all unique categories in the feature.

Assign an integer value to each category.

Replace the categories with their corresponding integers.

In Python, this can be done using LabelEncoder from sklearn.preprocessing:

python
Copy code
from sklearn.preprocessing import LabelEncoder

data = ['Red', 'Blue', 'Green', 'Red']
encoder = LabelEncoder()
encoded = encoder.fit_transform(data)

print(encoded)  # Output: [2 0 1 2]
⚠️ When to Use Label Encoding?
Use label encoding when:

The categorical variable has ordinal relationships (e.g., "Low", "Medium", "High").

There are few categories, and the model can interpret numerical relationships properly.

⚠️ Caution:
For non-ordinal categorical data, label encoding may lead to incorrect assumptions by the algorithm. In such cases, consider using One-Hot Encoding instead.

🔁 Alternative: One-Hot Encoding
Instead of assigning numbers, it creates a binary column for each category. Better for non-ordinal data.



In [5]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import sklearn.datasets

In [13]:
df=pd.read_csv('data (1).csv')
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [16]:
#finding the count of different labels of a column
df['diagnosis'].value_counts()

Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
B,357
M,212


In [18]:
#load the labelencoder function
encoder=LabelEncoder();
df['diagnosis']=encoder.fit_transform(df['diagnosis'])

df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,perimeter_se,area_se,smoothness_se,compactness_se,concavity_se,concave points_se,symmetry_se,fractal_dimension_se,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


1=M

0=B

///////////////////////////////////////

handling imbalanced data set

imbalanced dataset  eg   we have  mail dataset with column=  valid  where

valid = 0 means   spam

valid = 1 means  ham  

if datasetr is of 2000 rows and 1900 out of it are 1 and  100 are 0  than this is imbalanced dataset

Handling an imbalanced dataset is crucial in machine learning, especially in classification tasks where one class dominates the others. If not addressed, models tend to be biased toward the majority class, leading to poor performance on the minority class (often the most important one, like fraud detection or disease prediction).

⚖️ What is Imbalanced Data?
An imbalanced dataset has a large difference in the number of samples between classes.

Example:
Class	Count
0	950
1	50

Here, class 1 is the minority class (only 5%).

✅ How to Handle Imbalanced Data
🔹 1. Resampling Techniques
A. Undersampling – Reduce the majority class

python
Copy code
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler()
X_resampled, y_resampled = rus.fit_resample(X, y)
B. Oversampling – Increase the minority class

Random Oversampling: Duplicate minority samples

SMOTE (Synthetic Minority Over-sampling Technique): Generates synthetic examples

python
Copy code
from imblearn.over_sampling import SMOTE

sm = SMOTE()
X_resampled, y_resampled = sm.fit_resample(X, y)

In [19]:
fraud_data=pd.read_csv('credit_data.csv')
fraud_data.head()


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [22]:
print(fraud_data.shape)
print(fraud_data['Class'].value_counts())

(71396, 31)
Class
0.0    71218
1.0      177
Name: count, dtype: int64


In [25]:
legit=fraud_data[fraud_data['Class']==0]
fraud=fraud_data[fraud_data['Class']==1]
print(legit.shape,fraud.shape)

(71218, 31) (177, 31)


In [24]:
#applying under sampling   to reduce legit sample to  nearby fraud sample

legit_sample=legit.sample(n=200);
legit_sample.shape

(200, 31)

In [28]:
balanced_dataset=pd.concat([legit_sample,fraud],axis=0)
balanced_dataset.shape

(377, 31)

In [29]:
balanced_dataset['Class'].value_counts()

Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0.0,200
1.0,177
