# One-Hot Encoding in Pandas
One-hot encoding is a method used to convert categorical data into numerical format. This is essential for machine learning algorithms that require numerical inputs.

In this tutorial, we'll cover:
- Basic one-hot encoding
- Adding dummy variables to the original DataFrame
- One-hot encoding for multiple columns
- Dropping the first dummy variable to avoid multicollinearity

## Step 1: Basic One-Hot Encoding

In [3]:
import pandas as pd

# Sample DataFrame
data = {'Color': ['Red', 'Blue', 'Green', 'Red']}
df = pd.DataFrame(data)

# Create dummy variables
one_hot = pd.get_dummies(df['Color'])

# Display the result
print(one_hot)

    Blue  Green    Red
0  False  False   True
1   True  False  False
2  False   True  False
3  False  False   True


### Explanation:
- `pd.get_dummies()` converts the categorical column `Color` into binary columns for each unique category.
- Each row contains 1 if the category is present and 0 otherwise.

## Step 2: Adding Dummy Variables to the Original DataFrame

In [4]:
# Add dummy variables to the original DataFrame
df_with_dummies = pd.concat([df, one_hot], axis=1)

print(df_with_dummies)

   Color   Blue  Green    Red
0    Red  False  False   True
1   Blue   True  False  False
2  Green  False   True  False
3    Red  False  False   True


### Explanation:
- `pd.concat()` is used to merge the original DataFrame with the dummy variables.
- This way, the original column is preserved alongside the new binary columns.

## Step 3: One-Hot Encoding for Multiple Columns

In [5]:
# Sample DataFrame with multiple columns
data = {
    'Color': ['Red', 'Blue', 'Green', 'Red'],
    'Size': ['S', 'M', 'L', 'XL']
}
df_multi = pd.DataFrame(data)

# Create dummy variables for all categorical columns
one_hot_multi = pd.get_dummies(df_multi)

print(one_hot_multi)

   Color_Blue  Color_Green  Color_Red  Size_L  Size_M  Size_S  Size_XL
0       False        False       True   False   False    True    False
1        True        False      False   False    True   False    False
2       False         True      False    True   False   False    False
3       False        False       True   False   False   False     True


### Explanation:
- `get_dummies()` handles all categorical columns automatically when passed the entire DataFrame.

## Step 4: Dropping the First Dummy Variable

In [6]:
# Create dummy variables with drop_first=True
one_hot_drop = pd.get_dummies(df_multi, drop_first=True)

print(one_hot_drop)

   Color_Green  Color_Red  Size_M  Size_S  Size_XL
0        False       True   False    True    False
1        False      False    True   False    False
2         True      False   False   False    False
3        False       True   False   False     True


### Explanation:
- `drop_first=True` removes the first category for each column to avoid the **dummy variable trap**.
- This is especially important for linear regression models to avoid multicollinearity.

# REAL LIFE EXAMPLES

In [7]:
df = pd.read_csv(r'../Datasets/Property_Crimes.csv')
df

Unnamed: 0,Area_Name,Year,Group_Name,Sub_Group_Name,Cases_Property_Recovered,Cases_Property_Stolen,Value_of_Property_Recovered,Value_of_Property_Stolen
0,Andaman & Nicobar Islands,2001,Burglary - Property,3. Burglary,27,64,755858,1321961
1,Andhra Pradesh,2001,Burglary - Property,3. Burglary,3321,7134,51483437,147019348
2,Arunachal Pradesh,2001,Burglary - Property,3. Burglary,66,248,825115,4931904
3,Assam,2001,Burglary - Property,3. Burglary,539,2423,3722850,21466955
4,Bihar,2001,Burglary - Property,3. Burglary,367,3231,2327135,17023937
...,...,...,...,...,...,...,...,...
2444,Tamil Nadu,2010,Total Property,7. Total Property Stolen & Recovered,16125,21509,660311804,1317919190
2445,Tripura,2010,Total Property,7. Total Property Stolen & Recovered,192,879,5666102,33032746
2446,Uttar Pradesh,2010,Total Property,7. Total Property Stolen & Recovered,9130,35068,577591772,1442670414
2447,Uttarakhand,2010,Total Property,7. Total Property Stolen & Recovered,964,2234,47135685,123398840


In [9]:
df.shape

(2449, 8)

In [10]:
pd.get_dummies(df.Group_Name,prefix='Group').head()

Unnamed: 0,Group_Burglary - Property,Group_Criminal Breach of Trust - Property,Group_Dacoity -Property,Group_Other heads of Property,Group_Robbery - Property,Group_Theft - Property,Group_Total Property
0,True,False,False,False,False,False,False
1,True,False,False,False,False,False,False
2,True,False,False,False,False,False,False
3,True,False,False,False,False,False,False
4,True,False,False,False,False,False,False


## Notes:
- **When to Use One-Hot Encoding**: Use it for non-ordinal categorical data.
- **Scalability**: Be cautious with datasets that have many unique categories, as it can lead to high memory usage.
- **Best Practices**: Always review your dataset to decide whether to drop the first dummy variable based on the model being used.