# Dummy Variables in Machine Learning

In machine learning, categorical variables (such as names, colors, locations, etc.) cannot be directly used by most algorithms. Dummy variables are a way to represent categorical data in a format that can be easily processed by algorithms. A dummy variable is a binary (0 or 1) variable used to indicate the presence or absence of a certain category in a dataset.
Understanding Dummy Variables

A dummy variable transforms categorical data into a binary form:

    0: indicates the absence of a category.
    1: indicates the presence of a category.

For example, in a dataset with different cities, we create a dummy variable for each city. A row in the dataset will contain a 1 in the column corresponding to the city where the data was recorded, and 0s in all other city columns.

##### 1. Create a Sample Dataset

Let's create a simple dataset that includes categorical data for cities along with numerical data (e.g., area and price of properties).

In [1]:
import pandas as pd

# Sample data
data = {
    "City": ["Cairo", "Giza", "Alexandria", "Cairo", "Giza", "Alexandria", "Cairo", "Giza", "Alexandria"],
    "Area": [200, 300, 250, 220, 280, 310, 240, 290, 260],
    "Price": [500000, 600000, 450000, 480000, 550000, 520000, 470000, 580000, 510000]
}

# Create DataFrame
df = pd.DataFrame(data)
df.head()


Unnamed: 0,City,Area,Price
0,Cairo,200,500000
1,Giza,300,600000
2,Alexandria,250,450000
3,Cairo,220,480000
4,Giza,280,550000


##### 2. Convert Categorical Data to Dummy Variables

We can use pd.get_dummies() to convert the "City" column into dummy variables. This will create new columns for each unique city, with values of 1 or 0 to indicate the presence of each city.



In [2]:
# Create dummy variables for the 'City' column
dummies = pd.get_dummies(df['City'])

# Display dummy variables
dummies


Unnamed: 0,Alexandria,Cairo,Giza
0,False,True,False
1,False,False,True
2,True,False,False
3,False,True,False
4,False,False,True
5,True,False,False
6,False,True,False
7,False,False,True
8,True,False,False


##### 3. Concatenate Dummy Variables with Original Data

To ensure that we retain both the original features and the dummy variables, we concatenate the dummy variables with the original dataset. This allows us to keep the numerical data while adding the encoded categorical data.



In [3]:
# Concatenate the original dataframe with the dummy variables
df_dummy = pd.concat([df, dummies], axis='columns')

# Display the new dataframe with dummy variables
df_dummy


Unnamed: 0,City,Area,Price,Alexandria,Cairo,Giza
0,Cairo,200,500000,False,True,False
1,Giza,300,600000,False,False,True
2,Alexandria,250,450000,True,False,False
3,Cairo,220,480000,False,True,False
4,Giza,280,550000,False,False,True
5,Alexandria,310,520000,True,False,False
6,Cairo,240,470000,False,True,False
7,Giza,290,580000,False,False,True
8,Alexandria,260,510000,True,False,False


##### 4. Drop Unnecessary Columns to Avoid Multicollinearity

When creating dummy variables, it’s common practice to drop one of the dummy columns. This is because including all dummy variables for a categorical feature can introduce multicollinearity, a problem where features are highly correlated. To avoid this, we drop one category (usually the first or reference category) to prevent redundancy.

In [4]:
# Drop the original 'City' column and one dummy column (e.g., 'Cairo') to avoid multicollinearity
df_dummy.drop(['City', 'Cairo'], axis='columns', inplace=True)

# Display the updated dataframe
df_dummy


Unnamed: 0,Area,Price,Alexandria,Giza
0,200,500000,False,False
1,300,600000,False,True
2,250,450000,True,False
3,220,480000,False,False
4,280,550000,False,True
5,310,520000,True,False
6,240,470000,False,False
7,290,580000,False,True
8,260,510000,True,False


##### 5. Separate Features and Target Variable

After encoding categorical variables as dummy variables, we separate the dataset into features (X) and the target variable (y). The target variable is the column we want to predict, and the features are the other columns that will be used as input for the model.

In [5]:
# Separate features (X) and target variable (y)
X = df_dummy.drop('Price', axis='columns')  # Features (excluding 'Price')
y = df_dummy['Price']  # Target variable ('Price')

# Display the features and target variable
print(X.head())
print(y.head())


   Area  Alexandria   Giza
0   200       False  False
1   300       False   True
2   250        True  False
3   220       False  False
4   280       False   True
0    500000
1    600000
2    450000
3    480000
4    550000
Name: Price, dtype: int64


##### 6. Prepare Data for Machine Learning

Now that we have properly encoded categorical data and separated features and target variable, the dataset is ready for machine learning. We can proceed with model training and evaluation.