<h2>Data warangling - Indicator Variable (or Dummy Variable) - part 5</h2>

<h3>What is an indicator variable?</h2>

An indicator variable (or dummy variable) is a numerical variable used to label categories. They are called 'dummies' because the numbers themselves don't have inherent meaning.

Why we use indicator variables?

We use indicator variables so we can use categorical variables for regression analysis in the later modules.

Example We see the column "fuel-type" has two unique values: "gas" or "diesel". Regression doesn't understand words, only numbers. To use this attribute in regression analysis, we convert "fuel-type" to indicator variables.

We will use pandas' method 'get_dummies' to assign numerical values to different categories of fuel type.

In [1]:
import numpy as np
import pandas as pd

headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

df = pd.read_csv('automobile_new.csv',names=headers)
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
55,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,80,mpfi,3.329751,3.255423,9.4,135,6000,16,23,15645
66,0,93.0,mercedes-benz,diesel,turbo,two,hardtop,rwd,front,106.7,...,183,idi,3.58,3.64,21.5,123,4350,22,25,28176
145,0,85.0,subaru,gas,turbo,four,wagon,4wd,front,96.9,...,108,mpfi,3.62,2.64,7.7,111,4800,23,23,11694
128,3,150.0,saab,gas,std,two,hatchback,fwd,front,99.1,...,121,mpfi,3.54,3.07,9.31,110,5250,21,28,11850
139,0,102.0,subaru,gas,std,four,sedan,fwd,front,97.2,...,108,mpfi,3.62,2.64,9.0,94,5200,26,32,9960


In [2]:
df.columns

Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')

In [3]:
df['fuel-type'].value_counts()

fuel-type
gas       9
diesel    1
Name: count, dtype: int64

In [4]:
dummy_variables1 = pd.get_dummies(df['fuel-type'])
dummy_variables1.head()

Unnamed: 0,diesel,gas
55,False,True
66,True,False
145,False,True
128,False,True
139,False,True


In [5]:
# Change the column names for clarity:
dummy_variables1.rename(columns={'gas':'fuel-type-gas', 'diesel':'fuel-type-diesel'}, inplace=True)
dummy_variables1.head()

Unnamed: 0,fuel-type-diesel,fuel-type-gas
55,False,True
66,True,False
145,False,True
128,False,True
139,False,True


In the dataframe, column 'fuel-type' has values for 'gas' and 'diesel' as 0s and 1s now.

In [6]:
# merge data frame 'df' and dummy_variable_1

df = pd.concat([df, dummy_variables1], axis=1)
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,fuel-type-diesel,fuel-type-gas
55,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,3.329751,3.255423,9.4,135,6000,16,23,15645,False,True
66,0,93.0,mercedes-benz,diesel,turbo,two,hardtop,rwd,front,106.7,...,3.58,3.64,21.5,123,4350,22,25,28176,True,False
145,0,85.0,subaru,gas,turbo,four,wagon,4wd,front,96.9,...,3.62,2.64,7.7,111,4800,23,23,11694,False,True
128,3,150.0,saab,gas,std,two,hatchback,fwd,front,99.1,...,3.54,3.07,9.31,110,5250,21,28,11850,False,True
139,0,102.0,subaru,gas,std,four,sedan,fwd,front,97.2,...,3.62,2.64,9.0,94,5200,26,32,9960,False,True


In [7]:
# drop original column "fuel-type" from "df"
df.drop('fuel-type', axis=1, inplace=True)

In [8]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,fuel-type-diesel,fuel-type-gas
55,3,150.0,mazda,std,two,hatchback,rwd,front,95.3,169.0,...,3.329751,3.255423,9.4,135,6000,16,23,15645,False,True
66,0,93.0,mercedes-benz,turbo,two,hardtop,rwd,front,106.7,187.5,...,3.58,3.64,21.5,123,4350,22,25,28176,True,False
145,0,85.0,subaru,turbo,four,wagon,4wd,front,96.9,173.6,...,3.62,2.64,7.7,111,4800,23,23,11694,False,True
128,3,150.0,saab,std,two,hatchback,fwd,front,99.1,186.6,...,3.54,3.07,9.31,110,5250,21,28,11850,False,True
139,0,102.0,subaru,std,four,sedan,fwd,front,97.2,172.0,...,3.62,2.64,9.0,94,5200,26,32,9960,False,True


**create an indicator variable for the column "aspiration"**

In [9]:
dummy_variables2 = pd.get_dummies(df['aspiration'])
dummy_variables2.head()

Unnamed: 0,std,turbo
55,True,False
66,False,True
145,False,True
128,True,False
139,True,False


In [10]:
# Change the column names for clarity:
dummy_variables2.rename(columns={'std':'aspiration-std', 'turbo':'aspiration-turbo'}, inplace=True)
dummy_variables2.head()

Unnamed: 0,aspiration-std,aspiration-turbo
55,True,False
66,False,True
145,False,True
128,True,False
139,True,False


**Merge the new dataframe to the original dataframe, then drop the column 'aspiration'.**

In [11]:
# merge the new dataframe to the original datafram
df = pd.concat([df, dummy_variables2], axis=1)
df.head()

Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,fuel-type-diesel,fuel-type-gas,aspiration-std,aspiration-turbo
55,3,150.0,mazda,std,two,hatchback,rwd,front,95.3,169.0,...,9.4,135,6000,16,23,15645,False,True,True,False
66,0,93.0,mercedes-benz,turbo,two,hardtop,rwd,front,106.7,187.5,...,21.5,123,4350,22,25,28176,True,False,False,True
145,0,85.0,subaru,turbo,four,wagon,4wd,front,96.9,173.6,...,7.7,111,4800,23,23,11694,False,True,False,True
128,3,150.0,saab,std,two,hatchback,fwd,front,99.1,186.6,...,9.31,110,5250,21,28,11850,False,True,True,False
139,0,102.0,subaru,std,four,sedan,fwd,front,97.2,172.0,...,9.0,94,5200,26,32,9960,False,True,True,False


In [12]:
# drop original column "fuel-type" from "df"
df.drop('aspiration', axis=1, inplace=True)

In [13]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,...,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,fuel-type-diesel,fuel-type-gas,aspiration-std,aspiration-turbo
55,3,150.0,mazda,two,hatchback,rwd,front,95.3,169.0,65.7,...,9.4,135,6000,16,23,15645,False,True,True,False
66,0,93.0,mercedes-benz,two,hardtop,rwd,front,106.7,187.5,70.3,...,21.5,123,4350,22,25,28176,True,False,False,True
145,0,85.0,subaru,four,wagon,4wd,front,96.9,173.6,65.4,...,7.7,111,4800,23,23,11694,False,True,False,True
128,3,150.0,saab,two,hatchback,fwd,front,99.1,186.6,66.5,...,9.31,110,5250,21,28,11850,False,True,True,False
139,0,102.0,subaru,four,sedan,fwd,front,97.2,172.0,65.4,...,9.0,94,5200,26,32,9960,False,True,True,False


In [14]:
df.to_csv('clean_df')