---
title: Categorical & Dummy Variables
type: lesson
duration: "1:5"
creator:
    name: Lucy Williams
    city: DC
---

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Categorical & Dummy Variables
Week 2 | Lesson 3.3

![](assets/images/balloon_dataset.png)

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Be able to use get_dummies and other ways of converting categorical data to numerical data
- How to create indicator variable (0 or 1) columns from categorical data

### LESSON GUIDE
| TIMING  | TYPE  | TOPIC  |
|:-:|---|---|
| 10 min  | [Introduction](#introduction)   | Categorical & Dummy Variables |
| 25 min  | [Demo /Guided Practice ](#demo)  | Categorical Variables  |
| 25 min  | [Demo /Guided Practice ](#demo)  | Dummy Variables  |
| 25 min  | [Independent Practice](#ind-practice)  |   |
| 5 min  | [Conclusion](#conclusion)  |  |

## <a name="Categorical & Dummy Variables"></a>
## Introduction: Categorical & Dummy Variables (10 mins)

Regression analysis is used with numerical variables. Results only have a valid
interpretation if it makes sense to assume that having a value of 2 on some variable
is does indeed mean having twice as much of something as a 1, and having a 50 means
50 times as much as 1. But, some times you need to work with categorical variables
in which the different values have no real numerical relationship with each other.
The solution is, to use categorical and dummy variables

A categorical variable is an independent or predictor variable that contains
values indicating membership in one of several possible categories. E.g.,
gender (male or female), marital status (married, single, divorced,
widowed). The categories are often assigned numerical values used as
labels, e.g., 0 = male; 1 = female.

A dummy variable is created by recoding categorial variables that have more than
two categories into a series of binary variables.

Here is more information on different [types of variables](http://www.indiana.edu/~educy520/sec5982/week_2/variable_types.pdf).

In [None]:
import pandas as pd
balloons_df = pd.read_csv('../data/balloons.csv', names=['Color', 'Size', 'Act', 'Age', 'Inflated'])

In [None]:
balloons_df

## Categorical Variables

The variables in this dataset are categorical rather than numerical in nature. Consider a row vector from this set, $v \in B$, where $B$ represents the entire feature space.

We know that any given row vector will have

- $v_1 \in \{\texttt{'YELLOW'}, \texttt{'PURPLE'}\}$
- $v_2 \in \{\texttt{'SMALL'}, \texttt{'LARGE'}\}$
- $v_3 \in \{\texttt{'STRETCH'}, \texttt{'DIP'}\}$
- $v_4 \in \{\texttt{'ADULT'}, \texttt{'CHILD'}\}$
- $v_5 \in \{\texttt{'T'}, \texttt{'F'}\}$

and that each row vector has dimension 5. 

### Numpy (And Thus Pandas, Scipy, Sklearn, etc) Can Not  Perform Mathematical Operations on String Vectors

In [None]:
v_1 = balloons_df.ix[0]
v_2 = balloons_df.ix[1]
v_1.dot(v_2)

### In Order To Work With Our Dataset using our core libraries, we must encode it numerically

In [None]:
encode_color = lambda x: 1 if x == 'YELLOW' else 0
encode_size = lambda x: 1 if x == 'SMALL' else 0
encode_act = lambda x: 1 if x == 'STRETCH' else 0
encode_age = lambda x: 1 if x == 'ADULT' else 0
encode_inflated = lambda x: 1 if x =='T' else 0

In [None]:
balloons_df.Color = balloons.Color.apply(encode_color)
balloons_df.Size = balloons.Size.apply(encode_size)
balloons_df.Act = balloons.Act.apply(encode_act)
balloons_df.Age = balloons.Age.apply(encode_age)
balloons_df.Inflated = balloons.Inflated.apply(encode_inflated)

In [None]:
balloons_df

In [None]:
balloons_df.T.dot(balloons)

### What is the significance of the diagonal? What is the significance of an off-diagonal? Can you see any features or feature combinations that would be a good predictor of Inflation?

## What if the category has more than two classes? 

<img src='assets/images/lenses_dataset.png' width=700px>

In [None]:
lenses_df = pd.read_csv('../data/lenses.csv', index_col=0, 
                        names=['Age', 'Spectacle Prescription', 'Astigmatic', 'Tear Production Rate','Prescription Class'])
lenses_df

    Attribute Information:
        -- 3 Classes
         1 : the patient should be fitted with hard contact lenses,
         2 : the patient should be fitted with soft contact lenses,
         3 : the patient should not be fitted with contact lenses.

        1. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic
        2. spectacle prescription:  (1) myope, (2) hypermetrope
        3. astigmatic:     (1) no, (2) yes
        4. tear production rate:  (1) reduced, (2) normal


### Numerical Data signifying Classes

In [None]:
pd.get_dummies(lenses_df['Age']).head()

## Build a New Dataframe Using Our Dummy Columns

In [None]:
Age_dummies = pd.get_dummies(lenses_df['Age'])
Spectacle_dummies = pd.get_dummies(lenses_df['Spectacle Prescription'])
Astigmatic_dummies = pd.get_dummies(lenses_df['Astigmatic'])
Tear_dummies = pd.get_dummies(lenses_df['Tear Production Rate'])

In [None]:
Age_dummies.columns = ['age_young', 'age_pre-presbyopic', 'age_presbyopic']
Spectacle_dummies.columns = ['spectacle_myope', 'spectacle_hypermetrope']
Astigmatic_dummies.columns = ['astigmatic_no', 'astigmatic_yes']
Tear_dummies.columns = ['tear_reduced', 'tear_normal']
Age_dummies.head()

In [None]:
lenses_encoded = pd.concat([Age_dummies,Spectacle_dummies,Astigmatic_dummies,Tear_dummies], axis=1)
lenses_encoded

<a name="Categorical Variables"></a>
## Demo / Guided Practice: Categorical Variables (25 mins)

Why exactly would you want to use categorical variables?
The categorical data type is useful in the following cases:

- A string variable consisting of only a few different values. Converting such a
    string variable to a categorical variable will save some memory, see  
    [here](https://pandas-docs.github.io/pandas-docs-travis/categorical.html#categorical-memory).
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a         
    categorical and specifying an order on the categories, sorting and min/max will
    use the logical order instead of the lexical order, see
    [here](https://pandas-docs.github.io/pandas-docs-travis/categorical.html#categorical-sort)
- As a signal to other python libraries that this column should be treated as a
    categorical variable (e.g. to use suitable statistical methods or plot types).

Let's use pandas to create a few Categorical Series. One way is by specifying
dtype="category" when constructing a Series:

> Here is a link to the [demo code](./w2-3.3-demo.ipynb).

```Python
import pandas as pd
s = pd.Series(["a","b","c","a"], dtype="category")
s
```

Another way is to convert an existing Series or column to a category dtype:
```Python
df = pd.DataFrame({"A":["a","b","c","a"]})
df["B"] = df["A"].astype('category')
df
```

You can also pass a pandas.Categorical object to a Series or assign it to a DataFrame.
```Python
raw_cat = pd.Categorical(["a","b","c","a"], categories=["b","c","d"],
                          ordered=False)
```

```Python
s = pd.Series(raw_cat)
s
```

**Check:** Why would you use a categorical variable?
[categorical variable](https://pandas-docs.github.io/pandas-docs-travis/categorical.html)

<a name="ind-practice"></a>
## Independent Practice: Topic (25 minutes)

Use the Shuttle Control Data (`'shuttle_control.csv'`) to:

1. Create dummy variables for each feature
2. Create a new dataframe containing your encoded data


    Shuttle Control Attribute Information:
    1. Class: noauto, auto
       -- that is, advise using manual/automatic control
    2. STABILITY: stab, xstab
    3. ERROR: XL, LX, MM, SS
    4. SIGN: pp, nn
    5. WIND: head, tail
    6. MAGNITUDE: Low, Medium, Strong, OutOfRange
    7. VISIBILITY: yes, no


<a name="conclusion"></a>
## Conclusion (5 mins)

We learned that categorical and dummy variables are very useful. Some applications
are: turning a string value that may only have a few different values into a
categorical variable or when the lexical order of a variable is not the same as
the logical order. When we start Regression in Week 3, it will become even more
apparent how valuable these tools are to help us manage our data and make
it easier to analyze.