# 1. Introduction to Categorical Data

Almost every dataset contains categorical information—and often it’s an unexplored goldmine of information. In this chapter, you’ll learn how pandas handles categorical columns using the data type category. You’ll also discover how to group data by categories to unearth great summary statistics.

# Exploring a target variable

You have been asked to build a machine learning model to predict whether or not a person makes over $50,000 in a year. To understand the target variable, Above/Below 50k, you decide to explore the variable in more detail.

The Python package pandas will be used throughout this course and will be loaded as pd throughout. The adult census income dataset, adult, has also been preloaded for you.

# Instructions:

- Explore the Above/Below 50k variable by printing out a description of the variable's contents.

In [2]:
import pandas as pd
import numpy as np
adult = pd.read_csv("adult.csv")
# Explore the Above/Below 50k variable
print(adult["Above/Below 50k"].describe())

count      32561
unique         2
top        <=50K
freq       24720
Name: Above/Below 50k, dtype: object


- Explore the Above/Below 50k variable by printing out a frequency table of the values found in this column.

In [3]:
# Explore the Above/Below 50k variable
print(adult["Above/Below 50k"].describe())

# Print a frequency table of "Above/Below 50k"
print(adult["Above/Below 50k"].value_counts())

count      32561
unique         2
top        <=50K
freq       24720
Name: Above/Below 50k, dtype: object
 <=50K    24720
 >50K      7841
Name: Above/Below 50k, dtype: int64


- Rerun .value_counts(), but this time print out the relative frequency values instead of the counts.

In [4]:
# Explore the Above/Below 50k variable
print(adult["Above/Below 50k"].describe())

# Print a frequency table of "Above/Below 50k"
print(adult["Above/Below 50k"].value_counts())

# Print relative frequency values
print(adult["Above/Below 50k"].value_counts(normalize = True))

count      32561
unique         2
top        <=50K
freq       24720
Name: Above/Below 50k, dtype: object
 <=50K    24720
 >50K      7841
Name: Above/Below 50k, dtype: int64
 <=50K    0.75919
 >50K     0.24081
Name: Above/Below 50k, dtype: float64


# Question
Given the output from the previous steps, do more people make more or less than $50,000?
# Possible answers

( ) More than $50,000

(x) Less than $50,000

# Setting dtypes and saving memory

A colleague of yours is exploring a list of occupations and how they relate to salary. She has given you a list of these occupations, list_of_occupations, and has a few simple questions such as "How many different titles are there?" and "Which position is the most common?".

# Instructions:

- Create a pandas Series, series1, using the list_of_occupations (do not set the dtype).

In [5]:
# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)

NameError: name 'list_of_occupations' is not defined

- Print both the data type and number of bytes used of this new Series.

In [None]:
# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)

# Print out the data type and number of bytes for series1
print("series1 data type:", series1.dtype)
print("series1 number of bytes:", series1.nbytes)

- Create a second pandas Series, series2, using the list_of_occupations and set the dtype to "category".

In [None]:
# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)

# Print out the data type and number of bytes for series1
print("series1 data type:", series1.dtype)
print("series1 number of bytes:", series1.nbytes)

# Create a Series, "category" dtype
series2 = pd.Series(list_of_occupations, dtype="category")

- Print both the data type and number of bytes used of this new Series.

In [None]:
# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)

# Print out the data type and number of bytes for series1
print("series1 data type:", series1.dtype)
print("series1 number of bytes:", series1.nbytes)

# Create a Series, "category" dtype
series2 = pd.Series(list_of_occupations, dtype="category")

# Print out the data type and number of bytes for series2
print("series2 data type:", series2.dtype)
print("series2 number of bytes:", series2.nbytes)

# Creating a categorical pandas Series

Another colleague at work has collected information on the number of "Gold", "Silver", and "Bronze" medals won by the USA at the Summer & Winter Olympics since 1896. She has provided this as a list, medals_won. Before taking a look at the total number of each medal won, you want to create a categorical pandas Series. However, you know that these medals have a specific order to them and that Gold is better than Silver, but Silver is better than Bronze. Use the object, medals_won, to help.

# Instructions:

- Create a categorical pandas Series without using pd.Series().
- Specify the three known medal categories such that "Bronze" < "Silver" < "Gold".
- Specify that the order of the categories is important when creating this Series.

In [None]:
# Create a categorical Series and specify the categories (let pandas know the order matters!)
medals = pd.Categorical(medals_won, categories=["Bronze", "Silver", "Gold"], ordered=True)
print(medals)

# Setting dtype when reading data

You are preparing to create a machine learning model to predict a person's income category using the adult census income dataset. You don't have access to any cloud resources and you want to make sure that your laptop will be able to load the full dataset and process its contents. You have read in the first five rows of the dataset adult to help you understand what kind of columns are available.

# Instructions:

-Call the correct attribute on the adult DataFrame to review the data types.

In [None]:
# Check the dtypes
print(adult.dtypes)