# 1. Introduction to Categorical Data

Almost every dataset contains categorical information—and often it’s an unexplored goldmine of information. In this chapter, you’ll learn how pandas handles categorical columns using the data type category. You’ll also discover how to group data by categories to unearth great summary statistics.

# Exploring a target variable

You have been asked to build a machine learning model to predict whether or not a person makes over $50,000 in a year. To understand the target variable, Above/Below 50k, you decide to explore the variable in more detail.

The Python package pandas will be used throughout this course and will be loaded as pd throughout. The adult census income dataset, adult, has also been preloaded for you.

# Instructions:

- Explore the Above/Below 50k variable by printing out a description of the variable's contents.

In [1]:
import pandas as pd
import numpy as np
adult = pd.read_csv("adult.csv")

# Explore the Above/Below 50k variable
print(adult["Above/Below 50k"].describe())

count      32561
unique         2
top        <=50K
freq       24720
Name: Above/Below 50k, dtype: object


- Explore the Above/Below 50k variable by printing out a frequency table of the values found in this column.

In [2]:
# Explore the Above/Below 50k variable
print(adult["Above/Below 50k"].describe())

# Print a frequency table of "Above/Below 50k"
print(adult["Above/Below 50k"].value_counts())

count      32561
unique         2
top        <=50K
freq       24720
Name: Above/Below 50k, dtype: object
 <=50K    24720
 >50K      7841
Name: Above/Below 50k, dtype: int64


- Rerun .value_counts(), but this time print out the relative frequency values instead of the counts.

In [3]:
# Explore the Above/Below 50k variable
print(adult["Above/Below 50k"].describe())

# Print a frequency table of "Above/Below 50k"
print(adult["Above/Below 50k"].value_counts())

# Print relative frequency values
print(adult["Above/Below 50k"].value_counts(normalize = True))

count      32561
unique         2
top        <=50K
freq       24720
Name: Above/Below 50k, dtype: object
 <=50K    24720
 >50K      7841
Name: Above/Below 50k, dtype: int64
 <=50K    0.75919
 >50K     0.24081
Name: Above/Below 50k, dtype: float64


# Question
Given the output from the previous steps, do more people make more or less than $50,000?
# Possible answers

( ) More than $50,000

(x) Less than $50,000

# Setting dtypes and saving memory

A colleague of yours is exploring a list of occupations and how they relate to salary. She has given you a list of these occupations, list_of_occupations, and has a few simple questions such as "How many different titles are there?" and "Which position is the most common?".

# Instructions:

- Create a pandas Series, series1, using the list_of_occupations (do not set the dtype).

In [7]:
# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)

- Print both the data type and number of bytes used of this new Series.

In [None]:
# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)

# Print out the data type and number of bytes for series1
print("series1 data type:", series1.dtype)
print("series1 number of bytes:", series1.nbytes)

- Create a second pandas Series, series2, using the list_of_occupations and set the dtype to "category".

In [None]:
# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)

# Print out the data type and number of bytes for series1
print("series1 data type:", series1.dtype)
print("series1 number of bytes:", series1.nbytes)

# Create a Series, "category" dtype
series2 = pd.Series(list_of_occupations, dtype="category")

- Print both the data type and number of bytes used of this new Series.

In [None]:
# Create a Series, default dtype
series1 = pd.Series(list_of_occupations)

# Print out the data type and number of bytes for series1
print("series1 data type:", series1.dtype)
print("series1 number of bytes:", series1.nbytes)

# Create a Series, "category" dtype
series2 = pd.Series(list_of_occupations, dtype="category")

# Print out the data type and number of bytes for series2
print("series2 data type:", series2.dtype)
print("series2 number of bytes:", series2.nbytes)

# Creating a categorical pandas Series

Another colleague at work has collected information on the number of "Gold", "Silver", and "Bronze" medals won by the USA at the Summer & Winter Olympics since 1896. She has provided this as a list, medals_won. Before taking a look at the total number of each medal won, you want to create a categorical pandas Series. However, you know that these medals have a specific order to them and that Gold is better than Silver, but Silver is better than Bronze. Use the object, medals_won, to help.

# Instructions:

- Create a categorical pandas Series without using pd.Series().
- Specify the three known medal categories such that "Bronze" < "Silver" < "Gold".
- Specify that the order of the categories is important when creating this Series.

In [None]:
# Create a categorical Series and specify the categories (let pandas know the order matters!)
medals = pd.Categorical(medals_won, categories=["Bronze", "Silver", "Gold"], ordered=True)
print(medals)

# Setting dtype when reading data

You are preparing to create a machine learning model to predict a person's income category using the adult census income dataset. You don't have access to any cloud resources and you want to make sure that your laptop will be able to load the full dataset and process its contents. You have read in the first five rows of the dataset adult to help you understand what kind of columns are available.

# Instructions:

-Call the correct attribute on the adult DataFrame to review the data types.

In [None]:
# Check the dtypes
print(adult.dtypes)

# Question

Based on the data types in adult, which columns are good candidates for specifying a dtype of "category" when reading in the adult dataset?

Possible answers


( ) "Age", "Education Num", and "Race"

( ) "Age", "Hours/Week", and "Capital Loss"

( ) "Workclass", "Education Num", "Hours/Week", and "Above/Below 50k"

(x) "Workclass", "Education", "Relationship", "Above/Below 50k"

- Create a dictionary with keys: "Workclass", "Education", "Relationship", and "Above/Below 50k".
- Set the value for each key to be "category".

In [11]:
# Check the dtypes
print(adult.dtypes)

# Create a dictionary with column names as keys and "category" as values
adult_dtypes = {"Workclass": "category", "Education": "category", "Relationship": "category", "Above/Below 50k": "category"}

Age                 int64
Workclass          object
fnlgwt              int64
Education          object
Education Num       int64
Marital Status     object
Occupation         object
Relationship       object
Race               object
Sex                object
Capital Gain        int64
Capital Loss        int64
Hours/Week          int64
Country            object
Above/Below 50k    object
dtype: object


- Use the newly created dictionary, adult_dtypes, when reading in adult.csv

In [10]:
# Check the dtypes
print(adult.dtypes)

# Create a dictionary with column names as keys and "category" as values
adult_dtypes = {
   "Workclass": "category",
   "Education": "category",
   "Relationship": "category",
   "Above/Below 50k": "category" 
}

# Read in the CSV using the dtypes parameter
adult2 = pd.read_csv(
  "adult.csv",
  dtype=adult_dtypes
)
print(adult2.dtypes)

Age                 int64
Workclass          object
fnlgwt              int64
Education          object
Education Num       int64
Marital Status     object
Occupation         object
Relationship       object
Race               object
Sex                object
Capital Gain        int64
Capital Loss        int64
Hours/Week          int64
Country            object
Above/Below 50k    object
dtype: object
Age                   int64
Workclass          category
fnlgwt                int64
Education          category
Education Num         int64
Marital Status       object
Occupation           object
Relationship       category
Race                 object
Sex                  object
Capital Gain          int64
Capital Loss          int64
Hours/Week            int64
Country              object
Above/Below 50k    category
dtype: object


# Create lots of groups

You want to find the mean Age of adults when grouping by the following categories:

- "Workclass" (which has 9 categories)
- "Above/Below 50k" (which has 2 categories)
- "Education" (which has 16 categories).
You have developed the following bit of code:

gb = adult.groupby(by=[ "Workclass",
                        "Above/Below 50k", 
                        "Education"])
How many groups are in the gb object and what is the maximum possible number of groups that could have been created? The dataset adult, and the gb object have been preloaded for you.

# Possible answers

( ) 2 are created out of 2 possible groups.

( ) 2 are created out of 18 possible groups.

( ) 288 are created out of 288 possible groups.

(x) 208 are created out of 288 possible groups.

# Setting up a .groupby() statement

The gender wage gap is a hot-topic item in the United States and across the world. Using the adult census income dataset, loaded as adult, you want to check if some of the recently published data lines up with this income survey.

# Instructions:

- Split the adult dataset across the "Sex" and "Above/Below 50k" columns, saving this object as gb.
- Print out the number of observations found in each group.
- Using gb, find the average of each numerical column.

In [12]:
# Group the adult dataset by "Sex" and "Above/Below 50k"
gb = adult.groupby(by=["Sex", "Above/Below 50k"])

# Print out how many rows are in each created group
print(gb.size())

# Print out the mean of each group for all columns
print(gb.mean())

Sex      Above/Below 50k
 Female   <=50K              9592
          >50K               1179
 Male     <=50K             15128
          >50K               6662
dtype: int64
                               Age         fnlgwt  Education Num  \
Sex     Above/Below 50k                                            
 Female  <=50K           36.210801  185999.381359       9.820475   
         >50K            42.125530  183687.406277      11.787108   
 Male    <=50K           37.147012  193093.609268       9.452142   
         >50K            44.625788  188769.101321      11.580606   

                         Capital Gain  Capital Loss  Hours/Week  
Sex     Above/Below 50k                                          
 Female  <=50K             121.986134     47.364470   35.916701  
         >50K             4200.389313    173.648855   40.426633  
 Male    <=50K             165.723823     56.806782   40.693879  
         >50K             3971.765836    198.780396   46.366106  


  print(gb.mean())


# Using pandas functions effectively

You are creating a Python application that will calculate summary statistics based on user-selected variables. The complete dataset is quite large. For now, you are setting up your code using part of the dataset, preloaded as adult. As you create a reusable process, make sure you are thinking through the most efficient way to setup the GroupBy object.

# Instructions:

- Create a list of the names for two user-selected variables: "Education" and "Above/Below 50k".
- Create a GroupBy object, gb, using the user_list as the grouping variables.
- Calculate the mean of "Hours/Week" across each group using the most efficient approach covered in the video.

In [13]:
# Create a list of user-selected variables
user_list = ["Education", "Above/Below 50k"]

# Create a GroupBy object using this list
gb = adult.groupby(by=user_list)

# Find the mean for the variable "Hours/Week" for each group - Be efficient!
print(gb["Hours/Week"].mean())

Education      Above/Below 50k
 10th           <=50K             36.574053
                >50K              43.774194
 11th           <=50K             33.322870
                >50K              45.133333
 12th           <=50K             35.035000
                >50K              44.818182
 1st-4th        <=50K             37.864198
                >50K              48.833333
 5th-6th        <=50K             38.539432
                >50K              46.000000
 7th-8th        <=50K             38.830033
                >50K              47.500000
 9th            <=50K             37.667351
                >50K              44.851852
 Assoc-acdm     <=50K             39.264339
                >50K              44.256604
 Assoc-voc      <=50K             40.817826
                >50K              43.853186
 Bachelors      <=50K             40.586152
                >50K              45.475462
 Doctorate      <=50K             45.429907
                >50K              47.513072
 