# 2 Categorical pandas Series

Now it’s time to learn how to set, add, and remove categories from a Series. You’ll also explore how to update, rename, collapse, and reorder categories, before applying your new skills to clean and access other data within your DataFrame.

# Setting categories

After exploring the pandas Series "size" from the adoptable dogs dataset, you have decided that it should be an ordinal categorical variable. Creating such a variable takes a few steps. If these steps are performed out of order, you may not be able to access or use the necessary methods. The goal is to convert the "size" column from the dogs dataset into a ordered categorical pandas Series with the following categories: ["small", "medium", "large"].

# Adding categories

The owner of a local dog adoption agency has listings for almost 3,000 dogs. One of the most common questions they have been receiving lately is: "What type of area was the dog previously kept in?". You are setting up a pipeline to do some analysis and want to look into what information is available regarding the "keep_in" variable. Both pandas, as pd, and the dogs dataset have been preloaded.

# Instructions:

- Print the frequency of the responses in the "keep_in" variable and make sure the count of NaN values are shown.

In [7]:
import pandas as pd
dogs = pd.read_csv("ShelterDogs.csv")

# Check frequency counts while also printing the NaN count
print(dogs["keep_in"].value_counts(dropna=False))

both flat and garden    1224
NaN                     1021
garden                   510
flat                     182
Name: keep_in, dtype: int64


- Convert the "keep_in" variable to a categorical Series.

In [8]:
# Check frequency counts while also printing the NaN count
print(dogs["keep_in"].value_counts(dropna=False))

# Switch to a categorical variable
dogs["keep_in"] = dogs["keep_in"].astype("category")

both flat and garden    1224
NaN                     1021
garden                   510
flat                     182
Name: keep_in, dtype: int64


- Add the list of new categories provided by the adoption agency, new_categories, to the "keep_in" column.

In [9]:
# Check frequency counts while also printing the NaN count
print(dogs["keep_in"].value_counts(dropna=False))

# Switch to a categorical variable
dogs["keep_in"] = dogs["keep_in"].astype("category")

# Add new categories
new_categories = ["Unknown History", "Open Yard (Countryside)"]
dogs["keep_in"] = dogs["keep_in"].cat.add_categories(
    new_categories = ["Unknown History", "Open Yard (Countryside)"]
)

both flat and garden    1224
NaN                     1021
garden                   510
flat                     182
Name: keep_in, dtype: int64


- Print the frequency counts of the keep_in column and do not drop NaN values.

In [None]:
# Check frequency counts while also printing the NaN count
print(dogs["keep_in"].value_counts(dropna=False))

# Switch to a categorical variable
dogs["keep_in"] = dogs["keep_in"].astype("category")

# Add new categories
new_categories = ["Unknown History", "Open Yard (Countryside)"]
dogs["keep_in"] = dogs["keep_in"].cat.add_categories(new_categories)

# Check frequency counts one more time
print(dogs["keep_in"].value_counts(dropna=False))

# Removing categories

Before adopting dogs, parents might want to know whether or not a new dog likes children. When looking at the adoptable dogs dataset, dogs, you notice that the frequency of responses for the categorical Series "likes_children" looks like this:

maybe     1718
yes       1172
no          47
The owner of the data wants to convert all "maybe" responses to "no", as it would be unsafe to let a family adapt a dog if it doesn't like children. The code to convert all "maybe" to "no" is provided in Step 1. However, the option for "maybe" still remains as a category.

# Instructions:

- Print out the categories of the categorical Series dogs["likes_children"].

In [None]:
# Set "maybe" to be "no"
dogs.loc[dogs["likes_children"] == "maybe", "likes_children"] = "no"

# Print out categories
print(dogs["likes_children"].cat.categories)

- Print out the frequency table for "likes_children" to see if any "maybe" responses remain.

In [None]:
# Set "maybe" to be "no"
dogs.loc[dogs["likes_children"] == "maybe", "likes_children"] = "no"

# Print out categories
print(dogs["likes_children"].cat.categories)

# Print the frequency table
print(dogs["likes_children"].value_counts())

- Remove the "maybe" category from the Series.

In [None]:
# Set "maybe" to be "no"
dogs.loc[dogs["likes_children"] == "maybe", "likes_children"] = "no"

# Print out categories
print(dogs["likes_children"].cat.categories)

# Print the frequency table
print(dogs["likes_children"].value_counts())

# Remove the "maybe" category
dogs["likes_children"] = dogs["likes_children"].cat.remove_categories(removals=["maybe"])
print(dogs["likes_children"].value_counts())

- Print out the categories of "likes_children" one more time.

In [None]:
# Set "maybe" to be "no"
dogs.loc[dogs["likes_children"] == "maybe", "likes_children"] = "no"

# Print out categories
print(dogs["likes_children"].cat.categories)

# Print the frequency table
print(dogs["likes_children"].value_counts())

# Remove the `"maybe" category
dogs["likes_children"] = dogs["likes_children"].cat.remove_categories(["maybe"])
print(dogs["likes_children"].value_counts())

# Print the categories one more time
print(dogs["likes_children"].cat.categories)

# Renaming categories

The likes_children column of the adoptable dogs dataset needs an update. Here are the current frequency counts:

Maybe?    1718
yes       1172
no          47
Two things that stick out are the differences in capitalization and the ? found in the Maybe? category. The data should be cleaner than this and you are being asked to make a few changes.

# Instructions:

- Create a dictionary called my_changes that will update the Maybe? category to Maybe.
- Rename the categories in likes_children using the my_changes dictionary.
- Update the categories one more time so that all categories are uppercase using the .upper() method.
- Print out the categories of the updated likes_children Series.

In [None]:
# Create the my_changes dictionary
my_changes = {"Maybe?": "Maybe"}

# Rename the categories listed in the my_changes dictionary
dogs["likes_children"] = dogs["likes_children"].cat.rename_categories(my_changes)

# Use a lambda function to convert all categories to uppercase using upper()
dogs["likes_children"] =  dogs["likes_children"].cat.rename_categories(lambda c: c.upper())

# Print the list of categories
print(dogs["likes_children"].cat.categories)

# Collapsing categories

One problem that users of a local dog adoption website have voiced is that there are too many options. As they look through the different types of dogs, they are getting lost in the overwhelming amount of choice. To simplify some of the data, you are going through each column and collapsing data if appropriate. To preserve the original data, you are going to make new updated columns in the dogs dataset. You will start with the coat column. The frequency table is listed here:

short          1969
medium          565
wirehaired      220
long            180
medium-long       3

# Instructions:

Create a dictionary named update_coats to map both wirehaired and medium-long to medium.
Collapse the categories listed in this new dictionary and save this as a new column, coat_collapsed.
Convert this new column into a categorical Series.
Print the frequency table of this new Series.

In [15]:
# Create the update_coats dictionary
update_coats = {
  "wirehaired": "medium",
  "medium-long": "medium"
}

# Create a new column, coat_collapsed
dogs["coat_collapsed"] = dogs["coat"].replace(update_coats)

# Convert the column to categorical
dogs["coat_collapsed"] = dogs["coat_collapsed"].astype("category")

# Print the frequency table
print(dogs["coat_collapsed"].value_counts())

short     1972
medium     785
long       180
Name: coat_collapsed, dtype: int64


# Reordering categories in a Series

The owner of a local dog adoption agency has asked you take a look at her data on adoptable dogs. She is specifically interested in the size of the dogs in her dataset and wants to know if there are differences in other variables, given a dog's size. The adoptable dogs dataset has been loaded as dogs and the "size" variable has already been saved as a categorical column.

# Instructions:

- Print out the current categories of the "size" pandas Series.

In [None]:
# Print out the current categories of the size variable
print(dogs['size'].cat.categories)

- Reorder categories in the "size" column using the categories "small", "medium", "large", do not set the ordered parameter.

In [None]:
# Print out the current categories of the size variable
print(dogs["size"].cat.categories)

# Reorder the categories using the list provided
dogs["size"] = dogs["size"].cat.reorder_categories(
    new_categories =["small", "medium", "large"], ordered=True
)


- Update the reorder_categories() method so that pandas knows the variable has a natural order.

In [None]:
# Print out the current categories of the size variable
print(dogs["size"].cat.categories)

# Reorder the categories, specifying the Series is ordinal
dogs["size"] = dogs["size"].cat.reorder_categories(
  new_categories=["small", "medium", "large"],
  ordered=True
)

- Add a argument to the method so that the "size" column is updated without needing to save it to itself.

In [None]:
# Print out the current categories of the size variable
print(dogs["size"].cat.categories)

# Reorder the categories, specifying the Series is ordinal, and overwriting the original series
dogs["size"].cat.reorder_categories(
  new_categories=["small", "medium", "large"],
  ordered=True,
  inplace=True
)

# Using .groupby() after reordering

It is now time to run some analyses on the adoptable dogs dataset that is focused on the "size" of the dog. You have already developed some code to reorder the categories. In this exercise, you will develop two similar .groupby() statements to help better understand the effect of "size" on other variables. dogs has been preloaded for you.

# Instructions:

- Print out the frequency table of "sex" for each category of the "size" column.
- Print out the frequency table of "keep_in" for each category of the "size" column.

In [None]:
# Previous code
dogs["size"].cat.reorder_categories(
  new_categories=["small", "medium", "large"],
  ordered=True,
  inplace=True
)

# How many Male/Female dogs are available of each size?
print(dogs.groupby(by="size")["sex"].value_counts())

# Do larger dogs need more room to roam?
print(dogs.groupby(by="size")["keep_in"].value_counts())

# Cleaning variables

Users of an online entry system used to have the ability to freely type in responses to questions. This is causing issues when trying to analyze the adoptable dogs dataset, dogs. Here is the current frequency table of the "sex" column:

male      1672
female    1249
 MALE        10
 FEMALE       5
Malez        1
Now that the system only takes responses of "female" and "male", you want this variable to match the updated system.

# Instructions:

- Update the misspelled response "Malez" to be "male" by creating the replacement map, replace_map.

In [21]:
# Fix the misspelled word 
replace_map = dogs["sex"].str.title()

- Replace all occurrences of "Malez" with "male" by using replace_map.

In [22]:
# Fix the misspelled word 
replace_map = {"Malez": "male"}

# Update the sex column using the created map
dogs["sex"] = dogs["sex"].replace(replace_map)

print(dogs["sex"].value_counts())

male      1681
female    1256
Name: sex, dtype: int64


- Remove the leading spaces of the " MALE" and " FEMALE" responses.

In [23]:
# Fix the misspelled word
replace_map = {"Malez": "male"}

# Update the sex column using the created map
dogs["sex"] = dogs["sex"].replace(replace_map)

# Strip away leading whitespace
dogs["sex"] = dogs["sex"].str.strip()

print(dogs["sex"].value_counts())

male      1681
female    1256
Name: sex, dtype: int64


- Convert all responses to be strictly lowercase.

In [24]:
# Fix the misspelled word
replace_map = {"Malez": "male"}

# Update the sex column using the created map
dogs["sex"] = dogs["sex"].replace(replace_map)

# Strip away leading whitespace
dogs["sex"] = dogs["sex"].str.strip()

# Make all responses lowercase
dogs["sex"] = dogs["sex"].str.lower()

print(dogs["sex"].value_counts())

male      1681
female    1256
Name: sex, dtype: int64


- Convert the "sex" column to a categorical pandas Series.

In [25]:
# Fix the misspelled word
replace_map = {"Malez": "male"}

# Update the sex column using the created map
dogs["sex"] = dogs["sex"].replace(replace_map)

# Strip away leading whitespace
dogs["sex"] = dogs["sex"].str.strip()

# Make all responses lowercase
dogs["sex"] = dogs["sex"].str.lower()

# Convert to a categorical Series
dogs["sex"] = dogs["sex"].astype("category")

print(dogs["sex"].value_counts())

male      1681
female    1256
Name: sex, dtype: int64


# Accessing and filtering data

You are working on a Python application to display information about the dogs available for adoption at your local animal shelter. Some of the variables of interest, such as "breed", "size", and "coat", are saved as categorical variables. In order for this application to work properly, you need to be able to access and filter data using these columns.

The ID variable has been set as the index of the pandas DataFrame dogs.

- Print the "coat" value for the dog with an ID of 23807.

In [None]:
# Print the category of the coat for ID 23807
print(dogs.loc[23807, "coat"])

short

- For dogs with a long "coat", print the number of each "sex".

In [27]:
# Find the count of male and female dogs who have a "long" coat
print(dogs.loc[dogs["coat"] == "long", "sex"].value_counts())

male      124
female     56
Name: sex, dtype: int64


- Print the average age of dogs with a "breed" of "English Cocker Spaniel".

In [28]:
# Print the mean age of dogs with a breed of "English Cocker Spaniel"
print(dogs.loc[dogs["breed"] == "English Cocker Spaniel", "age"].mean())

8.186153846153847


- Filter to the dogs with "English" in their "breed" name using the .contains() method.

In [29]:
# Count the number of dogs that have "English" in their breed name
print(dogs[dogs["breed"].str.contains("English", regex=False)].shape[0])

35
