# Subsetting `DataFrames` -- Exercises

## Goal

Practice `pandas` subsetting operations

## Exercises

### 0. Import `pandas` and load the penguins data set

In [None]:
import pandas as pd

penguins = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")

### 1. Get the data from the column for the flipper length.  What is its type of the output when you select a single column from a `DataFrame`?

In [None]:
# When subsetting to specific columns, it's useful to print out the column names for reference
list(penguins.columns)

In [None]:
# Just index the DataFrame with the column name
flipper_len = penguins["flipper_length_mm"]
flipper_len.head(10)

In [None]:
# Use the type function to get the type of the output
# Here, the type is a pandas Series object since we only selected a single column
type(flipper_len)

### 2. Subset the penguins data to just the columns containing length/depth measurements.  What is the type of the output when you select multiple columns from a `DataFrame`?

In [None]:
# Subsetting to multiple columns requires wraping the column names in a Python list with []
length_data = penguins[["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]]
length_data.head(10)

In [None]:
# Since multiple columns were selected, the output is DataFrame
type(length_data)

### 3. What are the names of the different islands represented in the data set?

In [None]:
# Select the island column, then get the unique values
list(penguins["island"].unique())

### 4. How many rows have missing body mass values?

Hint: You'll need to find (or guess) the name of a helper function very similar to one we used in the lesson.

In [None]:
# The isna function indicates whether or not a value is missing
penguins[penguins["body_mass_g"].isna()]

In [None]:
# It's easy to count the rows from the output, but if there were many more
# shape can be used to get the count programatically
penguins[penguins["body_mass_g"].isna()].shape[0]

### 5. Get the subset of data that match ALL of the following criteria

* Penguins of the Gentoo and Chinstrap species
* Flipper length less than 200
* Females only

In [None]:
# This is mainly an exercise in getting the syntax correct
penguins[(penguins["species"].isin(["Gentoo", "Chinstrap"])) & \
         (penguins["flipper_length_mm"] < 200) & \
         (penguins["sex"] == "female")]

### 6. If we only wanted to select the `species`, `flipper_length_mm`, and `sex` columns from the above exercise, how would we need to modify the code?

In [None]:
# To filter rows AND select specific columns, we need to use the .loc function
penguins.loc[(penguins["species"].isin(["Gentoo", "Chinstrap"])) & \
             (penguins["flipper_length_mm"] < 200) & \
             (penguins["sex"] == "female"), ["species", "flipper_length_mm", "sex"]]