# Subsetting `DataFrames` -- Exercises

## Goal

Practice `pandas` subsetting operations

## Exercises

### 0. Import `pandas` and load the penguins data set

In [1]:
import pandas as pd

penguins = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv")

### 1. Get the data from the column for the flipper length.  What is its type of the output when you select a single column from a `DataFrame`?

In [2]:
# When subsetting to specific columns, it's useful to print out the column names for reference
list(penguins.columns)

['species',
 'island',
 'bill_length_mm',
 'bill_depth_mm',
 'flipper_length_mm',
 'body_mass_g',
 'sex',
 'year']

In [3]:
# Just index the DataFrame with the column name
flipper_len = penguins["flipper_length_mm"]
flipper_len.head(10)

0    181.0
1    186.0
2    195.0
3      NaN
4    193.0
5    190.0
6    181.0
7    195.0
8    193.0
9    190.0
Name: flipper_length_mm, dtype: float64

In [4]:
# Use the type function to get the type of the output
# Here, the type is a pandas Series object since we only selected a single column
type(flipper_len)

pandas.core.series.Series

### 2. Subset the penguins data to just the columns containing length/depth measurements.  What is the type of the output when you select multiple columns from a `DataFrame`?

In [5]:
# Subsetting to multiple columns requires wraping the column names in a Python list with []
length_data = penguins[["bill_length_mm", "bill_depth_mm", "flipper_length_mm"]]
length_data.head(10)

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm
0,39.1,18.7,181.0
1,39.5,17.4,186.0
2,40.3,18.0,195.0
3,,,
4,36.7,19.3,193.0
5,39.3,20.6,190.0
6,38.9,17.8,181.0
7,39.2,19.6,195.0
8,34.1,18.1,193.0
9,42.0,20.2,190.0


In [6]:
# Since multiple columns were selected, the output is DataFrame
type(length_data)

pandas.core.frame.DataFrame

### 3. What are the names of the different islands represented in the data set?

In [7]:
# Select the island column, then get the unique values
list(penguins["island"].unique())

['Torgersen', 'Biscoe', 'Dream']

### 4. How many rows have missing body mass values?

Hint: You'll need to find (or guess) the name of a helper function very similar to one we used in the lesson.

In [8]:
# The isna function indicates whether or not a value is missing
penguins[penguins["body_mass_g"].isna()]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
3,Adelie,Torgersen,,,,,,2007
271,Gentoo,Biscoe,,,,,,2009


In [9]:
# It's easy to count the rows from the output, but if there were many more
# shape can be used to get the count programatically
penguins[penguins["body_mass_g"].isna()].shape[0]

2

### 5. Get the subset of data that match ALL of the following criteria

* Penguins of the Gentoo and Chinstrap species
* Flipper length less than 200
* Females only

In [10]:
# This is mainly an exercise in getting the syntax correct
penguins[(penguins["species"].isin(["Gentoo", "Chinstrap"])) & \
         (penguins["flipper_length_mm"] < 200) & \
         (penguins["sex"] == "female")]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
276,Chinstrap,Dream,46.5,17.9,192.0,3500.0,female,2007
279,Chinstrap,Dream,45.4,18.7,188.0,3525.0,female,2007
281,Chinstrap,Dream,45.2,17.8,198.0,3950.0,female,2007
282,Chinstrap,Dream,46.1,18.2,178.0,3250.0,female,2007
284,Chinstrap,Dream,46.0,18.9,195.0,4150.0,female,2007
286,Chinstrap,Dream,46.6,17.8,193.0,3800.0,female,2007
288,Chinstrap,Dream,47.0,17.3,185.0,3700.0,female,2007
290,Chinstrap,Dream,45.9,17.1,190.0,3575.0,female,2007
293,Chinstrap,Dream,58.0,17.8,181.0,3700.0,female,2007
294,Chinstrap,Dream,46.4,18.6,190.0,3450.0,female,2007


### 6. If we only wanted to select the `species`, `flipper_length_mm`, and `sex` columns from the above exercise, how would we need to modify the code?

In [11]:
# To filter rows AND select specific columns, we need to use the .loc function
penguins.loc[(penguins["species"].isin(["Gentoo", "Chinstrap"])) & \
             (penguins["flipper_length_mm"] < 200) & \
             (penguins["sex"] == "female"), ["species", "flipper_length_mm", "sex"]]

Unnamed: 0,species,flipper_length_mm,sex
276,Chinstrap,192.0,female
279,Chinstrap,188.0,female
281,Chinstrap,198.0,female
282,Chinstrap,178.0,female
284,Chinstrap,195.0,female
286,Chinstrap,193.0,female
288,Chinstrap,185.0,female
290,Chinstrap,190.0,female
293,Chinstrap,181.0,female
294,Chinstrap,190.0,female
