## Subsetting
A data scientist can _slice_ out the 10 relevant features from a dataframe with hundreds of columns. Or they can _filter_ a dataframe to remove rows with incomplete data.

#### Data Scope and Question
Let's demonstrate dataframe operations using a dataframe of baby names.

In [1]:
import pandas as pd

In [2]:
baby = pd.read_csv("chapter6/babynames.txt")
baby

Unnamed: 0,Name,Sex,Count,Year
0,Liam,M,19659,2020
1,Noah,M,18252,2020
2,Oliver,M,14147,2020
3,Elijah,M,13034,2020
4,William,M,12541,2020
...,...,...,...,...
2020717,Ula,F,5,1880
2020718,Vannie,F,5,1880
2020719,Verona,F,5,1880
2020720,Vertie,F,5,1880


The data in the baby table comes from the US Social Security Administration (SSA), which records the baby name and birth sex for birth certificate purposes.

The SSA website has a page that describes the data in more detail, and here is the relevant information about the data's limitations:

> All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.
>
> All data are from a 100% sample of our records on Social Security card applications as of March 2021.

#### Dataframes and Indices
Row labels in dataframe can also be strings, as in the example below, in which each row is labeled using the dog breed name.

In [5]:
dogs = pd.read_csv("chapter6/dogs.txt")
dogs.set_index('breed', inplace=True)
dogs

Unnamed: 0_level_0,grooming,food_cost,kids,size
breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Labrador Retriever,weekly,466.0,high,medium
German Shepherd,weekly,466.0,medium,large
Beagle,daily,324.0,high,small
Golden Retriever,weekly,466.0,high,medium
Yorkshire Terrier,daily,324.0,low,small
Bulldog,weekly,466.0,medium,medium
Boxer,weekly,466.0,high,medium


In [7]:
# pandas stores the row labels in a special pd.Index object
dogs.index

Index(['Labrador Retriever', 'German Shepherd', 'Beagle', 'Golden Retriever',
       'Yorkshire Terrier', 'Bulldog', 'Boxer'],
      dtype='object', name='breed')

#### Slicing

In [8]:
baby

Unnamed: 0,Name,Sex,Count,Year
0,Liam,M,19659,2020
1,Noah,M,18252,2020
2,Oliver,M,14147,2020
3,Elijah,M,13034,2020
4,William,M,12541,2020
...,...,...,...,...
2020717,Ula,F,5,1880
2020718,Vannie,F,5,1880
2020719,Verona,F,5,1880
2020720,Vertie,F,5,1880


```.loc``` lets us select rows and columns using their labels.

In [9]:
baby.loc[1, 'Name']

'Noah'

To slice multiple rows or columns, we can use Python slice syntax instead of individual values:

In [10]:
baby.loc[0:3, 'Name': 'Count']

Unnamed: 0,Name,Sex,Count
0,Liam,M,19659
1,Noah,M,18252
2,Oliver,M,14147
3,Elijah,M,13034


To get an entire column of data, we can pass an empty slice as the first argument:

In [13]:
baby.loc[:, 'Count']

0          19659
1          18252
2          14147
3          13034
4          12541
           ...  
2020717        5
2020718        5
2020719        5
2020720        5
2020721        5
Name: Count, Length: 2020722, dtype: int64

Selecting out a single row or column of a dataframe produces a ```pd.Series``` object:

In [14]:
counts = baby.loc[:, 'Count']
counts.__class__.__name__

'Series'

In [15]:
# Shorthand for selecting columns:
baby['Name']

0             Liam
1             Noah
2           Oliver
3           Elijah
4          William
            ...   
2020717        Ula
2020718     Vannie
2020719     Verona
2020720     Vertie
2020721      Wilma
Name: Name, Length: 2020722, dtype: object

In [16]:
baby[['Name', 'Count']]

Unnamed: 0,Name,Count
0,Liam,19659
1,Noah,18252
2,Oliver,14147
3,Elijah,13034
4,William,12541
...,...,...
2020717,Ula,5
2020718,Vannie,5
2020719,Verona,5
2020720,Vertie,5


In [17]:
dogs

Unnamed: 0_level_0,grooming,food_cost,kids,size
breed,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Labrador Retriever,weekly,466.0,high,medium
German Shepherd,weekly,466.0,medium,large
Beagle,daily,324.0,high,small
Golden Retriever,weekly,466.0,high,medium
Yorkshire Terrier,daily,324.0,low,small
Bulldog,weekly,466.0,medium,medium
Boxer,weekly,466.0,high,medium


```.iloc``` uses the _positions_ of rows and columns rather than labels.

In [18]:
dogs.iloc[0:3, 0:2]

Unnamed: 0_level_0,grooming,food_cost
breed,Unnamed: 1_level_1,Unnamed: 2_level_1
Labrador Retriever,weekly,466.0
German Shepherd,weekly,466.0
Beagle,daily,324.0


In [19]:
dogs.loc['Labrador Retriever': 'Beagle', 'grooming': 'food_cost']

Unnamed: 0_level_0,grooming,food_cost
breed,Unnamed: 1_level_1,Unnamed: 2_level_1
Labrador Retriever,weekly,466.0
German Shepherd,weekly,466.0
Beagle,daily,324.0


#### Filtering Rows
Let's say we want to find the most popular baby names in 2020. To do this, we can filter rows to keep only the rows where the ```Year``` is 2020.

In [20]:
# Get a Series with the Year data
baby['Year']

0          2020
1          2020
2          2020
3          2020
4          2020
           ... 
2020717    1880
2020718    1880
2020719    1880
2020720    1880
2020721    1880
Name: Year, Length: 2020722, dtype: int64

In [21]:
# Compare with 2020
baby['Year'] == 2020

0           True
1           True
2           True
3           True
4           True
           ...  
2020717    False
2020718    False
2020719    False
2020720    False
2020721    False
Name: Year, Length: 2020722, dtype: bool

Now we tell ```pandas``` to keep only the rows where the comparison evaluated to ```True```:

In [23]:
baby_2020 = baby.loc[baby['Year'] == 2020, :]

In [26]:
baby_2020

Unnamed: 0,Name,Sex,Count,Year
0,Liam,M,19659,2020
1,Noah,M,18252,2020
2,Oliver,M,14147,2020
3,Elijah,M,13034,2020
4,William,M,12541,2020
...,...,...,...,...
31265,Zykeria,F,5,2020
31266,Zylani,F,5,2020
31267,Zylynn,F,5,2020
31268,Zynique,F,5,2020


In [25]:
# Filtering has a shorthand without using .loc:
baby[baby['Year'] == 2020]

Unnamed: 0,Name,Sex,Count,Year
0,Liam,M,19659,2020
1,Noah,M,18252,2020
2,Oliver,M,14147,2020
3,Elijah,M,13034,2020
4,William,M,12541,2020
...,...,...,...,...
31265,Zykeria,F,5,2020
31266,Zylani,F,5,2020
31267,Zylynn,F,5,2020
31268,Zynique,F,5,2020


Finally, to find the most common names in 2020, sort the dataframe by ```Count``` in descending order:

In [27]:
(baby[baby['Year'] == 2020]
    .sort_values('Count', ascending=False)
    .head(7)
)

Unnamed: 0,Name,Sex,Count,Year
0,Liam,M,19659,2020
1,Noah,M,18252,2020
13911,Emma,F,15581,2020
2,Oliver,M,14147,2020
13912,Ava,F,13084,2020
3,Elijah,M,13034,2020
13913,Charlotte,F,13003,2020


#### Example: How Recently has Luna Become a Popular Name?
The NYT article mentions that the name Luna was almost nonexistent before 2000 but has since grown to become a very popular name for girls. When exactly did Luna become popular?

When approaching a data manipulation task, we recommend breaking the problem down into smaller steps. For example, we could think:
1. Filter: keep only rows with ```'Luna'``` in the ```Name``` column.
2. Filter: keep only rows with ```'F'``` in the ```Sex``` column.
3. Slice: keep the ```Count``` and ```Year``` columns.

In [31]:
luna = baby[baby['Name'] == 'Luna'] # [1]
luna = luna[luna['Sex'] == 'F'] # [2]
luna = luna[['Count', 'Year']] # [3]
luna

Unnamed: 0,Count,Year
13923,7770,2020
45366,7772,2019
77393,6929,2018
109741,5351,2017
142368,3677,2016
...,...,...
2009654,15,1885
2011900,18,1884
2014083,17,1883
2018187,18,1881


In [33]:
import plotly.express as px

In [37]:
px.line(luna, x='Year', y='Count', width=350, height=350, title='Luna Popularity Trend')

If someone tells you that their name is Luna, you can take a pretty good guess at their age even without any other information about them!

In [38]:
siri = (baby.query('Name == "Siri"')
        .query('Sex == "F"'))
px.line(siri, x='Year', y='Count', width=350, height=350, title='Siri Popularity Trend')

Siri happens to be the name of Apple's voice assistant and was introduced in 2011. Let's draw a line for the year 2011 and take a look...

In [40]:
fig = px.line(siri, x='Year', y='Count', width=350, height=350, title='Siri Popularity Trend')
fig.add_vline(
    x=2011, line_color="red", line_dash="dashdot", line_width=4, opacity=0.7
)