# 05. Selecting Subsets of Data from DataFrames with `.loc`

### Objectives

+ `.loc` can select rows, columns, or rows and columns simultaneously
+ `.loc` selects only by **label**

# Subset selection with `.loc`
The **`.loc`** indexer selects data in a different manner than *just the brackets*. We must learn its set of rules.

## Simultaneous row and column subset selection with `.loc`
**`.loc`** can select rows and columns simultaneously. You cannot do this with *just the brackets*. 

This is done by separating the row and column selections with a **comma**. The selection will look something like this:

```
df.loc[rows, cols]
```

## `.loc` only selects data by LABEL

Very importantly, **`.loc`** only selects data by the **LABEL** of the rows and columns. You must provide **`.loc`** with the label of the rows and/or columns you would like to select.

## Select two rows and three columns with .loc
If we wanted to select the rows **`Dean`** and **`Cornelia`** along with the columns **`age`**, **`state`**, and **`score`** we would do this:

In [None]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

In [None]:
rows = ['Dean', 'Cornelia']
cols = ['age', 'state', 'score']

df.loc[rows, cols]

## The possible types of selections for `.loc`
Row or column selections can be any of the following:

* A single label
* A list of labels
* A slice with labels

We can use any of these three for either row or column selections with `.loc.`

### Select two rows and a single column:
Here we use a list for the rows and string for the column. The row selection is **`['Dean', 'Aaron']`** and the column selection is **`food`**. Note how this returns a Series since we are selecting exactly a single column.

In [None]:
rows = ['Dean', 'Aaron']
cols = 'food'

df.loc[rows, cols]

# Use slice notation to select a range of rows
We have seen slice notation when working with Python lists. This same notation is allowed with DataFrames. Let's choose all of the rows from Jane to Penelope with slice notation along with the columns state and color.

In [None]:
cols = ['state', 'color']

df.loc['Jane':'Penelope', cols]

## Slice notation only works in brackets attached to the object
Python only allows us to use slice notation within the brackets that are attached to an object. If we try and assign slice notation outside of this, we will get a syntax error like we do below.

In [None]:
rows = ['Jane':'Penelope']

## Use the `slice` function to separate out the selection in a different line
There is a built-in `slice` function that you can use to assign your selection to a variable. It takes the same three values **start**, **stop**, and **step**, but this time as function parameters.

In [None]:
rows = slice('Jane', 'Penelope')
cols = ['state', 'color']

df.loc[rows, cols]

### Slice both the rows and columns

In [None]:
df.loc[:'Dean', 'height':]

Use `None` to denote an empty part of the slice.

In [None]:
rows = slice(None, 'Dean')
cols = slice('height', None)

df.loc[rows, cols]

## Slices with `.loc` are inclusive of the stop value
Notice that the stop value is included in the returned DataFrame. When slicing Python lists, the last element is **excluded**.

# Use slice notation or the slice function?
Almost no one uses the `slice` function, so you will probably want to use slice notation. That said, the slice function does help separate the row and column selection into their own lines of code.

### Selecting all of the rows and some columns
It is possible to select all of the rows by using a single colon. Here, we select all of the rows and two of the columns.

In [None]:
cols = ['food', 'color']
df.loc[:, cols]

In [None]:
rows = slice(None)
cols = ['food', 'color']

df.loc[rows, cols]

## The above is not necessary! Use *just the brackets*
You would never see two columns with all the rows selected like that. This is exactly what *just the brackets* are built for.

In [None]:
cols = ['food', 'color']
df[cols]

### A single colon is slice notation for select all
That single colon might be intimidating but it is technically slice notation that selects all items. See the following example with a list:

In [None]:
a_list = [1, 2, 3, 4, 5, 6]
a_list[:]

## Use a single colon to select all the columns

In [None]:
rows = ['Penelope','Cornelia']
df.loc[rows, :]

### The above is usually shortened
By default, Pandas will select all of the columns if you only provide a row selection. Providing the colon is not necessary and the following will do the same:

In [None]:
rows = ['Penelope','Cornelia']
df.loc[rows]

## Use slice notation to select a range of rows with all of the columns
Similary, we can slice from Niko through Dean while selecting all of the columns. We do not provide a specific column selection. By default, Pandas returns all of the columns.

In [None]:
df.loc['Niko':'Dean']

## Other slicing examples
You can slice in a variety of ways such as taking every other row by setting the step size to 2:

In [None]:
df.loc['Niko':'Christina':2]

With the `slice` function.

In [None]:
rows = slice('Niko', 'Christina', 2)
df.loc[rows]

Omitting the start value to include all rows until the stop value:

In [None]:
df.loc[:'Penelope']

Use `None` to represent a missing component of the slice.

In [None]:
rows = slice(None, 'Penelope')
df.loc[rows]

Omitting the stop value to keep all rows after the start value:

In [None]:
df.loc['Aaron':]

In [None]:
rows = slice('Aaron', None)
df.loc[rows]

## Select a single row and a single column
This returns a scalar value and NOT a DataFrame or Series

In [None]:
rows = 'Jane'
cols = 'state'
df.loc[rows, cols]

## Select a single row as a Series with `.loc`
The `.loc` indexer will return a single row as a Series when given a single row label. Let's select the row for Niko. Notice that the column names have now become index labels.

In [None]:
df.loc['Niko']

# Is this confusing?
Think about why this output may be confusing.

## Select a single row and a single column
This returns a scalar value and NOT a DataFrame or Series

In [None]:
rows = 'Jane'
cols = 'state'
df.loc[rows, cols]

# Summary of `.loc`
* Only uses labels
* Can select rows and columns simultaneously
* Selection can be a single label, a list of labels or a slice of labels
* Put a comma between row and column selections

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Select all columns for the movie 'The Dark Knight Rises'.</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">Select all columns for the movies 'Tangled' and 'Avatar'.</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">What year was 'Tangled' and 'Avatar' made and what was their IMBD scores?</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Can you tell what the data type of the `year` column is by just looking at its values?</span>

In [None]:
# Turn this into a markdown cell and write your answer here

### Problem 5
<span  style="color:green; font-size:16px">Use a single method to output the data type and number of non-missing values of `year`. Is it missing any?</span>

In [None]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Select every 100th movie between 'Tangled' and 'Forrest Gump'. Why doesn't 'Forrest Gump' appear in the results?</span>

In [None]:
# your code here