# 05. Selecting Subsets of Data from DataFrames with `.loc`

### Objectives

+ `.loc` can select rows, columns, or rows and columns simultaneously
+ `.loc` selects only by **label**, no integars

# Subset selection with `.loc`
The **`.loc`** indexer selects data in a different manner than *just the brackets*. We must learn its set of rules.

## Simultaneous row and column subset selection with `.loc`
**`.loc`** can select rows and columns simultaneously. You cannot do this with *just the brackets*. 

This is done by separating the row and column selections with a **comma**. The selection will look something like this:

```
df.loc[rows, cols]
```

## `.loc` only selects data by LABEL

Very importantly, **`.loc`** only selects data by the **LABEL** of the rows and columns. You must provide **`.loc`** with the label of the rows and/or columns you would like to select.

## Select two rows and three columns with .loc
If we wanted to select the rows **`Dean`** and **`Cornelia`** along with the columns **`age`**, **`state`**, and **`score`** we would do this:

In [1]:
import pandas as pd
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


In [2]:
rows = ['Dean', 'Cornelia']        # using list 
cols = ['age', 'state', 'score']   # using list 

df.loc[rows, cols]

Unnamed: 0,age,state,score
Dean,32,AK,1.8
Cornelia,69,TX,2.2


## The possible types of selections for `.loc`
Row or column selections can be any of the following:

* A single label
* A list of labels
* A slice with labels

We can use any of these three for either row or column selections with `.loc.`

### Select two rows and a single column:
Here we use a list for the rows and string for the column. The row selection is **`['Dean', 'Aaron']`** and the column selection is **`food`**. Note how this returns a Series since we are selecting exactly a single column.

In [3]:
rows = ['Dean', 'Aaron']
cols = 'food'

df.loc[rows, cols]

Dean     Cheese
Aaron     Mango
Name: food, dtype: object

# Use slice notation to select a range of rows
We have seen slice notation when working with Python lists. This same notation is allowed with DataFrames. Let's choose all of the rows from Jane to Penelope with slice notation along with the columns state and color.

In [4]:
cols = ['state', 'color']

df.loc['Jane':'Penelope', cols]

Unnamed: 0,state,color
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


## Slice notation only works in brackets attached to the object
Python only allows us to use slice notation within the brackets that are attached to an object. If we try and assign slice notation outside of this, we will get a syntax error like we do below.

In [None]:
rows = ['Jane':'Penelope'] # 

## Use the `slice` function to separate out the selection in a different line
There is a built-in `slice` function that you can use to assign your selection to a variable. It takes the same three values **start**, **stop**, and **step**, but this time as function parameters.

In [5]:
rows = slice('Jane', 'Penelope') # slice functon in core python to create slice 
cols = ['state', 'color']

df.loc[rows, cols]

Unnamed: 0,state,color
Jane,NY,blue
Niko,TX,green
Aaron,FL,red
Penelope,AL,white


### Slice both the rows and columns

In [6]:
df.loc[:'Dean', 'height':]

Unnamed: 0,height,score
Jane,165,4.6
Niko,70,8.3
Aaron,120,9.0
Penelope,80,3.3
Dean,180,1.8


Use `None` to denote an empty part of the slice.

In [7]:
rows = slice(None, 'Dean')  # no start and no end
cols = slice('height', None)

df.loc[rows, cols]

Unnamed: 0,height,score
Jane,165,4.6
Niko,70,8.3
Aaron,120,9.0
Penelope,80,3.3
Dean,180,1.8


## Slices with `.loc` are inclusive of the stop value
Notice that the stop value is included in the returned DataFrame. When slicing Python lists, the last element is **excluded**.

# Use slice notation or the slice function?
Almost no one uses the `slice` function, so you will probably want to use slice notation. That said, the slice function does help separate the row and column selection into their own lines of code.

### Selecting all of the rows and some columns
It is possible to select all of the rows by using a single colon. Here, we select all of the rows and two of the columns.

In [15]:
cols = ['food', 'color']
df.loc[:, cols]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


In [16]:
rows = slice(None)
cols = ['food', 'color']

df.loc[rows, cols]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


## The above is not necessary! Use *just the brackets*
You would never see two columns with all the rows selected like that. This is exactly what *just the brackets* are built for.

In [8]:
cols = ['food', 'color']
df[cols]

Unnamed: 0,food,color
Jane,Steak,blue
Niko,Lamb,green
Aaron,Mango,red
Penelope,Apple,white
Dean,Cheese,gray
Christina,Melon,black
Cornelia,Beans,red


### A single colon is slice notation for select all
That single colon might be intimidating but it is technically slice notation that selects all items. See the following example with a list:

In [9]:
a_list = [1, 2, 3, 4, 5, 6]
a_list[:] # means everything

[1, 2, 3, 4, 5, 6]

## Use a single colon to select all the columns

In [10]:
rows = ['Penelope','Cornelia']
df.loc[rows, :] # means get all the columns, not commonly used here, should use df[rows]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


### The above is usually shortened
By default, Pandas will select all of the columns if you only provide a row selection. Providing the colon is not necessary and the following will do the same:

In [11]:
rows = ['Penelope','Cornelia']
df.loc[rows]

Unnamed: 0,state,color,food,age,height,score
Penelope,AL,white,Apple,4,80,3.3
Cornelia,TX,red,Beans,69,150,2.2


## Use slice notation to select a range of rows with all of the columns
Similary, we can slice from Niko through Dean while selecting all of the columns. We do not provide a specific column selection. By default, Pandas returns all of the columns.

In [12]:
df.loc['Niko':'Dean'] # if one thing selected, then just rows

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8


## Other slicing examples
You can slice in a variety of ways such as taking every other row by setting the step size to 2:

In [13]:
df.loc['Niko':'Christina':2]

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


With the `slice` function.

In [14]:
rows = slice('Niko', 'Christina', 2)
df.loc[rows]

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3
Penelope,AL,white,Apple,4,80,3.3
Christina,TX,black,Melon,33,172,9.5


Omitting the start value to include all rows until the stop value:

In [None]:
df.loc[:'Penelope']

Use `None` to represent a missing component of the slice.

In [None]:
rows = slice(None, 'Penelope')
df.loc[rows]

Omitting the stop value to keep all rows after the start value:

In [None]:
df.loc['Aaron':]

In [None]:
rows = slice('Aaron', None)
df.loc[rows]

## Select a single row and a single column
This returns a scalar value and NOT a DataFrame or Series

In [17]:
rows = 'Jane'
cols = 'state'
df.loc[rows, cols]  # single row and single column returns one value, scalar value: boolean, string, int, float

'NY'

## Select a single row as a Series with `.loc`
The `.loc` indexer will return a single row as a Series when given a single row label. Let's select the row for Niko. Notice that the column names have now become index labels.

In [18]:
df.loc['Niko']

state        TX
color     green
food       Lamb
age           2
height       70
score       8.3
Name: Niko, dtype: object

In [19]:
type(df.loc['Niko']) # it's a one row, and it's a series
# if want to make horonzial, make it a list

pandas.core.series.Series

In [20]:
rows = ['Niko']
df.loc[rows]

Unnamed: 0,state,color,food,age,height,score
Niko,TX,green,Lamb,2,70,8.3


In [21]:
type(df.loc[rows])

pandas.core.frame.DataFrame

# Is this confusing?
Think about why this output may be confusing.

# Summary of `.loc`
* Only uses labels
* Can select rows and columns simultaneously
* Selection can be a single label, a list of labels or a slice of labels
* Put a comma between row and column selections

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Select all columns for the movie 'The Dark Knight Rises'.</span>

In [24]:
# your code here
movie = pd.read_csv('../data/movie.csv', index_col= 'title')
movie.head(2) 

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1


In [25]:
movie.loc['The Dark Knight Rises']  

year                                                            2012
color                                                          Color
content_rating                                                 PG-13
duration                                                         164
director_name                                      Christopher Nolan
director_fb                                                    22000
actor1                                                     Tom Hardy
actor1_fb                                                      27000
actor2                                                Christian Bale
actor2_fb                                                      23000
actor3                                          Joseph Gordon-Levitt
actor3_fb                                                      23000
gross                                                    4.48131e+08
genres                                               Action|Thriller
num_reviews                       

### Problem 2
<span  style="color:green; font-size:16px">Select all columns for the movies 'Tangled' and 'Avatar'.</span>

In [26]:
# your code here
movie.loc[['Tangled', 'Avatar']]  

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Tangled,2010.0,Color,PG,100.0,Nathan Greno,15.0,Brad Garrett,799.0,Donna Murphy,553.0,...,284.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,324.0,294810,17th century|based on fairy tale|disney|flower...,English,USA,260000000.0,7.8
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9


### Problem 3
<span  style="color:green; font-size:16px">What year was 'Tangled' and 'Avatar' made and what was their IMBD scores?</span>

In [28]:
# your code here
movie.loc[['Tangled', 'Avatar'],['year','imdb_score']]  

Unnamed: 0_level_0,year,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Tangled,2010.0,7.8
Avatar,2009.0,7.9


### Problem 4
<span  style="color:green; font-size:16px">Can you tell what the data type of the `year` column is by just looking at its values?</span>

In [36]:
# Turn this into a markdown cell and write your answer here
movie['year'].dtypes # movie.loc['year']loc selects the index = 'year'

dtype('float64')

In [40]:
movie.dtypes

year               float64
color               object
content_rating      object
duration           float64
director_name       object
director_fb        float64
actor1              object
actor1_fb          float64
actor2              object
actor2_fb          float64
actor3              object
actor3_fb          float64
gross              float64
genres              object
num_reviews        float64
num_voted_users      int64
plot_keywords       object
language            object
country             object
budget             float64
imdb_score         float64
dtype: object

### Problem 5
<span  style="color:green; font-size:16px">Use a single method to output the data type and number of non-missing values of `year`. Is it missing any?</span>

In [42]:
# your code here
movie.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4916 entries, Avatar to My Date with Drew
Data columns (total 21 columns):
year               4810 non-null float64
color              4897 non-null object
content_rating     4616 non-null object
duration           4901 non-null float64
director_name      4814 non-null object
director_fb        4814 non-null float64
actor1             4909 non-null object
actor1_fb          4909 non-null float64
actor2             4903 non-null object
actor2_fb          4903 non-null float64
actor3             4893 non-null object
actor3_fb          4893 non-null float64
gross              4054 non-null float64
genres             4916 non-null object
num_reviews        4867 non-null float64
num_voted_users    4916 non-null int64
plot_keywords      4764 non-null object
language           4904 non-null object
country            4911 non-null object
budget             4432 non-null float64
imdb_score         4916 non-null float64
dtypes: float64(10), int64(1), 

### Problem 6
<span  style="color:green; font-size:16px">Select every 100th movie between 'Tangled' and 'Forrest Gump'. Why doesn't 'Forrest Gump' appear in the results?</span>

In [43]:
# your code here
# Forrest Gump is not a multiple of 100 away from Tangled
movie.loc['Tangled':'Forrest Gump':100]

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Tangled,2010.0,Color,PG,100.0,Nathan Greno,15.0,Brad Garrett,799.0,Donna Murphy,553.0,...,284.0,200807262.0,Adventure|Animation|Comedy|Family|Fantasy|Musi...,324.0,294810,17th century|based on fairy tale|disney|flower...,English,USA,260000000.0,7.8
Shrek the Third,2007.0,Color,PG,93.0,Chris Miller,50.0,Justin Timberlake,3000.0,Eric Idle,795.0,...,692.0,320706665.0,Adventure|Animation|Comedy|Family|Fantasy,227.0,211971,disney spoof|fairy tale|prince|princess|tough guy,English,USA,160000000.0,6.1
X-Men 2,2003.0,Color,PG-13,134.0,Bryan Singer,0.0,Hugh Jackman,20000.0,Bruce Davison,505.0,...,346.0,214948780.0,Action|Adventure|Fantasy|Sci-Fi|Thriller,289.0,405973,mutant|prison|professor|school|x men,English,Canada,110000000.0,7.5
Cloud Atlas,2012.0,Color,R,172.0,Tom Tykwer,670.0,Tom Hanks,15000.0,Jim Sturgess,5000.0,...,1000.0,27098580.0,Drama|Sci-Fi,511.0,284825,composer|future|letter|nonlinear timeline|nurs...,English,Germany,102000000.0,7.5
Divergent,2014.0,Color,PG-13,139.0,Neil Burger,168.0,Kate Winslet,14000.0,Theo James,5000.0,...,1000.0,150832203.0,Adventure|Mystery|Sci-Fi,459.0,341058,army|brother sister relationship|dystopia|fath...,English,USA,85000000.0,6.7
Hidalgo,2004.0,Color,PG-13,136.0,Joe Johnston,394.0,J.K. Simmons,24000.0,Viggo Mortensen,10000.0,...,1000.0,67286731.0,Action|Adventure|Western,140.0,67856,arab|cowboy|horse|race|sheik,English,USA,100000000.0,6.7
Doom,2005.0,Color,R,113.0,Andrzej Bartkowiak,43.0,Dwayne Johnson,12000.0,Ben Daniels,585.0,...,452.0,28031250.0,Action|Adventure|Horror|Sci-Fi,237.0,88146,commando unit|extra chromosome|first person sh...,English,UK,60000000.0,5.2
Gone Girl,2014.0,Color,R,149.0,David Fincher,21000.0,Patrick Fugit,835.0,Sela Ward,812.0,...,625.0,167735396.0,Crime|Drama|Mystery|Thriller,568.0,569841,based on novel|disappearance|missing person|mi...,English,USA,61000000.0,8.1
"Sabrina, the Teenage Witch",,Color,TV-G,22.0,,,Nate Richert,870.0,Soleil Moon Frye,558.0,...,271.0,,Comedy|Family|Fantasy,20.0,24420,female protagonist|hereditary gift of witchcra...,English,USA,3000000.0,6.6
