# Selecting Subsets of Data from a Series

## Using Dot Notation to Select a Column as a Series
Previously we learned how to use *just the brackets* to select a single column as a Series. Another common way to do this uses dot notation. Place the column name following a dot after the name of your DataFrame. Let's begin by reading in the movie dataset and setting the index as the title.

In [1]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head(3)

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1
Spectre,2015.0,Color,PG-13,148.0,Sam Mendes,0.0,Christoph Waltz,11000.0,Rory Kinnear,393.0,...,161.0,200074175.0,Action|Adventure|Thriller,602.0,275868,bomb|espionage|sequel|spy|terrorist,English,UK,245000000.0,6.8


Instead of using *just the brackets* to select a single column, you can use dot notation. Let's select the `year` column in such a manner.

In [2]:
movie.year.head(3)

title
Avatar                                      2009.0
Pirates of the Caribbean: At World's End    2007.0
Spectre                                     2015.0
Name: year, dtype: float64

### I don't recommend doing this
Although this is valid pandas syntax I don't recommend using this notation for the following reasons:

* You cannot select columns with spaces in them
* You cannot select columns that have the same name as a pandas method such as `count`
* You cannot use a variable name that is assigned to the name of a column

Using *just the brackets* **always** works so I recommend doing the following instead:

In [3]:
movie['year'].head(3)

title
Avatar                                      2009.0
Pirates of the Caribbean: At World's End    2007.0
Spectre                                     2015.0
Name: year, dtype: float64

### Why even know about this?
pandas is written differently by different people and you will definitely see this syntax around, so it's important to be aware of it.

## Selecting Subsets of Data From a Series
Selecting subsets of data from a Series is very similar to that as a DataFrame. Since there are no columns in a Series, there isn't a need to use *just the brackets*. Instead, you can do all of your subset selection with `loc` and `iloc`. Let's select the `imdb_score` column as a Series and output the head.

In [4]:
imdb = movie['imdb_score']
imdb.head(3)

title
Avatar                                      7.9
Pirates of the Caribbean: At World's End    7.1
Spectre                                     6.8
Name: imdb_score, dtype: float64

### Selection with a scalar, a list, and a slice
Just like with a DataFrame, both `loc` and `iloc` accept either a single scalar, a list, or a slice. The `loc` indexer also accepts a boolean Series/array which will be covered in a later chapter. Let's select the IMDB score for 'Forrest Gump'. Since we are selecting a single label, only the value is returned.

In [5]:
imdb.loc['Forrest Gump']

8.8

Select the scores for both 'Forrest Gump' and 'Avatar' with a list. Notice that a Series is returned.

In [6]:
locs = ['Forrest Gump', 'Avatar']
imdb.loc[locs]

title
Forrest Gump    8.8
Avatar          7.9
Name: imdb_score, dtype: float64

Select every 100th movie from 'Avatar' to 'Forrest Gump' with slice notation:

In [7]:
imdb.loc['Avatar':'Forrest Gump':100]

title
Avatar                                   7.9
The Fast and the Furious                 6.7
Harry Potter and the Sorcerer's Stone    7.5
Epic                                     6.7
102 Dalmatians                           4.8
Pompeii                                  5.6
Wall Street: Money Never Sleeps          6.3
Hop                                      5.5
Beyond Borders                           6.5
Name: imdb_score, dtype: float64

### Repeat with `iloc`
The `iloc` indexer works analogously as `loc` on Series but only uses integer location. Let's make selections with a single integer, a list of integers, and a slice of integers. We'll begin by selecting the score of the 21st movie (integer location 20).

In [8]:
imdb.iloc[20]

7.5

In this example, we select three scores with a list.

In [9]:
ilocs = [10, 20, 30]
imdb.iloc[ilocs]

title
Batman v Superman: Dawn of Justice           6.9
The Hobbit: The Battle of the Five Armies    7.5
Skyfall                                      7.8
Name: imdb_score, dtype: float64

Here is an example that uses slice notation.

In [10]:
imdb.iloc[3000:3050:10]

title
Quartet               6.8
The Guru              5.4
Machine Gun McCain    6.2
The Blue Butterfly    6.3
Stripes               6.9
Name: imdb_score, dtype: float64

### Trouble with *just the brackets*
It is possible to use just the brackets to make the same selections as above. See the following examples:

In [11]:
imdb['Forrest Gump']

8.8

In [12]:
imdb['Avatar':'Forrest Gump':100]

title
Avatar                                   7.9
The Fast and the Furious                 6.7
Harry Potter and the Sorcerer's Stone    7.5
Epic                                     6.7
102 Dalmatians                           4.8
Pompeii                                  5.6
Wall Street: Money Never Sleeps          6.3
Hop                                      5.5
Beyond Borders                           6.5
Name: imdb_score, dtype: float64

In [13]:
ilocs = [10, 20, 30]
imdb[ilocs]

title
Batman v Superman: Dawn of Justice           6.9
The Hobbit: The Battle of the Five Armies    7.5
Skyfall                                      7.8
Name: imdb_score, dtype: float64

In [14]:
imdb[3000:3050:10]

title
Quartet               6.8
The Guru              5.4
Machine Gun McCain    6.2
The Blue Butterfly    6.3
Stripes               6.9
Name: imdb_score, dtype: float64

### Can you spot the problem?
The major issue is that using *just the brackets* is **ambiguous** and **not explicit**. We don't know if we are selecting by label or by integer location. With `loc` and `iloc`, it is clear what our intentions are. I suggest using `loc` and `iloc` for clarity.

## Comparison to Python Lists and Dictionaries
It may be helpful to compare pandas ability to make selections by label and integer location to that of Python lists and dictionaries. Python lists allow for selection of data only through **integer location**. You can use a single integer or slice notation to make the selection but NOT a list of integers. Let's see examples of subset selection of lists using integers:

In [15]:
a_list = [10, 5, 3, 89, 20, 44, 37]

In [16]:
a_list[4]

20

In [17]:
a_list[-3:]

[20, 44, 37]

### Selection by label with Python dictionaries
All values in each dictionary are labeled by a key. We use this key to make single selections. Dictionaries only allow selection with a single label. Slices and lists of labels are not allowed.

In [18]:
d = {'a':1, 'b':2, 't':20, 'z':26, 'A':27}
d['a']

1

In [19]:
d['A']

27

### pandas has the power of lists and dictionaries
DataFrames and Series are able to make selections with integers like a list and with labels like a dictionary.

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the bikes dataset. We will be using it for the next few questions. Select the wind speed column as a Series and assign it to a variable and output the head. What kind of index does this Series have?</span>

In [20]:
bikes = pd.read_csv('../data/bikes.csv')
bikes.head(3)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy


In [21]:
wind = bikes['wind_speed']
wind.head(3)

0    12.7
1     6.9
2    16.1
Name: wind_speed, dtype: float64

In [22]:
wind.index

RangeIndex(start=0, stop=50089, step=1)

### Exercise 2
<span  style="color:green; font-size:16px">From the wind speed Series, select the integer locations 4 through, but not including 10</span>

In [23]:
wind.iloc[4:10]

4    17.3
5    17.3
6    15.0
7     5.8
8     0.0
9    12.7
Name: wind_speed, dtype: float64

### Exercise 3
<span  style="color:green; font-size:16px">Copy and paste your answer to problem 2 below but use `loc` instead. Do you get the same result? Why not?</span>

In [24]:
wind.loc[4:10]

4     17.3
5     17.3
6     15.0
7      5.8
8      0.0
9     12.7
10     9.2
Name: wind_speed, dtype: float64

### Exercise 4
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select `actor1` as a Series. Who is the `actor1` for 'My Big Fat Greek Wedding'?</span>

In [25]:
movie = pd.read_csv('../data/movie.csv', index_col='title')
actor1 = movie['actor1']

actor1.loc['My Big Fat Greek Wedding']

'Nia Vardalos'

### Exercise 5
<span  style="color:green; font-size:16px">Find `actor1` for your favorite two movies?</span>

In [26]:
actor1.loc[['Titanic', 'Blood Diamond']]

title
Titanic          Leonardo DiCaprio
Blood Diamond    Leonardo DiCaprio
Name: actor1, dtype: object

### Exercise 6
<span  style="color:green; font-size:16px">Select the last 3 values from `actor1` using two different ways?</span>

In [27]:
actor1.iloc[-3:]

title
A Plague So Pleasant    Eva Boehnke
Shanghai Calling          Alan Ruck
My Date with Drew       John August
Name: actor1, dtype: object

In [28]:
actor1.tail(3)

title
A Plague So Pleasant    Eva Boehnke
Shanghai Calling          Alan Ruck
My Date with Drew       John August
Name: actor1, dtype: object