# Selecting Data In A DataFrame
## Notebook Outline:

* [Introduction To Indexing](#IntroToIndexing)
* [Introduction to the .iloc method](#IntroducingILoc)
* [Introduction to the .loc method](#IntroducingLoc)
* [Using .loc to get rows based on a condition](#UsingLocWithCondition)
* [Using .loc to get rows based on multiple condition](#UsingLocWithMultipleConditions)

# How to use this Notebook

The best way to use this notebook is to follow along with the lecture and then to apply what you learn to your own data files, or (if you do not have any of your own data) to practice using this functions and methods on the provided data. A little practice goes a long way towards understand and retaining! It would be easy to just skim this notebook, but you will learn more by doing!

<a name="IntroToIndexing"></a>
#  Introduction to Indexing
Indexing and slicing just refers to methods to grab specific rows and columns from a dataset.  Maybe you want the value at the 100th row and 11th column, or maybe you want all the rows of data for month of January in 2017. Maybe you want all the rows where a certain value is greater than 0.  These are all examples where you will want to use indexing.

We are going to cover the two main methods of indexing, .iloc and loc, and then start using them in example. First, we need to get a handle on the basics of these methods!

<a name='IntroducingILoc'></a>
# Introducing the .iloc[] method
The .iloc method will allow us to select rows and columns based on the _number_ of the row and column. For example, we can select the 10th row and 3rd column, or we can select all values on the 17th row, etc...  Let's learn about .iloc[] via the examples below.

We need a dataset to practice on, so let's load the Illinois Boy Baby names dataset that we saw in a previous lecture.

In [35]:
# In this cell we import pandas and load the datafile.
import pandas as pd
filepath = ('/Users/yuzhang/Dropbox/Academia/Lecturer/I&C_SCI_X426.62/week3/'
            'Most_Popular_Baby_Boy_Names__Illinois_1980-2013.csv')
nameData = pd.read_csv(filepath)

#### Let's use the .head() method to get a quick look at the data

In [36]:
nameData.head(3)

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
1,2,1980,Jason,2389
2,3,1980,Christopher,2273


#### now let's use the .iloc[] method
The .iloc method allows us to index a dataframe in a very similar way to how we would index a list. It is important to remember that .iloc does not use any of the row or column labels (names), but it uses the numerical position of each row and column.

While it is important to know about.iloc, I actually don't use it very often and I use a similar method, .loc, that we will learn about next!

To use .iloc, simple write the method after the dataframe variable name and then use _square_ brackets to grab the row and column you want. The first value is the row number, and the second is the column number.  For example, nameData.iloc[0, 3] will get the value form the first row and the fourth column.

In the next cell we grab the first row and first column of the dataframe (remember that python is zero indexed)

The value that is returedn is 1, note this value does correspond to the value in the first row and first column in the output above.

In [37]:
# We grab the value from the first row and the first column.
nameData.iloc[0, 0]

1

#### Now, let's use .iloc[]  to get the 2nd row.
Remember, python is zero-indexed so the 2nd row is at index 9.

Note how we use ':' to get all the columns.

In [38]:
nameData.iloc[1, :]

Rank             2
Year          1980
Name         Jason
Frequency     2389
Name: 1, dtype: object

#### Let's use .iloc[] to get all the values in the last 3rd column
Note how we use ':' to to get all the rows

In [39]:
nameData.iloc[:, 2]

0          Michael
1            Jason
2      Christopher
3          Matthew
4            David
          ...     
845         NATHAN
846         ANDREW
847          HENRY
848          DAVID
849           JACK
Name: Name, Length: 850, dtype: object

#### Now let's use .iloc[] to get the first 10 rows.
When getting a range of rows, we can type the range as < first row number >: < last row number + 1>. For example nameData.iloc[0:10, :] will get all the rows from row 0 through row 9.

In [40]:
nameData.iloc[0:10, :]

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
1,2,1980,Jason,2389
2,3,1980,Christopher,2273
3,4,1980,Matthew,2112
4,5,1980,David,2088
5,6,1980,James,1925
6,7,1980,Robert,1763
7,8,1980,Daniel,1724
8,9,1980,John,1722
9,10,1980,Joseph,1710


#### Let's use iloc[] to get the first 10 rows and the first 2 columns

In [41]:
nameData.iloc[0:10, 0:2]

Unnamed: 0,Rank,Year
0,1,1980
1,2,1980
2,3,1980
3,4,1980
4,5,1980
5,6,1980
6,7,1980
7,8,1980
8,9,1980
9,10,1980


#### If you are starting your selection at 0, you don't actually need to type the 0. For example:

In [42]:
nameData.iloc[:10, :2]

Unnamed: 0,Rank,Year
0,1,1980
1,2,1980
2,3,1980
3,4,1980
4,5,1980
5,6,1980
6,7,1980
7,8,1980
8,9,1980
9,10,1980


#### Now lets get ever other row. The notation is dataframe.iloc[< first row index > : < last row index + 1> : < step size >, :]

In [43]:
nameData.iloc[0:10:2, :]

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
2,3,1980,Christopher,2273
4,5,1980,David,2088
6,7,1980,Robert,1763
8,9,1980,John,1722


<a name='IntroducingLoc'></a>
# Introducing the .loc method
.loc[] lets us select rows and columns by their labels or by boolean value (true/false tests). I use .loc _much_ more often than I use .iloc.

For example, we use the .loc method below to select the 'Name' column.

In [44]:
nameData.loc[:, 'Name']

0          Michael
1            Jason
2      Christopher
3          Matthew
4            David
          ...     
845         NATHAN
846         ANDREW
847          HENRY
848          DAVID
849           JACK
Name: Name, Length: 850, dtype: object

#### Let's use .loc to grab just the first row of the name column.
Notice that our row labels also happen to be the number of the row. This is not always the case but it is here.

In [45]:
nameData.loc[0 , 'Name']

'Michael'

In [46]:
nameData.index

RangeIndex(start=0, stop=850, step=1)

#### Let's now get the first two rows and the columns 'rank' and 'name'.
Notice that when you want multiple rows and/or columns you need to list the labels of rows and/or columns you want. (That is, the labels are in square brackets..they are in a list.

In [47]:
nameData.loc[[0, 1], ['Name', 'Rank']]

Unnamed: 0,Name,Rank
0,Michael,1
1,Jason,2


<a name='UsingLocWithCondition'></a>
# Using .loc to get rows based on a _condition_
In this section, we are going to look out how we get rows where a certain condition is True. This is a very common thing to do!  Often examples are show on random data, but let's use it on real data - starting with the name data!

#### Reviewing Booleans
We first need to do a quick Boolean review. A 'boolean' is a variable type that can have value of either True or False. The are usually created by performing some kind of simple test. For example, 2 > 5, this statement is _false_ because it is _not true_ that 2 > 5. You will want to briefly review what each symbol below means:
* a == b, tests if a is the same value as b.
* a != b, tests if a is not the same value as b.
* a > b, tests if a is greater than b.
* a >= b, tests if a is greater than or equal to b.
* a < b, test if a is less than b.
* a <= b, tests if a is less than or equal to b.

##### NOTE: '==' is not the same as '='. '=' is used to assign values to variable names. '==' is used to test for equivalence.

Let's try some other tests in the cell below.

In [48]:
print(2 > 1)
print(1 == 1)
print(1 == 3)
print(5 <= 6)
print(5 <= 5)
print(100 >= 101)
print(2 != 4)
print('apple' != 'banana')

True
True
False
True
True
False
True
True


#### Creating a columns of true/false values based on values in a column of a dataframe.
Let's say we want all the rows of the baby name data where the value in the 'Rank' column is 1. We just need to use the '==' operator we review above.

First get the column from the dataframe using the .loc method and then use the '==' to test for equivalence to 1. Notice that this prints a Series (which is a like a pandas DataFrame but just 1-dimensional instead of having multiple columns.

In [49]:
nameData.loc[:, 'Rank'] == 1

0       True
1      False
2      False
3      False
4      False
       ...  
845    False
846    False
847    False
848    False
849    False
Name: Rank, Length: 850, dtype: bool

#### Assign the True/False values to a variable and use it with the .loc method to index the dataframe
This time, we will assign the output True/False values to the variable name _ranked1st_.  Now we can use this variable to index our dataframe.

In [50]:
ranked1st = nameData.loc[:, 'Rank'] == 1

#### Using a boolean series with .loc
You can use a boolean series with .loc to select the rows (or columns) where the series has a value of True. You can _not_ do this with .iloc.

Note how the below only gets the rows where _ranked1st_ has the value of True.

In [51]:
nameData.loc[ranked1st, :]

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
25,1,1981,Michael,3632
50,1,1982,Michael,3664
75,1,1983,Michael,3681
100,1,1984,Michael,3669
125,1,1985,Michael,3480
150,1,1986,Michael,3337
175,1,1987,Michael,3467
200,1,1988,Michael,3540
225,1,1989,Michael,3624


#### Note that you can use the True/False test directly in the .loc method, this is usually what you will see.

In [52]:
nameData.loc[nameData['Rank'] == 1, :]

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
25,1,1981,Michael,3632
50,1,1982,Michael,3664
75,1,1983,Michael,3681
100,1,1984,Michael,3669
125,1,1985,Michael,3480
150,1,1986,Michael,3337
175,1,1987,Michael,3467
200,1,1988,Michael,3540
225,1,1989,Michael,3624


#### Let's look at some more examples: Use booleans and .loc to get all the rows for the name 'William'.

In [53]:
nameData.loc[nameData['Name'] == 'William', :]

Unnamed: 0,Rank,Year,Name,Frequency
17,18,1980,William,1192
40,16,1981,William,1176
69,20,1982,William,1124
95,21,1983,William,1033
120,21,1984,William,1075
149,25,1985,William,919
173,24,1986,William,949
196,22,1987,William,983
223,24,1988,William,916
244,20,1989,William,983


#### Get all rows where the rank is 3 or higher (3rd place, 2nd place, or 1st place)

In [54]:
nameData.loc[nameData['Rank'] <= 3, :]

Unnamed: 0,Rank,Year,Name,Frequency
0,1,1980,Michael,3886
1,2,1980,Jason,2389
2,3,1980,Christopher,2273
25,1,1981,Michael,3632
26,2,1981,Matthew,2329
...,...,...,...,...
801,2,2012,NOAH,749
802,3,2012,ALEXANDER,742
825,1,2013,NOAH,738
826,2,2013,JACOB,714


#### Practice Exercise: Get all the rows for the year 2000

In [55]:
# In this cell, use the .loc method and an '==' test to get all the rows where the year is equal to 2000

<a name=UsingLocWithMultipleConditions></a>
# Using .loc to get rows based on multiple _conditions_
In this section, we are going to look out how we get rows where multiple conditions are True. This is also a very common thing to do!

First we need a quick review of the symbol we use for _and_ and _or_ when using arrays (or series) of True/False values:

* & - means 'and'
* | - means 'or'

#### How to get the row where the rank equals 1 and the year equals 2000.
Use the same equivalence tests we used above, but combine them with the & operator. This produces a series with True/False values, where each value will only be True if both test are True. We only expect one row to be test as True, that is there should only be one row where the year equals 2000 and the rank equals 1.

##### Note you must now use parentheses to group each test.

In [56]:
nameData.loc[(nameData['Rank'] == 1) & (nameData['Year'] == 2000), :]

Unnamed: 0,Rank,Year,Name,Frequency
500,1,2000,Jacob,1640


### Now let's try some examples on our auto data. First we will load the data.

In [57]:
outPath = ('/Users/yuzhang/Dropbox/Academia/Lecturer/I&C_SCI_X426.62/week3/auto-mpg-tabs.csv')

autoMPGData = pd.read_csv(outPath, sep='\t', index_col=0)
autoMPGData.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
0,18.0,8,307.0,130.0,3504.0,12.0,70,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino


#### Use the or operator, '|', to get the rows for where the name is 'ford gran torino' or 'ford pinto'.

In [58]:
autoMPGData.loc[(autoMPGData['carname'] == 'ford gran torino') | (autoMPGData['carname'] == 'ford pinto'), :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
32,25.0,4,98.0,?,2046.0,19.0,71,ford pinto
88,14.0,8,302.0,137.0,4042.0,14.5,73,ford gran torino
112,19.0,4,122.0,85.00,2310.0,18.5,73,ford pinto
130,26.0,4,122.0,80.00,2451.0,16.5,74,ford pinto
136,16.0,8,302.0,140.0,4141.0,14.0,74,ford gran torino
168,23.0,4,140.0,83.00,2639.0,17.0,75,ford pinto
174,18.0,6,171.0,97.00,2984.0,14.5,75,ford pinto
190,14.5,8,351.0,152.0,4215.0,12.8,76,ford gran torino
206,26.5,4,140.0,72.00,2565.0,13.6,76,ford pinto


#### Get all rows for cars built after 1980. This time we assign the output to a variable named _carModelsAbove80_ and use. head() to print the first few rows.
We also print the type of _carModelsAbove80_ so you can see that it is a dataframe also.

In [59]:
carModelsAbove80 = autoMPGData.loc[(autoMPGData['model year'] > 80), :]
print(type(carModelsAbove80))
carModelsAbove80.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
338,27.2,4,135.0,84.0,2490.0,15.7,81,plymouth reliant
339,26.6,4,151.0,84.0,2635.0,16.4,81,buick skylark
340,25.8,4,156.0,92.0,2620.0,14.4,81,dodge aries wagon (sw)
341,23.5,6,173.0,110.0,2725.0,12.6,81,chevrolet citation
342,30.0,4,135.0,84.0,2385.0,12.9,81,plymouth reliant


#### Introducing the .isin() method for testing if a value in a column is in a list of possible values.
Let's say we wanted all rows where th car name is 'ford pinto', 'ford gran torino', or 'ford maverick'. We could use three equivalence test and two or operators to string the tests together. But, another way is to use the .isin() method.  See the example below:

In [60]:
# we can just use the .isin() method on any column and then pass the list of
# the valued we want checked to the .isin() method
autoMPGData['carname'].isin(['ford pinto', 'ford gran torino','ford maverick'])

0      False
1      False
2      False
3      False
4      False
       ...  
393    False
394    False
395    False
396    False
397    False
Name: carname, Length: 398, dtype: bool

In [61]:
# Now, let's use it in the .loc[] method to get those rows from the dataframe
autoMPGData.loc[autoMPGData['carname'].isin(['ford pinto', 'ford gran torino','ford maverick']), :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
17,21.0,6,200.0,85.00,2587.0,16.0,70,ford maverick
32,25.0,4,98.0,?,2046.0,19.0,71,ford pinto
88,14.0,8,302.0,137.0,4042.0,14.5,73,ford gran torino
100,18.0,6,250.0,88.00,3021.0,16.5,73,ford maverick
112,19.0,4,122.0,85.00,2310.0,18.5,73,ford pinto
126,21.0,6,200.0,?,2875.0,17.0,74,ford maverick
130,26.0,4,122.0,80.00,2451.0,16.5,74,ford pinto
136,16.0,8,302.0,140.0,4141.0,14.0,74,ford gran torino
155,15.0,6,250.0,72.00,3158.0,19.5,75,ford maverick
168,23.0,4,140.0,83.00,2639.0,17.0,75,ford pinto


#### Now, let's use the an equivalence test to get all the rows where the horsepower column has a value of '?'

In [62]:
autoMPGData.loc[autoMPGData['horsepower'] == '?', :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
32,25.0,4,98.0,?,2046.0,19.0,71,ford pinto
126,21.0,6,200.0,?,2875.0,17.0,74,ford maverick
330,40.9,4,85.0,?,1835.0,17.3,80,renault lecar deluxe
336,23.6,4,140.0,?,2905.0,14.3,80,ford mustang cobra
354,34.5,4,100.0,?,2320.0,15.8,81,renault 18i
374,23.0,4,151.0,?,3035.0,20.5,82,amc concord dl


#### Now, use the not equivalent test, !=, to get all the rows that do not have a missing horsepower value.

In [63]:
autoMPGDataCleaned = autoMPGData.loc[autoMPGData['horsepower'] != '?', :]
autoMPGDataCleaned.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
393,27.0,4,140.0,86.0,2790.0,15.6,82,ford mustang gl
394,44.0,4,97.0,52.0,2130.0,24.6,82,vw pickup
395,32.0,4,135.0,84.0,2295.0,11.6,82,dodge rampage
396,28.0,4,120.0,79.0,2625.0,18.6,82,ford ranger
397,31.0,4,119.0,82.0,2720.0,19.4,82,chevy s-10


In [64]:
autoMPGDataCleaned['horsepower'].unique()

array(['130.0', '165.0', '150.0', '140.0', '198.0', '220.0', '215.0',
       '225.0', '190.0', '170.0', '160.0', '95.00', '97.00', '85.00',
       '88.00', '46.00', '87.00', '90.00', '113.0', '200.0', '210.0',
       '193.0', '100.0', '105.0', '175.0', '153.0', '180.0', '110.0',
       '72.00', '86.00', '70.00', '76.00', '65.00', '69.00', '60.00',
       '80.00', '54.00', '208.0', '155.0', '112.0', '92.00', '145.0',
       '137.0', '158.0', '167.0', '94.00', '107.0', '230.0', '49.00',
       '75.00', '91.00', '122.0', '67.00', '83.00', '78.00', '52.00',
       '61.00', '93.00', '148.0', '129.0', '96.00', '71.00', '98.00',
       '115.0', '53.00', '81.00', '79.00', '120.0', '152.0', '102.0',
       '108.0', '68.00', '58.00', '149.0', '89.00', '63.00', '48.00',
       '66.00', '139.0', '103.0', '125.0', '133.0', '138.0', '135.0',
       '142.0', '77.00', '62.00', '132.0', '84.00', '64.00', '74.00',
       '116.0', '82.00'], dtype=object)

![](Success!.png)

# Lesson Summary:
In this lesson you learned:
* How to use the .iloc[] method to select rows and columns from a dataframe by their number.
* How to use the .loc[] method to select rows and columns from a dataframe by their label.
* How to use the .loc[] method to select data based on boolean (True/False) arrays.