# Selecting Data In A DataFrame

In this notebook, we will learn how to select data in a DataFrame!

## Notebook Outline:

* <a href='#IntroToIndexing'>Introduction To Indexing</a>
* <a href='#IntroducingILoc'>Introduction To iLoc</a>
* <a href='#IntroducingLoc'>Introduction To Loc</a>
* <a href='#UsingLocWithCondition'>Using Loc With A Condition</a>
* <a href='#UsingLocWithMultipleConditions'>Using Loc With Multiple Conditions</a>

# How to use this Notebook

The best way to use this notebook is to follow along with the lecture and then to apply what you learn to your own data files, or (if you do not have any of your own data) to practice using this functions and methods on the provided data. A little practice goes a long way towards understand and retaining! It would be easy to just skim this notebook, but you will learn more by doing!

<a name="IntroToIndexing"></a>
#  Introduction to Indexing
Indexing and slicing just refers to methods to grab specific rows and columns from a dataset.  Maybe you want the value at the 100th row and 11th column, or maybe you want all the rows of data for month of January in 2017. Maybe you want all the rows where a certain value is greater than 0.  These are all examples where you will want to use indexing.

We are going to cover the two main methods of indexing, .iloc and loc, and then start using them in example. First, we need to get a handle on the basics of these methods!

<a name='IntroducingILoc'></a>
# Introducing the `.iloc[]` method
The .iloc method will allow us to select rows and columns based on the _number_ of the row and column. For example, we can select the 10th row and 3rd column, or we can select all values on the 17th row, etc...  Let's learn about .iloc[] via the examples below.

We need a dataset to practice on, so let's load the ShiftManagerApp_LaborSheet names dataset that we saw in a previous lecture.

In [1]:
# In this cell we import pandas and load the datafile.
import pandas as pd
import os

filepath = os.path.join(os.getcwd(), 'data', 'ShiftManagerApp_LaborSheet.csv')
store_data = pd.read_csv(filepath)

#### Let's use the `.head()` method to get a quick look at the data

Exercise: fill in the line below to use the `head` method.

In [2]:
store_data.head()

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
0,4462,JillianA,2017-01-23,08:00:00,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,
1,4462,ZoeyD,2017-02-05,06:00:00,90.0,155.0,114.0,,78.0,,,,,,2017-02-05 11:30:48,,
2,4462,JessicaB,2017-02-05,07:00:00,173.0,182.0,106.0,,81.0,,,,,,2017-02-05 11:35:48,,
3,4462,JessicaB,2017-02-05,08:00:00,333.0,311.0,102.0,,55.0,,,,,,2017-02-05 11:52:05,,
4,4462,JessicaB,2017-02-05,09:00:00,594.0,598.0,155.0,,106.0,,,,,,2017-02-05 11:59:35,,


#### now let's use the `.iloc[]` method
The `.iloc[]` method allows us to index a dataframe in a very similar way to how we would index a list. It is important to remember that `.iloc[]` does not use any of the row or column labels (names), but it uses the numerical position of each row and column.

While it is important to know about `.iloc[]`, I actually don't use it very often and I use a similar method, `.loc[]`, that we will learn about next!

To use `.iloc[]`, simple write the method after the dataframe variable name and then use _square_ brackets to grab the row and column you want. The first value is the row number, and the second is the column number.  For example, store_data `.iloc[0, 3]` will get the value form the first row and the fourth column.

In the next cell we grab the first row and first column of the dataframe (remember that python is zero indexed)

The value that is returend is 1, note this value does correspond to the value in the first row and first column in the output above.

In [3]:
# We grab the value from the first row and the first column.
store_data.iloc[0, 0]

4462

#### Now, let's use `.iloc[]`  to get the 2nd row.
Remember, python is zero-indexed so the 2nd row is at index 1.

Note how we use ':' to get all the columns.

In [4]:
store_data.iloc[1, :]

Store_ID                                4462
Manager                                ZoeyD
Date                              2017-02-05
Ending_Hour                         06:00:00
Projected_Sales                           90
Sales                                    155
DT_TTL                                   114
Car_Count                                NaN
KVS_Total                                 78
Scheduled_People                         NaN
Actual_People                            NaN
Reason_for_Labor_Diff                    NaN
Reason_for_High_TTLs                     NaN
Manager_Entering_Data                    NaN
Timestamp                2017-02-05 11:30:48
OEPE                                     NaN
Park_Percentage                          NaN
Name: 1, dtype: object

#### You don't actually need to use the ':' to gt all the columns. But you do need to use ':' to get all the rows (see a few cells below). So, as a practice, it's easier to remember to just use ':'.

In [5]:
store_data.iloc[1, ]

Store_ID                                4462
Manager                                ZoeyD
Date                              2017-02-05
Ending_Hour                         06:00:00
Projected_Sales                           90
Sales                                    155
DT_TTL                                   114
Car_Count                                NaN
KVS_Total                                 78
Scheduled_People                         NaN
Actual_People                            NaN
Reason_for_Labor_Diff                    NaN
Reason_for_High_TTLs                     NaN
Manager_Entering_Data                    NaN
Timestamp                2017-02-05 11:30:48
OEPE                                     NaN
Park_Percentage                          NaN
Name: 1, dtype: object

#### Let's use `.iloc[]` to get all the values in the last 3rd column
Note how we use ':' to to get all the rows

In [6]:
store_data.iloc[:, 2]

0        2017-01-23
1        2017-02-05
2        2017-02-05
3        2017-02-05
4        2017-02-05
            ...    
25466    2018-07-28
25467    2018-07-28
25468    2018-07-28
25469    2018-07-28
25470    2018-07-28
Name: Date, Length: 25471, dtype: object

#### Now let's use `.iloc[]` to get the first 10 rows.
When getting a range of rows, we can type the range as < first row number >: < last row number + 1>. For example store_data.iloc[0:10, :] will get all the rows from row 0 through row 9.

In [7]:
store_data.iloc[0:10, :]

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
0,4462,JillianA,2017-01-23,08:00:00,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,
1,4462,ZoeyD,2017-02-05,06:00:00,90.0,155.0,114.0,,78.0,,,,,,2017-02-05 11:30:48,,
2,4462,JessicaB,2017-02-05,07:00:00,173.0,182.0,106.0,,81.0,,,,,,2017-02-05 11:35:48,,
3,4462,JessicaB,2017-02-05,08:00:00,333.0,311.0,102.0,,55.0,,,,,,2017-02-05 11:52:05,,
4,4462,JessicaB,2017-02-05,09:00:00,594.0,598.0,155.0,,106.0,,,,,,2017-02-05 11:59:35,,
5,4462,JessicaB,2017-02-05,10:00:00,554.0,534.0,170.0,,107.0,,,,,,2017-02-05 12:04:04,,
6,4462,JessicaB,2017-02-05,11:00:00,594.0,827.0,257.0,,150.0,,,,,,2017-02-05 12:12:48,,
7,4462,JessicaB,2017-02-05,13:00:00,649.0,740.0,201.0,,117.0,,,,,,2017-02-05 15:16:20,,
8,4462,JessicaB,2017-02-05,14:00:00,552.0,474.0,220.0,,167.0,,,,,,2017-02-05 15:17:56,,
9,4462,JillianA,2017-02-05,18:00:00,474.0,322.0,177.0,,108.0,,,,,,2017-02-05 18:15:06,,


#### Let's use `.iloc[]` to get the first 10 rows and the first 2 columns

In [8]:
store_data.iloc[0:10, 0:2]

Unnamed: 0,Store_ID,Manager
0,4462,JillianA
1,4462,ZoeyD
2,4462,JessicaB
3,4462,JessicaB
4,4462,JessicaB
5,4462,JessicaB
6,4462,JessicaB
7,4462,JessicaB
8,4462,JessicaB
9,4462,JillianA


#### If you are starting your selection at 0, you don't actually need to type the 0. For example:

In [9]:
store_data.iloc[:10, :2]

Unnamed: 0,Store_ID,Manager
0,4462,JillianA
1,4462,ZoeyD
2,4462,JessicaB
3,4462,JessicaB
4,4462,JessicaB
5,4462,JessicaB
6,4462,JessicaB
7,4462,JessicaB
8,4462,JessicaB
9,4462,JillianA


#### Now lets get every other row. The notation is dataframe.iloc[< first row index > : < last row index + 1> : < step size >, :]

In [10]:
store_data.iloc[0:10:2, :]

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
0,4462,JillianA,2017-01-23,08:00:00,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,
2,4462,JessicaB,2017-02-05,07:00:00,173.0,182.0,106.0,,81.0,,,,,,2017-02-05 11:35:48,,
4,4462,JessicaB,2017-02-05,09:00:00,594.0,598.0,155.0,,106.0,,,,,,2017-02-05 11:59:35,,
6,4462,JessicaB,2017-02-05,11:00:00,594.0,827.0,257.0,,150.0,,,,,,2017-02-05 12:12:48,,
8,4462,JessicaB,2017-02-05,14:00:00,552.0,474.0,220.0,,167.0,,,,,,2017-02-05 15:17:56,,


#### You can also select specific rows and columns:

In [11]:
store_data.iloc[[1, 4, 7], :]

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
1,4462,ZoeyD,2017-02-05,06:00:00,90.0,155.0,114.0,,78.0,,,,,,2017-02-05 11:30:48,,
4,4462,JessicaB,2017-02-05,09:00:00,594.0,598.0,155.0,,106.0,,,,,,2017-02-05 11:59:35,,
7,4462,JessicaB,2017-02-05,13:00:00,649.0,740.0,201.0,,117.0,,,,,,2017-02-05 15:16:20,,


In [12]:
store_data.iloc[[1, 4, 7], [1, 3]]

Unnamed: 0,Manager,Ending_Hour
1,ZoeyD,06:00:00
4,JessicaB,09:00:00
7,JessicaB,13:00:00


## In Class Exercise
Please create a cell below and use the .iloc[] method to explore the dataset.

<a name='IntroducingLoc'></a>
# Introducing the `.loc[]` method
.loc[] lets us select rows and columns by their labels or by boolean value (true/false tests). I use .loc _much_ more often than I use .iloc.

For example, we use the .loc method below to select the 'Store_ID' column.

In [13]:
store_data.loc[:, 'Store_ID']

0         4462
1         4462
2         4462
3         4462
4         4462
         ...  
25466    31225
25467    31225
25468    31225
25469    31225
25470    31225
Name: Store_ID, Length: 25471, dtype: int64

#### Let's use `.loc[]` to grab just the first row of the name column.
Notice that our row labels also happen to be the number of the row. This is not always the case but it is here.

In [14]:
store_data.loc[0 , 'Store_ID']

4462

#### Let's now get the first two rows and the columns 'rank' and 'name'.
Notice that when you want multiple rows and/or columns you need to list the labels of rows and/or columns you want. (That is, the labels are in square brackets..they are in a list.

In [15]:
store_data.loc[:, ['Manager', 'Sales'] ]

Unnamed: 0,Manager,Sales
0,JillianA,420.0
1,ZoeyD,155.0
2,JessicaB,182.0
3,JessicaB,311.0
4,JessicaB,598.0
5,JessicaB,534.0
6,JessicaB,827.0
7,JessicaB,740.0
8,JessicaB,474.0
9,JillianA,322.0


## In Class Exercise
Please create a cell below and use the .loc[] method to explore the dataset.

<a name='UsingLocWithCondition'></a>
# Using `.loc[]` to get rows based on a _condition_
In this section, we are going to look out how we get rows where a certain condition is True. This is a very common thing to do!  Often examples are show on random data, but let's use it on real data - starting with the name data!

#### Reviewing Booleans
We first need to do a quick Boolean review. A 'boolean' is a variable type that can have value of either True or False. The are usually created by performing some kind of simple test. For example, 2 > 5, this statement is _false_ because it is _not true_ that 2 > 5. You will want to briefly review what each symbol below means:
* a == b, tests if a is the same value as b.
* a != b, tests if a is not the same value as b.
* a > b, tests if a is greater than b.
* a >= b, tests if a is greater than or equal to b.
* a < b, test if a is less than b.
* a <= b, tests if a is less than or equal to b.

##### NOTE: '==' is not the same as '='. '=' is used to assign values to variable names. '==' is used to test for equivalence.

Let's try some other tests in the cell below.

In [None]:
print(2 > 1)
print(1 == 1)
print(1 == 3)
print(5 <= 6)
print(5 <= 5)
print(100 >= 101)
print(2 != 4)
print('apple' != 'banana')

#### Creating a columns of true/false values based on values in a column of a dataframe.

First get the column from the dataframe using the .loc method and then use the '==' to test for equivalence to 1. Notice that this prints a Series (which is a like a pandas DataFrame but just 1-dimensional instead of having multiple columns.

In [16]:
store_data.loc[:, 'Store_ID'] == 4462

0         True
1         True
2         True
3         True
4         True
         ...  
25466    False
25467    False
25468    False
25469    False
25470    False
Name: Store_ID, Length: 25471, dtype: bool

#### Assign the True/False values to a variable and use it with the .loc method to index the dataframe
This time, we will assign the output True/False values to the variable name `store_4462`.  Now we can use this variable to index our dataframe.

In [17]:
store_4462 = store_data['Store_ID'] == 4462

#### Using a boolean series with .loc
You can use a boolean series with .loc to select the rows (or columns) where the series has a value of True. You can _not_ do this with .iloc.

Note how the below only gets the rows where `store_4462` has the value of True.

In [18]:
store_data.loc[store_4462, :]

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
0,4462,JillianA,2017-01-23,08:00:00,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,
1,4462,ZoeyD,2017-02-05,06:00:00,90.0,155.0,114.0,,78.0,,,,,,2017-02-05 11:30:48,,
2,4462,JessicaB,2017-02-05,07:00:00,173.0,182.0,106.0,,81.0,,,,,,2017-02-05 11:35:48,,
3,4462,JessicaB,2017-02-05,08:00:00,333.0,311.0,102.0,,55.0,,,,,,2017-02-05 11:52:05,,
4,4462,JessicaB,2017-02-05,09:00:00,594.0,598.0,155.0,,106.0,,,,,,2017-02-05 11:59:35,,
5,4462,JessicaB,2017-02-05,10:00:00,554.0,534.0,170.0,,107.0,,,,,,2017-02-05 12:04:04,,
6,4462,JessicaB,2017-02-05,11:00:00,594.0,827.0,257.0,,150.0,,,,,,2017-02-05 12:12:48,,
7,4462,JessicaB,2017-02-05,13:00:00,649.0,740.0,201.0,,117.0,,,,,,2017-02-05 15:16:20,,
8,4462,JessicaB,2017-02-05,14:00:00,552.0,474.0,220.0,,167.0,,,,,,2017-02-05 15:17:56,,
9,4462,JillianA,2017-02-05,18:00:00,474.0,322.0,177.0,,108.0,,,,,,2017-02-05 18:15:06,,


#### Note that you can use the True/False test directly in the .loc method, this is usually what you will see.

In [19]:
store_4462 = store_data.loc[store_data['Store_ID'] == 4462, :]
store_4462.head()

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
0,4462,JillianA,2017-01-23,08:00:00,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,
1,4462,ZoeyD,2017-02-05,06:00:00,90.0,155.0,114.0,,78.0,,,,,,2017-02-05 11:30:48,,
2,4462,JessicaB,2017-02-05,07:00:00,173.0,182.0,106.0,,81.0,,,,,,2017-02-05 11:35:48,,
3,4462,JessicaB,2017-02-05,08:00:00,333.0,311.0,102.0,,55.0,,,,,,2017-02-05 11:52:05,,
4,4462,JessicaB,2017-02-05,09:00:00,594.0,598.0,155.0,,106.0,,,,,,2017-02-05 11:59:35,,


#### Let's look at some more examples: Use booleans and .loc to get all the rows for the name 'JillianA'.

In [20]:
store_data.loc[store_data['Manager'] == 'JillianA', :]

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
0,4462,JillianA,2017-01-23,08:00:00,540.0,420.0,170.0,,100.0,,,,,,2017-01-23 09:52:14,,
9,4462,JillianA,2017-02-05,18:00:00,474.0,322.0,177.0,,108.0,,,,,,2017-02-05 18:15:06,,
10,4462,JillianA,2017-02-05,19:00:00,420.0,382.0,224.0,,129.0,,,,,,2017-02-05 19:19:42,,
11,4462,JillianA,2017-02-05,20:00:00,371.0,305.0,220.0,,152.0,,,,,,2017-02-05 21:14:54,,
12,4462,JillianA,2017-02-05,21:00:00,204.0,335.0,207.0,,99.0,,,,,,2017-02-05 21:16:31,,
45,4462,JillianA,2017-02-09,15:00:00,585.0,460.0,168.0,,112.0,,,,,,2017-02-09 15:06:13,,
46,4462,JillianA,2017-02-09,16:00:00,526.0,595.0,221.0,,116.0,,,,,,2017-02-09 16:08:27,,
47,4462,JillianA,2017-02-09,17:00:00,522.0,491.0,196.0,,116.0,,,,,,2017-02-09 17:13:56,,
48,4462,JillianA,2017-02-09,18:00:00,668.0,689.0,177.0,,91.0,,,,,,2017-02-09 18:03:58,,
50,4462,JillianA,2017-02-09,20:00:00,524.0,427.0,227.0,,95.0,,,,,,2017-02-09 20:10:56,,


#### Get all rows where sales >= 1000

In [21]:
top_sales_hours = store_data.loc[store_data['Sales'] >= 1000, :]
top_sales_hours.head()

Unnamed: 0,Store_ID,Manager,Date,Ending_Hour,Projected_Sales,Sales,DT_TTL,Car_Count,KVS_Total,Scheduled_People,Actual_People,Reason_for_Labor_Diff,Reason_for_High_TTLs,Manager_Entering_Data,Timestamp,OEPE,Park_Percentage
64,4462,CharlesS,2017-02-11,13:00:00,1050.0,1028.0,179.0,,86.0,,,,,,2017-02-11 13:06:28,,
111,4462,JessicaB,2017-02-17,13:00:00,875.0,1002.0,223.0,,146.0,,,,,,2017-02-17 13:29:14,,
118,4462,CharlesS,2017-02-18,12:00:00,973.0,1025.0,197.0,,118.0,,,,,,2017-02-18 12:12:31,,
119,4462,CharlesS,2017-02-18,13:00:00,1004.0,1028.0,227.0,,117.0,,,,,,2017-02-18 13:38:16,,
120,4462,CharlesS,2017-02-18,14:00:00,799.0,1049.0,243.0,,99.0,,,,,,2017-02-18 14:34:16,,


## In Class Exercise
Please create a cell below and use the .loc[] method to explore the dataset. Try to select the rows with DT_TTL < 60.

<a name=UsingLocWithMultipleConditions></a>
# Using `.loc[]` to get rows based on multiple _conditions_
In this section, we are going to look out how we get rows where multiple conditions are True. This is also a very common thing to do!

First we need a quick review of the symbol we use for _and_ and _or_ when using arrays (or series) of True/False values:

* & - means 'and'
* | - means 'or'

#### How to get the row where the DT_TTL is below 150 and the Sales are above 1000
Use the same equivalence tests we used above, but combine them with the & operator. This produces a series with True/False values, where each value will only be True if both test are True.

##### Note you must now use parentheses to group each test.

In [22]:
store_data.loc[(store_data['DT_TTL'] < 150) & (store_data['Sales'] > 500), 'Manager'].value_counts()

JessicaB      34
HeatherW      31
JordanW       31
ArielA        30
ErinS         25
JessicaM      23
RachelH       16
JessicaA      15
TomW          13
MelissaJ      12
EricaF         8
CeaunnaS       7
KyllieT        6
JoseM          6
MichelleM      6
AdrianaR       4
KyllieA        4
BrandonD       4
MonicaH        4
OraM           4
CharlesE       3
SamanthaF      3
ClaudiaA       3
BlaineO        3
JillianA       3
TommyA         3
CharizmaM      2
ShannonH       2
ShannonL       2
CharlesS       2
DonovanS       2
ColterB        2
DaisyD         1
ChristinaS     1
CarmellaR      1
LynnW          1
KaylaW         1
JennaJ         1
JadenM         1
EmilyP         1
DacotaA        1
CheyenneN      1
JammieT        1
CheyenneK      1
KatieB         1
StephanieA     1
GregD          1
Erin           1
Name: Manager, dtype: int64

### Now let's try some examples on our auto data. First we will load the data.

In [23]:
filepath = os.path.join(os.getcwd(), 'data', 'auto-mpg-tabs.csv')

autoMPGData = pd.read_csv(filepath, sep='\t', index_col=0)
autoMPGData.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
0,18.0,8,307.0,130.0,3504.0,12.0,70,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino


#### Use the or operator, '|', to get the rows for where the name is 'ford gran torino' or 'ford pinto'.

In [24]:
autoMPGData.loc[(autoMPGData['carname'] == 'ford gran torino') |
                (autoMPGData['carname'] == 'ford pinto'), :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
32,25.0,4,98.0,?,2046.0,19.0,71,ford pinto
88,14.0,8,302.0,137.0,4042.0,14.5,73,ford gran torino
112,19.0,4,122.0,85.00,2310.0,18.5,73,ford pinto
130,26.0,4,122.0,80.00,2451.0,16.5,74,ford pinto
136,16.0,8,302.0,140.0,4141.0,14.0,74,ford gran torino
168,23.0,4,140.0,83.00,2639.0,17.0,75,ford pinto
174,18.0,6,171.0,97.00,2984.0,14.5,75,ford pinto
190,14.5,8,351.0,152.0,4215.0,12.8,76,ford gran torino
206,26.5,4,140.0,72.00,2565.0,13.6,76,ford pinto


#### Get all rows for cars built after 1980. This time we assign the output to a variable named _carModelsAbove80_ and use. head() to print the first few rows.
We also print the type of _carModelsAbove80_ so you can see that it is a dataframe also.

In [25]:
carModelsAbove80 = autoMPGData.loc[(autoMPGData['model year'] > 80), :]
print(type(carModelsAbove80))
carModelsAbove80.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
338,27.2,4,135.0,84.0,2490.0,15.7,81,plymouth reliant
339,26.6,4,151.0,84.0,2635.0,16.4,81,buick skylark
340,25.8,4,156.0,92.0,2620.0,14.4,81,dodge aries wagon (sw)
341,23.5,6,173.0,110.0,2725.0,12.6,81,chevrolet citation
342,30.0,4,135.0,84.0,2385.0,12.9,81,plymouth reliant


### Introducing `.between()`

We can use the `between()` method to find values that are between two values.

In [26]:
autoMPGData.loc[autoMPGData['model year'].between(80, 85), :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
309,41.5,4,98.0,76.00,2144.0,14.7,80,vw rabbit
310,38.1,4,89.0,60.00,1968.0,18.8,80,toyota corolla tercel
311,32.1,4,98.0,70.00,2120.0,15.5,80,chevrolet chevette
312,37.2,4,86.0,65.00,2019.0,16.4,80,datsun 310
313,28.0,4,151.0,90.00,2678.0,16.5,80,chevrolet citation
314,26.4,4,140.0,88.00,2870.0,18.1,80,ford fairmont
315,24.3,4,151.0,90.00,3003.0,20.1,80,amc concord
316,19.1,6,225.0,90.00,3381.0,18.7,80,dodge aspen
317,34.3,4,97.0,78.00,2188.0,15.8,80,audi 4000
318,29.8,4,134.0,90.00,2711.0,15.5,80,toyota corona liftback


### Introducing `.isin()`
Let's say we wanted all rows where th car name is 'ford pinto', 'ford gran torino', or 'ford maverick'. We could use three equivalence test and two or operators to string the tests together. But, another way is to use the .isin() method.  See the example below:

In [27]:
# we can just use the .isin() method on any column and then pass the list of
# the valued we want checked to the .isin() method
autoMPGData['carname'].isin(['ford pinto', 'ford gran torino','ford maverick'])

0      False
1      False
2      False
3      False
4      False
       ...  
393    False
394    False
395    False
396    False
397    False
Name: carname, Length: 398, dtype: bool

In [28]:
# Now, let's use it in the .loc[] method to get those rows from the dataframe
autoMPGData.loc[autoMPGData['carname'].isin(['ford pinto', 'ford gran torino','ford maverick']), :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
17,21.0,6,200.0,85.00,2587.0,16.0,70,ford maverick
32,25.0,4,98.0,?,2046.0,19.0,71,ford pinto
88,14.0,8,302.0,137.0,4042.0,14.5,73,ford gran torino
100,18.0,6,250.0,88.00,3021.0,16.5,73,ford maverick
112,19.0,4,122.0,85.00,2310.0,18.5,73,ford pinto
126,21.0,6,200.0,?,2875.0,17.0,74,ford maverick
130,26.0,4,122.0,80.00,2451.0,16.5,74,ford pinto
136,16.0,8,302.0,140.0,4141.0,14.0,74,ford gran torino
155,15.0,6,250.0,72.00,3158.0,19.5,75,ford maverick
168,23.0,4,140.0,83.00,2639.0,17.0,75,ford pinto


#### Selecting rows by wildcard to select all the models with the word ford

We can use the `.contains()` method on any column of strings to find all the rows with a string that matches a substring.  We can also set the `case` argument to False to ignore case. Note that we have to use the `str` attribute to access string methods for the column.

In [30]:
# Now, let's use the match method to find all the rows with ford in the carname
autoMPGData.loc[autoMPGData['carname'].str.contains('ford', case=False), :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,ford galaxie 500
17,21.0,6,200.0,85.00,2587.0,16.0,70,ford maverick
25,10.0,8,360.0,215.0,4615.0,14.0,70,ford f250
32,25.0,4,98.0,?,2046.0,19.0,71,ford pinto
36,19.0,6,250.0,88.00,3302.0,15.5,71,ford torino 500
40,14.0,8,351.0,153.0,4154.0,13.5,71,ford galaxie 500
43,13.0,8,400.0,170.0,4746.0,12.0,71,ford country squire (sw)
48,18.0,6,250.0,88.00,3139.0,14.5,71,ford mustang
61,21.0,4,122.0,86.00,2226.0,16.5,72,ford pinto runabout


#### We can use the '|' as an 'or' in our matches.

In [31]:
# Now, let's use the match method to find all the rows with ford or chevrolete in the carname
autoMPGData.loc[autoMPGData['carname'].str.contains('ford|chevrolet', case=False), :].head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
0,18.0,8,307.0,130.0,3504.0,12.0,70,chevrolet chevelle malibu
4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,ford galaxie 500
6,14.0,8,454.0,220.0,4354.0,9.0,70,chevrolet impala
12,15.0,8,400.0,150.0,3761.0,9.5,70,chevrolet monte carlo


###  Now, let's use the an equivalence test to get all the rows where the horsepower column has a value of '?'

In [32]:
autoMPGData.loc[autoMPGData['horsepower'] == '?', :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname
32,25.0,4,98.0,?,2046.0,19.0,71,ford pinto
126,21.0,6,200.0,?,2875.0,17.0,74,ford maverick
330,40.9,4,85.0,?,1835.0,17.3,80,renault lecar deluxe
336,23.6,4,140.0,?,2905.0,14.3,80,ford mustang cobra
354,34.5,4,100.0,?,2320.0,15.8,81,renault 18i
374,23.0,4,151.0,?,3035.0,20.5,82,amc concord dl


#### Now, use the not equivalent test, !=, to get all the rows that do not have a missing horsepower value.

Note that we also use the `.copy()` method, this means that we are actually copying those rows into a new dataframe and not just creating a dataframe that references the rows in the original dataframe.

In [36]:
autoMPGData["has_horsepower"] = autoMPGData['horsepower'] != '?'

In [37]:
autoMPGData.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname,has_horsepower
0,18.0,8,307.0,130.0,3504.0,12.0,70,chevrolet chevelle malibu,True
1,15.0,8,350.0,165.0,3693.0,11.5,70,buick skylark 320,True
2,18.0,8,318.0,150.0,3436.0,11.0,70,plymouth satellite,True
3,16.0,8,304.0,150.0,3433.0,12.0,70,amc rebel sst,True
4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino,True


In [39]:
autoMPGDataCleaned = autoMPGData.loc[autoMPGData['horsepower'] != '?', :].copy()
autoMPGDataCleaned.tail()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname,has_horsepower
393,27.0,4,140.0,86.0,2790.0,15.6,82,ford mustang gl,True
394,44.0,4,97.0,52.0,2130.0,24.6,82,vw pickup,True
395,32.0,4,135.0,84.0,2295.0,11.6,82,dodge rampage,True
396,28.0,4,120.0,79.0,2625.0,18.6,82,ford ranger,True
397,31.0,4,119.0,82.0,2720.0,19.4,82,chevy s-10,True


#### Now, use the `.astype()` method to convert the column to a float type.

In [40]:
# autoMPGDataCleaned.loc[:, 'horsepower'] = autoMPGDataCleaned.loc[:, 'horsepower'].astype(float)
autoMPGDataCleaned['horsepower'] = autoMPGDataCleaned['horsepower'].astype(float)


# Now we can look for cars with a horsepower over 190, for example
autoMPGDataCleaned.loc[~(autoMPGDataCleaned['horsepower'] > 190), :]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,accelartion,model year,carname,has_horsepower
0,18.0,8,307.0,130.0,3504.0,12.0,70,chevrolet chevelle malibu,True
1,15.0,8,350.0,165.0,3693.0,11.5,70,buick skylark 320,True
2,18.0,8,318.0,150.0,3436.0,11.0,70,plymouth satellite,True
3,16.0,8,304.0,150.0,3433.0,12.0,70,amc rebel sst,True
4,17.0,8,302.0,140.0,3449.0,10.5,70,ford torino,True
9,15.0,8,390.0,190.0,3850.0,8.5,70,amc ambassador dpl,True
10,15.0,8,383.0,170.0,3563.0,10.0,70,dodge challenger se,True
11,14.0,8,340.0,160.0,3609.0,8.0,70,plymouth 'cuda 340,True
12,15.0,8,400.0,150.0,3761.0,9.5,70,chevrolet monte carlo,True
14,24.0,4,113.0,95.0,2372.0,15.0,70,toyota corona mark ii,True


# Introducing `.isnull()`

In [42]:
store_data_clean= store_data.loc[~store_data['DT_TTL'].isnull(), :]

## In Class Exercise
Please create a cell below and use the `.loc[]` method to explore the dataset (`autoMPGData` or `store_data`). Use multiple tests, and also use the `.between()` and `.isin()` method.

# Lesson Summary:
In this lesson you learned:
* How to use the `.iloc[]` method to select rows and columns from a dataframe by their number.
* How to use the `.loc[]` method to select rows and columns from a dataframe by their label.
* How to use the `.loc[]` method to select data based on boolean (True/False) arrays.

## Question or Comments About This Notebook?
Feel free to contact me via my LinkedIn: https://www.linkedin.com/in/william-j-henry <br>
You can also email me at will@henryanalytics.com <br>