# How do I apply multiple filter criteria to a pandas DataFrame ?

In [1]:
import pandas as pd

In [2]:
movies = pd.read_csv('http://bit.ly/imdbratings')

In [3]:
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [4]:
movies.shape

(979, 6)

## Filter the dataframe so that we only see rows whose duration is > 200 minutes

In [7]:
type(movies.duration > 200)

pandas.core.series.Series

Here's what we need to do first. We need to construct a bool list, which contains `True` when the duration is > 200 and `False` otherwise. Let's do this the hard way first to get and understanding.

In [11]:
booleans = []
for length in movies.duration:
    if length >= 200:
        booleans.append(True)
    else:
        booleans.append(False)

In [12]:
booleans[0:5]

[False, False, True, False, False]

In [13]:
len(booleans)

979

**So, next we need to convert the boolean list to a Pandas Series!!**

In [14]:
is_long = pd.Series(booleans)

In [15]:
type(is_long)

pandas.core.series.Series

In [16]:
is_long.head()

0    False
1    False
2     True
3    False
4    False
dtype: bool

**So, last step in filtering is to pass the Series object to the dataframe using the bracket notation.**

In [17]:
movies[is_long]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
7,8.9,The Lord of the Rings: The Return of the King,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
17,8.7,Seven Samurai,UNRATED,Drama,207,"[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K..."
78,8.4,Once Upon a Time in America,R,Crime,229,"[u'Robert De Niro', u'James Woods', u'Elizabet..."
85,8.4,Lawrence of Arabia,PG,Adventure,216,"[u""Peter O'Toole"", u'Alec Guinness', u'Anthony..."
142,8.3,Lagaan: Once Upon a Time in India,PG,Adventure,224,"[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell..."
157,8.2,Gone with the Wind,G,Drama,238,"[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit..."
204,8.1,Ben-Hur,G,Adventure,212,"[u'Charlton Heston', u'Jack Hawkins', u'Stephe..."
445,7.9,The Ten Commandments,APPROVED,Adventure,220,"[u'Charlton Heston', u'Yul Brynner', u'Anne Ba..."
476,7.8,Hamlet,PG-13,Drama,242,"[u'Kenneth Branagh', u'Julie Christie', u'Dere..."


Normally a bracket notation is used for pulling out a given column. But in this case, notice that we are passing a Series of True's and False's. Observe how pandas is reacting, it is returning a dataframe with all the rows and columns where the duration is greater than 200 minutes. So, this may seem strange as we are not being explicit by using a loc method. But remember that when we want to pass in a Series of Boolean values, we can just go with the bracket notation.

So, this is the long way of doing things. You don't need to write the for loop. So, I will show you 2 ways of how we can get to this result simply.

First is, we don't have to write the for-loop. INSTEAD, you can simply do this

In [18]:
is_long = movies.duration >= 200
is_long.head()

0    False
1    False
2     True
3    False
4    False
Name: duration, dtype: bool

This is because, when you say a Series name: `movies.duration` and a comparison operator `>=` in this case, pandas knows you don't want a single True or False. It knows that you are asking to compare each value in the Series and return a boolean Series.

In summary, you don't need to write a for-loop to create this logical Series. And this logical Series is what you pass to the dataframe `movies` using bracket notation to filter out the rows.

Second way to do this is actually not creating `is_long` but directly using bracket notation with the Series comparison.

In [19]:
movies[movies.duration >= 200]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
7,8.9,The Lord of the Rings: The Return of the King,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
17,8.7,Seven Samurai,UNRATED,Drama,207,"[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K..."
78,8.4,Once Upon a Time in America,R,Crime,229,"[u'Robert De Niro', u'James Woods', u'Elizabet..."
85,8.4,Lawrence of Arabia,PG,Adventure,216,"[u""Peter O'Toole"", u'Alec Guinness', u'Anthony..."
142,8.3,Lagaan: Once Upon a Time in India,PG,Adventure,224,"[u'Aamir Khan', u'Gracy Singh', u'Rachel Shell..."
157,8.2,Gone with the Wind,G,Drama,238,"[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit..."
204,8.1,Ben-Hur,G,Adventure,212,"[u'Charlton Heston', u'Jack Hawkins', u'Stephe..."
445,7.9,The Ten Commandments,APPROVED,Adventure,220,"[u'Charlton Heston', u'Yul Brynner', u'Anne Ba..."
476,7.8,Hamlet,PG-13,Drama,242,"[u'Kenneth Branagh', u'Julie Christie', u'Dere..."


** SO that is the end of filtering rows!! **

BONUS: What if I want to just view genre of this filtered dataframe ? 
Well, remember the output of this `movies[movies.duration >= 200]` is actually a dataframe. So, you can use either 'dot' notation or you can rely on the explicit way of doing things by using the `loc` function.

In [20]:
movies[movies.duration >= 200].loc[:, ['title','genre']]

Unnamed: 0,title,genre
2,The Godfather: Part II,Crime
7,The Lord of the Rings: The Return of the King,Adventure
17,Seven Samurai,Drama
78,Once Upon a Time in America,Crime
85,Lawrence of Arabia,Adventure
142,Lagaan: Once Upon a Time in India,Adventure
157,Gone with the Wind,Drama
204,Ben-Hur,Adventure
445,The Ten Commandments,Adventure
476,Hamlet,Drama


** BUT, and this is a big BUT, you can also do this like this**

In [21]:
movies.loc[movies.duration >= 200, ['title', 'genre']]

Unnamed: 0,title,genre
2,The Godfather: Part II,Crime
7,The Lord of the Rings: The Return of the King,Adventure
17,Seven Samurai,Drama
78,Once Upon a Time in America,Crime
85,Lawrence of Arabia,Adventure
142,Lagaan: Once Upon a Time in India,Adventure
157,Gone with the Wind,Drama
204,Ben-Hur,Adventure
445,The Ten Commandments,Adventure
476,Hamlet,Drama


What you are saying here is, give me the rows (i.e hey pandas filter the rows based on the boolean Series) and give me the columns which i specified in the loc!

## MULTIPLE FILTERING CRITERIA

Use logical `&` and `|` operator to chain several filtering criteria. Notice it is not double ampersand.

In [22]:
movies.loc[(movies.duration >= 200) & (movies.genre == 'Crime'), ['title', 'genre']]

Unnamed: 0,title,genre
2,The Godfather: Part II,Crime
78,Once Upon a Time in America,Crime


Key thing to note is that, you pass a boolean Series to get the rows!! Here the boolean Series happens to be a compound statement

## Use boolean Series to filter rows! 

**BONUS!!**

### What if I have a many OR conditions ? That is, what if I want to select movies that are either Crime, Drama or Action ?

One way to do this is by using multiple OR conditions, like this:

In [23]:
movies.loc[(movies.genre == 'Crime') | (movies.genre == 'Drama') | (movies.genre == 'Action'), ['title', 'genre']]

Unnamed: 0,title,genre
0,The Shawshank Redemption,Crime
1,The Godfather,Crime
2,The Godfather: Part II,Crime
3,The Dark Knight,Action
4,Pulp Fiction,Crime
5,12 Angry Men,Drama
9,Fight Club,Drama
11,Inception,Action
12,Star Wars: Episode V - The Empire Strikes Back,Action
13,Forrest Gump,Drama


This seems like a lot of words! So, there is a function in pandas Series, which allows us to check if a given value in the Series belongs to a list of strings!

In [25]:
movies.loc[movies.genre.isin(['Crime', 'Drama', 'Action']), ['title', 'genre']]

Unnamed: 0,title,genre
0,The Shawshank Redemption,Crime
1,The Godfather,Crime
2,The Godfather: Part II,Crime
3,The Dark Knight,Action
4,Pulp Fiction,Crime
5,12 Angry Men,Drama
9,Fight Club,Drama
11,Inception,Action
12,Star Wars: Episode V - The Empire Strikes Back,Action
13,Forrest Gump,Drama
