## 6.	How do I filter rows based on column values (Single and Multiple Criteria)?

Filtering data refers to extracting a part of available data that fulfills our criteria. We use filtering to understand our data better, and to prepare our data for machine learning problem.

In [1]:
import pandas as pd

We will use the movie dataset which contains top-rated movies from the Internet Movie Database (IMDB). Each row represents a movie, and the columns represent its features. We can also check the shape and data type of DataFrame. 

In [2]:
movies = pd.read_csv("http://bit.ly/imdbratings")
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [3]:
movies.shape

(979, 6)

In [4]:
movies.dtypes

star_rating       float64
title              object
content_rating     object
genre              object
duration            int64
actors_list        object
dtype: object

### 6.1. Filtering by single criteria

Suppose we want to examine long movies, say 200 minutes long i.e. we only want rows where duration is at least 200 minutes. We will simply pass a series of Booleans (True/False) to the DataFrame movies in the square bracket. Understand that when we compare a series with numbers or string, panda generates a series of booleans. Reference the idea to code below.

In [5]:
movies[movies.duration>=200].head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
7,8.9,The Lord of the Rings: The Return of the King,PG-13,Adventure,201,"[u'Elijah Wood', u'Viggo Mortensen', u'Ian McK..."
17,8.7,Seven Samurai,UNRATED,Drama,207,"[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K..."
78,8.4,Once Upon a Time in America,R,Crime,229,"[u'Robert De Niro', u'James Woods', u'Elizabet..."
85,8.4,Lawrence of Arabia,PG,Adventure,216,"[u""Peter O'Toole"", u'Alec Guinness', u'Anthony..."


In [6]:
movies[movies.content_rating=="R"].head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
8,8.9,Schindler's List,R,Biography,195,"[u'Liam Neeson', u'Ralph Fiennes', u'Ben Kings..."


Notice that the above code outputs a DataFrame, and if we want to get a column out of DataFrame, we can simply use dot or bracket notation. We can similarly, series methods when we expect a series output to perform arithmetic operations on the series.

In [7]:
movies[movies.content_rating=="R"].title.head()

0    The Shawshank Redemption
1               The Godfather
2      The Godfather: Part II
4                Pulp Fiction
8            Schindler's List
Name: title, dtype: object

In [8]:
movies[movies.content_rating=="R"].star_rating.mean()

7.85478260869563

###  6.2. Filtering by multiple criteria

We need to be familiar with logical operators (and – ‘&’, or – ‘|’) to use multiple filter criteria. We will assume everyone is familiar with it, if you are not, it's easy to learn on your own, so please do that before proceeding. Another important point is parenthesis “( )”. Our code won't work if we don’t put individual filtering criteria within a pair of brackets separated by logical operators. It is needed so that the evaluation order is clear. We can also use multiple criteria from one column.

In [9]:
movies[(movies.duration>=200) & (movies.genre=='Drama')]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
17,8.7,Seven Samurai,UNRATED,Drama,207,"[u'Toshir\xf4 Mifune', u'Takashi Shimura', u'K..."
157,8.2,Gone with the Wind,G,Drama,238,"[u'Clark Gable', u'Vivien Leigh', u'Thomas Mit..."
476,7.8,Hamlet,PG-13,Drama,242,"[u'Kenneth Branagh', u'Julie Christie', u'Dere..."


In [10]:
movies[(movies.genre=='Crime') | (movies.genre=='Drama') | (movies.genre=='Action')]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
...,...,...,...,...,...,...
970,7.4,Wonder Boys,R,Drama,107,"[u'Michael Douglas', u'Tobey Maguire', u'Franc..."
972,7.4,Blue Valentine,NC-17,Drama,112,"[u'Ryan Gosling', u'Michelle Williams', u'John..."
973,7.4,The Cider House Rules,PG-13,Drama,126,"[u'Tobey Maguire', u'Charlize Theron', u'Micha..."
976,7.4,Master and Commander: The Far Side of the World,PG-13,Action,138,"[u'Russell Crowe', u'Paul Bettany', u'Billy Bo..."


When we are filtering rows based on multiple criteria from the same column, we can use an alternative simpler way, using a series method called “isin()”. We simply need to pass a list to the series method.

In [11]:
movies[movies.genre.isin(["Crime", "Drama", "Action"])]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."
...,...,...,...,...,...,...
970,7.4,Wonder Boys,R,Drama,107,"[u'Michael Douglas', u'Tobey Maguire', u'Franc..."
972,7.4,Blue Valentine,NC-17,Drama,112,"[u'Ryan Gosling', u'Michelle Williams', u'John..."
973,7.4,The Cider House Rules,PG-13,Drama,126,"[u'Tobey Maguire', u'Charlize Theron', u'Micha..."
976,7.4,Master and Commander: The Far Side of the World,PG-13,Action,138,"[u'Russell Crowe', u'Paul Bettany', u'Billy Bo..."


There’s one other common use for multiple filter criteria. Instead of passing the series of Booleans to DataFrame to filter rows, we can use it to create new columns that combine multiple existing columns based on some criteria. We simply use the series “map( )” method to map the series of Booleans to numeric values and assign them to a new column.

In [12]:
(movies.content_rating=="R") & (movies.duration>=200)

0      False
1      False
2       True
3      False
4      False
       ...  
974    False
975    False
976    False
977    False
978    False
Length: 979, dtype: bool

In [13]:
movies["long_adult_movie"]=((movies.content_rating=="R") & (movies.duration>=200)).map({True:1, False:0})
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,long_adult_movie
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",0
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']",0
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv...",1
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E...",0
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L....",0
