In [2]:
import numpy as np
np.set_printoptions(threshold=50)
path_data = '../../../assets/data/'
# pd is a common shorthand for pandas
import pandas as pd
topmovies = pd.read_csv(path_data + 'top_movies_2017.csv')

# Selecting and Sorting

<h1>Getting Started</h1>

In each section of this chapter, we will work with the Top Movies dataset we introduced earlier. We will pose a question, break the question down into high-level steps, and then translate each step into Python code using `pandas` DataFrames.

<h1>Answering a Question</h1>

Let's use `pandas` to answer the following question:

What were the five highest Gross movies in 2016?

<h2>Breaking the Problem Down</h2>

We can decompose this question into the following simpler table manipulations:

1. Chose set of relevant columns (i.e. 'Title', 'Year' and 'Gross').
2. Slice out the rows for the year 2016.
3. Sort the rows in descending order by Gross.
4. Select the first five rows from the sorted DataFrame.
   
Now, we can express these steps in pandas.

<h2>Selecting</h2>

All the tasks (except task 3) in our broken-down problems involve selecting a subset of data from the DataFrame. Therefore, we will start discussing the selection first to approach the first two tasks.

<h3> Slicing using `.loc` </h3>

The `loc` attribute is used to access rows and columns by their label. 
To select subsets of a DataFrame, we use the `.loc` slicing syntax. The first argument is the label of the row and the second is the label of the column. 

**Note:** Remember that the index is also called the row labels.

In [3]:
topmovies

Unnamed: 0,Title,Studio,Gross,Gross (Adjusted),Year
0,Gone with the Wind,MGM,198676459,1796176700,1939
1,Star Wars,Fox,460998007,1583483200,1977
2,The Sound of Music,Fox,158671368,1266072700,1965
3,E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982
4,Titanic,Paramount,658672302,1204368000,1997
...,...,...,...,...,...
195,9 to 5,Fox,103290500,341357800,1980
196,Batman v Superman: Dawn of Justice,Warner Brothers,330360194,340137000,2016
197,The Firm,Paramount,158348367,340028200,1993
198,Suicide Squad,Warner Brothers,325100054,339411900,2016


In [12]:
topmovies.loc[1,'Title']

'Star Wars'

To slice out multiple rows or columns, we can use :. Note that `.loc` slicing is inclusive, unlike Python's slicing.

In [13]:
topmovies.loc [0:9,'Title':'Studio']

Unnamed: 0,Title,Studio
0,Gone with the Wind,MGM
1,Star Wars,Fox
2,The Sound of Music,Fox
3,E.T.: The Extra-Terrestrial,Universal
4,Titanic,Paramount
5,The Ten Commandments,Paramount
6,Jaws,Universal
7,Doctor Zhivago,MGM
8,The Exorcist,Warner Brothers
9,Snow White and the Seven Dwarves,Disney


<h3> Selecting Column(s)</h3>

We use `.loc` slicing to chose a set of columns form a DataFrame. 

We will often want a single column from a DataFrame:

In [16]:
topmovies.loc[:, 'Gross']

0      1939
1      1977
2      1965
3      1982
4      1997
       ... 
195    1980
196    2016
197    1993
198    2016
199    1988
Name: Year, Length: 200, dtype: int64

Note that when we select a single column, we get a pandas Series. A Series is like a one-dimensional NumPy array since we can perform arithmetic on all the elements at once. For example, we can calculate 'Gross in Millions' from the gross. You can also assign it to a new column.

In [7]:
topmovies['Gross in Millions'] = topmovies.loc[:, 'Gross'] / 1000000
topmovies

Unnamed: 0,Title,Studio,Gross,Gross (Adjusted),Year,Gross in Millions
0,Gone with the Wind,MGM,198676459,1796176700,1939,198.676459
1,Star Wars,Fox,460998007,1583483200,1977,460.998007
2,The Sound of Music,Fox,158671368,1266072700,1965,158.671368
3,E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982,435.110554
4,Titanic,Paramount,658672302,1204368000,1997,658.672302
...,...,...,...,...,...,...
195,9 to 5,Fox,103290500,341357800,1980,103.290500
196,Batman v Superman: Dawn of Justice,Warner Brothers,330360194,340137000,2016,330.360194
197,The Firm,Paramount,158348367,340028200,1993,158.348367
198,Suicide Squad,Warner Brothers,325100054,339411900,2016,325.100054


To convert it to a DataFeame you can use `to_frame()` method.

In [17]:
topmovies.loc[:, 'Year'].to_frame()

Unnamed: 0,Year
0,1939
1,1977
2,1965
3,1982
4,1997
...,...
195,1980
196,2016
197,1993
198,2016


To select out specific columns, we can pass a list into the .loc slice:

In [20]:
topmovies.loc[:, ['Title','Year']]

Unnamed: 0,Title,Year
0,Gone with the Wind,1939
1,Star Wars,1977
2,The Sound of Music,1965
3,E.T.: The Extra-Terrestrial,1982
4,Titanic,1997
...,...,...
195,9 to 5,1980
196,Batman v Superman: Dawn of Justice,2016
197,The Firm,1993
198,Suicide Squad,2016


Selecting columns is common, so there's a shorthand.

In [23]:
topmovies['Title']

0                      Gone with the Wind
1                               Star Wars
2                      The Sound of Music
3             E.T.: The Extra-Terrestrial
4                                 Titanic
                      ...                
195                                9 to 5
196    Batman v Superman: Dawn of Justice
197                              The Firm
198                         Suicide Squad
199               Who Framed Roger Rabbit
Name: Title, Length: 200, dtype: object

In [30]:
topmovies = topmovies[['Title','Year','Gross']]

<h3> Slecting Rows </h3>

Often you may want to select rows eighter specified rows or rows that correspond to entries with a particular feature. For example, movies released in the year 2016. Or you might want to take the first five movies. 

In [None]:
To select specified rows whose indices are known beforehead we can `loc` slicing.

In [11]:
topmovies.loc[1:5,:]

Unnamed: 0,Title,Studio,Gross,Gross (Adjusted),Year,Gross in Millions
1,Star Wars,Fox,460998007,1583483200,1977,460.998007
2,The Sound of Music,Fox,158671368,1266072700,1965,158.671368
3,E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982,435.110554
4,Titanic,Paramount,658672302,1204368000,1997,658.672302
5,The Ten Commandments,Paramount,65500000,1164590000,1956,65.5


<h3>Filtering Rows</h3>
To select rows that satisfy certain conditions we can slice the dataframe with the condition. To slice out the rows with year 2016, we will first create a Series containing True for each row we want to keep and False for each row we want to drop. This is simple because math and boolean operators on Series are applied to each element in the Series.

In [31]:
topmovies['Year']

0      1939
1      1977
2      1965
3      1982
4      1997
       ... 
195    1980
196    2016
197    1993
198    2016
199    1988
Name: Year, Length: 200, dtype: int64

In [32]:
topmovies['Year'] == 2016

0      False
1      False
2      False
3      False
4      False
       ...  
195    False
196     True
197    False
198     True
199    False
Name: Year, Length: 200, dtype: bool

Once we have this Series of True and False, we can pass it into `.loc`.

In [33]:
topmovies2016 = topmovies.loc[topmovies['Year'] == 2016]
topmovies2016

Unnamed: 0,Title,Year,Gross
56,Rogue One: A Star Wars Story,2016,532177324
72,Finding Dory,2016,486295561
119,Captain America: Civil War,2016,408084349
146,The Secret Life of Pets,2016,368384330
157,Deadpool,2016,363070709
164,The Jungle Book (2016),2016,364001123
184,Zootopia,2016,341268248
196,Batman v Superman: Dawn of Justice,2016,330360194
198,Suicide Squad,2016,325100054


<h2>Sorting Rows</h2>

The next step is to sort the rows in descending order by 'Gross'. We can use the `sort_values()` function.

In [34]:
sorted_2016 = topmovies2016.sort_values('Gross', ascending=False)
sorted_2016

Unnamed: 0,Title,Year,Gross
56,Rogue One: A Star Wars Story,2016,532177324
72,Finding Dory,2016,486295561
119,Captain America: Civil War,2016,408084349
146,The Secret Life of Pets,2016,368384330
164,The Jungle Book (2016),2016,364001123
157,Deadpool,2016,363070709
184,Zootopia,2016,341268248
196,Batman v Superman: Dawn of Justice,2016,330360194
198,Suicide Squad,2016,325100054


<h2>Selecting Top 5 Rows</h2>

Now we can use `.iloc` slicing to select the first 5 rows as the highest gross-making movies in 2016.

In [35]:
sorted_2016.iloc[0:5, :]

Unnamed: 0,Title,Year,Gross
56,Rogue One: A Star Wars Story,2016,532177324
72,Finding Dory,2016,486295561
119,Captain America: Civil War,2016,408084349
146,The Secret Life of Pets,2016,368384330
164,The Jungle Book (2016),2016,364001123


Note that, both `.loc` and `.iloc` are used for slicing. While `loc` takes row and column labels, `iloc` lakes integer indexes to represent rows and columns. Another difference is that `loc` is right-inclusive while `iloc` is right-exclusive.

<h1>Conclusion</h1>
In this section, we answered a data science question from tabular data using `pandas` slicing and sorting. First we broke down our question into doable smaller subproblems and then we solved each smaller subproblems using Python programming to get the answer to our main question.