In [1]:
import pathlib

import pandas as pd


PATH_DATA = pathlib.Path.cwd().parent.parent.parent.parent / 'data'

pd.set_option('display.max_rows', 10)
pd.set_option('display.show_dimensions', False)

In [2]:
nba = pd.read_csv(PATH_DATA / 'nba_salaries.csv')
nba.rename(columns={"'15-'16 SALARY": 'SALARY'}, inplace=True)

# Selecting Rows

Often, we would like to extract just those rows that correspond to entries with a particular feature. For example, we might want only the rows corresponding to the Warriors, or to players who earned more than \$10 million. Or we might just want the top five earners.

### Specified Rows

The DataFrame property `iloc` allows for the selection of rows by their integer position in the index. It supports argument signatures including a single row index or an array of indices; and it returns, respectively, the row's underlying Series or a new DataFrame consisting of only those rows.

For example, if we wanted just the first row of `nba`, we could use `iloc` as follows.

In [3]:
nba

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
0,Paul Millsap,PF,Atlanta Hawks,18.671659
1,Al Horford,C,Atlanta Hawks,12.000000
2,Tiago Splitter,C,Atlanta Hawks,9.756250
3,Jeff Teague,PG,Atlanta Hawks,8.000000
4,Kyle Korver,SG,Atlanta Hawks,5.746479
...,...,...,...,...
412,Gary Neal,PG,Washington Wizards,2.139000
413,DeJuan Blair,C,Washington Wizards,2.000000
414,Kelly Oubre Jr.,SF,Washington Wizards,1.920240
415,Garrett Temple,SG,Washington Wizards,1.100602


In [4]:
nba.iloc[[0]]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
0,Paul Millsap,PF,Atlanta Hawks,18.671659


This is a new DataFrame with just the single row that we specified.

Note that, by default, the index starts with the integer `0`.

We could also get the fourth, fifth, and sixth rows by specifying a range of indices as the argument – starting at index `3` and ending (before) index `6`.

In [5]:
list(range(3, 6))

[3, 4, 5]

In [6]:
nba.iloc[range(3, 6)]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
3,Jeff Teague,PG,Atlanta Hawks,8.0
4,Kyle Korver,SG,Atlanta Hawks,5.746479
5,Thabo Sefolosha,SF,Atlanta Hawks,4.0


More typically, we might specify such a range using Python's `slice` syntax.

In [7]:
nba.iloc[3:6]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
3,Jeff Teague,PG,Atlanta Hawks,8.0
4,Kyle Korver,SG,Atlanta Hawks,5.746479
5,Thabo Sefolosha,SF,Atlanta Hawks,4.0


If we want a DataFrame of the top 5 highest paid players, we can first sort the data by salary and then select the first five rows:

In [8]:
nba.sort_values(['SALARY'], ascending=False).iloc[:5]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
169,Kobe Bryant,SF,Los Angeles Lakers,25.0
29,Joe Johnson,SF,Brooklyn Nets,24.894863
72,LeBron James,SF,Cleveland Cavaliers,22.9705
255,Carmelo Anthony,SF,New York Knicks,22.875
131,Dwight Howard,C,Houston Rockets,22.359364


### Rows Corresponding to a Specified Feature

More often, we will want to access data in a set of rows that have a certain feature, but whose indices we don't know ahead of time. For example, we might want data on all the players who made more than $10 million, but we don't want to spend time counting rows in the sorted DataFrame.

The DataFrame property `loc` does the job for us. It supports the output of a DataFrame with the same columns as the original but only the rows where the feature occurs.

This is specified to `loc` using a Boolean sequence, indicating the rows satisfying our condition, such as:

    [True, False, True]

…and which we may generate from a simple conditional expression in Python, applied to the Pandas `Series` of data underlying the feature itself.

In the first example, we extract the data for all those who earned more than $10 million.

In [9]:
nba['SALARY']

0      18.671659
1      12.000000
2       9.756250
3       8.000000
4       5.746479
         ...    
412     2.139000
413     2.000000
414     1.920240
415     1.100602
416     0.561716
Name: SALARY, dtype: float64

In [10]:
nba['SALARY'] > 10

0       True
1       True
2      False
3      False
4      False
       ...  
412    False
413    False
414    False
415    False
416    False
Name: SALARY, dtype: bool

In [11]:
nba.loc[nba['SALARY'] > 10]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
0,Paul Millsap,PF,Atlanta Hawks,18.671659
1,Al Horford,C,Atlanta Hawks,12.000000
29,Joe Johnson,SF,Brooklyn Nets,24.894863
30,Thaddeus Young,PF,Brooklyn Nets,11.235955
42,Al Jefferson,C,Charlotte Hornets,13.500000
...,...,...,...,...
368,DeMar DeRozan,SG,Toronto Raptors,10.050000
383,Gordon Hayward,SF,Utah Jazz,15.409570
400,John Wall,PG,Washington Wizards,15.851950
401,Nene Hilario,C,Washington Wizards,13.000000


In [12]:
nba.loc[nba['SALARY'] > 10].shape

(69, 4)

There are 69 rows in the new DataFrame, corresponding to the 69 players who made more than \\$10 million dollars. Arranging these rows in order makes the data easier to analyze. DeMar DeRozan of the Toronto Raptors was the "poorest" of this group, at a salary of just over \$10 million dollars.

In [13]:
nba.loc[nba['SALARY'] > 10].sort_values(['SALARY'])

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
368,DeMar DeRozan,SG,Toronto Raptors,10.050000
298,Gerald Wallace,SF,Philadelphia 76ers,10.105855
204,Luol Deng,SF,Miami Heat,10.151612
144,Monta Ellis,SG,Indiana Pacers,10.300000
95,Wilson Chandler,SF,Denver Nuggets,10.449438
...,...,...,...,...
131,Dwight Howard,C,Houston Rockets,22.359364
255,Carmelo Anthony,SF,New York Knicks,22.875000
72,LeBron James,SF,Cleveland Cavaliers,22.970500
29,Joe Johnson,SF,Brooklyn Nets,24.894863


How much did Stephen Curry make? For the answer, we have to access the row where the value of `PLAYER` is equal to `Stephen Curry`. That is placed in a DataFrame consisting of just one row:

In [14]:
nba.loc[nba['PLAYER'] == 'Stephen Curry']

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
121,Stephen Curry,PG,Golden State Warriors,11.370786


Curry made just under \$11.4 million dollars. That's a lot of money, but it's less than half the salary of LeBron James. You'll find that salary in the "Top 5" table earlier in this section, or you could find it replacing `'Stephen Curry'` by `'LeBron James'` in the line of code above.

In that code, rather than the "greater than" operator, `>`, we used the "equality" operator, `==`. Thus for example you can alternatively construct a DataFrame of all the Warriors:

In [15]:
nba.loc[nba['TEAM'] == 'Golden State Warriors']

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
117,Klay Thompson,SG,Golden State Warriors,15.501000
118,Draymond Green,PF,Golden State Warriors,14.260870
119,Andrew Bogut,C,Golden State Warriors,13.800000
120,Andre Iguodala,SF,Golden State Warriors,11.710456
121,Stephen Curry,PG,Golden State Warriors,11.370786
...,...,...,...,...
126,Leandro Barbosa,SG,Golden State Warriors,2.500000
127,Festus Ezeli,C,Golden State Warriors,2.008748
128,Brandon Rush,SF,Golden State Warriors,1.270964
129,Kevon Looney,SF,Golden State Warriors,1.131960


This portion of the data is already sorted by salary, because the original data listed players sorted by salary within the same team.

### Multiple Features ###
You can access rows that have multiple specified features, by using `loc` repeatedly. For example, here is a way to extract all the Point Guards whose salaries were over \$15 million.

In [16]:
nba.loc[nba['POSITION'] == 'PG'].loc[nba['SALARY'] > 15]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
60,Derrick Rose,PG,Chicago Bulls,20.093064
74,Kyrie Irving,PG,Cleveland Cavaliers,16.407501
156,Chris Paul,PG,Los Angeles Clippers,21.468695
269,Russell Westbrook,PG,Oklahoma City Thunder,16.744218
400,John Wall,PG,Washington Wizards,15.85195


Alternatively, you can combine conditions using Python's "bitwise" operators, `&` and `|`, and pass the resulting specification to a single invocation of `loc`.

In [17]:
nba.loc[(nba['POSITION'] == 'PG') & (nba['SALARY'] > 15)]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
60,Derrick Rose,PG,Chicago Bulls,20.093064
74,Kyrie Irving,PG,Cleveland Cavaliers,16.407501
156,Chris Paul,PG,Los Angeles Clippers,21.468695
269,Russell Westbrook,PG,Oklahoma City Thunder,16.744218
400,John Wall,PG,Washington Wizards,15.85195


Though the above invocation of `loc` was, in a sense, more straight-forward, note that parentheses were required to group our operations; otherwise, Python would have misinterpreted the expression.

Though more verbose, we could make this clearer – and preserve our conditions for reuse – by assigning them names.

In [18]:
nba_position_equals_pg = nba['POSITION'] == 'PG'

nba_salary_more_than_15 = nba['SALARY'] > 15

nba.loc[nba_position_equals_pg & nba_salary_more_than_15]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
60,Derrick Rose,PG,Chicago Bulls,20.093064
74,Kyrie Irving,PG,Cleveland Cavaliers,16.407501
156,Chris Paul,PG,Los Angeles Clippers,21.468695
269,Russell Westbrook,PG,Oklahoma City Thunder,16.744218
400,John Wall,PG,Washington Wizards,15.85195


### General Form ###

By now you will have realized that the general way to create a new DataFrame by selecting rows with a given feature is to use `loc` with the appropriate Boolean sequence – or "mask" – for our condition(s):

    DATA_FRAME.loc[MASK]
    
Where a "mask" is generated by performing a logical operation on a column of our data:

    DATA_FRAME[COLUMN_LABEL] LOGICAL_OPERATOR CONDITION_VALUE
    
And these masks may be composed:

    MASK0 BITWISE_OPERATOR MASK1

In [19]:
nba.loc[(nba['SALARY'] >= 10) & (nba['SALARY'] < 10.3)]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
204,Luol Deng,SF,Miami Heat,10.151612
298,Gerald Wallace,SF,Philadelphia 76ers,10.105855
356,Danny Green,SG,San Antonio Spurs,10.0
368,DeMar DeRozan,SG,Toronto Raptors,10.05


Notice that the DataFrame above includes Danny Green who made \$10 million, but *not* Monta Ellis who made \$10.3 million. This is because we used the "greater than or equal to" operator, `>=`, with the condition value `10`, and the "strictly less than" operator, `<`, with the condition value 10.3.

If we specify a condition that isn't satisfied by any row, we get a DataFrame with column labels but no rows.

In [20]:
nba.loc[nba['PLAYER'] == 'Barack Obama']

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY


### Some More Conditions ###

Here are the logical operators in Python, as they apply to conditional selection of data.

| **Logical operator**                | Name                | Description                         |
| --------------------------- | ------------------- | ----------------------------------- |
| `==`                        | Equal to            | True if data cell value is equal to condition value |
| `>`                         | Greater than        | True if data cell value is greater than condition value |
| `>=`                        | Greater than or equal to | True if data cell value is greater than or equal to condition value |
| `<`                         | Less than           | True if data cell value is less than condition value |
| `<=`                        | Less than or equal  | True if data cell value is less than or equal to condition value |

And here are the bitwise operators with which you may combine conditions.

| **Bitwise operator** | Name           | Description                         |
| -------------------- | -------------- | ----------------------------------- |
| `&`                  | Bitwise AND    | True if BOTH conditions are True    |
| `|`                  | Bitwise OR     | True if EITHER condition is True    |

We end the section with a series of examples. 

Another common condition is that a string value *contains* another string. Pandas supports this type of comparision with its own method, `contains`, provided through its `str` property.

This can help save some typing. For example, you can just specify that the team name contains the word `Warriors` instead of that it is exactly equal to the phrase `Golden State Warriors`.

In [21]:
nba.loc[nba['TEAM'].str.contains('Warriors')]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
117,Klay Thompson,SG,Golden State Warriors,15.501000
118,Draymond Green,PF,Golden State Warriors,14.260870
119,Andrew Bogut,C,Golden State Warriors,13.800000
120,Andre Iguodala,SF,Golden State Warriors,11.710456
121,Stephen Curry,PG,Golden State Warriors,11.370786
...,...,...,...,...
126,Leandro Barbosa,SG,Golden State Warriors,2.500000
127,Festus Ezeli,C,Golden State Warriors,2.008748
128,Brandon Rush,SF,Golden State Warriors,1.270964
129,Kevon Looney,SF,Golden State Warriors,1.131960


Or you can extract data for all the guards, both Point Guards and Shooting Guards:

In [22]:
nba.loc[nba['POSITION'].str.contains('G')]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
3,Jeff Teague,PG,Atlanta Hawks,8.000000
4,Kyle Korver,SG,Atlanta Hawks,5.746479
8,Dennis Schroder,PG,Atlanta Hawks,1.763400
9,Tim Hardaway Jr.,SG,Atlanta Hawks,1.304520
11,Jason Richardson,SG,Atlanta Hawks,0.947276
...,...,...,...,...
409,Alan Anderson,SG,Washington Wizards,4.000000
411,Ramon Sessions,PG,Washington Wizards,2.170465
412,Gary Neal,PG,Washington Wizards,2.139000
415,Garrett Temple,SG,Washington Wizards,1.100602


Or you can get all the players who were not Cleveland Cavaliers and had a salary of no less than \$20 million:

In [23]:
nba.loc[(nba['TEAM'] != 'Cleveland Cavaliers') & (nba['SALARY'] >= 20)]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
29,Joe Johnson,SF,Brooklyn Nets,24.894863
60,Derrick Rose,PG,Chicago Bulls,20.093064
131,Dwight Howard,C,Houston Rockets,22.359364
156,Chris Paul,PG,Los Angeles Clippers,21.468695
169,Kobe Bryant,SF,Los Angeles Lakers,25.0
201,Chris Bosh,PF,Miami Heat,22.19273
202,Dwyane Wade,SG,Miami Heat,20.0
255,Carmelo Anthony,SF,New York Knicks,22.875
268,Kevin Durant,SF,Oklahoma City Thunder,20.158622


As you can see, the use of `loc` with conditional masks gives you great flexibility in accessing rows with features that interest you. Don't hesitate to experiment!