In [1]:
import pathlib

import pandas as pd


pd.set_option('display.max_rows', 10)
pd.set_option('display.show_dimensions', False)


PATH_DATA = pathlib.Path.cwd().parent.parent.parent.parent / 'data'

In [2]:
cones = pd.read_csv(PATH_DATA / 'cones.csv')

nba = pd.read_csv(PATH_DATA / 'nba_salaries.csv').rename(columns={"'15-'16 SALARY": "SALARY"})

# Introduction to DataFrames

We can now apply Python to analyze data. We will work with data stored in the pandas library's `DataFrame` structure.

DataFrames are a fundamental way of representing data sets. A DataFrame can be viewed in two ways:
* a sequence of named columns that each describe a single attribute of all entries in a data set, or
* a sequence of rows that each contain all information about a single individual in a data set.

We will study DataFrames in great detail in the next several chapters. For now, we will just introduce a few methods without going into technical details. 

The DataFrame `cones` has been imported for us; later we will see how, but here we will just work with it. First, let's take a look at it.

In [3]:
cones

Unnamed: 0,Flavor,Color,Price
0,strawberry,pink,3.55
1,chocolate,light brown,4.75
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
5,bubblegum,pink,4.75


The DataFrame has six rows. Each row corresponds to one ice cream cone. The ice cream cones are the *individuals*.

Each cone has three attributes: flavor, color, and price. Each column contains the data on one of these attributes, and so all the entries of any single column are of the same kind. Each column has a label. We will refer to columns by their labels.

A DataFrame method is just like a function, but it must operate on a DataFrame. So in general the call looks like:

    name_of_dataframe.method(arguments)

On the other hand, a DataFrame property returns a new object for operation, such as the Python *slice* operation, `[start:stop:interval]`:

    name_of_dataframe.property[start:stop:interval]

For example, if you want to see just the first two rows of data, you can use the DataFrame property `iloc`, for *integer-location* based look-up by position in the row index.

In [4]:
cones.iloc[:2]

Unnamed: 0,Flavor,Color,Price
0,strawberry,pink,3.55
1,chocolate,light brown,4.75


You can replace 2 by any number of rows. If you ask for more than six, you will only get six, because `cones` only has six rows.

### Choosing Sets of Columns ###
The property `loc` provides access to groups of rows and columns by *label*.

In [5]:
cones.loc[:, ['Flavor']]

Unnamed: 0,Flavor
0,strawberry
1,chocolate
2,chocolate
3,strawberry
4,chocolate
5,bubblegum


This leaves the original table unchanged.

In [6]:
cones

Unnamed: 0,Flavor,Color,Price
0,strawberry,pink,3.55
1,chocolate,light brown,4.75
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
5,bubblegum,pink,4.75


You can select more than one column, by separating the column labels by commas.

In [7]:
cones.loc[:, ['Flavor', 'Price']]

Unnamed: 0,Flavor,Price
0,strawberry,3.55
1,chocolate,4.75
2,chocolate,5.25
3,strawberry,5.25
4,chocolate,5.25
5,bubblegum,4.75


You can also *drop* columns you don't want. The table above can be created by dropping the `Color` column.

In [8]:
cones.drop(columns=['Color'])

Unnamed: 0,Flavor,Price
0,strawberry,3.55
1,chocolate,4.75
2,chocolate,5.25
3,strawberry,5.25
4,chocolate,5.25
5,bubblegum,4.75


You can name this new table and look at it again by just typing its name.

In [9]:
no_colors = cones.drop(columns=['Color'])

no_colors

Unnamed: 0,Flavor,Price
0,strawberry,3.55
1,chocolate,4.75
2,chocolate,5.25
3,strawberry,5.25
4,chocolate,5.25
5,bubblegum,4.75


As with the `loc` property, the `drop` method creates a smaller DataFrame and leaves the original DataFrame unchanged. In order to explore your data, you can create any number of smaller DataFrames by using choosing or dropping columns. It will do no harm to your original data DataFrame.

### Sorting Rows ###

The `sort_values` method creates a new DataFrame by arranging the rows of the original DataFrame in ascending order of the values in the specified column. Here the `cones` DataFrame has been sorted in ascending order of the price of the cones.

In [10]:
cones.sort_values('Price')

Unnamed: 0,Flavor,Color,Price
0,strawberry,pink,3.55
1,chocolate,light brown,4.75
5,bubblegum,pink,4.75
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25


To sort in descending order, you can use an *optional* argument to `sort_values`. As the name implies, optional arguments don't have to be used, but they can be used if you want to change the default behavior of a method. 

By default, `sort_values` sorts in increasing order of the values in the specified column. To sort in decreasing order, use the optional argument `ascending=False`.

In [11]:
cones.sort_values('Price', ascending=False)

Unnamed: 0,Flavor,Color,Price
2,chocolate,dark brown,5.25
3,strawberry,pink,5.25
4,chocolate,dark brown,5.25
1,chocolate,light brown,4.75
5,bubblegum,pink,4.75
0,strawberry,pink,3.55


Like `drop`, the `sort_values` method leaves the original DataFrame unchanged.

### Selecting Rows that Satisfy a Condition ###

The `loc` property also allows you to create a new DataFrame consisting only of those rows specified by a given Boolean series, such as:

    [True, False, True]

We can generate such a series as simply as constructing a conditional expression in Python.

In this section we will work with a very simple condition, which is that the value in a specified column must be equal to a value that we also specify.

We will create a DataFrame consisting only of those rows corresponding to *chocolate* cones.

Unlike in our first example with `loc`, wherein we constructed a new `DataFrame` given the one-item list of columns –

    ['Flavor']

– in this example we will select only the `Series` of data underlying that column, by specifying only the column name.

In the cell below, the data is the same, but it is not presented as a table, because it is not a `DataFrame`.

In [12]:
cones.loc[:, 'Flavor']

0    strawberry
1     chocolate
2     chocolate
3    strawberry
4     chocolate
5     bubblegum
Name: Flavor, dtype: object

Because this is a very common operation, this expression can be shortened, such that the operation is applied to the DataFrame itself, as though it were a common Python `dict`.

In [13]:
cones['Flavor']

0    strawberry
1     chocolate
2     chocolate
3    strawberry
4     chocolate
5     bubblegum
Name: Flavor, dtype: object

From this `Series`, we can construct a new one indicating those rows whose values are `chocolate`.

In [14]:
cones['Flavor'] == 'chocolate'

0    False
1     True
2     True
3    False
4     True
5    False
Name: Flavor, dtype: bool

Putting it all together, we can hand this Boolean series to `loc`, to select only the rows corresponding to chocolate cones.

In [15]:
where_chocolate = cones['Flavor'] == 'chocolate'
cones.loc[where_chocolate]

Unnamed: 0,Flavor,Color,Price
1,chocolate,light brown,4.75
2,chocolate,dark brown,5.25
4,chocolate,dark brown,5.25


It is important to provide the value exactly. For example, if we specify `Chocolate` instead of `chocolate`, then the condition correctly finds no rows where the flavor is `Chocolate`.

In [16]:
where_chocolate = cones['Flavor'] == 'Chocolate'
where_chocolate

0    False
1    False
2    False
3    False
4    False
5    False
Name: Flavor, dtype: bool

In [17]:
cones.loc[where_chocolate]

Unnamed: 0,Flavor,Color,Price


### Example: Salaries in the NBA ###

"The NBA is the highest paying professional sports league in the world," [reported CNN](http://edition.cnn.com/2015/12/04/sport/gallery/highest-paid-nba-players/) in March 2016. The table `nba` contains the [salaries of all National Basketball Association players](https://www.statcrunch.com/app/index.php?dataid=1843341) in 2015-2016.

Each row represents one player. The columns are:

| **Column Label**   | Description                                         |
|--------------------|-----------------------------------------------------|
| `PLAYER`           | Player's name                                       |
| `POSITION`         | Player's position on team                           |
| `TEAM`             | Team name                                           |
|`SALARY`    | Player's salary in 2015-2016, in millions of dollars|
 
The code for the positions is PG (Point Guard), SG (Shooting Guard), PF (Power Forward), SF (Small Forward), and C (Center). But what follows doesn't involve details about how basketball is played.

The first row shows that Paul Millsap, Power Forward for the Atlanta Hawks, had a salary of almost $\$18.7$ million in 2015-2016.

In [18]:
nba

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
0,Paul Millsap,PF,Atlanta Hawks,18.671659
1,Al Horford,C,Atlanta Hawks,12.000000
2,Tiago Splitter,C,Atlanta Hawks,9.756250
3,Jeff Teague,PG,Atlanta Hawks,8.000000
4,Kyle Korver,SG,Atlanta Hawks,5.746479
...,...,...,...,...
412,Gary Neal,PG,Washington Wizards,2.139000
413,DeJuan Blair,C,Washington Wizards,2.000000
414,Kelly Oubre Jr.,SF,Washington Wizards,1.920240
415,Garrett Temple,SG,Washington Wizards,1.100602


Fans of Stephen Curry can find his row by using `loc`.

In [19]:
nba.loc[
    nba['PLAYER'] == 'Stephen Curry'
]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
121,Stephen Curry,PG,Golden State Warriors,11.370786


We can also create a new DataFrame called `warriors` consisting of just the data for the Golden State Warriors.

In [20]:
warriors = nba.loc[nba['TEAM'] == 'Golden State Warriors']

warriors

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
117,Klay Thompson,SG,Golden State Warriors,15.501000
118,Draymond Green,PF,Golden State Warriors,14.260870
119,Andrew Bogut,C,Golden State Warriors,13.800000
120,Andre Iguodala,SF,Golden State Warriors,11.710456
121,Stephen Curry,PG,Golden State Warriors,11.370786
...,...,...,...,...
126,Leandro Barbosa,SG,Golden State Warriors,2.500000
127,Festus Ezeli,C,Golden State Warriors,2.008748
128,Brandon Rush,SF,Golden State Warriors,1.270964
129,Kevon Looney,SF,Golden State Warriors,1.131960


Pandas has been configured here to display no more than 10 rows of DataFrames by default. You can use method `to_string` to customize this display. To display the entire table, use `to_string` with no arguments in the parentheses.

In [21]:
print(warriors.to_string())

                PLAYER POSITION                   TEAM     SALARY
117      Klay Thompson       SG  Golden State Warriors  15.501000
118     Draymond Green       PF  Golden State Warriors  14.260870
119       Andrew Bogut        C  Golden State Warriors  13.800000
120     Andre Iguodala       SF  Golden State Warriors  11.710456
121      Stephen Curry       PG  Golden State Warriors  11.370786
122     Jason Thompson       PF  Golden State Warriors   7.008475
123   Shaun Livingston       PG  Golden State Warriors   5.543725
124    Harrison Barnes       SF  Golden State Warriors   3.873398
125  Marreese Speights        C  Golden State Warriors   3.815000
126    Leandro Barbosa       SG  Golden State Warriors   2.500000
127       Festus Ezeli        C  Golden State Warriors   2.008748
128       Brandon Rush       SF  Golden State Warriors   1.270964
129       Kevon Looney       SF  Golden State Warriors   1.131960
130   Anderson Varejao       PF  Golden State Warriors   0.289755


The `nba` DataFrame is sorted in alphabetical order of the team names. To see how the players were paid in 2015-2016, it is useful to sort the data by salary. Remember that by default, the sorting is in increasing order.

In [22]:
nba.sort_values('SALARY')

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
267,Thanasis Antetokounmpo,SF,New York Knicks,0.030888
327,Cory Jefferson,PF,Phoenix Suns,0.049709
326,Jordan McRae,SG,Phoenix Suns,0.049709
324,Orlando Johnson,SG,Phoenix Suns,0.055722
325,Phil Pressey,PG,Phoenix Suns,0.055722
...,...,...,...,...
131,Dwight Howard,C,Houston Rockets,22.359364
255,Carmelo Anthony,SF,New York Knicks,22.875000
72,LeBron James,SF,Cleveland Cavaliers,22.970500
29,Joe Johnson,SF,Brooklyn Nets,24.894863


These figures are somewhat difficult to compare as some of these players changed teams during the season and received salaries from more than one team; only the salary from the last team appears in the table.  

The CNN report is about the other end of the salary scale – the players who are among the highest paid in the world. To identify these players we can sort in descending order of salary and look at the top few rows.

In [23]:
nba.sort_values('SALARY', ascending=False).iloc[:10]

Unnamed: 0,PLAYER,POSITION,TEAM,SALARY
169,Kobe Bryant,SF,Los Angeles Lakers,25.0
29,Joe Johnson,SF,Brooklyn Nets,24.894863
72,LeBron James,SF,Cleveland Cavaliers,22.9705
255,Carmelo Anthony,SF,New York Knicks,22.875
131,Dwight Howard,C,Houston Rockets,22.359364
201,Chris Bosh,PF,Miami Heat,22.19273
156,Chris Paul,PG,Los Angeles Clippers,21.468695
268,Kevin Durant,SF,Oklahoma City Thunder,20.158622
60,Derrick Rose,PG,Chicago Bulls,20.093064
202,Dwyane Wade,SG,Miami Heat,20.0


Kobe Bryant, since retired, was the highest earning NBA player in 2015-2016.