In [None]:
%matplotlib inline

import babypandas as bpd

# Lecture 5: Accessing, Sorting, and Querying

# NBA Salaries

- The file `nba_salaries.csv` contains all salaries from 2015-2016 NBA season.
- CSV: *comma-separated values*

## Reading a CSV

- We can read a CSV using `bpd.read_csv()`. Give it the name of the file, if in this directory, or a path to the file otherwise.

In [None]:
nba_salaries = bpd.read_csv('data/nba_salaries.csv')
nba_salaries

## Discussion Question

What would be a good column to use as the index?

- A) PLAYER
- B) POSITION
- C) TEAM
- D) 2015_SALARY

Is there something we should be worried about?

**Solution**: (A)

- We'll use the player name.
- But we should be careful that two players don't have the same name.

## Setting the index

In [None]:
salaries = nba_salaries.set_index('PLAYER')
salaries

### Shape of a table:

- `.shape` returns the number of rows and number of columns
- Access each with `[]`:

In [None]:
salaries.shape

In [None]:
salaries.shape[0] #number of rows

In [None]:
salaries.shape[1] #number of columns

## Use Case: Adjust for Inflation

- These salaries are old. We should adjust for inflation
- $\$1.00$ in 2015 = $\$1.09$ in 2021
- Workflow:
    - get the column of salaries
    - multiply every element by 1.09
    - add new column to table

### Step 1) Getting a column

- We can get a column from a dataframe using `.get(column_name)`:
- Warning: case sensitive!
- The result looks like a 1-column DataFrame, but is actually a *Series*

In [None]:
salaries.get("2015_SALARY")

### Digression: Series

- A *Series* is like an array, but with an index
- In particular, supports arithmetic

In [None]:
salaries.get("2015_SALARY")

### Step 2) Adjust the salaries for inflation

In [None]:
salaries.get("2015_SALARY") * 1.09

### Step 3) Add adjusted salaries to table

- Use `.assign(Name_of_column=data_in_array)` to assign an array (or series, or list) to a table.
- **Warning!** No quotes around `Name_of_column`
- Creates a new dataframe! Must save to variable.

In [None]:
salaries.assign(
    ADJUSTED_SALARY=salaries.get("2015_SALARY") * 1.09
)

In [None]:
salaries

In [None]:
adjusted_salaries = salaries.assign(
    ADJUSTED_SALARY=salaries.get("2015_SALARY") * 1.09
)
adjusted_salaries

## Use Case: Getting a particular player's salary

- How much did LeBron James make in 2015 (adjusted for inflation)?

In [None]:
# this is a Series!
adjusted_salaries.get('ADJUSTED_SALARY')

## Accessing a Series by row label: `.loc`

- Use `.loc[]` to *access* an element of the series with a particular row label

In [None]:
adjusted_salaries.get('ADJUSTED_SALARY').loc['LeBron James']

## How to get a particular element from a table:

1. `.get()` the column label
2. `.loc[]` the row label

In this class, we'll get the column, then row (but row, then column is also possible).

 Example: What position does LeBron play?

In [None]:
adjusted_salaries.get('POSITION').loc['LeBron James']

## Use Case: Salary Analysis

- What was the biggest/smallest salary? What was the average salary?
- *Series* have helpful methods, like `.min()`, `.max()`, `.mean()`, etc.

In [None]:
adjusted_salaries.get('ADJUSTED_SALARY').min()

In [None]:
adjusted_salaries.get('ADJUSTED_SALARY').max()

In [None]:
adjusted_salaries.get('ADJUSTED_SALARY').mean()

## Use Case: *Who* had the biggest salary?

- Strategy: Sort the table by salary and take the name at the top

### Step 1) Sort the table

- Use the `.sort_values(by=column_name)` method to sort.
- **Notice:** Creates a new table, doesn't change the old.
- **Notice:** By default, sorts in ascending order (small to large).

In [None]:
adjusted_salaries.sort_values(by='ADJUSTED_SALARY')

### Step 1) Sorting the table in *descending* order

- Use `.sort_values(by=column_name, ascending=False)` to sort in *descending* order

In [None]:
adjusted_salaries.sort_values?

In [None]:
highest_salaries = adjusted_salaries.sort_values(by='ADJUSTED_SALARY', ascending=False)
highest_salaries

### Step 2) Get the *name* of the person with the highest salary

- We saw that it was Kobe Bryant, but how do we get the name using code?
- Remember, the index is an array

In [None]:
highest_salaries.index[0]

## Use Case: What team did the person with the third-lowest salary play for?

- We have the tools, but its a little tricky. Can you think of a strategy?

In [None]:
salaries

## Strategy #1

1. Sort the table in ascending order using `.sort_values(by='ADJUSTED_SALARY')`
2. Get the name of the person using `.index[2]` (remember, indexing starts at 0)
3. Use `.get('TEAM').loc[their_name]` to get their team name.



In [None]:
lowest_salaries = adjusted_salaries.sort_values(by='ADJUSTED_SALARY')
lowest_salaries

In [None]:
name = lowest_salaries.index[2]
name

In [None]:
lowest_salaries.get('TEAM').loc[name]

## Another Approach

- To get the third element using `.loc[]`, we first had to find its label.
- Can we just get the 3rd element without knowing the label?
- Yes, with `.iloc[]`:

In [None]:
lowest_salaries.get('TEAM')

In [None]:
lowest_salaries.get('TEAM').loc['Jordan McRae']

In [None]:
lowest_salaries.get('TEAM').iloc[2]

## Strategy #2

1. Sort the table in ascending order using `.sort_values(by='ADJUSTED_SALARY')`, as before.
2. Use `.get('TEAM').iloc[2]` to get their team name.

In [None]:
adjusted_salaries.sort_values(by='ADJUSTED_SALARY').get('TEAM').iloc[2]

## Summary of accessing a Series

- There are two ways to get an element of a series:
    - `.loc[]` uses the row label
    - `.iloc[]` uses the integer position
- Usually `.loc` is more convenient

## Note

- Sometimes the integer position and row label are the same
- This happens by default with `bpd.read_csv`:

In [None]:
bpd.read_csv('data/nba_salaries.csv')

In [None]:
bpd.read_csv('data/nba_salaries.csv').get('PLAYER').loc[3]

In [None]:
bpd.read_csv('data/nba_salaries.csv').get('PLAYER').iloc[3]

## Questions we can answer:

- What was the biggest salary?
- How many players were there?
- What was LeBron James' salary?
- *Who* had the biggest salary?

## Questions we can't yet answer:

- What is the total payroll of the Cleveland Cavaliers?
- How many players make over 10 million?
- Who is the highest paid center (C)?

# Selecting Rows

## Use Case: Who was the highest paid center (C)?

In [None]:
salaries

## Selecting Rows

- We could do it if we had a table consisting only of centers.
- But how do we get that table?

## The Solution

In [None]:
salaries[salaries.get('POSITION') == 'C']

In [None]:
'PG' == 'C'

In [None]:
'C' == 'C'

In [None]:
salaries.get('POSITION') == 'C'

## Boolean Indexing

To select only some rows of `salaries`:

1. Make a list/array/Series of `True`s (keep) and `False`s (toss)
2. Then pass it into `salaries[]`.

Rather than making the list by hand, we usually generate it by making a comparison.

## Elementwise comparisons work as expected

In [None]:
salaries.get('2015_SALARY') > 5

In [None]:
#- make a table with only players who made more than 5 million
 

## Another example

In [None]:
#- get only the Cleveland Cavaliers


## When the condition is not satisfied

In [None]:
salaries[salaries.get('TEAM') == 'San Diego Surfers']

## Use Case: Who was the highest paid center?

1. Extract a table of centers
2. Sort by salary
3. Return first element in index

In [None]:
#- extract a table of centers
centers = ...
centers

In [None]:
#- sort and return first thing in index


## Discussion Question

What was the total payroll of the Cleveland Cavaliers?

- a) `salaries[salaries.get('TEAM') == 'Cleveland Cavaliers'].get('2015_SALARY').sum()`
- b) `salaries.get('2015_SALARY').sum()[salaries.get('TEAM') == 'Cleveland Cavaliers']`
- c) `salaries['Cleveland Cavaliers'].get('2015_SALARY').sum()`

## Answer: a)

In [None]:
cavs = salaries[salaries.get('TEAM') == 'Cleveland Cavaliers']
cavs

In [None]:
#- use series method .sum()
cavs.get('2015_SALARY').sum()