In [1]:
import babypandas as bpd

from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 1024,
        'height': 768,
        'scroll': True,
})

# Lecture 3: Arrays and Tables

# How do we store *sequences*?

For instance:
- all temperatures in month of January
- the age of every user on Facebook
- the salary of every NBA player

## Each as own variable?

In [2]:
temperature_on_jan_01 = 68
temperature_on_jan_02 = 72
temperature_on_jan_03 = 65
temperature_on_jan_04 = 64
temperature_on_jan_05 = 62
temperature_on_jan_06 = 61
temperature_on_jan_07 = 59
temperature_on_jan_08 = 64
temperature_on_jan_09 = 64
temperature_on_jan_10 = 63
temperature_on_jan_11 = 65
temperature_on_jan_12 = 62

```
avg_temperature = 1/12 * (
    temperature_on_jan_01
    + temperature_on_jan_02
    + temperature_on_jan_03
    + ...)
```

## Python's `list`s

- To create a `list`, place commas between things and surround with square brackets:

In [3]:
temperature_list = [68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62]
temperature_list

In [4]:
temperature_list = [temperature_on_jan_01, 72, 65, temperature_on_jan_04, 62, 61, 59, 64, 64, 63, 65, 62]
temperature_list

## `list`s make working with sequences easy

In [5]:
# compute the average temperature using `sum`
sum(temperature_list) / len(temperature_list)

## The Problem

- Lists are sloowwww
- Not a big deal when there aren't many entries
- A big problem when there are millions/billions of entries

# Arrays
* Like lists, but faster.
* Slightly less easy to work with.
* Provided by a package called `numpy`

In [6]:
import numpy as np

## Creating arrays

- To create an array, pass a list to the `np.array` function
- Remember the square brackets!

In [7]:
temperature_array = np.array([68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62])
temperature_array

In [8]:
np.array(temperature_list)

## Accessing elements of arrays

- The things inside of an array are called its *elements*
- To get a particular element, use `[]`:

In [9]:
temperature_array

In [10]:
temperature_array[3]

## Warning!

- Python (like most languages) starts counting from 0, not 1!

In [11]:
# get the first element of the array
temperature_array[0]

## Out-of-bounds errors

In [12]:
temperature_array[42]

## Array/Number arithmetic

- `numpy` arrays make it easy to do the same thing to every element

In [13]:
temperature_array

In [14]:
# increase all temperatures by 3 degrees
temperature_array + 3

In [15]:
# halve all temperatures
temperature_array / 2

In [16]:
# convert all temperatures to Celsius
(5/9) * (temperature_array - 32)

## Array/Array arithmetic

- two arrays of the same size can be added, subtracted, multiplied, etc.
- the arithmetic happens *elementwise*

In [17]:
a1 = np.array([1,2,3])
a2 = np.array([4,5,6])

In [18]:
a1

In [19]:
a2

In [20]:
a1 + a2

In [21]:
a1 - a2

In [22]:
a1 * a2

## Arrays for basic statistics: newborn birth weight

In [23]:
#: four girls with weight in kg: g1 = 3.405, g2 = 3.207, g3 = 2.42, g4 = 3.984

g1 = 3.405 
g2 = 3.207
g3 = 2.42
g4 = 3.984

# average weight of a newborn girl (in kg): 3.3
girl_av_weight = 3.3

### Load the weights into an array of floats

In [24]:
weights_kg_g = np.array([g1, g2, g3, g4]) 

weights_kg_g

### Calculate the deviation of weights from the average weight
* Subtracting a number from an array subtracts the number from each element.

In [25]:
weights_kg_g - girl_av_weight

### Convert the weights to pounds (2.2 kg/lb)

In [26]:
weights_lbs_g = weights_kg_g * 2.2
weights_lbs_g

### How many girls are recorded in the array?

- The function `len()` returns the length of an array (or list).

In [27]:
len(weights_lbs_g)

## Arrays for basic statistics: daily temperatures

### Below is an array of daily high temperatures in San Diego from August 2018

In [28]:
temps = np.array([86, 85, 85, 84, 85, 86, 91, 89, 90, 88, 88, 85, 83, 82, 79, 81, 82,
                   83, 82, 79, 81, 83, 83, 79, 80, 80, 79, 80, 82, 82, 80])

Numbers of days temperatures are collected in August:

In [29]:
len(temps)

### temperature statistics (mean, min, max)

- Arrays have handy methods for common tasks

In [30]:
...

In [31]:
temps.mean() # build the mean method

In [32]:
max(temps) # builtin functions work on array

In [33]:
temps.max() # the array has it's own min/max method (faster)

# Ranges

- We often find ourselves needing to make arrays like this:

In [34]:
days_in_january = np.array([
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
    23, 24, 25, 26, 27, 28, 29, 30, 31
])

# Ranges
* A range is an array of consecutive numbers
* ```np.arange(end)```: An array of increasing integers from 0 up to (and excluding!) end
* ```np.arange(start, end)```: An array of increasing integers from start up to (excluding!) end
* ```np.arange(start, end, step)```: A range with step between consecutive values
* The range always includes start but excludes end (i.e. a half-open interval)

In [35]:
np.arange(5)

In [36]:
np.arange(3, 9)

In [37]:
np.arange(3, 30, 5)

In [38]:
np.arange(-3, 2, 0.5)

In [39]:
np.arange(1, -3)

## Discussion Question

On the first day of January, you are paid 1 cent. Every day thereafter, your pay doubles: on the 2nd day it is 2 cents, on the 3rd it is 4 cents, on the 4th it is 8 cents, and so on.

January has 31 days.

Which of these expressions calculates the total amount of money you'll make in January (in dollars)?

- A) `(2**(np.arange(31) * .01)).sum()`
- B) `(2**(np.arange(32) * .01)).sum()`
- C) `((2**np.arange(31)) * .01).sum()`
- D) `((2**np.arange(32)) * .01).sum()`

_Type your answer here, replacing this text._

In [40]:
...

In [41]:
...

In [42]:
...

## (Optional) Speed Comparison, list vs array

In [43]:
n = 1_000_000
lst = list(range(n))
arr = np.arange(n)

In [54]:
sum(lst)

In [55]:
arr.sum()

In [56]:
%%timeit
_ = sum(lst)

In [57]:
%%timeit 
_ = arr.sum()

In [100]:
# compute ratio
...

# Tables

<img width=75% src="./data/imdb.png"/>

## How do we store *tabular data*?

- Could have an array for title, another for rating, another for year, etc.
- But this is not convenient.
- Instead, we use something called a *DataFrame*

In [59]:
bpd.read_csv('data/imdb.csv')

## `pandas`

- DataFrames are provided by a package called `pandas`
- `pandas` is *the* tool for doing data science in Python
    - downloaded $380,000$ times *yesterday*
    - last month: 14 million downloads

## But `pandas` is not so cute...

<img height=100% src="./data/angrypanda.jpg"/>

## Instead

- We at UCSD have created a smaller, nicer version of `pandas`
- Keeps important stuff, throws out the rest.
- Easier to learn, but is still valid `pandas` code.

## We call it `babypandas`

<img height=75% src="./data/babypanda.jpg"/>

## Importing `babypandas`

In [60]:
import babypandas as bpd

## Table Structure

- Tables have *columns* and *rows*
- Can think of each column as an array
- Every column has a label: "Votes", "Rating", etc.
- Every row does too: 0, 1, 2, 3

In [61]:
movies = bpd.read_csv('data/imdb.csv').take(np.arange(4))
movies

## The Index

- Together, the row labels are called the *index*.
- It's not a separate column!

In [62]:
movies

## Setting a new index

- Can set a better index using `.set_index(column_name)`
- Row labels should (ideally) be unique identifiers.
- Returns a copy!
- Looks nicer, but also really useful.

In [63]:
movies_by_name = movies.set_index('Title')
movies_by_name

## The index is an array

In [64]:
movies_by_name

In [65]:
movies_by_name.index

## Discussion Question

Which of these will return `Léon`?

- A) `movies_by_name['Title'][3]`
- B) `movies_by_name['Title'][4]`
- C) `movies_by_name.index[3]`
- D) `movies_by_name.index[4]`

_Type your answer here, replacing this text._

In [101]:
...

# NBA Salaries

- The file `nba_salaries.csv` contains all salaries from 2015-2016 NBA season.
- CSV: *Comma-separated values*

In [69]:
print(open('data/nba_salaries.csv').read())

## Reading a CSV

- We can read a CSV using `bpd.read_csv()`. Give it name of the file.

In [70]:
salaries = bpd.read_csv('data/nba_salaries.csv')
salaries

## Discussion Question

What would be a good column to use as the index?

- A) PLAYER
- B) POSITION
- C) TEAM
- D) 2015_SALARY

Is there something we should be worried about?

_Type your answer here, replacing this text._

## Setting the index

In [71]:
salaries_by_player = salaries.set_index('PLAYER')
salaries_by_player

### Shape of a table:

- `.shape` returns the number of rows and number of columns
- Access each with `[]`:

In [72]:
salaries_by_player.shape

In [73]:
salaries_by_player.shape[0]

In [74]:
salaries_by_player.shape[1]

## Use Case: Adjust for Inflation

- These salaries are old. We should adjust for inflation
- $\$1.00$ in 2015 = $\$1.09$ in 2020
- Workflow:
    - get the column of salaries
    - multiply every element by 1.09
    - add new column to table

### Step 1) Getting a column

- We can get a column from a dataframe using `.get(column_name)`:
- Warning: case sensitive!
- The result looks like a 1-column DataFrame, but is actually a *Series*

In [75]:
salaries_by_player.get("2015_SALARY")

### Digression: Series

- A *Series* is like an array, but with an index
- In particular, supports arithmetic

In [76]:
salaries_by_player.get("2015_SALARY")

In [77]:
# Step 2) Adjust the salaries for inflation
salaries_by_player.get("2015_SALARY") * 1.09

### Step 3) Add adjusted salaries to table

- Use `.assign(Name_of_column=data_in_array)` to assign an array (or series, or list) to a table.
- **Warning!** No quotes around `Name_of_column`
- Creates a new dataframe! Must save to variable.

In [78]:
salaries_by_player.assign(
    ADJUSTED_SALARY=salaries_by_player.get("2015_SALARY") * 1.09
)

In [79]:
salaries_by_player

In [80]:
adjusted_salaries = salaries_by_player.assign(
    ADJUSTED_SALARY=salaries_by_player.get("2015_SALARY") * 1.09
)
adjusted_salaries

## Use Case: Getting a particular player's salary

- How much did LeBron James make in 2015 (adjusted for inflation)?

In [81]:
# this is a Series!
adjusted_salaries.get('ADJUSTED_SALARY')

## Accessing a Series by row label: `.loc`

- Use `.loc[]` to *access* an element of the series with a particular row label

In [82]:
adjusted_salaries.get('ADJUSTED_SALARY').loc['LeBron James']

## How to get a particular element from a table:

1. `.get()` the column label
2. `.loc[]` the row label

In this class, we'll always get column, then row (but row, then column is also possible).

 Example: What position does LeBron play?

In [83]:
adjusted_salaries.get('POSITION').loc['LeBron James']

## Use Case: Salary Analysis

- What was the biggest/smallest salary? What was the average salary?
- *Series* have helpful methods, like `.min()`, `.max()`, `.mean()`, etc.

In [84]:
adjusted_salaries.get('ADJUSTED_SALARY').min()

In [85]:
adjusted_salaries.get('ADJUSTED_SALARY').max()

In [86]:
adjusted_salaries.get('ADJUSTED_SALARY').mean()

## Use Case: *Who* had the biggest salary?

- Strategy: Sort the table by salary and take the name at the top

### Step 1) Sort the table

- Use the `.sort_values(by=column_name)` method to sort.
- **Notice:** Creates a new table.
- Everything works as expected, but we wanted *descending* order.

In [87]:
adjusted_salaries.sort_values(by='ADJUSTED_SALARY')

### Step 1) Sorting the table in *descending* order

- Use `.sort_values(by=column_name, ascending=False)` to sort in *descending* order

In [88]:
highest_salaries = adjusted_salaries.sort_values(by='ADJUSTED_SALARY', ascending=False)
highest_salaries

### Step 2) Get the *name* of the person with the highest salary

- We saw that is was Kobe, but how do we get the name using code?
- Remember, the index is an array

In [89]:
highest_salaries.index[0]

## Use Case: What team did the person with the third-lowest salary play for?

- We have the tools, but its a little tricky. Can you think of a strategy?

## Strategy #1

1. Sort the table in ascending order using `.sort_values(by='ADJUSTED_SALARY')`
2. Get the name of the person using `.index[2]` (remember starts at 0)
3. Use `.get('TEAM').loc[their_name]` to get their team name.



In [90]:
lowest_salaries = adjusted_salaries.sort_values(by='ADJUSTED_SALARY')
lowest_salaries

In [91]:
name = lowest_salaries.index[2]
name

In [92]:
lowest_salaries.get('TEAM').loc[name]

## Another Approach

- To get the third element using `.loc[]`, we first had to find its label.
- Can we just get the 3rd element without knowing the label?
- Yes, with `.iloc[]`:

In [93]:
lowest_salaries.get('TEAM')

In [94]:
lowest_salaries.get('TEAM').loc['Jordan McRae']

In [95]:
lowest_salaries.get('TEAM').iloc[2]

## Strategy #2

1. Sort the table in ascending order using `.sort_values(by='ADJUSTED_SALARY')`, as before.
2. Use `.get('TEAM').iloc[2]` to get their team name.

In [96]:
adjusted_salaries.sort_values(by='ADJUSTED_SALARY').get('TEAM').iloc[2]

## Summary of accessing a Series

- There are two ways to get an element of a series:
    - `.loc[]` uses the row label
    - `.iloc[]` uses the integer position
- Usually `.loc` is more convenient

## Note

- Sometimes the integer position and row label are the same
- This happens by default with `bpd.read_csv`:

In [97]:
bpd.read_csv('data/nba_salaries.csv')

In [98]:
bpd.read_csv('data/nba_salaries.csv').get('PLAYER').loc[3]

In [99]:
bpd.read_csv('data/nba_salaries.csv').get('PLAYER').iloc[3]

# More Questions

- What is the total payroll of the Cleveland Cavaliers?
- How many players make over 10 million?
- Who is the highest paid center (C)?