# Lecture 3: Sequences

## Announcements
* If you are not enrolled in Gradescope, you can add yourself with a code on Piazza FAQ.
* If you are not enrolled in OKPY, please send us a private message on Piazza so we can add you.
* Please contact me through Piazza, not email.
* Labs must be done individually, but HWs can be done with partner using pair programming.
* We grade written questions on Gradescope, programming questions on OKPY. Your total score for an assignment will be the sum of these two scores (you'll have to compute it yourself).
* Lab 2 and HW 2 are out. Lab due Thursday, HW due Sunday.

## Last Time: Python basics
* Expressions
* Variables
* Functions
* Data types: `int`, `float`, `str`
* Type conversion

## Type conversion to and from strings
* Any value can be converted to a string using ```str```
* Strings can be converted to ```int``` and ```float``` when possible

In [None]:
str(3.0)

In [None]:
float('3')

In [None]:
int('chicken!')

In [None]:
'6.0' + 3.0

In [None]:
int('4.0')

In [None]:
int(float('4.0'))

<center><img src="q3.png"  width="1000"/></center>

In [None]:
x=3
y='4'
z='5.6'

In [None]:
x+y

In [None]:
x+int(y+z)

In [None]:
str(x)+int(y)

In [None]:
str(x)+z

### Type conversion causes messy data!

Genomics data (string-to-date):
> "Geneticists use MARCH1 as shorthand for membrane associated ring-CH-type finger 1. But Excel interprets MARCH1 as a date, automatically converting it to 1-Mar or another designation for the first of March."

[Excel Is Autocorrecting Scientific Research. And That's Not Cool](https://science.howstuffworks.com/innovation/scientific-experiments/excel-is-autocorrecting-scientific-research-thats-not-cool.htm)

### Type conversion causes messy data!

Genomics data (string-to-float):

> "Excel misidentifies some other gene names as coordinates or floating points. You might be able to suss out that 1-Mar is actually MARCH1, but how about 2.31E+13? That's how Excel converts the RIKEN identifier 2310009E13."

[Excel Is Autocorrecting Scientific Research. And That's Not Cool](https://science.howstuffworks.com/innovation/scientific-experiments/excel-is-autocorrecting-scientific-research-thats-not-cool.htm)

<center><img src="./type_inference_2.png"  width="800"/></center>

# Arrays and Ranges

# Arrays
* An array contains a sequence of values.
* All elements of an array should have the **same type**.
* Arithmetic is applied to each element individually
* When two arrays are added, they must have the same size; corresponding elements are added in the result.
    - Unless one of the arrays has size one.

In [None]:
from datascience import *        # datascience library for course
import numpy as np               # 'numerical python library' for working with arrays

## Arrays make working with data easy
* Add, subtract, multiply, divide, exponentiate.
* Use ``.item`` to access an array element by index.
* Warning: array indices start with zero!

In [None]:
a1 = make_array(1,2,3)
a2 = make_array(3,2,1)

In [None]:
a1 + a2

## Arrays for basic statistics: newborn birth weight

In [None]:
baby1 = 3.405 
baby2 = 3.207
baby3 = 2.42
baby4 = 3.984

### Load the weights into an array of floats
* `make_array()`

In [None]:
weights_kg = 

### Calculate the deviation of weights from average
* Subtracting a number from an array subtracts the number from each element.

In [None]:
avg_weight = 3.5 # average weight of ALL newborns (in kg)

### Convert the weights to pounds (2.2 lbs per kg)

### How many baby weights are recorded in the array?
* `len()` or `.size`

## Arrays for basic statistics: daily temperatures

### Below is an array of daily high temperatures in San Diego from August 2018

In [None]:
temps = make_array(86, 85, 85, 84, 85, 86, 91, 89, 90, 88, 88, 85, 83, 82, 79, 81, 82,
                   83, 82, 79, 81, 83, 83, 79, 80, 80, 79, 80, 82, 82, 80)

Numbers of days temperatures are collected in August:

### Temperature statistics (mean, min, max)

In [None]:
temps.sum() / temps.size  # use sum and size

In [None]:
temps.mean() # build the mean method

In [None]:
min(temps), max(temps) # built-in functions work on arrays

In [None]:
temps.min(), temps.max() # the array has it's own min/max method (faster)

### Sort the temperatures / calculate differences

In [None]:
np.sort(temps)

In [None]:
np.diff(temps)

### Convert from Fahrenheit to Celsius
$$C = \dfrac{5}{9}\left(F-32\right)$$

# Ranges
* A range is an array of consecutive numbers
* ```np.arange(end)```: An array of increasing integers from ``0`` up to ``end``
* ```np.arange(start, end)```: An array of increasing integers from ``start`` up to ``end``
* ```np.arange(start, end, step)```: Jump by ``step`` between consecutive values
* The range always includes ``start`` but excludes ``end`` (i.e. a half-open interval)

In [None]:
np.arange(5)

In [None]:
np.arange(3, 9)

In [None]:
np.arange(3, 30, 5)

In [None]:
np.arange(-3, 2, 0.5)

In [None]:
np.arange(1, -3)

In [None]:
np.arange(1, -3, -1)

<center><img src="q4.png"  width="800"/></center>


In [None]:
x = make_array(2, 3, 4)
y = np.arange(2, 3, 4)
z = np.arange(3)

In [None]:
x+y

In [None]:
x+z

In [None]:
z.item(0)+y.item(0)

In [None]:
x.item(1)+y.item(1)

# Tables


![image.png](attachment:image.png)


# The datascience module
* see documentation here: http://data8.org/datascience/tables.html
* similar to Pandas dataframes, R data.frame, Spark dataframes

## Why not use a "real" library for tables?
* Production libraries are harder to use because they evolve over time.
* The datascience module highlights only what's important.
* Easy to transition to production libraries (do it, if you're motivated!).

# Table Structure
* A table is a sequence of labeled columns.
* The labels, or column names, are strings.
* Columns are arrays, all with the same length.
* Different columns can have different data types. 

# Table Structure
![table_anatomy.png](attachment:table_anatomy.png)

#  Charles Joseph Minard, 1781 - 1870

* French civil engineer who created one of the greatest graphs of all time

<img src="attachment:minard.jpg" width="25%" align="middle"/>

# Minard's Map

## Visualized Napoleon's 1812 invasion of Russia, including:

* the number of soldiers
* the direction of the march
* the latitude and longitude of each city
* the temperature on the return journey
* dates in November and December


# Visualization of 1812 March

![image.png](attachment:image.png)

# What's the data powering the map? 

In [None]:
from datascience import *        # datascience library for course
import numpy as np               # 'numerical python library' for working with arrays

In [None]:
# read './minard.csv'
minard = Table.read_table('minard.csv')
minard

### Shape of a table:
* number of columns,
* number of rows

In [None]:
minard.num_columns

In [None]:
minard.num_rows

### labels and relabeling columns
* `.labels` and `.relabeled(old_name, new_name)`
* `.relabeled` returns a new table (doesn't change the current one)

In [None]:
minard.labels

In [None]:
minard.relabeled('City', 'City Name')

In [None]:
minard.labels

In [None]:
minard.relabeled('City', 'City Name').relabeled('Survivors', 'Number Alive')

In [None]:
minard = minard.relabeled('City', 'City Name')
minard

### Selecting columns and table elements
* `.column()` takes a column name/index; returns an array.
* `.select()` takes column(s) (name/index); returns a table.

In [None]:
# access a column (array)
minard.column('City Name')

In [None]:
# access an element of the table
minard.column('City Name').item(0)

### Selecting columns and table elements
* `.column()` takes a column name/index; returns an array.
* `.select()` takes column(s) (name/index); returns a table.

In [None]:
minard.select('Latitude', 'Longitude')

### Adding columns: percentage of surviving troops at step k
* use `.with_column(col_name, array)`
* use `PercentFormatter` using `.set_format(col, formatter)`

In [None]:
initial = minard.column('Survivors').item(0)
minard = minard.with_column(
    'Percent Surviving', minard.column('Survivors')/initial
)
minard

In [None]:
# format the percent column
minard.set_format('Percent Surviving', PercentFormatter)

### Dropping columns
* `.drop(cols)`

In [None]:
minard.drop('Latitude', 'Longitude')

In [None]:
minard

<center><img src="q6.png"  width="1000"/></center>

# Summary of Table methods so far

Description|datascience module methods
---|---
Creating and extending tables:| `Table.read_table` and `Table().with_columns`
Finding the size| `num_rows` and `num_columns`
Referring to columns: labels, relabeling, and indices | `labels` and `relabeled`; column indices start at 0
Accessing data in a column|`column` takes a label or index and returns an array
Using array methods to work with data in columns|`item, sum, min, max`, and so on
Creating new tables containing some of the original columns:| `select, drop`

# Sorting Tables

* The `sort` method creates a new table with the same rows in a different order (the original table is unaffected).

* The `show` method displays the first rows of a table

In [None]:
# Larger tables have rows omitted
# read in ./nba_salaries.csv
nba = Table.read_table('nba_salaries.csv')
nba

# About the data

<center><img src="nba_description.png"  width="800"/></center>

### In 2015-16, what was the total payroll for all NBA teams combined?

### What's the largest salary in the NBA in 2015-16? Who earned it?

In [None]:
nba.column("'15-'16 SALARY").max()

### How about the top 5?

In [None]:
nba.sort("'15-'16 SALARY", descending=True).show(5)

### What are the optional arguments for `sort`?

* `descending=True`, sorts the column in descending order (default: ascending order).
* `distinct=True`, omits repeated values of the column, keeping the only the first. 

### What does the code below do?

In [None]:
nba.sort(3, descending=True).sort(1, distinct=True)

<center><img src="q7.png"  width="1000"/></center>