# Lecture 04: Arrays, Ranges, and Tables

# Arrays
* An array contains a sequence of values.
* All elements of an array should have the **same type**.
* Arithmetic is applied to each element individually
* When two arrays are added, they must have the same size; corresponding elements are added in the result.
    - Unless one of the arrays has size one.

In [1]:
# Add necessary imports

from datascience import *        # datascience library for this course
import numpy as np               # 'numerical python library' for working with arrays

In [2]:
# Let's create an array with people and count how many friends they have in this class
# friends = ...
# Number of friends for each person:
# Chen = 10
# Max = 4
# Danny = 2
# Sam = 7
# Taylor = 5

In [3]:
# friends = ...
friends = make_array(10, 4, 2, 7, 5)
friends

array([10,  4,  2,  7,  5])

In [4]:
# they all make 4 more friends...
friends + 4

array([14,  8,  6, 11,  9])

### use ```.item()``` to access an array element by index
* Warning: array indices start with zero!

In [7]:
# How many friends does the first person have?
friends.item(0)

10

In [9]:
# How many friends does the fourth person listed in the array have?
friends.item(3)

7

## Arrays make working with data easy
* add them, subtract them, muliply, divide, exponentiate

In [10]:
# Note that these statements produce no output:
# these assignment statements just create two new names and assign arrays to them
a1 = make_array(1,2,3)
a2 = make_array(3,2,1)

In [13]:
# this is how you can see the contents that are stored in any name you defined
a1

array([1, 2, 3])

In [76]:
a2

array([3, 2, 1])

In [77]:
a1 + a2

array([4, 4, 4])

In [78]:
a1 - a2

array([-2,  0,  2])

In [79]:
a1 * a2

array([3, 4, 3])

In [14]:
a1/a2

array([0.33333333, 1.        , 3.        ])

In [81]:
a1**a2

array([1, 4, 3])

## Arrays for basic statistics: daily temperatures

### Below is an array of daily high temperatures in San Diego from August 2018

In [87]:
#temps = make_array(86, 85, 85, 84, 85, 86, 91, 89, 90, 88, 88, 85, 83, 82, 79, 81, 82,
#                   83, 82, 79, 81, 83, 83, 79, 80, 80, 79, 80, 82, 82, 80)

temps = make_array(86, 85, 85, 84, 85, 86, 91, 89, 90, 88, 88, 85, 83, 82, 79, 81, 82,
                   83, 82, 79, 81, 83, 83, 79, 80, 80, 79, 80, 82, 82, 80)

Numbers of days temperatures are collected in August:

In [88]:
temps.size

31

### temperature statistics (mean, min, max)

In [89]:
temps.sum() / temps.size  # use sum and size

83.29032258064517

In [90]:
temps.mean() # build the mean method

83.29032258064517

In [91]:
min(temps), max(temps) # builtin functions work on array

(79, 91)

In [92]:
temps.min(), temps.max() # the array has it's own min/max method (faster)

(79, 91)

# Ranges
* (by default) A range is an array of consecutive numbers (however, the `step` need not be set to 1)
* ```np.arange(end)```: An array of increasing integers from 0 up to end (`end` represents the number of elements)
* ```np.arange(start, end)```: An array of increasing integers from start up to `end`
* ```np.arange(start, end, step)```: A range with step between consecutive values
* The range always includes `start` but excludes `end` (i.e. a half-open interval)

In [16]:
np.arange(5)

array([0, 1, 2, 3, 4])

In [17]:
np.arange(3, 9)

array([3, 4, 5, 6, 7, 8])

In [97]:
np.arange(3, 30, 5)

array([ 3,  8, 13, 18, 23, 28])

In [98]:
np.arange(-3, 2, 0.5)

array([-3. , -2.5, -2. , -1.5, -1. , -0.5,  0. ,  0.5,  1. ,  1.5])

In [99]:
np.arange(1, -3)

array([], dtype=int64)

In [100]:
np.arange(1, -3, -1)

array([ 1,  0, -1, -2])

### Discussion Question

Assume you have run the following statements:

`x = make_array(2,3,4)`

`y = np.arange(2,3,4)`

`z = np.arange(3)`


|Choose the expression that will cause an error: |
|-----|
|`A.                    x + y`
|`B.                    x + z`
|`C.    x.item(0) + y.item(0)`
|`D.    x.item(1) + y.item(1)`


In [34]:
x = make_array(2,3,4)

y = np.arange(2,3,4)

z = np.arange(3)

In [35]:
print(" x = ", x, "\n y = ", y, "\n z = ", z)

 x =  [2 3 4] 
 y =  [2] 
 z =  [0 1 2]


In [38]:
# You can run them one by one
#x + y
#x + z
#x.item(0) + y.item(0)
#x.item(1) + y.item(1)

# Tables


<img src="attachment:image.png" width="100%" align="middle"/>


# Table Structure
* A table is a sequence of labeled columns.
* The labels, or column names, are strings.
* Columns are arrays, all with the same length.
* Different columns can have different data types. 

# Table Structure
![table_anatomy.png](attachment:table_anatomy.png)

In [41]:
# Since we said that the columns in a table are arrays, when we create a table with columns, 
# we need to tell it not just the name of the column, but what to store in the corresponding array.
# Note that we first have to create an empty table using Table(), 
# and then apply the method to create columns and add values
Table().with_columns('Lat', 'Long', 'City', 'Number')

Lat,City
Long,Number


In [43]:
# In this example, we follow each column name with the array values corresponding to that column
Table().with_columns(
    'Lat', make_array(32, 37, 42), 
    'Long', make_array(54, 55, 55.4), 
    'City', make_array('Smolensk','Dorogobouge', 'Orscha'),
    'Number', np.arange(3))

Lat,Long,City,Number
32,54.0,Smolensk,0
37,55.0,Dorogobouge,1
42,55.4,Orscha,2


# Minard's Map

## Charles Joseph Minard, 1781 - 1870

* French civil engineer who created one of the greatest graphs of all time

<img src="attachment:minard.jpg" width="25%" align="middle"/>

# Minard's Map

## Visualized Napoleon's 1812 invasion of Russia, including:

* the number of soldiers
* the direction of the march
* the latitude and longitude of each city
* the temperature on the return journey
* Dates in November and December


# Visualization of 1812 March

![image.png](attachment:image.png)

# What's the data powering the map? (DEMO)

In [20]:
# wildcard, i.e., * -- not good practice, but keeps things simple!
from datascience import *

In [21]:
# what did we just import?
for n, v in globals().items():
    try:
        if 'datascience.' in v.__module__:
            print(n)
    except:
        pass
    
# Below is the list of methods that are available in the module

Table
default_formatter
Formatter
NumberFormatter
CurrencyFormatter
DateFormatter
PercentFormatter
DistributionFormatter
Map
Marker
Circle
Region
are
make_array
percentile
plot_cdf_area
plot_normal_cdf
table_apply
proportions_from_distribution
sample_proportions
minimize


# Minard's data: the anatomy of a table

In [22]:
# read './minard.csv'
minard = Table.read_table('minard.csv')
minard

Longitude,Latitude,City,Direction,Survivors
32.0,54.8,Smolensk,Advance,145000
33.2,54.9,Dorogobouge,Advance,140000
34.4,55.5,Chjat,Advance,127100
37.6,55.8,Moscou,Advance,100000
34.3,55.2,Wixma,Retreat,55000
32.0,54.6,Smolensk,Retreat,24000
30.4,54.4,Orscha,Retreat,20000
26.8,54.3,Moiodexno,Retreat,12000


### Shape of a table:
* number of columns,
* number of rows

In [66]:
minard.num_columns

5

In [67]:
minard.num_rows

8

### labels and relabeling columns
* `.labels` and `.relabeled(old_name, new_name)`
* `.relabeled` returns a new table (doesn't change the current one)

In [68]:
minard.labels

('Longitude', 'Latitude', 'City', 'Direction', 'Survivors')

In [69]:
minard.relabeled('City', 'City Name')

Longitude,Latitude,City Name,Direction,Survivors
32.0,54.8,Smolensk,Advance,145000
33.2,54.9,Dorogobouge,Advance,140000
34.4,55.5,Chjat,Advance,127100
37.6,55.8,Moscou,Advance,100000
34.3,55.2,Wixma,Retreat,55000
32.0,54.6,Smolensk,Retreat,24000
30.4,54.4,Orscha,Retreat,20000
26.8,54.3,Moiodexno,Retreat,12000


In [70]:
minard

Longitude,Latitude,City,Direction,Survivors
32.0,54.8,Smolensk,Advance,145000
33.2,54.9,Dorogobouge,Advance,140000
34.4,55.5,Chjat,Advance,127100
37.6,55.8,Moscou,Advance,100000
34.3,55.2,Wixma,Retreat,55000
32.0,54.6,Smolensk,Retreat,24000
30.4,54.4,Orscha,Retreat,20000
26.8,54.3,Moiodexno,Retreat,12000


In [72]:
(
    minard
    .relabeled('City', 'City Name')
    .relabeled('Survivors', 'Number Alive')
)

Longitude,Latitude,City Name,Direction,Number Alive
32.0,54.8,Smolensk,Advance,145000
33.2,54.9,Dorogobouge,Advance,140000
34.4,55.5,Chjat,Advance,127100
37.6,55.8,Moscou,Advance,100000
34.3,55.2,Wixma,Retreat,55000
32.0,54.6,Smolensk,Retreat,24000
30.4,54.4,Orscha,Retreat,20000
26.8,54.3,Moiodexno,Retreat,12000


In [28]:
minard = minard.relabeled('City', 'City Name')

### Selecting columns and table elements
* `.column()` takes a column name/index; returns an array.
* `.select()` takes column(s) (name/index); returns a table.

In [38]:
# access a column (array)
minard.column('City Name')

array(['Smolensk', 'Dorogobouge', 'Chjat', 'Moscou', 'Wixma', 'Smolensk',
       'Orscha', 'Moiodexno'], dtype='<U11')

In [39]:
# access an element of the table
minard.column('City Name').item(0)

'Smolensk'

In [34]:
minard.select('City Name', 'Latitude')

City Name,Latitude
Smolensk,54.8
Dorogobouge,54.9
Chjat,55.5
Moscou,55.8
Wixma,55.2
Smolensk,54.6
Orscha,54.4
Moiodexno,54.3


In [52]:
lat_long_cols = ['Latitude', 'Longitude']
minard.select(lat_long_cols)

Latitude,Longitude
54.8,32.0
54.9,33.2
55.5,34.4
55.8,37.6
55.2,34.3
54.6,32.0
54.4,30.4
54.3,26.8


### Adding columns: percentage of surviving troops at step k
* use `.with_column(col_name, array)`
* use `PercentFormatter` using `.set_format(col, formatter)`

In [47]:
initial = minard.column('Survivors').item(0)
minard = minard.with_column(
    'Percent Surviving', minard.column('Survivors')/initial
)
minard

Longitude,Latitude,City Name,Direction,Survivors,Percent Surviving
32.0,54.8,Smolensk,Advance,145000,1.0
33.2,54.9,Dorogobouge,Advance,140000,0.965517
34.4,55.5,Chjat,Advance,127100,0.876552
37.6,55.8,Moscou,Advance,100000,0.689655
34.3,55.2,Wixma,Retreat,55000,0.37931
32.0,54.6,Smolensk,Retreat,24000,0.165517
30.4,54.4,Orscha,Retreat,20000,0.137931
26.8,54.3,Moiodexno,Retreat,12000,0.0827586


In [48]:
# format the percent column
minard.set_format('Percent Surviving', PercentFormatter)

Longitude,Latitude,City Name,Direction,Survivors,Percent Surviving
32.0,54.8,Smolensk,Advance,145000,100.00%
33.2,54.9,Dorogobouge,Advance,140000,96.55%
34.4,55.5,Chjat,Advance,127100,87.66%
37.6,55.8,Moscou,Advance,100000,68.97%
34.3,55.2,Wixma,Retreat,55000,37.93%
32.0,54.6,Smolensk,Retreat,24000,16.55%
30.4,54.4,Orscha,Retreat,20000,13.79%
26.8,54.3,Moiodexno,Retreat,12000,8.28%


### Dropping columns
* `.drop(cols)`

In [53]:
minard.drop(lat_long_cols)

City Name,Direction,Survivors,Percent Surviving
Smolensk,Advance,145000,100.00%
Dorogobouge,Advance,140000,96.55%
Chjat,Advance,127100,87.66%
Moscou,Advance,100000,68.97%
Wixma,Retreat,55000,37.93%
Smolensk,Retreat,24000,16.55%
Orscha,Retreat,20000,13.79%
Moiodexno,Retreat,12000,8.28%


In [44]:
# Note that the original table is still intact!
minard

Longitude,Latitude,City,Direction,Survivors
32.0,54.8,Smolensk,Advance,145000
33.2,54.9,Dorogobouge,Advance,140000
34.4,55.5,Chjat,Advance,127100
37.6,55.8,Moscou,Advance,100000
34.3,55.2,Wixma,Retreat,55000
32.0,54.6,Smolensk,Retreat,24000
30.4,54.4,Orscha,Retreat,20000
26.8,54.3,Moiodexno,Retreat,12000


### Discussion Question

|How would you calculate the average of the numbers in last column of `minard`?|
|---|
|`A. sum(minard.select('Survivors')) / minard.num_rows`|
|`B. sum(minard.column('Survivors')) / minard.num_rows`|
|`C.                                Both A and B work.`|
|`D.                             Neither A nor B work.`|

# Summary of Table methods

Description|datascience module methods
---|---
Creating and extending tables:| `Table.read_table` and `Table().with_columns`
Finding the size| `num_rows` and `num_columns`
Referring to columns: labels, relabeling, and indices | `labels` and `relabeled`; column indices start at 0
Accessing data in a column|`column` takes a label or index and returns an array
Using array methods to work with data in columns|`item, sum, min, max`, and so on
Creating new tables containing some of the original columns:| `select, drop`