In [1]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

# Tables
Most data are arranged in a tabular format. Hence, it is useful to understand how to manipulate tables.

The standard library for data science is `pandas`, stands for [Python Data Analysis Library](http://pandas.pydata.org/). This library is capable of many things but on the same time very difficult to learn. Because of the difficulty, Berkeley faculties developed a library called `tables` that is easier to learn, and we will use this library in this class. Once we go beyond this class, it will be easy to transition between this `tables` and `pandas table` or also called `pandas data frame`. 

## Table Structure

A table is a sequence of labeled columns (think of a rectangular grid of data).
* Each column is labeled with a string
* Each column's data are in arrays. Each array in each column needs to be the same length
<img src = 'table_example.jpg' width = 500\>
We usually treat a row as a single entity or a single data point

## Demo
Again, make sure to import the `datascience` package. the `tables` library is located within `datascience`.

After we import `datascience`, we will be able to use the function `Table`. 

In [2]:
Table()

Above is an empty table. Once we create an empty table, we can add columns into it. 

In [3]:
Table().with_columns('Petals', make_array(8, 34, 5))

Petals
8
34
5


In the `with_columns` method, the first argument is the label of the column, while the second argument is the data inside the column.

If we want to add more columns, we add more arguments within the `with_columns` method.

In [4]:
Table().with_columns(
    'Petals', make_array(8, 34, 5),
    'Name', make_array('lotus', 'sunflower', 'rose')
)

Petals,Name
8,lotus
34,sunflower
5,rose


Above, we just created a table but we didn't assign it to any variable and thus, we won't be able to use that table in the future. 

Now we will assign the table above to the name `flowers`.

In [5]:
flowers = Table().with_columns(
    'Petals', make_array(8, 34, 5),
    'Name', make_array('lotus', 'sunflower', 'rose')
)
flowers

Petals,Name
8,lotus
34,sunflower
5,rose


We can add another column to the newly created `flowers`.

In [6]:
flowers.with_columns('Color', make_array('pink', 'yellow', 'red'))

Petals,Name,Color
8,lotus,pink
34,sunflower,yellow
5,rose,red


Be careful! Above is just a new table! `flowers` itself does not change!

In [7]:
flowers

Petals,Name
8,lotus
34,sunflower
5,rose


This is similar to the following:

In [8]:
x = 3

In [9]:
x + 2

5

In [10]:
x

3

Above, after adding with 3, `x` does not change! If we want to update `x`, we need to reassign `x`!

In [11]:
x = x + 2
x

5

And the same goes for `flowers`

In [12]:
flowers = flowers.with_columns('Color', make_array('pink', 'yellow', 'red'))
flowers

Petals,Name,Color
8,lotus,pink
34,sunflower,yellow
5,rose,red


Be careful with running the `x` cell above! Because if we run it multiple times, `x` will keep increasing!

In [13]:
x = x + 2
x

7

# Minard's Map
<img src = 'minard.jpg' width = 200\>
Charles Joseph Minard (1781-1870) is a French civil engineer who created one of the greatest graph of all time. 

Back in 1812, Napoleon, in his conquest to conquer the world, tried to invade Russia during the winter. The Russians, taking advantage of the weather, hid deep inside Russia. Unfortunately, Napoleon did not survived the winter. Minard made a visualization of Napoleon march's into Russia and back out of Russia. Within the visualization, Minard was able take into account different variables including:
1. The number of soldiers
2. The direction of the march
3. The latitude and longitude of each city
4. The temperature on the return journey
5. Dates in November and December
<img src = 'minard_map.png' width =1200\>

From the map above, the gray line represents Napoleon's soldier marching into Moscow while the black line represents the return trip. As we can see, both line become narrower along the trip, implying that the number of soldiers decreased overtime. We can also see latitude and longtidue on each city and the temperature along the conquest. 

## Different Types of Data
<img src = 'data.jpg' width = 500\>

Above is the data based on Minard's map. Notice that different columns can contain different types of data. We will obtain the data from a `csv` file and load it into the name `Minard`.

`csv` stands for "comma-separated values". If we open the file, it will look like a tabular data separated with commas. 

In [14]:
minard = Table.read_table('minard.csv')
minard

Longitude,Latitude,City,Direction,Survivors
32.0,54.8,Smolensk,Advance,145000
33.2,54.9,Dorogobouge,Advance,140000
34.4,55.5,Chjat,Advance,127100
37.6,55.8,Moscou,Advance,100000
34.3,55.2,Wixma,Retreat,55000
32.0,54.6,Smolensk,Retreat,24000
30.4,54.4,Orscha,Retreat,20000
26.8,54.3,Moiodexno,Retreat,12000


We can look up a table's number of columns and rows using the method `num_columns` and `num_rows`, respectively.

In [15]:
minard.num_columns

5

In [16]:
minard.num_rows

8

We can grab the labels in `minard` table with the `labels` method.

In [17]:
minard.labels

('Longitude', 'Latitude', 'City', 'Direction', 'Survivors')

We can also rename a label using the method `relabeled`. The first argument of this method is the string of the label to be replaced, while the second argument is the new label.

In [18]:
minard.relabeled('City', 'City Name')

Longitude,Latitude,City Name,Direction,Survivors
32.0,54.8,Smolensk,Advance,145000
33.2,54.9,Dorogobouge,Advance,140000
34.4,55.5,Chjat,Advance,127100
37.6,55.8,Moscou,Advance,100000
34.3,55.2,Wixma,Retreat,55000
32.0,54.6,Smolensk,Retreat,24000
30.4,54.4,Orscha,Retreat,20000
26.8,54.3,Moiodexno,Retreat,12000


Above method comes in handy because some data come with unreadable or hard-to-read column labels.

Again, we did not change the `minard` table at all! If we want to make changes to `minard` table, we need to reassign it.

In [19]:
minard = minard.relabeled('City', 'City Name')
minard

Longitude,Latitude,City Name,Direction,Survivors
32.0,54.8,Smolensk,Advance,145000
33.2,54.9,Dorogobouge,Advance,140000
34.4,55.5,Chjat,Advance,127100
37.6,55.8,Moscou,Advance,100000
34.3,55.2,Wixma,Retreat,55000
32.0,54.6,Smolensk,Retreat,24000
30.4,54.4,Orscha,Retreat,20000
26.8,54.3,Moiodexno,Retreat,12000


If we run the `relabel` method one more time, it will give out an error! This is because the label `City` is already relabeled and thus, Python could not find the label that needs to be replaced.

In [20]:
minard = minard.relabeled('City', 'City Name')
minard

ValueError: Invalid labels. Column labels must already exist in table in order to be replaced.

We can take the values of a certain column using the `column` method. This method gives an array containing the datas. We can do this using either the column label or the index of the label.

In [None]:
minard.column('Survivors')

In [None]:
minard.column(4)

If we want to take a single element out of a column, we can use the `item` method.

In [None]:
minard.column(4).item(0)

In [None]:
minard.column(4).item(1)

We will add another column to `minard` called `Percent Surviving`, which represents the percentage of soldiers that survived in comparison to the initial amount of soldier.

In [21]:
#initial is the starting number of soldiers to begin with
initial = minard.column(4).item(0)
minard = minard.with_columns(
    # Percent surviving is calculated by dividing each eleemnt of the `Survivors` column with initial
    'Percent Surviving', minard.column('Survivors') / initial
)
minard

Longitude,Latitude,City Name,Direction,Survivors,Percent Surviving
32.0,54.8,Smolensk,Advance,145000,1.0
33.2,54.9,Dorogobouge,Advance,140000,0.965517
34.4,55.5,Chjat,Advance,127100,0.876552
37.6,55.8,Moscou,Advance,100000,0.689655
34.3,55.2,Wixma,Retreat,55000,0.37931
32.0,54.6,Smolensk,Retreat,24000,0.165517
30.4,54.4,Orscha,Retreat,20000,0.137931
26.8,54.3,Moiodexno,Retreat,12000,0.0827586


At the end of the return journey, Napoleon ended up with only 8% of his original army size! 

We can take columns from a table using the `select` method. Similar to `column` method, we can use either the label of the column or the index of the column. `select` returns a new table.

In [22]:
minard.select('Longitude', 'Direction')

Longitude,Direction
32.0,Advance
33.2,Advance
34.4,Advance
37.6,Advance
34.3,Retreat
32.0,Retreat
30.4,Retreat
26.8,Retreat


In [23]:
minard.select('Longitude', 4)

Longitude,Survivors
32.0,145000
33.2,140000
34.4,127100
37.6,100000
34.3,55000
32.0,24000
30.4,20000
26.8,12000


And we can stack up the `select` method!

In [24]:
minard.select('Longitude', 'Direction').select('Direction')

Direction
Advance
Advance
Advance
Advance
Retreat
Retreat
Retreat
Retreat


Note that `select` method returns a table, while the `column` method returns an array. You can do an operation to an array, but you can't do operations to a table!

In [26]:
# Can't add 1 to a table!
minard.select('Survivors') + 1

TypeError: unsupported operand type(s) for +: 'Table' and 'int'

In [25]:
minard.column('Survivors') + 1

array([145001, 140001, 127101, 100001,  55001,  24001,  20001,  12001])

If we want to remove a column/columns, we can use the `drop` method. 

In [27]:
minard.drop('Longitude', 'Latitude', 'Direction')

City Name,Survivors,Percent Surviving
Smolensk,145000,1.0
Dorogobouge,140000,0.965517
Chjat,127100,0.876552
Moscou,100000,0.689655
Wixma,55000,0.37931
Smolensk,24000,0.165517
Orscha,20000,0.137931
Moiodexno,12000,0.0827586


As previously pointed out, this method does not change the original table `minard` either! We need to reassign it to change the table.