In [5]:
import numpy as np
np.set_printoptions(threshold=50)
path_data = '../../assets/data/'

# Working with Tabular Data

Tabular data is one of the most common and useful forms of data for analysis. Tables are a fundamental object type for representing data sets. A table can be viewed in two ways:
* a sequence of named columns that each describe a single aspect of all entries in a data set, or
* a sequence of rows that each contain all information about a single entry in a data set.

The row and column are the two main components of a table.
In order to use tables, import all of the module called `datascience`, a module was written by Berkeley professors John DeNero and David Culler, as well as students Sam Lau and Alvin Wan. The full documentation to the datascience package can be found [here](https://www.data8.org/datascience/), but students typically only need the [Python Reference Guide](https://www.data8.org/sp20/python-reference.html) for all the functions that are used widely in CMPUT191. Let's begin by importing `datascience`:

In [8]:
from datascience import *

<h2>Creating Tables</h2> 
Empty tables can be created using the `Table` function. An empty table is useful because it can be extended to contain new rows and columns.

In [9]:
Table()

The `with_columns` method on a table constructs a new table with additional labeled columns. Each column of a table is an array. To add one new column to a table, call `with_columns` with a label and an array. (The `with_column` method can be used with the same effect.)

Below, we begin each example with an empty table that has no columns. 

In [10]:
Table().with_columns('Number of petals', make_array(8, 34, 5))

Number of petals
8
34
5


To add two (or more) new columns, provide the label and array for each column. All columns must have the same length, or an error will occur.

In [11]:
Table().with_columns(
    'Number of petals', make_array(8, 34, 5),
    'Name', make_array('lotus', 'sunflower', 'rose')
)

Number of petals,Name
8,lotus
34,sunflower
5,rose


We can give this table a name, and then extend the table with another column.

In [12]:
flowers = Table().with_columns(
    'Number of petals', make_array(8, 34, 5),
    'Name', make_array('lotus', 'sunflower', 'rose')
)

flowers.with_columns(
    'Color', make_array('pink', 'yellow', 'red')
)

Number of petals,Name,Color
8,lotus,pink
34,sunflower,yellow
5,rose,red


The `with_columns` method creates a new table each time it is called, so the original table is not affected. For example, the table `flowers` still has only the two columns that it had when it was created.

In [13]:
flowers

Number of petals,Name
8,lotus
34,sunflower
5,rose


<h2>Reading Tables</h2>
Creating tables in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use Python to read it into a table, instead of typing it all in cell by cell.

Often, tables are created from files that contain comma-separated values. Such files are called CSV files.

Below, we use the Table method `read_table` to read a CSV file that contains some of the data used by Minard in his graphic about Napoleon's Russian campaign. The data are placed in a table named `topmovies`.

In [16]:
topmovies = Table.read_table(path_data + 'top_movies_2017.csv')
topmovies

Title,Studio,Gross,Gross (Adjusted),Year
Gone with the Wind,MGM,198676459,1796176700,1939
Star Wars,Fox,460998007,1583483200,1977
The Sound of Music,Fox,158671368,1266072700,1965
E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982
Titanic,Paramount,658672302,1204368000,1997
The Ten Commandments,Paramount,65500000,1164590000,1956
Jaws,Universal,260000000,1138620700,1975
Doctor Zhivago,MGM,111721910,1103564200,1965
The Exorcist,Warner Brothers,232906145,983226600,1973
Snow White and the Seven Dwarves,Disney,184925486,969010000,1937


We will use this Table to demonstrate some useful methods. We will then develop other methods useful in DataScience on the same Table.

<h2>The Size of the Table</h2>

The method `num_columns` gives the number of columns in the table, and `num_rows` the number of rows.

In [17]:
topmovies.num_columns

5

In [18]:
topmovies.num_rows

200

<h2>Column Labels</h2>

The method `labels` can be used to list the labels of all the columns. With `topmovies` we don't gain much by this, but it can be very useful for tables that are so large that not all columns are visible on the screen.

In [19]:
topmovies.labels

('Title', 'Studio', 'Gross', 'Gross (Adjusted)', 'Year')

We can change column labels using the `relabeled` method. This creates a new table and leaves `topmovies` unchanged.

In [20]:
topmovies.relabeled('Year', 'Year Released')

Title,Studio,Gross,Gross (Adjusted),Year Released
Gone with the Wind,MGM,198676459,1796176700,1939
Star Wars,Fox,460998007,1583483200,1977
The Sound of Music,Fox,158671368,1266072700,1965
E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982
Titanic,Paramount,658672302,1204368000,1997
The Ten Commandments,Paramount,65500000,1164590000,1956
Jaws,Universal,260000000,1138620700,1975
Doctor Zhivago,MGM,111721910,1103564200,1965
The Exorcist,Warner Brothers,232906145,983226600,1973
Snow White and the Seven Dwarves,Disney,184925486,969010000,1937


However, this method does not change the original table. 

In [21]:
topmovies

Title,Studio,Gross,Gross (Adjusted),Year
Gone with the Wind,MGM,198676459,1796176700,1939
Star Wars,Fox,460998007,1583483200,1977
The Sound of Music,Fox,158671368,1266072700,1965
E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982
Titanic,Paramount,658672302,1204368000,1997
The Ten Commandments,Paramount,65500000,1164590000,1956
Jaws,Universal,260000000,1138620700,1975
Doctor Zhivago,MGM,111721910,1103564200,1965
The Exorcist,Warner Brothers,232906145,983226600,1973
Snow White and the Seven Dwarves,Disney,184925486,969010000,1937


A common pattern is to assign the original name `topmovies` to the new table so that all future uses of `topmovies` will refer to the relabeled table.

In [22]:
topmovies = topmovies.relabeled('Year', 'Year Released')
topmovies

Title,Studio,Gross,Gross (Adjusted),Year Released
Gone with the Wind,MGM,198676459,1796176700,1939
Star Wars,Fox,460998007,1583483200,1977
The Sound of Music,Fox,158671368,1266072700,1965
E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982
Titanic,Paramount,658672302,1204368000,1997
The Ten Commandments,Paramount,65500000,1164590000,1956
Jaws,Universal,260000000,1138620700,1975
Doctor Zhivago,MGM,111721910,1103564200,1965
The Exorcist,Warner Brothers,232906145,983226600,1973
Snow White and the Seven Dwarves,Disney,184925486,969010000,1937
