In [33]:
import numpy as np
np.set_printoptions(threshold=50)
path_data = '../../assets/data/'

# Working with Tabular Data

Tabular data is one of the most common and useful forms of data for analysis. Tables are a fundamental object type for representing data sets. A table can be viewed in two ways:
* a sequence of named columns that each describe a single aspect of all entries in a data set, or
* a sequence of rows that each contain all information about a single entry in a data set.

The row and column are the two main components of a table.

In order to use tables, we will use `pandas`, the standard Python library for working with tabular data. `pandas` is the current tool of choice in both industry and academia for working with tabular data. It is more important that you understand the types of useful operations on data than the exact details of `pandas` syntax.  The full documentation to the `pandas` library can be found [`pandas` documentation](https://pandas.pydata.org/pandas-docs/stable/). Because we will cover only the most important pandas functions in the CMPUT 195 course, you should bookmark the full documentation for reference when you conduct your own data analyses. Let's begin by importing `pandas`:

In [6]:
# pd is a common shorthand for pandas
import pandas as pd

<h2>Creating DataFrame</h2>
Tables created using `pandas` are called a DataFrame object. Empty DataFrames can be created using the `DataFrame` function. An empty dataframe is useful because it can be extended to contain new rows and columns.

In [10]:
df = pd.DataFrame()

To create a DataFrame with data we use Python dictionary as data. This dictionary structure is a common way to represent tabular data before converting it to a more convenient format like a DataFrame. 

The `pandas` DataFrame(data) function converts the dictionary data into a DataFrame, which is a two-dimensional labeled data structure with columns of potentially different types. The resulting DataFrame flowers will have two columns ('Number of petals' and 'Name') and three rows, corresponding to the entries in the lists. You can create a DataFrame with one or multiple columns. Note that, all columns must have the same length, or an error will occur.

Below, we create a DataFrame with some initial data.

In [27]:
data = {
    'Number of petals': [8, 34, 5],
    'Name':['lotus', 'sunflower', 'rose']
}
flowers = pd.DataFrame(data)
flowers


Unnamed: 0,Number of petals,Name
0,8,lotus
1,34,sunflower
2,5,rose


In the above examples you see that there is an additional column created. The additional column is called `index`, which is another main component of a DataFrame. It serves as a unique identifier for each row, enabling quick data retrieval and alignment. In the example above, the default index values are 0, 1, and 2. The index can be customized to fit the needs of the analysis, making it a powerful feature for managing and analyzing data in Pandas.

We can extend the DataFrame with another column. To add a column to a DataFrame `assign` method is used. 

In [28]:
flowers.assign(Color = ['pink', 'yellow', 'red'])

Unnamed: 0,Number of petals,Name,Color
0,8,lotus,pink
1,34,sunflower,yellow
2,5,rose,red


The `assign` method creates a new DataFrame each time it is called, so the original DataFrame is not affected. For example, the DataFrame `flowers` still has only the two columns that it had when it was created.

In [29]:
flowers

Unnamed: 0,Number of petals,Name
0,8,lotus
1,34,sunflower
2,5,rose


To modify the original DataFrame you need to assign it to a new DataFrame.

In [30]:
flowers = flowers.assign(Color = ['pink', 'yellow', 'red'])
flowers

Unnamed: 0,Number of petals,Name,Color
0,8,lotus,pink
1,34,sunflower,yellow
2,5,rose,red


<h2>Reading DataFrame</h2>
Creating DataFrames in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use `pandas` to read it into a DataFrame, instead of typing it all in cell by cell.

Often, DataFrames are created from files that contain comma-separated values. Such files are called CSV files.

Below, we use the `pandas` method `read_csv` to read a CSV file that contains imdb top 200 movies up to 2017. The data are placed in a DataFrame named `topmovies`.

In [35]:
topmovies = pd.read_csv(path_data + 'top_movies_2017.csv')
topmovies

Unnamed: 0,Title,Studio,Gross,Gross (Adjusted),Year
0,Gone with the Wind,MGM,198676459,1796176700,1939
1,Star Wars,Fox,460998007,1583483200,1977
2,The Sound of Music,Fox,158671368,1266072700,1965
3,E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982
4,Titanic,Paramount,658672302,1204368000,1997
...,...,...,...,...,...
195,9 to 5,Fox,103290500,341357800,1980
196,Batman v Superman: Dawn of Justice,Warner Brothers,330360194,340137000,2016
197,The Firm,Paramount,158348367,340028200,1993
198,Suicide Squad,Warner Brothers,325100054,339411900,2016


We will use this DataFrame to demonstrate some useful methods. We will then develop other methods useful in DataScience on the same DataFrame.

<h2>The Shape of the DataFrame</h2>

The `shape` attribute returns a tuple representing the dimensions of the DataFrame: (number of rows, number of columns).

In [37]:
num_rows, num_columns = topmovies.shape
print(f'The Top Movies Dataset has {num_rows} rows and {num_columns} columns.')

The Top Movies Dataset has 200 rows and 5 columns.


<h2>Column Labels</h2>

The attribute `columns` can be used to list the labels of all the columns. With `topmovies` we don't gain much by this, but it can be very useful for DataFrames that are so large that not all columns are visible on the screen.

**Note:** The columns attribute of a DataFrame in `pandas` returns an Index object, which is an immutable array implementing an ordered, sliceable set. You can explicitly convert it to a list.


In [41]:
list(topmovies.columns)

['Title', 'Studio', 'Gross', 'Gross (Adjusted)', 'Year']

We can change column labels using the `rename` method. This creates a new table and leaves `topmovies` unchanged if we don't specify the `inplace` to be true.

In [47]:
topmovies.rename(columns = {'Year': 'Release Year'}, inplace = True)
topmovies

Unnamed: 0,Title,Studio,Gross,Gross (Adjusted),Release Year
0,Gone with the Wind,MGM,198676459,1796176700,1939
1,Star Wars,Fox,460998007,1583483200,1977
2,The Sound of Music,Fox,158671368,1266072700,1965
3,E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982
4,Titanic,Paramount,658672302,1204368000,1997
...,...,...,...,...,...
195,9 to 5,Fox,103290500,341357800,1980
196,Batman v Superman: Dawn of Justice,Warner Brothers,330360194,340137000,2016
197,The Firm,Paramount,158348367,340028200,1993
198,Suicide Squad,Warner Brothers,325100054,339411900,2016


<h2>Accessing the Data in a DataFrame</h2>

We often want to access certain subsets of a DataFrame. We can do this using rows, columns, or both. The `loc` and `iloc` are two DataFrame slicing attributes which are used for this purpose.

<h3> loc </h3>

The `loc` attribute is used to access rows and columns by their label. `loc` slicing is right-inclusive. Let's take the first 10 movies and their studios from `topmovies` dataframe.

**Note:** Remember that the index is also called the row labels.

In [56]:
topmovies.loc [0:9,'Title':'Studio']

Unnamed: 0,Title,Studio
0,Gone with the Wind,MGM
1,Star Wars,Fox
2,The Sound of Music,Fox
3,E.T.: The Extra-Terrestrial,Universal
4,Titanic,Paramount
5,The Ten Commandments,Paramount
6,Jaws,Universal
7,Doctor Zhivago,MGM
8,The Exorcist,Warner Brothers
9,Snow White and the Seven Dwarves,Disney


To access all the rows for some specific column we use [[ ]] operator as well.

In [67]:
topmovies[['Title','Studio']]

Unnamed: 0,Title,Studio
0,Gone with the Wind,MGM
1,Star Wars,Fox
2,The Sound of Music,Fox
3,E.T.: The Extra-Terrestrial,Universal
4,Titanic,Paramount
...,...,...
195,9 to 5,Fox
196,Batman v Superman: Dawn of Justice,Warner Brothers
197,The Firm,Paramount
198,Suicide Squad,Warner Brothers


If we want to access only one column, we can use the [] operator instead of the [[]] operator. The ['column_name'] will return a `pandas` Series which can be converted to DataFrame by using the `to_frame()` method.

In [69]:
topmovies['Title'].to_frame()

Unnamed: 0,Title
0,Gone with the Wind
1,Star Wars
2,The Sound of Music
3,E.T.: The Extra-Terrestrial
4,Titanic
...,...
195,9 to 5
196,Batman v Superman: Dawn of Justice
197,The Firm
198,Suicide Squad


The 5 columns are indexed 0, 1, 2, 3, and 4. The column `Survivors` can also be accessed by using its column index.

In [16]:
minard.column(4)

array([145000, 140000, 127100, 100000,  55000,  24000,  20000,  12000])

The 8 items in the array are indexed 0, 1, 2, and so on, up to 7. The items in the column can be accessed using `item`, as with any array.

In [17]:
minard.column(4).item(0)

145000

In [18]:
minard.column(4).item(5)

24000

<h2>Working with the Data in a Column</h2>

Because columns are arrays, we can use array operations on them to discover new information. For example, we can create a new column that contains the percent of all survivors at each city after Smolensk.

In [19]:
initial = minard.column('Survivors').item(0)
minard = minard.with_columns(
    'Percent Surviving', minard.column('Survivors')/initial
)
minard

Longitude,Latitude,City Name,Direction,Survivors,Percent Surviving
32.0,54.8,Smolensk,Advance,145000,1.0
33.2,54.9,Dorogobouge,Advance,140000,0.965517
34.4,55.5,Chjat,Advance,127100,0.876552
37.6,55.8,Moscou,Advance,100000,0.689655
34.3,55.2,Wixma,Retreat,55000,0.37931
32.0,54.6,Smolensk,Retreat,24000,0.165517
30.4,54.4,Orscha,Retreat,20000,0.137931
26.8,54.3,Moiodexno,Retreat,12000,0.0827586


To make the proportions in the new columns appear as percents, we can use the method `set_format` with the option `PercentFormatter`. The `set_format` method takes `Formatter` objects, which exist for dates (`DateFormatter`), currencies (`CurrencyFormatter`), numbers, and percentages.

In [20]:
minard.set_format('Percent Surviving', PercentFormatter)

Longitude,Latitude,City Name,Direction,Survivors,Percent Surviving
32.0,54.8,Smolensk,Advance,145000,100.00%
33.2,54.9,Dorogobouge,Advance,140000,96.55%
34.4,55.5,Chjat,Advance,127100,87.66%
37.6,55.8,Moscou,Advance,100000,68.97%
34.3,55.2,Wixma,Retreat,55000,37.93%
32.0,54.6,Smolensk,Retreat,24000,16.55%
30.4,54.4,Orscha,Retreat,20000,13.79%
26.8,54.3,Moiodexno,Retreat,12000,8.28%


<h2>Choosing Sets of Columns</h2>

The method `select` creates a new table that contains only the specified columns.

In [21]:
minard.select('Longitude', 'Latitude')

Longitude,Latitude
32.0,54.8
33.2,54.9
34.4,55.5
37.6,55.8
34.3,55.2
32.0,54.6
30.4,54.4
26.8,54.3


The same selection can be made using column indices instead of labels.

In [22]:
minard.select(0, 1)

Longitude,Latitude
32.0,54.8
33.2,54.9
34.4,55.5
37.6,55.8
34.3,55.2
32.0,54.6
30.4,54.4
26.8,54.3


The result of using `select` is a new table, even when you select just one column.

In [23]:
minard.select('Survivors')

Survivors
145000
140000
127100
100000
55000
24000
20000
12000


Notice that the result is a table, unlike the result of `column`, which is an array.

In [24]:
minard.column('Survivors')

array([145000, 140000, 127100, 100000,  55000,  24000,  20000,  12000])

Another way to create a new table consisting of a set of columns is to `drop` the columns you don't want.

In [25]:
minard.drop('Longitude', 'Latitude', 'Direction')

City Name,Survivors,Percent Surviving
Smolensk,145000,100.00%
Dorogobouge,140000,96.55%
Chjat,127100,87.66%
Moscou,100000,68.97%
Wixma,55000,37.93%
Smolensk,24000,16.55%
Orscha,20000,13.79%
Moiodexno,12000,8.28%


Neither `select` nor `drop` change the original table. Instead, they create new smaller tables that share the same data. The fact that the original table is preserved is useful! You can generate multiple different tables that only consider certain columns without worrying that one analysis will affect the other.

In [26]:
minard

Longitude,Latitude,City Name,Direction,Survivors,Percent Surviving
32.0,54.8,Smolensk,Advance,145000,100.00%
33.2,54.9,Dorogobouge,Advance,140000,96.55%
34.4,55.5,Chjat,Advance,127100,87.66%
37.6,55.8,Moscou,Advance,100000,68.97%
34.3,55.2,Wixma,Retreat,55000,37.93%
32.0,54.6,Smolensk,Retreat,24000,16.55%
30.4,54.4,Orscha,Retreat,20000,13.79%
26.8,54.3,Moiodexno,Retreat,12000,8.28%


All of the methods that we have used above can be applied to any table.