In [33]:
import numpy as np
np.set_printoptions(threshold=50)
path_data = '../../assets/data/'

# Working with Tabular Data

Tabular data is one of the most common and useful forms of data for analysis. Tables are a fundamental object type for representing data sets. A table can be viewed in two ways:
* a sequence of named columns that each describe a single aspect of all entries in a data set, or
* a sequence of rows that each contain all information about a single entry in a data set.

The row and column are the two main components of a table.

In order to use tables, we will use `pandas`, the standard Python library for working with tabular data. `pandas` is the current tool of choice in both industry and academia for working with tabular data. It is more important that you understand the types of useful operations on data than the exact details of `pandas` syntax.  The full documentation to the `pandas` library can be found [`pandas` documentation](https://pandas.pydata.org/pandas-docs/stable/). Because we will cover only the most important pandas functions in the CMPUT 195 course, you should bookmark the full documentation for reference when you conduct your own data analyses. Let's begin by importing `pandas`:

In [6]:
# pd is a common shorthand for pandas
import pandas as pd

<h2>Creating DataFrame</h2>
Tables created using `pandas` are called a DataFrame object. Empty DataFrames can be created using the `DataFrame` function. An empty dataframe is useful because it can be extended to contain new rows and columns.

In [10]:
df = pd.DataFrame()

To create a DataFrame with data we use Python dictionary as data. This dictionary structure is a common way to represent tabular data before converting it to a more convenient format like a DataFrame. 

The `pandas` DataFrame(data) function converts the dictionary data into a DataFrame, which is a two-dimensional labeled data structure with columns of potentially different types. The resulting DataFrame flowers will have two columns ('Number of petals' and 'Name') and three rows, corresponding to the entries in the lists. You can create a DataFrame with one or multiple columns. Note that, all columns must have the same length, or an error will occur.

Below, we create a DataFrame with some initial data.

In [27]:
data = {
    'Number of petals': [8, 34, 5],
    'Name':['lotus', 'sunflower', 'rose']
}
flowers = pd.DataFrame(data)
flowers


Unnamed: 0,Number of petals,Name
0,8,lotus
1,34,sunflower
2,5,rose


In the above examples you see that there is an additional column created. The additional column is called `index`, which is another main component of a DataFrame. It serves as a unique identifier for each row, enabling quick data retrieval and alignment. In the example above, the default index values are 0, 1, and 2. The index can be customized to fit the needs of the analysis, making it a powerful feature for managing and analyzing data in Pandas.

We can extend the DataFrame with another column. To add a column to a DataFrame `assign` method is used. 

In [28]:
flowers.assign(Color = ['pink', 'yellow', 'red'])

Unnamed: 0,Number of petals,Name,Color
0,8,lotus,pink
1,34,sunflower,yellow
2,5,rose,red


The `assign` method creates a new DataFrame each time it is called, so the original DataFrame is not affected. For example, the DataFrame `flowers` still has only the two columns that it had when it was created.

In [29]:
flowers

Unnamed: 0,Number of petals,Name
0,8,lotus
1,34,sunflower
2,5,rose


To modify the original DataFrame you need to assign it to a new DataFrame.

In [30]:
flowers = flowers.assign(Color = ['pink', 'yellow', 'red'])
flowers

Unnamed: 0,Number of petals,Name,Color
0,8,lotus,pink
1,34,sunflower,yellow
2,5,rose,red


<h2>Reading DataFrame</h2>
Creating DataFrames in this way involves a lot of typing. If the data have already been entered somewhere, it is usually possible to use `pandas` to read it into a DataFrame, instead of typing it all in cell by cell.

Often, DataFrames are created from files that contain comma-separated values. Such files are called CSV files.

Below, we use the `pandas` method `read_csv` to read a CSV file that contains imdb top 200 movies up to 2017. The data are placed in a DataFrame named `topmovies`.

In [35]:
topmovies = pd.read_csv(path_data + 'top_movies_2017.csv')
topmovies

Unnamed: 0,Title,Studio,Gross,Gross (Adjusted),Year
0,Gone with the Wind,MGM,198676459,1796176700,1939
1,Star Wars,Fox,460998007,1583483200,1977
2,The Sound of Music,Fox,158671368,1266072700,1965
3,E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982
4,Titanic,Paramount,658672302,1204368000,1997
...,...,...,...,...,...
195,9 to 5,Fox,103290500,341357800,1980
196,Batman v Superman: Dawn of Justice,Warner Brothers,330360194,340137000,2016
197,The Firm,Paramount,158348367,340028200,1993
198,Suicide Squad,Warner Brothers,325100054,339411900,2016


We will use this DataFrame to demonstrate some useful methods. We will then develop other methods useful in DataScience on the same DataFrame.

<h2>The Shape of the DataFrame</h2>

The `shape` attribute returns a tuple representing the dimensions of the DataFrame: (number of rows, number of columns).

In [37]:
num_rows, num_columns = topmovies.shape
print(f'The Top Movies Dataset has {num_rows} rows and {num_columns} columns.')

The Top Movies Dataset has 200 rows and 5 columns.


<h2>Column Labels</h2>

The attribute `columns` can be used to list the labels of all the columns. With `topmovies` we don't gain much by this, but it can be very useful for DataFrames that are so large that not all columns are visible on the screen.

**Note:** The columns attribute of a DataFrame in `pandas` returns an Index object, which is an immutable array implementing an ordered, sliceable set. You can explicitly convert it to a list.


In [41]:
list(topmovies.columns)

['Title', 'Studio', 'Gross', 'Gross (Adjusted)', 'Year']

We can change column labels using the `rename` method. This creates a new table and leaves `topmovies` unchanged if we don't specify the `inplace` to be true.

In [47]:
topmovies.rename(columns = {'Year': 'Release Year'}, inplace = True)
topmovies

Unnamed: 0,Title,Studio,Gross,Gross (Adjusted),Release Year
0,Gone with the Wind,MGM,198676459,1796176700,1939
1,Star Wars,Fox,460998007,1583483200,1977
2,The Sound of Music,Fox,158671368,1266072700,1965
3,E.T.: The Extra-Terrestrial,Universal,435110554,1261085000,1982
4,Titanic,Paramount,658672302,1204368000,1997
...,...,...,...,...,...
195,9 to 5,Fox,103290500,341357800,1980
196,Batman v Superman: Dawn of Justice,Warner Brothers,330360194,340137000,2016
197,The Firm,Paramount,158348367,340028200,1993
198,Suicide Squad,Warner Brothers,325100054,339411900,2016
