# Arrays and dataframes: An introduction to tabular data

by Koenraad De Smedt at UiB

---
Many datasets have the shape of a table, i.e. an n-dimensional data structure of fixed size. Two-dimensional data have the following axes:

0.  *rows*, also called *records* (often given numerical indexes)
1.  *columns*, also called *fields*

>
Row index   | First Column | Second Column  
------------|--------------|--------------
0 →         | Row 0, Col 0 | Row 0, Col 1 |
1 →         | Row 1, Col 0 | Row 1, Col 1 |
>

It is of course possible to have data structures with [only one dimension or more than two dimensions](https://www.awesomegrasp.com/wp-content/uploads/2019/10/python-array-and-axis-e1570454038223-768x405.png), but we will not give examples involving more than two dimensions.

---

This notebook provides an introduction to working with tabular data using these Python libraries:

1.   [*numpy*](https://numpy.org/doc/stable/) provides operations on n-dimensional *arrays* of items of the same type and size.
2.   [*pandas*](https://pandas.pydata.org/pandas-docs/stable/index.html) provides operations on *dataframes*, which are two-dimensional labeled datastructures, and *series*, which have only one data column.

In [None]:
import numpy as np
import pandas as pd
pd.__version__

## Arrays with *numpy*

Here is a *numpy* example for a two-dimensional array (sometimes called a *matrix*) with 4 rows and 3 columns.
These numbers are copied by hand from a table in [SLP3, Ch. 6](https://web.stanford.edu/~jurafsky/slp3/6.pdf) (in turn based on Osgood, C. E., G. J. Suci, and P. H. Tannenbaum. 1957. *The Measurement of Meaning.* University of Illinois Press). In most cases, however, we would obtain values from other sources (such as a corpus search), rather than typing them in.

In [None]:
a = np.array([[8.05, 5.5, 7.38], [7.67, 5.57, 6.5], [2.45, 5.65, 3.58], [6.71, 3.95, 4.24]])
a

Display information about this array.

In [None]:
print(a.shape)
print(a.size)

Parts of an array can be addressed by means of row and column indexes. In a two-dimensional array, the first index is for rows, the second one for columns. Like other datatypes, array indexes are zero-based. 

Here are some examples adressing whole rows.

In [None]:
print(a[0])
print(a[1:3])

Whole columns are addressed by the second index; we can not leave the first index empty, but instead we can use `:` for the first index.

In [None]:
print(a[:,1])
print(a[:,1:3])

[5.5  5.57 5.65 3.95]
[[5.5  7.38]
 [5.57 6.5 ]
 [5.65 3.58]
 [3.95 4.24]]


Here we address parts of an array by specifying both rows and columns. One row and one column gives us the value in a single cell. Multiple rows and one column gives us part of that column.

In [None]:
print(a[3,1])
print(a[1:3,2])

The *numpy* module supports some mathematical operations on arrays. One can, for instance, make the sums of several rows.

For more information, start at https://numpy.org/doc/stable/user/absolute_beginners.html.

In [None]:
a[0] + a[1]

## Dataframes with *pandas*

The *pandas* module provides many ways of working with tabular data. A pandas *dataframe* is a two-dimensional *labeled* data structure. You can think of it as two-dimensional array with labels on the rows and columns, much like a spreadsheet or a dataframe in [R](https://www.r-project.org/). Unlike an array, a dataframe may contain data of different types.

A dataframe can be created in many ways. The following creates a dataframe from the numpy array above. Column names and row names (*index*) can be set at creation time or later.

In [None]:
df = pd.DataFrame(a, columns=['Valence','Arousal','Dominance'])
df

Set the index (row names).

In [None]:
df.index = ['courageous', 'music', 'heartbreak', 'cub']
df

A single column can be selected by its label. The result is a one-dimensional datastructure known as a *Series*.

In [None]:
df['Valence']

In [None]:
type(df['Valence'])

Multiple columns can be selected by a list of labels. The result is again a DataFrame.



In [None]:
df[['Valence', 'Dominance']]

A single row can be selected with its index label by means the `.loc` method. The result is a Series.

In [None]:
df.loc['cub']

Multiple rows can be selected by a list of index labels. The result is a DataFrame.

In [None]:
df.loc[['music', 'cub']]

A single value can be addressed by its column and row.

In [None]:
df['Arousal'].loc['music']

Multiple columns and rows can be selected by specifying lists of columns and lists of rows.

In [None]:
df[['Arousal', 'Dominance']].loc[['music', 'heartbreak']]

Rows can also be selected based on their *numerical* order in the dataframe with `.iloc`. These indexes must be numbers.

In [None]:
df.iloc[1:3]

## Styling data frames with colors

Two-dimensional arrays and dataframes containing only numbers can be visualized with colors. These are often used for [looking at the results of a classifier](https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea).
In the example below, we assume the following results of a model that classifies newspaper articles:

*  10 Sports articles are classified correctly and 5 are classified as Politics; 
*  2 Politics articles are classified as Sports and 13 are classified correctly.

In [None]:
class_labels = ['Sports', 'Politics']
cm = pd.DataFrame([[10, 5], [2, 13]], index=class_labels, columns=class_labels)
cm

Dataframes can be styled with a background gradient for each cell dependent on the cell’s value. This view, often called a *heatmap*, may help in getting a quick overview of contrasts in the values.

In [None]:
cm.style.background_gradient(axis=None, cmap='Greens')

Alternatively, one can make a downloadable heatmap picture with a legend and other options. The `annot` parameter controls whether the values are displayed in the cells. 

The bottom row has higher contrast, which highlights the fact that the Politics articles are classified more correctly.

In [None]:
import seaborn as sns
sns.set(font_scale=1.3)
s = sns.heatmap(cm, cmap="Greens", annot=True)
s.set(ylabel='True class', xlabel='Assigned Class')
s.plot()

For larger examples of heatmaps, see for instance [Measuring Harmful Representations in Scandinavian Language Models](https://arxiv.org/pdf/2211.11678.pdf).


Dataframes can be made by reading tabular data from several different file formats and sources, and can be processed and exported to other formats. This will be demonstrated in other notebooks.

##Exercises

1.  Make a new dataframe containing the *Valence* and *Dominance* columns and the *music* and *cub* rows.
2.  Show a heatmap for the new dataframe.