## What is a Polars `DataFrame`?
In this lecture we have a high-level look at a Polars `DataFrame` and learn:
- how to access important metadata
- how Polars stores data with Apache Arrow
- what happens when we modify a `DataFrame`

In [None]:
import polars as pl
import numpy as np

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csv_file)
df.head(3)

A Polars `DataFrame`:
- is a tabular dataset stored in an Arrow Table (see below)
- has a height and a width
- has unique string column names
- has a data type for each column
- has methods for transforming the data stored in the Arrow Table

We can get the height (number of rows) and width (number of columns) as attributes

In [None]:
df.width

In [None]:
df.height

## Data type schema

Every column in a `DataFrame` has a data type called a `dtype`.

We can get an `OrderedDict` that maps column names to dtypes with the `.schema` attribute

In [None]:
df.schema

There is also a `dtypes` attribute (as in Pandas). However, this gives a `list` of dtypes with no column names

In [None]:
df.dtypes

A `Series` also has a data type attribute

In [None]:
df['Name'].dtype

### Supertypes
We can group the dtypes into groups:
- integers e.g. pl.Int8,pl.Int16 etc
- floats pl.Float32,pl.Float64
- string pl.Utf8
- boolean pl.Boolean
- datetime pl.Datetime,pl.Date etc

Polars also has a concept of supertypes. Supertypes occur where we are trying to do an operation involving columns that have different types. If the dtypes of these columns have a supertype all columns are cast to that type to do the operation. 

Supertypes are defined on a given pair of dtypes rather than being universal. Here are some simple examples:
- pl.Int8 & pl.Int16 -> pl.Int16
- pl.Float32 & pl.Float64 -> pl.Float64

There are also rules in place for other combinations e.g.:
- pl.Int64 & pl.Boolean -> pl.Boolean
- pl.Int32 & pl.Float32 -> pl.Float64 (following a convention set by Numpy)
- any dtype & pl.Utf8 -> pl.Utf8 (any column can be cast to string)

## Apache Arrow

A classic Pandas `DataFrame` stores its data in Numpy arrays. In Polars the data is stored in an Arrow Table. 

We can see this Arrow Table by calling `to_arrow` - this is a cheap operation as it is just viewing the underlying data

In [None]:
df.to_arrow()

An Arrow Table is a collection of Arrow Arrays - these are one-dimensional vectors that are the fundamental data store. We can see the Arrow Array for a column by calling `to_arrow` on a `Series`

In [None]:
df["Age"].to_arrow()

### What is Apache Arrow?
Apache Arrow is an open source cross-language project to store tabular data in-memory. Apache Arrow is both:
- a specificiation for how data should be represented in memory
- a set of libraries in different languages that implement that specification

Polars uses the implementation of the Arrow specification from the Rust library [Arrow2](https://docs.rs/arrow2/latest/arrow2/)

### Why does `Polars` use `Apache Arrow`?
The Apache Arrow project developed when it became clear that Numpy arrays - designed for scientific computing - are not the optimal data store for tabular data.

Arrow allows for:
- sharing data without copying (known as "zero-copy")
- faster vectorised calculations
- working with larger-than-memory data in chunks
- consistent representation of missing data

Overall, Polars can process data more quickly and with less memory usage because of Arrow.

### What are the downsides of `Apache Arrow`?
The design of Arrow is optimised for operations on one-dimensional columns, whreas the design of Numpy is optimised for operations on multi-dimensional arrays. This tradeoff means some kinds of operations will be slower with Arrow data compared to Numpy:
- transposing a dataframe
- doing matrix multiplication/linear algebra on a `dataframe`

For this kind of use case - where calculations require accessing data by row and column - it may be faster to convert to a Numpy array (see the lecture on conversion in this Section).

### So what is the relationship between a Polars `DataFrame` and Arrow data?
A Polars `DataFrame` holds references to an Arrow Table which holds references to Arrow Arrays. We can think of a Polars `DataFrame` being a lightweight object that points to the lightweight Arrow Table which points to the heavyweight Arrow Arrays (heavyweight because they hold the actual data). 

This detached structure means we can make changes to the cheap `DataFrame` wrapper and copy none (or a minimal amount) of the data in the Arrow Arrays. We see examples of this below.

In [None]:
df_shape = (1_000_000,100)
df_polars = pl.DataFrame(
    np.random.standard_normal(df_shape)
)
df_polars.shape

### Dropping a column
We see how long it takes to drop a column from a Polars `DataFrame`. We use the IPython `timeit` module to compare performance (we learn more about `timeit` later in the course)

In [None]:
%%timeit -n1 -r3
df_polars.drop("column_0")

Polars does this very fast (and much faster than classic Pandas). This is because Polars just creates a new `DataFrame` object (a cheap operation) that points to all the Arrow Arrays except `column_0`. Polars basically just loops through the list of column names for this operation!

### Renaming a column
We have a similar effect whenever we change some part of a `DataFrame` that does not affect the actual data in the columns. For example, if we rename a column...

In [None]:
%%timeit -n1 -r3
df_polars.rename({"column_0":"a"})

Polars again does this very fast because it just updates the column name and checks the column names are still unique.

### Cloning a `DataFrame`
Or if we create a new `DataFrame` by cloning

In [None]:
%%timeit -n1 -r3
df_polars.clone()

In this case Polars has created a new `DataFrame` object that points at the same Arrow Table.
### Updating a cloned `DataFrame`

Although the new and old `DataFrames` initially point at the same Arrow Table we do not need to worry about changes to one affecting the other.

If we make changes to a value in one of the `DataFrames` - say the new `DataFrame` - then the new `DataFrame` will:
- copy the data in **the column that has changed** to a new Arrow Array
- create a new Arrow Table that points to the updated Arrow Array along with the unchanged Arrow Arrays

So now we have:
- two `DataFrames` that point to:
- two Arrow Tables that point to:
- the same Array Arrays for the unchanged columns and different Arrow Arrays for the changed column

In this way we create a new `DataFrame` but **only ever have to copy data in columns that change**. We see how changes to the new `DataFrame` do not affect the old `DataFrame` in this example where we change the first value in the first row

In [None]:
df_polars2 = df_polars.clone()
df_polars2[0,0] = 1000
df_polars2[0,0]

In the original `DataFrame` we still have the original value

In [None]:
df_polars[0,0]

## Exercises
In the exercises you will develop your understanding of:
- getting the dtypes of a `DataFrame`
- getting the dtypes of a `Series`

### Exercise 1 

What are the dtypes of this `DataFrame`?

In [None]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})
df<blank>

### Exercise 2
Create a `Series` by selecting the `a` column of `df`

In [None]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})
# df<blank>

What is the dtype of `a`?
What is the dtype of `b`?

## Solutions

### Solution to Exercise 1
What are the dtypes of this `DataFrame`?

In [None]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})
df.schema

### Solution to Exercise 2
Create a `Series` by selecting the `a` column of `df`

In [None]:
df = pl.DataFrame({'a':[0,1,2],'b':[0,1,2.0]})
s = df["a"]

In [None]:
s

`s` has 64-bit integer dtype 

In [None]:
s2 = df["b"]
s2

`s2` has 64-bit floating point dtype 