# Reading from and writing to a database
By the end of this lecture you will be able to:
- write to an SQL database
- read from a SQL database
- apply row and column filters
- pass data to and from DuckDB

In this example we will use a SQLite database saved in the data directory.

## Connectorx
Polars uses the ConnectorX library to handle **reading** from databases. ConnectorX is fast because it is:
- written in Rust
- stores data in Apache Arrow and so Polars can access the data without copying

In [None]:
from pathlib import Path

import polars as pl

## Creating a SQLite database

For this lecture we first create a local database with SQLite. A SQLite database is simply a file on disk. 

We create a `DataFrame` with 1000 rows of NYC taxi data

In [None]:
csv_file = "../data/nyc_trip_data_1k.csv"
df = pl.read_csv(csv_file)

Before we write to the database we need to create a directory to hold it.

First we set the path to the directory where we create the SQLite database file.

In [None]:
sqliteDBDirectory = Path("data_files/sqlite/nyc_data")
if not sqliteDBDirectory.exists():
    # If this does not yet exist we create it
    sqliteDBDirectory.mkdir(parents=True, exist_ok=True)

We set the path to the SQLite database file that we will create

In [None]:
sqliteDBPath = sqliteDBDirectory / "nyc_trip_data.sqlite"

### Engines for writing to a database
To work with a database we need to specify an engine to communicate between Polars and the database. The options are:
- SQLalchemy and
- Arrow Database Connectivity (ADBC)

#### SQLalchemy
If we choose SQLalchemy then Polars simply creates a Pandas `DataFrame` backed by PyArrow instead of Numpy (a zero-copy operation).

You can do this as well if you want to have full control over operations:
```python
            df.to_pandas(use_pyarrow_extension_array=True).to_sql(
                name=table_name, con=engine, if_exists=if_exists
            )
```
Polars then uses the standard `to_sql` Pandas method on that `DataFrame`.
SQLalchemy is a tried and test approach that works for many different databases.

#### Arrow Database Connectivity (ADBC)
ADBC is a promising new approach built around Apache Arrow. It *should* have advantages over SQLalchemy in terms of performance and memory usage. However, it is still early days for ADBC and the feature set is still limited compared to SQLalchemy. If ADBC doesn't work for your situation now then stick with SQLalchemy and check back in a few months.

### Creating a database
In this example we create a SQLite database with ADBC.

To work with SQLite with ADBC we need to install an additional driver

In [None]:
%pip install adbc_driver_sqlite

The connection URI for a SQLite database must begin with `sqlite:///` followed by the path to the database file. We call `as_posix` on the `Path` object to extract the path as a string before writing the data to the database in a table called `records`

In [None]:
uri = "sqlite:///" + sqliteDBPath.as_posix()
uri

In [None]:
uri = "sqlite:///" + sqliteDBPath.as_posix()
if not sqliteDBPath.exists():
    # If the database doesn't exist then create it
    (
        df.sort("passenger_count").write_database(
            table_name="records",
            connection=uri,
            if_exists="replace",
            engine="adbc",
        )
    )

## Reading from a database

We query the database with this the connection string above and a sql query.

In this example we select 3 rows from the records table

In [None]:
df = pl.read_database_uri("select * from records limit 3", uri=uri,engine="adbc")
df

Reading from a database is typically slower than reading the same data from a file. Even if the file is a relatively slow file format such as CSV

In [None]:
%timeit -n1 -r1 pl.read_csv(csv_file)
%timeit -n1 -r1 pl.read_database_uri("select * from records",uri=uri,engine="adbc")

## Reading from a client-server database
To read from a client-server database like Postgres then the connection string requires the standard connection and login details such as
```python
uri = "postgresql://username:password@server:port/database"
pl.read_database_uri(sql="select * from records",uri=uri)
```

## Filtering rows and selecting columns
At present `pl.read_database` works only in eager mode. If you read a database and then `select` a column or `filter` rows then the entire database is read into memory before the `select` or `filter` is applied.

In [None]:
(
    pl.read_database_uri("select * from records", uri=uri,engine="adbc")
    .filter(pl.col("passenger_count") > 3)
    .head(3)
)

To apply the filters in the database you need to specify the filters in the SQL string using `where`

In [None]:
(
    pl.read_database_uri(
        "select * from records where passenger_count > 3", 
        uri=uri,
        engine="adbc"
    )
    .head(3)
)

While to select columns you specify the columns in the SQL string

In [None]:
(
    pl.read_database_uri(
        "select pickup,dropoff from records", 
        uri=uri,
        engine="adbc"
    )
    .head(3)
)

The `pl.read_database_uri` function also has arguments to parition a query. However, I have not found that these help queries to run faster so we won't cover them here.  If you want your queries to run faster the best thing is to define an appropriate index for your query in the database.

## DuckDB
DuckDB is like SQLite but optimised for analytics.  

Although DuckDB is not built in Arrow like Polars it can work with Arrow data.

We can pass the Arrow Table from Polars to DuckDB for a query.

First we install duckdb

In [None]:
%pip install duckdb

We import duckDB and read the data into a Polars `DataFrame`

In [None]:
import duckdb

dfPolars = pl.read_csv(csv_file)

We first pass the Arrow data from Polars to DuckDB

In [None]:
nyc = duckdb.arrow(dfPolars.to_arrow())

We can then query the database and return the results as an Arrow Table

In [None]:
nyc.query(
    "nyc", "SELECT passenger_count,avg(trip_distance) FROM nyc group by passenger_count"
).to_arrow_table()

However, it is more useful to return the results as a Polars `DataFrame`

In [None]:
pl.from_arrow(
    nyc.query(
        "nyc",
        "SELECT passenger_count,avg(trip_distance) FROM nyc group by passenger_count",
    ).to_arrow_table()
)

## Exercises

In the exercises you will develop your understanding of:
- querying a database with `pl.read_database_uri`
- querying DuckDB via an Arrow Table

### Exercise 1
Get the maximum and average of the passenger count when the trip distance is greater than 5 km. Use the ADBC engine

### Exercise 2
Read the Titanic dataset into a `DataFrame`

In [None]:
titaniccsv_file = "../data/titanic.csv"

In [None]:
dfTitanic = <blank>

Read the data into DuckDB with `duckdb.arrow`

Get the average age in each passenger class and return the result as a Polars `DataFrame`

## Solutions

### Solution to exercise 1
Get the maximum and average of the passenger count when the trip distance is greater than 5 km

In [None]:
(
    pl.read_database_uri(
        "select max(passenger_count),avg(passenger_count) from records where trip_distance > 5",
        uri=uri,
        engine="adbc"
    )
)

### Solution to exercise 2
Read the Titanic dataset into a `DataFrame`

In [None]:
titanic_csv_file = "../data/titanic.csv"

In [None]:
dfTitanic = pl.read_csv(titanic_csv_file)

Read the data into DuckDB with `duckdb.arrow`

In [None]:
titanic = duckdb.arrow(dfTitanic.to_arrow())

Get the average age in each passenger class and return the result as a Polars `DataFrame`

In [None]:
(
    pl.from_arrow(
        titanic.query('titanic','select Pclass,avg(Age) from titanic group by Pclass').to_arrow_table()
    )
)