# Working with multiple files
By the end of this lecture you will be able to:
- read multiple files with a glob pattern
- read multiple files from a list
- read multiple files in lazy mode
- automate file discovery in sub-directories

We import Python's built-in `pathlib` module to work with multiple file paths and create sub-directories.

In the example below we use CSV files but the results also apply to Parquet files.

In [None]:
from pathlib import Path

import polars as pl

pl.Config.set_tbl_rows(6)

We need a dataset with multiple files that share the same schema for this notebook.

We create multiple CSV files from the Titanic dataset in a new directory.

We begin by reading in the full CSV

In [None]:
csv_file = "../data/titanic.csv"

In [None]:
df = pl.read_csv(csv_file)

We create a new sub-directory in this directory.

We use the `mkdir` method of a `Path` object to create this new sub-directory

In [None]:
# Path to the new directory
csv_directory = Path("data_files/csv/multiple_csv")
# Create the new directory if it doesn't already exist
csv_directory.mkdir(parents=True,exist_ok=True)

We split the `DataFrame` and write the two new files to the sub-directory

In [None]:
df[:700].write_csv(csv_directory / "train.csv")
df[700:].write_csv(csv_directory / "test.csv")

## Eager mode
### Reading multiple files with wildcard patterns

We can read multiple CSV files with the same schema using a wildcard `*` pattern

In [None]:
(
    pl.read_csv(csv_directory / "*.csv")
    .head(2)
)

The files are read in alphabetical order where `test` comes before `train`

#### What happens when we use the wildcard pattern `*`?
When we use the wildcard pattern `*` as above Polars internally:
- make a list of the files that match the pattern
- calls `scan_csv` on each file to make a list of `LazyFrames`
- does a vertical concatenation of the `LazyFrames`
- calls `collect` to return a `DataFrame`

Essentially using `read_csv` with `*` is an automated version of the lazy mode approach we see below.

### What happens if there is a potential optimisation?
In the query below we do `read_csv` followed by a `filter`

In [None]:
(
    pl.read_csv("data_files/csv/multiple_csv/*.csv")
    .filter(pl.col("Pclass") == 1)
    .head(2)
)

Although Polars uses `pl.scan_csv` internally the overall query is eager and the query optimiser is not used in this query. This means that if we follow `read_csv` with - for example - a `filter` method then each CSV is read in full into memory, concatenated into a single `DataFrame` and then the `filter` is applied.

### Reading from a list of file paths

If we have a list of file paths we can also read them manually with `pl.concat`

In [None]:
file_path_list = [csv_directory / "train.csv",csv_directory / "test.csv"]
(
    pl.concat(
        [pl.read_csv(csv_path) for csv_path in file_path_list]
    )
    .head(3)
)

## Scanning CSVs in lazy mode

### Scanning multiple files with a wildcard
We can scan multiple CSV files in a directory in lazy mode using a wildcard

In [None]:
print(
    pl.scan_csv(csv_directory / "*.csv")
    .filter(pl.col("Age") > 50)
    .explain()
)

The plan shows us that Polars:
- creates a plan for each file e.g. `PLAN 0`
- applies the `filter` on each file`
- concatenates the output from each plan to a single `DataFrame` in `UNION`

Unlike the eager query above the query optmiser is working here.

We evaluate this plan on all the CSVs with `collect`

In [None]:
(
    pl.scan_csv(csv_directory / "*.csv")
    .filter(pl.col("Age") > 50)
    .collect()
)

## Handling variations in column names
We cannot concatenate CSVs that have different column names with `pl.scan_csv`

In this example we create write two `DataFrames` with slightly different column names to a new directory

In [None]:
df1 = pl.DataFrame(
    {
        'int_column':[0,1,2]
    }
)
df2 = pl.DataFrame(
    {
        'Int_Column':[3,4]
    }
)
# Create a sub-directory to hold the CSV for each DataFrame
mismatched_column_names_path = Path('data_files/csv/mismatched_column_names/')
if not mismatched_column_names_path.exists():
    mismatched_column_names_path.mkdir()
# Write the DataFrames to a CSV
df1.write_csv(mismatched_column_names_path / 'df1.csv')
df2.write_csv(mismatched_column_names_path / 'df2.csv')

If we try to call `pl.scan_csv` with a `*` we get an `Exception` (commented out to allow my automated checks to run)

In [None]:
# (
#     pl.scan_csv(mismatched_column_names_path / 'df*.csv')
#     .collect()
# )

We handle this using the `with_column_names` argument to modify the column names before we concatenate the data from different files.

In this example we specify a function to casts the column names to lower case

In [None]:
(
    pl.scan_csv(
        mismatched_column_names_path / 'df*.csv',
        with_column_names=lambda cols: [col.lower() for col in cols]
    )
    .collect()
)

### Scanning from a list of file paths in lazy mode
We can also create a list of scanned CSV files in lazy mode

In [None]:
files_list = [
    'data_files/csv/multiple_csv/train.csv',
    'data_files/csv/multiple_csv/test.csv'
]
queries_list = [
    pl.scan_csv(csv_path) for csv_path in files_list
]
queries_list

The `queries_list` is a `list` of `LazyFrames`.

Polars can evaluate a `list` of `LazyFrames` with `pl.collect_all`.  The output is a `list` of `DataFrames`

To return the output as a single `DataFrame` we call:
- `pl.concat` to combine the `list` of `LazyFrames` to a single `LazyFrame`
- `collect` to evaluate the `LazyFrame`

In [None]:
(
    pl.concat(
        queries_list
    )
    .collect()
    .head(3)
)

For large datasets we can use streaming with `streaming = True` in `collect`.

If the column names are in different orders or there are small differences in the dtypes (e.g. floats in one file and integers in another) we can reconcile these by concatenating with the `vertical_relaxed` method as show in the Concatenation lecture. 

## Discovering file paths
In some cases we want an easy way to find all the CSVs in sub-directories.

We can use PyArrow in this case. While using PyArrow isn't necessary in this simple example, it is handy with more complicated directory structures

In [None]:
import pyarrow.dataset as ds

dataset = ds.dataset(
    csv_directory,
    format="csv"
)

We list the files that PyArrow has found

In [None]:
dataset.files

We can then read these files in eager mode by:
- letting PyArrow turn them into an Arrow table and
- creating a Polars `DataFrame` from the Arrow table with zero-copy

In [None]:
(
    pl.from_arrow(
        dataset.to_table()
    )
    .head(3)
)

With PyArrow we can do manual optimisations such as limit the columns or apply a row filter in the arguments of `to_table`

In [None]:
(
    pl.from_arrow(
        dataset.to_table(
            columns=["Pclass","Age"],
            filter = ds.field("Age") > 70)
    )
    .head(3)
)

See the PyArrow docs for more info on the `dataset` object: https://arrow.apache.org/docs/python/dataset.html

## So which approach should you use?
Each of these approaches will work, but these are my opinions for general cases:
- If you want to read all files into memory with no query optimisations use `pl.read_csv`
- Use a wildcard if you can specify the files using a wildcard
- Use a list if you want more control over which files you read
- Use PyArrow if you have a more complicated directory structure

## Exercises
In the exercises you will develop your understanding of:
- reading multiple CSV files in eager mode
- reading multiple CSV files in lazy mode
- reading CSVs with PyArrow

### Exercise 1
The NYC taxi dataset CSV has 1000 rows containing records from different days.

### Set-up
We transform this CSV into a set of partitioned CSVs in sub-directories. 

We first set the path to the full CSV

In [None]:
nyccsv_file = "../data/nyc_trip_data_1k.csv"

We now:
- read the CSV
- add a column that records the date from the `pickup` datetime
- partition the `DataFrame` into a dictionary that maps dates to the `DataFrame` for that date

In [None]:
dailyDfDict = (
    pl.read_csv(nyccsv_file,try_parse_dates=True)
    .with_columns(
    pl.col("pickup").dt.truncate("1d").dt.strftime("%Y-%m-%d").alias("pickup_day")
    )
    .partition_by(by=["pickup_day"],as_dict=True)
)


The keys of the `dailyDfDict` are the string dates for each day

In [None]:
dailyDfDict.keys()

The values for each key is a `DataFrame` for that date

In [None]:
dailyDfDict['2022-01-01',].head(3)

We now create a partitioned directory called `daily_nyc` for the data.

The name of each sub-directory is a date.

The content of each sub-directory is the CSV for that date

In [None]:
# Path to the new directory
nyccsv_directory = Path("data_files/csv/daily_nyc")

# Create the new directory if it doesn't already exist
nyccsv_directory.mkdir(parents=True,exist_ok=True)

# Loop through each date
for (day,), df in dailyDfDict.items():
    # Create a Path object for that date
    dailyDirectory = (nyccsv_directory / day)
    # Create the sub-directory for that date
    dailyDirectory.mkdir(parents=True,exist_ok=True)
    # Write a CSV called daily.csv
    df.write_csv(dailyDirectory / "daily.csv")


We list the contents of `daily_nyc` to see the sub-directories for each date

In [None]:
ls data_files/csv/daily_nyc/

We list the contents of one sub-directory to show the CSV

In [None]:
ls data_files/csv/daily_nyc/2022-01-01/

### Now on to the exercise!

Read all the CSV files in eager mode using a path with wildcards for the final directory name

In [None]:
(
    pl.read_csv(
        "data_files/csv/daily_nyc<blank>
    )
)

Read the CSV files in eager mode using:
- a `glob` and a `generator`
- a concatenation of the list of `DataFrames`

In [None]:
nycfile_paths_generator = nyccsv_directory<blank>

Read all the CSV files in lazy mode using a path with wildcards for the final directory name

Read all the CSVs in lazy mode **between 2022-01-01 and 2022-01-09** inclusive

- Scan the required `DataFrames` by iterating through the generator
- Call `collect_all` to evaluate all the `LazyFrames`
- `concat` all the `DataFrames`

If you want a hint about filtering the dates expand the cell below

In [None]:
#Hint: in an `if` statement convert the `csv_path` to string with `csv_path.as_posix()` and check if 2022-01-0
# is in the string

### Exercise 2
Create a PyArrow `dataset` object with all the CSVs

List all the CSV files in the dataset

Read all the files into a Polars `DataFrame`

## Solutions

### Solution to exercise 1

Read all the CSV files in eager mode using a path with wildcards for the final directory name

In [None]:
pl.read_csv("data_files/csv/daily_nyc/*/daily.csv")

Read the CSV files in eager mode using:
- a `glob` and a `generator`
- a concatenation of the list of `DataFrames`

In [None]:
file_paths_generator = nyccsv_directory.glob("*/*.csv")
(
    pl.concat(
        [pl.read_csv(csv_path) for csv_path in file_paths_generator]
    )
).shape

Read all the CSV files in lazy mode using a path with wildcards for the final directory name

In [None]:
(
    pl.scan_csv("data_files/csv/daily_nyc/*/daily.csv")
    .collect()
)

Read all the CSVs in lazy mode *between 2022-01-01 and 2022-01-09** inclusive

- Scan the required `DataFrames` by iterating through the generator
- Call `collect_all` to evaluate all the `LazyFrames`
- `concat` all the `DataFrames`

In [None]:
#Hint: in an `if` statement convert the `csv_path` to string with `csv_path.as_posix()` and check if 2022-01-0
# is in the string

In [None]:
nycfile_paths_generator = nyccsv_directory.glob("*/daily.csv")
(
    pl.concat(
        pl.collect_all(
            [pl.scan_csv(csv_path) for csv_path in nycfile_paths_generator if "2022-01-0" in csv_path.as_posix()]
        )
    )
).shape

### Solution to exercise 2
Create a PyArrow `dataset` object with all the CSVs

In [None]:
dataset = ds.dataset(nyccsv_directory,format="csv")

List all the CSV files in the dataset

In [None]:
dataset.files

Read all the files into a Polars `DataFrame`

In [None]:
(
    pl.from_arrow(
        dataset.to_table()
    )
    .head(3)
)