# Working with multiple files

On occasion, we will need to combine more than 2 files using some combination of `UNION` and `JOIN`.  In this lecture, we will show a clean approach to scaling up these operations up to any number of files.  In the process, we will

1. Use `list` comprehensions to process and `UNION` many similar files.
2. Use `dict` comprehensions to store and access many tables by name.

In [28]:
import polars as pl

## The Basics of working with many files.

* Use `glob.glob` to find all files that match a pattern
* Convert all files to `pd.DataFrames`
* Store the `df` in a list or dictionary

### What the heck is a `glob`

`glob.glob`

* Takes a path regular expression
* Returns a list of files that match the patterm
* Relative paths!

In [30]:
from glob import glob
sales_files = glob('./data/auto_sales_*.csv')
sales_files

['./data/auto_sales_apr.csv', './data/auto_sales_may.csv']

### Search tools for `glob`

* Use `*` as a wildcard,
* Use `?` for optional characters, and
* Use `[...]` to define character classes.

### Using `polars` built-in `glob`

* `pl.read_csv( ..., glob=True)` is default,
* Will search for all files when given a wildcard/optional/class, and
* UNION the resulting tables.

**Note.** All information in the file name is lost!

In [33]:
sales_files = pl.read_csv('./data/auto_sales_*.csv')
sales_files

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck
i64,str,i64,i64,i64,i64
0,"""Ann""",22,18,15,12
1,"""Bob""",19,12,17,20
2,"""Yolanda""",19,8,32,15
3,"""Xerxes""",12,23,18,9
0,"""Ann""",22,18,15,12
1,"""Bob""",20,14,6,24
2,"""Yolanda""",19,10,28,17
3,"""Xerxes""",11,27,17,9


#### BEWARE - Files need to be UNIONABLE

In [34]:
pl.read_csv('./data/baseball/core/*.csv')

ComputeError: schema lengths differ

## Store in `dict` or `list`?

* Natural sequence/order? $\rightarrow$ `list`
    *  Example: Lakes data and years are a natural sequence
* Easier to refer by name? $\rightarrow$ `dict`
    * Baseball files have no order and easier to refer to by name

## Example - Using `glob` to read and UNION the sales data

Using `glob` with a `list` to automate reading an combining files 

#### Step 1 - Get the file names

In [35]:
from glob import glob
sales_files = glob('./data/auto_sales_*.csv')
sales_files

['./data/auto_sales_apr.csv', './data/auto_sales_may.csv']

#### Step 2 - Read the files into a list of data frames

In [36]:
sales_by_month = [pl.read_csv(f) for f in sales_files]
sales_by_month

[shape: (4, 6)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
 │ 1   ┆ Bob         ┆ 19      ┆ 12    ┆ 17  ┆ 20    │
 │ 2   ┆ Yolanda     ┆ 19      ┆ 8     ┆ 32  ┆ 15    │
 │ 3   ┆ Xerxes      ┆ 12      ┆ 23    ┆ 18  ┆ 9     │
 └─────┴─────────────┴─────────┴───────┴─────┴───────┘,
 shape: (4, 6)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
 │ 1   ┆ Bob         ┆ 20      ┆ 14    ┆ 6   ┆ 24    │
 │ 2   ┆ Yolanda     ┆ 19      ┆ 1

 #### Inspect each data from with head

In [38]:
[df.head(2) for df in sales_by_month]

[shape: (2, 6)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
 │ 1   ┆ Bob         ┆ 19      ┆ 12    ┆ 17  ┆ 20    │
 └─────┴─────────────┴─────────┴───────┴─────┴───────┘,
 shape: (2, 6)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
 │ 1   ┆ Bob         ┆ 20      ┆ 14    ┆ 6   ┆ 24    │
 └─────┴─────────────┴─────────┴───────┴─────┴───────┘]

#### Inspecting the `shape`s

In [39]:
[df.shape for df in sales_by_month]

[(4, 6), (4, 6)]

#### Step 3 - Pull off the month from the file names with a RegEx

In [46]:
import re

MONTH_RE = re.compile(r'^\./data/auto_sales_([a-zA-Z_]*)\.csv$')
get_month = lambda p: MONTH_RE.match(p).group(1) if MONTH_RE.match(p) else None

In [47]:
sales_files = glob('./data/auto_sales_*.csv')
sales_files

['./data/auto_sales_apr.csv', './data/auto_sales_may.csv']

In [48]:
month_names = [get_month(p) for p in sales_files]
month_names

['apr', 'may']

In [49]:
sales_dfs = [pl.read_csv(f) for f in sales_files]

sales_dfs

[shape: (4, 6)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
 │ 1   ┆ Bob         ┆ 19      ┆ 12    ┆ 17  ┆ 20    │
 │ 2   ┆ Yolanda     ┆ 19      ┆ 8     ┆ 32  ┆ 15    │
 │ 3   ┆ Xerxes      ┆ 12      ┆ 23    ┆ 18  ┆ 9     │
 └─────┴─────────────┴─────────┴───────┴─────┴───────┘,
 shape: (4, 6)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
 │ 1   ┆ Bob         ┆ 20      ┆ 14    ┆ 6   ┆ 24    │
 │ 2   ┆ Yolanda     ┆ 19      ┆ 1

#### Combine month and tables in one `list` comprehension

Note that we will need the month name later, so we are storing it in a `tuple` with the data frame for now.

In [50]:
sales_by_month = [(get_month(f), pl.read_csv(f)) for f in sales_files]

sales_by_month

[('apr',
  shape: (4, 6)
  ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
  │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
  │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
  │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
  ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
  │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
  │ 1   ┆ Bob         ┆ 19      ┆ 12    ┆ 17  ┆ 20    │
  │ 2   ┆ Yolanda     ┆ 19      ┆ 8     ┆ 32  ┆ 15    │
  │ 3   ┆ Xerxes      ┆ 12      ┆ 23    ┆ 18  ┆ 9     │
  └─────┴─────────────┴─────────┴───────┴─────┴───────┘),
 ('may',
  shape: (4, 6)
  ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
  │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
  │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
  │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
  ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
  │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
  │ 1   ┆ Bob         ┆ 20      ┆ 14    ┆ 6   ┆ 24  

#### Step 4 - Add a month column to each file

Notice that we need to put the `polars` dot-chain *inside* the `list` comprehension to allow access to the names.

In [51]:
sale_files_with_month = [(df
                          .with_columns(month = pl.lit(mon))
                         )
                         for mon, df in sales_by_month
                        ]

In [26]:
[df.head(2) for df in sale_files_with_month]

[shape: (2, 7)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck ┆ month │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   ┆ str   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    ┆ apr   │
 │ 1   ┆ Bob         ┆ 19      ┆ 12    ┆ 17  ┆ 20    ┆ apr   │
 └─────┴─────────────┴─────────┴───────┴─────┴───────┴───────┘,
 shape: (2, 7)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck ┆ month │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   ┆ str   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    ┆ may   │
 │ 1   ┆ Bob         ┆ 20      ┆ 14    ┆ 6   ┆ 24    ┆ may   │
 └─────┴─────────────┴──

#### Step 5 - Combine the files using `pl.concat`

In [52]:
combined_files = pl.concat(sale_files_with_month)
combined_files

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck,month
i64,str,i64,i64,i64,i64,str
0,"""Ann""",22,18,15,12,"""apr"""
1,"""Bob""",19,12,17,20,"""apr"""
2,"""Yolanda""",19,8,32,15,"""apr"""
3,"""Xerxes""",12,23,18,9,"""apr"""
0,"""Ann""",22,18,15,12,"""may"""
1,"""Bob""",20,14,6,24,"""may"""
2,"""Yolanda""",19,10,28,17,"""may"""
3,"""Xerxes""",11,27,17,9,"""may"""


## <font color="red"> Exercise 3.1</font>

In the data folder, you will find 6 files that contain a sample 100,000 rows from the uber data for the month apr14-sep14.  Perform the following tasks:

1. Use `glob` to get all 6 file paths.
2. Use a regular expression to create a `lambda` function that pulls the month from the files.
3. Read the 6 data frames into a `list` of `tuples` containing the month name corresponding data frame.
4. Add the month column each data frame using a pipe inside of a comprehension.
5. Use `pd.concat` to combine these 6 data frames into one combined `df`

In [46]:
# Your code here