# Working with multiple files

On occasion, we will need to combine more than 2 files using some combination of `UNION` and `JOIN`.  In this lecture, we will show a clean approach to scaling up these operations up to any number of files.  In the process, we will

1. Use `list` comprehensions to process and `UNION` many similar files.
2. Use `dict` comprehensions to store and access many tables by name.

In [32]:
import polars as pl

## Baseball data

We will be using the [Baseball Databank](https://github.com/chadwickbureau/baseballdatabank), make sure you have these data cloned into `./data/baseball`.

In [12]:
!git clone https://github.com/chadwickbureau/baseballdatabank.git ./data/baseball

fatal: destination path './data/baseball' already exists and is not an empty directory.


## Working with many files.

* Use `glob.glob` to find all files that match a pattern
* Convert all files to `pd.DataFrames`
* Store the `df` in a list or dictionary

## What the heck is a `glob`

`glob.glob`

* Takes a path regular expression
* Returns a list of files that match the patterm
* Relative paths!

## Store in `dict` or `list`?

* Natural sequence/order? $\rightarrow$ `list`
    *  Example: Lakes data and years are a natural sequence
* Easier to refer by name? $\rightarrow$ `dict`
    * Baseball files have no order and easier to refer to by name

## Example 1 - Using `glob` to read and combine the sales data

Using `glob` with a `list` to automate reading an combining files 

#### Step 1 - Get the file names

In [33]:
from glob import glob
sales_files = glob('./data/auto_sales_*.csv')
sales_files

['./data/auto_sales_apr.csv', './data/auto_sales_may.csv']

#### Step 2 - Read the files into a list of data frames

In [36]:
sales_by_month = [pl.read_csv(f) for f in sales_files]

 #### Inspect each data from with head

In [37]:
[df.head(2) for df in sales_by_month]

[shape: (2, 6)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
 ├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 │ 1   ┆ Bob         ┆ 19      ┆ 12    ┆ 17  ┆ 20    │
 └─────┴─────────────┴─────────┴───────┴─────┴───────┘,
 shape: (2, 6)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
 ├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 │ 1   ┆ Bob         ┆ 20      ┆ 14    ┆ 6   ┆ 24    │
 └─────┴─────────────┴─────────┴──

#### Step 3 - Pull off the month from the file names and repackage as a `dict`

In [38]:
import re

MONTH_RE = re.compile(r'^\./data/auto_sales_([a-zA-Z_]*)\.csv$')
get_month = lambda p: MONTH_RE.match(p).group(1) 
month_names = lambda files: [get_month(p) for p in files]
month_names(sales_files)

['apr', 'may']

In [39]:
month_name_and_file = list(zip(month_names(sales_files), sales_files))
month_name_and_file

[('apr', './data/auto_sales_apr.csv'), ('may', './data/auto_sales_may.csv')]

#### Now repackage with a `list` comprehension

Note that we will need the month name later, so we are storing it in a `tuple` with the data frame for now.

In [41]:
sales_by_month = [(mon,pl.read_csv(file)) for mon, file in month_name_and_file]

In [42]:
[(mon, df.head(2)) for mon, df in sales_by_month]

[('apr',
  shape: (2, 6)
  ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
  │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
  │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
  │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
  ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
  │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
  ├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
  │ 1   ┆ Bob         ┆ 19      ┆ 12    ┆ 17  ┆ 20    │
  └─────┴─────────────┴─────────┴───────┴─────┴───────┘),
 ('may',
  shape: (2, 6)
  ┌─────┬─────────────┬─────────┬───────┬─────┬───────┐
  │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck │
  │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   │
  │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   │
  ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╡
  │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    │
  ├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
  │ 1   ┆ Bob         ┆ 20      ┆ 14    ┆ 6   ┆ 24  

#### Step 4 - Add a month column to each file

Notice that we need to put the `dfply` pipe *inside* the `list` comprehension to allow access to the names.

In [43]:
sale_files_with_month = [(df
                          .with_columns(month = mon)
                         )
                         for mon, df in sales_by_month
                        ]

In [44]:
[df.head(2) for df in sale_files_with_month]

[shape: (2, 7)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck ┆ month │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   ┆ str   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    ┆ apr   │
 ├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
 │ 1   ┆ Bob         ┆ 19      ┆ 12    ┆ 17  ┆ 20    ┆ apr   │
 └─────┴─────────────┴─────────┴───────┴─────┴───────┴───────┘,
 shape: (2, 7)
 ┌─────┬─────────────┬─────────┬───────┬─────┬───────┬───────┐
 │     ┆ Salesperson ┆ Compact ┆ Sedan ┆ SUV ┆ Truck ┆ month │
 │ --- ┆ ---         ┆ ---     ┆ ---   ┆ --- ┆ ---   ┆ ---   │
 │ i64 ┆ str         ┆ i64     ┆ i64   ┆ i64 ┆ i64   ┆ str   │
 ╞═════╪═════════════╪═════════╪═══════╪═════╪═══════╪═══════╡
 │ 0   ┆ Ann         ┆ 22      ┆ 18    ┆ 15  ┆ 12    ┆ may   │
 ├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌

#### Step 5 - Combine the files using `pl.concat`

In [45]:
combined_files = pl.concat(sale_files_with_month)
combined_files

Unnamed: 0_level_0,Salesperson,Compact,Sedan,SUV,Truck,month
i64,str,i64,i64,i64,i64,str
0,"""Ann""",22,18,15,12,"""apr"""
1,"""Bob""",19,12,17,20,"""apr"""
2,"""Yolanda""",19,8,32,15,"""apr"""
3,"""Xerxes""",12,23,18,9,"""apr"""
0,"""Ann""",22,18,15,12,"""may"""
1,"""Bob""",20,14,6,24,"""may"""
2,"""Yolanda""",19,10,28,17,"""may"""
3,"""Xerxes""",11,27,17,9,"""may"""


## <font color="red"> Exercise 2.9.1</font>

In the data folder, you will find 6 files that contain a sample 100,000 rows from the uber data for the month apr14-sep14.  Perform the following tasks:

1. Use `glob` to get all 6 file paths.
2. Use a regular expression to create a `lambda` function that pulls the month from the files.
3. Read the 6 data frames into a `list` of `tuples` containing the month name corresponding data frame.
4. Add the month column each data frame using a pipe inside of a comprehension.
5. Use `pd.concat` to combine these 6 data frames into one combined `df`

In [46]:
# Your code here

## Example 2 - Reading and joining the baseball database using `dict`

**Task:** Collect the number of total hits for each batters in the 2010 season join on their first and last name.

In the second example, we will store the data frames in a `dict`, which will make it easier to join the files by ne

#### Step 1 - Get the files names

In [55]:
from glob import glob
files = glob('./data/baseball/core/*.csv')
files

['./data/baseball/core/AwardsManagers.csv',
 './data/baseball/core/Managers.csv',
 './data/baseball/core/AwardsPlayers.csv',
 './data/baseball/core/Fielding.csv',
 './data/baseball/core/Salaries.csv',
 './data/baseball/core/Parks.csv',
 './data/baseball/core/Schools.csv',
 './data/baseball/core/People.csv',
 './data/baseball/core/PitchingPost.csv',
 './data/baseball/core/Teams.csv',
 './data/baseball/core/Appearances.csv',
 './data/baseball/core/AwardsSharePlayers.csv',
 './data/baseball/core/TeamsFranchises.csv',
 './data/baseball/core/Batting.csv',
 './data/baseball/core/ManagersHalf.csv',
 './data/baseball/core/FieldingOF.csv',
 './data/baseball/core/Pitching.csv',
 './data/baseball/core/CollegePlaying.csv',
 './data/baseball/core/HomeGames.csv',
 './data/baseball/core/HallOfFame.csv',
 './data/baseball/core/AwardsShareManagers.csv',
 './data/baseball/core/BattingPost.csv',
 './data/baseball/core/TeamsHalf.csv',
 './data/baseball/core/SeriesPost.csv',
 './data/baseball/core/Fielding

* Only need the `Batting.csv` and `People.csv`.  
* Narrow with a RegEx

In [56]:
import re
needed_file = re.compile(r'./data/baseball/core/(Batting|People).csv')

needed_files = [f for f in files if needed_file.match(f)]
needed_files

['./data/baseball/core/People.csv', './data/baseball/core/Batting.csv']

#### Step 2 - Make helper functions to get the name from path

In [57]:
import re
FILE_NAME_RE = re.compile(r'^\./data/baseball/core/([a-zA-Z_]*)\.csv$')
file_name = lambda p: FILE_NAME_RE.match(p).group(1) 
file_names = lambda files: [file_name(p) for p in files]
file_names(needed_files)

['People', 'Batting']

#### Step 3 - Use a comprehension to read in all files

**Note:** The data is small (< 10mb total) so it is safe to read all at once.

In [58]:
dfs = {name:pl.read_csv(path) for name, path in zip(file_names(needed_files), needed_files)}
dfs['Batting'].head()

playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
str,i64,i64,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,str,str,str,i64
"""abercda01""",1871,1,"""TRO""","""NA""",1,4,0,0,0,0,0,0,0,0,0,0,,,,,0
"""addybo01""",1871,1,"""RC1""","""NA""",25,118,30,32,6,0,0,13,8,1,4,0,,,,,0
"""allisar01""",1871,1,"""CL1""","""NA""",29,137,28,40,4,5,0,19,3,1,2,5,,,,,1
"""allisdo01""",1871,1,"""WS3""","""NA""",27,133,28,44,10,2,2,27,1,1,0,2,,,,,0
"""ansonca01""",1871,1,"""RC1""","""NA""",25,120,29,39,11,3,0,16,6,2,2,1,,,,,0


In [59]:
dfs['People'].head()

playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
str,i64,i64,i64,str,str,str,i64,i64,i64,str,str,str,str,str,str,i64,i64,str,str,str,str,str,str
"""aardsda01""",1981,12,27,"""USA""","""CO""","""Denver""",,,,,,,"""David""","""Aardsma""","""David Allan""",215,75,"""R""","""R""","""2004-04-06""","""2015-08-23""","""aardd001""","""aardsda01"""
"""aaronha01""",1934,2,5,"""USA""","""AL""","""Mobile""",,,,,,,"""Hank""","""Aaron""","""Henry Louis""",180,72,"""R""","""R""","""1954-04-13""","""1976-10-03""","""aaroh101""","""aaronha01"""
"""aaronto01""",1939,8,5,"""USA""","""AL""","""Mobile""",1984.0,8.0,16.0,"""USA""","""GA""","""Atlanta""","""Tommie""","""Aaron""","""Tommie Lee""",190,75,"""R""","""R""","""1962-04-10""","""1971-09-26""","""aarot101""","""aaronto01"""
"""aasedo01""",1954,9,8,"""USA""","""CA""","""Orange""",,,,,,,"""Don""","""Aase""","""Donald William...",190,75,"""R""","""R""","""1977-07-26""","""1990-10-03""","""aased001""","""aasedo01"""
"""abadan01""",1972,8,25,"""USA""","""FL""","""Palm Beach""",,,,,,,"""Andy""","""Abad""","""Fausto Andres""",184,73,"""L""","""L""","""2001-09-10""","""2006-04-13""","""abada001""","""abadan01"""


#### Step 4 - Preprocess each file.

In [61]:
# Filter, select, and aggregate hits for 2010.
hits_in_2010_raw = (dfs['Batting']
                   .select(['yearID', 'playerID', 'H'])
                   .filter(pl.col('yearID') == 2010)
                   .groupby('playerID')
                   .agg(pl.col('H').mean().alias('Total Hits'))
                   )
hits_in_2010_raw.head(2)

playerID,Total Hits
str,f64
"""ramirma03""",15.0
"""romerni01""",0.0


In [62]:
# Grab the first and last names from People.

player_names = (dfs['People']
                .select(['playerID', 'nameFirst', 'nameLast'])
               )
player_names.head(2)

playerID,nameFirst,nameLast
str,str,str
"""aardsda01""","""David""","""Aardsma"""
"""aaronha01""","""Hank""","""Aaron"""


#### Step 4 -- Join the tables

In [73]:
hits_in_2010 = (hits_in_2010_raw 
                .join(player_names, on='playerID', how='left')
                .drop('playerID')
               )
hits_in_2010.head()

Total Hits,nameFirst,nameLast
f64,str,str
15.0,"""Max""","""Ramirez"""
0.0,"""Niuman""","""Romero"""
56.5,"""Jeff""","""Francoeur"""
78.0,"""Mark""","""Kotsay"""
12.0,"""Cory""","""Sullivan"""


## <font color="red"> Exercise 2.9.2 </font>

We want to get the total hits allowed for all pitchers during the 2000-2010 seasons.  Use `glob` and a `dict` to collect this information into a table that includes the players first and last names.

In [77]:
# Your code here