# Working with many and/or large files

In this section, we will take a look at techniques for working with many files, as well as large files.

In [1]:
import pandas as pd
from dfply import *

## Baseball data

We will be using the [Baseball Databank](https://github.com/chadwickbureau/baseballdatabank), make sure you have these data cloned into `./data/baseball`.

In [1]:
!git clone https://github.com/chadwickbureau/baseballdatabank.git ./data/baseball

fatal: destination path './data/baseball' already exists and is not an empty directory.


## Working with many files.

* Use `glob.glob` to find all files that match a pattern
* Convert all files to `pd.DataFrames`
* Store the `df` in a list or dictionary

## What the heck is a `glob`

`glob.glob`

* Takes a path regular expression
* Returns a list of files that match the patterm
* Relative paths!

## Store in `dict` or `list`?

* Natural sequence/order? $\rightarrow$ `list`
    *  Example: Lakes data and years are a natural sequence
* Easier to refer by name? $\rightarrow$ `dict`
    * Baseball files have no order and easier to refer to by name

## Example 1 - Reading the baseball database.

#### Step 1 - Get the files names

In [2]:
from glob import glob
files = glob('./data/baseball/core/*.csv')
files

['./data/baseball/core/AwardsManagers.csv',
 './data/baseball/core/Managers.csv',
 './data/baseball/core/AwardsPlayers.csv',
 './data/baseball/core/Fielding.csv',
 './data/baseball/core/Salaries.csv',
 './data/baseball/core/Parks.csv',
 './data/baseball/core/Schools.csv',
 './data/baseball/core/People.csv',
 './data/baseball/core/PitchingPost.csv',
 './data/baseball/core/Teams.csv',
 './data/baseball/core/Appearances.csv',
 './data/baseball/core/AwardsSharePlayers.csv',
 './data/baseball/core/TeamsFranchises.csv',
 './data/baseball/core/Batting.csv',
 './data/baseball/core/ManagersHalf.csv',
 './data/baseball/core/FieldingOF.csv',
 './data/baseball/core/Pitching.csv',
 './data/baseball/core/CollegePlaying.csv',
 './data/baseball/core/HomeGames.csv',
 './data/baseball/core/HallOfFame.csv',
 './data/baseball/core/AwardsShareManagers.csv',
 './data/baseball/core/BattingPost.csv',
 './data/baseball/core/TeamsHalf.csv',
 './data/baseball/core/SeriesPost.csv',
 './data/baseball/core/Fielding

#### Step 2 - Make helper functions to get the name from path

In [3]:
import re
FILE_NAME_RE = re.compile(r'^\./data/baseball/core/([a-zA-Z_]*)\.csv$')
file_name = lambda p: FILE_NAME_RE.match(p).group(1) 
file_names = lambda files: [file_name(p) for p in files]
file_names(files)[:2]

['AwardsManagers', 'Managers']

#### Step 3 - Use a comprehension to read in all files

**Note:** The data is small (< 10mb total) so it is safe to read all at once.

In [4]:
dfs = {name:pd.read_csv(path) for name, path in zip(file_names(files), files)}
dfs['Pitching'].head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,...,IBB,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP
0,bechtge01,1871,1,PH1,,1,2,3,3,2,...,,7,,0,146.0,0,42,,,
1,brainas01,1871,1,WS3,,12,15,30,30,30,...,,7,,0,1291.0,0,292,,,
2,fergubo01,1871,1,NY2,,0,0,1,0,0,...,,2,,0,14.0,0,9,,,
3,fishech01,1871,1,RC1,,4,16,24,24,22,...,,20,,0,1080.0,1,257,,,
4,fleetfr01,1871,1,NY2,,0,1,1,1,1,...,,0,,0,57.0,0,21,,,


## <font color="red"> Exercise 1 </font>

Use `glob` to read the following files into a `dict`: `Person.csv`, `Survey.csv`, `Site.csv`, `Visited.csv`

In [None]:
# Your code here

## <font color="blue"> Key </font>

In [32]:
import re
files = glob('./data/[PSV][a-z]*.csv')
name_re = re.compile(r'\./data/([PSV][a-z]*).csv')
names = [name_re.match(f).group(1) for f in files]
dfs = {n: pd.read_csv(f) for n, f in zip(names, files)}
for name, df in dfs.items():
    print(name,'\n', df, '\n')

Site 
     name    lat     long
0   DR-1  -49.85 -128.57
1   DR-3  -47.15 -126.72
2  MSK-4  -48.87 -123.40 

Visited 
    id    site     dated
0  619   DR-1    2/8/27
1  622   DR-1   2/10/27
2  734   DR-3    1/7/30
3  735   DR-3   1/12/30
4  751   DR-3   2/26/30
5  752   DR-3       NaN
6  837  MSK-4   1/14/32
7  844   DR-1   3/22/32 

Person 
          id    personal     family
0      dyer     William       Dyer
1        pb       Frank    Pabodie
2      lake    Anderson       Lake
3       roe   Valentina    Roerich
4  danforth       Frank   Danforth 

Survey 
     taken   person  quant   reading
0      619    dyer    rad      9.82
1      619    dyer    sal      0.13
2      622    dyer    rad      7.80
3      622    dyer    sal      0.09
4      734      pb    rad      8.41
5      734    lake    sal      0.05
6      734      pb   temp    -21.50
7      735      pb    rad      7.22
8      735  -null-    sal      0.06
9      735  -null-   temp    -26.00
10     751      pb    rad      4.35
1

## Up Next

In [Lecture 3.2 - Aggregating Large Files with Pandas](./3_2_aggregating_large_files_in_pandas.ipynb), we will look at using `pandas` to read and aggregate chunks of a large file.