# Concatenating Tables with Set-Like Operations

One of the two way of combining two tables is to stack one table on top of the other.  When stacking two tables on top of one another, we need to decide

1. If we combine columns based on position or name (and if combining by name, what do we do with mismatches?)
2. How to decide which rows to keep.  In this case, we will take some guidance from SQL clauses.

## Three Types of Operations

* **Union:** Keeps rows from either table.
* **Intersection:** Only keeps common columns
* **Set Difference/Except:** Keep rows from the left table *except* those in the right table.

## Set Operations in Action 

<img src="./img/table_verbs_set.gif" width=800>

## All Operations Match by Position

All operations

* Match columns by position
* Require same number/type of columns

## Distinct Versus All

* **UNION/INTERSECT/SET DIFFERENE** are **DISTINCT**
    * Only keeps distinct rows, removing duplicates.
* **UNION ALL/INTERSECT ALL/SET DIFFERENCE ALL**
    * Keeps duplicate rows

In [1]:
import pandas as pd
from dfply import *

In [2]:
sales_may = pd.read_csv('./data/auto_sales_may.csv')
sales_may

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9


In [3]:
sales_apr = pd.read_csv('./data/auto_sales_apr.csv')
sales_apr

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12
1,1,Bob,19,12,17,20
2,2,Yolanda,19,8,32,15
3,3,Xerxes,12,23,18,9


## Unions with `dfply`

Use `left_table >> union(right_table)`

In [4]:
sales_may >> union(sales_apr)

  stacked = df.append(other)


Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9
1,1,Bob,19,12,17,20
2,2,Yolanda,19,8,32,15
3,3,Xerxes,12,23,18,9


## `dfply.union` is distinct

Since Ann have the same sales each month, her row only included one row.  Note that we can use `keep='last'` to `keep='first'` to determine which row is kept.

In [5]:
sales_may >> union(sales_apr, keep='last')

  stacked = df.append(other)


Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9
0,0,Ann,22,18,15,12
1,1,Bob,19,12,17,20
2,2,Yolanda,19,8,32,15
3,3,Xerxes,12,23,18,9


In [8]:
sales_may >> union(sales_apr, keep='first')

  stacked = df.append(other)


Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9
1,1,Bob,19,12,17,20
2,2,Yolanda,19,8,32,15
3,3,Xerxes,12,23,18,9


## Making `union_all`

We can use `pd.concat` to perform a `UNION ALL`

In [10]:
from more_dfply import union_all
sales_may >> union_all(sales_apr)

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9
4,0,Ann,22,18,15,12
5,1,Bob,19,12,17,20
6,2,Yolanda,19,8,32,15
7,3,Xerxes,12,23,18,9


## Adding a month column

Another way to keep both of Ann's sales rows is adding a month column (which we should probably do anyway).

In [12]:
sales_may >> mutate(month = 'May') >> union(sales_apr >> mutate(month = 'April'))

  stacked = df.append(other)


Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck,month
0,0,Ann,22,18,15,12,May
1,1,Bob,20,14,6,24,May
2,2,Yolanda,19,10,28,17,May
3,3,Xerxes,11,27,17,9,May
0,0,Ann,22,18,15,12,April
1,1,Bob,19,12,17,20,April
2,2,Yolanda,19,8,32,15,April
3,3,Xerxes,12,23,18,9,April


## Finding common rows with `dfply.intersect`

In [13]:
sales_may >> intersect(sales_apr)

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12


## Finding rows unique to the left table.

Use `left_table >> dfply.set_diff(right_table)`

In [10]:
sales_may >> set_diff(sales_apr)

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9


# Working with many and/or large files

In this section, we will take a look at techniques for working with many files, as well as large files.

In [14]:
import pandas as pd
from dfply import *

## Baseball data

We will be using the [Baseball Databank](https://github.com/chadwickbureau/baseballdatabank), make sure you have these data cloned into `./data/baseball`.

In [15]:
!git clone https://github.com/chadwickbureau/baseballdatabank.git ./data/baseball

fatal: destination path './data/baseball' already exists and is not an empty directory.


## Working with many files.

* Use `glob.glob` to find all files that match a pattern
* Convert all files to `pd.DataFrames`
* Store the `df` in a list or dictionary

## What the heck is a `glob`

`glob.glob`

* Takes a path regular expression
* Returns a list of files that match the patterm
* Relative paths!

## Store in `dict` or `list`?

* Natural sequence/order? $\rightarrow$ `list`
    *  Example: Lakes data and years are a natural sequence
* Easier to refer by name? $\rightarrow$ `dict`
    * Baseball files have no order and easier to refer to by name

## Example 1 - Using `glob` to read and combine the sales data

Using `glob` with a `list` to automate reading an combining files 

#### Step 1 - Get the file names

In [16]:
from glob import glob
sales_files = glob('./data/auto_sales_*.csv')
sales_files

['./data/auto_sales_apr.csv', './data/auto_sales_may.csv']

#### Step 2 - Read the files into a list of data frames

In [17]:
sales_by_month = [pd.read_csv(f) for f in sales_files]

 #### Inspect each data from with head

In [18]:
[df.head(2) for df in sales_by_month]

[   Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck
 0           0         Ann       22     18   15     12
 1           1         Bob       19     12   17     20,
    Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck
 0           0         Ann       22     18   15     12
 1           1         Bob       20     14    6     24]

#### Step 3 - Pull off the month from the file names and repackage as a `dict`

In [27]:
import re
my_pattern = re.compile(r'./data/auto_sales_([a-z]{3}).csv')
[my_pattern.match(f).group(1) for f in sales_files]

['apr', 'may']

In [26]:
my_pattern.match('./data/auto_sales_jun.csv').group(1)

'jun'

In [28]:
import re

MONTH_RE = re.compile(r'./data/auto_sales_([a-z]{3}).csv')
get_month = lambda p: MONTH_RE.match(p).group(1) 
month_names = lambda files: [get_month(p) for p in files]
month_names(sales_files)

['apr', 'may']

In [29]:
month_name_and_file = list(zip(month_names(sales_files), sales_files))
month_name_and_file

[('apr', './data/auto_sales_apr.csv'), ('may', './data/auto_sales_may.csv')]

#### Now repackage with a `list` comprehension

Note that we will need the month name later, so we are storing it in a `tuple` with the data frame for now.

In [30]:
sales_by_month = [(mon,pd.read_csv(file)) for mon, file in month_name_and_file]

In [31]:
[(mon, df.head(2)) for mon, df in sales_by_month]

[('apr',
     Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck
  0           0         Ann       22     18   15     12
  1           1         Bob       19     12   17     20),
 ('may',
     Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck
  0           0         Ann       22     18   15     12
  1           1         Bob       20     14    6     24)]

#### Step 4 - Add a month column to each file

Notice that we need to put the `dfply` pipe *inside* the `list` comprehension to allow access to the names.

In [32]:
sale_files_with_month = [(df
                          >> mutate(month = mon)
                         )
                         for mon, df in sales_by_month
                        ]

In [33]:
[df.head(2) for df in sale_files_with_month]

[   Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck month
 0           0         Ann       22     18   15     12   apr
 1           1         Bob       19     12   17     20   apr,
    Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck month
 0           0         Ann       22     18   15     12   may
 1           1         Bob       20     14    6     24   may]

#### Step 5 - Combine the files using `pd.concat`

Note that `pd.concat` is `dfply.union_all`

In [34]:
?pd.concat

In [35]:
combined_files = pd.concat(sale_files_with_month)
combined_files

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck,month
0,0,Ann,22,18,15,12,apr
1,1,Bob,19,12,17,20,apr
2,2,Yolanda,19,8,32,15,apr
3,3,Xerxes,12,23,18,9,apr
0,0,Ann,22,18,15,12,may
1,1,Bob,20,14,6,24,may
2,2,Yolanda,19,10,28,17,may
3,3,Xerxes,11,27,17,9,may


## <font color="red"> Exercise 2.10.1</font>

In the data folder, you will find 6 files that contain a sample 100,000 rows from the uber data for the month apr14-sep14.  Perform the following tasks:

1. Use `glob` to get all 6 file paths.
2. Use a regular expression to create a `lambda` function that pulls the month from the files.
3. Read the 6 data frames into a `list` of `tuples` containing the month name corresponding data frame.
4. Add the month column each data frame using a pipe inside of a comprehension.
5. Use `pd.concat` to combine these 6 data frames into one combined `df`

In [51]:
# Your code here
uber_files = glob('./data/uber-*.csv')
uber_files

['./data/uber-raw-data-jun14-sample.csv',
 './data/uber-raw-data-may14-sample.csv',
 './data/uber-raw-data-aug14-sample.csv',
 './data/uber-raw-data-sep14-sample.csv',
 './data/uber-raw-data-apr14-sample.csv',
 './data/uber-raw-data-jul14-sample.csv']

In [None]:
uberf_by_month = [pd.read_csv(f) for f in uber_files]
[df.head(3) for df in uberf_by_month]

In [63]:
month_re = re.compile(r'./data/uber-raw-data-([a-z]{3})\d+-sample.csv')
get_month = lambda p: month_re.match(p).group(1) 
month_names = lambda files: [get_month(p) for p in files]
month_name_and_file = list(zip(month_names(uber_files), uber_files))
month_name_and_file

[('jun', './data/uber-raw-data-jun14-sample.csv'),
 ('may', './data/uber-raw-data-may14-sample.csv'),
 ('aug', './data/uber-raw-data-aug14-sample.csv'),
 ('sep', './data/uber-raw-data-sep14-sample.csv'),
 ('apr', './data/uber-raw-data-apr14-sample.csv'),
 ('jul', './data/uber-raw-data-jul14-sample.csv')]

In [66]:
uberf_by_month = [(mon,pd.read_csv(file)) for mon, file in month_name_and_file]
[(mon, df.head(2)) for mon, df in uberf_by_month]

[('jun',
              Date/Time      Lat      Lon    Base
  0  6/19/2014 16:49:00  40.7568 -73.9701  B02682
  1  6/12/2014 21:25:00  40.6463 -73.7768  B02598),
 ('may',
              Date/Time      Lat      Lon    Base
  0  5/31/2014 18:57:00  40.7660 -73.9714  B02682
  1  5/13/2014 21:19:00  40.7598 -73.9782  B02598),
 ('aug',
              Date/Time      Lat      Lon    Base
  0  8/12/2014 19:19:00  40.7062 -74.0145  B02598
  1  8/30/2014 17:39:00  40.6400 -73.9672  B02764),
 ('sep',
              Date/Time      Lat      Lon    Base
  0  9/29/2014 22:30:00  40.7848 -73.9540  B02682
  1  9/26/2014 10:41:00  40.7134 -73.9974  B02598),
 ('apr',
              Date/Time      Lat      Lon    Base
  0  4/18/2014 21:38:00  40.7359 -73.9852  B02682
  1  4/23/2014 15:19:00  40.7642 -73.9543  B02598),
 ('jul',
              Date/Time      Lat      Lon    Base
  0  7/29/2014 19:34:00  40.7140 -74.0144  B02682
  1  7/11/2014 10:24:00  40.7264 -73.9553  B02617)]

In [67]:
uber_f_with_month = [(df
                     >> mutate(Month = mon)
                     )
                    for mon, df in uberf_by_month
                    ]

[df.head(3) for df in uber_f_with_month]

[            Date/Time      Lat      Lon    Base Month
 0  6/19/2014 16:49:00  40.7568 -73.9701  B02682   jun
 1  6/12/2014 21:25:00  40.6463 -73.7768  B02598   jun
 2  6/15/2014 22:23:00  40.7205 -73.9575  B02512   jun,
             Date/Time      Lat      Lon    Base Month
 0  5/31/2014 18:57:00  40.7660 -73.9714  B02682   may
 1  5/13/2014 21:19:00  40.7598 -73.9782  B02598   may
 2  5/21/2014 18:19:00  40.7254 -73.9979  B02598   may,
             Date/Time      Lat      Lon    Base Month
 0  8/12/2014 19:19:00  40.7062 -74.0145  B02598   aug
 1  8/30/2014 17:39:00  40.6400 -73.9672  B02764   aug
 2   8/3/2014 12:21:00  40.7242 -73.9788  B02598   aug,
             Date/Time      Lat      Lon    Base Month
 0  9/29/2014 22:30:00  40.7848 -73.9540  B02682   sep
 1  9/26/2014 10:41:00  40.7134 -73.9974  B02598   sep
 2   9/9/2014 14:46:00  40.7174 -73.9584  B02598   sep,
             Date/Time      Lat      Lon    Base Month
 0  4/18/2014 21:38:00  40.7359 -73.9852  B02682   apr
 1  4/

In [72]:
uber_final = pd.concat(uber_f_with_month)
uber_final.sample(6)

Unnamed: 0,Date/Time,Lat,Lon,Base,Month
2888,6/19/2014 20:39:00,40.7466,-73.9934,B02598,jun
52359,9/7/2014 16:15:00,40.6682,-73.936,B02617,sep
67804,7/18/2014 23:52:00,40.7247,-73.9986,B02682,jul
82559,9/3/2014 13:01:00,40.7639,-73.9786,B02682,sep
51267,6/13/2014 23:33:00,40.7739,-73.8734,B02598,jun
95445,6/8/2014 0:47:00,40.7085,-73.9468,B02617,jun


## Example 2 - Reading and joining the baseball database using `dict`

**Task:** Collect the number of total hits for each batters in the 2010 season join on their first and last name.

In the second example, we will store the data frames in a `dict`, which will make it easier to join the files by ne

#### Step 1 - Get the files names

In [73]:
from glob import glob
files = glob('./data/baseball/core/*.csv')
files

['./data/baseball/core/ManagersHalf.csv',
 './data/baseball/core/AwardsPlayers.csv',
 './data/baseball/core/CollegePlaying.csv',
 './data/baseball/core/FieldingOFsplit.csv',
 './data/baseball/core/AwardsManagers.csv',
 './data/baseball/core/FieldingPost.csv',
 './data/baseball/core/AwardsShareManagers.csv',
 './data/baseball/core/AwardsSharePlayers.csv',
 './data/baseball/core/People.csv',
 './data/baseball/core/Pitching.csv',
 './data/baseball/core/Salaries.csv',
 './data/baseball/core/BattingPost.csv',
 './data/baseball/core/AllstarFull.csv',
 './data/baseball/core/Parks.csv',
 './data/baseball/core/PitchingPost.csv',
 './data/baseball/core/TeamsFranchises.csv',
 './data/baseball/core/FieldingOF.csv',
 './data/baseball/core/SeriesPost.csv',
 './data/baseball/core/Batting.csv',
 './data/baseball/core/Fielding.csv',
 './data/baseball/core/TeamsHalf.csv',
 './data/baseball/core/Schools.csv',
 './data/baseball/core/HallOfFame.csv',
 './data/baseball/core/Managers.csv',
 './data/baseball/

* Only need the `Batting.csv` and `People.csv`.  
* Narrow with a RegEx

In [74]:
import re
needed_file = re.compile(r'./data/baseball/core/(Batting|People).csv')

needed_files = [f for f in files if needed_file.match(f)]
needed_files

['./data/baseball/core/People.csv', './data/baseball/core/Batting.csv']

#### Step 2 - Make helper functions to get the name from path

In [75]:
import re
FILE_NAME_RE = re.compile(r'^\./data/baseball/core/([a-zA-Z_]*)\.csv$')
file_name = lambda p: FILE_NAME_RE.match(p).group(1) 
file_names = lambda files: [file_name(p) for p in files]
file_names(needed_files)

['People', 'Batting']

#### Step 3 - Use a comprehension to read in all files

**Note:** The data is small (< 10mb total) so it is safe to read all at once.

In [76]:
dfs = {name:pd.read_csv(path) for name, path in zip(file_names(needed_files), needed_files)}
dfs['Batting'].head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abercda01,1871,1,TRO,,1,4,0,0,0,...,0.0,0.0,0.0,0,0.0,,,,,0.0
1,addybo01,1871,1,RC1,,25,118,30,32,6,...,13.0,8.0,1.0,4,0.0,,,,,0.0
2,allisar01,1871,1,CL1,,29,137,28,40,4,...,19.0,3.0,1.0,2,5.0,,,,,1.0
3,allisdo01,1871,1,WS3,,27,133,28,44,10,...,27.0,1.0,1.0,0,2.0,,,,,0.0
4,ansonca01,1871,1,RC1,,25,120,29,39,11,...,16.0,6.0,2.0,2,1.0,,,,,0.0


In [77]:
{n:df.head(2) for n, df in dfs.items()}

{'People':     playerID  birthYear  birthMonth  birthDay birthCountry birthState  \
 0  aardsda01     1981.0        12.0      27.0          USA         CO   
 1  aaronha01     1934.0         2.0       5.0          USA         AL   
 
   birthCity  deathYear  deathMonth  deathDay  ... nameLast    nameGiven  \
 0    Denver        NaN         NaN       NaN  ...  Aardsma  David Allan   
 1    Mobile        NaN         NaN       NaN  ...    Aaron  Henry Louis   
 
   weight height bats throws       debut   finalGame   retroID    bbrefID  
 0  215.0   75.0    R      R  2004-04-06  2015-08-23  aardd001  aardsda01  
 1  180.0   72.0    R      R  1954-04-13  1976-10-03  aaroh101  aaronha01  
 
 [2 rows x 24 columns],
 'Batting':     playerID  yearID  stint teamID lgID   G   AB   R   H  2B  ...   RBI   SB  \
 0  abercda01    1871      1    TRO  NaN   1    4   0   0   0  ...   0.0  0.0   
 1   addybo01    1871      1    RC1  NaN  25  118  30  32   6  ...  13.0  8.0   
 
     CS  BB   SO  IBB  HBP

In [78]:
dfs['People'].head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,...,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,...,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


#### Step 4 - Preprocess each file.

In [80]:
# Filter, select, and aggregate hits for 2010.
hits_in_2010_raw = (dfs['Batting']
                   >> select(X.yearID, X.playerID, X.H)
                   >> filter_by(X.yearID == 2010)
                   >> group_by(X.playerID)
                   >> summarise(total_hits = X.H.sum())
                   )
                   
hits_in_2010_raw.head(3)

Unnamed: 0,playerID,total_hits
0,aardsda01,0
1,abadfe01,0
2,abreubo01,146


In [81]:
# Grab the first and last names from People.

player_names = (dfs['People']
         >> select(X.playerID, X.nameFirst, X.nameLast))
player_names.head(3)

Unnamed: 0,playerID,nameFirst,nameLast
0,aardsda01,David,Aardsma
1,aaronha01,Hank,Aaron
2,aaronto01,Tommie,Aaron


#### Step 4 -- Join the tables

In [82]:
hits_in_2010 = (hits_in_2010_raw 
                >> left_join(player_names, by='playerID')
                >> drop(X.playerID)
               )
hits_in_2010.head()

Unnamed: 0,total_hits,nameFirst,nameLast
0,0,David,Aardsma
1,0,Fernando,Abad
2,146,Bobby,Abreu
3,45,Tony,Abreu
4,0,Jeremy,Accardo


## <font color="red"> Exercise 2.10.2 </font>

We want to get the total hits allowed for all pitchers during the 2000-2010 seasons.  Use `glob` and a `dict` to collect this information into a table that includes the players first and last names.

In [105]:
# Your code here
files = glob('./data/baseball/core/*.csv')

needed_file = re.compile(r'./data/baseball/core/(Pitching|People).csv')
needed_files = [f for f in files if needed_file.match(f)]

FILE_NAME_RE = re.compile(r'^\./data/baseball/core/([a-zA-Z_]*)\.csv$')
file_name = lambda p: FILE_NAME_RE.match(p).group(1) 
file_names = lambda files: [file_name(p) for p in files]

dfs = {name:pd.read_csv(path) for name, path in zip(file_names(needed_files), needed_files)}
dfs['Pitching'].sample(4)

Unnamed: 0,playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,...,IBB,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP
18850,coxca01,1969,1,WS2,AL,12,7,52,13,4,...,7.0,1,1.0,1,719.0,11,62,,,
17281,kochal01,1964,1,DET,AL,0,0,3,0,0,...,0.0,1,0.0,0,19.0,1,3,,,
16211,barbest01,1961,1,BAL,AL,18,12,37,34,14,...,4.0,9,2.0,1,1040.0,1,102,,,
30384,thomami01,1995,1,ML4,AL,0,0,1,0,0,...,0.0,0,0.0,0,7.0,0,0,0.0,0.0,0.0


In [125]:
hits_allowed = (dfs['Pitching']
                >> select(X.yearID, X.playerID, X.H)
                >> filter_by(X.yearID >= 2000, X.yearID <= 2010)
                >> group_by(X.playerID)
                >> summarise(total_hits = X.H.sum())
                )
hits_allowed.sample(5)

Unnamed: 0,playerID,total_hits
1220,palacvi01,12
1756,wallaje01,85
575,glynnry01,209
1554,smithgr02,218
1063,middlja01,75


In [126]:
player_names = (dfs['People']
         >> select(X.playerID, X.nameFirst, X.nameLast))
player_names.sample(5)

Unnamed: 0,playerID,nameFirst,nameLast
16872,smithge01,Germany,Smith
3675,coreyma01,Mark,Corey
773,banksbr01,Brian,Banks
14673,quinnfr02,Frank,Quinn
16050,scantpa01,Pat,Scantlebury


In [127]:
hits_from_00_to_10 = (hits_allowed 
                >> left_join(player_names, by='playerID')
                >> drop(X.playerID)
               )
hits_from_00_to_10.sample(5)

Unnamed: 0,total_hits,nameFirst,nameLast
990,245,Tom,Martin
1403,76,Frankie,Rodriguez
829,10,Brandon,Kintzler
42,89,Jose,Arredondo
104,532,Matt,Belisle
