## Introduction

A wide variety of options are available to combine two or more DataFrames or Series together.
The append method is the least flexible and only allows for new rows to be appended
to a DataFrame. The concat method is very versatile and can combine any number of
DataFrames or Series on either axis. The join method provides fast lookups by aligning
a column of one DataFrame to the index of others. The merge method provides SQL-like
capabilities to join two DataFrames together

## Appending new rows to DataFrames

When performing data analysis, it is far more common to create new columns than new rows.
This is because a new row of data usually represents a new observation, and as an analyst,
it is typically not your job to continually capture new data. Data capture is usually left to other
platforms like relational database management systems. Nevertheless, it is a necessary
feature to know as it will crop up from time to time.
In this recipe, we will begin by appending rows to a small dataset with the .loc attribute and
then transition to using the .append method.

In [2]:
# Read in the names dataset, and output it:

import pandas as pd 
import numpy as np
names = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/names.csv')

In [5]:
names


Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2


In [6]:
# Let's create a list that contains some new data and 
# use the .loc attribute to set a single row label
# equal to this new data:

new_data_list = ['Aria', 1]

names.loc[4] = new_data_list

In [7]:
names

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1


In [8]:
# The .loc attribute uses labels to refer to the rows.
# In this case, the row labels exactly match the 
# integer location. It is possible to append
# more rows with non-integer labels:

names.loc['five'] = ['Zach', 3]

In [9]:
names

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1
five,Zach,3


In [11]:
# To be more explicit in associating variables to 
# values, you may use a dictionary. Also, in this step,
# we can dynamically choose the new index label to be 
# the length of the DataFrame
names.loc[len(names)] = {'Name': 'Zach', 'Age': 2}

In [12]:
names

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1
five,Zach,3
6,Zach,2


In [13]:
len(names)

7

In [14]:
# A Series can hold the new data as well and works 
# exactly the same as a dictionary:

names.loc[len(names)] = pd.Series({'Age': '32', 'Name': 'Dean'})

In [15]:
names

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1
five,Zach,3
6,Zach,2
7,Dean,32


The preceding operations all use the .loc attribute to make changes to the names
DataFrame in-place. There is no separate copy of the DataFrame that is returned.
In the next few steps, we will look at the .append method, which does not modify
the calling DataFrame. Instead, it returns a new copy of the DataFrame with the
appended row(s). Let's begin with the original names DataFrame and attempt to
append a row. The first argument to .append must be either another DataFrame,
Series, dictionary, or a list of these, but not a list like the one in step 2. Let's see
what happens when we attempt to use a dictionary with .append:

In [16]:
names = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/names.csv')

In [17]:
names.append({'Name': 'Aria', 'Age': 1})

TypeError: Can only append a dict if ignore_index=True

This error message appears to be slightly incorrect. We are passing a dictionary and
not a Series but nevertheless, it gives us instructions on how to correct it, we need
to pass the ignore_index=True parameter:

In [18]:
names.append({'Name': 'Aria', 'Age': 1}, ignore_index=True)

Unnamed: 0,Name,Age
0,Cornelia,70
1,Abbas,69
2,Penelope,4
3,Niko,2
4,Aria,1


This works but ignore_index is a sneaky parameter. When set to True, the old
index will be removed completely and replaced with a RangeIndex from 0 to n-1.
For instance, let's specify an index for the names DataFrame:

In [19]:
names.index = ['Canada', 'Canada', 'USA', 'USA']

names

Unnamed: 0,Name,Age
Canada,Cornelia,70
Canada,Abbas,69
USA,Penelope,4
USA,Niko,2


 Rerun the code from step 7, and you will get the 
 same result. The original index is completely ignored

In [20]:
# Let's continue with this names DataFrame with the
# country strings in the index. Let's append a Series 
# that has a name attribute with the .append method:
a = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))

In [21]:
a

Name    Zach
Age        3
Name: 4, dtype: object

In [22]:
names.append(a)

Unnamed: 0,Name,Age
Canada,Cornelia,70
Canada,Abbas,69
USA,Penelope,4
USA,Niko,2
4,Zach,3


In [23]:
# The .append method is more flexible than the .loc
# attribute. It supports appending multiple rows at the
# same time. One way to accomplish this is by passing 
# in a list of Series:

a1 = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
a2 = pd.Series({'Name': 'Zayd', 'Age': 2}, name='USA')

In [25]:
names.append([a1, a2])

Unnamed: 0,Name,Age
Canada,Cornelia,70
Canada,Abbas,69
USA,Penelope,4
USA,Niko,2
4,Zach,3
USA,Zayd,2


Small DataFrames with only two columns are simple enough to manually write out
all the column names and values. When they get larger, this process will be quite
painful. For instance, let's take a look at the 2016 baseball dataset:

In [28]:
pd.set_option('max_columns', 8, 'max_rows', 15)

bball_16 = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/baseball16.csv')

In [29]:
bball_16

Unnamed: 0,playerID,yearID,stint,teamID,...,HBP,SH,SF,GIDP
0,altuvjo01,2016,1,HOU,...,7.0,3.0,7.0,15.0
1,bregmal01,2016,1,HOU,...,0.0,0.0,1.0,1.0
2,castrja01,2016,1,HOU,...,1.0,1.0,0.0,9.0
3,correca01,2016,1,HOU,...,5.0,0.0,3.0,12.0
4,gattiev01,2016,1,HOU,...,4.0,0.0,5.0,12.0
...,...,...,...,...,...,...,...,...,...
11,reedaj01,2016,1,HOU,...,0.0,0.0,1.0,1.0
12,springe01,2016,1,HOU,...,11.0,0.0,1.0,12.0
13,tuckepr01,2016,1,HOU,...,2.0,0.0,0.0,2.0
14,valbulu01,2016,1,HOU,...,1.0,3.0,2.0,5.0


This dataset contains 22 columns and it would be easy to mistype a column name or
forget one altogether if you were manually entering new rows of data. To help protect
against these mistakes, let's select a single row as a Series and chain the .to_dict
method to it to get an example row as a dictionary:

In [32]:
data_dict = bball_16.iloc[0].to_dict()

In [35]:
data_dict

{'playerID': 'altuvjo01',
 'yearID': 2016,
 'stint': 1,
 'teamID': 'HOU',
 'lgID': 'AL',
 'G': 161,
 'AB': 640,
 'R': 108,
 'H': 216,
 '2B': 42,
 '3B': 5,
 'HR': 24,
 'RBI': 96.0,
 'SB': 30.0,
 'CS': 10.0,
 'BB': 60,
 'SO': 70.0,
 'IBB': 11.0,
 'HBP': 7.0,
 'SH': 3.0,
 'SF': 7.0,
 'GIDP': 15.0}

Clear the old values with a dictionary comprehension assigning any previous string
value as an empty string and all others as missing values. This dictionary can now
serve as a template for any new data you would like to enter:

In [37]:
new_data_dict = {k: '' if isinstance (v, str) else
                np.nan for k, v in data_dict.items()}

In [38]:
new_data_dict

{'playerID': '',
 'yearID': nan,
 'stint': nan,
 'teamID': '',
 'lgID': '',
 'G': nan,
 'AB': nan,
 'R': nan,
 'H': nan,
 '2B': nan,
 '3B': nan,
 'HR': nan,
 'RBI': nan,
 'SB': nan,
 'CS': nan,
 'BB': nan,
 'SO': nan,
 'IBB': nan,
 'HBP': nan,
 'SH': nan,
 'SF': nan,
 'GIDP': nan}

In [40]:
{k: '_' if isinstance(v, str) else
                0 for k, v in data_dict.items()}

{'playerID': '_',
 'yearID': 0,
 'stint': 0,
 'teamID': '_',
 'lgID': '_',
 'G': 0,
 'AB': 0,
 'R': 0,
 'H': 0,
 '2B': 0,
 '3B': 0,
 'HR': 0,
 'RBI': 0,
 'SB': 0,
 'CS': 0,
 'BB': 0,
 'SO': 0,
 'IBB': 0,
 'HBP': 0,
 'SH': 0,
 'SF': 0,
 'GIDP': 0}

Appending a single row to a DataFrame is a fairly expensive operation and if you find yourself
writing a loop to append single rows of data to a DataFrame, then you are doing it wrong. Let's
first create 1,000 rows of new data as a list of Series:

In [42]:
random_data = []
for i in range(1000):
    d = dict()
    for k, v in data_dict.items():
        if isinstance(v, str):
            d[k] = np.random.choice(list('abcde'))
        else:
            d[k] = np.random.randint(10)
    random_data.append(pd.Series(d, name=i + len(bball_16)))
        

In [53]:
random_data

[playerID    d
 yearID      0
 stint       3
 teamID      c
 lgID        e
            ..
 IBB         6
 HBP         9
 SH          6
 SF          8
 GIDP        9
 Name: 16, Length: 22, dtype: object,
 playerID    b
 yearID      5
 stint       0
 teamID      c
 lgID        c
            ..
 IBB         4
 HBP         8
 SH          5
 SF          0
 GIDP        8
 Name: 17, Length: 22, dtype: object,
 playerID    d
 yearID      0
 stint       4
 teamID      e
 lgID        e
            ..
 IBB         7
 HBP         8
 SH          7
 SF          8
 GIDP        2
 Name: 18, Length: 22, dtype: object,
 playerID    e
 yearID      2
 stint       1
 teamID      c
 lgID        b
            ..
 IBB         2
 HBP         2
 SH          3
 SF          0
 GIDP        4
 Name: 19, Length: 22, dtype: object,
 playerID    c
 yearID      4
 stint       3
 teamID      e
 lgID        a
            ..
 IBB         4
 HBP         4
 SH          7
 SF          0
 GIDP        0
 Name: 20, Length: 22, 

 Let's time how long it takes to loop through each
item making one append at a time:

In [52]:
%%timeit
bball_16_copy = bball_16.copy()
for row in random_data:
    bball_16_copy = bball_16_copy.append(row)

4.83 s ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


That took nearly five seconds for only 1,000 rows. If we instead pass in the entire list of Series,
we get an enormous speed increase:

In [55]:
%%timeit
bball_16_copy = bball_16.copy()
bball_16_copy = bball_16.append(random_data)

91 ms ± 4.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


If you pass in a list of Series objects, the time has been reduced to under one-tenth of a second. Internally, pandas converts the list of Series to a single DataFrame and then
appends the data.

## Concatenating multiple DataFrames together

The concat function enables concatenating two or more DataFrames (or Series) together,
both vertically and horizontally. As per usual, when dealing with multiple pandas objects
simultaneously, concatenation doesn't happen haphazardly but aligns each object by
their index.

In [62]:
stocks_2016 = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/stocks_2016.csv')
stocks_2017 = pd.read_csv('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/stocks_2017.csv')

In [61]:
stocks_2016

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,80,95,110
1,TSLA,50,80,130
2,WMT,40,55,70


In [63]:
stocks_2017

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,50,120,140
1,GE,100,30,40
2,IBM,87,75,95
3,SLB,20,55,85
4,TXN,500,15,23
5,TSLA,100,100,300


In [64]:
# Place all the stock datasets into a single list, and 
# then call the concat function to concatenate them
# together along the default axis (0):
s_list = [stocks_2016, stocks_2017]

pd.concat(s_list)

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,80,95,110
1,TSLA,50,80,130
2,WMT,40,55,70
0,AAPL,50,120,140
1,GE,100,30,40
2,IBM,87,75,95
3,SLB,20,55,85
4,TXN,500,15,23
5,TSLA,100,100,300


By default, the concat function concatenates DataFrames vertically, one on
top of the other. One issue with the preceding DataFrame is that there is no way
to identify the year of each row. The concat function allows each piece of the
resulting DataFrame to be labeled with the keys parameter. This label will appear
in the outermost index level of the concatenated frame and force the creation of a
MultiIndex. Also, the names parameter has the ability to rename each index level
for clarity:

In [69]:
pd.concat(s_list, keys=['2016', '2017'], 
         names=['Year', 'Symbol'], )

Unnamed: 0_level_0,Unnamed: 1_level_0,Symbol,Shares,Low,High
Year,Symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016,0,AAPL,80,95,110
2016,1,TSLA,50,80,130
2016,2,WMT,40,55,70
2017,0,AAPL,50,120,140
2017,1,GE,100,30,40
2017,2,IBM,87,75,95
2017,3,SLB,20,55,85
2017,4,TXN,500,15,23
2017,5,TSLA,100,100,300


In [71]:
# It is also possible to concatenate horizontally by
# changing the axis parameter to columns or 1:
pd.concat(s_list, keys=['2016', 2017], 
         axis='columns', names=['Year', 'Symbol']
         )

Year,2016,2016,2016,2016,2017,2017,2017,2017
Symbol,Symbol,Shares,Low,High,Symbol,Shares,Low,High
0,AAPL,80.0,95.0,110.0,AAPL,50,120,140
1,TSLA,50.0,80.0,130.0,GE,100,30,40
2,WMT,40.0,55.0,70.0,IBM,87,75,95
3,,,,,SLB,20,55,85
4,,,,,TXN,500,15,23
5,,,,,TSLA,100,100,300


Notice that missing values appear whenever a stock symbol is present in one year
but not the other. The concat function, by default, uses an outer join, keeping all
rows from each DataFrame in the list. However, it gives us an option to keep only
rows that have the same index values in both DataFrames. This is referred to as
an inner join. We set the join parameter to inner to change the behavior:

In [73]:
pd.concat(s_list, keys=['2016', 2017], join='inner',
         axis='columns', names=['Year', 'Symbol']
         )

Year,2016,2016,2016,2016,2017,2017,2017,2017
Symbol,Symbol,Shares,Low,High,Symbol,Shares,Low,High
0,AAPL,80,95,110,AAPL,50,120,140
1,TSLA,50,80,130,GE,100,30,40
2,WMT,40,55,70,IBM,87,75,95


The .append method is a heavily watered-down version of concat that can only append new
rows to a DataFrame. Internally, .append just calls the concat function. For instance, step 2
from this recipe may be duplicated with the following:

In [74]:
stocks_2016.append(stocks_2017)

Unnamed: 0,Symbol,Shares,Low,High
0,AAPL,80,95,110
1,TSLA,50,80,130
2,WMT,40,55,70
0,AAPL,50,120,140
1,GE,100,30,40
2,IBM,87,75,95
3,SLB,20,55,85
4,TXN,500,15,23
5,TSLA,100,100,300


## Understanding the difference between concat, join, and merge 

The .merge and .join DataFrame (and not Series) methods and the concat function all
provide very similar functionality to combine multiple pandas objects together. As they are
so similar and they can replicate each other in certain situations, it can get very confusing
regarding when and how to use them correctly.

To help clarify their differences, take a look at the following outline:

    .concat:
    
 A pandas function

 Combines two or more pandas objects vertically or horizontally
    
 Aligns only on the index

 Errors whenever a duplicate appears in the index
    
 Defaults to outer join with the option for inner join

    .join:
    
 A DataFrame method

 Combines two or more pandas objects horizontally
    
 Aligns the calling DataFrame's column(s) or index with the other object's index (and
not the columns)

 Handles duplicate values on the joining columns/index by performing a Cartesian
product


 Defaults to left join with options for inner, outer, and right

    .merge:
    
 A DataFrame method

 Combines exactly two DataFrames horizontally

 Aligns the calling DataFrame's column(s) or index with the other DataFrame's
column(s) or index

 Handles duplicate values on the joining columns or index by performing a cartesian
product

 Defaults to inner join with options for left, outer, and right
    
In this recipe, we will combine DataFrames. The first situation is simpler with concat while
the second is simpler with .merge.

In [82]:
# Let's read in stock data for 2016, 2017, and 2018
# into a list of DataFrames using a loop instead of
# three different calls to the read_csv function:

years = 2016, 2017, 2018
stock_tables  = [pd.read_csv(f'C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/stocks_{year}.csv', index_col='Symbol') for year in years]

In [83]:
stocks_2016, stocks_2017, stocks_2018 = stock_tables

In [84]:
stocks_2016

Unnamed: 0_level_0,Shares,Low,High
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,80,95,110
TSLA,50,80,130
WMT,40,55,70


In [85]:
stocks_2017

Unnamed: 0_level_0,Shares,Low,High
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,50,120,140
GE,100,30,40
IBM,87,75,95
SLB,20,55,85
TXN,500,15,23
TSLA,100,100,300


In [86]:
stocks_2018

Unnamed: 0_level_0,Shares,Low,High
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AAPL,40,135,170
AMZN,8,900,1125
TSLA,50,220,400


In [92]:
# The concat function is the only pandas method that is
# able to combine DataFrames vertically. Let's do this
# by passing it the list stock_tables:

pd.concat(stock_tables, keys=['2016', 2017, 2018], names=['Year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Shares,Low,High
Year,Symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016,AAPL,80,95,110
2016,TSLA,50,80,130
2016,WMT,40,55,70
2017,AAPL,50,120,140
2017,GE,100,30,40
2017,IBM,87,75,95
2017,SLB,20,55,85
2017,TXN,500,15,23
2017,TSLA,100,100,300
2018,AAPL,40,135,170


In [89]:
# It can also combine DataFrames horizontally by 
# changing the axis parameter to columns:
pd.concat(dict(zip(years, stock_tables)), axis='columns')

Unnamed: 0_level_0,2016,2016,2016,2017,2017,2017,2018,2018,2018
Unnamed: 0_level_1,Shares,Low,High,Shares,...,High,Shares,Low,High
Symbol,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
AAPL,80.0,95.0,110.0,50.0,...,140.0,40.0,135.0,170.0
TSLA,50.0,80.0,130.0,100.0,...,300.0,50.0,220.0,400.0
WMT,40.0,55.0,70.0,,...,,,,
GE,,,,100.0,...,40.0,,,
IBM,,,,87.0,...,95.0,,,
SLB,,,,20.0,...,85.0,,,
TXN,,,,500.0,...,23.0,,,
AMZN,,,,,...,,8.0,900.0,1125.0


In [94]:
e = ['abel', 'shade', 'mike', 'edu']
f = [2, 4, 5, 7]
dict(zip(e, f))

{'abel': 2, 'shade': 4, 'mike': 5, 'edu': 7}

Now that we have started combining DataFrames horizontally, we can use the .join
and .merge methods to replicate this functionality of concat. Here, we use the
.join method to combine the stock_2016 and stock_2017 DataFrames. By
default, the DataFrames align on their index. If any of the columns have the same
names, then you must supply a value to the lsuffix or rsuffix parameters to
distinguish them in the result:

In [97]:
stocks_2016.join(stocks_2017, lsuffix='_2016', rsuffix='_2017', how='outer')

Unnamed: 0_level_0,Shares_2016,Low_2016,High_2016,Shares_2017,Low_2017,High_2017
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAPL,80.0,95.0,110.0,50.0,120.0,140.0
GE,,,,100.0,30.0,40.0
IBM,,,,87.0,75.0,95.0
SLB,,,,20.0,55.0,85.0
TSLA,50.0,80.0,130.0,100.0,100.0,300.0
TXN,,,,500.0,15.0,23.0
WMT,40.0,55.0,70.0,,,


In [98]:
# To replicate the output of the concat function from
# step 3, we can pass a list of DataFrames to
# the .join method:

other = [stocks_2017.add_suffix('_2017'),
        stocks_2018.add_suffix('_2018')]

In [99]:
stocks_2016.add_suffix('_2016').join(other, how='outer')

Unnamed: 0_level_0,Shares_2016,Low_2016,High_2016,Shares_2017,...,High_2017,Shares_2018,Low_2018,High_2018
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AAPL,80.0,95.0,110.0,50.0,...,140.0,40.0,135.0,170.0
TSLA,50.0,80.0,130.0,100.0,...,300.0,50.0,220.0,400.0
WMT,40.0,55.0,70.0,,...,,,,
GE,,,,100.0,...,40.0,,,
IBM,,,,87.0,...,95.0,,,
SLB,,,,20.0,...,85.0,,,
TXN,,,,500.0,...,23.0,,,
AMZN,,,,,...,,8.0,900.0,1125.0


In [100]:
# Let's check whether they are equal:
stock_join = stocks_2016.add_suffix('_2016').join(other, how='outer')

In [136]:
stock_concat = (
    pd.concat(
        dict(zip(years, stock_tables)), axis='columns')
    .swaplevel(axis=1)
    .pipe(lambda df_: 
          df_.set_axis(df_.columns.to_flat_index(), axis=1)
    .rename(lambda label:  
           "_".join([str(x) for x in label]), axis=1))
    )
    


In [107]:
stock_join.equals(stock_concat)

True

Now, let's turn to the .merge method that, unlike concat and .join, can only
combine two DataFrames together. By default, .merge attempts to align the values
in the columns that have the same name for each of the DataFrames. However, you
can choose to have it align on the index by setting the Boolean parameters left_
index and right_index to True. Let's merge the 2016 and 2017 stock data
together

In [140]:
stocks_2016.merge(stocks_2017, left_index=True, right_index=True,)

Unnamed: 0_level_0,Shares_x,Low_x,High_x,Shares_y,Low_y,High_y
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
AAPL,80,95,110,50,120,140
TSLA,50,80,130,100,100,300


By default, .merge uses an inner join and automatically supplies suffixes for
identically named columns. Let's change to an outer join and then perform another
outer join of the 2018 data to replicate the behavior of concat. Note that in pandas
1.0, the merge index will be sorted and the concat version won't be:

In [144]:
stock_merge = (stocks_2016
               .merge(stocks_2017, left_index=True,
                     right_index=True, how='outer',
                     suffixes=('_2016', '_2017'))
               .merge(stocks_2018.add_suffix('_2018'), 
                     left_index=True, right_index=True,
                     how='outer')
              )

In [145]:
stock_merge

Unnamed: 0_level_0,Shares_2016,Low_2016,High_2016,Shares_2017,...,High_2017,Shares_2018,Low_2018,High_2018
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
AAPL,80.0,95.0,110.0,50.0,...,140.0,40.0,135.0,170.0
AMZN,,,,,...,,8.0,900.0,1125.0
GE,,,,100.0,...,40.0,,,
IBM,,,,87.0,...,95.0,,,
SLB,,,,20.0,...,85.0,,,
TSLA,50.0,80.0,130.0,100.0,...,300.0,50.0,220.0,400.0
TXN,,,,500.0,...,23.0,,,
WMT,40.0,55.0,70.0,,...,,,,


In [146]:
stock_concat.sort_index().equals(stock_merge)

True

Now let's turn our comparison to datasets where we are interested in aligning
together the values of columns and not the index or column labels themselves. The
.merge method is built for this situation. Let's take a look at two new small datasets,
food_prices and food_transactions:

In [151]:
names = ['prices', 'transactions']
food_tables = [pd.read_csv(f'C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/food_{name}.csv') for name in names]

In [153]:
food_tables

[     item store  price  Date
 0    pear     A   0.99  2017
 1    pear     B   1.99  2017
 2   peach     A   2.99  2017
 3   peach     B   3.49  2017
 4  banana     A   0.39  2017
 5  banana     B   0.49  2017
 6   steak     A   5.99  2017
 7   steak     B   6.99  2017
 8   steak     B   4.99  2015,
    custid     item store  quantity
 0       1     pear     A         5
 1       1   banana     A        10
 2       2    steak     B         3
 3       2     pear     B         1
 4       2    peach     B         2
 5       2    steak     B         1
 6       2  coconut     B         4]

In [156]:
food_prices, food_transactions = food_tables

In [157]:
food_prices

Unnamed: 0,item,store,price,Date
0,pear,A,0.99,2017
1,pear,B,1.99,2017
2,peach,A,2.99,2017
3,peach,B,3.49,2017
4,banana,A,0.39,2017
5,banana,B,0.49,2017
6,steak,A,5.99,2017
7,steak,B,6.99,2017
8,steak,B,4.99,2015


In [158]:
food_transactions

Unnamed: 0,custid,item,store,quantity
0,1,pear,A,5
1,1,banana,A,10
2,2,steak,B,3
3,2,pear,B,1
4,2,peach,B,2
5,2,steak,B,1
6,2,coconut,B,4


In [159]:
# If we wanted to find the total amount of each
# transaction, we would need to join these tables on
# the item and store columns:

food_transactions.merge(food_prices, on=['item', 'store'])

Unnamed: 0,custid,item,store,quantity,price,Date
0,1,pear,A,5,0.99,2017
1,1,banana,A,10,0.39,2017
2,2,steak,B,3,6.99,2017
3,2,steak,B,3,4.99,2015
4,2,steak,B,1,6.99,2017
5,2,steak,B,1,4.99,2015
6,2,pear,B,1,1.99,2017
7,2,peach,B,2,3.49,2017


The price is now aligned correctly with its corresponding item and store, but there is
a problem. Customer 2 has a total of four steak items. As the steak item appears
twice in each table for store B, a Cartesian product takes place between them,
resulting in four rows. Also, notice that the item, coconut, is missing because there
was no corresponding price for it. Let's fix both of these issues:


In [161]:
food_transactions.merge(food_prices.query('Date == 2017'), how='left')

Unnamed: 0,custid,item,store,quantity,price,Date
0,1,pear,A,5,0.99,2017.0
1,1,banana,A,10,0.39,2017.0
2,2,steak,B,3,6.99,2017.0
3,2,pear,B,1,1.99,2017.0
4,2,peach,B,2,3.49,2017.0
5,2,steak,B,1,6.99,2017.0
6,2,coconut,B,4,,


In [178]:
# We can replicate this with the .join method,
# but we must first put the joining columns of the 
# food_prices DataFrame into the index:
food_prices_join = food_prices.query('Date == 2017') \
                   .set_index(['item', 'store'])

In [179]:
food_prices_join

Unnamed: 0_level_0,Unnamed: 1_level_0,price,Date
item,store,Unnamed: 2_level_1,Unnamed: 3_level_1
pear,A,0.99,2017
pear,B,1.99,2017
peach,A,2.99,2017
peach,B,3.49,2017
banana,A,0.39,2017
banana,B,0.49,2017
steak,A,5.99,2017
steak,B,6.99,2017


The .join method only aligns with the index of the passed DataFrame but can use
the index or the columns of the calling DataFrame. To use columns for alignment
on the calling DataFrame, you will need to pass them to the on parameter:

In [180]:
food_transactions.join(food_prices_join, on=['item', 'store'])

Unnamed: 0,custid,item,store,quantity,price,Date
0,1,pear,A,5,0.99,2017.0
1,1,banana,A,10,0.39,2017.0
2,2,steak,B,3,6.99,2017.0
3,2,pear,B,1,1.99,2017.0
4,2,peach,B,2,3.49,2017.0
5,2,steak,B,1,6.99,2017.0
6,2,coconut,B,4,,


The output matches the result from step 11. To replicate this with the concat
function, you would need to put the item and store columns into the index of
both DataFrames. However, in this particular case, an error would be produced as
a duplicate index value occurs in at least one of the DataFrames (with item steak
and store B):

In [181]:
pd.concat([food_transactions.set_index(['item', 'store']),
    food_prices.set_index(['item', 'store'])],
     axis='columns')

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

It is possible to read all files from a particular directory into DataFrames without knowing their
names. Python provides a few ways to iterate through directories, with the glob module being
a popular choice. The gas prices directory contains five different CSV files, each having
weekly prices of a particular grade of gas beginning from 2007. Each file has just two columns
– the date for the week and the price. This is a perfect situation to iterate through all the files,
read them into DataFrames, and combine them all together with the concat function.
The glob module has the glob function, which takes a single parameter – the location of the
directory you would like to iterate through as a string. To get all the files in the directory, use
the string *. In this example, ''*.csv' returns only files that end in .csv. The result from
the glob function is a list of string filenames, which can be passed to the read_csv function:

In [185]:
import glob
df_list = []

for filename in glob.glob('C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/gas prices/*.csv'):
    df_list.append(pd.read_csv(filename, index_col='Week', parse_dates=['Week']))


In [186]:
gas = pd.concat(df_list, axis='columns')

In [187]:
gas

Unnamed: 0_level_0,All Grades,Diesel,Midgrade,Premium,Regular
Week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2017-09-25,2.701,2.788,2.859,3.105,2.583
2017-09-18,2.750,2.791,2.906,3.151,2.634
2017-09-11,2.800,2.802,2.953,3.197,2.685
2017-09-04,2.794,2.758,2.946,3.191,2.679
2017-08-28,2.513,2.605,2.668,2.901,2.399
...,...,...,...,...,...
2007-01-29,2.213,2.413,2.277,2.381,2.165
2007-01-22,2.216,2.430,2.285,2.391,2.165
2007-01-15,2.280,2.463,2.347,2.453,2.229
2007-01-08,2.354,2.537,2.418,2.523,2.306


## Connecting to SQL databases

In [36]:
# Before we can begin reading tables from the chinook
# database, we need to set up our SQLAlchemy engine:
pd.set_option('max_columns', 4, 'max_rows', 20)
from sqlalchemy import create_engine
engine = create_engine('sqlite:///C:/Users/justine.o_kobo360/Desktop/Pandas Workbook/Pandas CookBook 1.x/Data files/chinook.db')

We can now step back into the world of pandas and remain there for the rest of
the recipe. Let's complete a command and read in the tracks table with the
read_sql_table function. The name of the table is the first argument and the
SQLAlchemy engine is the second:

In [3]:
tracks = pd.read_sql_table('tracks', engine)

In [6]:
tracks

Unnamed: 0,TrackId,...,UnitPrice
0,1,...,0.99
1,2,...,0.99
2,3,...,0.99
3,4,...,0.99
4,5,...,0.99
...,...,...,...
3498,3499,...,0.99
3499,3500,...,0.99
3500,3501,...,0.99
3501,3502,...,0.99


For the rest of the recipe, we will answer a couple of different specific queries with
help from the database diagram. To begin, let's find the average length of song per
genre

In [44]:
(pd.read_sql_table('genres', engine)
 .merge(tracks[['GenreId', 'Milliseconds']],
        on='GenreId', how='left'
       ).drop('GenreId', axis=1)
)
 

Unnamed: 0,Name,Milliseconds
0,Rock,343719
1,Rock,342562
2,Rock,230619
3,Rock,252051
4,Rock,375418
...,...,...
3498,Classical,286741
3499,Classical,139200
3500,Classical,66639
3501,Classical,221331


Now we can easily find the average length of each song per genre. To help ease
interpretation, we convert the Milliseconds column to the timedelta data type:

In [46]:
(pd.read_sql_table('genres', engine)
 .merge(tracks[['GenreId', 'Milliseconds']],
        on='GenreId', how='left'
       ).drop('GenreId', axis=1)
 .groupby('Name')
 ['Milliseconds']
 .mean()
 .pipe(lambda s_: pd.to_datetime(s_, unit='ms')
      .rename('Length'))
 .dt.floor('s')
 .sort_values()

)

Name
Rock And Roll      1970-01-01 00:02:14
Opera              1970-01-01 00:02:54
Hip Hop/Rap        1970-01-01 00:02:58
Easy Listening     1970-01-01 00:03:09
Bossa Nova         1970-01-01 00:03:39
                           ...        
Comedy             1970-01-01 00:26:25
TV Shows           1970-01-01 00:35:45
Drama              1970-01-01 00:42:55
Science Fiction    1970-01-01 00:43:45
Sci Fi & Fantasy   1970-01-01 00:48:31
Name: Length, Length: 25, dtype: datetime64[ns]

Now let's find the total amount spent per customer. We will need the customers,
invoices, and invoice_items tables all connected to each other

In [34]:
cust = pd.read_sql_table('customers', engine,
                        columns=['CustomerId', 'FirstName',
                                'LastName'])
invoice = pd.read_sql_table('invoices', engine,
                        columns=['InvoiceId','CustomerId'])
invoice_items = pd.read_sql_table('invoice_items', engine, 
                                 columns=['InvoiceId', 'UnitPrice', 'Quantity'])

In [37]:
(cust
 .merge(invoice, on='CustomerId')
 .merge(invoice_items, on='InvoiceId')
)

Unnamed: 0,CustomerId,FirstName,...,UnitPrice,Quantity
0,1,Luís,...,1.99,1
1,1,Luís,...,1.99,1
2,1,Luís,...,0.99,1
3,1,Luís,...,0.99,1
4,1,Luís,...,0.99,1
...,...,...,...,...,...
2235,59,Puja,...,0.99,1
2236,59,Puja,...,0.99,1
2237,59,Puja,...,0.99,1
2238,59,Puja,...,0.99,1


In [48]:
# We can now multiply the quantity by the unit price
# and then find the total amount spent per customer:

(cust
 .merge(invoice, on='CustomerId')
 .merge(invoice_items, on='InvoiceId')
 .assign(Total=lambda df_: df_.Quantity * df_.UnitPrice)
 .groupby(['CustomerId', 'FirstName', 'LastName'])
 ['Total']
 .sum()
 .sort_values(ascending=False)
)

CustomerId  FirstName  LastName  
6           Helena     Holý          49.62
26          Richard    Cunningham    47.62
57          Luis       Rojas         46.62
45          Ladislav   Kovács        45.62
46          Hugh       O'Reilly      45.62
                                     ...  
31          Martha     Silk          37.62
32          Aaron      Mitchell      37.62
33          Ellie      Sullivan      37.62
35          Madalena   Sampaio       37.62
59          Puja       Srivastava    36.64
Name: Total, Length: 59, dtype: float64

If you are adept with SQL, you can write a SQL query as a string and pass it to the read_sql_
query function. For example, the following will reproduce the output from step 4:

In [73]:
sql_string1 = '''
SELECT
NAME, 
time(avg(Milliseconds) / 100, 'unixepoch') as avg_time
FROM (
      SELECT 
      g.Name,
      t.Milliseconds
      FROM 
        genres as g
      JOIN
         tracks as t on
         g.genreid == t.genreid
)

GROUP BY Name
ORDER By avg_time

'''

In [74]:
pd.read_sql_query(sql_string1, engine)

Unnamed: 0,Name,avg_time
0,Rock And Roll,00:22:26
1,Opera,00:29:08
2,Hip Hop/Rap,00:29:41
3,Easy Listening,00:31:31
4,Bossa Nova,00:36:35
...,...,...
20,Comedy,04:24:12
21,TV Shows,05:57:30
22,Drama,07:09:12
23,Science Fiction,07:17:35


In [59]:
sql_string2 ='''
SELECT
      c.customerid,
      c.FirstName,
      c.LastName,
      sum(ii.quantity * ii.unitprice) as Total
FROM

      customers as c
      
JOIN 
     invoices as i
     on c.customerid = i.customerid
     
JOIN 
     invoice_items as ii
     on i.invoiceid = ii.invoiceid
     
GROUP BY 
     c.customerid, c.FirstName, c.LastName
     
ORDER BY
      Total desc'''

In [60]:
pd.read_sql_query(sql_string2, engine)

Unnamed: 0,CustomerId,FirstName,LastName,Total
0,6,Helena,Holý,49.62
1,26,Richard,Cunningham,47.62
2,57,Luis,Rojas,46.62
3,45,Ladislav,Kovács,45.62
4,46,Hugh,O'Reilly,45.62
...,...,...,...,...
54,53,Phil,Hughes,37.62
55,54,Steve,Murray,37.62
56,55,Mark,Taylor,37.62
57,56,Diego,Gutiérrez,37.62
