<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-pandas-and-the-Cleveland-Museum-of-Art-(CMA)-collections-data" data-toc-modified-id="Load-pandas-and-the-Cleveland-Museum-of-Art-(CMA)-collections-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load pandas and the Cleveland Museum of Art (CMA) collections data</a></span></li><li><span><a href="#Look-at-the-citations-data" data-toc-modified-id="Look-at-the-citations-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Look at the citations data</a></span></li><li><span><a href="#Look-at-the-creators-data" data-toc-modified-id="Look-at-the-creators-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Look at the creators data</a></span></li><li><span><a href="#Show-duplicates-of-merge-by-values-in-the-citations-data" data-toc-modified-id="Show-duplicates-of-merge-by-values-in-the-citations-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Show duplicates of merge-by values in the citations data</a></span></li><li><span><a href="#Show-duplicates-of-the-merge-by-values-in-the-creators-data" data-toc-modified-id="Show-duplicates-of-the-merge-by-values-in-the-creators-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Show duplicates of the merge-by values in the creators data</a></span></li><li><span><a href="#Check-the-merge" data-toc-modified-id="Check-the-merge-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Check the merge</a></span></li><li><span><a href="#Show-a-merge-by-value-duplicated-in-both-DataFrames" data-toc-modified-id="Show-a-merge-by-value-duplicated-in-both-DataFrames-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Show a merge-by value duplicated in both DataFrames</a></span></li><li><span><a href="#Do-a-many-to-many-merge" data-toc-modified-id="Do-a-many-to-many-merge-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Do a many-to-many merge</a></span></li></ul></div>

# Load pandas and the Cleveland Museum of Art (CMA) collections data

In [1]:
import pandas as pd

In [2]:
import watermark
%load_ext watermark

%watermark -n -i -iv

watermark: 2.1.0
json     : 2.0.9
pandas   : 1.2.1



In [3]:
cmacitations = pd.read_csv('data/cmacitations.csv')
cmacreators = pd.read_csv('data/cmacreators.csv')

# Look at the citations data

In [4]:
cmacitations.head(10)

Unnamed: 0,id,citation
0,92937,"Milliken, William M. ""The Second Exhibition of..."
1,92937,"Glasier, Jessie C. ""Museum Gets Prize-Winning ..."
2,92937,"""Cleveland Museum Acquires Typical Pictures by..."
3,92937,"Milliken, William M. ""Two Examples of Modern P..."
4,92937,<em>Memorial Exhibition of the Work of George ...
5,92937,The Cleveland Museum of Art. <em>Handbook of t...
6,92937,"Cortissoz, Royal. ""Paintings and Prints by Geo..."
7,92937,"Isham, Samuel, and Royal Cortissoz. <em>The Hi..."
8,92937,"Mather, Frank Jewett, Charles Rufus Morey, and..."
9,92937,"""Un Artiste Americain."" <em>L'illustration.</e..."


In [5]:
cmacitations.shape

(11642, 2)

In [9]:
cmacitations['id'].nunique()

935

# Look at the creators data

In [7]:
cmacreators.loc[:, ['id', 'creator', 'birth_year']].head(10)

Unnamed: 0,id,creator,birth_year
0,92937,"George Bellows (American, 1882-1925)",1882
1,94979,"John Singleton Copley (American, 1738-1815)",1738
2,137259,"Gustave Courbet (French, 1819-1877)",1819
3,141639,"Frederic Edwin Church (American, 1826-1900)",1826
4,93014,"Thomas Cole (American, 1801-1848)",1801
5,110180,"Albert Pinkham Ryder (American, 1847-1917)",1847
6,135299,"Vincent van Gogh (Dutch, 1853-1890)",1853
7,125249,"Vincent van Gogh (Dutch, 1853-1890)",1853
8,126769,"Henri Rousseau (French, 1844-1910)",1844
9,135382,"Claude Monet (French, 1840-1926)",1840


In [8]:
cmacreators.shape

(737, 8)

In [10]:
cmacreators['id'].nunique()

654

# Show duplicates of merge-by values in the citations data

In [13]:
cmacitations['id'].value_counts().head(10)

148758    174
122351    116
92937      98
123168     94
149112     93
94979      93
124245     87
128842     86
102578     84
93014      79
Name: id, dtype: int64

# Show duplicates of the merge-by values in the creators data

In [14]:
cmacreators['id'].value_counts().head(10)

140001    4
149386    4
146797    3
146795    3
149041    3
142753    3
114538    3
140427    3
114537    3
142752    3
Name: id, dtype: int64

# Check the merge

In [17]:
def checkmerge(dfleft, dfright, idvar):
    dfleft['inleft'] = 'Y'
    dfright['inright'] = 'Y'
    dfboth = pd.merge(dfleft[[idvar, 'inleft']],
                      dfright[[idvar, 'inright']],
                      on=[idvar],
                      how='outer')
    dfboth.fillna('N', inplace=True)
    print(pd.crosstab(dfboth['inleft'], dfboth['inright']))
    # print(dfboth.loc[(dfboth['inleft'] == 'N') | (dfboth['inright'] == 'N')])

In [18]:
checkmerge(cmacitations.copy(), cmacreators.copy(), 'id')

inright     N     Y
inleft             
N           0    46
Y        2579  9701


# Show a merge-by value duplicated in both DataFrames

In [19]:
cmacitations.loc[cmacitations['id'] == 124733]

Unnamed: 0,id,citation
8963,124733,"Weigel, J. A. G. <em>Catalog einer Sammlung vo..."
8964,124733,"Winkler, Friedrich. <em>Die Zeichnungen Albrec..."
8965,124733,"Francis, Henry S. ""Drawing of a Dead Blue Jay ..."
8966,124733,"Kurz, Otto. <em>Fakes: A Handbook for Collecto..."
8967,124733,Minneapolis Institute of Arts. <em>Watercolors...
8968,124733,"Pilz, Kurt. ""Hans Hoffmann: Ein Nürnberger Dür..."
8969,124733,"Koschatzky, Walter and Alice Strobl. <em>Düre..."
8970,124733,"Johnson, Mark M<em>. Idea to Image: Preparator..."
8971,124733,"Kaufmann, Thomas DaCosta. <em>Drawings from th..."
8972,124733,"Koreny, Fritz. <em>Albrecht Dürer and the ani..."


In [21]:
cmacreators.loc[cmacreators['id'] == 124733,
                ['id', 'creator', 'birth_year', 'title']]

Unnamed: 0,id,creator,birth_year,title
449,124733,"Albrecht Dürer (German, 1471-1528)",1471,Dead Blue Roller
450,124733,"Hans Hoffmann (German, 1545/50-1591/92)",1545/50,Dead Blue Roller


# Do a many-to-many merge

In [22]:
cma = pd.merge(cmacitations, cmacreators, on=['id'], how='outer')

In [23]:
cma['citation'] = cma['citation'].str[0:20]

In [24]:
cma['creator'] = cma['creator'].str[0:20]

In [25]:
cma.loc[cma['id'] == 124733, ['id', 'creator', 'birth_year', 'title']]

Unnamed: 0,id,creator,birth_year,title
9457,124733,Albrecht Dürer (Germ,1471,Dead Blue Roller
9458,124733,Hans Hoffmann (Germa,1545/50,Dead Blue Roller
9459,124733,Albrecht Dürer (Germ,1471,Dead Blue Roller
9460,124733,Hans Hoffmann (Germa,1545/50,Dead Blue Roller
9461,124733,Albrecht Dürer (Germ,1471,Dead Blue Roller
9462,124733,Hans Hoffmann (Germa,1545/50,Dead Blue Roller
9463,124733,Albrecht Dürer (Germ,1471,Dead Blue Roller
9464,124733,Hans Hoffmann (Germa,1545/50,Dead Blue Roller
9465,124733,Albrecht Dürer (Germ,1471,Dead Blue Roller
9466,124733,Hans Hoffmann (Germa,1545/50,Dead Blue Roller
