# Data Analysis

For understanding the bibliographic catalogue data of the Swissbib platform, the project team has generated a sample .json file with a big amount of representative records. This chapter shows the analysis of this data delivery.

## Table of Contents

- [Sample Records Analysis](#sample_records_analysis)
    - [Book](#book)
    - [Music](#music)
    - [Video Material](#video_material)
    - [Map](#map)
    - [Periodical](#periodical)
    - [Collection](#collection)
    - [Computer File](#computer_file)
- [General Observation](#general_observation)
- [Attribute Analysis](#attribute_analysis)

## Sample Records Analysis<a id='sample_records_analysis'/>

In this section, the data file is loaded and some sample data records are shown.

In [1]:
import os
import json

records = []
path_data = './data'

for line in open(os.path.join(path_data, 'job7r4A1.json'), 'r'):
    records.append(json.loads(line))

print('Number of data records loaded', len(records))

Number of data records loaded 183407


In [2]:
import pandas as pd

# Generate Pandas DataFrame object out of the raw data
df = pd.DataFrame(records)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df.columns)

df.head()

Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,ismn,musicid,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes
0,"[(OCoLC)1085491204, (IDSBB)006899773]",1992,[],{},1992,554061449,[],,1992,[BK020000],[],[],,[S. 102-114],[],"{'100': ['SchluchterAndré'], '245c': ['André S...",[],[],1992,,"{'245': ['Die Feist von Kienberg', 'eine Wasen...","{'245': ['Die Feist von Kienberg', 'eine Wasen...",[S. 102-114]
1,"[(OCoLC)1085491341, (IDSBB)006899983]",1986,[],{},1986,554061481,[],,1986,[BK020000],[],[],,[43 S.],[],"{'100': ['NaegeliWerner'], '245c': ['von Werne...",[],[],1986,,{'245': ['Reimereien']},{'245': ['Reimereien']},[43 S.]
2,"[(OCoLC)1085491299, (IDSBB)006899959]",1923,[],{},1923,554061503,[],,1923,[BK020000],[],[],,[75 p.],[],"{'100': ['KellyM.V.'], '700': ['GeniesseJ.B.']...",[],[],1923,,{'245': ['Efficax antidotum ad matrimonia mixt...,{'245': ['Efficax antidotum ad matrimonia mixt...,[75 p.]
3,"[(OCoLC)1085491268, (IDSBB)006896614]",1947,[],{},1947,554061511,[],,1947,[BK020000],[],[],,[24 S.],[],"{'100': ['WegelinWalter'], '245c': ['']}",[],[],1947,,"{'245': ['Probleme der Inflationsbekämpfung', ...","{'245': ['Probleme der Inflationsbekämpfung', ...",[24 S.]
4,"[(OCoLC)1085491079, (IDSBB)006896866]",1991,[],{},1991,55406152X,[],,1991,[BK020000],[],[],,[p. 14-15],[],"{'100': ['OberlinUrs'], '245c': ['Urs Oberlin ...",[],[],1991,,{'245': ['[Poems]']},{'245': ['[Poems]']},[p. 14-15]


In [3]:
print('Number of records {:d}, number of attributes per record {:d}.\n'.format(
    len(df), len(df.columns)))

df.info()

Number of records 183407, number of attributes per record 23.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183407 entries, 0 to 183406
Data columns (total 23 columns):
035liste      183407 non-null object
century       183407 non-null object
coordinate    183407 non-null object
corporate     183407 non-null object
decade        183407 non-null object
docid         183407 non-null object
doi           183407 non-null object
edition       183407 non-null object
exactDate     183407 non-null object
format        183407 non-null object
isbn          183407 non-null object
ismn          183407 non-null object
musicid       183407 non-null object
pages         183407 non-null object
part          183407 non-null object
person        183407 non-null object
pubinit       183407 non-null object
pubword       183407 non-null object
pubyear       183407 non-null object
scale         183407 non-null object
ttlfull       183407 non-null object
ttlpart       183407 non-null object
volumes     

Swissbib data describes different kinds of bibliographic units, see [format](#format). The following subsections show some sample data for such units. The format is interpreted roughly in these subsections, compare Swissbib's [format codes](http://www.swissbib.org/wiki/index.php?title=Filtering#format_codes).

In [4]:
df.format.str[0].str[:2].unique()

array(['BK', 'MU', 'VM', 'MP', 'CR', 'CL', 'CF', nan], dtype=object)

### Book<a id='book'/>

A format code starting with $\texttt{BK}$ stands for a bibliographic unit of a book or an article. A sample record is shown below.

In [5]:
df.loc[df[df.format.str[0].str[:2]=='BK'].index[0]]

035liste                  [(OCoLC)1085491204, (IDSBB)006899773]
century                                                    1992
coordinate                                                   []
corporate                                                    {}
decade                                                     1992
docid                                                 554061449
doi                                                          []
edition                                                        
exactDate                                              1992    
format                                               [BK020000]
isbn                                                         []
ismn                                                         []
musicid                                                        
pages                                              [S. 102-114]
part                                                         []
person        {'100': ['SchluchterAndré'

### Music<a id='music'/>

A format code starting with $\texttt{MU}$ stands for a bibliographic unit that is related to music. A sample record is shown below.

In [6]:
df.loc[df[df.format.str[0].str[:2]=='MU'].index[0]]

035liste                  [(OCoLC)1085495414, (IDSBB)007052696]
century                                                    2018
coordinate                                                   []
corporate                                                    {}
decade                                                     2018
docid                                                 554098806
doi                                                          []
edition                                                        
exactDate                                              20182018
format                                               [MU040100]
isbn                                                         []
ismn                                                         []
musicid                                              GCD 924201
pages                                            [1 CD (69:42)]
part                                                         []
person        {'100': ['BrahmsJohannes18

### Video Material<a id='video_material'/>

A format code starting with $\texttt{VM}$ represents some film material. A sample record is shown below.

In [7]:
df.loc[df[df.format.str[0].str[:2]=='VM'].index[0]]

035liste      [(OCoLC)1065768412, (IDSBB)007052702, (OCoLC)1...
century                                                    2018
coordinate                                                   []
corporate                                                    {}
decade                                                     2018
docid                                                 554098911
doi                                                          []
edition                                                        
exactDate                                              20182008
format                                               [VM010000]
isbn                         [978-3-946274-20-9, 3-946274-20-X]
ismn                                                         []
musicid                                                        
pages                                  [2 DVD-Videos (283 min)]
part                                                     [2004]
person        {'100': [], '700': ['Schli

### Map<a id='map'/>

A format code starting with $\texttt{MP}$ is a map. A sample record is shown below.

In [8]:
df.loc[df[df.format.str[0].str[:2]=='MP'].index[0]]

035liste                  [(OCoLC)1085495396, (IDSBB)007052708]
century                                                    1993
coordinate                                 [E0035400, N0532700]
corporate                                                    {}
decade                                                     1993
docid                                                 554099039
doi                                                          []
edition                                               Facsimile
exactDate                                              19931604
format                                               [MP010300]
isbn                                                         []
ismn                                                         []
musicid                                                        
pages                                                 [1 Karte]
part                                                         []
person        {'100': ['BlaeuWillem Jans

### Periodical<a id='periodical'/>

A format code starting with $\texttt{CR}$ is a bibliographic unit of a periodical. A sample record is shown below.

In [9]:
df.loc[df[df.format.str[0].str[:2]=='CR'].index[0]]

035liste                   [(OCoLC)699516877, (IDSBB)007052728]
century                                                    1812
coordinate                                                   []
corporate                                                    {}
decade                                                     1812
docid                                                 55409939X
doi                                                          []
edition                                                        
exactDate                                              18121826
format                                               [CR030653]
isbn                                                [1533-4406]
ismn                                                         []
musicid                                                        
pages                                        [Online-Ressource]
part                                                         []
person                                {'

### Collection<a id='collection'/>

A format code starting with $\texttt{CL}$ is a collection. A sample record is shown below.

In [10]:
df.loc[df[df.format.str[0].str[:2]=='CL'].index[0]]

035liste                  [(OCoLC)1085510940, (IDSBB)007052979]
century                                                    2017
coordinate                                                   []
corporate               {'110': ['St. Gallischer Hilfsverein']}
decade                                                     2017
docid                                                 554101610
doi                                                          []
edition                                                        
exactDate                                              20179999
format                                               [CL010000]
isbn                                                         []
ismn                                                         []
musicid                                                        
pages                                                 [1 Mappe]
part                                                         []
person                                {'

### Computer File<a id='computer_file'/>

A format code starting with $\texttt{CF}$ is a placeholder for a computer file on any kind of storage. A sample record is shown below.

In [11]:
df.loc[df[df.format.str[0].str[:2]=='CF'].index[0]]

035liste                   [(IDSBB)007008154, (RERO)R007245313]
century                                                    2012
coordinate                                                   []
corporate                                                    {}
decade                                                     2012
docid                                                 554144301
doi                                                          []
edition                                                        
exactDate                                              2012    
format                                               [CF010000]
isbn                                                         []
ismn                                                         []
musicid                                                        
pages                                         [1 Speicherkarte]
part                                                         []
person                                {'

## General Observation<a id='general_observation'/>

As can be observed in the sample records displayed above, the attributes of the records are stored in basic Python datatypes like strings, lists (of strings), and dictionaries. A look into the raw data file confirms this observation.

In [12]:
! head -n 2 ./data/job7r4A1.json

{"docid":"554061449","035liste":["(OCoLC)1085491204","(IDSBB)006899773"],"isbn":[],"ttlfull":{"245":["Die Feist von Kienberg","eine Wasenmeisterfamilie im Ancien Régime zwischen Ehrbarkeit und Delinquenz"]},"ttlpart":{"245":["Die Feist von Kienberg","eine Wasenmeisterfamilie im Ancien Régime zwischen Ehrbarkeit und Delinquenz"]},"person":{"100":["SchluchterAndré"],"245c":["André Schluchter"]},"corporate":{},"pubyear":"1992    ","decade":"1992","century":"1992","exactDate":"1992    ","edition":"","part":[],"pages":["S. 102-114"],"volumes":["S. 102-114"],"pubinit":[],"pubword":[],"scale":"","coordinate":[],"doi":[],"ismn":[],"musicid":"","format":["BK020000"]}
{"docid":"554061481","035liste":["(OCoLC)1085491341","(IDSBB)006899983"],"isbn":[],"ttlfull":{"245":["Reimereien"]},"ttlpart":{"245":["Reimereien"]},"person":{"100":["NaegeliWerner"],"245c":["von Werner Naegeli"]},"corporate":{},"pubyear":"1986    ","decade":"1986","century":"1986","exactDate":"1986    ","edition":"","part":[],"pa

A specific analysis on each attribute, its meaning and contents will be done as a next step. For the upcoming analysis, each attribute of the data records is assigned to its specific group type. This is done with the following dictionary.

In [13]:
column_types_dict = {
    'strings_columns' : ['century', 'decade', 'docid', 'exactDate'],
    'list_columns' : ['volumes', '035liste'],
    'array_of_strings_columns' : []
}
# The dictionary of compare logic
strings_columns = ['doi', 'edition', 'format', 'isbn',
                   'ismn', 'musicid', 'pubinit', 'pubyear', 'scale', 'volumes']
list_columns = ['coordinate', 'corporate', 'pages', 'part', 'pubword',
                'ttlfull', 'ttlpart']
array_of_strings_columns = ['person']

## Attribute Analysis<a id='attribute_analysis'/>

This section analyses the attributes provided by the Swissbib data extracts. The extracted data is used as a basis for the machine learning models in the capstone project. The attributes are based on the [MARC 21 Format for Bibliographic Data](https://www.loc.gov/marc/bibliographic/) and are documented on a [Swissbib Wiki page](http://www.swissbib.org/wiki/index.php?title=Features_Deduplication).

In [14]:
df.columns

Index(['035liste', 'century', 'coordinate', 'corporate', 'decade', 'docid',
       'doi', 'edition', 'exactDate', 'format', 'isbn', 'ismn', 'musicid',
       'pages', 'part', 'person', 'pubinit', 'pubword', 'pubyear', 'scale',
       'ttlfull', 'ttlpart', 'volumes'],
      dtype='object')

This section uses some functions that have been written for supporting the analysis of the attributes. These functions have been defined in a separate code file [data_analysis_funcs](./data_analysis_funcs.py).

In [15]:
import data_analysis_funcs as daf

### Table of Contents of Attribute Analysis

- [035liste](#035liste)
- [century](#century)
- coordinate
- corporate
- [decade](#decade)
- [docid](#docid)
- doi
- edition
- [exactDate](#exactDate)
- format
- isbn
- ismn
- musicid
- pages
- part
- person
- pubinit
- pubword
- pubyear
- scale
- ttlfull
- ttlpart
- [volumes](#volumes)

### 035liste<a id='035liste'/>

Attribute $\texttt{035liste}$ holds a list of identifiers from the originating library of a bibliographic unit, see [Features Deduplication](http://www.swissbib.org/wiki/index.php?title=Features_Deduplication). Each record of the Swissbib data holds at least one identifier. Some examples are shown below.

In [16]:
_, _ = daf.find_empty_in_column(df, column_types_dict, '035liste')

Number of records with filled 035liste 183407, with missing 035liste 0 => 100.0%


In [17]:
df['035liste'].apply(lambda x : len(x)).sort_values().head(5)

83463     1
150669    1
150668    1
150667    1
150666    1
Name: 035liste, dtype: int64

In [18]:
df['035liste'].apply(lambda x : len(x)).sort_values().tail(10)

144185    20
54084     21
30793     21
136608    21
61603     21
14755     21
49974     22
139972    22
139191    23
124359    23
Name: 035liste, dtype: int64

In [19]:
print('Some sample identifiers:')
df['035liste'].iloc[ df['035liste'].sample(n=10).index ]

Some sample identifiers:


133780    [(SERSOL)ssib030276503, (VAUD)9910211464862028...
46157     [(SERSOL)ssj0002088210, (NEBIS)011317504, (WaS...
85676                 [(OCoLC)1085874533, (NEBIS)011355398]
181834                [(OCoLC)1089704728, (NEBIS)011372106]
34825     [(NEBIS)011278797, (SGBN)001453809, (RERO)R008...
173281    [(RERO)R008340978, (VAUD)991021221895402852, R...
93294     [(OCoLC)1029742615, (ABN)000846532, (OCoLC)102...
156161                                         [(CEO)39992]
41845        [(SERSOL)ssj0000608226, (WaSeSS)ssj0000608226]
78253                 [(OCoLC)1085682628, (NEBIS)001410191]
Name: 035liste, dtype: object

Attribute $\texttt{035liste}$ is the central attribute for finding duplicates in the training data of the Swissbib's goldstandard. This process will be explained and implemented in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb).

### century<a id='century'/>

In [20]:
idx_century_filled, idx_century_empty = daf.find_empty_in_column(df, column_types_dict, 'century')

daf.two_examples(df, idx_century_filled, idx_century_empty)

Number of records with filled century 183407, with missing century 0 => 100.0%

EMPTY - None

FILLED - index 0 

035liste                  [(OCoLC)1085491204, (IDSBB)006899773]
century                                                    1992
coordinate                                                   []
corporate                                                    {}
decade                                                     1992
docid                                                 554061449
doi                                                          []
edition                                                        
exactDate                                              1992    
format                                               [BK020000]
isbn                                                         []
ismn                                                         []
musicid                                                        
pages                                              [S. 

In [21]:
df.century.value_counts(normalize=True).head()

2018    0.208389
2019    0.086785
2017    0.046350
uuuu    0.036236
2016    0.025223
Name: century, dtype: float64

### decade<a id='decade'/>

In [22]:
idx_decade_filled, idx_decade_empty = daf.find_empty_in_column(df, column_types_dict, 'decade')

Number of records with filled decade 183407, with missing decade 0 => 100.0%


In [23]:
df[df.decade != df.century]

Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,ismn,musicid,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes


The attribute holds identical data to field [century](#century). Its MARC definition is the same, too.

### docid<a id='docid'/>

In [24]:
idx_docid_filled, idx_docid_empty = daf.find_empty_in_column(df, column_types_dict, 'docid')

Number of records with filled docid 183407, with missing docid 0 => 100.0%


In [25]:
df.docid[0]

'554061449'

### exactDate<a id='exactDate'/>

In [26]:
idx_century_filled, idx_century_empty = daf.find_empty_in_column(df, column_types_dict, 'exactDate')

Number of records with filled exactDate 183407, with missing exactDate 0 => 100.0%


In [27]:
df[df.exactDate.str[0:4] != df.century]

Unnamed: 0,035liste,century,coordinate,corporate,decade,docid,doi,edition,exactDate,format,isbn,ismn,musicid,pages,part,person,pubinit,pubword,pubyear,scale,ttlfull,ttlpart,volumes


Conforming the MARC description, the first 4 digits of exactDate hold identical data to field [century](#century).

In [28]:
print('Degree of non-blank filling of last 4 digits {:.1f}%'.format(
    df.exactDate[df.exactDate.str[4:] != '    '].count()/len(df)*100))
print('Degree of numerical filling of last 4 digits {:.1f}%'.format(
    df.exactDate[~df.exactDate.str[4:].isin(['    ', 'uuuu'])].count()/len(df)*100))

Degree of non-blank filling of last 4 digits 19.4%
Degree of numerical filling of last 4 digits 13.3%


In [29]:
df.exactDate[df.exactDate.str[4:] != '    '].head()

12     uuuuuuuu
61     uuuuuuuu
62     uuuuuuuu
117    19241925
257    uuuuuuuu
Name: exactDate, dtype: object

In [30]:
df.loc[183319]

035liste                     [(ZORA)oai:www.zora.uzh.ch:169340]
century                                                    2019
coordinate                                                   []
corporate                                                    {}
decade                                                     2019
docid                                                 556987284
doi                                  [10.1093/eurheartj/ehz068]
edition                                                        
exactDate                                              20190414
format                                               [BK010053]
isbn                                                         []
ismn                                 [10.1093/eurheartj/ehz068]
musicid                                                        
pages                                                        []
part                                         [40(15):1183-1187]
person        {'100': [], '700': ['Templ

### volumes<a id='volumes'/>

In [31]:
idx_volumes_filled, idx_volumes_empty = daf.find_empty_in_column(df, column_types_dict, 'volumes')

Number of records with filled volumes 161471, with missing volumes 21936 => 88.0%


In [32]:
df.volumes.loc[[1810, 0, 73897, 80258]]

1810                                                    []
0                                             [S. 102-114]
73897                          [XI, 213 S., Portr., 23 cm]
80258    [1 Partitur (5 Seiten), 32 Stimmen, Wiedlisbac...
Name: volumes, dtype: object

In [35]:
import data_preparation_funcs as dpf

df = dpf.transform_list_to_string(df, 'volumes')

In [36]:
df.volumes.loc[[1810, 0, 73897, 80258]]

1810                                                      
0                                               S. 102-114
73897                            XI, 213 S., Portr., 23 cm
80258    1 Partitur (5 Seiten), 32 Stimmen, Wiedlisbach...
Name: volumes, dtype: object