# Data Analysis

For understanding the bibliographic catalogue data of the Swissbib platform, the project team has generated a sample .json file with a big amount of representative records. This chapter shows the analysis of this data delivery.

## Table of Contents

- [Sample Records Analysis](#Sample-Records-Analysis)
    - [Book](#Book)
    - [Music](#Music)
    - [Video Material](#Video-Material)
    - [Map](#Map)
    - [Periodical](#Periodical)
    - [Collection](#Collection)
    - [Computer File](#Computer-File)
- [General Observation](#General-Observation)
- [Attribute Analysis](#Attribute-Analysis)
- [Metadata Handover](#Metadata-Handover)
- [Summary](#Summary)

## Sample Records Analysis

In this section, the data file is loaded and some sample data records are shown.

In [1]:
import os
import json

records = []
path_data = './data'
path_goldstandard = './daten_goldstandard'

for line in open(os.path.join(path_data, 'job7r4A1.json'), 'r'):
    records.append(json.loads(line))

print('Number of data records loaded', len(records))

Number of data records loaded 183407


In [2]:
import pandas as pd

# Generate Pandas DataFrame object out of the raw data
df = pd.DataFrame(records)

# Extend display to number of columns of DataFrame
pd.options.display.max_columns = len(df.columns)

df.head()

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,edition,part,pages,volumes,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format
0,554061449,"[(OCoLC)1085491204, (IDSBB)006899773]",[],"{'245': ['Die Feist von Kienberg', 'eine Wasen...","{'245': ['Die Feist von Kienberg', 'eine Wasen...","{'100': ['SchluchterAndré'], '245c': ['André S...",{},1992,1992,1992,1992,,[],[S. 102-114],[S. 102-114],[],[],,[],[],[],,[BK020000]
1,554061481,"[(OCoLC)1085491341, (IDSBB)006899983]",[],{'245': ['Reimereien']},{'245': ['Reimereien']},"{'100': ['NaegeliWerner'], '245c': ['von Werne...",{},1986,1986,1986,1986,,[],[43 S.],[43 S.],[],[],,[],[],[],,[BK020000]
2,554061503,"[(OCoLC)1085491299, (IDSBB)006899959]",[],{'245': ['Efficax antidotum ad matrimonia mixt...,{'245': ['Efficax antidotum ad matrimonia mixt...,"{'100': ['KellyM.V.'], '700': ['GeniesseJ.B.']...",{},1923,1923,1923,1923,,[],[75 p.],[75 p.],[],[],,[],[],[],,[BK020000]
3,554061511,"[(OCoLC)1085491268, (IDSBB)006896614]",[],"{'245': ['Probleme der Inflationsbekämpfung', ...","{'245': ['Probleme der Inflationsbekämpfung', ...","{'100': ['WegelinWalter'], '245c': ['']}",{},1947,1947,1947,1947,,[],[24 S.],[24 S.],[],[],,[],[],[],,[BK020000]
4,55406152X,"[(OCoLC)1085491079, (IDSBB)006896866]",[],{'245': ['[Poems]']},{'245': ['[Poems]']},"{'100': ['OberlinUrs'], '245c': ['Urs Oberlin ...",{},1991,1991,1991,1991,,[],[p. 14-15],[p. 14-15],[],[],,[],[],[],,[BK020000]


In [3]:
print('Number of records {:d}, number of attributes per record {:d}.\n'.format(
    len(df), len(df.columns)))

df.info()

Number of records 183407, number of attributes per record 23.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183407 entries, 0 to 183406
Data columns (total 23 columns):
docid         183407 non-null object
035liste      183407 non-null object
isbn          183407 non-null object
ttlfull       183407 non-null object
ttlpart       183407 non-null object
person        183407 non-null object
corporate     183407 non-null object
pubyear       183407 non-null object
decade        183407 non-null object
century       183407 non-null object
exactDate     183407 non-null object
edition       183407 non-null object
part          183407 non-null object
pages         183407 non-null object
volumes       183407 non-null object
pubinit       183407 non-null object
pubword       183407 non-null object
scale         183407 non-null object
coordinate    183407 non-null object
doi           183407 non-null object
ismn          183407 non-null object
musicid       183407 non-null object
format      

Swissbib data describes different kinds of bibliographic units, see [format](#format). The following subsections show some sample data for such units. The format is interpreted roughly in these subsections, compare Swissbib's [format codes](http://www.swissbib.org/wiki/index.php?title=Filtering#format_codes) [[FeatWiki](./A_References.ipynb#feature_deduplication_wiki)].

In [4]:
df.format.str[0].str[:2].unique()

array(['BK', 'MU', 'VM', 'MP', 'CR', 'CL', 'CF', nan], dtype=object)

### Book

A format code starting with $\texttt{BK}$ stands for a bibliographic unit of a book or an article. A sample record is shown below.

In [5]:
df.loc[df[df.format.str[0].str[:2]=='BK'].index[0]]

docid                                                 554061449
035liste                  [(OCoLC)1085491204, (IDSBB)006899773]
isbn                                                         []
ttlfull       {'245': ['Die Feist von Kienberg', 'eine Wasen...
ttlpart       {'245': ['Die Feist von Kienberg', 'eine Wasen...
person        {'100': ['SchluchterAndré'], '245c': ['André S...
corporate                                                    {}
pubyear                                                1992    
decade                                                     1992
century                                                    1992
exactDate                                              1992    
edition                                                        
part                                                         []
pages                                              [S. 102-114]
volumes                                            [S. 102-114]
pubinit                                 

### Music

A format code starting with $\texttt{MU}$ stands for a bibliographic unit that is related to music. A sample record is shown below.

In [6]:
df.loc[df[df.format.str[0].str[:2]=='MU'].index[0]]

docid                                                 554098806
035liste                  [(OCoLC)1085495414, (IDSBB)007052696]
isbn                                                         []
ttlfull                             {'245': ['Violin sonatas']}
ttlpart                             {'245': ['Violin sonatas']}
person        {'100': ['BrahmsJohannes1833-1897(DE-588)11851...
corporate                                                    {}
pubyear                                                20182018
decade                                                     2018
century                                                    2018
exactDate                                              20182018
edition                                                        
part                                                         []
pages                                            [1 CD (69:42)]
volumes                                          [1 CD (69:42)]
pubinit                                 

### Video Material

A format code starting with $\texttt{VM}$ represents some film material. A sample record is shown below.

In [7]:
df.loc[df[df.format.str[0].str[:2]=='VM'].index[0]]

docid                                                 554098911
035liste      [(OCoLC)1065768412, (IDSBB)007052702, (OCoLC)1...
isbn                         [978-3-946274-20-9, 3-946274-20-X]
ttlfull       {'245': ['Kunst und Gemüse, A. Hipler', 'Art a...
ttlpart       {'245': ['Kunst und Gemüse, A. Hipler', 'Art a...
person        {'100': [], '700': ['SchlingensiefChristoph196...
corporate                                                    {}
pubyear                                                20182008
decade                                                     2018
century                                                    2018
exactDate                                              20182008
edition                                                        
part                                                     [2004]
pages                                  [2 DVD-Videos (283 min)]
volumes                                [2 DVD-Videos (283 min)]
pubinit                                 

### Map

A format code starting with $\texttt{MP}$ is a map. A sample record is shown below.

In [8]:
df.loc[df[df.format.str[0].str[:2]=='MP'].index[0]]

docid                                                 554099039
035liste                  [(OCoLC)1085495396, (IDSBB)007052708]
isbn                                                         []
ttlfull       {'245': ['Nova descriptio Comitatus Hollandiæ'...
ttlpart       {'245': ['Nova descriptio Comitatus Hollandiæ'...
person        {'100': ['BlaeuWillem Janszoon1571-1638(DE-588...
corporate                                                    {}
pubyear                                                19931604
decade                                                     1993
century                                                    1993
exactDate                                              19931604
edition                                               Facsimile
part                                                         []
pages                                                 [1 Karte]
volumes                                               [1 Karte]
pubinit                                 

### Periodical

A format code starting with $\texttt{CR}$ is a bibliographic unit of a periodical. A sample record is shown below.

In [9]:
df.loc[df[df.format.str[0].str[:2]=='CR'].index[0]]

docid                                                 55409939X
035liste                   [(OCoLC)699516877, (IDSBB)007052728]
isbn                                                [1533-4406]
ttlfull       {'245': ['The new England journal of medicine ...
ttlpart       {'245': ['The new England journal of medicine ...
person                                {'100': [], '245c': ['']}
corporate                                                    {}
pubyear                                                18121826
decade                                                     1812
century                                                    1812
exactDate                                              18121826
edition                                                        
part                                                         []
pages                                        [Online-Ressource]
volumes                                      [Online-Ressource]
pubinit                                 

### Collection

A format code starting with $\texttt{CL}$ is a collection. A sample record is shown below.

In [10]:
df.loc[df[df.format.str[0].str[:2]=='CL'].index[0]]

docid                                                 554101610
035liste                  [(OCoLC)1085510940, (IDSBB)007052979]
isbn                                                         []
ttlfull       {'245': ['[St. Gallischer Hilfsverein - Sankt ...
ttlpart       {'245': ['[St. Gallischer Hilfsverein - Sankt ...
person                                {'100': [], '245c': ['']}
corporate               {'110': ['St. Gallischer Hilfsverein']}
pubyear                                                20179999
decade                                                     2017
century                                                    2017
exactDate                                              20179999
edition                                                        
part                                                         []
pages                                                 [1 Mappe]
volumes                                               [1 Mappe]
pubinit                                 

### Computer File

A format code starting with $\texttt{CF}$ is a placeholder for a computer file on any kind of storage. A sample record is shown below.

In [11]:
df.loc[df[df.format.str[0].str[:2]=='CF'].index[0]]

docid                                                 554144301
035liste                   [(IDSBB)007008154, (RERO)R007245313]
isbn                                                         []
ttlfull       {'245': ['Ice Age 4', 'voll verschoben : die a...
ttlpart       {'245': ['Ice Age 4', 'voll verschoben : die a...
person                                {'100': [], '245c': ['']}
corporate                                                    {}
pubyear                                                2012    
decade                                                     2012
century                                                    2012
exactDate                                              2012    
edition                                                        
part                                                         []
pages                                         [1 Speicherkarte]
volumes                                       [1 Speicherkarte]
pubinit                                 

## General Observation

As can be observed in the sample records displayed above, the attributes of the records are stored in basic Python datatypes like strings, lists (of strings), and dictionaries. A look into the raw data file confirms this observation.

In [12]:
! head -n 2 ./data/job7r4A1.json

{"docid":"554061449","035liste":["(OCoLC)1085491204","(IDSBB)006899773"],"isbn":[],"ttlfull":{"245":["Die Feist von Kienberg","eine Wasenmeisterfamilie im Ancien Régime zwischen Ehrbarkeit und Delinquenz"]},"ttlpart":{"245":["Die Feist von Kienberg","eine Wasenmeisterfamilie im Ancien Régime zwischen Ehrbarkeit und Delinquenz"]},"person":{"100":["SchluchterAndré"],"245c":["André Schluchter"]},"corporate":{},"pubyear":"1992    ","decade":"1992","century":"1992","exactDate":"1992    ","edition":"","part":[],"pages":["S. 102-114"],"volumes":["S. 102-114"],"pubinit":[],"pubword":[],"scale":"","coordinate":[],"doi":[],"ismn":[],"musicid":"","format":["BK020000"]}
{"docid":"554061481","035liste":["(OCoLC)1085491341","(IDSBB)006899983"],"isbn":[],"ttlfull":{"245":["Reimereien"]},"ttlpart":{"245":["Reimereien"]},"person":{"100":["NaegeliWerner"],"245c":["von Werner Naegeli"]},"corporate":{},"pubyear":"1986    ","decade":"1986","century":"1986","exactDate":"1986    ","edition":"","part":[],"pa

A specific analysis on each attribute, its meaning and contents will be done as a next step. For the upcoming analysis, each attribute of the data records is assigned to its specific group type. This is done with the help of a global dictionary variable $\texttt{columns}\_\texttt{metadata}\_\texttt{dict}$.

In [13]:
columns_metadata_dict = {}

## Attribute Analysis

This section analyses the attributes provided by the Swissbib data extracts. The extracted data is used as a basis for the machine learning models in the capstone project. The attributes are based on the MARC 21 Format for Bibliographic Data [[MARC](./A_References.ipynb#marc21)] and are documented on a Swissbib wikipedia page [[FeatWiki](./A_References.ipynb#feature_deduplication_wiki)].

In [14]:
df.columns

Index(['docid', '035liste', 'isbn', 'ttlfull', 'ttlpart', 'person',
       'corporate', 'pubyear', 'decade', 'century', 'exactDate', 'edition',
       'part', 'pages', 'volumes', 'pubinit', 'pubword', 'scale', 'coordinate',
       'doi', 'ismn', 'musicid', 'format'],
      dtype='object')

This section provides and uses some functions that have been written for supporting the analysis of the attributes as well as for data preprocessing in the upcoming chapters. These functions have been defined in separate code files.

- [data_analysis_funcs.py](./data_analysis_funcs.py)
- [data_preparation_funcs.py](./data_preparation_funcs.py)

In [15]:
import data_analysis_funcs as daf
import data_preparation_funcs as dpf

### Table of Contents of Attribute Analysis

- [035liste](#035liste)
- [century](#century)
- coordinate
- [corporate](#corporate)
- [decade](#decade)
- [docid](#docid)
- [doi](#doi)
- [edition](#edition)
- [exactDate](#exactDate)
- [format](#format)
- isbn
- ismn
- musicid
- [pages](#pages)
- part
- [person](#person)
- pubinit
- pubword
- [pubyear](#pubyear)
- scale
- [ttlfull](#ttlfull)
- ttlpart
- [volumes](#volumes)

### 035liste

In [16]:
columns_metadata_dict['list_columns'] = ['035liste']

Attribute $\texttt{035liste}$ holds a list of identifiers from the originating library of a bibliographic unit, see [[FeatWiki](./A_References.ipynb/#feature_deduplication_wiki)]. Each record of the Swissbib data holds at least one identifier. Some examples are shown below.

In [17]:
_, _ = daf.find_empty_in_column(df, columns_metadata_dict, '035liste')

Number of records with filled 035liste 183407, with missing 035liste 0 => 100.0%


In [18]:
df['035liste'].apply(lambda x : len(x)).sort_values().head(5)

83463     1
150669    1
150668    1
150667    1
150666    1
Name: 035liste, dtype: int64

In [19]:
df['035liste'].apply(lambda x : len(x)).sort_values().tail(10)

144185    20
54084     21
30793     21
136608    21
61603     21
14755     21
49974     22
139972    22
139191    23
124359    23
Name: 035liste, dtype: int64

In [20]:
print('Some sample identifiers:')
df['035liste'].sample(n=10)

Some sample identifiers:


37282        [(SERSOL)ssib033998482, (WaSeSS)ssib033998482]
140188    [(IDSBB)006836984, (NEBIS)011367367, (IDSLU)00...
66239                  [(OCoLC)603924741, (IDSBB)004803708]
118148       [(SERSOL)ssib013430430, (WaSeSS)ssib013430430]
157305                                         [(CEO)39577]
98815                 [(OCoLC)1086293570, (IDSBB)007064833]
46489                  [(NEBIS)010607095, (RERO)R008888519]
149923                                         [(CEO)67316]
155271                                         [(CEO)37207]
18011     [(NEBIS)011268958, (IDSBB)001177732, (OCoLC)10...
Name: 035liste, dtype: object

Attribute $\texttt{035liste}$ is the central attribute for finding duplicates in the training data of the Swissbib's goldstandard. This process will be explained and implemented in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb).

### century

In [21]:
columns_metadata_dict['strings_columns'] = ['century']

In [22]:
idx_century_filled, idx_century_empty = daf.find_empty_in_column(df, columns_metadata_dict, 'century')

daf.two_examples(df, idx_century_filled, idx_century_empty)

Number of records with filled century 183407, with missing century 0 => 100.0%

EMPTY - None

FILLED - index 0 

docid                                                 554061449
035liste                  [(OCoLC)1085491204, (IDSBB)006899773]
isbn                                                         []
ttlfull       {'245': ['Die Feist von Kienberg', 'eine Wasen...
ttlpart       {'245': ['Die Feist von Kienberg', 'eine Wasen...
person        {'100': ['SchluchterAndré'], '245c': ['André S...
corporate                                                    {}
pubyear                                                1992    
decade                                                     1992
century                                                    1992
exactDate                                              1992    
edition                                                        
part                                                         []
pages                                              [S. 

Attribute $\texttt{century}$ holds information on the year of origin of the bibliographic unit [[FeatWiki](./A_References.ipynb/#feature_deduplication_wiki)]. The attribute holds strings of length 4 that can predominantly be interpreted as year-dates. Some examples and the top-most quantity distribution are shown below.

In [23]:
df['century'].sample(n=15)

155340    1950
7518      1990
159744    1988
142293    2016
127991    1984
118813    1953
74054     2017
80111     2018
110728    2018
7837      2017
175676    1988
90942     1964
150645    1948
156465    1979
106292    1979
Name: century, dtype: object

In [24]:
df.century.value_counts(normalize=True).head(10)

2018    0.208389
2019    0.086785
2017    0.046350
uuuu    0.036236
2016    0.025223
1999    0.020201
2015    0.019879
2014    0.018156
2012    0.016286
2013    0.015616
Name: century, dtype: float64

If the year-date is unclear to some extent or no year-date of the unit is registered, letter 'u' is used as a placeholder of the unknown digit.

In [25]:
df.century[df.century.str.contains('u')].unique()

array(['uuuu', 'u826', '193u', '197u', '192u', '19uu', '200u', '198u',
       '189u', '188u', '18uu', '201u', '20uu', 'u829', '195u', '218u',
       '196u', '17uu', '1uuu', '199u', 'u611', '191u', '190u', 'u713',
       'u999', '194u', 'u693', '186u', '184u', '15uu', 'uuu1'],
      dtype=object)

The above statement results in the same array like the following regular expression.

In [26]:
df.century[df.century.str.contains('[^0-9]')].unique()

array(['uuuu', 'u826', '193u', '197u', '192u', '19uu', '200u', '198u',
       '189u', '188u', '18uu', '201u', '20uu', 'u829', '195u', '218u',
       '196u', '17uu', '1uuu', '199u', 'u611', '191u', '190u', 'u713',
       'u999', '194u', 'u693', '186u', '184u', '15uu', 'uuu1'],
      dtype=object)

In [27]:
df.century[~df.century.str.contains('[u]')].unique()

array(['1992', '1986', '1923', '1947', '1991', '1967', '1950', '1985',
       '1983', '1942', '1883', '1990', '1984', '1989', '1993', '1961',
       '1940', '1981', '1988', '1858', '1978', '1977', '1880', '1945',
       '1963', '1912', '1884', '1937', '1956', '1943', '1916', '1960',
       '1980', '1936', '1906', '1987', '1955', '1944', '1953', '1930',
       '1903', '1913', '1938', '1895', '1905', '1920', '1918', '1857',
       '1928', '1881', '1932', '1900', '1924', '1915', '1931', '1927',
       '1919', '1896', '1922', '1871', '1926', '1939', '1907', '1776',
       '1873', '1893', '1968', '1975', '1946', '1833', '1855', '1949',
       '1962', '1971', '1959', '1966', '1862', '1898', '1901', '1904',
       '1902', '1951', '1909', '1929', '1921', '1882', '1964', '1911',
       '1957', '1897', '1910', '1867', '1914', '1908', '1958', '1934',
       '1933', '1872', '1845', '1979', '1954', '1885', '1810', '1891',
       '1869', '1876', '1889', '1836', '1952', '1935', '1948', '1917',
      

The attribute is taken over to the feature matrix without an extra processing in [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb).

In [28]:
columns_metadata_dict['data_analysis_columns'] = ['century']

### corporate

In [29]:
columns_metadata_dict['strings_columns'].append('corporate')

Attribute $\texttt{corporate}$ is a collection of corporate names of the bibliographical unit, see [[FeatWiki](./A_References.ipynb/#feature_deduplication_wiki)]. The attribute out of the raw data comes along as a dictionary column in the DataFrame with three possible key value pairs. For reasons of easier processing, attribute $\texttt{corporate}$ will be split for into a singular attribute for each key of the dictionary. The values come along as lists and the implemented function $\texttt{.transform}\_\texttt{list}\_\texttt{to}\_\texttt{string()}$ will be used to process its elements into one single string of the new column.

In [30]:
df.corporate.sample(n=20)

138951                                                   {}
69906                                                    {}
162389                                                   {}
171882                                                   {}
158747    {'110': ['North Elba Park (District)'], '710':...
85432                                                    {}
174778                                                   {}
6369                                                     {}
33873                                                    {}
96646                                                    {}
108523                                                   {}
147998                                                   {}
14512                                                    {}
150218    {'110': ['Liechtensteinischer Olympischer Spor...
15859                                                    {}
131543                                                   {}
39625                                   

In [31]:
for ending in ['110', '710', '810']:
    df = dpf.transform_dictionary_to_list(df, 'corporate', ending)
    df = dpf.transform_list_to_string(df, 'corporate_'+ending)

    columns_metadata_dict['strings_columns'].append('corporate_'+ending)
    _, _ = daf.find_empty_in_column(df, columns_metadata_dict, 'corporate_'+ending)

Number of records with filled corporate_110 11370, with missing corporate_110 172037 => 6.2%
Number of records with filled corporate_710 23123, with missing corporate_710 160284 => 12.6%
Number of records with filled corporate_810 57, with missing corporate_810 183350 => 0.0%


The attribute is sparsely filled. This is due to the fact that most of Swissbib's bibliographical units are units produced by persons, not corporates.

In [32]:
df[['corporate_110', 'corporate_710']][
    (df.corporate_110!=df.corporate_710) &
    (df.corporate_110.apply(lambda x : len(x))!=0)].count()

corporate_110    11349
corporate_710    11349
dtype: int64

In [33]:
df[['corporate_110', 'corporate_710']][
    (df.corporate_110==df.corporate_710) &
    (df.corporate_110.apply(lambda x : len(x))!=0)].count()

corporate_110    21
corporate_710    21
dtype: int64

The attribute holds different data in its dictionary key 110 compared to its dictionary key 710. Both key values of $\texttt{corporate}$ seem to be relevant for the model. Some examples are shown below.

In [34]:
df.corporate_110[df.corporate_110.apply(lambda x : len(x))!=0].sample(n=20)

157596    summer olympic games. organizing committee. 21...
150500                                  conseil de l'europe
160658    summer youth olympic games. organizing committ...
165106                          bulgarian olympic committee
152146                           japanese olympic committee
165453                      united states olympic committee
163221                     union internationale de patinage
27758     schweizerischer verein der freundinnen junger ...
141572     united nations, department of public information
161035                              comité olympique suisse
162839    olympic winter games. organizing committee. 15...
86087                                aktion saubere schweiz
158461    international olympic committee. session. 1982...
94184                                   map productions ltd
163366                  republic of china olympic committee
159593                          british olympic association
168600    international olympic committe

In [35]:
df.corporate_710[df.corporate_710.apply(lambda x : len(x))!=0].sample(n=20)

148640                tartuskij gosudarstvennyj universitet
17064                          malta philharmonic orchestra
175964    union nationale des clubs universitaires (fran...
170828                  deutsches olympia zentrum (münchen)
55753     österreichische akademie der wissenschaften, c...
37498                         springerlink (online service)
135042             state university of iowa, college of law
115304                                   aeschbach-stiftung
163851    international olympic committee. olympic solid...
102216                                   frauenmuseum meran
128137                                atelier otto rietmann
58888                               palace (groupe musical)
177094                                             mastodon
79739     vereinte nationen, vereinte nationen, vereinte...
175789    société internationale d'histoire de l'éducati...
50567     schweiz, ipso-sozial- und umfrageforschung, bu...
92070     museum brandhorst, museo d'art

In [36]:
df.corporate_810[df.corporate_810.apply(lambda x : len(x))!=0].sample(n=20)

113869                              peter-ochs-gesellschaft
15102               schweiz, bundesamt für landestopografie
86065                               hallwag kümmerly + frey
106150                              hallwag kümmerly + frey
106189                              hallwag kümmerly + frey
113870                              peter-ochs-gesellschaft
25428               schweiz, bundesamt für landestopografie
25264               schweiz, bundesamt für landestopografie
133739                deutschland, verteidigungsministerium
133706    carnegie endowment for international peace., d...
17859               schweiz, bundesamt für landestopografie
94674                                             hécatombe
110572                              hallwag kümmerly + frey
106173                              hallwag kümmerly + frey
61902                               hallwag kümmerly + frey
94680                                             hécatombe
110597                              hall

In [37]:
df.corporate[(df.corporate_110 != df.corporate_810) &
             (df.corporate_810.apply(lambda x : len(x))!=0)]

45080              {'810': ['Wirtschaftsuniversität Wien']}
47145     {'110': ['Biochemical Society (Great Britain)'...
52134     {'710': ['Kunsthandlung Helmut H. Rumbler'], '...
94673     {'710': ['Ensemble Batida', 'Hécatombe'], '810...
94674     {'710': ['Ensemble Batida', 'Hécatombe'], '810...
94678     {'710': ['Ensemble Batida', 'Hécatombe'], '810...
94679     {'710': ['Ensemble Batida', 'Hécatombe'], '810...
94680     {'710': ['Ensemble Batida', 'Hécatombe'], '810...
113869                 {'810': ['Peter-Ochs-Gesellschaft']}
113870                 {'810': ['Peter-Ochs-Gesellschaft']}
133706    {'110': ['Carnegie Endowment for International...
133739    {'810': ['Deutschland', 'Verteidigungsminister...
143892    {'110': ['Judge Advocate General's School (Uni...
Name: corporate, dtype: object

Only a few records hold data in attribute $\texttt{corporate}$ with key value 810. Furthermore, attribute $\texttt{corporate}$ and key 810 seems to store redundant data with $\texttt{corporate}$ and key value 110. Therefore, attribute $\texttt{corporate}$ with key 810 will be ommitted.

In [38]:
columns_metadata_dict['data_analysis_columns'].append('corporate_110')
columns_metadata_dict['data_analysis_columns'].append('corporate_710')

### decade

In [39]:
columns_metadata_dict['strings_columns'].append('decade')

In [40]:
idx_decade_filled, idx_decade_empty = daf.find_empty_in_column(df, columns_metadata_dict, 'decade')

Number of records with filled decade 183407, with missing decade 0 => 100.0%


In [41]:
df[df.decade != df.century]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,...,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format,corporate_110,corporate_710,corporate_810


The attribute holds identical data to field [century](#century). Its MARC definition is the same, too.

### docid

In [42]:
columns_metadata_dict['strings_columns'].append('docid')

In [43]:
idx_docid_filled, idx_docid_empty = daf.find_empty_in_column(df, columns_metadata_dict, 'docid')

Number of records with filled docid 183407, with missing docid 0 => 100.0%


In [44]:
df.docid[0]

'554061449'

### doi

In [45]:
columns_metadata_dict['list_columns'].append('doi')

In [46]:
idx_doi_filled, idx_doi_empty = daf.find_empty_in_column(df, columns_metadata_dict, 'doi')

Number of records with filled doi 10114, with missing doi 173293 => 5.5%


In [47]:
df.doi[df.doi.apply(lambda x : len(x))>0].head(20)

1854                                     [00028947575214]
1899    [10.5451/unibas-007052902, urn:nbn:ch:bel-bau-...
1937    [10.5451/unibas-007052953, urn:nbn:ch:bel-bau-...
2046                                     [04600317120499]
2063                                     [00602567484134]
2207                                     [00039841539226]
2245                                     [00096802280399]
2286                                     [00761195120422]
2494                                     [04250095800740]
2779                                     [00888837038720]
2996                     [urn:nbn:de:101:1-2016111912809]
3087                                     [00605633131628]
3385                     [urn:nbn:de:101:1-2017040728657]
4094                                     [00602547324375]
4504                                     [00656605612812]
4710                             [10.14361/9783839445334]
6579                                     [00887254706021]
7624          

In [48]:
df.loc[1854]

docid                                                    554099918
035liste         [(OCoLC)71126385, (IDSBB)007052820, (OCoLC)711...
isbn                                                            []
ttlfull          {'245': ['Symphony no. 8'], '246': ['Symphony ...
ttlpart                                {'245': ['Symphony no. 8']}
person           {'100': ['MahlerGustav1860-1911(DE-588)1185762...
corporate        {'710': ['Konzertvereinigung Wiener Staatsoper...
pubyear                                                   20062006
decade                                                        2006
century                                                       2006
exactDate                                                 20062006
edition                                                           
part                                                            []
pages                                               [1 Cd (79:48)]
volumes                                             [1 Cd (79:

In [49]:
#columns_metadata_dict['data_analysis_columns'].append('doi')

### edition

In [50]:
columns_metadata_dict['strings_columns'].append('edition')

In [51]:
_, _ = daf.find_empty_in_column(df, columns_metadata_dict, 'edition')

Number of records with filled edition 25352, with missing edition 158055 => 13.8%


Attribute $\texttt{edition}$ holds the edition statement [[FeatWiki](./A_References.ipynb/#feature_deduplication_wiki)]. The data type of the attribute is a string value.

In [52]:
df.edition[df.edition.apply(lambda x : len(x)>0)].sample(n=10)

1359      2. vermehrte und verb. Aufl
104469                     1. Auflage
29366                  [Ausgabe] 2018
111892                     3. Auflage
121516                         1st ed
21150        First issued in hardback
50978                         12e éd.
31841                      2. Auflage
44916                 Second edition.
85597                      6. Auflage
Name: edition, dtype: object

The attribute is taken over to the feature matrix without an extra processing in [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb).

In [53]:
columns_metadata_dict['data_analysis_columns'].append('edition')

### exactDate

In [54]:
columns_metadata_dict['strings_columns'].append('exactDate')

In [55]:
_, _ = daf.find_empty_in_column(df, columns_metadata_dict, 'exactDate')

Number of records with filled exactDate 183407, with missing exactDate 0 => 100.0%


In [56]:
df[df.exactDate.str[0:4] != df.century]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,...,pubinit,pubword,scale,coordinate,doi,ismn,musicid,format,corporate_110,corporate_710,corporate_810


Conforming the MARC description, the first 4 digits of $\texttt{exactDate}$ hold identical data to field [century](#century).

In [57]:
print('Degree of non-blank filling of last 4 digits {:.1f}%'.format(
    df.exactDate[df.exactDate.str[4:] != '    '].count()/len(df)*100))
print('Degree of numerical filling of last 4 digits {:.1f}%'.format(
    df.exactDate[~df.exactDate.str[4:].isin(['    ', 'uuuu'])].count()/len(df)*100))

Degree of non-blank filling of last 4 digits 19.4%
Degree of numerical filling of last 4 digits 13.3%


In [58]:
df.exactDate[df.exactDate.str[4:] != '    '].head()

12     uuuuuuuu
61     uuuuuuuu
62     uuuuuuuu
117    19241925
257    uuuuuuuu
Name: exactDate, dtype: object

In [59]:
df.loc[183319]

docid                                                    556987284
035liste                        [(ZORA)oai:www.zora.uzh.ch:169340]
isbn                                                            []
ttlfull          {'245': ['Altered limbic and autonomic process...
ttlpart          {'245': ['Altered limbic and autonomic process...
person           {'100': [], '700': ['TemplinChristianjoint aut...
corporate                                                       {}
pubyear                                                   20190414
decade                                                        2019
century                                                       2019
exactDate                                                 20190414
edition                                                           
part                                            [40(15):1183-1187]
pages                                                           []
volumes                                                       

### format

In [60]:
columns_metadata_dict['list_columns'].append('format')

Attribute $\texttt{format}$ describes the format of a bibliographic unit, see examples under section [Sample Records Analysis](#Sample-Records-Analysis).

In [61]:
_, _ = daf.find_empty_in_column(df, columns_metadata_dict, 'format')

Number of records with filled format 179688, with missing format 3719 => 98.0%


In [62]:
df = dpf.transform_list_to_string(df, 'format')

df['format'][df.format.apply(lambda x : len(x))==8].sample(n=10)

175453    bk020000
13308     bk020000
147295    mu040100
105662    bk010000
102281    bk020300
176825    cr030600
102988    bk010053
50677     bk020000
139517    vm010300
67674     bk020000
Name: format, dtype: object

In [63]:
print('{:.1f}% of the records hold one single format.'.format(
    df['format'][df.format.apply(lambda x : len(x))==8].count()/len(df)*100))
print('{:.1f}% of the records hold more than one format.'.format(
    df['format'][df.format.apply(lambda x : len(x))>8].count()/len(df)*100))

96.4% of the records hold one single format.
1.6% of the records hold more than one format.


In [64]:
df['format'][df.format.apply(lambda x : len(x))<8].sample(n=10)

150769    
150640    
154170    
149221    
155629    
158091    
171153    
167794    
169172    
153190    
Name: format, dtype: object

In [65]:
df['format'][df.format.apply(lambda x : len(x))>8].head()

1981    mu010100, mu010000
2000    mu010000, mu010100
2001    mu010100, mu010000
2003    mu010100, mu010000
2004    mu010200, mu010000
Name: format, dtype: object

In [66]:
df['format'][df.format.apply(lambda x : len(x))>18].head()

17607     bk020300, bk020000, bk020500
83055     bk020300, bk020800, bk020000
181090    cr030700, cr030600, cr030300
181376    bk020800, bk020400, bk020000
Name: format, dtype: object

The attribute seems to be very relevant for a basic identification of the bibliographical unit. Mainly the first two digits seem to be a reliable rough classification indicator. The remaining digits may have an inferior importance and and an inferior reliability due to freedom of interpretation. For this reason, the attribute is divided into 2 new attributes.

- New attribute $\texttt{format_prefix}$ will hold the first two digits of the first $\texttt{format}$ element.
- New attribute $\texttt{format_number}$ will hold the 6 subsequent digits of the first $\texttt{format}$ element.

The $\texttt{format}$ attribute will be dropped after this preprocessing step.

In [67]:
df['format_prefix'] = df.format.str[:2]
df.format_prefix[df['format_prefix']==''] = '  '
df['format_postfix'] = df.format.str[2:8]
df.format_postfix[df['format_postfix']==''] = '      '
df[['format', 'format_prefix', 'format_postfix']].sample(n=15)

Unnamed: 0,format,format_prefix,format_postfix
119024,bk020053,bk,20053
140221,bk020000,bk,20000
25274,bk020000,bk,20000
59756,mu010100,mu,10100
10939,bk020000,bk,20000
166920,bk020000,bk,20000
144938,bk020300,bk,20300
6030,mu040100,mu,40100
174987,bk020000,bk,20000
178726,bk020000,bk,20000


This preprocessing step has been implemented in a separate function $\texttt{.split}\_\texttt{format()}$ which can be found in code file [data_preparation_funcs.py](./data_preparation_funcs.py)

In [68]:
columns_metadata_dict['data_analysis_columns'].append('format_prefix')
columns_metadata_dict['data_analysis_columns'].append('format_postfix')

### pages

In [69]:
columns_metadata_dict['list_columns'].append('pages')

In [70]:
idx_pages_filled, idx_pages_empty = daf.find_empty_in_column(df, columns_metadata_dict, 'pages')

Number of records with filled pages 161471, with missing pages 21936 => 88.0%


In [71]:
# Identical to volumes
#columns_metadata_dict['data_analysis_columns'].append('pages')

### person

In [72]:
columns_metadata_dict['strings_columns'].append('person')

Attribute $\texttt{person}$ is a collection of personal names statements of the bibliographical unit, see [[FeatWiki](./A_References.ipynb/#feature_deduplication_wiki)]. The attribute out of the raw data is a dictionary column in the DataFrame with four possible key value pairs. Attribute $\texttt{person}$ is split for processing reasons into a singular attribute for each key of the dictionary. The values come along as lists and the implemented function $\texttt{.transform}\_\texttt{list}\_\texttt{to}\_\texttt{string()}$ will be used to process its elements into one single string of the new column.

In [73]:
df.person.sample(20)

104084    {'100': ['PauliGerhard1648-1715(DE-588)1329274...
98242     {'100': ['VandersteenWilly1913-1990(DE-588)128...
27494     {'100': ['SchmidtH.Verfasseraut'], '245c': ['H...
77286     {'100': ['FeuchtmayerJoseph Anton1696-1770(DE-...
100725    {'100': ['YosifonDavid G.1973-(DE-588)11631254...
70516        {'100': ['ZakimEric'], '245c': ['Eric Zakim']}
73023     {'100': [], '700': ['LantolfJames P.'], '245c'...
79212     {'100': ['MachoThomas'], '245c': ['Thomas Mach...
174648    {'100': ['LévyDidier(RERO)A003540910cre'], '70...
179183    {'100': ['AustruyAnna(RERO)A021930073aut'], '2...
100241    {'100': [], '700': ['HaldonJohn F.(DE-588)1044...
92340     {'100': ['TuorLeo1959-(DE-588)128982101Verfass...
66262     {'100': ['JouvePierre Jean'], '245c': ['Pierre...
108490                            {'100': [], '245c': ['']}
112373    {'100': ['RaoM. K. SuryanarayanaVerfasseraut']...
36643                             {'100': [], '245c': ['']}
36476     {'100': ['DevosRaymond'], '245

In [74]:
for ending in ['100', '700', '800', '245c']:
    df = dpf.transform_dictionary_to_list(df, 'person', ending)
    df = dpf.transform_list_to_string(df, 'person_'+ending)

    columns_metadata_dict['strings_columns'].append('person_'+ending)
    _, _ = daf.find_empty_in_column(df, columns_metadata_dict, 'person_'+ending)

Number of records with filled person_100 115396, with missing person_100 68011 => 62.9%
Number of records with filled person_700 73254, with missing person_700 110153 => 39.9%
Number of records with filled person_800 1137, with missing person_800 182270 => 0.6%
Number of records with filled person_245c 158011, with missing person_245c 25396 => 86.2%


In [75]:
df.person_100.sample(20)

63024                                                      
176608       vuillardéric1968-(de-588)142262846verfasseraut
34900                      martinmélanie(rero)a024814137cre
97707                                        mayerhöffereva
89560                                   waltstephen m.1955-
87228       roseroevelio1958-(de-588)1025870123verfasseraut
25896                                                      
40957                                     russellaaron j.m.
97787                                         cencilodovico
152308                                            yuandaren
156841                                            mercerbob
78058                      rodolphe1948-(rero)a021636611aut
116551                            gleasonrobertverfasseraut
82487      lamunièresimon1961-(de-588)121752356verfasseraut
87895                                                      
51791                                       steinerphilippe
157261                                  

In [76]:
df.person_700.sample(20)

143434                                     marksedward1934-
6331                             josefbernhardübersetzertrl
134240                                                     
173405                                                     
167920                                                     
145413     congdonlisa1968-(de-588)1084606542illustratorill
57427                                 letao(rero)a005674005
85799                                                      
134955                                                     
95227               ducurtiladèleaut, brasier de thuyléaaut
4925      sterlingaaron, ahrensjonathan, polonskyjonny, ...
80085     messiaenolivier1908-1992réveil des oiseaux(de-...
94783                         cossettesylvie, aldersonmarie
131227    lécossaissarah1981-herausgeberedt, quemenernel...
24245                                                      
120862                              eliasfriederike, eggert
131116                                  

In [77]:
df.person_800.sample(20)

56820     
12843     
4485      
119605    
33717     
96871     
1993      
93746     
157118    
168501    
1277      
99336     
75434     
7691      
31272     
112497    
151870    
18151     
114211    
16489     
Name: person_800, dtype: object

In [78]:
df.person_245c.sample(20)

41964     herausgegeben von robert pfitzmann, peter neuh...
169337                      conseil de la ville de brisbane
36463                                     giampiero maspero
113549                           haydn ; sviatoslav richter
60601                           redaktion: claudia dziallas
115890                                        ogawa, ichijo
121599                                        schoder, jörg
86764                                               canal 9
67770                                               vercors
152350    org. by sasakawa sports foundation ; ed. kazun...
153543                                        gigliola gori
109356                                         anne gorrick
33964                                                      
44874                  herausgeberin, anita riecher-rössler
112478       diego caballero ... [und drei weitere autoren]
117161                                     reinhold niebuhr
116442                                  

Column $\texttt{person_245c}$ is identified as the most complete and usefull personal name attribute and will be used as a basis for the data of the feature matrix.

In [79]:
#columns_metadata_dict['data_analysis_columns'].append('person_100')
#columns_metadata_dict['data_analysis_columns'].append('person_700')
#columns_metadata_dict['data_analysis_columns'].append('person_800')
columns_metadata_dict['data_analysis_columns'].append('person_245c')

### pubyear

In [80]:
columns_metadata_dict['strings_columns'].append('pubyear')

In [81]:
_, _ = daf.find_empty_in_column(df, columns_metadata_dict, 'pubyear')

Number of records with filled pubyear 183407, with missing pubyear 0 => 100.0%


In [82]:
df[df.exactDate != df.pubyear]

Unnamed: 0,docid,035liste,isbn,ttlfull,ttlpart,person,corporate,pubyear,decade,century,exactDate,...,musicid,format,corporate_110,corporate_710,corporate_810,format_prefix,format_postfix,person_100,person_700,person_800,person_245c


All 8 digits of $\texttt{pubyear}$ hold identical data to field [exactDate](#exactDate). This observation corresponds to the expectation from the MARC description of the attribute.

### ttlfull

In [83]:
columns_metadata_dict['strings_columns'].append('ttlfull')

Attribute $\texttt{ttlfull}$ holds the full title of the bibliographical unit, see [[FeatWiki](./A_References.ipynb/#feature_deduplication_wiki)]. Some examples are shown below.

In [84]:
df.ttlfull.sample(20)

82535      {'245': ['Petite philosophie de la dépression']}
181587    {'245': ['Praxisbuch Kinderschutz interdiszipl...
134679            {'245': ['Indiana Corporate Procedures']}
85709     {'245': ['Ultrafast Bandgap Photonics III', '1...
179622    {'245': ['Why some things should not be for sa...
132460    {'245': ['Alienated wisdom', 'enquiry into Jew...
102857    {'245': ['Savage Kin', 'Indigenous Informants ...
154360    {'245': ['Jeux de la XXe Olympiade Munich 1972...
116101                            {'245': ['[Ordinarium]']}
39045     {'245': ['Erste, zweite und dritte Berathung d...
173382           {'245': ['The ethics of sports medicine']}
74075                        {'245': ['Les fleurs du mal']}
119201    {'245': ['Reinhold Niebuhr to William Scarlett...
79424                    {'245': ['The rector's daughter']}
78163     {'245': ['Checklisten Krankheiten im Alter', '...
53918     {'245': ['A literature of questions', 'nonfict...
41383     {'245': ['Mesozoic/Cenozoic ve

The attribute is a dictionary column in the DataFrame. There are two key value pairs possible in this attribute. For their processing, two new attributes will be added to the DataFrame, one for each key value pair. The values come along as lists. The implemented function $\texttt{.transform}\_\texttt{list}\_\texttt{to}\_\texttt{string()}$ will be used to process the list elements into one single string of the column.

In [85]:
for ending in ['245', '246']:
    df = dpf.transform_dictionary_to_list(df, 'ttlfull', ending)
    df = dpf.transform_list_to_string(df, 'ttlfull_'+ending)

    columns_metadata_dict['strings_columns'].append('ttlfull_'+ending)
    _, _ = daf.find_empty_in_column(df, columns_metadata_dict, 'ttlfull_'+ending)

Number of records with filled ttlfull_245 183407, with missing ttlfull_245 0 => 100.0%
Number of records with filled ttlfull_246 15897, with missing ttlfull_246 167510 => 8.7%


In [86]:
df.ttlfull_245.sample(20)

122358    validity and usability testing of a health sys...
87111     instrumentelle lackanalytik, das lehrbuch für ...
148136    corpus linguistics for pragmatics, a guide for...
49536     interprète grec - les mémoires de sherlock holmes
41242     mathematik 1 für nichtmathematiker, grundbegri...
125285    following, challenging, or shaping: can third ...
155626    mach mit bei der schülerolympiade, unterrichts...
28760     la maréchalerie, le ferrage est-il la meilleur...
81225                         bevölkerungstrends in hamburg
69272     essai sur les préjugés ; ou de l'influence de ...
177712      la recherche au musée du quai branly, 2012-2013
28844     plan directeur 1990-94, rapport des transports...
113076                                               slayer
135497    justice holmes to doctor wu, an intimate corre...
114410    suite orientale, pour flûte, violoncelle et piano
135782                     schiller-bibliographie 1959-1963
143317    union of the british provinces

In [87]:
df.ttlfull_246.sample(20)

122738                                                     
82375     gurmad, diiwaanka gabayada soomaaliyeed, colle...
20608                                                      
153448                                                     
124817                                                     
46393                                                      
44848                                                      
144293                                                     
111953                                                     
142751                                                     
10231                                                      
146268                                                     
96301                                                      
135372                                                     
47079                                                      
87766                                                      
1871                                    

The two new columns will be used as a basis for the data of the feature matrix.

In [88]:
columns_metadata_dict['data_analysis_columns'].append('ttlfull_245')
columns_metadata_dict['data_analysis_columns'].append('ttlfull_246')

### volumes

In [89]:
columns_metadata_dict['list_columns'].append('volumes')

In [90]:
_, _ = daf.find_empty_in_column(df, columns_metadata_dict, 'volumes')

Number of records with filled volumes 161471, with missing volumes 21936 => 88.0%


This attribute holds information on the number of physical pages, volumes, total playing time etc. of the bibliographic unit [[FeatWiki](./A_References.ipynb/#feature_deduplication_wiki)] depending on its format. Some examples are shown below.

In [91]:
df['volumes'].sample(n=15)

22594                                [462 p.]
72350                                [325 p.]
5366                         [1 Compact Disc]
129653                    [1 online resource]
70288                                 [58 p.]
148806                               [390 S.]
38090      [1 online resource (x, 200 pages)]
32480                      [Online-Ressource]
22631                             [21 Seiten]
145382                               [403 S.]
102526                       [xi, 309 Seiten]
11120                                 [52 S.]
166390                                     []
45047     [1 online resource (xxvii, 323 p.)]
150027                  [1 vol. (non paginé)]
Name: volumes, dtype: object

The attribute comes along as a list of one string element. A function for data preparation has been written to extract the element out of the list and store it as a single string of lowercase characters. This function will be used for preparing the data of the goldstandard in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb).

In [92]:
df = dpf.transform_list_to_string(df, 'volumes')

In [93]:
df['volumes'].sample(n=15)

1168      255 s., frontispiz, + 3 angebunden.
86168                         xiv, 626 seiten
179513                                       
116002                                2 teile
17910                                  352 s.
7692                                  [28] s.
100007                1 compact disc (57 min)
112237                             16, 168 s.
31445                                        
154967                                 258 p.
145123                      1 online resource
108441                                       
42222                       1 online resource
100581                     1 online-ressource
38083        1 online resource (xvii, 294 p.)
Name: volumes, dtype: object

In [94]:
print('Array of unique attribute values\n', df.volumes.unique())
print('\nTotal number of unique values {:,d}'.format(len(df.volumes.unique())))

Array of unique attribute values
 ['s. 102-114' '43 s.' '75 p.' ... '232 s., 2 bl. taf.'
 '26 seiten, 7 ungezählte blätter bildtafeln' 'xli, 282 seiten']

Total number of unique values 36,466


In [95]:
columns_metadata_dict['data_analysis_columns'].append('volumes')

## Metadata Handover

To hand over the attributes dictionary of this chapter as metadata, the dictionary is saved into a pickle file that will be read in the next chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb) as input file.

In [96]:
import pickle as pk

# Binary intermediary metadata file
with open(os.path.join(path_goldstandard,
                       'culomns_metadata.pkl'), 'wb') as df_output_file:
    pk.dump(columns_metadata_dict, df_output_file)

## Summary

The result of this chapter is an analysis of the attributes of the raw data [[FeatWiki](./A_References.ipynb/#feature_deduplication_wiki)] for the capstone project. During the analysis and discussion of the attributes, some functions have been written, that will be used in the upcoming chapters for data preprocessing. As the next step, Swissbib's training and testing data will be analysed and processed in chapter [Goldstandard and Data Preparation](./2_GoldstandardDataPreparation.ipynb).