## Arken  - Wikidata
version 1.1

WD egenskap [Property:P8899](https://www.wikidata.org/wiki/Property:P8899) 
* this [notebook](https://github.com/salgo60/open-data-examples/blob/master/Arken.ipynb)  
* Task [T269064](https://phabricator.wikimedia.org/T269064)
----


#### Other sources we sync
* [Arken](https://github.com/salgo60/open-data-examples/blob/master/Arken.ipynb) 
  * WD [Property:P8899](https://www.wikidata.org/wiki/Property:P8899) 
* [Kulturpersoner Uppsalakyrkogård](https://github.com/salgo60/open-data-examples/blob/master/Check%20WD%20kulturpersoner%20uppsalakyrkogardar.ipynb)
* [Litteraturbanken](https://github.com/salgo60/open-data-examples/blob/master/Litteraturbanken%20Author.ipynb) 
  * WD property [P5101](https://www.wikidata.org/wiki/Property_talk:P5101) [P5123](https://www.wikidata.org/wiki/Property_talk:P5123)
* [Nobelprize.org](https://github.com/salgo60/open-data-examples/blob/master/Nobel%20API.ipynb)
  * WD [property 8024](https://www.wikidata.org/wiki/Property:P8024)
* [SBL](https://github.com/salgo60/open-data-examples/blob/master/SBL.ipynb) 
  * WD [property 3217](https://www.wikidata.org/wiki/Property:P3217) 
* [SKBL](https://github.com/salgo60/open-data-examples/blob/master/Svenskt%20Kvinnobiografiskt%20lexikon%20part%203.ipynb)
  * WD [property 4963](https://www.wikidata.org/wiki/Property:P4963)
* [Svenska Akademien](https://github.com/salgo60/open-data-examples/blob/master/Svenska%20Akademien.ipynb) 
  * WD [property 5325](https://www.wikidata.org/wiki/Property:P5325) 


In [1]:
from datetime import datetime
now = datetime.now()
print("Last run: ", now)

Last run:  2021-02-13 12:51:22.112404


In [2]:
import urllib3, json
import pandas as pd   
from bs4 import BeautifulSoup
import sys
import pprint
from SPARQLWrapper import SPARQLWrapper, JSON
from tqdm.notebook import trange  
from wikidataintegrator import wdi_core, wdi_login

endpoint_url = "https://query.wikidata.org/sparql"

SparqlQuery = """SELECT ?item ?arkid WHERE {
?item wdt:P8899 ?arkid
}"""


http = urllib3.PoolManager()

# Query https://w.wiki/Vo5
def get_results(endpoint_url, query):
    user_agent = "user  salgo60/%s.%s" % (sys.version_info[0], sys.version_info[1])
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

SparQlResults = get_results(endpoint_url, SparqlQuery)
length = len (SparQlResults["results"]["bindings"])
dfWikidata = pd.DataFrame(columns=['WD', 'arkid'])
    
for r in trange(0,length):
    resultSparql = SparQlResults["results"]["bindings"][r]
    wd = resultSparql["item"]["value"].replace("http://www.wikidata.org/entity/","") 
    try: 
        wdArkid= resultSparql["arkid"]["value"] 
    except:
        wdArkid = ""    
    dfWikidata = dfWikidata.append({'WD': wd, 'arkid': wdArkid}, ignore_index=True)
  

HBox(children=(FloatProgress(value=0.0, max=3383.0), HTML(value='')))




In [3]:
dfWikidata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3383 entries, 0 to 3382
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   WD      3383 non-null   object
 1   arkid   3383 non-null   object
dtypes: object(2)
memory usage: 53.0+ KB


In [4]:
dfWikidata.head(200)

Unnamed: 0,WD,arkid
0,Q890742,"Anna,-prinsessa-av-Sverige"
1,Q1698723,aderne-john
2,Q1254,"Annan,-Kofi-A"
3,Q179025,"Anouilh,-Jean"
4,Q5557877,"Antoni,-Nils"
...,...,...
195,Q5695637,carlberg-b-w
196,Q793616,"prins-Carl,-hertig-av-Västergötland"
197,Q5602474,carlberg-gosta
198,Q5602592,carleson-carl


In [5]:
import urllib.parse
urlbase = "https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page="
urlbase_entry = "https://arken.kb.se"
dfArken = pd.DataFrame(columns=['nameAuth', 'urlAuth', 'Auktoriserad', 'Datum', 'Auktoritetspost'])

#for i in range(1,80):
for i in range(1,90):
    url = urlbase + str(i)
    print(url)
    r = http.request('GET', url)
    soup = BeautifulSoup(r.data, "html.parser")
    for link in soup.select('div.search-result-description a[href]'):
        nameAuth = link.string
        urlAuth = urllib.parse.unquote(link['href'].split("/")[1])
        #print ("\t",urlAuth, nameAuth)    
        urlentry = urlbase_entry + link['href']
        #print ("\t\t",urlentry)
        try:
            r_entry = http.request('GET', urlentry)
            soup_entry = BeautifulSoup(r_entry.data, "html.parser")
            Auktoriserad = ""
            Datum = ""
            Auktoritetspost = ""
            fields = soup_entry.select('div.field')
            for f in fields:
                h3 = f.select("h3")
                divText = f.select("div")
                if len(h3) > 0:
                    if "Auktoriserad" in h3[0].getText():
                        #print("\t\tAuktoriserad: " + divText[0].getText().strip())
                        Auktoriserad = divText[0].getText().strip()
                    if "Datum för verksamhetstid" in h3[0].text:
                        #print("\t\t\tDatum: " + divText[0].getText().strip())
                        Datum =  divText[0].getText().strip()
                    if "Auktoritetspost" in h3[0].text:
                        Auktoritetspost =   divText[0].getText().strip()
                        #print("\t\t\tAuktoritetspost: " + divText[0].getText().strip())
                
            dfArken = dfArken.append({'nameAuth': nameAuth, 'urlAuth': urlAuth, 'Auktoriserad': Auktoriserad, 
                              'Datum': Datum, 'Auktoritetspost': Auktoritetspost}, ignore_index=True)
  
        except:
            print("Error")
                
 

https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=1
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=2
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=3
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=4
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=5
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=6
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=7
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=8
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=9
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=10
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=11
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=12
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=13
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=14
https://arken.kb.se/actor/browse?sort=alpha

In [6]:
dfArken.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6034 entries, 0 to 6033
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   nameAuth         6034 non-null   object
 1   urlAuth          6034 non-null   object
 2   Auktoriserad     6034 non-null   object
 3   Datum            6034 non-null   object
 4   Auktoritetspost  6034 non-null   object
dtypes: object(5)
memory usage: 235.8+ KB


In [7]:
dfArken.head(10)

Unnamed: 0,nameAuth,urlAuth,Auktoriserad,Datum,Auktoritetspost
0,"Abenius, Margit","Abenius,-Margit","Abenius, Margit",1899-1970,https://libris.kb.se/ljx00mt45v0dfx5#it
1,"Abenius, Vera","Abenius,-Vera","Abenius, Vera",1890-1967,ediffah:kb:636923:1147851925
2,"Aber, Erich","Aber,-Erich","Aber, Erich",1904-1995,ediffah:kb:294903:1160049953
3,"Abildgaard, Nicolai","Abildgaard,-Nicolai","Abildgaard, Nicolai",1743-1809,https://libris.kb.se/sq4671cb16gj9q4#it
4,"Abrahamson, August","Abrahamson,-August","Abrahamson, August",1817-1898,https://libris.kb.se/wt7bkc9f1h1tt4z#it
...,...,...,...,...,...
195,"Andersson, George",andersson-george,"Andersson, George",,
196,"Andersson, Gunder","Andersson,-Gunder","Andersson, Gunder",1943-,https://libris.kb.se/vs688w3d219c01z#it
197,"Andersson, Gunnar",andersson-gunnar,"Andersson, Gunnar",1925-1995,
198,"Andersson, Ingeborg","Andersson,-Ingeborg","Andersson, Ingeborg",,


In [8]:
dfArken.to_csv(r'Arken.csv')

## Check diff Wikidata
* dfArken
* dfWikidata
* see also 

In [9]:
dfWikidata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3383 entries, 0 to 3382
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   WD      3383 non-null   object
 1   arkid   3383 non-null   object
dtypes: object(2)
memory usage: 53.0+ KB


In [10]:
dfArken.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6034 entries, 0 to 6033
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   nameAuth         6034 non-null   object
 1   urlAuth          6034 non-null   object
 2   Auktoriserad     6034 non-null   object
 3   Datum            6034 non-null   object
 4   Auktoritetspost  6034 non-null   object
dtypes: object(5)
memory usage: 235.8+ KB


In [11]:
dfArken.sample()

Unnamed: 0,nameAuth,urlAuth,Auktoriserad,Datum,Auktoritetspost
4222,"Pettersson, Agnes",pettersson-agnes,"Pettersson, Agnes",,


In [12]:
# Merge plotPublishedAuthor WDSKBLtot  
mergeArkenWD = pd.merge(dfWikidata, dfArken,how='outer', left_on='arkid',right_on='urlAuth',indicator=True)   
mergeArkenWD.rename(columns={"_merge": "WD_Arken"},inplace = True)
mergeArkenWD['WD_Arken'] = mergeArkenWD['WD_Arken'].str.replace('left_only','WD_only').str.replace('right_only','Arken_only')
mergeArkenWD["WD_Arken"].value_counts()  


both          3327
Arken_only    2712
WD_only         56
Name: WD_Arken, dtype: int64

In [13]:
Arken_only = mergeArkenWD[mergeArkenWD["WD_Arken"] == "Arken_only"].copy() 
WD_only = mergeArkenWD[mergeArkenWD["WD_Arken"] == "WD_only"].copy() 
# could be places etc....
WD_only.head(10)

Unnamed: 0,WD,arkid,nameAuth,urlAuth,Auktoriserad,Datum,Auktoritetspost,WD_Arken
702,Q879471,Bjurholm,,,,,,WD_only
704,Q142,Frankrike,,,,,,WD_only
710,Q926728,Bromma,,,,,,WD_only
714,Q3388444,Holte,,,,,,WD_only
722,Q665230,Mondsee,,,,,,WD_only
729,Q1157266,"Edqvist, Dagmar",,,,,,WD_only
730,Q184719,antroposofi,,,,,,WD_only
736,Q10427902,SE-S-HS-Acc2012-107,,,,,,WD_only
2381,Q25425317,"Sundelin,-Nils-Johan",,,,,,WD_only
2591,Q1966481,Lövsta-bruk,,,,,,WD_only


In [14]:
pd.set_option('display.max_rows', None)  
Arken_only.sample(10)

Unnamed: 0,WD,arkid,nameAuth,urlAuth,Auktoriserad,Datum,Auktoritetspost,WD_Arken
3800,,,"Cervin, Claes","Cervin,-Claes","Cervin, Claes",,,Arken_only
5384,,,"Sahnenius, Anders Gustaf","Sahnenius,-Anders-Gustaf","Sahnenius, Anders Gustaf",,,Arken_only
5623,,,"Stuart, Harald",stuart-harald,"Stuart, Harald",,,Arken_only
3864,,,"De Geer, Louis","Geer,-Louis-de","De Geer, Louis",,,Arken_only
3512,,,"Aspner, Ludvig",aspner-ludvig,"Aspner, Ludvig",,,Arken_only
3726,,,"Brodd, Ranveig","Brodd,-Ranveig","Brodd, Ranveig",1926-1999,https://libris.kb.se/ljx181t43lp6hpw#it,Arken_only
4433,,,"Hjortsberg, Lovisa Karolina","Berg,-Lovisa-Karolina","Hjortsberg, Lovisa Karolina",1807-1868,https://libris.kb.se/64jmqf3q0pjkrw4#it,Arken_only
5173,,,"Ostwald, Lars Knut",ostwald-lars-knut,"Ostwald, Lars Knut",1899-1968,,Arken_only
4388,,,"Hellquist, Carl Gustaf Ingemar","Hellquist,-Carl-Gustaf-Ingemar","Hellquist, Carl Gustaf Ingemar",1896-1973,ediffah:kb:450489:1429185659,Arken_only
4504,,,"Isberg, Josephine","Isberg,-Josephine","Isberg, Josephine",,,Arken_only


In [None]:
mergewithLibris = Arken_only[Arken_only["Auktoritetspost"].notnull()].copy() 
mergewithLibris.sample(10)

In [None]:
#mergewithLibris = Arken_only[Arken_only["Auktoritetspost"].notnull()].copy() 
ArkenOnlyWithAuthrec = Arken_only[Arken_only["Auktoritetspost"] != ''].copy() 
ArkenOnlyWithAuthrec.to_csv(r'ArkenOnlyWithAuthrec.csv')   
ArkenOnlyWithAuthrec

In [17]:
mergewithLibris
mergewithLibris.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2712 entries, 3383 to 6094
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   WD               0 non-null      object
 1   arkid            0 non-null      object
 2   nameAuth         2712 non-null   object
 3   urlAuth          2712 non-null   object
 4   Auktoriserad     2712 non-null   object
 5   Datum            2712 non-null   object
 6   Auktoritetspost  2712 non-null   object
 7   WD_Arken         2712 non-null   object
dtypes: object(8)
memory usage: 190.7+ KB


## Places

TBD

In [24]:
urlbase = "https://arken.kb.se/taxonomy/index/id/42?sort=alphabetic&sortDir=asc&page="
dfp = pd.DataFrame(columns=['nameAuth', 'urlAuth', 'Auktoriserad', 'Datum', 'Auktoritetspost'])
def check_Taxonomy(url):
    r_entry = http.request('GET', url)
    soup_entry = BeautifulSoup(r_entry.data, "html.parser")
    fields = soup_entry.select('div.field')
    for f in fields:
        h3 = f.select("h3")
        divText = f.select("div")
        if len(h3) > 0:
            if "Taxonomi" in h3[0].getText():
                #print("\tTaxonomi: " + divText[0].getText().strip())
                Taxonomi = divText[0].getText().strip()
    return True
    
for i in range(1,10):
    url = urlbase + str(i)
    #print(url)
    r = http.request('GET', url)
    soup = BeautifulSoup(r.data, "html.parser")
    for link in soup.select('table  a[href]'):
        nameAuth = link.string
        urlAuth = urllib.parse.unquote(link['href'].split("/")[1])
        #print ("\t",urlAuth, nameAuth)    
        urlentry = urlbase_entry + link['href']
        #print ("\t\t",urlentry)
        if check_Taxonomy(urlentry):
            #print("True")
            pass
    