## Arken  - Wikidata
version 1.1

WD egenskap [Property:P8899](https://www.wikidata.org/wiki/Property:P8899) 
* this [notebook](https://github.com/salgo60/open-data-examples/blob/master/Arken.ipynb)  
* Task [T269064](https://phabricator.wikimedia.org/T269064)
----


#### Other sources we sync
* [Arken](https://github.com/salgo60/open-data-examples/blob/master/Arken.ipynb) 
  * WD [Property:P8899](https://www.wikidata.org/wiki/Property:P8899) 
* [Kulturpersoner Uppsalakyrkogård](https://github.com/salgo60/open-data-examples/blob/master/Check%20WD%20kulturpersoner%20uppsalakyrkogardar.ipynb)
* [Litteraturbanken](https://github.com/salgo60/open-data-examples/blob/master/Litteraturbanken%20Author.ipynb) 
  * WD property [P5101](https://www.wikidata.org/wiki/Property_talk:P5101) [P5123](https://www.wikidata.org/wiki/Property_talk:P5123)
* [Nobelprize.org](https://github.com/salgo60/open-data-examples/blob/master/Nobel%20API.ipynb)
  * WD [property 8024](https://www.wikidata.org/wiki/Property:P8024)
* [SBL](https://github.com/salgo60/open-data-examples/blob/master/SBL.ipynb) 
  * WD [property 3217](https://www.wikidata.org/wiki/Property:P3217) 
* [SKBL](https://github.com/salgo60/open-data-examples/blob/master/Svenskt%20Kvinnobiografiskt%20lexikon%20part%203.ipynb)
  * WD [property 4963](https://www.wikidata.org/wiki/Property:P4963)
* [Svenska Akademien](https://github.com/salgo60/open-data-examples/blob/master/Svenska%20Akademien.ipynb) 
  * WD [property 5325](https://www.wikidata.org/wiki/Property:P5325) 


In [1]:
from datetime import datetime
now = datetime.now()
print("Last run: ", now)

Last run:  2021-08-08 22:46:59.256845


In [2]:
import urllib3, json
import pandas as pd   
from bs4 import BeautifulSoup
import sys
import pprint
from SPARQLWrapper import SPARQLWrapper, JSON
from tqdm.notebook import trange  
from wikidataintegrator import wdi_core, wdi_login

endpoint_url = "https://query.wikidata.org/sparql"

SparqlQuery = """SELECT ?item ?arkid WHERE {
?item wdt:P8899 ?arkid
}"""


http = urllib3.PoolManager()

# Query https://w.wiki/Vo5
def get_results(endpoint_url, query):
    user_agent = "user  salgo60/%s.%s" % (sys.version_info[0], sys.version_info[1])
    sparql = SPARQLWrapper(endpoint_url, agent=user_agent)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql.query().convert()

SparQlResults = get_results(endpoint_url, SparqlQuery)
length = len (SparQlResults["results"]["bindings"])
dfWikidata = pd.DataFrame(columns=['WD', 'arkid'])
    
for r in trange(0,length):
    resultSparql = SparQlResults["results"]["bindings"][r]
    wd = resultSparql["item"]["value"].replace("http://www.wikidata.org/entity/","") 
    try: 
        wdArkid= resultSparql["arkid"]["value"] 
    except:
        wdArkid = ""    
    dfWikidata = dfWikidata.append({'WD': wd, 'arkid': wdArkid}, ignore_index=True)
  

  0%|          | 0/3390 [00:00<?, ?it/s]

In [3]:
dfWikidata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3390 entries, 0 to 3389
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   WD      3390 non-null   object
 1   arkid   3390 non-null   object
dtypes: object(2)
memory usage: 53.1+ KB


In [4]:
dfWikidata.head(200)

Unnamed: 0,WD,arkid
0,Q1254,"Annan,-Kofi-A"
1,Q1149,"Gandhi,-Indira"
2,Q4441,"Dickinson,-Emily"
3,Q2677,vilhelm-b-ii-c-kejsare-av-tyskland
4,Q1511,wagner-richard
...,...,...
195,Q5907381,"Key-Åberg,-Sandro"
196,Q5909732,"Kindblad,-Karl-Eduard"
197,Q5916153,"Knös,-Anders-Erik"
198,Q5916306,"Kobb,-Gustaf"


In [5]:
import urllib.parse
urlbase = "https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page="
urlbase_entry = "https://arken.kb.se"
dfArken = pd.DataFrame(columns=['nameAuth', 'urlAuth', 'Auktoriserad', 'Datum', 'Auktoritetspost'])

#for i in range(1,80):
for i in range(1,90):
    url = urlbase + str(i)
    print(url)
    r = http.request('GET', url)
    soup = BeautifulSoup(r.data, "html.parser")
    for link in soup.select('div.search-result-description a[href]'):
        nameAuth = link.string
        urlAuth = urllib.parse.unquote(link['href'].split("/")[1])
        #print ("\t",urlAuth, nameAuth)    
        urlentry = urlbase_entry + link['href']
        #print ("\t\t",urlentry)
        try:
            r_entry = http.request('GET', urlentry)
            soup_entry = BeautifulSoup(r_entry.data, "html.parser")
            Auktoriserad = ""
            Datum = ""
            Auktoritetspost = ""
            fields = soup_entry.select('div.field')
            for f in fields:
                h3 = f.select("h3")
                divText = f.select("div")
                if len(h3) > 0:
                    if "Auktoriserad" in h3[0].getText():
                        #print("\t\tAuktoriserad: " + divText[0].getText().strip())
                        Auktoriserad = divText[0].getText().strip()
                    if "Datum för verksamhetstid" in h3[0].text:
                        #print("\t\t\tDatum: " + divText[0].getText().strip())
                        Datum =  divText[0].getText().strip()
                    if "Auktoritetspost" in h3[0].text:
                        Auktoritetspost =   divText[0].getText().strip()
                        #print("\t\t\tAuktoritetspost: " + divText[0].getText().strip())
                
            dfArken = dfArken.append({'nameAuth': nameAuth, 'urlAuth': urlAuth, 'Auktoriserad': Auktoriserad, 
                              'Datum': Datum, 'Auktoritetspost': Auktoritetspost}, ignore_index=True)
  
        except:
            print("Error")
                
 

https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=1
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=2
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=3
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=4
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=5
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=6
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=7
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=8
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=9
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=10
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=11
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=12
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=13
https://arken.kb.se/actor/browse?sort=alphabetic&sortDir=asc&page=14
https://arken.kb.se/actor/browse?sort=alpha

In [6]:
dfArken.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6079 entries, 0 to 6078
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   nameAuth         6079 non-null   object
 1   urlAuth          6079 non-null   object
 2   Auktoriserad     6079 non-null   object
 3   Datum            6079 non-null   object
 4   Auktoritetspost  6079 non-null   object
dtypes: object(5)
memory usage: 237.6+ KB


In [7]:
dfArken.head(10)

Unnamed: 0,nameAuth,urlAuth,Auktoriserad,Datum,Auktoritetspost
0,"Abenius, Margit","Abenius,-Margit","Abenius, Margit",1899-1970,https://libris.kb.se/ljx00mt45v0dfx5#it
1,"Abenius, Vera","Abenius,-Vera","Abenius, Vera",1890-1967,ediffah:kb:636923:1147851925
2,"Aber, Erich","Aber,-Erich","Aber, Erich",1904-1995,ediffah:kb:294903:1160049953
3,"Abildgaard, Nicolai","Abildgaard,-Nicolai","Abildgaard, Nicolai",1743-1809,https://libris.kb.se/sq4671cb16gj9q4#it
4,"Abrahamson, August","Abrahamson,-August","Abrahamson, August",1817-1898,https://libris.kb.se/wt7bkc9f1h1tt4z#it
5,"Abrahamson, Kjell Albin",abrahamson-kjell-albin,"Abrahamson, Kjell Albin",1945-2016,http://libris.kb.se/rp355s6942t756j
6,"Abrahamsson, Maggie","Abrahamsson,-Maggie","Abrahamsson, Maggie",,
7,"Abramson, August",abramson-august,"Abramson, August",,
8,"Abramson, Axel Nathanael","Abramson,-Axel-Nathanael","Abramson, Axel Nathanael",1855-1914,https://libris.kb.se/jgvz60z22gzmc7m#it
9,"Abramson, Ernst",abramson-ernst,"Abramson, Ernst",1896-1979,https://libris.kb.se/sq467r8b01553hh#it


In [8]:
dfArken.to_csv(r'Arken.csv')

## Check diff Wikidata
* dfArken
* dfWikidata
* see also 

In [9]:
dfWikidata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3390 entries, 0 to 3389
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   WD      3390 non-null   object
 1   arkid   3390 non-null   object
dtypes: object(2)
memory usage: 53.1+ KB


In [10]:
dfArken.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6079 entries, 0 to 6078
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   nameAuth         6079 non-null   object
 1   urlAuth          6079 non-null   object
 2   Auktoriserad     6079 non-null   object
 3   Datum            6079 non-null   object
 4   Auktoritetspost  6079 non-null   object
dtypes: object(5)
memory usage: 237.6+ KB


In [11]:
dfArken.sample()

Unnamed: 0,nameAuth,urlAuth,Auktoriserad,Datum,Auktoritetspost
958,Carlander (släkt),Carlander-släkt,Carlander (släkt),,https://libris.kb.se/r93b2cx32s33flg#it


In [12]:
# Merge plotPublishedAuthor WDSKBLtot  
mergeArkenWD = pd.merge(dfWikidata, dfArken,how='outer', left_on='arkid',right_on='urlAuth',indicator=True)   
mergeArkenWD.rename(columns={"_merge": "WD_Arken"},inplace = True)
mergeArkenWD['WD_Arken'] = mergeArkenWD['WD_Arken'].str.replace('left_only','WD_only').str.replace('right_only','Arken_only')
mergeArkenWD["WD_Arken"].value_counts()  


both          3331
Arken_only    2752
WD_only         59
Name: WD_Arken, dtype: int64

In [13]:
Arken_only = mergeArkenWD[mergeArkenWD["WD_Arken"] == "Arken_only"].copy() 
WD_only = mergeArkenWD[mergeArkenWD["WD_Arken"] == "WD_only"].copy() 
# could be places etc....
WD_only.head(10)

Unnamed: 0,WD,arkid,nameAuth,urlAuth,Auktoriserad,Datum,Auktoritetspost,WD_Arken
17,Q28287,Ystad,,,,,,WD_only
26,Q64694,Dornach,,,,,,WD_only
50,Q208177,Birka,,,,,,WD_only
94,Q2167,Lund,,,,,,WD_only
217,Q842877,Furusund,,,,,,WD_only
219,Q848393,Ludvika,,,,,,WD_only
235,Q990076,Djursholm,,,,,,WD_only
279,Q2577744,Ingarö,,,,,,WD_only
282,Q90,Paris,,,,,,WD_only
360,Q1027830,Gotland,,,,,,WD_only


In [14]:
pd.set_option('display.max_rows', None)  
Arken_only.sample(10)

Unnamed: 0,WD,arkid,nameAuth,urlAuth,Auktoriserad,Datum,Auktoritetspost,WD_Arken
5329,,,"Remelin, Anton","Remelin,-Anton","Remelin, Anton",1882-1970,https://libris.kb.se/53hknbvp2pl045c#it,Arken_only
6052,,,"Wolfbrandt, Thore","Wolfbrandt,-Thore","Wolfbrandt, Thore",,,Arken_only
4190,,,"Geber, Hugo","Geber,-Hugo","Geber, Hugo",,,Arken_only
4346,,,Hammarskjöld (släkt),Hammarskjöld-släkt,Hammarskjöld (släkt),,http://libris.kb.se/resource/bib/17260844,Arken_only
3689,,,"Bonde, Ingeborg","Bonde,-Ingeborg","Bonde, Ingeborg",1882-1943,ediffah:kb:495327:1270823290,Arken_only
5461,,,"Sario, S.","Sario,-S","Sario, S.",,,Arken_only
4991,,,"Marcusdotter, Marianne","Marcusdotter,-Marianne","Marcusdotter, Marianne",,,Arken_only
4073,,,"Fischer, Hildur","Fischer,-Hildur","Fischer, Hildur",1843-1926,ediffah:kb:849663:1437035044,Arken_only
3880,,,"Del Pezzo, Gateano","del-Pezzo,-Gateano","Del Pezzo, Gateano",,,Arken_only
3398,,,"Adelsköld, Sofia Marie","Adelsköld,-Sofia-Marie","Adelsköld, Sofia Marie",,,Arken_only


In [15]:
mergewithLibris = Arken_only[Arken_only["Auktoritetspost"].notnull()].copy() 
mergewithLibris.sample(10)

Unnamed: 0,WD,arkid,nameAuth,urlAuth,Auktoriserad,Datum,Auktoritetspost,WD_Arken
4985,,,"Malmström, Carl Gustav","Malmström,-Carl-Gustav","Malmström, Carl Gustav",1822-1912,ediffah:kb:736495:1364311334,Arken_only
5066,,,"Mörner, Nils C.",morner-nils-c,"Mörner, Nils C.",1849-1926,,Arken_only
5103,,,"Nilsén, Rolf",nilsen-rolf,"Nilsén, Rolf",1954-2014,,Arken_only
5895,,,"Umeå universitet, institutionen för litteratur...",umea-universitet-institutionen-for-litteraturv...,"Umeå universitet, institutionen för litteratur...",,,Arken_only
5472,,,"Scheringson, Reinhold",scheringson-reinhold,"Scheringson, Reinhold",1759-1849,https://libris.kb.se/97mqw59t1t62bmj#it,Arken_only
5729,,,Sverige. Statistiska centralbyrån,Statistiska-centralbyrån,Sverige. Statistiska centralbyrån,,https://libris.kb.se/vs68659d3hpd0xr#it,Arken_only
4955,,,"Löfgren, M.","Löfgren,-M","Löfgren, M.",,,Arken_only
4569,,,"Josephson, Erik Semmy","Josephson,-Erik-Semmy","Josephson, Erik Semmy",1864-1929,ediffah:kb:445895:1232374701,Arken_only
5201,,,Ostwald (släkt),ostwald,Ostwald (släkt),,,Arken_only
5906,,,"Unander-Scharin, Charlotte","Unander-Scharin,-Charlotte","Unander-Scharin, Charlotte",1952-,ediffah:kb:617663:1436872516,Arken_only


In [16]:
#mergewithLibris = Arken_only[Arken_only["Auktoritetspost"].notnull()].copy() 
ArkenOnlyWithAuthrec = Arken_only[Arken_only["Auktoritetspost"] != ''].copy() 
ArkenOnlyWithAuthrec.to_csv(r'ArkenOnlyWithAuthrec.csv')   
ArkenOnlyWithAuthrec

Unnamed: 0,WD,arkid,nameAuth,urlAuth,Auktoriserad,Datum,Auktoritetspost,WD_Arken
3394,,,Adams-Ray (släkt),Adams-Ray-släkt,Adams-Ray (släkt),,ediffah:kb:688055:1137069059,Arken_only
3396,,,Adelsköld (släkt),Adelsköld-släkt,Adelsköld (släkt),,ediffah:kb:482202:1457702780,Arken_only
3400,,,"Adlercreutz, Hedvig","Adlercreutz,-Hedvig","Adlercreutz, Hedvig",1832-1905,ediffah:kb:584691:1348770251,Arken_only
3408,,,"Aghnides, Thanassis","Aghnides,-Thanassis","Aghnides, Thanassis",1889-1984,ediffah:kb:285979:1415978596,Arken_only
3409,,,"Agrell, Carl Christian","Agrell,-Carl-Christian","Agrell, Carl Christian",1780-1838,ediffah:kb:587676:1189788900,Arken_only
3410,,,"Ahlbom, Frans Leonard","Ahlbom,-Frans-Leonard","Ahlbom, Frans Leonard",1816-1860,ediffah:kb:587676:1189788900,Arken_only
3411,,,"Ahlgren, Gerda","Ahlgren,-Gerda","Ahlgren, Gerda",1856-1945,ediffah:kb:671155:1436943082,Arken_only
3413,,,"Ahlström, Christine","Ahlström,-Christine","Ahlström, Christine",1807-1890,ediffah:kb:196469:1384421920,Arken_only
3414,,,"Ahnfelt-Rönne, Ragna","Ahnfelt-Rönne,-Ragna","Ahnfelt-Rönne, Ragna",1890-1951,ediffah:kb:963551:1415345577,Arken_only
3415,,,"Ahnlund, Mats",ahnlund-mats,"Ahnlund, Mats",1944-,http://libris.kb.se/resource/auth/306056,Arken_only


In [17]:
mergewithLibris
mergewithLibris.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2752 entries, 3390 to 6141
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   WD               0 non-null      object
 1   arkid            0 non-null      object
 2   nameAuth         2752 non-null   object
 3   urlAuth          2752 non-null   object
 4   Auktoriserad     2752 non-null   object
 5   Datum            2752 non-null   object
 6   Auktoritetspost  2752 non-null   object
 7   WD_Arken         2752 non-null   object
dtypes: object(8)
memory usage: 193.5+ KB


## Places

TBD

In [18]:
urlbase = "https://arken.kb.se/taxonomy/index/id/42?sort=alphabetic&sortDir=asc&page="
dfp = pd.DataFrame(columns=['nameAuth', 'urlAuth', 'Auktoriserad', 'Datum', 'Auktoritetspost'])
def check_Taxonomy(url):
    r_entry = http.request('GET', url)
    soup_entry = BeautifulSoup(r_entry.data, "html.parser")
    fields = soup_entry.select('div.field')
    for f in fields:
        h3 = f.select("h3")
        divText = f.select("div")
        if len(h3) > 0:
            if "Taxonomi" in h3[0].getText():
                #print("\tTaxonomi: " + divText[0].getText().strip())
                Taxonomi = divText[0].getText().strip()
    return True
    
for i in range(1,10):
    url = urlbase + str(i)
    #print(url)
    r = http.request('GET', url)
    soup = BeautifulSoup(r.data, "html.parser")
    for link in soup.select('table  a[href]'):
        nameAuth = link.string
        urlAuth = urllib.parse.unquote(link['href'].split("/")[1])
        #print ("\t",urlAuth, nameAuth)    
        urlentry = urlbase_entry + link['href']
        #print ("\t\t",urlentry)
        if check_Taxonomy(urlentry):
            #print("True")
            pass
    