## COVID-19 Infographic

Here we focus in articles related to COVID-19 (Q84263196) and 2019–20 COVID-19 pandemic (Q81068910). Finding relevant connections such as "Main Subject", "Part Of" or "Has caused".

In [2]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
#https://w.wiki/KvX (Thanks User:Dipsacus_fullonum)
# All statements with item, property, value and rank with COVID-19 (Q84263196) as value for qualifier.

sparql.setQuery("""
SELECT ?item ?itemLabel ?property ?propertyLabel ?value ?valueLabel ?rank ?qualifier ?qualifierLabel
WHERE
{
  ?item ?claim ?statement.
  ?property wikibase:claim ?claim.
  ?property wikibase:statementProperty ?sprop.
  ?statement ?sprop ?value.
  ?statement wikibase:rank ?rank. 
  ?statement ?qprop wd:Q84263196. # COVID-19

  
  ?qualifier wikibase:qualifier ?qprop.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

allStatements = pd.io.json.json_normalize(results['results']['bindings'])

In [3]:
allStatements['valueLabel.value'].value_counts()


disease outbreak                 265
human                              6
treatment                          2
drug development                   1
mascot character                   1
moe anthropomorphic character      1
drug repositioning                 1
medical diagnosis                  1
diagnostic test                    1
vaccine                            1
pandemic                           1
Name: valueLabel.value, dtype: int64

In [4]:
allStatements[['item.value','itemLabel.value']].head(10)


Unnamed: 0,item.value,itemLabel.value
0,http://www.wikidata.org/entity/Q83873593,2020 coronavirus pandemic in France
1,http://www.wikidata.org/entity/Q83889294,2020 coronavirus pandemic in Germany
2,http://www.wikidata.org/entity/Q84166704,2020 coronavirus pandemic in Spain
3,http://www.wikidata.org/entity/Q84055415,2019–20 coronavirus outbreak in Finland
4,http://www.wikidata.org/entity/Q84081576,2020 coronavirus pandemic in Sweden
5,http://www.wikidata.org/entity/Q84098939,2020 coronavirus pandemic in Russia
6,http://www.wikidata.org/entity/Q84446340,2020 coronavirus pandemic in Belgium
7,http://www.wikidata.org/entity/Q86521237,2020 coronavirus outbreak in Asia
8,http://www.wikidata.org/entity/Q83872291,2019–20 coronavirus outbreak in Japan
9,http://www.wikidata.org/entity/Q83873057,2019–20 coronavirus outbreak in Vietnam


In [5]:
# All truthy statements with COVID-19 (Q84263196) as value.
#https://w.wiki/KvZ (Thanks User:Dipsacus_fullonum)

sparql.setQuery("""
SELECT ?item ?itemLabel ?property ?propertyLabel
WHERE
{
  ?item ?claim wd:Q84263196.
  ?property wikibase:directClaim ?claim.
   SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

truthy = pd.io.json.json_normalize(results['results']['bindings'])

In [6]:
truthy[['item.value','itemLabel.value','propertyLabel.value']].sample(10).head(10)


Unnamed: 0,item.value,itemLabel.value,propertyLabel.value
897,http://www.wikidata.org/entity/Q88627924,Jorge Alcibíades García Lara,medical condition
894,http://www.wikidata.org/entity/Q88503664,Harouna Kaboré,medical condition
640,http://www.wikidata.org/entity/Q87369505,The COVID‐19 epidemic,main subject
719,http://www.wikidata.org/entity/Q87648634,2020 coronavirus pandemic in Armenia,has cause
870,http://www.wikidata.org/entity/Q88090268,Jean-Paul Hamon,medical condition
203,http://www.wikidata.org/entity/Q13742419,Kees Bakker,cause of death
518,http://www.wikidata.org/entity/Q87070975,2020 coronavirus pandemic in Israel,has cause
1012,http://www.wikidata.org/entity/Q87865656,Detection of Covid-19 in Children in Early Jan...,main subject
265,http://www.wikidata.org/entity/Q19599015,Javier Jiménez,medical condition
780,http://www.wikidata.org/entity/Q87647368,The Incubation Period of Coronavirus Disease 2...,main subject


In [7]:
truthy['propertyLabel.value'].value_counts()


medical condition                454
main subject                     234
has cause                        197
cause of death                   123
research intervention             13
category combines topics           3
different from                     3
facet of                           3
has effect                         3
has immediate cause                2
named after                        2
category's main topic              1
vaccine for                        1
Wikimedia portal's main topic      1
interested in                      1
represents                         1
academic degree                    1
said to be the same as             1
medical condition treated          1
item for this sense                1
Name: propertyLabel.value, dtype: int64

In [8]:
truthy[truthy['propertyLabel.value'] == 'main subject'][['itemLabel.value','propertyLabel.value']].head(5)


Unnamed: 0,itemLabel.value,propertyLabel.value
444,Recent advances in the detection of respirator...,main subject
445,The continuing 2019-nCoV epidemic threat of no...,main subject
446,Clinical features of patients infected with 20...,main subject
447,"Early Transmission Dynamics in Wuhan, China, o...",main subject
448,"2019-nCoV, first death outside China",main subject


In [9]:
truthy[truthy['propertyLabel.value'] == 'has cause'][['itemLabel.value','propertyLabel.value']].head(5)


Unnamed: 0,itemLabel.value,propertyLabel.value
306,2019–20 COVID-19 pandemic,has cause
416,2019–20 coronavirus pandemic in mainland China,has cause
417,2019–20 coronavirus outbreak in Japan,has cause
418,2019–20 COVID-19 outbreak in South Korea,has cause
419,2019–20 coronavirus outbreak in Vietnam,has cause


In [10]:
truthy[truthy['propertyLabel.value'] == ('facet of' or 'main subject')][['itemLabel.value','propertyLabel.value']].head(5)


Unnamed: 0,itemLabel.value,propertyLabel.value
490,timeline of the 2019–20 coronavirus outbreak,facet of
491,SARS-CoV-2 transmission,facet of
1043,"Ordinance of January 30, 2020",facet of


In [11]:
truthy[truthy['propertyLabel.value'] == 'medical condition'][['itemLabel.value','propertyLabel.value']].head(5)

Unnamed: 0,itemLabel.value,propertyLabel.value
16,Jackson Browne,medical condition
17,Greg Rikaart,medical condition
18,Albin Ekdal,medical condition
19,Brett Dean,medical condition
20,Shintaro Fujinami,medical condition


In [12]:
#Remove medical condition and cause of death
mainSubject = truthy[truthy['propertyLabel.value'] != 'medical condition']
mainSubject = mainSubject[mainSubject['propertyLabel.value'] != 'cause of death']
mainSubject = mainSubject[mainSubject['propertyLabel.value'] != 'medical condition treated']

In [13]:
mainSubject[['itemLabel.value','propertyLabel.value']].sample(5).head(5)

Unnamed: 0,itemLabel.value,propertyLabel.value
859,The Cholera Epidemics in Hamburg and What to L...,main subject
450,Molecular Modeling Evaluation of the Binding A...,main subject
662,Coronavirus Disease 2019 (COVID-19): A critica...,main subject
466,Vitamin C Infusion for the Treatment of Severe...,main subject
962,2020 coronavirus pandemic in Benin,has cause


In [14]:
#All truthy statements with 2019–20 COVID-19 pandemic (Q81068910) as value.
#https://w.wiki/Kvd (Thanks User:Dipsacus_fullonum)

sparql.setQuery("""
# 
SELECT ?item ?itemLabel ?property ?propertyLabel WHERE {
  ?item ?claim wd:Q81068910. #2019–20 COVID-19 pandemic
  ?property wikibase:directClaim ?claim.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

Q81068910 = pd.io.json.json_normalize(results['results']['bindings'])

In [15]:
Q81068910['propertyLabel.value'].value_counts()

part of                          427
main subject                     167
facet of                          31
category combines topics          20
has cause                         10
significant event                  5
has contributing factor            3
has effect                         3
field of work                      2
category contains                  2
different from                     2
notable works                      1
cause of death                     1
Wikimedia portal's main topic      1
interested in                      1
template's main topic              1
said to be the same as             1
category's main topic              1
Name: propertyLabel.value, dtype: int64

In [16]:
Q81068910[Q81068910['propertyLabel.value'] == 'part of'][['itemLabel.value','propertyLabel.value']].head(5)

Unnamed: 0,itemLabel.value,propertyLabel.value
1,2019–20 coronavirus pandemic in mainland China,part of
2,2019–20 coronavirus outbreak in Japan,part of
3,2019–20 COVID-19 outbreak in South Korea,part of
4,2019–20 coronavirus outbreak in Vietnam,part of
5,2019–20 coronavirus outbreak in Singapore,part of


In [18]:
#removing not strong connections such.
Q81068910Strong = Q81068910[Q81068910['propertyLabel.value'] != 'field of work']
Q81068910Strong = Q81068910Strong[Q81068910Strong['propertyLabel.value'] != 'interested in']
Q81068910Strong = Q81068910Strong[Q81068910Strong['propertyLabel.value'] != 'notable works']
Q81068910Strong = Q81068910Strong[Q81068910Strong['propertyLabel.value'] != 'Wikimedia portal\'s main topic']
Q81068910Strong = Q81068910Strong[Q81068910Strong['propertyLabel.value'] != 'category combines topics']


In [19]:
#Getting Qs ids
mainSubjectQ = [ link.split('/')[-1] for link in mainSubject['item.value'].tolist()]
allStatementsQ = [ link.split('/')[-1] for link in allStatements['item.value'].tolist()]
Q81068910StrongQ = [ link.split('/')[-1] for link in Q81068910Strong['item.value'].tolist()]
## merging both sets
#adding both sets & seeds
strongQs = set(mainSubjectQ).union(allStatementsQ).union({'Q84263196','Q83741704'})

In [20]:
import pickle
with open('strongQsCovid-19_20200325.pickle','wb') as f:
    pickle.dump(strongQs,f)

In [22]:
#On the strong set
import requests
sitelinks_base = 'https://www.wikidata.org/w/api.php?action=wbgetentities&format=json&props=sitelinks&ids=' 
sitelinks = []

for Q in strongQs:
    url = sitelinks_base + Q
    sitelinks.append(requests.get(url=url).json())

pagesPerProject = {}
for s in sitelinks:
    if 'entities' in s:
        for k,v in s['entities'].items():
            if 'sitelinks' in v:
                for wiki,data in v['sitelinks'].items():
                    page = data['title']
                    project ='%s.wikipedia' % wiki.replace('wiki','')
                    pagesPerProject[project] = pagesPerProject.get(project,[])
                    pagesPerProject[project].append(page)

In [23]:
with open('pagesPerProjectStronglyRelated20200325.wikitext','w') as f:
    for project, pages in pagesPerProject.items():
        projectcode = project.split('.')[0]
        f.write('\n== %s == \n \n' % project )
        for page in pages:
            f.write('* [[%s:%s|%s]]\n' % (projectcode,page,page)) 

In [24]:
import pickle
with open('pagesPerProjectStronglyRelated20200325.pickle','wb') as f:
    pickle.dump(pagesPerProject,f)

In [25]:
import pickle 
from random import randint
with open('pagesPerProjectStronglyRelated20200325.pickle','rb') as f:
    pagesPerProject = pickle.load(f)

### Total number of edits to strongly-related COVID articles globally

In [27]:
## Function to count revisions

import mwapi

def countRevisions(page_name,project,date):
    """
    page_name: str, article title, ex: 'COVID-19'
    project: str, project id, ex: 'es.wikipedia'
    date: timestamp, counting from given day example '2020-01-01T00:00:00Z'
    
    """
    counter = 0
    session = mwapi.Session("https://%s.org" % project, user_agent="dsaez@wikimedia.org - COVID-19 research")
    for response_doc in session.get(action='query', prop='revisions', titles=page_name, 
                                    rvprop=['ids', 'timestamp'], rvlimit=100, rvdir="newer", 
                                    formatversion=2, rvstart=date, continuation=True):
        for rev_doc in response_doc['query']['pages'][0]['revisions']:
            rev_id = rev_doc['revid']
            timestamp = rev_doc['timestamp']
            counter += 1
    return counter

In [28]:
results = []
startDate = '2020-01-01T00:00:00Z'
for project, pages in pagesPerProject.items():
    print(project)
    for page in pages:
        try:
            c = countRevisions(page,project,startDate)
            results.append([project,page,c])
        except:
            #print('error in %s %s' % (page,project))
            pass

hequote.wikipedia
itvoyage.wikipedia
ta.wikipedia
bn.wikipedia
cdo.wikipedia
uknews.wikipedia
species.wikipedia
als.wikipedia
nlnews.wikipedia
frp.wikipedia
jv.wikipedia
az.wikipedia
as.wikipedia
bg.wikipedia
fa.wikipedia
id.wikipedia
bat_smg.wikipedia
ti.wikipedia
tk.wikipedia
ja.wikipedia
yo.wikipedia
zh_classical.wikipedia
tg.wikipedia
simple.wikipedia
si.wikipedia
hi.wikipedia
mn.wikipedia
ug.wikipedia
eo.wikipedia
te.wikipedia
my.wikipedia
bh.wikipedia
en.wikipedia
mg.wikipedia
frnews.wikipedia
hesource.wikipedia
sq.wikipedia
vec.wikipedia
de.wikipedia
ang.wikipedia
mwl.wikipedia
af.wikipedia
su.wikipedia
ensource.wikipedia
itquote.wikipedia
bs.wikipedia
ga.wikipedia
ukquote.wikipedia
esvoyage.wikipedia
mk.wikipedia
sl.wikipedia
hu.wikipedia
sc.wikipedia
azb.wikipedia
ro.wikipedia
nrm.wikipedia
crh.wikipedia
hyw.wikipedia
th.wikipedia
enversity.wikipedia
et.wikipedia
finews.wikipedia
ban.wikipedia
am.wikipedia
ml.wikipedia
cv.wikipedia
commons.wikipedia
enquote.wikipedia
konews.wi

In [29]:
import pandas as pd
df = pd.DataFrame(results)
df.rename(columns={0:'project',1:'article',2:'edits'},inplace=True)

### Results

#### Statistics from Jan 1st to March 30th, 2020


In [30]:
print('Total number of projects %s' % len(df.project.unique()))
print('Total number of Articles %s' % df.shape[0])
print('Total number of edits %s' % df.edits.sum())
avgPerDay = round(df.edits.sum()/(31+28+19))
print('Avg Edits per Day %s' % avgPerDay )
avgPerHour = round(df.edits.sum()/(24*(31+28+19)))
print('Avg Edits per hour %s' % avgPerHour )

Total number of projects 146
Total number of Articles 2558
Total number of edits 299342
Avg Edits per Day 3838.0
Avg Edits per hour 160.0


In [None]:
#there are some errors/warnings. Don't worry, we are getting a lower-bound
results = []
startDate = '2020-03-01T00:00:00Z'
for project, pages in pagesPerProject.items():
    print(project)
    for page in pages:
        try:
            c = countRevisions(page,project,startDate)
            results.append([project,page,c])
        except:
            #print('error in %s %s' % (page,project))
            pass

In [31]:
import wmfdata as wmf
wmf.mariadb.run("""
select distinct translation_source_language,translation_target_language, 
translation_source_title, count(*) as frequency from cx_translations 
where ( translation_source_title LIKE '%orona%' 
OR translation_source_title LIKE '%COVID%' ) AND translation_target_url is not NULL   
group by translation_source_title order by frequency desc
""", "wikishared")



Unnamed: 0,translation_source_language,translation_target_language,translation_source_title,frequency
0,en,am,Coronavirus,17
1,en,bg,Novel coronavirus (2019-nCoV),13
2,en,gu,Coronavirus disease 2019,10
3,en,fr,COVID-19 vaccine,10
4,en,ca,Betacoronavirus,6
...,...,...,...,...
444,en,es,2020 coronavirus pandemic in Japan,1
445,en,fa,2020 coronavirus outbreak in France,1
446,en,ar,2020 coronavirus pandemic in Nevada,1
447,en,ro,Murine coronavirus,1


In [32]:
wmf.mariadb.run("""
describe cx_translations
""", "wikishared")

Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,translation_id,int(11),NO,PRI,,auto_increment
1,translation_source_title,varbinary(512),NO,MUL,,
2,translation_target_title,varbinary(512),NO,,,
3,translation_source_language,varbinary(36),NO,MUL,,
4,translation_target_language,varbinary(36),NO,,,
5,translation_source_url,blob,NO,,,
6,translation_target_url,blob,YES,,,
7,translation_status,"enum('draft','published','deleted')",YES,,,
8,translation_start_timestamp,varbinary(14),NO,,,
9,translation_last_updated_timestamp,varbinary(14),NO,,,


In [35]:
wmf.mariadb.run("""
select distinct translation_source_language,translation_target_language, 
translation_source_title, count(*) as frequency 
from cx_translations 
where translation_status='published'
order by frequency desc
limit 100""","wikishared")

Unnamed: 0,translation_source_language,translation_target_language,translation_source_title,frequency
0,es,pt,Palak paneer,560643
