# Data-analysis of inspire database
[Inspire hep](http://inspirehep.net/) is the information system for research articles in the area of High Energy Physics. As almost all the articles in this area (around a million) are deposited in [arXiv.org](http://arxiv.org), the inspire information system have a wide coverage with full well normalized metadata and author disambiguation. Moreover, the system has implemented a [fully API](http://inspirehep.net/info/hep/api) for autoutomated searching and receive machine readable responses.

> A periodically updated snapshot of HEP record metadata in json format is also available at [http://inspirehep.net/hep_records.json.gz](http://inspirehep.net/hep_records.json.gz) (400 MB) with checksum.

Also available in [MarXML](http://inspirehep.net/dumps/inspire-dump.html)

### Load web version
If `DUMP=True` below, it load the very last version (__ <font color=red>WARNING: 1.5GB decompressed</font>,  10 GB of RAM recommended__):

In [26]:
DUMP=False
if DUMP:
    import gzip
    import pandas as pd
    import requests
    url = "http://inspirehep.net/hep_records.json.gz"
    df_full=pd.read_json(  gzip.decompress( requests.get(url).content  
                                      ).decode('utf8'), lines=True)

### Load local version
* If `FULL=True` below, it load a local version (March 22, 2018) (__ <font color=red>WARNING: 1.5GB decompressed</font>,  10 GB of RAM recommended__)
* If `FULL=False` below, it load an small sample with 1000 entries.

In [27]:
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth',200)

FULL=False
if FULL:
    # also at https://drive.google.com/file/d/1NtocdsxbTky5uLmKp3eDoGcQG7qsTEYv/view?usp=sharing
    df_all=pd.read_json('hep_records.json',lines=True)
else:
    url='http://fisica.udea.edu.co/downloads/nodes_small.json'
    df_all=pd.read_json(url)
    # df_all=pd.read_json( gzip.decompress( requests.get(url+'.gz').content  ) )

## Keep only entries with references

In [30]:
dfr=df_all[df_all.references.map(len)>0].reset_index(drop=True)

In [31]:
dfr.sample(2)

Unnamed: 0,abstract,authors,citations,co-authors,creation_date,free_keywords,list_rcid,recid,references,standardized_keywords,title
156,"This paper is devoted to a study of possible scaling laws, and their logarithmic corrections, occurring in deep inelastic electropion production. Both the exclusive and semiexclusive processes are...","[Calogeracos, A.]","[891896, 675459]","[Dombey, Norman, West, G.B.]",1995,[],"[405851, 405851, 405851, 405851, 405851, 405851, 405851, 405851, 405851, 405851, 405851, 405851, 405851]",405851,"[98402, 73764, 52422, 1393354, 69100, 84526, 81012, 458293, 85303, 60858, 325723, 75069, 75194]","[electron p: deep inelastic scattering, deep inelastic scattering: electron p, pi: electroproduction, electroproduction: pi, pi: photoproduction, photoproduction: pi, current algebra, PCAC model, ...",PCAC and the possibility of scaling in electropion production
324,"When Beppo-SAX measured the 0.1--12 keV spectrum of RE J1034+396, observations in the optical, UV and EUV were also taken within a few weeks. This multiwavelength spectrum placed very strong const...","[Puchnarewicz, E.M.]","[615721, 929571]","[Soria, R.]",2002-02,[],"[596210, 596210, 596210]",596210,"[576561, 548722, 436660]",[],"Do nls1s and ultrasoft agn have irradiated, warped accretion disks?"


In [32]:
dfr=dfr.reset_index(drop=True)

In [33]:
dfr.shape

(1000, 11)

### make a query

In [34]:
q=input('Search for author, e.g: restrepo, d:  ')

rq=dfr[dfr.full_authors.str.lower().str.contains('restrepo, d')].reset_index(drop=True)
print('{} records found'.format(rq.shape[0])  )
rq

Search for author, e.g: restrepo, d:  f


AttributeError: 'DataFrame' object has no attribute 'full_authors'

###  Convert to WOS-like format

In [35]:
dfr['full_authors']=dfr['authors']+dfr['co-authors']

In [36]:
dfrw=dfr.copy()
dfrw.full_authors=dfrw.full_authors.str.join('\n').str.replace('.$','\n')
#convert integer list to string list: https://stackoverflow.com/a/3590175/2268280
dfrw['refs']=dfrw.references.map(lambda x: '\n'.join( map( str,x )   )).str.replace('.$','\n')

In [37]:
dfrw[['title','full_authors','refs']][:3]#.to_excel('inspire.xlsx',index=False)

Unnamed: 0,title,full_authors,refs
0,Canonical Methods in Nonabelian Gauge Theories,"Brandt, Richard A.\nNg, Wing-Chiu\nYoung, Kennet\n",87072\n1283\n86276\n1577\n95084\n54444\n99663\n58673\n67637\n93143\n7922\n
1,Scale Breaking and Anomalies in Deep Inelastic Lepton Scattering,"Shaw, Gordon L.\nFaroughy, Dara\nThomas, Pau\n",1953\n3746\n2179\n3393\n3304\n113705\n60999\n108589\n60846\n2255\n108040\n99698\n85251\n108861\n108241\n3067\n109116\n2717\n10044\n
2,EFFECTIVE QCD LAGRANGIAN AT FINITE TEMPERATURE,"Dittrich, Walter\nSchanbacher, Volke\n",112293\n119942\n145900\n130542\n7860\n6862\n140214\n11196\n


## Build the network of references

In [38]:
df.shape

NameError: name 'df' is not defined

In [39]:
#df.to_json('full_ralations.json')

In [41]:
df[:1]

NameError: name 'df' is not defined

In [42]:
def check_intersect1d(x,y):
    return np.intersect1d(x,y)

def check_not_ref(x,ref):
    return x not in ref



#DEBUG: ====
#Nmin=1;DEBUG=False
def check_connections( dfi,Nmin=5,DEBUG=False,external_ed=pd.DataFrame() ):
#if True:    
    ed=pd.DataFrame()


    if DEBUG: 
        #===log===
        f=open('kk.log','w')
        f.write('')
        f.close()
        #=======
    #DEBUG===========
    #i=5000
    #if True:
    for i in dfi.index:
        #check if edge already extits in some external ed DataFrame===
        #edge=dfi.loc[i,'recid']
        #if not edges_already_exists(edge,external_ed):
        
        #================    
        if DEBUG:
            #===log===
            f=open('kk.log','a')
            f.write('{}\n'.format(i))
            f.close()
            #=======
        print(i,end='\r')
        # Drop already analysed files
        df_not_i=dfi.drop( dfi.index[ range(0,i+1)  ] ) 
        # prepare edges DataFrame
        df_not_i=df_not_i.rename_axis({'recid':'target'}, axis = 'columns')
        #WARNING: limit the number of columns to free memory
        df_not_i=df_not_i[['target','references']]
        # Make the intersection of dfi.loc[i,'references'] with all the others one (see previous restricion)
        df_not_i['intersect']=df_not_i['references'].map(lambda x: check_intersect1d(x,dfi.loc[i,'references']))
        # INPORTANT PART Restrict the number of edges ============
        df_not_i=df_not_i[df_not_i['intersect'].map(len)>=Nmin]
        #==========================================================
        # Add source column. 
        df_not_i['source']=dfi.loc[i,'recid']
        # drop references columns
        ed=ed.append( df_not_i[['source','target','intersect']] ).reset_index(drop=True) 

    return ed

In [None]:
import time
s=time.time()

dfi=dfr.sample(1000).sort_values('recid').reset_index(drop=True)

ed=check_connections(dfi,Nmin=5,DEBUG=True)

print(time.time()-s)

In [527]:
edf=ed.copy()

In [528]:
edf.shape

(3271, 3)

In [536]:
edf=edf[edf.intersect.map(len)>=5]
edf.shape

(124, 3)

## Find the colegios

In [537]:
colegios=pd.DataFrame()
j=0
col={}
col[j]=[]

In [538]:
#i=3
#if True:
for i in edf.source.unique():
    #print(i)
    if True:
        if not i in col[j]: # new colegio!
            cl=edf[edf.source==i]
            # TODO loop in target
            if cl.target.values[0] not in col[j]: # neither source not target in old colegio:
                j=i
                col[j] =list( np.unique( np.concatenate ( ( cl.source.values,cl.target.values )  ) )  )
            else:
                col[j].append(i)

In [539]:
cdf=pd.DataFrame( { 'source': list( col.keys()),'colegios':  list( col.values() )  } )

In [540]:
cdf=cdf[cdf.source>0].reset_index(drop=True)

In [541]:
cdff=cdf.merge(edf,on='source',how='left').fillna('')

In [545]:
colg=cdff[['source','target','colegios','intersect']].sort_values(
     ['source','target'],ascending=False).reset_index(drop=True)
colg[:1]

Unnamed: 0,source,target,colegios,intersect
0,1644265,1662910,"[1644265, 1653123, 1662910]","[593382, 659055, 760769, 779080, 796887, 919443, 1124337, 1124338, 1203133, 1293923, 1321709, 1468068, 1605397]"


In [555]:
colg['len_col']=colg.colegios.map(len)
colg.len_col.max()

12

In [561]:
colg[colg.len_col>5].drop_duplicates('source')#.shape

Unnamed: 0,source,target,colegios,intersect,len_col
18,1260794,1662910,"[1260794, 1260873, 1296453, 1644265, 1653123, 1662910]","[494953, 593382, 599622, 659055, 760769, 779080, 796887, 919443]",6
26,1230004,1596723,"[1230004, 1260794, 1260873, 1296453, 1596723, 1235726]","[581996, 589478, 712925, 877524, 912611]",6
48,823215,1662910,"[823215, 871874, 883930, 1230004, 1260873, 1296453, 1296927, 1434935, 1631169, 1644265, 1653123, 1662910]","[12291, 12292, 40440, 50073, 593382, 700668, 796887]",12


## Better efficiency
The current approach does not scale well

We follow here the usual approach to have a node file and build up an edges file.
The egdes file will be build in [pajek.ipynb](./pajek.ipynb)

In [562]:
def ones(x,y):
    return np.ones(len(x)).astype(int)*y

##  Pajek like Dataframe

### Save node file
JSON format keeps the python object structure

In [563]:
SAVE=False
if SAVE:
    df.to_json('nodes_small.json')

In [564]:
df['list_rcid']=df.references.combine( df.recid, func=ones)

In [565]:
pj=pd.DataFrame( { 'recid': np.concatenate (  tuple( df.list_rcid.values ) ),
                   'references': np.concatenate (  tuple( df.references.values ) )} )

In [566]:
pj=pj.sort_values(['references','recid'])

In [567]:
pj.shape

(23960, 2)

Remove entries with a single citation

In [568]:
pj['diff_refs_1']=np.concatenate (  (  pj.references.values[1:]-pj.references.values[:-1],
                            [   pj.references.values[-1]-pj.references.values[-2]   ] ) )
pj['diff_refs_2']=np.concatenate (  (  [   pj.references.values[-1]-pj.references.values[-2]   ],
                                         (pj.references.values[:-1]-pj.references.values[1:])
                                   ) )
pj['diff_refs']=pj.diff_refs_1*pj.diff_refs_2

In [569]:
pj=pj[pj.diff_refs==0].drop(['diff_refs_1','diff_refs_2','diff_refs'],axis='columns').reset_index(drop=True)

In [570]:
pj.shape

(4915, 2)

In [572]:
pj[:8]

Unnamed: 0,recid,references
0,15541,113
1,143630,113
2,385487,113
3,468709,113
4,1323465,113
5,398379,250
6,641951,250
7,231044,931


###  Automatize the process

In [586]:
%%writefile nodes_to_pajek.py
import numpy as np
import pandas as pd

def ones(x,y):
    return np.ones(len(x)).astype(int)*y

def nodes_to_pajek(df):
    df['list_rcid']=df.references.combine( df.recid, func=ones)
    pj=pd.DataFrame( { 'recid': np.concatenate (  tuple( df.list_rcid.values ) ),
                   'references': np.concatenate (  tuple( df.references.values ) )} )
    pj=pj.sort_values(['references','recid'])
    pj['diff_refs_1']=np.concatenate (  (  pj.references.values[1:]-pj.references.values[:-1],
                            [   pj.references.values[-1]-pj.references.values[-2]   ] ) )
    pj['diff_refs_2']=np.concatenate (  (  [   pj.references.values[-1]-pj.references.values[-2]   ],
                                         (pj.references.values[:-1]-pj.references.values[1:])
                                   ) )
    pj['diff_refs']=pj.diff_refs_1*pj.diff_refs_2
    
    pj=pj[pj.diff_refs==0].drop(['diff_refs_1','diff_refs_2','diff_refs'],
                        axis='columns').reset_index(drop=True)
    return  pj

Overwriting nodes_to_pajek.py


In [584]:
from nodes_to_pajek import *

In [None]:
pj=nodes_to_pajek(df)

In [578]:
pj.shape

(4915, 2)

### We can now build the edges file!!!!
In [pajek.ipynb](./pajek.ipynb)

In [587]:
def nodes_to_pajek_tmp(df):
    df['list_rcid']=df.references.combine( df.recid, func=ones)
    pj=pd.DataFrame( { 'recid': np.concatenate (  tuple( df.list_rcid.values ) ),
                   'references': np.concatenate (  tuple( df.references.values ) )} )
    pj=pj.sort_values(['references','recid'])
    pj['diff_refs_1']=np.concatenate (  (  pj.references.values[1:]-pj.references.values[:-1],
                            [   pj.references.values[-1]-pj.references.values[-2]   ] ) )
    pj['diff_refs_2']=np.concatenate (  (  [   pj.references.values[-1]-pj.references.values[-2]   ],
                                         (pj.references.values[:-1]-pj.references.values[1:])
                                   ) )
    pj['diff_refs']=pj.diff_refs_1*pj.diff_refs_2
    
    pj=pj[pj.diff_refs==0].drop(['diff_refs_1','diff_refs_2','diff_refs'],
                        axis='columns').reset_index(drop=True)
    return  pj

In [595]:
pjt=nodes_to_pajek_tmp(dfr)

In [596]:
pjt.shape

(21733815, 2)

In [597]:
pjt[:3]

Unnamed: 0,recid,references
0,76215,1
1,81651,1
2,128250,1


In [None]:
pjt.to_json('')

# Appendix
Para generar las slides use:

In [30]:
%%bash
jupyter nbconvert ALL_inspire.ipynb --to slides --reveal-prefix "https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.1.0"

[NbConvertApp] Converting notebook ALL_inspire.ipynb to slides
[NbConvertApp] Writing 273514 bytes to ALL_inspire.slides.html
