<a href="https://colab.research.google.com/github/restrepo/medicion/blob/master/cienciometria/Query_CTR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Búsquedas WOS+SCI+SCP+PTJ+CTR para UdeA

Búsquedas en bases bibligráficas  
* Web of Science (WOS), 
* Scielo (SCI)
* Scopus  (SCP)
* Puntaje (UDEA)
* Center (CTR)
de los artículos científicos de la UdeA

La base de datos se creó con:

[WOS_SCI_SCP_PTJ_CTR.ipynb](./WOS_SCI_SCP_PTJ_CTR.ipynb)

In [1]:
import os
VERSION='NEW'
if os.getcwd()=='/content':
    !pip install openpyxl xlrd wosplus fuzzywuzzy[speedup] > /dev/null

## functions

In [2]:
import pandas as pd
import wosplus as wp
pd.set_option('display.max_colwidth',200)
from venn import draw_venn, generate_colors
import numpy as np
import fuzzywuzzy.process as fwp
from fuzzywuzzy import fuzz
idc='CÉDULA'

##  Configure public links of  files in Google Drive
* If it is a Google Spreadsheet the corresponding file is downloaded as CSV
* If it is in excel/json or text file the file is downloaded  directly

To define your  own labeled IDs for public google drive files edit the next cell:

In [3]:
%%writefile drive.cfg
[FILES]
WOS_SCI_SCP_PTJ_CTR.json.gz=19E1C1kRk4I0V3uXojqko8-NEicWaPp1j

Overwriting drive.cfg


##  Load data bases

In [4]:
affil='Univ Antioquia'
drive_files=wp.wosplus('drive.cfg')

In [5]:
UDEAjsonfile='WOS_SCI_SCP_PTJ_CTR.json.gz'
tmp=drive_files.load_biblio(UDEAjsonfile,compression='gzip')
UDEA=drive_files.biblio['WOS'].copy().reset_index(drop=True)



In [6]:
#from check_quality import *
#check_quality(UDEA)

## Indices:
Información obtenida de la columna: `json_column='UDEA_authors'`

In [7]:
json_column='UDEA_authors'

Que contiene listas de diccionarios con la información del autor UDEA: 

`{'DEPARTAMENTO': 'Instituto de Biología',
  'FACULTAD': 'Facultad de Ciencias Exactas y Naturales',
  'GRUPO': 'Sin Grupo Asociado',
  'INICIALES': 'I.',
  'NOMBRE COMPLETO': 'Idalyd Fonseca Gonzalez',
  'NOMBRES': 'Idalyd',
  'PRIMER APELLIDO': 'Fonseca',
  'SEGUNDO APELLIDO': 'Gonzalez',
  'WOS_affiliation': ['Univ Antioquia, Colombia.'],
  'WOS_author': ['FONSECA, IDALYD',
   'FONSECA-GONZALEZ, IDALYD',
   'Fonseca-Gonzalez, Idalyd',
   'Fonseca-Gonzalez, I.'],
  'full_name': 'FONSECA GONZALEZ IDALYD'}`

Otras columnas: `['OA','Z9'*,SCP_Cited by']`, `*`: WOS cited by

Ver también [WOS field tags](https://images.webofknowledge.com/images/help/WOS/hs_wos_fieldtags.html)

# Resultados totales

Artículos no identificados:

In [159]:
UDEA_NOT=UDEA[UDEA[json_column]==''].reset_index(drop=True)
UDEA_NOT.shape[0]

3136

Artículos identificados

In [160]:
UDEA_YES=UDEA[UDEA[json_column]!=''].reset_index(drop=True)
UDEA_YES.shape[0]

12564

### Análisis sobre artículos identificados

In [161]:
def flatten_if_nested(l):
    flatten=False
    for i in l:
        if type(i)==list:
            #return i
            flatten=True
    if flatten:
        l=[item for sublist in l for item in sublist]
        l=pd.np.array(l)
    return l
def extract_key(df,key,json_column='UDEA_authors'):
    '''
    Extract all the unique key values of the list of dictionaries in 
    a json column when the key value is a string or another list
    '''
    ll=df[json_column].apply(lambda l: np.unique([ d.get(key) for d in l 
                                if d.get(key) ]) if type(l)==list else l)
    if ll.str[0].apply(lambda l: l if type(l)==list else None).dropna().shape[0]:
        ll=ll.apply(flatten_if_nested)
    ll=ll.apply(pd.Series).stack().values
    return pd.DataFrame( {key:list(ll)} ).groupby(key)[key].count().sort_values(ascending=False)

In [162]:
extract_key(UDEA_YES,'FACULTAD')

FACULTAD
Facultad de Medicina                        3385
Facultad de Ciencias Exactas y Naturales    2378
Facultad de Ingeniería                      1947
Facultad de Ciencias Agrarias                704
Facultad de Ciencias Sociales y Humanas      227
Facultad de Artes                             15
Name: FACULTAD, dtype: int64

In [163]:
extract_key(UDEA_YES,'DEPARTAMENTO')

DEPARTAMENTO
Departamento de Microbiología y Parasitología                   960
Instituto de Física                                             903
Instituto de Investigaciones Médicas                            783
Departamento de Medicina Interna                                703
Instituto de Química                                            693
Instituto de Biología                                           677
Departamento de  Producción Agropecuaria                        434
Departamento de Pediatría y Puericultura                        412
Departamento de Ingeniería Metalúrgica                          364
Departamento de Ingeniería Sanitaria  y Ambiental               357
Escuela de Medicina Veterinaria                                 325
Departamento de Ingeniería Mecánica                             297
Departamento de Ingeniería Quimica                              292
Departamento de Cirugía                                         228
Departamento de Fisiología         

In [164]:
extract_key(UDEA_YES,'GRUPO')

GRUPO
Sin Grupo Asociado                                                                                                                                                                                451
Grupo de Materia Condensada-UdeA                                                                                                                                                                  261
Inmunovirología                                                                                                                                                                                   251
Centro de Investigación, Innovación y Desarrollo de Materiales - CIDEMAT - Anteriormente: Grupo de Corrosión y Protección                                                                         244
Programa de Estudio y Control de Enfermedades Tropicales                                                                                                                                          239
Grup

In [165]:
extract_key(UDEA_YES,'full_name')

full_name
DUQUE ECHEVERRI CARLOS ALBERTO            261
VELEZ BERNAL IVAN DARIO                   128
BEDOYA BERRIO GABRIEL DE JESUS            127
CERON MUÑOZ MARIO FERNANDO                126
LOPERA RESTREPO FRANCISCO JAVIER          125
JAIMES BARRAGAN FABIAN ALBERTO            120
RUGELES LOPEZ MARIA TERESA                118
PEÑUELA MESA GUSTAVO ANTONIO              117
CARMONA FONSECA JAIME DE JESUS            114
OLIVERA ANGEL MARTHA EUFEMIA              102
AMARILES MUÑOZ PEDRO JOSE                  98
RESTREPO BETANCUR LUIS FERNANDO            97
ROBLEDO RESTREPO SARA MARIA                95
CARDONA MAYA WALTER DARIO                  94
RIOS LUIS ALBERTO                          94
CARDONA ARIAS JAIBERTH ANTONIO             93
CADAVID JARAMILLO ANGELA PATRICIA          88
ARDILA MEDINA CARLOS MARTIN                87
BLAIR TRUJILLO SILVIA VICTORIA             86
MORALES ARAMBURO ALVARO LUIS               86
MONDRAGON PEREZ FANOR                      84
RESTREPO COSSIO ALBEIRO 

# Búsquedas

In [166]:
def extract_key_unique(*args,**kwargs):
    keys=extract_key(*args,**kwargs).keys()
    return [ k for k in keys if k]

def get_groups(l,g):
    for d in l:
        gt=d.get('GRUPO')
        if gt and type( gt )==str:
            gs=gt.replace(
                ', Grupo','; Grupo'
            ).split('; ')
            for gg in gs:
                if gg not in g:
                    g.append(gg)
    return g

facultades={'key':'FACULTAD',
            'values' : extract_key_unique(UDEA,'FACULTAD',json_column='UDEA_authors') }
departamentos={'key':'DEPARTAMENTO',
            'values' :extract_key_unique(UDEA,'DEPARTAMENTO',json_column='UDEA_authors')}
nombre_completo={'key'    : 'NOMBRE COMPLETO',
            'values' : extract_key_unique(UDEA,'NOMBRE COMPLETO',json_column='UDEA_authors')}
full_name={'key'    : 'full_name',
            'values' : extract_key_unique(UDEA,'full_name',json_column='UDEA_authors')}
udea_affiliations={'key'    : 'WOS_affiliation',
            'values' : extract_key_unique(UDEA,'WOS_affiliation',json_column='UDEA_authors')}
wos_affiliations={'key'    : 'affiliation',
            'values' : extract_key_unique(UDEA,'WOS_affiliation',json_column='authors_WOS')}
udea_author={'key'    : 'WOS_author',
            'values' : extract_key_unique(UDEA,'WOS_author',json_column='UDEA_authors')}
wos_author={'key'    : 'WOS_author',
            'values' : extract_key_unique(UDEA,'WOS_author',json_column='authors_WOS')}


#.apply(....) is a loop!
g=[]
#append to g
tmp=UDEA.UDEA_authors.apply(lambda l: 
                        get_groups(l,g)
        if type(l)==list else None
                        )
grupos={'key':'GRUPO',
            'values' :g}


## Función de búsqueda

For value string or list of each dictionary within a list of dictionaries, like the column 'UDEA_authors' in `UDEA` DataFrame

In [167]:
def query_json_column(q,df=UDEA,json_column='UDEA_authors',
                        choices=nombre_completo,scorer=fuzz.partial_token_sort_ratio,score_cutoff=0):
    #Found best exact match from index
    fchoices=fwp.extractOne(q,choices['values'],scorer=scorer,score_cutoff=score_cutoff)
    # Exact search in indexed subcolumn converted to strins (e.g list → string if necessary)
    if fchoices:
        fchoices=fchoices[0]
        dfF=df[df[json_column].apply(lambda l: True in [ str(d.get(choices['key'])).find(fchoices)>-1 
                                        for d in l if d.get(choices['key'])] if type(l)==list else False)]
        return dfF.reset_index(drop=True)
    else:
        return pd.DataFrame()

### Autor

In [168]:
r=query_json_column('Diego Alejandro Restrepo Quintero',df=UDEA,json_column='UDEA_authors',
                        choices=nombre_completo,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [169]:
r.shape

(38, 181)

In [170]:
#r[['TI','AU','authors_WOS',json_column]].reset_index(drop=True)[5:7]

## Grupos

Ejemplo

In [171]:
r=query_json_column('Grupo de Fenomenología de Interacciones Fundamentales',df=UDEA,json_column='UDEA_authors',
                        choices=grupos,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [172]:
r.shape

(92, 181)

Buscar todos

In [177]:
gdf=pd.DataFrame()
for g in grupos['values']:
    r=query_json_column(g,df=UDEA,json_column='UDEA_authors',choices=grupos,
                        scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)
                        
    gdf=gdf.append( {'Group':g,'articles':r.shape[0]},ignore_index=True )
gdf['articles']=gdf['articles'].astype(int)

In [178]:
gdf.sort_values('articles',ascending=False).reset_index(drop=True)[:10]

Unnamed: 0,Group,articles
0,Sin Grupo Asociado,451
1,Inmunovirología,304
2,"Grupo Reproducción, Inmunovirología, Infección y Cáncer",304
3,Grupo de Materia Condensada-UdeA,302
4,Grupo de Estado Sólido,292
5,"Centro de Investigación, Innovación y Desarrollo de Materiales - CIDEMAT - Anteriormente: Grupo de Corrosión y Protección,",265
6,"Centro de Investigación, Innovación y Desarrollo de Materiales - CIDEMAT - Anteriormente: Grupo de Corrosión y Protección",265
7,Grupo Académico de Epidemiología Clínica,263
8,"Grupo Académico de Epidemiología Clínica, Nacer, Salud Sexual y Reproductiva",263
9,Grupo de Neurociencias de Antioquia,254


## Departamento

In [179]:
r=query_json_column('Instituto de Física',df=UDEA,json_column='UDEA_authors',
                        choices=departamentos,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [180]:
r.shape

(903, 181)

## Centro

Ejemplo

In [181]:
cen=query_json_column('Facultad de Ciencias Exactas y Naturales',df=UDEA,json_column='UDEA_authors',
                        choices=facultades,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [182]:
cen.shape

(2378, 181)

Todos

In [183]:
fdf=pd.DataFrame()
for f in facultades['values']:
    r=query_json_column(f,df=UDEA,json_column='UDEA_authors',choices=facultades,
                        scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)
    fdf=fdf.append( {'Facultad':f,'articles':r.shape[0]},ignore_index=True )
fdf['articles']=fdf['articles'].astype(int)

In [184]:
fdf.sort_values('articles',ascending=False)

Unnamed: 0,Facultad,articles
0,Facultad de Medicina,3385
1,Facultad de Ciencias Exactas y Naturales,2378
2,Facultad de Ingeniería,1947
3,Facultad de Ciencias Agrarias,704
4,Facultad de Ciencias Sociales y Humanas,227
5,Facultad de Artes,15


## Citas

In [30]:
UDEA_YES.sort_values('Z9',ascending=False)[['Z9','TI','SO','AU','PY']].reset_index(drop=True)[:10]

Unnamed: 0,Z9,TI,SO,AU,PY
0,3610,"An integrated map of genetic variation from 1,092 human genomes",NATURE,"Altshuler, DM\nDurbin, RM\nAbecasis, GR\nBentley, DR\nChakravarti, A\nClark, AG\nDonnelly, P\nEichler, EE\nFlicek, P\nGabriel, SB\nGibbs, RA\nGreen, ED\nHurles, ME\nKnoppers, BM\nKorbel, JO\nLande...",2012
1,1526,Leishmaniasis Worldwide and Global Estimates of Its Incidence,PLOS ONE,"Alvar, J\nVelez, ID\nBern, C\nHerrero, M\nDesjeux, P\nCano, J\nJannin, J\nden Boer, M\n",2012
2,1271,A global reference for human genetic variation,NATURE,"Altshuler, DM\nDurbin, RM\nAbecasis, GR\nBentley, DR\nChakravarti, A\nClark, AG\nDonnelly, P\nEichler, EE\nFlicek, P\nGabriel, SB\nGibbs, RA\nGreen, ED\nHurles, ME\nKnoppers, BM\nKorbel, JO\nLande...",2015
3,901,Human papillomavirus genotype attribution in invasive cervical cancer: a retrospective cross-sectional worldwide study,LANCET ONCOLOGY,"de Sanjose, S\nQuint, WGV\nAlemany, L\nGeraets, DT\nKlaustermeier, JE\nLloveras, B\nTous, S\nFelix, A\nBravo, LE\nShin, HR\nVallejos, CS\nde Ruiz, PA\nLima, MA\nGuimera, N\nClavero, O\nAlejo, M\nL...",2010
4,711,The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution,SCIENCE,"Elsik, CG\nTellam, RL\nWorley, KC\nGibbs, RA\nAbatepaulo, ARR\nAbbey, CA\nAdelson, DL\nAerts, J\nAhola, V\nAlexander, L\nAlioto, T\nAlmeida, IG\nAmadio, AF\nAnatriello, E\nAntonarakis, SE\nAnzola,...",2009
5,601,GENETIC ABSOLUTE DATING BASED ON MICROSATELLITES AND THE ORIGIN OF MODERN HUMANS,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF\nAMERICA,"GOLDSTEIN, DB\nLINARES, AR\nCAVALLISFORZA, LL\nFELDMAN, MW\n",1995
6,474,Mutations in IRF6 cause Van der Woude and popliteal pterygium syndromes,NATURE GENETICS,"Kondo, S\nSchutte, BC\nRichardson, RJ\nBjork, BC\nKnight, AS\nWatanabe, Y\nHoward, E\nde Lima, RLLF\nDaack-Hirsch, S\nSander, A\nMcDonald-McGinn, DM\nZackai, EH\nLammer, EJ\nAylsworth, AS\nArdinge...",2002
7,429,Leptogenesis,PHYSICS REPORTS-REVIEW SECTION OF PHYSICS LETTERS,"Davidson, S\nNardi, E\nNir, Y\n",2008
8,410,Temperature sensitivity of drought-induced tree mortality portends increased regional die-off under global-change-type drought,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF\nAMERICA,"Adams, HD\nGuardiola-Claramonte, M\nBarron-Gafford, GA\nVillegas, JC\nBreshears, DD\nZou, CB\nTroch, PA\nHuxman, TE\n",2009
9,376,Electron localization following attosecond molecular photoionization,NATURE,"Sansone, G\nKelkensberg, F\nPerez-Torres, JF\nMorales, F\nKling, MF\nSiu, W\nGhafur, O\nJohnsson, P\nSwoboda, M\nBenedetti, E\nFerrari, F\nLepine, F\nSanz-Vicario, JL\nZherebtsov, S\nZnakovskaya, ...",2010


In [31]:
UDEA_YES.Z9.sum()

75281

In [32]:
UDEA_YES.sort_values('SCP_Cited by',ascending=False)[[
    'SCP_Cited by','TI','SO','AU','PY']].reset_index(drop=True)[:10]

Unnamed: 0,SCP_Cited by,TI,SO,AU,PY
0,1586,Leishmaniasis Worldwide and Global Estimates of Its Incidence,PLOS ONE,"Alvar, J\nVelez, ID\nBern, C\nHerrero, M\nDesjeux, P\nCano, J\nJannin, J\nden Boer, M\n",2012
1,1160,"Effects of tranexamic acid on death, vascular occlusive events, and blood transfusion in trauma patients with significant haemorrhage (CRASH-2): A randomised, placebo-controlled trial",The Lancet,"Olldashi F., Kerçi M., Zhurda T., Ruçi K., Banushi A., Traverso M.S., Jiménez J., Balbi J., Dellera C., Svampa S., Quintana G., Piñero G., Teves J., Seppelt I., Mountain D., Hunter J., Balogh Z., ...",2010
2,994,Human papillomavirus genotype attribution in invasive cervical cancer: a retrospective cross-sectional worldwide study,LANCET ONCOLOGY,"de Sanjose, S\nQuint, WGV\nAlemany, L\nGeraets, DT\nKlaustermeier, JE\nLloveras, B\nTous, S\nFelix, A\nBravo, LE\nShin, HR\nVallejos, CS\nde Ruiz, PA\nLima, MA\nGuimera, N\nClavero, O\nAlejo, M\nL...",2010
3,626,The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution,SCIENCE,"Elsik, CG\nTellam, RL\nWorley, KC\nGibbs, RA\nAbatepaulo, ARR\nAbbey, CA\nAdelson, DL\nAerts, J\nAhola, V\nAlexander, L\nAlioto, T\nAlmeida, IG\nAmadio, AF\nAnatriello, E\nAntonarakis, SE\nAnzola,...",2009
4,598,GENETIC ABSOLUTE DATING BASED ON MICROSATELLITES AND THE ORIGIN OF MODERN HUMANS,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF\nAMERICA,"GOLDSTEIN, DB\nLINARES, AR\nCAVALLISFORZA, LL\nFELDMAN, MW\n",1995
5,485,Mutations in IRF6 cause Van der Woude and popliteal pterygium syndromes,NATURE GENETICS,"Kondo, S\nSchutte, BC\nRichardson, RJ\nBjork, BC\nKnight, AS\nWatanabe, Y\nHoward, E\nde Lima, RLLF\nDaack-Hirsch, S\nSander, A\nMcDonald-McGinn, DM\nZackai, EH\nLammer, EJ\nAylsworth, AS\nArdinge...",2002
6,439,The importance of early treatment with tranexamic acid in bleeding trauma patients: An exploratory analysis of the CRASH-2 randomised controlled trial,The Lancet,"Olldashi F., Kerçi M., Zhurda T., Ruçi K., Banushi A., Traverso M.S., Jiménez J., Balbi J., Dellera C., Svampa S., Quintana G., Piñero G., Teves J., Seppelt I., Mountain D., Balogh Z., Zaman M., D...",2011
7,432,Leptogenesis,PHYSICS REPORTS-REVIEW SECTION OF PHYSICS LETTERS,"Davidson, S\nNardi, E\nNir, Y\n",2008
8,424,Temperature sensitivity of drought-induced tree mortality portends increased regional die-off under global-change-type drought,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF\nAMERICA,"Adams, HD\nGuardiola-Claramonte, M\nBarron-Gafford, GA\nVillegas, JC\nBreshears, DD\nZou, CB\nTroch, PA\nHuxman, TE\n",2009
9,405,THE STRUCTURE OF THE PRESENILIN-1 (S182) GENE AND IDENTIFICATION OF 6 NOVEL MUTATIONS IN EARLY-ONSET AD FAMILIES,NATURE GENETICS,"CLARK, RF\nHUTTON, M\nFULDNER, RA\nFROELICH, S\nKARRAN, E\nTALBOT, C\nCROOK, R\nLENDON, C\nPRIHAR, G\nHE, C\nKORENBLAT, K\nMARTINEZ, A\nWRAGG, M\nBUSFIELD, F\nBEHRENS, MI\nMYERS, A\nNORTON, J\nMOR...",1995


In [33]:
UDEA_YES['SCP_Cited by'].sum()

78299

# Función de búsque de nombres completos usando los autores WOS y los metadatos de la información institucional

In [34]:
aun=extract_key(UDEA_NOT,'WOS_author',json_column='authors_WOS')
aun[27:28]

WOS_author
Cerón-Muñoz, M. F.    6
Name: WOS_author, dtype: int64

In [35]:
aun=aun.keys()

In [36]:
posib=extract_key(UDEA_YES,'WOS_author',json_column='authors_WOS').keys()

### Goods: i=2,3,4,6
### Bad: 1,5

In [37]:
i=27
n=aun[i]
n

'Cerón-Muñoz, M. F.'

In [38]:
# if nold:
qq=query_json_column(n,df=UDEA_NOT,json_column='authors_WOS',
                        choices=wos_author,scorer=fuzz.ratio,score_cutoff=100)

In [39]:
qq.index

RangeIndex(start=0, stop=6, step=1)

In [40]:
for i in qq.index:
    print( [ d for d in qq.loc[i,'authors_WOS'] if n in d.get('WOS_author')] )

[{'i': 0, 'affiliation': ['Grupo de Investigación GaMMA, Facultad de Ciencias Agrarias, Univ Antioquia, Calle 70 Nº 52 - 2, Medellín, Colombia'], 'WOS_author': 'Cerón-Muñoz, M. F.'}]
[{'i': 0, 'affiliation': ['Grupo de Investigación en Genética, Mejoramiento y Modelación-GaMMA, Facultad de Ciencias Agrarias, Univ Antioquia UdeA, Calle 70 No. 52-21, Medellín, Colombia'], 'WOS_author': 'Cerón-Muñoz, M. F.'}]
[{'i': 0, 'affiliation': ['Grupo de investigación GaMMA, Facultad de Ciencias Agrarias, Univ Antioquia, Calle 70 N°52-21, Medellín, Colombia'], 'WOS_author': 'Cerón-Muñoz, M. F.'}]
[{'i': 2, 'affiliation': ['Grupo de Investigación en Genética, Mejoramiento y Modelación Animal-GaMMA, Univ Antioquia, Calle 70 N° 52-21, Medellín, Colombia'], 'WOS_author': 'Cerón-Muñoz, M. F.'}]
[{'i': 0, 'affiliation': ['Grupo de investigación GaMMA, Univ Antioquia, Carrera 75 No. 65-87, Bloque, Medellín, Colombia'], 'WOS_author': 'Cerón-Muñoz, M. F.'}]
[{'i': 3, 'affiliation': ['Grupo de Investigación 

In [41]:
qq.SO.unique()

array(['Livestock Research for Rural Development'], dtype=object)

In [42]:
qq.shape

(6, 181)

In [43]:
extract_key(qq[:1],'WOS_author',json_column='authors_WOS')

WOS_author
Ramírez-Arias, J. P.    1
Cerón-Muñoz, M. F.      1
Name: WOS_author, dtype: int64

In [44]:
qq.loc[0,'authors_WOS']

[{'WOS_author': 'Cerón-Muñoz, M. F.',
  'affiliation': ['Grupo de Investigación GaMMA, Facultad de Ciencias Agrarias, Univ Antioquia, Calle 70 Nº 52 - 2, Medellín, Colombia'],
  'i': 0},
 {'WOS_author': 'Ramírez-Arias, J. P.',
  'affiliation': ['Grupo de Investigación GaMMA, Facultad de Ciencias Agrarias, Univ Antioquia, Calle 70 Nº 52 - 2, Medellín, Colombia'],
  'i': 1}]

In [45]:
qq.loc[0,'SO']

'Livestock Research for Rural Development'

## Include SO

In [46]:
aunly=drive_files.read_drive_json('UDEA_authors_with_WOS_info.json').reset_index(drop=True)

In [47]:
def build_institutional_authors(x,author_df,x_author_key='WOS_author',x_affiliation_key='affiliation',
                                        author_key='WOS_author',
                                        affiliation_key='WOS_affiliation'):
    if type(x)!=list:
        return None
    ll=[]
    for j in range(len(x)):
        
                                #author_WOS→affiliation always have single affiliation
        kk=find_author_affiliation(x[j].get(x_author_key),x[j].get(x_affiliation_key)[0],
                                        author_df=author_df,
                                        author_key=author_key,
                                        affiliation_key=affiliation_key,
                                        ratio=0.9 )
        if kk:
            ll.append(kk)
    if not ll:
        ll=None
    return ll

In [48]:
import fuzzywuzzy.process as fwp
from fuzzywuzzy import fuzz
#UDEA_NOT=UDEA[UDEA['UDEA_authors'].isna()].reset_index(drop=True)
df2=aunly.copy()
df2=pd.DataFrame( list( df2['UDEA_authors'].values ) )
df2['UDEA_authors']=aunly['UDEA_authors']
contents=df2[['WOS_author','WOS_affiliation','UDEA_authors']].reset_index(drop=True)
contents['WOS_author']=contents['WOS_author']#.astype(str)
contents['WOS_affiliation']=contents['WOS_affiliation']#.astype(str)

# ==============

In [49]:
dfnot=qq#UDEA_NOT.copy()
dfnot=dfnot.reset_index(drop=True)

In [50]:
l=dfnot['authors_WOS'].loc[0]
so=dfnot['SO'].loc[0]

In [92]:
TEST=True
if TEST:
    l=[{'WOS_author': 'Ponce, W. A.',
        'affiliation': 
        ['International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'],
        'i': 0}]  
    so='Zeitschrift für Physik C Particles and Fields'

In [93]:
l

[{'WOS_author': 'Ponce, W. A.',
  'affiliation': ['International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'],
  'i': 0}]

In [94]:
so

'Zeitschrift für Physik C Particles and Fields'

In [95]:
#for i in range(20):
#l=dfnot['authors_WOS'].loc[i]
#def json_fuzzy_merge(l,UDEA,contents,right_target='UDEA_authors',
                       #left_on='WOS_author',extra_left_on='affiliation',
                       #right_on='WOS_author',extra_right_on='WOS_affiliation',
                       #cutoff=95,cutoff_extra=65,scorer=fuzz.partial_ratio):
if True:                
    right_target='UDEA_authors'
    left_on='WOS_author'
    extra_left_on='affiliation'
    right_on='WOS_author' 
    extra_right_on='WOS_affiliation'
    extra_extra_right_on='full_name'
    SO='SO'
    cutoff=92
    cutoff_extra=70
    scorer=fuzz.token_set_ratio
    DEBUG=False
    newl=[]
    for d in l:
        AUTHOR=False
        AFFILIATION=False
        JOURNAL=True
        dfraf=pd.DataFrame()        
        au=d.get(left_on)
        aff=d.get(extra_left_on)[0]
        Q=1
        break
        # extract best WOS author match
        #r=fwp.extractOne(au,contents[right_on],scorer=scorer)
        #if r[1]>=cutoff:
        #    raf=fwp.extractOne( aff, contents.loc[r[2],extra_right_on],scorer=scorer )
            #print(r[1],r[2],raf[1],aff,',',contents.loc[r[2],extra_right_on])
            #if raf[1]>=cutoff_extra:
            #    newl=newl+[  contents.loc[r[2],right_target]  ]
            #else:
                #check SO
                #newl=newl+[  contents.loc[r[2],right_target]  ]
        #break
    #if newl:
    #    return newl
    #else:
    #    return None

In [96]:
au

'Ponce, W. A.'

In [97]:
aff

'International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'

In [98]:
if True:
        Q=1
        # Try match author to a good degree
        rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),scorer=scorer)
        if DEBUG: print(1,rau)
        if rau[1]>=cutoff:
            AUTHOR=True
            if DEBUG: print(2,AUTHOR)            

In [99]:
rau

('Ponce, William A.', 88)

In [100]:
if True:
        #Try match author with less quality: Q
        #else:
        if rau[1]<cutoff:
            rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),
                       scorer=fuzz.partial_token_sort_ratio)
            if DEBUG: print(2.1,rau)            
            if rau and rau[1]>=cutoff:
                Q=Q-0.1
                AUTHOR=True            

In [101]:
rau

('Ponce, William A.', 100)

In [102]:
if True:
        if AUTHOR:
            dfraf=contents[contents[right_on].apply( lambda l: rau[0] in l )
                                ].reset_index(drop=True)
            raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],scorer=fuzz.ratio)
            if DEBUG: print(3,rau)
            if raf and raf[1]>=cutoff_extra:
                AFFILIATION=True

In [103]:
raf

('Univ Antioquia, Inst Fis, Calle 70 52-21, Medellin, Colombia.', 43)

In [104]:
if True:
            #else:
            if raf[1]<cutoff_extra:
                Q=Q-0.1
                raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],
                                   scorer=fuzz.partial_token_set_ratio)
                if raf and raf[1]>=cutoff_extra:
                    AFFILIATION=True

In [105]:
raf

('Univ Antioquia, Inst Fis, Medellin, Colombia.', 100)

# ================

Journal:

In [106]:
full_name=dfraf['UDEA_authors'].loc[0].get('full_name')

In [107]:
full_name

'PONCE GUTIERREZ WILLIAM ANTONIO'

In [108]:
if True:
        if Q<1:
            cutoff_so=50
            if Q<0.9:
                cutoff_so=60
            if not dfraf.empty:
                full_name=dfraf[right_target].loc[0].get(
                        extra_extra_right_on)
                if full_name:
                    kkk=UDEA[UDEA['UDEA_nombre'].str.contains(full_name)
                                ].reset_index(drop=True)
                    rso=fwp.extractOne( so,   kkk.SO, scorer=scorer)
                    if not rso:
                        JOURNAL=False
                    elif rso[1]<cutoff_so:
                        JOURNAL=False

In [109]:
so

'Zeitschrift für Physik C Particles and Fields'

In [110]:
kkk.SO[:2]

0                             REVISTA MEXICANA DE FISICA
1    ACTA PHYSICA HUNGARICA NEW SERIES-HEAVY ION PHYSICS
Name: SO, dtype: object

In [111]:
rso

('PARTICLES AND FIELDS, PROCEEDINGS', 77, 2)

In [112]:
if True:            
            #else:
            if dfraf.empty:
                JOURNAL=False

In [113]:
JOURNAL

True

In [127]:
if True:
        if AUTHOR and AFFILIATION and JOURNAL:
            mthchedd=dfraf.loc[0,right_target]
            mthchedd['from_author_WOS_WOS_author']=au
            newl=newl+[  mthchedd  ]
            print('{} → {}'.format(au,newl[0][extra_extra_right_on]) ) 

Ponce, W. A. → PONCE GUTIERREZ WILLIAM ANTONIO


In [123]:
mthchedd

Test full function  below

# ================

In [128]:
#for i in range(20):
#l=dfnot['authors_WOS'].loc[i]
#93,70
from IPython.display import clear_output
def json_fuzzy_merge(l,so,UDEA,contents,right_target='UDEA_authors',
                       left_on='WOS_author',extra_left_on='affiliation',
                       right_on='WOS_author',extra_right_on='WOS_affiliation',
                       extra_extra_right_on='full_name',
                       cutoff=93,cutoff_extra=70,scorer=fuzz.token_set_ratio,
                       DEBUG=False):
    newl=[]
    for d in l:
        clear_output(wait=True)
        AUTHOR=False
        AFFILIATION=False
        JOURNAL=True

        dfraf=pd.DataFrame()
        au=d.get(left_on)
        aff=d.get(extra_left_on)[0]
        Q=1
        # Try match author to a good degree
        rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),scorer=scorer)
        if DEBUG: print(1,rau)
        if rau[1]>=cutoff:
            AUTHOR=True
        #Try match author with less quality: Q
        else:
            rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),
                       scorer=fuzz.partial_token_sort_ratio)
            if DEBUG: print(1.1,rau)            
            if rau and rau[1]>=cutoff:
                Q=Q-0.1
                AUTHOR=True
        if DEBUG: print(1.2,'AUTHOR:',AUTHOR)                            
        if AUTHOR:
            dfraf=contents[contents[right_on].apply( lambda l: rau[0] in l )
                                ].reset_index(drop=True)
            raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],scorer=fuzz.ratio)
            if DEBUG: print(2,raf)
            if raf and raf[1]>=cutoff_extra:
                AFFILIATION=True
            else:
                Q=Q-0.1
                raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],
                                   scorer=fuzz.partial_token_set_ratio)
                if DEBUG: print(2.1,raf)
                if raf and raf[1]>=cutoff_extra:
                    AFFILIATION=True

        if DEBUG: print(2.2,'AFFILIATION:',AFFILIATION,'Q:',Q)                
        if Q<1:
            cutoff_so=50
            if Q<0.9:
                cutoff_so=60
            if not dfraf.empty:
                full_name=dfraf[right_target].loc[0].get(
                        extra_extra_right_on)
                if full_name:
                    kkk=UDEA[UDEA['UDEA_nombre'].str.contains(full_name)
                                ].reset_index(drop=True)
                    rso=fwp.extractOne( so,   kkk.SO, scorer=scorer)
                    if not rso:
                        JOURNAL=False
                    elif rso[1]<cutoff_so:
                        JOURNAL=False
            else:
                JOURNAL=False
        if DEBUG: print(3,'JOURNAL',JOURNAL)                
        if AUTHOR and AFFILIATION and JOURNAL:
            mthchedd=dfraf.loc[0,right_target]
            mthchedd['from_author_WOS_WOS_author']=au
            newl=newl+[  mthchedd  ]            
            print('{} → {}'.format(au,newl[0][extra_extra_right_on]) ) 
    if newl:
        return newl
    else:
        return None

In [129]:
dfraf=json_fuzzy_merge(l,so,UDEA,contents,DEBUG=True)

1 ('Ponce, William A.', 88)
1.1 ('Ponce, William A.', 100)
1.2 AUTHOR: True
2 ('Univ Antioquia, Inst Fis, Calle 70 52-21, Medellin, Colombia.', 43)
2.1 ('Univ Antioquia, Inst Fis, Medellin, Colombia.', 100)
2.2 AFFILIATION: True Q: 0.8
3 JOURNAL True
Ponce, W. A. → PONCE GUTIERREZ WILLIAM ANTONIO


In [147]:
%time kk=UDEA_NOT['authors_WOS'].combine(UDEA_NOT['SO'],func=lambda l,so: json_fuzzy_merge(l,so,UDEA,contents) if type(l)==list else None)

CPU times: user 1h 6min 46s, sys: 5.08 s, total: 1h 6min 51s
Wall time: 1h 6min 51s


In [151]:
kk.dropna().shape

(883,)

In [148]:
qq=UDEA_NOT.reset_index(drop=True)
qq['UDEA_authors']=kk

In [153]:
#pp=qq[qq['authors_WOS'].astype(str).str.contains('Ponce, W. A.')].reset_index(drop=True)#[['authors_WOS','UDEA_authors']]

In [154]:
#ppp=pp['authors_WOS'].combine(pp['SO'],func=lambda l,so: json_fuzzy_merge(l,so,UDEA,contents) 
#                              if type(l)==list else None)

In [None]:
#qq[['authors_WOS','UDEA_authors']]

In [155]:
UDEA_NOT.shape[0]+UDEA_YES.shape[0]

15700

In [156]:
qq['UDEA_authors'].dropna().shape

(883,)

In [157]:
qq=qq.fillna('')

In [158]:
UDEA_NOT=qq.reset_index(drop=True)
UDEA=UDEA_YES.append(UDEA_NOT,sort=False).reset_index(drop=True)

### Quality check

In [193]:
chk=pd.DataFrame( list(kk.dropna().str[0].values) )[['from_author_WOS_WOS_author','WOS_author','full_name']]

In [194]:
import unidecode

In [295]:
chk['simple_wos']=chk['from_author_WOS_WOS_author'].str.lower().str.replace(
    '[\.,]','').str.replace('\-',' ').apply(unidecode.unidecode)
chk['full_name'].str.lower().str.replace('[\.,\-]','').apply(unidecode.unidecode)
chk['short_name']=chk['full_name'].str.lower().str.replace('[\.,\-]','').str.replace(
            '^(\w+\s+\w+\s+\w)\w+(\s+\w)\w+$',r'\1\2').str.replace(
         '^(\w+\s+\w+\s+\w)\w+$',r'\1').apply(unidecode.unidecode)
chk['simple_name']=chk['full_name'].str.lower().str.replace('[\.,\-]','').str.replace(
            '^(\w+\s+)\w+\s+(\w+)\s+\w+$',r'\1\2').str.replace(
            '^(\w+\s+)\w+\s+(\w+)$',r'\1\2').apply(unidecode.unidecode)
chk['last_name']=chk['full_name'].str.lower().str.replace('[\.,\-]','').str.replace(
            '^(\w+\s+)\w+\s+(\w+\s+\w+)$',r'\1\2')
#dos apellidos y un nombre (sort)
#un apellido y dos nombre  (sort)

In [296]:
scorer=fuzz.token_set_ratio
chk['s1']=chk['simple_wos'].combine( 
            chk['full_name'].str.lower().str.replace('[\.,]','').apply(unidecode.unidecode),
           func=fuzz.token_sort_ratio)
chk['s1b']=chk['simple_wos'].combine( 
            chk['full_name'].str.lower().str.replace('[\.,]','').apply(unidecode.unidecode),
           func=fuzz.partial_token_sort_ratio)
chk['s2']=chk['simple_wos'].combine(chk['short_name'],
           func=scorer)
chk['s3']=chk['simple_wos'].combine(chk['simple_name'],
           func=fuzz.ratio)
chk['s4']=chk['simple_wos'].combine(chk['last_name'],
           func=fuzz.token_sort_ratio)

In [297]:
chk['max']=chk[['s1','s1b','s2','s3','s4']].apply(max,axis=1)
chk['min']=chk[['s1','s1b','s2','s3','s4']].apply(min,axis=1)

In [298]:
chk[['simple_wos','full_name','s1','s1b','short_name','s2','simple_name','s3','last_name','s4',
     'min','max']]

Unnamed: 0,simple_wos,full_name,s1,s1b,short_name,s2,simple_name,s3,last_name,s4,min,max
0,restrepo sanchez nora e,RESTREPO COSSIO ALBEIRO ALONSO,49,52,restrepo cossio a a,64,restrepo albeiro,62,restrepo albeiro alonso,57,49,64
1,restrepo sanchez nora e,RESTREPO COSSIO ALBEIRO ALONSO,49,52,restrepo cossio a a,64,restrepo albeiro,62,restrepo albeiro alonso,57,49,64
2,restrepo sanchez nora e,RESTREPO COSSIO ALBEIRO ALONSO,49,52,restrepo cossio a a,64,restrepo albeiro,62,restrepo albeiro alonso,57,49,64
3,fredy ochoa gomez john,OCHOA GOMEZ JOHN FREDY,100,100,ochoa gomez j f,85,ochoa john,62,ochoa john fredy,84,62,100
4,arango arteaga myrtha,ARANGO ARTEAGA MYRTHA,100,100,arango arteaga m,93,arango myrtha,76,arango arteaga myrtha,100,76,100
5,callejas ricardo,CALLEJAS POSADA RICARDO DE LA MERCED,62,62,callejas posada ricardo de la merced,100,callejas posada ricardo de la merced,62,callejas posada ricardo de la merced,62,62,100
6,giraldo urrego laura maria,URREGO GIRALDO GERMAN ARTURO,56,58,urrego giraldo g a,88,urrego german,51,urrego german arturo,57,51,88
7,lemos j d,LEMOS DUQUE JUAN DIEGO,58,67,lemos duque j d,100,lemos juan,74,lemos juan diego,72,58,100
8,pavon j j,PAVON PALACIO JUAN JOSE,56,67,pavon palacio j j,100,pavon juan,74,pavon juan jose,75,56,100
9,lemos j d,LEMOS DUQUE JUAN DIEGO,58,67,lemos duque j d,100,lemos juan,74,lemos juan diego,72,58,100


In [229]:
scorer( 'lopez jaramillo c', 'lopez jaramillo c a' )

100