<a href="https://colab.research.google.com/github/restrepo/medicion/blob/master/cienciometria/Query_CTR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Búsquedas WOS+SCI+SCP+PTJ+CTR para UdeA

Búsquedas en bases bibligráficas  
* Web of Science (WOS), 
* Scielo (SCI)
* Scopus  (SCP)
* Puntaje (UDEA)
* Center (CTR)
de los artículos científicos de la UdeA

La base de datos se creó con:

[WOS_SCI_SCP_PTJ_CTR.ipynb](./WOS_SCI_SCP_PTJ_CTR.ipynb)

In [1]:
import os
VERSION='NEW'
if os.getcwd()=='/content':
    !pip install openpyxl xlrd wosplus fuzzywuzzy[speedup] > /dev/null

## functions

In [1733]:
import pandas as pd
import wosplus as wp
pd.set_option('display.max_colwidth',200)
from venn import draw_venn, generate_colors
import numpy as np
import fuzzywuzzy.process as fwp
from fuzzywuzzy import fuzz
idc='CÉDULA'

##  Configure public links of  files in Google Drive
* If it is a Google Spreadsheet the corresponding file is downloaded as CSV
* If it is in excel/json or text file the file is downloaded  directly

To define your  own labeled IDs for public google drive files edit the next cell:

In [3]:
%%writefile drive.cfg
[FILES]
WOS_SCI_SCP_PTJ_CTR.json.gz=19E1C1kRk4I0V3uXojqko8-NEicWaPp1j

Overwriting drive.cfg


##  Load data bases

In [4]:
affil='Univ Antioquia'
drive_files=wp.wosplus('drive.cfg')

In [5]:
UDEAjsonfile='WOS_SCI_SCP_PTJ_CTR.json.gz'
tmp=drive_files.load_biblio(UDEAjsonfile,compression='gzip')
UDEA=drive_files.biblio['WOS'].copy().reset_index(drop=True)



In [6]:
#from check_quality import *
#check_quality(UDEA)

## Indices:
Información obtenida de la columna: `json_column='UDEA_authors'`

In [1605]:
json_column='UDEA_authors'

Que contiene listas de diccionarios con la información del autor UDEA: 

`{'DEPARTAMENTO': 'Instituto de Biología',
  'FACULTAD': 'Facultad de Ciencias Exactas y Naturales',
  'GRUPO': 'Sin Grupo Asociado',
  'INICIALES': 'I.',
  'NOMBRE COMPLETO': 'Idalyd Fonseca Gonzalez',
  'NOMBRES': 'Idalyd',
  'PRIMER APELLIDO': 'Fonseca',
  'SEGUNDO APELLIDO': 'Gonzalez',
  'WOS_affiliation': ['Univ Antioquia, Colombia.'],
  'WOS_author': ['FONSECA, IDALYD',
   'FONSECA-GONZALEZ, IDALYD',
   'Fonseca-Gonzalez, Idalyd',
   'Fonseca-Gonzalez, I.'],
  'full_name': 'FONSECA GONZALEZ IDALYD'}`

Otras columnas: `['OA','Z9'*,SCP_Cited by']`, `*`: WOS cited by

Ver también [WOS field tags](https://images.webofknowledge.com/images/help/WOS/hs_wos_fieldtags.html)

# Resultados totales

Artículos no identificados:

In [1616]:
UDEA_NOT=UDEA[UDEA[json_column]==''].reset_index(drop=True)
UDEA_NOT.shape[0]

3442

Artículos identificados

In [1617]:
UDEA_YES=UDEA[UDEA[json_column]!=''].reset_index(drop=True)
UDEA_YES.shape[0]

12258

### Análisis sobre artículos identificados

In [1618]:
def flatten_if_nested(l):
    flatten=False
    for i in l:
        if type(i)==list:
            #return i
            flatten=True
    if flatten:
        l=[item for sublist in l for item in sublist]
        l=pd.np.array(l)
    return l
def extract_key(df,key,json_column='UDEA_authors'):
    '''
    Extract all the unique key values of the list of dictionaries in 
    a json column when the key value is a string or another list
    '''
    ll=df[json_column].apply(lambda l: np.unique([ d.get(key) for d in l 
                                if d.get(key) ]) if type(l)==list else l)
    if ll.str[0].apply(lambda l: l if type(l)==list else None).dropna().shape[0]:
        ll=ll.apply(flatten_if_nested)
    ll=ll.apply(pd.Series).stack().values
    return pd.DataFrame( {key:list(ll)} ).groupby(key)[key].count().sort_values(ascending=False)

In [1619]:
extract_key(UDEA_YES,'FACULTAD')

FACULTAD
Facultad de Medicina                        3289
Facultad de Ciencias Exactas y Naturales    2322
Facultad de Ingeniería                      1891
Facultad de Ciencias Agrarias                675
Facultad de Ciencias Sociales y Humanas      222
Facultad de Artes                             15
Name: FACULTAD, dtype: int64

In [1620]:
extract_key(UDEA_YES,'DEPARTAMENTO')

DEPARTAMENTO
Departamento de Microbiología y Parasitología                   939
Instituto de Física                                             879
Instituto de Investigaciones Médicas                            772
Departamento de Medicina Interna                                671
Instituto de Biología                                           670
Instituto de Química                                            670
Departamento de  Producción Agropecuaria                        421
Departamento de Pediatría y Puericultura                        404
Departamento de Ingeniería Metalúrgica                          352
Departamento de Ingeniería Sanitaria  y Ambiental               347
Escuela de Medicina Veterinaria                                 309
Departamento de Ingeniería Mecánica                             289
Departamento de Ingeniería Quimica                              285
Departamento de Cirugía                                         222
Departamento de Fisiología         

In [1621]:
extract_key(UDEA_YES,'GRUPO')

GRUPO
Sin Grupo Asociado                                                                                                                                                                                442
Grupo de Materia Condensada-UdeA                                                                                                                                                                  261
Inmunovirología                                                                                                                                                                                   244
Centro de Investigación, Innovación y Desarrollo de Materiales - CIDEMAT - Anteriormente: Grupo de Corrosión y Protección                                                                         240
Programa de Estudio y Control de Enfermedades Tropicales                                                                                                                                          238
Grup

In [1622]:
extract_key(UDEA_YES,'full_name')

full_name
DUQUE ECHEVERRI CARLOS ALBERTO       261
VELEZ BERNAL IVAN DARIO              127
CERON MUÑOZ MARIO FERNANDO           126
BEDOYA BERRIO GABRIEL DE JESUS       125
LOPERA RESTREPO FRANCISCO JAVIER     122
RUGELES LOPEZ MARIA TERESA           119
JAIMES BARRAGAN FABIAN ALBERTO       116
CARMONA FONSECA JAIME DE JESUS       114
PEÑUELA MESA GUSTAVO ANTONIO         114
OLIVERA ANGEL MARTHA EUFEMIA         101
RESTREPO BETANCUR LUIS FERNANDO       97
AMARILES MUÑOZ PEDRO JOSE             96
ROBLEDO RESTREPO SARA MARIA           95
CARDONA MAYA WALTER DARIO             94
RIOS LUIS ALBERTO                     94
CARDONA ARIAS JAIBERTH ANTONIO        93
ARDILA MEDINA CARLOS MARTIN           86
BLAIR TRUJILLO SILVIA VICTORIA        85
MORALES ARAMBURO ALVARO LUIS          85
CADAVID JARAMILLO ANGELA PATRICIA     83
MONDRAGON PEREZ FANOR                 83
AGUDELO SUAREZ ANDRES ALONSO          77
CORNEJO OCHOA JOSE WILLIAM            77
TRIANA CHAVEZ OMAR                    76
LOPEZ 

# Búsquedas

In [1623]:
def extract_key_unique(*args,**kwargs):
    keys=extract_key(*args,**kwargs).keys()
    return [ k for k in keys if k]

def get_groups(l,g):
    for d in l:
        gt=d.get('GRUPO')
        if gt and type( gt )==str:
            gs=gt.replace(
                ', Grupo','; Grupo'
            ).split('; ')
            for gg in gs:
                if gg not in g:
                    g.append(gg)
    return g

facultades={'key':'FACULTAD',
            'values' : extract_key_unique(UDEA,'FACULTAD',json_column='UDEA_authors') }
departamentos={'key':'DEPARTAMENTO',
            'values' :extract_key_unique(UDEA,'DEPARTAMENTO',json_column='UDEA_authors')}
nombre_completo={'key'    : 'NOMBRE COMPLETO',
            'values' : extract_key_unique(UDEA,'NOMBRE COMPLETO',json_column='UDEA_authors')}
full_name={'key'    : 'full_name',
            'values' : extract_key_unique(UDEA,'full_name',json_column='UDEA_authors')}
udea_affiliations={'key'    : 'WOS_affiliation',
            'values' : extract_key_unique(UDEA,'WOS_affiliation',json_column='UDEA_authors')}
wos_affiliations={'key'    : 'affiliation',
            'values' : extract_key_unique(UDEA,'WOS_affiliation',json_column='authors_WOS')}
udea_author={'key'    : 'WOS_author',
            'values' : extract_key_unique(UDEA,'WOS_author',json_column='UDEA_authors')}
wos_author={'key'    : 'WOS_author',
            'values' : extract_key_unique(UDEA,'WOS_author',json_column='authors_WOS')}


#.apply(....) is a loop!
g=[]
#append to g
tmp=UDEA.UDEA_authors.apply(lambda l: 
                        get_groups(l,g)
        if type(l)==list else None
                        )
grupos={'key':'GRUPO',
            'values' :g}


## Función de búsqueda

For value string or list of each dictionary within a list of dictionaries, like the column 'UDEA_authors' in `UDEA` DataFrame

In [1624]:
def query_json_column(q,df=UDEA,json_column='UDEA_authors',
                        choices=nombre_completo,scorer=fuzz.partial_token_sort_ratio,score_cutoff=0):
    #Found best exact match from index
    fchoices=fwp.extractOne(q,choices['values'],scorer=scorer,score_cutoff=score_cutoff)
    # Exact search in indexed subcolumn converted to strins (e.g list → string if necessary)
    if fchoices:
        fchoices=fchoices[0]
        dfF=df[df[json_column].apply(lambda l: True in [ str(d.get(choices['key'])).find(fchoices)>-1 
                                        for d in l if d.get(choices['key'])] if type(l)==list else False)]
        return dfF.reset_index(drop=True)
    else:
        return pd.DataFrame()

### Autor

In [1625]:
r=query_json_column('Diego Alejandro Restrepo Quintero',df=UDEA,json_column='UDEA_authors',
                        choices=nombre_completo,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [1626]:
r.shape

(38, 182)

In [1627]:
#r[['TI','AU','authors_WOS',json_column]].reset_index(drop=True)[5:7]

## Grupos

Ejemplo

In [1628]:
r=query_json_column('Grupo de Fenomenología de Interacciones Fundamentales',df=UDEA,json_column='UDEA_authors',
                        choices=grupos,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [1629]:
r.shape

(83, 182)

Buscar todos

In [1630]:
gdf=pd.DataFrame()
for g in grupos['values']:
    r=query_json_column(g,df=UDEA,json_column='UDEA_authors',choices=grupos,
                        scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)
                        
    gdf=gdf.append( {'Group':g,'articles':r.shape[0]},ignore_index=True )
gdf['articles']=gdf['articles'].astype(int)

In [1631]:
gdf.sort_values('articles',ascending=False).reset_index(drop=True)[:10]

Unnamed: 0,Group,articles
0,Sin Grupo Asociado,442
1,Grupo de Materia Condensada-UdeA,301
2,"Grupo Reproducción, Inmunovirología, Infección y Cáncer",297
3,Inmunovirología,297
4,Grupo de Estado Sólido,282
5,"Centro de Investigación, Innovación y Desarrollo de Materiales - CIDEMAT - Anteriormente: Grupo de Corrosión y Protección",260
6,"Centro de Investigación, Innovación y Desarrollo de Materiales - CIDEMAT - Anteriormente: Grupo de Corrosión y Protección,",260
7,"Grupo Académico de Epidemiología Clínica, Nacer, Salud Sexual y Reproductiva",256
8,Grupo Académico de Epidemiología Clínica,256
9,Grupo de Neurociencias de Antioquia,249


## Departamento

In [1632]:
r=query_json_column('Instituto de Física',df=UDEA,json_column='UDEA_authors',
                        choices=departamentos,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [1633]:
r.shape

(879, 182)

## Centro

Ejemplo

In [1634]:
cen=query_json_column('Facultad de Ciencias Exactas y Naturales',df=UDEA,json_column='UDEA_authors',
                        choices=facultades,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [1635]:
cen.shape

(2322, 182)

Todos

In [1636]:
fdf=pd.DataFrame()
for f in facultades['values']:
    r=query_json_column(f,df=UDEA,json_column='UDEA_authors',choices=facultades,
                        scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)
    fdf=fdf.append( {'Facultad':f,'articles':r.shape[0]},ignore_index=True )
fdf['articles']=fdf['articles'].astype(int)

In [1637]:
fdf.sort_values('articles',ascending=False)

Unnamed: 0,Facultad,articles
0,Facultad de Medicina,3289
1,Facultad de Ciencias Exactas y Naturales,2322
2,Facultad de Ingeniería,1891
3,Facultad de Ciencias Agrarias,675
4,Facultad de Ciencias Sociales y Humanas,222
5,Facultad de Artes,15


## Citas

In [1638]:
UDEA_YES.sort_values('Z9',ascending=False)[['Z9','TI','SO','AU','PY']].reset_index(drop=True)[:10]

Unnamed: 0,Z9,TI,SO,AU,PY
0,3610,"An integrated map of genetic variation from 1,092 human genomes",NATURE,"Altshuler, DM\nDurbin, RM\nAbecasis, GR\nBentley, DR\nChakravarti, A\nClark, AG\nDonnelly, P\nEichler, EE\nFlicek, P\nGabriel, SB\nGibbs, RA\nGreen, ED\nHurles, ME\nKnoppers, BM\nKorbel, JO\nLander, ES\nLee, C\nLehrach, H\nMardis, ER\nMarth, GT\nMcVean, GA\nNickerson, DA\nSchmidt, JP\nSherry, ST\nWang, J\nWilson, RK\nGibbs, RA\nDinh, H\nKovar, C\nLee, S\nLewis, L\nMuzny, D\nReid, J\nWang, M\nWang, J\nFang, XD\nGuo, XS\nJian, M\nJiang, H\nJin, X\nLi, GQ\nLi, JX\nLi, YR\nLi, Z\nLiu, X\nLu, Y\nMa, XD\nSu, Z\nTai, SS\nTang, MF\nWang, B\nWang, GB\nWu, HL\nWu, RH\nYin, Y\nZhang, WW\nZhao, J\nZhao, MR\nZheng, XL\nZhou, Y\nLander, ES\nAltshuler, DM\nGabriel, SB\nGupta, N\nFlicek, P\nClarke, L\nLeinonen, R\nSmith, RE\nZheng-Bradley, X\nBentley, DR\nGrocock, R\nHumphray, S\nJames, T\nKingsbury, Z\nLehrach, H\nSudbrak, R\nAlbrecht, MW\nAmstislavskiy, VS\nBorodina, TA\nLienhard, M\nMertes, F\nSultan, M\nTimmermann, B\nYaspo, ML\nSherry, ST\nMcVean, GA\nMardis, ER\nWilson, RK\nFulton, L\nFulton, R\nWeinstock, GM\nDurbin, RM\nBalasubramaniam, S\nBurton, J\nDanecek, P\nKeane, TM\nKolb-Kokocinski, A\nMcCarthy, S\nStalker, J\nQuail, M\nSchmidt, JP\nDavies, CJ\nGollub, J\nWebster, T\nWong, B\nZhan, YP\nAuton, A\nGibbs, RA\nYu, F\nBainbridge, M\nChallis, D\nEvani, US\nLu, J\nMuzny, D\nNagaswamy, U\nReid, J\nSabo, A\nWang, Y\nYu, J\nWang, J\nCoin, LJM\nFang, L\nGuo, XS\nJin, X\nLi, GQ\nLi, QB\nLi, YR\nLi, ZY\nLin, HX\nLiu, BH\nLuo, RB\nQin, N\nShao, HJ\nWang, BQ\nXie, YL\nYe, C\nYu, C\nZhang, F\nZheng, HC\nZhu, HM\nMarth, GT\nGarrison, EP\nKural, D\nLee, WP\nLeong, WF\nWard, AN\nWu, JT\nZhang, MY\nLee, C\nGriffin, L\nHsieh, CH\nMills, RE\nShi, XH\nvon Grotthuss, M\nZhang, CS\nDaly, MJ\nDePristo, MA\nAltshuler, DM\nBanks, E\nBhatia, G\nCarneiro, MO\ndel Angel, G\nGabriel, SB\nGenovese, G\nGupta, N\nHandsaker, RE\nHartl, C\nLander, ES\nMcCarroll, SA\nNemesh, JC\nPoplin, RE\nSchaffner, SF\nShakir, K\nYoon, SC\nLihm, J\nMakarov, V\nJin, HJ\nKim, W\nKim, KC\nKorbel, JO\nRausch, T\nFlice...",2012
1,1526,Leishmaniasis Worldwide and Global Estimates of Its Incidence,PLOS ONE,"Alvar, J\nVelez, ID\nBern, C\nHerrero, M\nDesjeux, P\nCano, J\nJannin, J\nden Boer, M\n",2012
2,1271,A global reference for human genetic variation,NATURE,"Altshuler, DM\nDurbin, RM\nAbecasis, GR\nBentley, DR\nChakravarti, A\nClark, AG\nDonnelly, P\nEichler, EE\nFlicek, P\nGabriel, SB\nGibbs, RA\nGreen, ED\nHurles, ME\nKnoppers, BM\nKorbel, JO\nLander, ES\nLee, C\nLehrach, H\nMardis, ER\nMarth, GT\nMcVean, GA\nNickerson, DA\nSchmidt, JP\nSherry, ST\nWang, J\nWilson, RK\nGibbs, RA\nBoerwinkle, E\nDoddapaneni, H\nHan, Y\nKorchina, V\nKovar, C\nLee, S\nMuzny, D\nReid, JG\nZhu, YM\nWang, J\nChang, YQ\nFeng, Q\nFang, XD\nGuo, XS\nJian, M\nJiang, H\nJin, X\nLan, TM\nLi, GQ\nLi, JX\nLi, YR\nLiu, SM\nLiu, X\nLu, Y\nMa, XD\nTang, MF\nWang, B\nWang, GB\nWu, HL\nWu, RH\nXu, X\nYin, Y\nZhang, DD\nZhang, WW\nZhao, J\nZhao, MR\nZheng, XL\nLander, ES\nAltshuler, DM\nGabriel, SB\nGupta, N\nGharani, N\nToji, LH\nGerry, NP\nResch, AM\nFlicek, P\nBarker, J\nClarke, L\nGil, L\nHunt, SE\nKelman, G\nKulesha, E\nLeinonen, R\nMcLaren, WM\nRadhakrishnan, R\nRoa, A\nSmirnov, D\nSmith, RE\nStreeter, I\nThormann, A\nToneva, I\nVaughan, B\nZheng-Bradley, X\nBentley, DR\nGrocock, R\nHumphray, S\nJames, T\nKingsbury, Z\nLehrach, H\nSudbrak, R\nAlbrecht, MW\nAmstislavskiy, VS\nBorodina, TA\nLienhard, M\nMertes, F\nSultan, M\nTimmermann, B\nYaspo, ML\nMardis, ER\nWilson, RK\nFulton, L\nFulton, R\nSherry, ST\nAnaniev, V\nBelaia, Z\nBeloslyudtsev, D\nBouk, N\nChen, C\nChurch, D\nCohen, R\nCook, C\nGarner, J\nHefferon, T\nKimelman, M\nLiu, CL\nLopez, J\nMeric, P\nO'Sullivan, C\nOstapchuk, Y\nPhan, L\nPonomarov, S\nSchneider, V\nShekhtman, E\nSirotkin, K\nSlotta, D\nZhang, H\nMcVean, GA\nDurbin, RM\nBalasubramaniam, S\nBurton, J\nDanecek, P\nKeane, TM\nKolb-Kokocinski, A\nMcCarthy, S\nStalker, J\nQuail, M\nSchmidt, JP\nDavies, CJ\nGollub, J\nWebster, T\nWong, B\nZhan, YP\nAuton, A\nCampbell, CL\nKong, Y\nMarcketta, A\nGibbs, RA\nYu, FL\nAntunes, L\nBainbridge, M\nMuzny, D\nSabo, A\nHuang, ZY\nWang, J\nCoin, LJM\nFang, L\nGuo, XS\nJin, X\nLi, GQ\nLi, QB\nLi, YR\nLi, ZY\nLin, HX\nLiu, BH\nLuo, RB\nShao, HJ\nXie, YL\nYe, C\nYu, C\nZhang, F\nZheng, HC\nZh...",2015
3,901,Human papillomavirus genotype attribution in invasive cervical cancer: a retrospective cross-sectional worldwide study,LANCET ONCOLOGY,"de Sanjose, S\nQuint, WGV\nAlemany, L\nGeraets, DT\nKlaustermeier, JE\nLloveras, B\nTous, S\nFelix, A\nBravo, LE\nShin, HR\nVallejos, CS\nde Ruiz, PA\nLima, MA\nGuimera, N\nClavero, O\nAlejo, M\nLlombart-Bosch, A\nCheng-Yang, C\nTatti, SA\nKasamatsu, E\nIljazovic, E\nOdida, M\nPrado, R\nSeoud, M\nGrce, M\nUsubutun, A\nJain, A\nSuarez, GAH\nLombardi, LE\nBanjo, A\nMenendez, C\nDomingo, EJ\nVelasco, J\nNessa, A\nChichareon, SCB\nQiao, YL\nLerma, E\nGarland, SM\nSasagawa, T\nFerrera, A\nHammouda, D\nMariani, L\nPelayo, A\nSteiner, I\nOliva, E\nMeijer, CJLM\nAl-Jassar, WF\nCruz, E\nWright, TC\nPuras, A\nLlave, CL\nTzardi, M\nAgorastos, T\nGarcia-Barriola, V\nClavel, C\nOrdi, J\nAndujar, M\nCastellsague, X\nSanchez, GI\nNowakowski, AM\nBornstein, J\nMunoz, N\nBosch, FX\n",2010
4,711,The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution,SCIENCE,"Elsik, CG\nTellam, RL\nWorley, KC\nGibbs, RA\nAbatepaulo, ARR\nAbbey, CA\nAdelson, DL\nAerts, J\nAhola, V\nAlexander, L\nAlioto, T\nAlmeida, IG\nAmadio, AF\nAnatriello, E\nAntonarakis, SE\nAnzola, JM\nAstashyn, A\nBahadue, SM\nBaldwin, CL\nBarris, W\nBaxter, R\nBell, SN\nBennett, AK\nBennett, GL\nBiase, FH\nBoldt, CR\nBradley, DG\nBrinkman, FSL\nBrinkmeyer-Langford, CL\nBrown, WC\nBrownstein, MJ\nBuhay, C\nCaetano, AR\nCamara, F\nCarroll, JA\nCarvalho, WA\nCasey, T\nCervelatti, EP\nChack, J\nChacko, E\nChandrabose, MM\nChapin, JE\nChapple, CE\nChen, HC\nChen, L\nCheng, Y\nCheng, Z\nChilders, CP\nChitko-McKown, CG\nChiu, R\nChoi, JW\nChrast, J\nColley, AJ\nConnelley, T\nCree, A\nCurry, S\nDalrymple, B\nDiep Dao, M\nDavis, C\nde Oliveira, CJF\nde Miranda Santos, IKF\nde Campos, TA\nDeobald, H\nDevinoy, E\nDickens, CM\nDing, Y\nDinh, HH\nDe Donato, M\nDonohue, KE\nDonthu, R\nDovc, P\nDugan-Rocha, S\nDurbin, KJ\nEberlein, A\nEdgar, RC\nEgan, A\nEggen, A\nEichler, EE\nElhaik, E\nEllis, SA\nElnitski, L\nErmolaeva, O\nEyras, E\nFitzsimmons, CJ\nFowler, GR\nFranzin, AM\nFritz, K\nGabisi, RA\nGarcia, GR\nGarcia, JF\nGenini, S\nGerlach, D\nGerman, JB\nGilbert, JGR\nGill, CA\nGladney, CJ\nGlass, EJ\nGoodell, J\nGrant, JR\nGraur, D\nGreaser, ML\nGreen, JA\nGreen, RD\nGuan, L\nGuigo, R\nHadsell, DL\nHagen, DE\nHakimov, HA\nHalgren, R\nHamernik, DL\nHamilton, C\nHarhay, GP\nHarrow, JL\nHart, EA\nHastings, N\nHavlak, P\nHenrichsen, CN\nHernandez, J\nHernandez, M\nHerzig, CTA\nHiendleder, SG\nHines, S\nHitchens, ME\nHlavina, W\nHobbs, M\nHolder, M\nHolt, RA\nHu, ZL\nHume, J\nIivanainen, A\nIngham, A\nIso-Touru, T\nJamis, C\nJann, O\nJensen, K\nJhangiani, SN\nJiang, HY\nJohnson, AJ\nJones, SJM\nJoshi, V\nJunier, T\nKapetis, D\nKappes, SM\nKapustin, Y\nKeele, JW\nKent, MP\nKerr, T\nKhalil, SS\nKhatib, H\nKiryutin, B\nKitts, P\nKokocinski, F\nKolbehdari, D\nKovar, CL\nKriventseva, EV\nKumar, CG\nKumar, D\nLahmers, KK\nLandrum, M\nLarkin, DM\nLau, LPL\nLeach, R\nLee, JCM\nLee, S\nL...",2009
5,601,GENETIC ABSOLUTE DATING BASED ON MICROSATELLITES AND THE ORIGIN OF MODERN HUMANS,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF\nAMERICA,"GOLDSTEIN, DB\nLINARES, AR\nCAVALLISFORZA, LL\nFELDMAN, MW\n",1995
6,474,Mutations in IRF6 cause Van der Woude and popliteal pterygium syndromes,NATURE GENETICS,"Kondo, S\nSchutte, BC\nRichardson, RJ\nBjork, BC\nKnight, AS\nWatanabe, Y\nHoward, E\nde Lima, RLLF\nDaack-Hirsch, S\nSander, A\nMcDonald-McGinn, DM\nZackai, EH\nLammer, EJ\nAylsworth, AS\nArdinger, HH\nLidral, AC\nPober, BR\nMoreno, L\nArcos-Burgos, M\nValencia, C\nHoudayer, C\nBahuau, M\nMoretti-Ferreira, D\nRichieri-Costa, A\nDixon, MJ\nMurray, JC\n",2002
7,429,Leptogenesis,PHYSICS REPORTS-REVIEW SECTION OF PHYSICS LETTERS,"Davidson, S\nNardi, E\nNir, Y\n",2008
8,410,Temperature sensitivity of drought-induced tree mortality portends increased regional die-off under global-change-type drought,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF\nAMERICA,"Adams, HD\nGuardiola-Claramonte, M\nBarron-Gafford, GA\nVillegas, JC\nBreshears, DD\nZou, CB\nTroch, PA\nHuxman, TE\n",2009
9,376,Electron localization following attosecond molecular photoionization,NATURE,"Sansone, G\nKelkensberg, F\nPerez-Torres, JF\nMorales, F\nKling, MF\nSiu, W\nGhafur, O\nJohnsson, P\nSwoboda, M\nBenedetti, E\nFerrari, F\nLepine, F\nSanz-Vicario, JL\nZherebtsov, S\nZnakovskaya, I\nL'Huillier, A\nIvanov, MY\nNisoli, M\nMartin, F\nVrakking, MJJ\n",2010


In [1639]:
UDEA_YES.Z9.sum()

76237

In [1640]:
UDEA_YES.sort_values('SCP_Cited by',ascending=False)[[
    'SCP_Cited by','TI','SO','AU','PY']].reset_index(drop=True)[:10]

Unnamed: 0,SCP_Cited by,TI,SO,AU,PY
0,1586,Leishmaniasis Worldwide and Global Estimates of Its Incidence,PLOS ONE,"Alvar, J\nVelez, ID\nBern, C\nHerrero, M\nDesjeux, P\nCano, J\nJannin, J\nden Boer, M\n",2012
1,1160,"Effects of tranexamic acid on death, vascular occlusive events, and blood transfusion in trauma patients with significant haemorrhage (CRASH-2): A randomised, placebo-controlled trial",The Lancet,"Olldashi F., Kerçi M., Zhurda T., Ruçi K., Banushi A., Traverso M.S., Jiménez J., Balbi J., Dellera C., Svampa S., Quintana G., Piñero G., Teves J., Seppelt I., Mountain D., Hunter J., Balogh Z., Zaman M., Druwé P., Rutsaert R., Mazairac G., Pascal F., Yvette Z., Chancellin D., Okwen P., Djokam-Liapoe J., Jangwa E., Mbuagbaw L., Fointama N., Pascal N., Baillie F., Jiang J.-Y., Gao G.-Y., Bao Y.-H., Morales C., Sierra J., Naranjo S., Correa C., Gómez C., Herrera J., Caicedo L., Rojas A., Pastas H., Miranda H., Constaín A., Perdomo M., Muñoz D., Duarte A., Vásquez E., Ortiz C., Ayala B., Delgado H., Benavides G., Rosero L., Mejía-Mantilla J., Varela A., Calle M., Castillo J., García A., Ciro J., Villa C., Panesso R., Flórez L., Gallego A., Puentes-Manosalva F., Medina L., Márquez K., Romero A.R., Hernández R., Martínez J., Gualteros W., Urbina Z., Velandia J., Benítez F., Trochez A., Villarreal A., Pabón P., López H., Quintero L., Rubiano A., Tamayo J., Piñera M., Navarro Z., Rondón D., Bujan B., Palacios L., Martínez D., Hernández Y., Fernández Y., Casola E., Delgado R., Herrera C., Arbolaéz M., Domínguez M., Iraola M., Rojas O., Enseñat A., Pastrana I., Rodríguez D., De La Campa S.A., Fortún T., Larrea M., Aragón L., Madrazo A., Svoboda P., Izurieta M., Daccach A., Altamirano M., Ortega A., Cárdenas B., González L., Ochoa M., Ortega F., Quichimbo F., Guiñanzaca J., Zavala I., Segura S., Jerez J., Acosta D., Yánez F., Camacho R., Khamis H., Shafei H., Kheidr A., Nasr H., Mosaad M., Rizk S., El Sayed H., Moati T., Hokkam E., Amin M., Lowis H., Fawzy M., Bedir N., Aldars M., Rodríguez V., Tobar J., Alvarenga J., Shalamberidze B., Demuria E., Rtveliashvili N., Chutkerashvili G., Dotiashvili D., Gogichaishvili T., Ingorokva G., Kazaishvili D., Melikidze B., Iashvili N., Tomadze G., Chkhikvadze M., Khurtsidze L., Lomidze Z., Dzagania D., Kvachadze N., Gotsadze G., Kaloiani V., Kajaia N., Dakubo J., Naaeder S., Sowah P., Yusuf A., Ishak A., Selasi-Sefenu P., Sibiri B.,...",2010
2,994,Human papillomavirus genotype attribution in invasive cervical cancer: a retrospective cross-sectional worldwide study,LANCET ONCOLOGY,"de Sanjose, S\nQuint, WGV\nAlemany, L\nGeraets, DT\nKlaustermeier, JE\nLloveras, B\nTous, S\nFelix, A\nBravo, LE\nShin, HR\nVallejos, CS\nde Ruiz, PA\nLima, MA\nGuimera, N\nClavero, O\nAlejo, M\nLlombart-Bosch, A\nCheng-Yang, C\nTatti, SA\nKasamatsu, E\nIljazovic, E\nOdida, M\nPrado, R\nSeoud, M\nGrce, M\nUsubutun, A\nJain, A\nSuarez, GAH\nLombardi, LE\nBanjo, A\nMenendez, C\nDomingo, EJ\nVelasco, J\nNessa, A\nChichareon, SCB\nQiao, YL\nLerma, E\nGarland, SM\nSasagawa, T\nFerrera, A\nHammouda, D\nMariani, L\nPelayo, A\nSteiner, I\nOliva, E\nMeijer, CJLM\nAl-Jassar, WF\nCruz, E\nWright, TC\nPuras, A\nLlave, CL\nTzardi, M\nAgorastos, T\nGarcia-Barriola, V\nClavel, C\nOrdi, J\nAndujar, M\nCastellsague, X\nSanchez, GI\nNowakowski, AM\nBornstein, J\nMunoz, N\nBosch, FX\n",2010
3,626,The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution,SCIENCE,"Elsik, CG\nTellam, RL\nWorley, KC\nGibbs, RA\nAbatepaulo, ARR\nAbbey, CA\nAdelson, DL\nAerts, J\nAhola, V\nAlexander, L\nAlioto, T\nAlmeida, IG\nAmadio, AF\nAnatriello, E\nAntonarakis, SE\nAnzola, JM\nAstashyn, A\nBahadue, SM\nBaldwin, CL\nBarris, W\nBaxter, R\nBell, SN\nBennett, AK\nBennett, GL\nBiase, FH\nBoldt, CR\nBradley, DG\nBrinkman, FSL\nBrinkmeyer-Langford, CL\nBrown, WC\nBrownstein, MJ\nBuhay, C\nCaetano, AR\nCamara, F\nCarroll, JA\nCarvalho, WA\nCasey, T\nCervelatti, EP\nChack, J\nChacko, E\nChandrabose, MM\nChapin, JE\nChapple, CE\nChen, HC\nChen, L\nCheng, Y\nCheng, Z\nChilders, CP\nChitko-McKown, CG\nChiu, R\nChoi, JW\nChrast, J\nColley, AJ\nConnelley, T\nCree, A\nCurry, S\nDalrymple, B\nDiep Dao, M\nDavis, C\nde Oliveira, CJF\nde Miranda Santos, IKF\nde Campos, TA\nDeobald, H\nDevinoy, E\nDickens, CM\nDing, Y\nDinh, HH\nDe Donato, M\nDonohue, KE\nDonthu, R\nDovc, P\nDugan-Rocha, S\nDurbin, KJ\nEberlein, A\nEdgar, RC\nEgan, A\nEggen, A\nEichler, EE\nElhaik, E\nEllis, SA\nElnitski, L\nErmolaeva, O\nEyras, E\nFitzsimmons, CJ\nFowler, GR\nFranzin, AM\nFritz, K\nGabisi, RA\nGarcia, GR\nGarcia, JF\nGenini, S\nGerlach, D\nGerman, JB\nGilbert, JGR\nGill, CA\nGladney, CJ\nGlass, EJ\nGoodell, J\nGrant, JR\nGraur, D\nGreaser, ML\nGreen, JA\nGreen, RD\nGuan, L\nGuigo, R\nHadsell, DL\nHagen, DE\nHakimov, HA\nHalgren, R\nHamernik, DL\nHamilton, C\nHarhay, GP\nHarrow, JL\nHart, EA\nHastings, N\nHavlak, P\nHenrichsen, CN\nHernandez, J\nHernandez, M\nHerzig, CTA\nHiendleder, SG\nHines, S\nHitchens, ME\nHlavina, W\nHobbs, M\nHolder, M\nHolt, RA\nHu, ZL\nHume, J\nIivanainen, A\nIngham, A\nIso-Touru, T\nJamis, C\nJann, O\nJensen, K\nJhangiani, SN\nJiang, HY\nJohnson, AJ\nJones, SJM\nJoshi, V\nJunier, T\nKapetis, D\nKappes, SM\nKapustin, Y\nKeele, JW\nKent, MP\nKerr, T\nKhalil, SS\nKhatib, H\nKiryutin, B\nKitts, P\nKokocinski, F\nKolbehdari, D\nKovar, CL\nKriventseva, EV\nKumar, CG\nKumar, D\nLahmers, KK\nLandrum, M\nLarkin, DM\nLau, LPL\nLeach, R\nLee, JCM\nLee, S\nL...",2009
4,598,GENETIC ABSOLUTE DATING BASED ON MICROSATELLITES AND THE ORIGIN OF MODERN HUMANS,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF\nAMERICA,"GOLDSTEIN, DB\nLINARES, AR\nCAVALLISFORZA, LL\nFELDMAN, MW\n",1995
5,485,Mutations in IRF6 cause Van der Woude and popliteal pterygium syndromes,NATURE GENETICS,"Kondo, S\nSchutte, BC\nRichardson, RJ\nBjork, BC\nKnight, AS\nWatanabe, Y\nHoward, E\nde Lima, RLLF\nDaack-Hirsch, S\nSander, A\nMcDonald-McGinn, DM\nZackai, EH\nLammer, EJ\nAylsworth, AS\nArdinger, HH\nLidral, AC\nPober, BR\nMoreno, L\nArcos-Burgos, M\nValencia, C\nHoudayer, C\nBahuau, M\nMoretti-Ferreira, D\nRichieri-Costa, A\nDixon, MJ\nMurray, JC\n",2002
6,439,The importance of early treatment with tranexamic acid in bleeding trauma patients: An exploratory analysis of the CRASH-2 randomised controlled trial,The Lancet,"Olldashi F., Kerçi M., Zhurda T., Ruçi K., Banushi A., Traverso M.S., Jiménez J., Balbi J., Dellera C., Svampa S., Quintana G., Piñero G., Teves J., Seppelt I., Mountain D., Balogh Z., Zaman M., Druwé P., Rutsaert R., Mazairac G., Pascal F., Yvette Z., Chancellin D., Okwen P., Djokam-Liapoe J., Jangwa E., Mbuagbaw L., Fointama N., Baillie F., Jiang J.-Y., Gao G.-Y., Bao Y.-H., Morales C., Sierra J., Naranjo S., Correa C., Gómez C., Herrera J., Caicedo L., Rojas A., Pastas H., Miranda H., Constaín A., Perdomo M., Muñoz D., Duarte Á., Vásquez E., Ortiz C., Ayala B., Delgado H., Benavides G., Rosero L., Mejía-Mantilla J., Varela A., Calle M., Castillo J., García A., Ciro J., Villa C., Panesso R., Flórez L., Gallego A., Puentes-Manosalva F., Medina L., Márquez K., Romero A.R., Hernández R., Martínez J., Gualteros W., Urbina Z., Velandia J., Benítez F., Trochez A., Villarreal A., Pabón P., López H., Quintero L., Rubiano A., Tamayo J., Piñera M., Martínez D., Martínez H., Casola E., Domínguez M., Herrera C., Iraola M., Rojas O., Pastrana I., Rodríguez D., De La Campa S.Á., Fortún T., Larrea M., Aragón L., Madrazo A., Svoboda P., Izurieta M., Daccach A., Altamirano M., Ortega A., Cárdenas B., González L., Ochoa M., Ortega F., Quichimbo F., Guiñanzaca J., Zavala I., Segura S., Jerez J., Acosta D., Yánez F., Camacho R., Khamis H., Shafei H., Kheidr A., Nasr H., Mosaad M., Rizk S., El Sayed H., Moati T., Hokkam E., Amin M., Lowis H., Fawzy M., Bedir N., Aldars M., Rodríguez V., Tobar J., Alvarenga J., Shalamberidze B., Demuria E., Rtveliashvili N., Chutkerashvili G., Dotiashvili D., Gogichaishvili T., Ingorokva G., Kazaishvili D., Melikidze B., Iashvili N., Tomadze G., Chkhikvadze M., Khurtsidze L., Lomidze Z., Dzagania D., Kvachadze N., Gotsadze G., Kaloiani V., Kajaia N., Dakubo J., Naaeder S., Sowah P., Yusuf A., Ishak A., Selasi-Sefenu P., Sibiri B., Sarpong-Peprah S., Boro T., Bopaiah K., Shetty K., Subbiah R., Mulla L., Doshi A., Dewan Y., Grewal S., Tripathy P., Ma...",2011
7,432,Leptogenesis,PHYSICS REPORTS-REVIEW SECTION OF PHYSICS LETTERS,"Davidson, S\nNardi, E\nNir, Y\n",2008
8,424,Temperature sensitivity of drought-induced tree mortality portends increased regional die-off under global-change-type drought,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF\nAMERICA,"Adams, HD\nGuardiola-Claramonte, M\nBarron-Gafford, GA\nVillegas, JC\nBreshears, DD\nZou, CB\nTroch, PA\nHuxman, TE\n",2009
9,405,THE STRUCTURE OF THE PRESENILIN-1 (S182) GENE AND IDENTIFICATION OF 6 NOVEL MUTATIONS IN EARLY-ONSET AD FAMILIES,NATURE GENETICS,"CLARK, RF\nHUTTON, M\nFULDNER, RA\nFROELICH, S\nKARRAN, E\nTALBOT, C\nCROOK, R\nLENDON, C\nPRIHAR, G\nHE, C\nKORENBLAT, K\nMARTINEZ, A\nWRAGG, M\nBUSFIELD, F\nBEHRENS, MI\nMYERS, A\nNORTON, J\nMORRIS, J\nMEHTA, N\nPEARSON, C\nLINCOLN, S\nBAKER, M\nDUFF, K\nZEHR, C\nPEREZTUR, J\nHOULDEN, H\nRUIZ, A\nOSSA, J\nLOPERA, F\nARCOS, M\nMADRIGAL, L\nCOLLINGE, J\nHUMPHREYS, C\nASHWORTH, A\nSARNER, S\nFOX, N\nHARVEY, R\nKENNEDY, A\nROQUES, P\nCLINE, RT\nPHILLIPS, CA\nVENTER, JC\nFORSELL, L\nAXELMAN, K\nLILIUS, L\nJOHNSTON, J\nCOWBURN, R\nVIITANEN, M\nWINBLAD, B\nKOSIK, K\nHALTIA, M\nPOYHONEN, M\nDICKSON, D\nMANN, D\nNEARY, D\nSNOWDEN, J\nLANTOS, P\nLANNFELT, L\nROSSOR, M\nROBERTS, GW\nADAMS, MD\nHARDY, J\nGOATE, A\n",1995


In [1641]:
UDEA_YES['SCP_Cited by'].sum()

79717

# Función de búsque de nombres completos usando los autores WOS y los metadatos de la información institucional

In [1655]:
aun=extract_key(UDEA_NOT,'WOS_author',json_column='authors_WOS')
aun[27:28]

WOS_author
Ponce, W. A.    5
Name: WOS_author, dtype: int64

In [1656]:
aun=aun.keys()

In [1657]:
posib=extract_key(UDEA_YES,'WOS_author',json_column='authors_WOS').keys()

### Goods: i=2,3,4,6
### Bad: 1,5

In [1658]:
i=27
n=aun[i]
n

'Ponce, W. A.'

In [1659]:
# if nold:
qq=query_json_column(n,df=UDEA_NOT,json_column='authors_WOS',
                        choices=wos_author,scorer=fuzz.ratio,score_cutoff=100)

In [1660]:
qq.index

RangeIndex(start=0, stop=5, step=1)

In [1661]:
for i in qq.index:
    print( [ d for d in qq.loc[i,'authors_WOS'] if n in d.get('WOS_author')] )

[{'affiliation': ['International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'], 'i': 0, 'WOS_author': 'Ponce, W. A.'}]
[{'affiliation': ['Physics Department, Univ Antioquia, A.A. 1226, Medellín, Colombia'], 'i': 0, 'WOS_author': 'Ponce, W. A.'}]
[{'affiliation': ['Departamento de Física, Univ Antioquia, Medellín, Colombia'], 'i': 0, 'WOS_author': 'Ponce, W. A.'}]
[{'affiliation': ['Departamento de Física, Univ Antioquia, Medellín, Colombia'], 'i': 0, 'WOS_author': 'Ponce, W. A.'}]
[{'affiliation': ['Departamento de Física, Univ Antioquia, Medellín, Colombia'], 'i': 0, 'WOS_author': 'Ponce, W. A.'}]


In [1662]:
qq.SO.unique()

array(['Zeitschrift für Physik C Particles and Fields',
       'Physical Review D'], dtype=object)

In [1663]:
qq.shape

(5, 182)

In [1664]:
extract_key(qq[:1],'WOS_author',json_column='authors_WOS')

WOS_author
Ponce, W. A.    1
Name: WOS_author, dtype: int64

In [1665]:
qq.loc[0,'authors_WOS']

[{'WOS_author': 'Ponce, W. A.',
  'affiliation': ['International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'],
  'i': 0}]

In [1666]:
qq.loc[0,'SO']

'Zeitschrift für Physik C Particles and Fields'

## Include SO

In [1667]:
aunly=drive_files.read_drive_json('UDEA_authors_with_WOS_info.json').reset_index(drop=True)

In [1668]:
def build_institutional_authors(x,author_df,x_author_key='WOS_author',x_affiliation_key='affiliation',
                                        author_key='WOS_author',
                                        affiliation_key='WOS_affiliation'):
    if type(x)!=list:
        return None
    ll=[]
    for j in range(len(x)):
        
                                #author_WOS→affiliation always have single affiliation
        kk=find_author_affiliation(x[j].get(x_author_key),x[j].get(x_affiliation_key)[0],
                                        author_df=author_df,
                                        author_key=author_key,
                                        affiliation_key=affiliation_key,
                                        ratio=0.9 )
        if kk:
            ll.append(kk)
    if not ll:
        ll=None
    return ll

In [1669]:
import fuzzywuzzy.process as fwp
from fuzzywuzzy import fuzz
#UDEA_NOT=UDEA[UDEA['UDEA_authors'].isna()].reset_index(drop=True)
df2=aunly.copy()
df2=pd.DataFrame( list( df2['UDEA_authors'].values ) )
df2['UDEA_authors']=aunly['UDEA_authors']
contents=df2[['WOS_author','WOS_affiliation','UDEA_authors']].reset_index(drop=True)
contents['WOS_author']=contents['WOS_author']#.astype(str)
contents['WOS_affiliation']=contents['WOS_affiliation']#.astype(str)

# ==============

In [1748]:
dfnot=qq#UDEA_NOT.copy()
dfnot=dfnot.reset_index(drop=True)

In [1749]:
l=dfnot['authors_WOS'].loc[0]
so=dfnot['SO'].loc[0]

In [1776]:
TEST=True
if TEST:
    l=[{'WOS_author': 'Ponce, W. A.',
        'affiliation': 
        ['International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'],
        'i': 0}]  
    so='Zeitschrift für Physik C Particles and Fields'

In [1778]:
l

[{'WOS_author': 'Ponce, W. A.',
  'affiliation': ['International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'],
  'i': 0}]

In [1779]:
so

'Zeitschrift für Physik C Particles and Fields'

In [1780]:
#for i in range(20):
#l=dfnot['authors_WOS'].loc[i]
#def json_fuzzy_merge(l,UDEA,contents,right_target='UDEA_authors',
                       #left_on='WOS_author',extra_left_on='affiliation',
                       #right_on='WOS_author',extra_right_on='WOS_affiliation',
                       #cutoff=95,cutoff_extra=65,scorer=fuzz.partial_ratio):
if True:                
    right_target='UDEA_authors'
    left_on='WOS_author'
    extra_left_on='affiliation'
    right_on='WOS_author' 
    extra_right_on='WOS_affiliation'
    extra_extra_right_on='full_name'
    SO='SO'
    cutoff=92
    cutoff_extra=70
    scorer=fuzz.token_set_ratio
    DEBUG=False
    newl=[]
    for d in l:
        AUTHOR=False
        AFFILIATION=False
        JOURNAL=True
        dfraf=pd.DataFrame()        
        au=d.get(left_on)
        aff=d.get(extra_left_on)[0]
        Q=1
        break
        # extract best WOS author match
        #r=fwp.extractOne(au,contents[right_on],scorer=scorer)
        #if r[1]>=cutoff:
        #    raf=fwp.extractOne( aff, contents.loc[r[2],extra_right_on],scorer=scorer )
            #print(r[1],r[2],raf[1],aff,',',contents.loc[r[2],extra_right_on])
            #if raf[1]>=cutoff_extra:
            #    newl=newl+[  contents.loc[r[2],right_target]  ]
            #else:
                #check SO
                #newl=newl+[  contents.loc[r[2],right_target]  ]
        #break
    #if newl:
    #    return newl
    #else:
    #    return None

In [1781]:
au

'Ponce, W. A.'

In [1782]:
aff

'International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'

In [1783]:
if True:
        Q=1
        # Try match author to a good degree
        rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),scorer=scorer)
        if DEBUG: print(1,rau)
        if rau[1]>=cutoff:
            AUTHOR=True
            if DEBUG: print(2,AUTHOR)            

In [1784]:
rau

('Ponce, William A.', 88)

In [1785]:
if True:
        #Try match author with less quality: Q
        #else:
        if rau[1]<cutoff:
            rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),
                       scorer=fuzz.partial_token_sort_ratio)
            if DEBUG: print(2.1,rau)            
            if rau and rau[1]>=cutoff:
                Q=Q-0.1
                AUTHOR=True            

In [1786]:
rau

('Ponce, William A.', 100)

In [1787]:
if True:
        if AUTHOR:
            dfraf=contents[contents[right_on].apply( lambda l: rau[0] in l )
                                ].reset_index(drop=True)
            raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],scorer=fuzz.ratio)
            if DEBUG: print(3,rau)
            if raf and raf[1]>=cutoff_extra:
                AFFILIATION=True

In [1788]:
raf

('Univ Antioquia, Inst Fis, Calle 70 52-21, Medellin, Colombia.', 43)

In [1789]:
if True:
            #else:
            if raf[1]<cutoff_extra:
                Q=Q-0.1
                raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],
                                   scorer=fuzz.partial_token_set_ratio)
                if raf and raf[1]>=cutoff_extra:
                    AFFILIATION=True

In [1790]:
raf

('Univ Antioquia, Inst Fis, Medellin, Colombia.', 100)

# ================

Journal:

In [1791]:
full_name=dfraf['UDEA_authors'].loc[0].get('full_name')

In [1792]:
full_name

'PONCE GUTIERREZ WILLIAM ANTONIO'

In [1793]:
if True:
        if Q<1:
            cutoff_so=50
            if Q<0.9:
                cutoff_so=60
            if not dfraf.empty:
                full_name=dfraf[right_target].loc[0].get(
                        extra_extra_right_on)
                if full_name:
                    kkk=UDEA[UDEA['UDEA_nombre'].str.contains(full_name)
                                ].reset_index(drop=True)
                    rso=fwp.extractOne( so,   kkk.SO, scorer=scorer)
                    if not rso:
                        JOURNAL=False
                    elif rso[1]<cutoff_so:
                        JOURNAL=False

In [1794]:
so

'Zeitschrift für Physik C Particles and Fields'

In [1797]:
kkk.SO[:2]

0                             REVISTA MEXICANA DE FISICA
1    ACTA PHYSICA HUNGARICA NEW SERIES-HEAVY ION PHYSICS
Name: SO, dtype: object

In [1798]:
rso

('PARTICLES AND FIELDS, PROCEEDINGS', 77, 2)

In [1799]:
if True:            
            #else:
            if dfraf.empty:
                JOURNAL=False

In [1800]:
JOURNAL

True

In [1801]:
if True:
        if AUTHOR and AFFILIATION and JOURNAL:
            newl=newl+[  dfraf.loc[0,right_target]  ]
            print('{} → {}'.format(au,full_name) )   

Ponce, W. A. → PONCE GUTIERREZ WILLIAM ANTONIO


Test full function  below

# ================

In [1817]:
#for i in range(20):
#l=dfnot['authors_WOS'].loc[i]
#93,70
from IPython.display import clear_output
def json_fuzzy_merge(l,so,UDEA,contents,right_target='UDEA_authors',
                       left_on='WOS_author',extra_left_on='affiliation',
                       right_on='WOS_author',extra_right_on='WOS_affiliation',
                       extra_extra_right_on='full_name',
                       cutoff=93,cutoff_extra=70,scorer=fuzz.token_set_ratio,
                       DEBUG=False):
    newl=[]
    for d in l:
        clear_output(wait=True)
        AUTHOR=False
        AFFILIATION=False
        JOURNAL=True

        dfraf=pd.DataFrame()
        au=d.get(left_on)
        aff=d.get(extra_left_on)[0]
        Q=1
        # Try match author to a good degree
        rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),scorer=scorer)
        if DEBUG: print(1,rau)
        if rau[1]>=cutoff:
            AUTHOR=True
        #Try match author with less quality: Q
        else:
            rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),
                       scorer=fuzz.partial_token_sort_ratio)
            if DEBUG: print(1.1,rau)            
            if rau and rau[1]>=cutoff:
                Q=Q-0.1
                AUTHOR=True
        if DEBUG: print(1.2,'AUTHOR:',AUTHOR)                            
        if AUTHOR:
            dfraf=contents[contents[right_on].apply( lambda l: rau[0] in l )
                                ].reset_index(drop=True)
            raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],scorer=fuzz.ratio)
            if DEBUG: print(2,raf)
            if raf and raf[1]>=cutoff_extra:
                AFFILIATION=True
            else:
                Q=Q-0.1
                raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],
                                   scorer=fuzz.partial_token_set_ratio)
                if DEBUG: print(2.1,raf)
                if raf and raf[1]>=cutoff_extra:
                    AFFILIATION=True

        if DEBUG: print(2.2,'AFFILIATION:',AFFILIATION,'Q:',Q)                
        if Q<1:
            cutoff_so=50
            if Q<0.9:
                cutoff_so=60
            if not dfraf.empty:
                full_name=dfraf[right_target].loc[0].get(
                        extra_extra_right_on)
                if full_name:
                    kkk=UDEA[UDEA['UDEA_nombre'].str.contains(full_name)
                                ].reset_index(drop=True)
                    rso=fwp.extractOne( so,   kkk.SO, scorer=scorer)
                    if not rso:
                        JOURNAL=False
                    elif rso[1]<cutoff_so:
                        JOURNAL=False
            else:
                JOURNAL=False
        if DEBUG: print(3,'JOURNAL',JOURNAL)                
        if AUTHOR and AFFILIATION and JOURNAL:
            newl=newl+[  dfraf.loc[0,right_target]  ]
            print('{} → {}'.format(au,full_name) ) 
    if newl:
        return newl
    else:
        return None

In [1818]:
dfraf=json_fuzzy_merge(l,so,UDEA,contents,DEBUG=True)

1 ('Ponce, William A.', 88)
1.1 ('Ponce, William A.', 100)
1.2 AUTHOR: True
2 ('Univ Antioquia, Inst Fis, Calle 70 52-21, Medellin, Colombia.', 43)
2.1 ('Univ Antioquia, Inst Fis, Medellin, Colombia.', 100)
2.2 AFFILIATION: True Q: 0.8
3 JOURNAL True
Ponce, W. A. → PONCE GUTIERREZ WILLIAM ANTONIO


In [None]:
%time kk=UDEA_NOT['authors_WOS'].combine('SO',func=lambda l,so: json_fuzzy_merge(l,UDEA,so,contents) if type(l)==list else None)

In [1599]:
kk.dropna().shape

(577,)

In [1600]:
qq=UDEA_NOT.reset_index(drop=True)
qq['UDEA_authors']=kk

In [1602]:
#qq[['authors_WOS','UDEA_authors']]

In [1608]:
UDEA_NOT.shape[0]+UDEA_YES.shape[0]

15700

In [1610]:
qq['UDEA_authors'].dropna().shape

(577,)

In [1613]:
qq=qq.fillna('')

In [1615]:
UDEA_NOT=qq.reset_index(drop=True)
UDEA=UDEA_YES.append(UDEA_NOT,sort=False).reset_index(drop=True)