<a href="https://colab.research.google.com/github/restrepo/medicion/blob/master/cienciometria/Query_CTR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Búsquedas WOS+SCI+SCP+PTJ+CTR para UdeA

Búsquedas en bases bibligráficas  
* Web of Science (WOS), 
* Scielo (SCI)
* Scopus  (SCP)
* Puntaje (UDEA)
* Center (CTR)
de los artículos científicos de la UdeA

La base de datos se creó con:

[WOS_SCI_SCP_PTJ_CTR.ipynb](./WOS_SCI_SCP_PTJ_CTR.ipynb)

In [1]:
import os
VERSION='NEW'
if os.getcwd()=='/content':
    !pip install openpyxl xlrd wosplus fuzzywuzzy[speedup] > /dev/null

## functions

In [2]:
import pandas as pd
import wosplus as wp
pd.set_option('display.max_colwidth',200)
from venn import draw_venn, generate_colors
import numpy as np
import fuzzywuzzy.process as fwp
from fuzzywuzzy import fuzz
idc='CÉDULA'

##  Configure public links of  files in Google Drive
* If it is a Google Spreadsheet the corresponding file is downloaded as CSV
* If it is in excel/json or text file the file is downloaded  directly

To define your  own labeled IDs for public google drive files edit the next cell:

In [3]:
%%writefile drive.cfg
[FILES]
WOS_SCI_SCP_PTJ_CTR.json.gz=19E1C1kRk4I0V3uXojqko8-NEicWaPp1j

Overwriting drive.cfg


##  Load data bases

In [4]:
affil='Univ Antioquia'
drive_files=wp.wosplus('drive.cfg')

In [5]:
UDEAjsonfile='WOS_SCI_SCP_PTJ_CTR.json.gz'
tmp=drive_files.load_biblio(UDEAjsonfile,compression='gzip')
UDEA=drive_files.biblio['WOS'].copy().reset_index(drop=True)



In [6]:
#from check_quality import *
#check_quality(UDEA)

## Indices:
Información obtenida de la columna: `json_column='UDEA_authors'`

In [7]:
json_column='UDEA_authors'

Que contiene listas de diccionarios con la información del autor UDEA: 

`{'DEPARTAMENTO': 'Instituto de Biología',
  'FACULTAD': 'Facultad de Ciencias Exactas y Naturales',
  'GRUPO': 'Sin Grupo Asociado',
  'INICIALES': 'I.',
  'NOMBRE COMPLETO': 'Idalyd Fonseca Gonzalez',
  'NOMBRES': 'Idalyd',
  'PRIMER APELLIDO': 'Fonseca',
  'SEGUNDO APELLIDO': 'Gonzalez',
  'WOS_affiliation': ['Univ Antioquia, Colombia.'],
  'WOS_author': ['FONSECA, IDALYD',
   'FONSECA-GONZALEZ, IDALYD',
   'Fonseca-Gonzalez, Idalyd',
   'Fonseca-Gonzalez, I.'],
  'full_name': 'FONSECA GONZALEZ IDALYD'}`

Otras columnas: `['OA','Z9'*,SCP_Cited by']`, `*`: WOS cited by

Ver también [WOS field tags](https://images.webofknowledge.com/images/help/WOS/hs_wos_fieldtags.html)

# Resultados totales

Artículos no identificados:

In [106]:
UDEA_NOT=UDEA[UDEA[json_column]==''].reset_index(drop=True)
UDEA_NOT.shape[0]

3387

Artículos identificados

In [107]:
UDEA_YES=UDEA[UDEA[json_column]!=''].reset_index(drop=True)
UDEA_YES.shape[0]

12313

### Análisis sobre artículos identificados

In [108]:
def flatten_if_nested(l):
    flatten=False
    for i in l:
        if type(i)==list:
            #return i
            flatten=True
    if flatten:
        l=[item for sublist in l for item in sublist]
        l=pd.np.array(l)
    return l
def extract_key(df,key,json_column='UDEA_authors'):
    '''
    Extract all the unique key values of the list of dictionaries in 
    a json column when the key value is a string or another list
    '''
    ll=df[json_column].apply(lambda l: np.unique([ d.get(key) for d in l 
                                if d.get(key) ]) if type(l)==list else l)
    if ll.str[0].apply(lambda l: l if type(l)==list else None).dropna().shape[0]:
        ll=ll.apply(flatten_if_nested)
    ll=ll.apply(pd.Series).stack().values
    return pd.DataFrame( {key:list(ll)} ).groupby(key)[key].count().sort_values(ascending=False)

In [109]:
extract_key(UDEA_YES,'FACULTAD')

FACULTAD
Facultad de Medicina                        3313
Facultad de Ciencias Exactas y Naturales    2327
Facultad de Ingeniería                      1885
Facultad de Ciencias Agrarias                693
Facultad de Ciencias Sociales y Humanas      225
Facultad de Artes                             15
Name: FACULTAD, dtype: int64

In [110]:
extract_key(UDEA_YES,'DEPARTAMENTO')

DEPARTAMENTO
Departamento de Microbiología y Parasitología                   947
Instituto de Física                                             884
Instituto de Investigaciones Médicas                            771
Departamento de Medicina Interna                                689
Instituto de Biología                                           671
Instituto de Química                                            667
Departamento de  Producción Agropecuaria                        429
Departamento de Pediatría y Puericultura                        409
Departamento de Ingeniería Metalúrgica                          357
Departamento de Ingeniería Sanitaria  y Ambiental               333
Escuela de Medicina Veterinaria                                 319
Departamento de Ingeniería Quimica                              289
Departamento de Ingeniería Mecánica                             288
Departamento de Cirugía                                         223
Departamento de Fisiología         

In [112]:
extract_key(UDEA_YES,'GRUPO')

GRUPO
Sin Grupo Asociado                                                                                                                                                                                450
Grupo de Materia Condensada-UdeA                                                                                                                                                                  261
Inmunovirología                                                                                                                                                                                   246
Centro de Investigación, Innovación y Desarrollo de Materiales - CIDEMAT - Anteriormente: Grupo de Corrosión y Protección                                                                         244
Programa de Estudio y Control de Enfermedades Tropicales                                                                                                                                          234
Grup

In [113]:
extract_key(UDEA_YES,'full_name')

full_name
DUQUE ECHEVERRI CARLOS ALBERTO            261
VELEZ BERNAL IVAN DARIO                   126
CERON MUÑOZ MARIO FERNANDO                126
BEDOYA BERRIO GABRIEL DE JESUS            125
LOPERA RESTREPO FRANCISCO JAVIER          122
JAIMES BARRAGAN FABIAN ALBERTO            119
RUGELES LOPEZ MARIA TERESA                118
PEÑUELA MESA GUSTAVO ANTONIO              115
CARMONA FONSECA JAIME DE JESUS            114
OLIVERA ANGEL MARTHA EUFEMIA              102
AMARILES MUÑOZ PEDRO JOSE                  97
RESTREPO BETANCUR LUIS FERNANDO            95
ROBLEDO RESTREPO SARA MARIA                94
RIOS LUIS ALBERTO                          94
CARDONA MAYA WALTER DARIO                  94
CARDONA ARIAS JAIBERTH ANTONIO             93
CADAVID JARAMILLO ANGELA PATRICIA          87
ARDILA MEDINA CARLOS MARTIN                87
MORALES ARAMBURO ALVARO LUIS               86
BLAIR TRUJILLO SILVIA VICTORIA             85
MONDRAGON PEREZ FANOR                      84
CORNEJO OCHOA JOSE WILLI

# Búsquedas

In [114]:
def extract_key_unique(*args,**kwargs):
    keys=extract_key(*args,**kwargs).keys()
    return [ k for k in keys if k]

def get_groups(l,g):
    for d in l:
        gt=d.get('GRUPO')
        if gt and type( gt )==str:
            gs=gt.replace(
                ', Grupo','; Grupo'
            ).split('; ')
            for gg in gs:
                if gg not in g:
                    g.append(gg)
    return g

facultades={'key':'FACULTAD',
            'values' : extract_key_unique(UDEA,'FACULTAD',json_column='UDEA_authors') }
departamentos={'key':'DEPARTAMENTO',
            'values' :extract_key_unique(UDEA,'DEPARTAMENTO',json_column='UDEA_authors')}
nombre_completo={'key'    : 'NOMBRE COMPLETO',
            'values' : extract_key_unique(UDEA,'NOMBRE COMPLETO',json_column='UDEA_authors')}
full_name={'key'    : 'full_name',
            'values' : extract_key_unique(UDEA,'full_name',json_column='UDEA_authors')}
udea_affiliations={'key'    : 'WOS_affiliation',
            'values' : extract_key_unique(UDEA,'WOS_affiliation',json_column='UDEA_authors')}
wos_affiliations={'key'    : 'affiliation',
            'values' : extract_key_unique(UDEA,'WOS_affiliation',json_column='authors_WOS')}
udea_author={'key'    : 'WOS_author',
            'values' : extract_key_unique(UDEA,'WOS_author',json_column='UDEA_authors')}
wos_author={'key'    : 'WOS_author',
            'values' : extract_key_unique(UDEA,'WOS_author',json_column='authors_WOS')}


#.apply(....) is a loop!
g=[]
#append to g
tmp=UDEA.UDEA_authors.apply(lambda l: 
                        get_groups(l,g)
        if type(l)==list else None
                        )
grupos={'key':'GRUPO',
            'values' :g}


## Función de búsqueda

For value string or list of each dictionary within a list of dictionaries, like the column 'UDEA_authors' in `UDEA` DataFrame

In [115]:
def query_json_column(q,df=UDEA,json_column='UDEA_authors',
                        choices=nombre_completo,scorer=fuzz.partial_token_sort_ratio,score_cutoff=0):
    #Found best exact match from index
    fchoices=fwp.extractOne(q,choices['values'],scorer=scorer,score_cutoff=score_cutoff)
    # Exact search in indexed subcolumn converted to strins (e.g list → string if necessary)
    if fchoices:
        fchoices=fchoices[0]
        dfF=df[df[json_column].apply(lambda l: True in [ str(d.get(choices['key'])).find(fchoices)>-1 
                                        for d in l if d.get(choices['key'])] if type(l)==list else False)]
        return dfF.reset_index(drop=True)
    else:
        return pd.DataFrame()

### Autor

In [116]:
r=query_json_column('Diego Alejandro Restrepo Quintero',df=UDEA,json_column='UDEA_authors',
                        choices=nombre_completo,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [117]:
r.shape

(37, 181)

In [118]:
#r[['TI','AU','authors_WOS',json_column]].reset_index(drop=True)[5:7]

## Grupos

Ejemplo

In [119]:
r=query_json_column('Grupo de Fenomenología de Interacciones Fundamentales',df=UDEA,json_column='UDEA_authors',
                        choices=grupos,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [120]:
r.shape

(89, 181)

Buscar todos

In [121]:
gdf=pd.DataFrame()
for g in grupos['values']:
    r=query_json_column(g,df=UDEA,json_column='UDEA_authors',choices=grupos,
                        scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)
                        
    gdf=gdf.append( {'Group':g,'articles':r.shape[0]},ignore_index=True )
gdf['articles']=gdf['articles'].astype(int)

In [122]:
gdf.sort_values('articles',ascending=False).reset_index(drop=True)[:10]

Unnamed: 0,Group,articles
0,Sin Grupo Asociado,450
1,Grupo de Materia Condensada-UdeA,302
2,"Grupo Reproducción, Inmunovirología, Infección y Cáncer",299
3,Inmunovirología,299
4,Grupo de Estado Sólido,281
5,"Centro de Investigación, Innovación y Desarrollo de Materiales - CIDEMAT - Anteriormente: Grupo de Corrosión y Protección",264
6,"Centro de Investigación, Innovación y Desarrollo de Materiales - CIDEMAT - Anteriormente: Grupo de Corrosión y Protección,",264
7,"Grupo Académico de Epidemiología Clínica, Nacer, Salud Sexual y Reproductiva",258
8,Grupo Académico de Epidemiología Clínica,258
9,"Grupo de Neurociencias de Antioquia, SINAPSIS",246


## Departamento

In [123]:
r=query_json_column('Instituto de Física',df=UDEA,json_column='UDEA_authors',
                        choices=departamentos,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [124]:
r.shape

(884, 181)

## Centro

Ejemplo

In [125]:
cen=query_json_column('Facultad de Ciencias Exactas y Naturales',df=UDEA,json_column='UDEA_authors',
                        choices=facultades,scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)

In [126]:
cen.shape

(2327, 181)

Todos

In [127]:
fdf=pd.DataFrame()
for f in facultades['values']:
    r=query_json_column(f,df=UDEA,json_column='UDEA_authors',choices=facultades,
                        scorer=fuzz.partial_token_sort_ratio,score_cutoff=79)
    fdf=fdf.append( {'Facultad':f,'articles':r.shape[0]},ignore_index=True )
fdf['articles']=fdf['articles'].astype(int)

In [128]:
fdf.sort_values('articles',ascending=False)

Unnamed: 0,Facultad,articles
0,Facultad de Medicina,3313
1,Facultad de Ciencias Exactas y Naturales,2327
2,Facultad de Ingeniería,1885
3,Facultad de Ciencias Agrarias,693
4,Facultad de Ciencias Sociales y Humanas,225
5,Facultad de Artes,15


## Citas

In [133]:
UDEA.sort_values('Z9',ascending=False)[['Z9','TI','SO','AU','PY']].reset_index(drop=True)[:10]

Unnamed: 0,Z9,TI,SO,AU,PY
0,3610,"An integrated map of genetic variation from 1,092 human genomes",NATURE,"Altshuler, DM\nDurbin, RM\nAbecasis, GR\nBentley, DR\nChakravarti, A\nClark, AG\nDonnelly, P\nEichler, EE\nFlicek, P\nGabriel, SB\nGibbs, RA\nGreen, ED\nHurles, ME\nKnoppers, BM\nKorbel, JO\nLande...",2012
1,1526,Leishmaniasis Worldwide and Global Estimates of Its Incidence,PLOS ONE,"Alvar, J\nVelez, ID\nBern, C\nHerrero, M\nDesjeux, P\nCano, J\nJannin, J\nden Boer, M\n",2012
2,1271,A global reference for human genetic variation,NATURE,"Altshuler, DM\nDurbin, RM\nAbecasis, GR\nBentley, DR\nChakravarti, A\nClark, AG\nDonnelly, P\nEichler, EE\nFlicek, P\nGabriel, SB\nGibbs, RA\nGreen, ED\nHurles, ME\nKnoppers, BM\nKorbel, JO\nLande...",2015
3,901,Human papillomavirus genotype attribution in invasive cervical cancer: a retrospective cross-sectional worldwide study,LANCET ONCOLOGY,"de Sanjose, S\nQuint, WGV\nAlemany, L\nGeraets, DT\nKlaustermeier, JE\nLloveras, B\nTous, S\nFelix, A\nBravo, LE\nShin, HR\nVallejos, CS\nde Ruiz, PA\nLima, MA\nGuimera, N\nClavero, O\nAlejo, M\nL...",2010
4,731,AN EVALUATION OF GENETIC DISTANCES FOR USE WITH MICROSATELLITE LOCI,GENETICS,"GOLDSTEIN, DB\nLINARES, AR\nCAVALLISFORZA, LL\nFELDMAN, MW\n",1995
5,711,The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution,SCIENCE,"Elsik, CG\nTellam, RL\nWorley, KC\nGibbs, RA\nAbatepaulo, ARR\nAbbey, CA\nAdelson, DL\nAerts, J\nAhola, V\nAlexander, L\nAlioto, T\nAlmeida, IG\nAmadio, AF\nAnatriello, E\nAntonarakis, SE\nAnzola,...",2009
6,622,Expression of a noncoding RNA is elevated in Alzheimer's disease and drives rapid feed-forward regulation of beta-secretase,NATURE MEDICINE,"Faghihi, MA\nModarresi, F\nKhalil, AM\nWood, DE\nSahagan, BG\nMorgan, TE\nFinch, CE\nLaurent, GS\nKenny, PJ\nWahlestedt, C\n",2008
7,601,GENETIC ABSOLUTE DATING BASED ON MICROSATELLITES AND THE ORIGIN OF MODERN HUMANS,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF\nAMERICA,"GOLDSTEIN, DB\nLINARES, AR\nCAVALLISFORZA, LL\nFELDMAN, MW\n",1995
8,495,Microsatellite markers reveal a spectrum of population structures in the malaria parasite Plasmodium falciparum,MOLECULAR BIOLOGY AND EVOLUTION,"Anderson, TJC\nHaubold, B\nWilliams, JT\nEstrada-Franco, JG\nRichardson, L\nMollinedo, R\nBockarie, M\nMokili, J\nMharakurwa, S\nFrench, N\nWhitworth, J\nVelez, ID\nBrockman, AH\nNosten, F\nFerrei...",2000
9,474,Mutations in IRF6 cause Van der Woude and popliteal pterygium syndromes,NATURE GENETICS,"Kondo, S\nSchutte, BC\nRichardson, RJ\nBjork, BC\nKnight, AS\nWatanabe, Y\nHoward, E\nde Lima, RLLF\nDaack-Hirsch, S\nSander, A\nMcDonald-McGinn, DM\nZackai, EH\nLammer, EJ\nAylsworth, AS\nArdinge...",2002


In [135]:
UDEA.Z9.sum()

94659

In [136]:
UDEA.sort_values('SCP_Cited by',ascending=False)[[
    'SCP_Cited by','TI','SO','AU','PY']].reset_index(drop=True)[:10]

Unnamed: 0,SCP_Cited by,TI,SO,AU,PY
0,1586,Leishmaniasis Worldwide and Global Estimates of Its Incidence,PLOS ONE,"Alvar, J\nVelez, ID\nBern, C\nHerrero, M\nDesjeux, P\nCano, J\nJannin, J\nden Boer, M\n",2012
1,1160,"Effects of tranexamic acid on death, vascular occlusive events, and blood transfusion in trauma patients with significant haemorrhage (CRASH-2): A randomised, placebo-controlled trial",The Lancet,"Olldashi F., Kerçi M., Zhurda T., Ruçi K., Banushi A., Traverso M.S., Jiménez J., Balbi J., Dellera C., Svampa S., Quintana G., Piñero G., Teves J., Seppelt I., Mountain D., Hunter J., Balogh Z., ...",2010
2,994,Human papillomavirus genotype attribution in invasive cervical cancer: a retrospective cross-sectional worldwide study,LANCET ONCOLOGY,"de Sanjose, S\nQuint, WGV\nAlemany, L\nGeraets, DT\nKlaustermeier, JE\nLloveras, B\nTous, S\nFelix, A\nBravo, LE\nShin, HR\nVallejos, CS\nde Ruiz, PA\nLima, MA\nGuimera, N\nClavero, O\nAlejo, M\nL...",2010
3,626,The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution,SCIENCE,"Elsik, CG\nTellam, RL\nWorley, KC\nGibbs, RA\nAbatepaulo, ARR\nAbbey, CA\nAdelson, DL\nAerts, J\nAhola, V\nAlexander, L\nAlioto, T\nAlmeida, IG\nAmadio, AF\nAnatriello, E\nAntonarakis, SE\nAnzola,...",2009
4,603,Expression of a noncoding RNA is elevated in Alzheimer's disease and drives rapid feed-forward regulation of beta-secretase,NATURE MEDICINE,"Faghihi, MA\nModarresi, F\nKhalil, AM\nWood, DE\nSahagan, BG\nMorgan, TE\nFinch, CE\nLaurent, GS\nKenny, PJ\nWahlestedt, C\n",2008
5,598,GENETIC ABSOLUTE DATING BASED ON MICROSATELLITES AND THE ORIGIN OF MODERN HUMANS,PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF\nAMERICA,"GOLDSTEIN, DB\nLINARES, AR\nCAVALLISFORZA, LL\nFELDMAN, MW\n",1995
6,590,A molecular sieve with eighteen-membered rings,Nature,"Davis M.E., Saldarriaga C., Montes C., Garces J., Crowdert C.",1988
7,496,Microsatellite markers reveal a spectrum of population structures in the malaria parasite Plasmodium falciparum,MOLECULAR BIOLOGY AND EVOLUTION,"Anderson, TJC\nHaubold, B\nWilliams, JT\nEstrada-Franco, JG\nRichardson, L\nMollinedo, R\nBockarie, M\nMokili, J\nMharakurwa, S\nFrench, N\nWhitworth, J\nVelez, ID\nBrockman, AH\nNosten, F\nFerrei...",2000
8,485,Mutations in IRF6 cause Van der Woude and popliteal pterygium syndromes,NATURE GENETICS,"Kondo, S\nSchutte, BC\nRichardson, RJ\nBjork, BC\nKnight, AS\nWatanabe, Y\nHoward, E\nde Lima, RLLF\nDaack-Hirsch, S\nSander, A\nMcDonald-McGinn, DM\nZackai, EH\nLammer, EJ\nAylsworth, AS\nArdinge...",2002
9,453,Sustainability in the construction industry: A review of recent developments based on LCA,Construction and Building Materials,"Ortiz O., Castells F., Sonnemann G.",2009


In [137]:
UDEA['SCP_Cited by'].sum()

101356

# Función de búsque de nombres completos usando los autores WOS y los metadatos de la información institucional

In [34]:
aun=extract_key(UDEA_NOT,'WOS_author',json_column='authors_WOS')
aun[27:28]

WOS_author
Cerón-Muñoz, M. F.    6
Name: WOS_author, dtype: int64

In [35]:
aun=aun.keys()

In [36]:
posib=extract_key(UDEA_YES,'WOS_author',json_column='authors_WOS').keys()

### Goods: i=2,3,4,6
### Bad: 1,5

In [37]:
i=27
n=aun[i]
n

'Cerón-Muñoz, M. F.'

In [38]:
# if nold:
qq=query_json_column(n,df=UDEA_NOT,json_column='authors_WOS',
                        choices=wos_author,scorer=fuzz.ratio,score_cutoff=100)

In [39]:
qq.index

RangeIndex(start=0, stop=6, step=1)

In [40]:
for i in qq.index:
    print( [ d for d in qq.loc[i,'authors_WOS'] if n in d.get('WOS_author')] )

[{'WOS_author': 'Cerón-Muñoz, M. F.', 'i': 0, 'affiliation': ['Grupo de Investigación GaMMA, Facultad de Ciencias Agrarias, Univ Antioquia, Calle 70 Nº 52 - 2, Medellín, Colombia']}]
[{'WOS_author': 'Cerón-Muñoz, M. F.', 'i': 0, 'affiliation': ['Grupo de Investigación en Genética, Mejoramiento y Modelación-GaMMA, Facultad de Ciencias Agrarias, Univ Antioquia UdeA, Calle 70 No. 52-21, Medellín, Colombia']}]
[{'WOS_author': 'Cerón-Muñoz, M. F.', 'i': 0, 'affiliation': ['Grupo de investigación GaMMA, Facultad de Ciencias Agrarias, Univ Antioquia, Calle 70 N°52-21, Medellín, Colombia']}]
[{'WOS_author': 'Cerón-Muñoz, M. F.', 'i': 2, 'affiliation': ['Grupo de Investigación en Genética, Mejoramiento y Modelación Animal-GaMMA, Univ Antioquia, Calle 70 N° 52-21, Medellín, Colombia']}]
[{'WOS_author': 'Cerón-Muñoz, M. F.', 'i': 0, 'affiliation': ['Grupo de investigación GaMMA, Univ Antioquia, Carrera 75 No. 65-87, Bloque, Medellín, Colombia']}]
[{'WOS_author': 'Cerón-Muñoz, M. F.', 'i': 3, 'aff

In [41]:
qq.SO.unique()

array(['Livestock Research for Rural Development'], dtype=object)

In [42]:
qq.shape

(6, 181)

In [43]:
extract_key(qq[:1],'WOS_author',json_column='authors_WOS')

WOS_author
Ramírez-Arias, J. P.    1
Cerón-Muñoz, M. F.      1
Name: WOS_author, dtype: int64

In [44]:
qq.loc[0,'authors_WOS']

[{'WOS_author': 'Cerón-Muñoz, M. F.',
  'affiliation': ['Grupo de Investigación GaMMA, Facultad de Ciencias Agrarias, Univ Antioquia, Calle 70 Nº 52 - 2, Medellín, Colombia'],
  'i': 0},
 {'WOS_author': 'Ramírez-Arias, J. P.',
  'affiliation': ['Grupo de Investigación GaMMA, Facultad de Ciencias Agrarias, Univ Antioquia, Calle 70 Nº 52 - 2, Medellín, Colombia'],
  'i': 1}]

In [45]:
qq.loc[0,'SO']

'Livestock Research for Rural Development'

## Include SO

In [46]:
aunly=drive_files.read_drive_json('UDEA_authors_with_WOS_info.json').reset_index(drop=True)

In [47]:
def build_institutional_authors(x,author_df,x_author_key='WOS_author',x_affiliation_key='affiliation',
                                        author_key='WOS_author',
                                        affiliation_key='WOS_affiliation'):
    if type(x)!=list:
        return None
    ll=[]
    for j in range(len(x)):
        
                                #author_WOS→affiliation always have single affiliation
        kk=find_author_affiliation(x[j].get(x_author_key),x[j].get(x_affiliation_key)[0],
                                        author_df=author_df,
                                        author_key=author_key,
                                        affiliation_key=affiliation_key,
                                        ratio=0.9 )
        if kk:
            ll.append(kk)
    if not ll:
        ll=None
    return ll

In [48]:
import fuzzywuzzy.process as fwp
from fuzzywuzzy import fuzz
#UDEA_NOT=UDEA[UDEA['UDEA_authors'].isna()].reset_index(drop=True)
df2=aunly.copy()
df2=pd.DataFrame( list( df2['UDEA_authors'].values ) )
df2['UDEA_authors']=aunly['UDEA_authors']
contents=df2[['WOS_author','WOS_affiliation','UDEA_authors']].reset_index(drop=True)
contents['WOS_author']=contents['WOS_author']#.astype(str)
contents['WOS_affiliation']=contents['WOS_affiliation']#.astype(str)

# ==============

In [49]:
dfnot=qq#UDEA_NOT.copy()
dfnot=dfnot.reset_index(drop=True)

In [50]:
l=dfnot['authors_WOS'].loc[0]
so=dfnot['SO'].loc[0]

In [51]:
TEST=True
if TEST:
    l=[{'WOS_author': 'Ponce, W. A.',
        'affiliation': 
        ['International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'],
        'i': 0}]  
    so='Zeitschrift für Physik C Particles and Fields'

In [52]:
l

[{'WOS_author': 'Ponce, W. A.',
  'affiliation': ['International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'],
  'i': 0}]

In [53]:
so

'Zeitschrift für Physik C Particles and Fields'

In [63]:
import re
import unidecode
def author_quality_match(x,y,scorer=fuzz.token_set_ratio):
    
    chk={}
    chk['simple_wos']=unidecode.unidecode(x).lower().replace(
                       '.','').replace(',','').replace('-',' ')
    chk['full_name']=unidecode.unidecode(y).lower().replace(
                       '.','').replace(',','').replace('-',' ')
    sn=re.sub('^(\w+\s+\w+\s+\w)\w+(\s+\w)\w+$',r'\1\2',chk['full_name'])
    chk['short_name']=re.sub('^(\w+\s+\w+\s+\w)\w+$',r'\1',sn)
    sn=re.sub('^(\w+\s+)\w+\s+(\w+)\s+\w+$',r'\1\2',chk['full_name'])
    chk['simple_name']=re.sub('^(\w+\s+)\w+\s+(\w+)$',r'\1\2',sn)
    chk['simple_second_name']=re.sub('^(\w+\s+)\w+\s+\w+\s+(\w+)$',r'\1\2',chk['full_name'])
    chk['last_name']=re.sub( '^(\w+\s+)\w+\s+(\w+\s+\w+)$',r'\1\2', chk['full_name'] )
    chk['last_names']=re.sub('^(\w+\s+\w+\s+\w+)\s+\w+$',r'\1',chk['full_name'])
    chk['second_name']=re.sub('^(\w+\s+\w+\s+)\w+\s+(\w+)$',r'\1\2',chk['full_name'])

    chk['s1']=fuzz.token_sort_ratio( chk['simple_wos'],chk['full_name'])
    chk['s1b']=fuzz.partial_token_sort_ratio( chk['simple_wos'],chk['full_name'])
    chk['s2']=scorer( chk['simple_wos'],chk['short_name'])
    chk['s3']=fuzz.ratio( chk['simple_wos'],chk['simple_name'])
    chk['s3']=fuzz.ratio( chk['simple_wos'],chk['simple_second_name'])
    chk['s4']=fuzz.token_sort_ratio( chk['simple_wos'],chk['last_name'])
    chk['s5']=fuzz.token_sort_ratio(chk['simple_wos'],chk['last_names'])    
    chk['s6']=fuzz.token_sort_ratio(chk['simple_wos'],chk['second_name'])
    
    chk['max']=max( chk['s1'],chk['s1b'],chk['s2'],chk['s3'],chk['s4'],chk['s5'],chk['s6'])
    
    return chk

In [64]:
#for i in range(20):
#l=dfnot['authors_WOS'].loc[i]
#def json_fuzzy_merge(l,UDEA,contents,right_target='UDEA_authors',
                       #left_on='WOS_author',extra_left_on='affiliation',
                       #right_on='WOS_author',extra_right_on='WOS_affiliation',
                       #cutoff=95,cutoff_affiliation=65,scorer=fuzz.partial_ratio):
if True:                
    right_target='UDEA_authors'
    left_on='WOS_author'
    extra_left_on='affiliation'
    right_on='WOS_author' 
    extra_right_on='WOS_affiliation'
    extra_extra_right_on='full_name'
    SO='SO'
    cutoff=92
    cutoff_author=90
    cutoff_affiliation=70
    scorer=fuzz.token_set_ratio
    DEBUG=False
    newl=[]
    for d in l:
        AUTHOR=False
        AFFILIATION=False
        JOURNAL=True
        dfraf=pd.DataFrame()        
        au=d.get(left_on)
        aff=d.get(extra_left_on)[0]
        Q=1
        break
        # extract best WOS author match
        #r=fwp.extractOne(au,contents[right_on],scorer=scorer)
        #if r[1]>=cutoff:
        #    raf=fwp.extractOne( aff, contents.loc[r[2],extra_right_on],scorer=scorer )
            #print(r[1],r[2],raf[1],aff,',',contents.loc[r[2],extra_right_on])
            #if raf[1]>=cutoff_affiliation:
            #    newl=newl+[  contents.loc[r[2],right_target]  ]
            #else:
                #check SO
                #newl=newl+[  contents.loc[r[2],right_target]  ]
        #break
    #if newl:
    #    return newl
    #else:
    #    return None

In [65]:
au

'Ponce, W. A.'

In [66]:
aff

'International Centre for Theoretical Physics, P.O.B. 586, Trieste, I-34100, Italy, Physics Depto., Univ Antioquia, A.A. 1226, Medellin, Colombia'

In [67]:
if True:
        Q=1
        # Try match author to a good degree
        rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),scorer=scorer)
        if DEBUG: print(1,rau)
        if rau[1]>=cutoff:
            AUTHOR=True
            if DEBUG: print(2,AUTHOR)            

In [68]:
rau

('Ponce, William A.', 88)

In [69]:
if True:
        #Try match author with less quality: Q
        #else:
        if rau[1]<cutoff:
            rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),
                       scorer=fuzz.partial_token_sort_ratio)
            if DEBUG: print(2.1,rau)            
            if rau and rau[1]>=cutoff:
                Q=Q-0.1
                AUTHOR=True            

In [70]:
rau

('Ponce, William A.', 100)

In [71]:
if True:
        if AUTHOR:
            dfraf=contents[contents[right_on].apply( lambda l: rau[0] in l )
                                ].reset_index(drop=True)
            full_name=dfraf[right_target].loc[0].get(extra_extra_right_on)
            chk=author_quality_match(au,full_name,scorer=scorer)
            if DEBUG: print(1.3,'chk max:',chk['max'])
            if chk['max']<cutoff_author:
                AUTHOR=False

In [72]:
chk['max']#>90

100

In [73]:
if True:
    if AUTHOR:
            raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],scorer=fuzz.ratio)
            if DEBUG: print(2,raf)
            if raf and raf[1]>=cutoff_affiliation:
                AFFILIATION=True

In [74]:
raf

('Univ Antioquia, Inst Fis, Calle 70 52-21, Medellin, Colombia.', 43)

In [75]:
if True:
            #else:
            if raf[1]<cutoff_affiliation:
                Q=Q-0.1
                raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],
                                   scorer=fuzz.partial_token_set_ratio)
                if raf and raf[1]>=cutoff_affiliation:
                    AFFILIATION=True

In [76]:
raf

('Univ Antioquia, Inst Fis, Medellin, Colombia.', 100)

# ================

Journal:

In [77]:
full_name=dfraf['UDEA_authors'].loc[0].get('full_name')

In [78]:
full_name

'PONCE GUTIERREZ WILLIAM ANTONIO'

In [79]:
if True:
        if Q<1:
            cutoff_so=50
            if Q<0.9:
                cutoff_so=60
            if not dfraf.empty:
                full_name=dfraf[right_target].loc[0].get(
                        extra_extra_right_on)
                if full_name:
                    kkk=UDEA[UDEA['UDEA_nombre'].str.contains(full_name)
                                ].reset_index(drop=True)
                    rso=fwp.extractOne( so,   kkk.SO, scorer=scorer)
                    if not rso:
                        JOURNAL=False
                    elif rso[1]<cutoff_so:
                        JOURNAL=False

In [81]:
so

'Zeitschrift für Physik C Particles and Fields'

In [82]:
kkk.SO[:2]

0                             REVISTA MEXICANA DE FISICA
1    ACTA PHYSICA HUNGARICA NEW SERIES-HEAVY ION PHYSICS
Name: SO, dtype: object

In [83]:
rso

('PARTICLES AND FIELDS, PROCEEDINGS', 77, 2)

In [84]:
if True:            
            #else:
            if dfraf.empty:
                JOURNAL=False

In [85]:
JOURNAL

True

In [86]:
if True:
        if AUTHOR and AFFILIATION and JOURNAL:
            mthchedd=dfraf.loc[0,right_target]
            mthchedd['from_author_WOS_WOS_author']=au
            newl=newl+[  mthchedd  ]
            print('{} → {}'.format(au,newl[0][extra_extra_right_on]) ) 

Ponce, W. A. → PONCE GUTIERREZ WILLIAM ANTONIO


Test full function  below

# ================

In [88]:
#for i in range(20):
#l=dfnot['authors_WOS'].loc[i]
#93,70
from IPython.display import clear_output

def json_fuzzy_merge(l,so,UDEA,contents,right_target='UDEA_authors',
                       left_on='WOS_author',extra_left_on='affiliation',
                       right_on='WOS_author',extra_right_on='WOS_affiliation',
                       extra_extra_right_on='full_name',
                       cutoff=93,
                       cutoff_author=90,
                       cutoff_affiliation=70,scorer=fuzz.token_set_ratio,
                       DEBUG=False):
    newl=[]
    for d in l:
        clear_output(wait=True)
        AUTHOR=False
        AFFILIATION=False
        JOURNAL=True

        dfraf=pd.DataFrame()
        au=d.get(left_on)
        aff=d.get(extra_left_on)[0]
        Q=1
        # Try match author to a good degree
        rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),scorer=scorer)
        if DEBUG: print(1,rau)
        if rau[1]>=cutoff:
            AUTHOR=True
        #Try match author with less quality: Q
        else:
            rau=fwp.extractOne(au,contents[right_on].apply(pd.Series).stack().unique(),
                       scorer=fuzz.partial_token_sort_ratio)
            if DEBUG: print(1.1,rau)            
            if rau and rau[1]>=cutoff:
                Q=Q-0.1
                AUTHOR=True
        if DEBUG: print(1.2,'AUTHOR:',AUTHOR)                            
        if AUTHOR:
            dfraf=contents[contents[right_on].apply( lambda l: rau[0] in l )
                                ].reset_index(drop=True)
            full_name=dfraf[right_target].loc[0].get(extra_extra_right_on)
            chk=author_quality_match(au,full_name,scorer=scorer)
            if DEBUG: print(1.3,'chk max:',chk['max'])
            if chk['max']<cutoff_author:
                AUTHOR=False
        if AUTHOR:
            raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],scorer=fuzz.ratio)
            if DEBUG: print(2,raf)
            if raf and raf[1]>=cutoff_affiliation:
                AFFILIATION=True
            else:
                Q=Q-0.1
                raf=fwp.extractOne(aff,dfraf[extra_right_on].loc[0],
                                   scorer=fuzz.partial_token_set_ratio)
                if DEBUG: print(2.1,raf)
                if raf and raf[1]>=cutoff_affiliation:
                    AFFILIATION=True

        if DEBUG: print(2.2,'AFFILIATION:',AFFILIATION,'Q:',Q)                
        if AUTHOR and Q<1:
            cutoff_so=50
            if Q<0.9:
                cutoff_so=60
            if not dfraf.empty and full_name:
                kkk=UDEA[UDEA['UDEA_nombre'].str.contains(full_name)
                                ].reset_index(drop=True)
                rso=fwp.extractOne( so,   kkk.SO, scorer=scorer)
                if not rso:
                    JOURNAL=False
                elif rso[1]<cutoff_so:
                    JOURNAL=False
            else:
                JOURNAL=False
        if DEBUG: print(3,'JOURNAL',JOURNAL)                
        if AUTHOR and AFFILIATION and JOURNAL:
            mthchedd=dfraf.loc[0,right_target]
            mthchedd['from_author_WOS_WOS_author']=au
            newl=newl+[  mthchedd  ]            
            print('{} → {}'.format(au,newl[0][extra_extra_right_on]) ) 
    if newl:
        return newl
    else:
        return None

In [89]:
dfraf=json_fuzzy_merge(l,so,UDEA,contents,DEBUG=True)

1 ('Ponce, William A.', 88)
1.1 ('Ponce, William A.', 100)
1.2 AUTHOR: True
1.3 chk max: 100
2 ('Univ Antioquia, Inst Fis, Calle 70 52-21, Medellin, Colombia.', 43)
2.1 ('Univ Antioquia, Inst Fis, Medellin, Colombia.', 100)
2.2 AFFILIATION: True Q: 0.8
3 JOURNAL True
Ponce, W. A. → PONCE GUTIERREZ WILLIAM ANTONIO


In [91]:
%time kk=UDEA_NOT['authors_WOS'].combine(UDEA_NOT['SO'],func=lambda l,so: json_fuzzy_merge(l,so,UDEA,contents) if type(l)==list else None)

CPU times: user 1h 6min 49s, sys: 5.64 s, total: 1h 6min 55s
Wall time: 1h 6min 54s


In [92]:
kk.dropna().shape

(632,)

In [93]:
qq=UDEA_NOT.reset_index(drop=True)
qq['UDEA_authors']=kk

In [None]:
#pp=qq[qq['authors_WOS'].astype(str).str.contains('Ponce, W. A.')].reset_index(drop=True)#[['authors_WOS','UDEA_authors']]

In [None]:
#ppp=pp['authors_WOS'].combine(pp['SO'],func=lambda l,so: json_fuzzy_merge(l,so,UDEA,contents) 
#                              if type(l)==list else None)

In [None]:
#qq[['authors_WOS','UDEA_authors']]

In [101]:
UDEA_NOT.shape[0]#+UDEA_YES.shape[0]

4019

In [102]:
qq['UDEA_authors'].dropna().shape

(632,)

In [104]:
qq=qq.fillna('')

In [105]:
UDEA_NOT=qq.reset_index(drop=True)
UDEA=UDEA_YES.append(UDEA_NOT,sort=False).reset_index(drop=True)

### Quality check

In [94]:
chk=pd.DataFrame( list(kk.dropna().str[0].values) )[['from_author_WOS_WOS_author','WOS_author','full_name']]

In [95]:
import unidecode

In [96]:
chk['simple_wos']=chk['from_author_WOS_WOS_author'].str.lower().str.replace(
    '[\.,]','').str.replace('\-',' ').apply(unidecode.unidecode)
chk['full_name'].str.lower().str.replace('[\.,\-]','').apply(unidecode.unidecode)
chk['short_name']=chk['full_name'].str.lower().str.replace('[\.,\-]','').str.replace(
            '^(\w+\s+\w+\s+\w)\w+(\s+\w)\w+$',r'\1\2').str.replace(
         '^(\w+\s+\w+\s+\w)\w+$',r'\1').apply(unidecode.unidecode)
chk['simple_name']=chk['full_name'].str.lower().str.replace('[\.,]','').str.replace(
             '\-',' ').str.replace(
            '^(\w+\s+)\w+\s+(\w+)\s+\w+$',r'\1\2').str.replace(
            '^(\w+\s+)\w+\s+(\w+)$',r'\1\2').apply(unidecode.unidecode)
chk['last_name']=chk['full_name'].str.lower().str.replace('[\.,]','').str.replace(
            '\-',' ').str.replace(
            '^(\w+\s+)\w+\s+(\w+\s+\w+)$',r'\1\2')
chk['last_names']=chk['full_name'].str.lower().str.replace('[\.,]','').str.replace(
             '\-',' ').str.replace(
            '^(\w+\s+\w+\s+\w+)\s+\w+$',r'\1')
#dos apellidos y un nombre (sort)
#un apellido y dos nombre  (sort)

In [97]:
scorer=fuzz.token_set_ratio
chk['s1']=chk['simple_wos'].combine( 
            chk['full_name'].str.lower().str.replace('[\.,]','').apply(unidecode.unidecode),
           func=fuzz.token_sort_ratio)
chk['s1b']=chk['simple_wos'].combine( 
            chk['full_name'].str.lower().str.replace('[\.,]','').apply(unidecode.unidecode),
           func=fuzz.partial_token_sort_ratio)
chk['s2']=chk['simple_wos'].combine(chk['short_name'],
           func=scorer)
chk['s3']=chk['simple_wos'].combine(chk['simple_name'],
           func=fuzz.ratio)
chk['s4']=chk['simple_wos'].combine(chk['last_name'],
           func=fuzz.token_sort_ratio)
chk['s5']=chk['simple_wos'].combine(chk['last_name'],
           func=fuzz.token_sort_ratio)

In [98]:
chk['rmax']=chk[['s1','s1b','s2','s3','s4','s5']].apply(max,axis=1)
chk['rmin']=chk[['s1','s1b','s2','s3','s4','s5']].apply(min,axis=1)

In [99]:
chk[['simple_wos','full_name','s1','s1b','short_name','s2','simple_name','s3','last_name','s4',
     'rmin','rmax']].drop_duplicates('simple_wos').reset_index(drop=True)[100:150]

Unnamed: 0,simple_wos,full_name,s1,s1b,short_name,s2,simple_name,s3,last_name,s4,rmin,rmax
100,calderon j c,CALDERON VELEZ JUAN CAMILO,53,75,calderon velez j c,100,calderon juan,80,calderon juan camilo,62,53,100
101,pineda david a,PINEDA SALAZAR DAVID ANTONIO,67,93,pineda salazar d a,73,pineda david,92,pineda david antonio,82,67,93
102,isaza c v,ISAZA NARVAEZ CLAUDIA VICTORIA,46,78,isaza narvaez c v,100,isaza claudia,64,isaza claudia victoria,58,46,100
103,arzuaga maria angelica,ARZUAGA SALAZAR MARIA ANGELICA,85,100,arzuaga salazar m a,63,arzuaga maria,74,arzuaga maria angelica,100,63,100
104,gallardo c c,GALLARDO CABRERA CECILIA,67,75,gallardo cabrera c,100,gallardo cecilia,79,gallardo cabrera cecilia,67,67,100
105,quintero t m t,QUINTERO TOBON MARIA TERESA,68,79,quintero tobon m t,100,quintero maria,71,quintero maria teresa,69,68,100
106,uscategui r m,USCATEGUI PEÑUELA ROSA MAGDALENA,58,77,uscategui penuela r m,100,uscategui rosa,81,uscategui rosa magdalena,70,58,100
107,echeverria e f,ECHEVERRIA ECHEVERRIA FELIX,68,93,echeverria echeverria f,100,echeverria felix,80,echeverria echeverria felix,68,68,100
108,zapata jose edgar,ZAPATA MONTOYA JOSE EDGAR,81,71,zapata montoya j e,63,zapata jose,79,zapata jose edgar,100,63,100
109,betancur t,BETANCUR VARGAS TERESITA,59,100,betancur vargas t,100,betancur teresita,74,betancur vargas teresita,59,59,100


In [None]:
scorer( 'lopez jaramillo c', 'lopez jaramillo c a' )

In [None]:
chk[chk['rmax']<=90][['simple_wos','full_name','s1','s1b','short_name','s2','simple_name','s3','last_name','s4',
     'rmin','rmax']].drop_duplicates('simple_wos').reset_index(drop=True)[:3]

In [None]:
chk[chk['rmax']>90][['simple_wos','full_name','s1','s1b','short_name','s2','simple_name','s3','last_name','s4',
     'rmin','rmax']].drop_duplicates('simple_wos').reset_index(drop=True)[:3]

## implementation quality check
* apply on kk to obtain full name and 'from_author_WOS_WOS_author'
* with this x and y apply to obtain rmax
* cut on rmax

In [100]:
chk[chk.full_name.str.contains('PONCE')][['simple_wos','full_name','s1','s1b','short_name','s2','simple_name','s3','last_name','s4',
     'rmin','rmax']]

Unnamed: 0,simple_wos,full_name,s1,s1b,short_name,s2,simple_name,s3,last_name,s4,rmin,rmax
509,ponce w a,PONCE GUTIERREZ WILLIAM ANTONIO,45,89,ponce gutierrez w a,100,ponce william,73,ponce william antonio,60,45,100
510,ponce w a,PONCE GUTIERREZ WILLIAM ANTONIO,45,89,ponce gutierrez w a,100,ponce william,73,ponce william antonio,60,45,100
514,ponce w a,PONCE GUTIERREZ WILLIAM ANTONIO,45,89,ponce gutierrez w a,100,ponce william,73,ponce william antonio,60,45,100
516,ponce w a,PONCE GUTIERREZ WILLIAM ANTONIO,45,89,ponce gutierrez w a,100,ponce william,73,ponce william antonio,60,45,100
517,ponce w a,PONCE GUTIERREZ WILLIAM ANTONIO,45,89,ponce gutierrez w a,100,ponce william,73,ponce william antonio,60,45,100


In [None]:
'A.-B'.replace('.','')

In [None]:
x='Ponce, W. A.'
y='PONCE GUTIERREZ WILLIAM ANTONIO'


In [None]:
%time chk=author_quality_match(x,y)

In [None]:
.str.lower().str.replace('[\.,\-]','').apply(unidecode.unidecode)
#dos apellidos y un nombre (sort)
#un apellido y dos nombre  (sort)    

In [152]:
UDEA_YES['authors_WOS'].apply(lambda l: l if type(l)==list and len(l)==0 else None).dropna().shape

(1164,)

In [153]:
UDEA_NOT['authors_WOS'].apply(lambda l: l if type(l)==list and len(l)==0 else None).dropna().shape

(861,)

In [157]:
UDEA_YES[ UDEA_YES['authors_WOS'].astype(str)=='[]' ].shape

(1164, 181)

In [159]:
UDEA_NOT[ UDEA_NOT['authors_WOS'].astype(str)=='[]' ].C1.loc[28] # Remove &

'[Morillo, V. K.] Univ Nacl Colombia, Appl Mineral & Bioproc Grp, Medellin, Colombia.\nUniv Nacl Colombia, Appl Mineral & Bioproc Grp, Medellin, Colombia.\n[Morales, A. L.] Antioquia Univ, Solid State Grp, Medellin, Colombia.\n'

In [163]:
UDEA_NOT[ UDEA_NOT['authors_WOS'].astype(str)=='[]' ].C1.loc[3053] # double name

'[Rossi, Francesco; Reklaitis, Gintaras] Purdue Univ, Sch Chem Engn, Forney Hall Chem Engn,480 Stadium Mall Dr, W Lafayette, IN 47907 USA.\n[Rossi, Francesco; Manenti, Flavio; Buzzi-Ferraris, Guido] Politecn Milan, Dipartimento Chim Mat & Ingn Chim Giulio Natta, Piazza Leonardo da Vinci 32, I-20123 Milan, Italy.\n[Casas-Orozco, Daniel] UdeA, Chem Engn Dept, Calle 70 52-21, Medellin, Colombia.\n'

In [173]:
pp=UDEA[UDEA['authors_WOS'].astype(str)=='[]']
pp.shape

(2025, 181)

In [174]:
pp[pp['C1'].astype(str).str.contains('\[')].shape

(260, 181)