<a href="https://colab.research.google.com/github/restrepo/medicion/blob/master/cienciometria/WOS_SCI_SCP_PTJ_CTR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WOS+SCI+SCP+PTJ+CTR

Merge the bibliographic datasets for 
* Web of Science (WOS), 
* Scielo (SCI)
* Scopus  (SCP)
* Puntaje (UDEA)
* Center (CTR)
of the scientific articles of Universidad de Antioquia

For details see [merge.ipynb in Colaboratory](https://colab.research.google.com/github/restrepo/medicion/blob/master/cienciometria/merge.ipynb)

Implementation:
The input pure o partially processed database with WOS-SCI-SCP and may be some UDEA entries from PTJ and Center information with additional data about the Full Name UDEA authors.

Addtionaly UDEA entries can be captured from:
1. A previous WOS-SCI-SCP-UDEA
2. A Data Base with a column with full names (FULL LAST NAMES NAMES, e.g VALDEZ GÚZMAN JUAN ALBERTO) and a list of author Aliases in WOS format (Lastname, Name, e.g Valdez-Gúzman, J.A.) with a list of registered affiliations. TODO: Test
3. The database from Puntaje (UDEA). 

In [1]:
import os
VERSION='NEW'
if os.getcwd()=='/content':
    !pip install openpyxl xlrd wosplus fuzzywuzzy[speedup] > /dev/null

In [2]:
# Delete UDEA_columns and start from schratch
REBUILD=False
MERGE_WITH_TRAINED=True

## functions

In [3]:
import pandas as pd
import wosplus as wp
pd.set_option('display.max_colwidth',200)

In [4]:
# %load wos_sci_scp_ptj_ctr.py

In [5]:
from wos_sci_scp_ptj_ctr import *

##  Configure public links of  files in Google Drive
* If it is a Google Spreadsheet the corresponding file is downloaded as CSV
* If it is in excel or text file the file is downloaded  directly

To define your  own labeled IDs for public google drive files edit the next cell:

In [6]:
%%writefile drive.cfg
[FILES]
WOS_SCI_SCP_PTJ_CTR.json.gz=19E1C1kRk4I0V3uXojqko8-NEicWaPp1j
WOS_SCP_UDEA_SJR_SIU.xlsx=0BxoOXsn2EUNIQ3R4WDhvSzVLQ2s
Base_de_datos_investigadores_Definitiva.csv=12oalgUeKhpvzkTPBP8pXCeHTrF-KO223dy9ov9w9QKs
UDEA_authors_with_WOS_info.json=1o1eVT4JD0FMMICq_oxrTJOzWh47veBMw
produccion_fecha_vig_2003_2018.xlsx=1WbtX4K__TTLxXRjuLvqUYz9tuHCIlS5v
UDEA_WOS_SCI_SCP_PTJ.json=1OkVytKbxJwGvXZDkynkSoUDtkUOTaT4A

Overwriting drive.cfg


##  Load data bases

In [8]:
affil='Univ Antioquia'
drive_files=wp.wosplus('drive.cfg')

#### DEBUG: if False stop in UDEA_PTJ!!!!

if os.path.exists(UDEAjsonfile):
    UDEA=               pd.read_json(UDEAjsonfile,compression='gzip').reset_index(drop=True)
else:    
    UDEA=drive_files.read_drive_json(UDEAjsonfile,compression='gzip').reset_index(drop=True)

In [9]:
if REBUILD:
    !rm WOS_SCI_SCP_PTJ_CTR.json.gz

In [10]:
RECOVER=True #False for test purposes
UDEAjsonfile='WOS_SCI_SCP_PTJ_CTR.json.gz'
#Test purposes
#UDEAjsonfile='UDEA_WOS_SCI_SCP_PTJ.json'
if RECOVER:
    #Requieres latest wosplus!
    tmp=drive_files.load_biblio(UDEAjsonfile,compression='gzip')# TODO CHANGE FOR LAST VERSION IN GOOGLE DRIVE
else:
    tmp=drive_files.load_biblio('UDEAtmp.json')
    #drive_files.load_biblio(
    #  'https://raw.githubusercontent.com/restrepo/medicion/master/cienciometria/data/UDEAtmp300.json'
    #    )#Test: 199+1=200 found
    
UDEA=drive_files.biblio['WOS'].reset_index(drop=True)
#DEBUG
#UDEA=UDEA.sample(300,replace=True).reset_index(drop=True) #Test: 77 found
#tmp=drive_files.load_biblio('Sample_WOS.xlsx')



In [46]:
df=UDEA
UDEA_authors='UDEA_authors'
kk=df[UDEA_authors].apply(lambda l:
             l if type(l)==list
             and len(l)>0 else None
                ).dropna().reset_index(drop=True)

kk[kk.apply(
    lambda l: len([1 for d in l if 
               d.get('full_name')])>0
       )].shape

(11681,)

In [47]:
kk[kk.apply(
    lambda l: len([1 for d in l if 
               d.get('NOMBRE COMPLETO')])>0
       )].shape

(7718,)

In [27]:
df=UDEA
Tipo='Tipo'
x=df[df[Tipo].str.contains(
      'UDEA')].shape[0]
print(x)

10312


In [80]:
%%writefile check_quality.py
def check_quality(df,
     authors_WOS='authors_WOS',
     Tipo='Tipo',
     UDEA_authors='UDEA_authors'
    ):
    import pandas as pd
    if authors_WOS in df.columns:
        print(authors_WOS)
        x=df[authors_WOS].apply(lambda l:
                 l if type(l)==list
                 and len(l)>0 else None
                    ).dropna().shape[0]
        print(x)
        kk=df[df['TI']=='Leptonic charged Higgs decays in the Zee model'].reset_index(drop=True)
        print(kk.loc[0,'TI'],'; authors_WOS:',kk.loc[0,authors_WOS],'; AU:',kk.loc[0,'AU'])
    if Tipo in df.columns:        
        print('Tipo contains UDEA')
        x=df[df[Tipo].str.contains(
             'UDEA')].shape[0]
        print(x)
    if UDEA_authors in df.columns:
        print(UDEA_authors)
        kk=df[UDEA_authors].apply(lambda l:
             l if type(l)==list
             and len(l)>0 else None
                ).dropna().reset_index(drop=True)
        print(kk.shape[0])

        print('UDEA_authors → full_names (Extrapolado puntaje)')
        x=kk[kk.apply(lambda l: len([1 for d in l if 
                   d.get('full_name')])>0
               )].shape[0]
        print(x)
    
        print('UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)')
        x=kk[kk.apply(lambda l: len([1 for d in l if 
                   d.get('NOMBRE COMPLETO')])>0
               )].shape[0]
        print(x)

Writing check_quality.py


In [59]:
check_quality(UDEA)

authors_WOS
13645
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
10312
UDEA_authors
11681
UDEA_authors → full_names (Extrapolado puntaje)
11681
UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)
7718
TI: "The inert doublet model" check WOS_author vs UDEA_authors
...


In [60]:
if REBUILD:
    UDEA=clean_institutional_columns(UDEA,prefix='UDEA',Tipo='Tipo')
    UDEA['UDEA_authors']=None


In [61]:
for t in UDEA.Tipo.unique():
    print( '{}:{}'.format( t, UDEA[ UDEA.Tipo==t].shape[0] ) )

WOS_SCP_UDEA:4468
SCI_UDEA:2083
WOS_UDEA:716
SCI_SCP_UDEA:1269
WOS_SCP:1352
WOS:1168
SCP_UDEA:1084
WOS_SCI_SCP:169
SCI_SCP:347
WOS_SCI:54
SCI:809
SCP:1489
WOS_SCI_SCP_UDEA:599
WOS_SCI_UDEA:93


In [62]:
UDEA.shape

(15700, 181)

## Load trained old data 

### Merge WOS_SCP_SCI with trained data set PTJ_CTR

Merge requires split in DI and TI


15700 (15700, 152)
(7072, 169) (8628, 169)

In [63]:
SIU=drive_files.read_drive_excel('WOS_SCP_UDEA_SJR_SIU.xlsx')

In [64]:
if MERGE_WITH_TRAINED:
    if os.path.exists('WOS_SCP_UDEA_SJR_SIU.xlsx'):
        SIU=pd.read_excel('WOS_SCP_UDEA_SJR_SIU.xlsx')
    else:    
        SIU=drive_files.read_drive_excel('WOS_SCP_UDEA_SJR_SIU.xlsx')
        
    UDEA,SIU=fill_trained_data(UDEA,SIU)#TODO: Remnove SIU

  result = method(y)


15700 (15700, 152)
(7072, 168) (8628, 168)


In [65]:
if MERGE_WITH_TRAINED:
    UDEA.to_json('UDEAtmp.json')
    RECOVER=False
    if RECOVER:
        UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

In [66]:
if 'UDEA_autores' in UDEA.columns and UDEA[UDEA['UDEA_autores']==''].shape[0]:
    UDEA['UDEA_autores']=UDEA['UDEA_autores'].apply(lambda s: pd.np.nan if type(s)==str and s=='' else s)

In [67]:
if 'UDEA_autores' in UDEA.columns:
    print(UDEA[UDEA['UDEA_autores']==''].shape[0],UDEA['UDEA_autores'].dropna().shape[0])

0 7072


In [68]:
check_quality(UDEA)

authors_WOS
13645
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
10312
TI: "The inert doublet model" check WOS_author vs UDEA_authors
...


# Puntaje

UDEA

In [69]:
qq=UDEA.copy()

In [70]:
drive_files.biblio['WOS']=qq
drive_files.biblio['WOS'].shape

(15700, 168)

In [71]:
tmp=drive_files.load_biblio('produccion_fecha_vig_2003_2018.xlsx',prefix='UDEA')

In [72]:
pp= drive_files.biblio['UDEA'].copy()

In [73]:
drive_files.biblio['UDEA']=pp

In [74]:
df=merge_puntaje(drive_files)

(32581, 24)
va1 0 0
........................................................................

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


7258 : 5806 + 2827 = 8633
va2 0 5806
.........................................................

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


7258 : 5453 + 353 = 5806
va3 0 5453
.......................................................

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


va4 0 5389
7258 : 5389 + 64 = 5453
(3239, 174) + (5389, 152) = 8628


In [75]:
#TODO: Check why not zero
if 'UDEA_autores' in df.columns:
    print(0,'=',df[df['UDEA_autores']==''].shape[0],'; found:',df['UDEA_autores'].dropna().shape[0])

0 = 0 ; found: 10311


In [76]:
#df['UDEA_autores'].apply(lambda s: pd.np.nan if type(s)==str and s=='' else s).dropna().shape

In [77]:
UDEA=df.copy()

In [78]:
UDEA.shape

(15700, 180)

In [79]:
check_quality(UDEA)

authors_WOS
13645
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
10311
TI: "The inert doublet model" check WOS_author vs UDEA_authors
...


## Fill C1 for not WOS entries in WOS format and extract  affiliation from C1

In [88]:
#Fill from SCI_C1
UDEA['C1']=SCI_C1_to_C1(UDEA)

In [89]:
#Fill from SCP_C1='SCP_Authors with affiliations
UDEA['C1']=SCP_Authors_with_affiliations_to_C1(UDEA)

In [90]:
UDEA[UDEA['C1'].isnull()].shape

(0, 180)

In [91]:
UDEA[UDEA.Tipo=='WOS'].reset_index(drop=True).C1.loc[0]

'[Doddasomayajula, R.; Chung, B. J.; Mut, F.; Cebral, J. R.] George Mason Univ, Bioengn Dept, Fairfax, VA 22030 USA.\n[Jimenez, C. M.] Univ Antioquia, Neurosurg Dept, Medellin, Colombia.\n[Hamzei-Sichani, F.] Mt Sinai Med Ctr, Dept Neurosurg, New York, NY 10029 USA.\n[Putman, C. M.] Inova Fairfax Hosp, Intervent Neuroradiol, Falls Church, VA USA.\n'

In [92]:
#WARNING: some C1 WOS entries are not normalized: Missing authors
UDEA['authors_WOS']=UDEA.C1.apply(lambda x: x.split('\n') if x else x).apply(
    lambda x:   [y.replace('[','').replace('] ','; ') for y in x if y.find(affil)>-1 ] if x else x ).apply(
     lambda x: get_author_info(x) if x else x)

# Improve normalization: remove C1s with only affiliation (from Scielo)
UDEA['authors_WOS']=UDEA['authors_WOS'].apply( 
    lambda x: [d for d in x if d.get('WOS_author').find(affil)==-1] if type(x)==list else x )

In [93]:
UDEA[UDEA.Tipo=='SCP'].reset_index(drop=True).loc[0].authors_WOS

[{'WOS_author': 'Gallo-Villegas, J. A.',
  'affiliation': ['Unidad de Gestión del Conocimiento, Centro Clínico y de Investigación SICOR, Soluciones Integrales en Riesgo Cardiovascular, Medellín, Colombia, Facultad de Medicina, Univ Antioquia, Medellín, Colombia'],
  'i': 0}]

## Prepare UDEA columns

In [95]:
#TODO: Remove from fill_trained_data(..)
if 'UDEA_autores' in UDEA.columns:
    UDEA['UDEA_autores']=UDEA['UDEA_autores'].apply(lambda s: re.sub('\s+',' ',s) if type(s)==str else s)
    UDEA['UDEA_authors']=UDEA['UDEA_autores'].apply(lambda s: s.split(';') if type(s)==str else s).apply(
                           lambda l: [{'full_name':y} for y in l ] if type(l)==list else l)

## Merge with official researcher list: PTJ

In [96]:
AU=drive_files.read_drive_excel('Base_de_datos_investigadores_Definitiva.csv')

In [97]:
UPDATE_UDEA_authors_with_AU=True
if (UDEA['UDEA_authors'].dropna().shape[0] and 
    UPDATE_UDEA_authors_with_AU):
    kkn=UDEA.copy()
    kkn=update_institutional_authors(kkn,AU)
    print(kkn.shape,UDEA.shape)
    UDEA=kkn.copy()

0
1
2
3
4
5
6
7
8
9
10
11
(15700, 181) (15700, 181)


Quality check

In [98]:
key_contains_in_list_of_dictionaries(UDEA,'Restrepo, D',column='authors_WOS',key='WOS_author').loc[1:2]

1    [{'affiliation': ['Univ Antioquia, Inst Fis, Calle 70 52-21 Medellin, Medellin, Colombia.'], 'i': 0, 'WOS_author': 'Restrepo, D.'}]
2                        [{'affiliation': ['Univ Antioquia, Inst Fis, Medellin 1226, Colombia.'], 'i': 0, 'WOS_author': 'Restrepo, D.'}]
Name: authors_WOS, dtype: object

In [99]:
if UPDATE_UDEA_authors_with_AU:
    UDEA.to_json('UDEAtmp.json')
    RECOVER=False
    if RECOVER:
        UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

In [100]:
check_quality(UDEA)

authors_WOS
13645
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
10311
UDEA_authors
10311
UDEA_authors → full_names (Extrapolado puntaje)
10311
UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)
6840
TI: "The inert doublet model" check WOS_author vs UDEA_authors
...


## Add `UDEA.authors_WOS` info* within `UDEA.UDEA_authors` data**
(\*) obtained from `UDEA.C1`

(\*\*) Obtained from [puntaje trained old UDEA data](./WOS_SCI_SCP_PTJ_GS_LNS.ipynb#Merge-with-trained-data-set) and the [official researcher list](./WOS_SCI_SCP_PTJ_GS_LNS.ipynb#Merge-with-official-researcher-list)

Obtain name parts and initials from full name in `UDEA_authors` dictionary and update `UDEA_authors` with them

In [101]:
import sys
if 'UDEA_authors' not in UDEA.columns and REBUILD==False:
    sys.exit('Make MERGE_WITH_TRAINED True and run again')

In [102]:
# Obtain spanish name parts from full name
dictupdatetmp=UDEA['UDEA_authors'].apply(lambda x: [y.update( 
                split_full_names(y,full_name='full_name')  ) if not pd.isnull(
                y.get('full_name')) else y for y in x] 
                                   if type(x)==list 
                                   else x)

In [103]:
kk=UDEA['authors_WOS'].combine( UDEA['UDEA_authors'], func=combinewos )

In [104]:
UDEA['UDEA_authors'].loc[0]

[{'CÉDULA': 39357558.0,
  'DEPARTAMENTO': 'Departamento de Microbiología y Parasitología',
  'FACULTAD': 'Facultad de Medicina',
  'GRUPO': 'Inmunovirología',
  'INICIALES': 'P. A.',
  'NOMBRE COMPLETO': 'Paula Andrea Velilla Hernandez',
  'NOMBRES': 'Paula Andrea',
  'PRIMER APELLIDO': 'Velilla',
  'SEGUNDO APELLIDO': 'Hernandez',
  'WOS_affiliation': ['Univ Antioquia UdeA, Fac Med, Grp Inmunovirol, Medellin, Colombia.'],
  'WOS_author': ['Velilla-Hernandez, Paula Andrea'],
  'full_name': 'VELILLA HERNANDEZ PAULA ANDREA'}]

In [105]:
UDEA.to_json('UDEAtmp.json')

### Load output restuls of previous Cell runs

In [106]:
RECOVER=False
if RECOVER:
    UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

## Build a single profile for all

### Fill UDEA_authors with WOS_author info

Obtain UDEA_authors DataFrame: `aunly`

In [107]:
aunly=DataFrame_authors(UDEA)

DELGADO LASTRA JUAN DE DIOS


In [108]:
if not aunly.empty:
    aunly.to_json('UDEA_authors_with_WOS_info.json')

In [109]:
RECOVER=False
if RECOVER:
    UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

In [110]:
UDEA.shape

(15700, 181)

In [111]:
if RECOVER:
    if os.path.exists('UDEA_authors_with_WOS_info.json' ):
        aunly=pd.read_json('UDEA_authors_with_WOS_info.json')
    else:
        aunly=drive_files.read_drive_json('UDEA_authors_with_WOS_info.json')

In [112]:
aunly.shape

(1273, 2)

(1273, 2)

In [113]:
check_quality(UDEA)

authors_WOS
13645
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
10311
UDEA_authors
10311
UDEA_authors → full_names (Extrapolado puntaje)
10311
UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)
6840
TI: "The inert doublet model" check WOS_author vs UDEA_authors
...


## Merge UDEA with authors

In [114]:
UDEA['UDEA_authors']=UDEA['UDEA_authors'].apply(lambda l:fill_full_wos_author_info(l,aunly) )

In [115]:
if UDEA['UDEA_authors'].dropna().shape[0]:
    UDEA.to_json('UDEAtmp.json')

In [116]:
RECOVER=False
if RECOVER:
    UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

In [117]:
UDEA.shape

(15700, 181)

In [118]:
kk=UDEA.authors_WOS.combine(UDEA.UDEA_authors,func=lambda x,y: get_UDEA_authors(x,y,aunly))

In [119]:
UDEA.UDEA_authors.dropna().shape

(10311,)

(7072,)

In [120]:
UDEA['UDEA_authors']=kk

In [121]:
UDEA.UDEA_authors.dropna().shape,UDEA.shape

((10901,), (15700, 181))

((10963,), (15704, 181))

In [122]:
aunly.shape

(1273, 2)

(1461, 2)

In [123]:
if not aunly.empty:
    print(aunly.drop_duplicates('tmp_author').shape)

(1273, 2)


In [124]:
if not aunly.empty:
    aunly.to_json('UDEA_authors_with_WOS_info.json')

In [125]:
RECOVER=False
if RECOVER:
    if os.path.exists('UDEA_authors_with_WOS_info.json' ):
        aunly=pd.read_json('UDEA_authors_with_WOS_info.json')
    else:
        aunly=drive_files.read_drive_json('UDEA_authors_with_WOS_info.json')

In [126]:
if UDEA['UDEA_authors'].dropna().shape[0]:
    UDEA.to_json('UDEAtmp.json')

In [127]:
if RECOVER:
    UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

In [128]:
UDEA.to_json('WOS_SCI_SCP_PTJ_CTR.json.gz',compression='gzip')

In [129]:
if 'UDEA_autores' in UDEA.columns:
    print(UDEA[UDEA['UDEA_autores']==''].shape[0],UDEA['UDEA_autores'].dropna().shape[0])

0 10311


In [130]:
if 'UDEA_authors' in UDEA.columns:
    print(UDEA[UDEA['UDEA_authors']==''].shape[0],UDEA['UDEA_authors'].dropna().shape[0])

0 10901


In [None]:
print 1

In [131]:
check_quality(UDEA)

authors_WOS
13645
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
10311
UDEA_authors
10901
UDEA_authors → full_names (Extrapolado puntaje)
10901
UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)
7303
TI: "The inert doublet model" check WOS_author vs UDEA_authors
...


## Add PTJ directly from `UDEA_authors` with `WOS_info` DataFrame

In [132]:
UDEA=pd.read_json('WOS_SCI_SCP_PTJ_CTR.json.gz',compression='gzip').reset_index(drop=True)

In [133]:
aunly=drive_files.read_drive_json('UDEA_authors_with_WOS_info.json').reset_index(drop=True)

In [134]:
def build_institutional_authors(x,author_df,x_author_key='WOS_author',x_affiliation_key='affiliation',
                                        author_key='WOS_author',
                                        affiliation_key='WOS_affiliation'):
    if type(x)!=list:
        return None
    ll=[]
    for j in range(len(x)):
        
                                #author_WOS→affiliation always have single affiliation
        kk=find_author_affiliation(x[j].get(x_author_key),x[j].get(x_affiliation_key)[0],
                                        author_df=author_df,
                                        author_key=author_key,
                                        affiliation_key=affiliation_key,
                                        ratio=0.9 )
        if kk:
            ll.append(kk)
    if not ll:
        ll=None
    return ll

In [135]:
if not UDEA['UDEA_authors'].dropna().shape[0]:
    UDEA['UDEA_authors']=UDEA.authors_WOS.apply(lambda l: build_institutional_authors(l,aunly) )

## Experimental: Change similirity by merge search

In [136]:
UDEA_YES=UDEA[~UDEA['UDEA_authors'].isna()].reset_index(drop=True)

In [137]:
UDEA_YES.shape

(10901, 181)

In [138]:
import fuzzywuzzy.process as fwp
from fuzzywuzzy import fuzz
UDEA_NOT=UDEA[UDEA['UDEA_authors'].isna()].reset_index(drop=True)
df2=aunly.copy()
df2=pd.DataFrame( list( df2['UDEA_authors'].values ) )
df2['UDEA_authors']=aunly['UDEA_authors']
contents=df2[['WOS_author','WOS_affiliation','UDEA_authors']].reset_index(drop=True)
contents['WOS_author']=contents['WOS_author'].astype(str)
contents['WOS_affiliation']=contents['WOS_affiliation'].astype(str)

In [139]:
print( UDEA_NOT['authors_WOS'].loc[0][0].get('WOS_author'),
      fwp.extractOne(  UDEA_NOT['authors_WOS'].loc[0][0].get('WOS_author'),
                     contents['WOS_author'],scorer=fuzz.partial_ratio  ) )

Maria Carrillo-Bonilla, Lina ("['Carrillo-Bonilla, Lina M.', 'Carrillo Bonilla, Lina Maria', 'CARRILLO, Lina M.']", 86, 963)


In [140]:
dfnot=UDEA_NOT.copy()
dfnot=dfnot.reset_index(drop=True)

In [141]:
#for i in range(20):
#l=dfnot['authors_WOS'].loc[i]
def json_fuzzy_merge(l,contents,right_target='UDEA_authors',
                       left_on='WOS_author',extra_left_on='affiliation',
                       right_on='WOS_author',extra_right_on='WOS_affiliation',
                       cuttoff=95,cuttof_extra=65,scorer=fuzz.partial_ratio):
    newl=[]
    for d in l:
        au=d.get(left_on)
        aff=d.get(extra_left_on)
        # Do not need to be string
        r=fwp.extractOne(au,contents[right_on],scorer=scorer)
        if r[1]>=cuttoff:
            raf=scorer( aff, contents.loc[r[2],extra_right_on]  )
            if raf>=cuttof_extra:
                newl=newl+[  contents.loc[r[2],right_target]  ]
    if newl:
        return newl
    else:
        return None

In [142]:
import pandas as pd
import swifter



In [None]:
%time kk=dfnot['authors_WOS'].swifter.apply(lambda l: json_fuzzy_merge(l,contents,cuttof_extra=65))

In [143]:
%time kk=dfnot['authors_WOS'].apply(lambda l: json_fuzzy_merge(l,contents,cuttof_extra=65))

CPU times: user 6min 10s, sys: 40 ms, total: 6min 10s
Wall time: 6min 10s


In [144]:
kk.dropna().shape

(780,)

In [145]:
dfnot['UDEA_authors']=kk

In [146]:
#dfnot[['UDEA_authors','authors_WOS']].dropna(subset=['UDEA_authors'])

In [147]:
UDEA_NOT=dfnot.reset_index(drop=True)

In [148]:
UDEA=UDEA_YES.append(UDEA_NOT).reset_index(drop=True)

In [149]:
UDEA['UDEA_authors'].dropna().shape

(11681,)

Quality checks

UDEA_YES=UDEA[UDEA.UDEA_nombre!=''].reset_index(drop=True)
UDEA_NOT=UDEA[UDEA.UDEA_nombre==''].reset_index(drop=True)

UDEA_YES['Tipo']=UDEA_YES.Tipo.str.replace('([SW][CO][SIP])$',r'\1_UDEA')

UDEA=UDEA_YES.append(UDEA_NOT)
UDEA=UDEA.reset_index(drop=True)

UDEA[UDEA.Tipo.str.contains('UDEA')].shape

In [151]:
UDEA.to_json('WOS_SCI_SCP_PTJ_CTR.json.gz',compression='gzip')

In [150]:
check_quality(UDEA)

authors_WOS
13645
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
10311
UDEA_authors
11681
UDEA_authors → full_names (Extrapolado puntaje)
11681
UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)
7717
TI: "The inert doublet model" check WOS_author vs UDEA_authors
...


In [None]:
print 1

## Try other approachs

In [152]:
wp.merge_with_close_matches??

In [153]:
%%writefile test.cfg
[FILES]
Sample_WOS.xlsx = 1--LJZ4mYyQcaJ93xBdbnYj-ZzdjO2Wq2
Sample_SCI.xlsx = 1-3a-hguQTk5ko8JRLCx--EKaslxGVscf
Sample_SCP.xlsx = 1-IAWlMdp2U-9L2jvZUio04ub1Ym3PX-H

Overwriting test.cfg


In [154]:
cib=wp.wosplus('test.cfg')
#cib.Debug=True
cib.load_biblio('Sample_WOS.xlsx')
cib.load_biblio('Sample_SCI.xlsx',prefix='SCI')
cib.load_biblio('Sample_SCP.xlsx',prefix='SCP')

In [155]:
def get_close_matches_Levenshtein(
        word,
        possibilities,
        n=3,
        cutoff=0.6,
        full=False):
    '''Replaces difflib.get_close_matches with faster algortihm based on
       Levenshtein.ratio.
       HINT: Similarity increase significatively after lower() and unidecode()

       Refs: https://en.wikipedia.org/wiki/Levenshtein_distance
    '''
    import pandas as pd
    import Levenshtein
    if isinstance(possibilities, str):
        possibilities = [possibilities]
    rs = pd.DataFrame()
    MATCH = False
    for p in possibilities:
        similarity = Levenshtein.ratio(word, p)
        # print(word,'::',p,similarity)
        # sys.exit()
        if similarity >= cutoff:
            MATCH = True
            rs = rs.append({'similarity': similarity,
                            'match': p}, ignore_index=True)

    if MATCH:
        rs = rs.sort_values(
            'similarity', ascending=False).reset_index(drop=True)
        if full:
            return list(rs['match'][:n].values), list(
                rs['similarity'][:n].values)
        else:
            return list(rs['match'][:n].values)
    else:
        if full:
            return ([], 0)
        else:
            return []

In [159]:
scorer=fuzz.ratio

In [160]:
def get_close_matches_Levenshtein_new(
        word,
        possibilities,
        n=1,
        cutoff=0.6,
        full=False,
        scorer=fuzz.ratio): #
    r=fwp.extract(word,possibilities,scorer=scorer,limit=n)
    
    if r[0][1]/100.>cutoff:
        if full:
            return [t[0] for t in r],[t[1]/100. for t in r]
        else:
            return [t[0] for t in r]
    else:
        if full:
            return ([], 0)
        else:
            return []        

In [163]:
def merge_with_close_matches_old(
        left,
        right,
        left_on='ST',
        right_on='UDEA_simple_título',
        left_extra_on='SO',
        right_extra_on='UDEA_nombre revista o premio',
        how='inner',
        n=1,
        cutoff=0.6,
        full=True,
        cutoff_extra=0.6):
    '''For each entry of the column: left_on of DataFrame left (cannot have empty fields),
       try to find the close match inside each row of right DataFrame, by comparing with
       the right_on entry of the row. When a row match is found, the full right row is appended
       to the matched row in the left DataFrame.
       If the similarity between the entries at left_on and right_on is less than 0.8,
       an additional check is performed between the entries left_extra_on and right_extra_on
       of the matched row.

       how implemented: inner and left (Default: inner)
    '''
    import numpy as np
    from unidecode import unidecode
    import pandas as pd
    # import sys #import globally
    # print(left[left_on][0])
    # sys.exit()
    words = left[left_on].str.lower().map(unidecode)
    possibilities = right[right_on].str.lower().map(unidecode)

    joined = pd.DataFrame()
    mi = np.array([])
    for i in left.index:
        if i % 100 == 0:
            print('.', end="")
        joined_series = left.loc[i]
        #joined_series=joined_series.append(pd.Series( {similarity_column:0} ))
        title, similarity = get_close_matches_Levenshtein(
            words[i], possibilities, n=n, cutoff=cutoff, full=full)
        # print(i,words[i],title,similarity) #cutuff 0.6 0.7 0.8 0.85 0.91 0.95
        # sys.exit()
        if title:
            mtch = right[possibilities == title[0]]
            # >=cutoff, e.g 0.65 0.95 0.81 0.86 0.9 0.96
            chk_cutoff = similarity[0]
            crosscheck = cutoff + 0.2  # 0.8 # e.g. 0.8 0.9 0.9 0.9 0.9 0.9
            if crosscheck >= 1:
                # force check if match worst than this (by experience)
                crosscheck = 0.95
            if chk_cutoff < crosscheck:  # e.g 0.65<0.8 0.95~<0.9 0.81~<0.0 0.86<0.9 0.91<~0.9 0.96~<0.9
                if get_close_matches_Levenshtein(unidecode(left[left_extra_on][i].lower()), [unidecode(
                        mtch[right_extra_on][mtch.index[0]].lower())], cutoff=cutoff_extra):  # cutoff=0.6
                    chk_cutoff = crosscheck + 0.1

            if chk_cutoff >= crosscheck:
                joined_series = joined_series.append(mtch.loc[mtch.index[0]])
                if how == 'outer':
                    mi = np.concatenate((mi, mtch.index.values))
                # joined_series[similarity_column]=similarity[0]

            #return joined_series
            if how == 'inner':
                joined = joined.append(joined_series, ignore_index=True)

        if (how == 'left' or 'outer'):
            joined = joined.append(joined_series, ignore_index=True)
    if how == 'outer':
        joined = joined.append(right.drop(
            right.index[list(mi.astype(int))]).reset_index(drop=True))
    return joined

def merge_with_close_matches_new(
        left,
        right,
        left_on='ST',
        right_on='UDEA_simple_título',
        left_extra_on='SO',
        right_extra_on='UDEA_nombre revista o premio',
        how='inner',
        n=1,
        cutoff=0.6,
        full=True,
        cutoff_extra=0.7):
    '''For each entry of the column: left_on of DataFrame left (cannot have empty fields),
       try to find the close match inside each row of right DataFrame, by comparing with
       the right_on entry of the row. When a row match is found, the full right row is appended
       to the matched row in the left DataFrame.
       If the similarity between the entries at left_on and right_on is less than 0.8,
       an additional check is performed between the entries left_extra_on and right_extra_on
       of the matched row.

       how implemented: inner and left (Default: inner)
    '''
    import numpy as np
    from unidecode import unidecode
    import pandas as pd
    # import sys #import globally
    # print(left[left_on][0])
    # sys.exit()
    words = left[left_on].str.lower().map(unidecode)
    possibilities = right[right_on].str.lower().map(unidecode)

    joined = pd.DataFrame()
    mi = np.array([])
    for i in left.index:
        if i % 100 == 0:
            print('.', end="")
        joined_series = left.loc[i]
        #joined_series=joined_series.append(pd.Series( {similarity_column:0} ))
        title, similarity = get_close_matches_Levenshtein_new(
            words[i], possibilities, n=n, cutoff=cutoff, full=full)
        # print(i,words[i],title,similarity) #cutuff 0.6 0.7 0.8 0.85 0.91 0.95
        # sys.exit()
        if title:
            mtch = right[possibilities == title[0]]
            # >=cutoff, e.g 0.65 0.95 0.81 0.86 0.9 0.96
            chk_cutoff = similarity[0]
            crosscheck = cutoff + 0.2  # 0.8 # e.g. 0.8 0.9 0.9 0.9 0.9 0.9
            if crosscheck >= 1:
                # force check if match worst than this (by experience)
                crosscheck = 0.95
            if chk_cutoff < crosscheck:  # e.g 0.65<0.8 0.95~<0.9 0.81~<0.0 0.86<0.9 0.91<~0.9 0.96~<0.9
                if get_close_matches_Levenshtein_new(unidecode(left[left_extra_on][i].lower()), [unidecode(
                        mtch[right_extra_on][mtch.index[0]].lower())], cutoff=cutoff_extra):  # cutoff=0.6
                    chk_cutoff = crosscheck + 0.1

            if chk_cutoff >= crosscheck:
                joined_series = joined_series.append(mtch.loc[mtch.index[0]])
                if how == 'outer':
                    mi = np.concatenate((mi, mtch.index.values))
                # joined_series[similarity_column]=similarity[0]

            #return joined_series
            if how == 'inner':
                joined = joined.append(joined_series, ignore_index=True)

        if (how == 'left' or 'outer'):
            joined = joined.append(joined_series, ignore_index=True)
    if how == 'outer':
        joined = joined.append(right.drop(
            right.index[list(mi.astype(int))]).reset_index(drop=True))
    return joined

In [164]:
cib.biblio['WOS']=UDEA.sample(500).reset_index(drop=True).copy().fillna('')
cib.biblio['SCI']=SIU[0:100].copy().fillna('')

In [165]:
tmp=drive_files.load_biblio('produccion_fecha_vig_2003_2018.xlsx',prefix='UDEA')

In [166]:
drive_files.biblio['UDEA']

Unnamed: 0,UDEA_cedula,UDEA_nombre,UDEA_tipo mat,UDEA_descrip tipo mat,UDEA_tipo concepto,UDEA_tipo mov,UDEA_puntos,UDEA_nro autores,UDEA_año realiz,UDEA_fecha vig,UDEA_fecha aplica,UDEA_cod prod,UDEA_título,UDEA_cod rev-prem,UDEA_nombre revista o premio,UDEA_issn rev,UDEA_nro acta,UDEA_pais prod,UDEA_idioma,Tipo
0,10002787,URREA DUQUE JUAN PABLO,ARTREVC,Articulo en revista Tipo C,SA,NO,3.0,2,2008,2009-02-01,2009-03-01,32497,"""TECNICA NO PARAMETRICA PARA LA DETECCION DE EVENTOS DE ATENUACION EN FIBRA OPTICA"" (NON-PARAMETRIC TECHNIQUEFOR DETECTING ATTENUATION EVENTS IN A FEBER OPTIC CABLE)",1852.0,SCIENTIA ET TECHNICA,0122-1701,261,46,ESPA?,UDEA
1,10002787,URREA DUQUE JUAN PABLO,ARTREVC,Articulo en revista Tipo C,SA,NO,3.0,2,2004,2009-02-01,2009-03-01,32496,"""IMPLEMENTACION DE LA TRANSFORMADA DE HOUGH PARA LA DETECCION DE LINEAS PARA UN SISTEMA DE VISION DE BAJO NIVEL""",1852.0,SCIENTIA ET TECHNICA,0122-1701,261,46,ESPA?,UDEA
2,10002787,URREA DUQUE JUAN PABLO,DIRTGRAMAE,Direccion de trabajo de grado de maestria,BO,NO,36.0,1,2018,2018-09-01,2018-09-01,61060,ANALISIS DE PROTOCOLOS DE ENRUTAMIENTO PARA LA TRANSMISION DE VIDEO SOBRE UNA RED INALAMBRICA MULTISALTO DEFINIDA POR SOFTWARE,0.0,,,461,46,ESPA?,UDEA
3,10002787,URREA DUQUE JUAN PABLO,ARTREVA2,Articulo en revista Tipo A2,SA,NO,12.0,2,2015,2017-03-01,2017-03-01,53760,THROUGHPUT ANALYSIS OF P2P VIDEO STREAMING ON SINGLE-HOP WIRELESS NETWORKS.,3377.0,IEEE AMERICA LATINA,1548-0992,417,74,ESPA?,UDEA
4,10002787,URREA DUQUE JUAN PABLO,PONENINTER,Ponencia en extenso en evento internacional,BO,NO,84.0,2,2015,2016-03-01,2016-03-01,53303,STATISTICAL PERFORMANCE EVALUATION OF P2P VIDEO STREAMING ON MULTI-HOP WIRELESS NETWORKS.,0.0,,,413,46,INGLE,UDEA
5,10002787,URREA DUQUE JUAN PABLO,PONENINTER,Ponencia en extenso en evento internacional,BO,NO,84.0,2,2018,2018-09-01,2018-09-01,61043,THROUGHPUT AND DELAY EVALUATION FRAMEWORK INTEGRATING SDN AND IEEE 802.11S WMN,0.0,,,461,105,INGLE,UDEA
6,10002787,URREA DUQUE JUAN PABLO,PONENACION,Ponencia en extenso en evento nacional,BO,NO,48.0,2,2012,2013-03-01,2013-09-01,43904,"""QUALITY ASSESSMENT FOR VIDO STREAMING P2P APPLICATION OVER WIRELESS MESH NETWORK""",0.0,,,349,46,INGLE,UDEA
7,10002787,URREA DUQUE JUAN PABLO,PONENINTER,Ponencia en extenso en evento internacional,BO,NO,84.0,2,2015,2016-03-01,2016-03-01,53304,THROUGHPUT ANALYSIS OF P2P VIDEO STREAMING ON SINGLE-HOP WIRELESS NETWORKS.,0.0,,,413,46,INGLE,UDEA
8,10003895,GONZALEZ MARULANDA EDWIN ROLANDO,RESULATINO,Resumen de ponencia en evento latinoamerican o de 1 continet,BO,NO,20.0,3,2007,2007-09-01,2007-09-01,28941,"""FACTORES ASOCIADOS AL RETRASO DEL CRECIMIENTO EN NINOS MENORES DE 11 ANOS. ANTIOQUIA, COLOMBIA""",0.0,,,226,26,ESPA?,UDEA
9,10003895,GONZALEZ MARULANDA EDWIN ROLANDO,PONINTINT,Ponencia de extension intermedia en evento internacional (),BO,NO,62.0,1,2010,2012-03-01,2012-03-01,39481,"""HEALTH SERVICES PRIVATIZATION AND PREVENTIVE HEALTH PROGRAMMES""",0.0,,,320,217,INGLE,UDEA


In [170]:
%time kkold=merge_with_close_matches_old(cib.biblio['WOS'],cib.biblio['SCI'].drop('Tipo',axis='columns'),left_on='TI',right_on='SCI_TI',right_extra_on='SCI_SO',how='left')


.....CPU times: user 20 s, sys: 12 ms, total: 20 s
Wall time: 20 s


In [171]:
%time kknew=merge_with_close_matches_new(cib.biblio['WOS'],cib.biblio['SCI'].drop('Tipo',axis='columns'),left_on='TI',right_on='SCI_TI',right_extra_on='SCI_SO',how='left')


.....CPU times: user 20 s, sys: 12 ms, total: 20 s
Wall time: 20 s


In [172]:
kkold.shape,kknew.shape

((500, 181), (500, 181))

In [173]:
(kkold['SCI_TI'].apply(lambda s: s if s else pd.np.nan).dropna().shape,
 kknew['SCI_TI'].apply(lambda s: s if s else pd.np.nan).dropna().shape)

((155,), (155,))

In [174]:
kknew[['TI','SCI_TI']].dropna()

Unnamed: 0,TI,SCI_TI
0,"High initial multidrug-resistant tuberculosis rate in Buenaventura, Colombia: a public-private initiative",
1,4-Phenylphenalenones as a template for new photodynamic compounds against Mycosphaerella fijiensis,
2,Tratamiento conservador en pacientes con retinoblastoma bilateral,Tratamiento conservador en pacientes con retinoblastoma bilateral
3,Microsolvation of NO3-: structural exploration and bonding analysis,
4,Short communication: Increased expression of secretory leukocyte protease inhibitor in oral mucosa of Colombian HIV type 1-exposed seronegative individuals.,
5,"Validez y confiabilidad del 'Cuestionario de calidad de vida KIDSCREEN-27' versión padres, en Medellín, Colombia","Validez y confiabilidad del 'Cuestionario de calidad de vida KIDSCREEN-27' versión padres, en Medellín, Colombia"
6,Prevalencia de neurocisticercosis en individuos afectados de epilepsia,
7,Alternative Oxidase Mediates Pathogen Resistance in Paracoccidioides brasiliensis Infection,
8,"Efficacy and safety of two whole IgG polyvalent antivenoms, refined by caprylic acid fractionation with or without beta-propiolactone, in the treatment of Bothrops asper bites in Colombia",
9,Effect of a high-intensity intermittent exercise session on concentrations of circulating musclin in adults with metabolic syndrome and insulin resistance [Efecto de una sesión de ejercicio interm...,


In [175]:
kkold[['TI','SCI_TI']].dropna()

Unnamed: 0,TI,SCI_TI
0,"High initial multidrug-resistant tuberculosis rate in Buenaventura, Colombia: a public-private initiative",
1,4-Phenylphenalenones as a template for new photodynamic compounds against Mycosphaerella fijiensis,
2,Tratamiento conservador en pacientes con retinoblastoma bilateral,Tratamiento conservador en pacientes con retinoblastoma bilateral
3,Microsolvation of NO3-: structural exploration and bonding analysis,
4,Short communication: Increased expression of secretory leukocyte protease inhibitor in oral mucosa of Colombian HIV type 1-exposed seronegative individuals.,
5,"Validez y confiabilidad del 'Cuestionario de calidad de vida KIDSCREEN-27' versión padres, en Medellín, Colombia","Validez y confiabilidad del 'Cuestionario de calidad de vida KIDSCREEN-27' versión padres, en Medellín, Colombia"
6,Prevalencia de neurocisticercosis en individuos afectados de epilepsia,
7,Alternative Oxidase Mediates Pathogen Resistance in Paracoccidioides brasiliensis Infection,
8,"Efficacy and safety of two whole IgG polyvalent antivenoms, refined by caprylic acid fractionation with or without beta-propiolactone, in the treatment of Bothrops asper bites in Colombia",
9,Effect of a high-intensity intermittent exercise session on concentrations of circulating musclin in adults with metabolic syndrome and insulin resistance [Efecto de una sesión de ejercicio interm...,


In [176]:
import time 

In [177]:
s=time.time()
kkold=merge_with_close_matches_old(cib.biblio['WOS'][['TI','SO']],drive_files.biblio['UDEA'][['UDEA_título',
                                                            'UDEA_nombre revista o premio']],
                            left_on='TI',left_extra_on='SO',right_on='UDEA_título',
                            right_extra_on='UDEA_nombre revista o premio',how='left')
print(time.time()-s)

.....505.96520495414734


In [178]:
s=time.time()
kknew=merge_with_close_matches_new(cib.biblio['WOS'][['TI','SO']],drive_files.biblio['UDEA'][['UDEA_título',
                                                            'UDEA_nombre revista o premio']],
                            left_on='TI',left_extra_on='SO',right_on='UDEA_título',
                            right_extra_on='UDEA_nombre revista o premio',how='left')
print(time.time()-s)

.....781.4975967407227


In [None]:
..................

In [179]:
(kkold['UDEA_título'].apply(lambda s: s if s else pd.np.nan).dropna().shape,
 kknew['UDEA_título'].apply(lambda s: s if s else pd.np.nan).dropna().shape)

((291,), (286,))