<a href="https://colab.research.google.com/github/restrepo/medicion/blob/master/cienciometria/WOS_SCI_SCP_PTJ_CTR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WOS+SCI+SCP+PTJ+CTR

Merge the bibliographic datasets for 
* Web of Science (WOS), 
* Scielo (SCI)
* Scopus  (SCP)
* Puntaje (UDEA)
* Center (CTR)
of the scientific articles of Universidad de Antioquia

For details see [merge.ipynb in Colaboratory](https://colab.research.google.com/github/restrepo/medicion/blob/master/cienciometria/merge.ipynb)

Implementation:
The input pure o partially processed database with WOS-SCI-SCP and may be some UDEA entries from PTJ and Center information with additional data about the Full Name UDEA authors.

Addtionaly UDEA entries can be captured from:
1. A previous WOS-SCI-SCP-UDEA
2. A Data Base with a column with full names (FULL LAST NAMES NAMES, e.g VALDEZ GÚZMAN JUAN ALBERTO) and a list of author Aliases in WOS format (Lastname, Name, e.g Valdez-Gúzman, J.A.) with a list of registered affiliations. TODO: Test
3. The database from Puntaje (UDEA). 

Without PTJ? Check: https://www.kaggle.com/aminer/author-disambiguation

In [115]:
import os
VERSION='NEW'
if os.getcwd()=='/content':
    !pip install openpyxl xlrd wosplus fuzzywuzzy[speedup] swifter > /dev/null
    !wget https://raw.githubusercontent.com/restrepo/medicion/master/cienciometria/wos_sci_scp_ptj_ctr.py

**WARNING!**: `authors_WOS` and `UDEA_authors` will be deleted and rebuided from schatch!

In [116]:
# Delete UDEA_columns and start from schratch
REBUILD=False
MERGE_WITH_TRAINED=True

## functions

In [117]:
import pandas as pd
import swifter
import wosplus as wp
pd.set_option('display.max_colwidth',200)

In [118]:
# %load wos_sci_scp_ptj_ctr.py

In [119]:
from wos_sci_scp_ptj_ctr import *

##  Configure public links of  files in Google Drive
* If it is a Google Spreadsheet the corresponding file is downloaded as CSV
* If it is in excel or text file the file is downloaded  directly

To define your  own labeled IDs for public google drive files edit the next cell:

In [120]:
%%writefile drive.cfg
[FILES]
WOS_SCI_SCP_PTJ_CTR.json.gz=19E1C1kRk4I0V3uXojqko8-NEicWaPp1j
WOS_SCP_UDEA_SJR_SIU.xlsx=0BxoOXsn2EUNIQ3R4WDhvSzVLQ2s
Base_de_datos_investigadores_Definitiva.csv=12oalgUeKhpvzkTPBP8pXCeHTrF-KO223dy9ov9w9QKs
UDEA_authors_with_WOS_info.json=1o1eVT4JD0FMMICq_oxrTJOzWh47veBMw
produccion_fecha_vig_2003_2018.xlsx=1WbtX4K__TTLxXRjuLvqUYz9tuHCIlS5v
UDEA_WOS_SCI_SCP_PTJ.json=1OkVytKbxJwGvXZDkynkSoUDtkUOTaT4A

Overwriting drive.cfg


##  Load data bases

In [121]:
affil='Univ Antioquia'
drive_files=wp.wosplus('drive.cfg')

In [122]:
#if REBUILD:
#    !rm WOS_SCI_SCP_PTJ_CTR.json.gz

In [123]:
RECOVER=True #False for test purposes
UDEAjsonfile='WOS_SCI_SCP_PTJ_CTR.json.gz'
#Test purposes
#UDEAjsonfile='UDEA_WOS_SCI_SCP_PTJ.json'
if RECOVER:
    #Requieres latest wosplus!
    tmp=drive_files.load_biblio(UDEAjsonfile,compression='gzip')# TODO CHANGE FOR LAST VERSION IN GOOGLE DRIVE
else:
    tmp=drive_files.load_biblio('UDEAtmp.json')
    #drive_files.load_biblio(
    #  'https://raw.githubusercontent.com/restrepo/medicion/master/cienciometria/data/UDEAtmp300.json'
    #    )#Test: 199+1=200 found
    
UDEA=drive_files.biblio['WOS'].reset_index(drop=True)
#DEBUG
#UDEA=UDEA.sample(300,replace=True).reset_index(drop=True) #Test: 77 found
#tmp=drive_files.load_biblio('Sample_WOS.xlsx')



In [124]:
#TODO: In some future version this columns will be updated only (now `UDEA_authors` is rebuilded)
if 'UDEA_authors' in UDEA:
    UDEA['UDEA_authors']=UDEA['UDEA_authors'].apply(lambda l: l if type(l)==list else None)
else:
    UDEA['UDEA_authors']=None
if 'authors_WOS' in UDEA:
    UDEA['authors_WOS']=UDEA['authors_WOS'].apply(lambda l: l if type(l)==list else [])
else:
    UDEA['authors_WOS']=None

In [125]:
check_quality(UDEA)

authors_WOS
14601
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [{'WOS_author': 'Restrepo, D.', 'affiliation': ['Univ Antioquia, Inst Fis, Medellin, Colombia.'], 'origin': 'from AU+UDEA_authors'}] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
3239
UDEA_authors
12324
UDEA_authors → full_names (Extrapolado puntaje)
12324
UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)
8273


In [126]:
if REBUILD:
    UDEA=clean_institutional_columns(UDEA,prefix='UDEA',Tipo='Tipo')
    UDEA['UDEA_authors']=None

In [127]:
for t in UDEA.Tipo.unique():
    print( '{}:{}'.format( t, UDEA[ UDEA.Tipo==t].shape[0] ) )

WOS_SCP:5820
WOS:1884
WOS_SCI_SCP:768
SCI_SCP:1616
SCP:2573
WOS_SCI:147
SCI:2892


In [128]:
UDEA.shape

(15700, 153)

## Load trained old data 

### Merge WOS_SCP_SCI with trained data set PTJ_CTR

Merge requires split in DI and TI. 

if `REBUILD` `False` only empty UDEA_columns are searched for.

In [129]:
SIU=drive_files.read_drive_excel('WOS_SCP_UDEA_SJR_SIU.xlsx')

In [130]:
if MERGE_WITH_TRAINED:
    if os.path.exists('WOS_SCP_UDEA_SJR_SIU.xlsx'):
        SIU=pd.read_excel('WOS_SCP_UDEA_SJR_SIU.xlsx')
    else:    
        SIU=drive_files.read_drive_excel('WOS_SCP_UDEA_SJR_SIU.xlsx')
        
    UDEA,SIU=fill_trained_data(UDEA,SIU)#TODO: Remove SIU

  result = method(y)


15700 (15700, 152)
(7072, 168) (8628, 168)


In [131]:
if MERGE_WITH_TRAINED:
    UDEA.to_json('UDEAtmp.json')
    RECOVER=False
    if RECOVER:
        UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

In [132]:
if 'UDEA_autores' in UDEA.columns and UDEA[UDEA['UDEA_autores']==''].shape[0]:
    UDEA['UDEA_autores']=UDEA['UDEA_autores'].apply(lambda s: pd.np.nan if type(s)==str and s=='' else s)

In [133]:
if 'UDEA_autores' in UDEA.columns:
    print(UDEA[UDEA['UDEA_autores']==''].shape[0],UDEA['UDEA_autores'].dropna().shape[0])

0 7072


In [134]:
check_quality(UDEA)

authors_WOS
14601
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [{'WOS_author': 'Restrepo, D.', 'affiliation': ['Univ Antioquia, Inst Fis, Medellin, Colombia.'], 'origin': 'from AU+UDEA_authors'}] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
0


# Puntaje

Try fill missing UDEA columns with some updated database Puntaje information.

In [135]:
qq=UDEA.copy()

In [136]:
drive_files.biblio['WOS']=qq
drive_files.biblio['WOS'].shape

(15700, 168)

In [137]:
tmp=drive_files.load_biblio('produccion_fecha_vig_2003_2018.xlsx',prefix='UDEA')

In [138]:
pp= drive_files.biblio['UDEA'].copy()

In [139]:
drive_files.biblio['UDEA']=pp

In [140]:
df=merge_puntaje(drive_files)

(32581, 24)
va1 0 0
........................................................................

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


7258 : 5806 + 2827 = 8633
va2 0 5806
.........................................................

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


7258 : 5453 + 353 = 5806
va3 0 5453
.......................................................

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


va4 0 5389
7258 : 5389 + 64 = 5453
(3239, 174) + (5389, 152) = 8628


In [141]:
#TODO: Check why not zero
if 'UDEA_autores' in df.columns:
    print(0,'=',df[df['UDEA_autores']==''].shape[0],'; found:',df['UDEA_autores'].dropna().shape[0])

0 = 0 ; found: 10311


In [142]:
#df['UDEA_autores'].apply(lambda s: pd.np.nan if type(s)==str and s=='' else s).dropna().shape

In [143]:
UDEA=df.copy()

In [144]:
UDEA.shape

(15700, 180)

In [145]:
check_quality(UDEA)

authors_WOS
14601
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [{'WOS_author': 'Restrepo, D.', 'affiliation': ['Univ Antioquia, Inst Fis, Medellin, Colombia.'], 'origin': 'from AU+UDEA_authors'}] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
3239


## Creates or updates and fill `'authors_WOS'` json column
A list of dictionaries with author and affiliation from the WOS `C1` column

Only updated: The list is keeped if `'authors_WOS'` already filled

### Fill C1 for not WOS entries in WOS format and extract  affiliation from C1

Updated only if not filled already

In [146]:
#Fill from SCI_C1
UDEA['C1']=SCI_C1_to_C1(UDEA)

In [147]:
#Fill from SCP_C1='SCP_Authors with affiliations
UDEA['C1']=SCP_Authors_with_affiliations_to_C1(UDEA)

In [148]:
UDEA[UDEA['C1'].isnull()].shape

(0, 180)

In [149]:
UDEA[UDEA.Tipo=='WOS'].reset_index(drop=True).C1.loc[0]

'[Torres, Ricardo A.; Petrier, Christian; Combet, Evelyne] Polytech Savoie Univ Savoie, Lab Chim Mol & Environm, F-73376 Le Bourget Du Lac, France.\n[Torres, Ricardo A.] Univ Antioquia, Fac Ciencias Exactas & Nat, Inst Quim, Grp Electroquim, Medellin 1226, Colombia.\n[Carrier, Marion] Univ Lyon 1, CNRS, Inst Rech Catalyse & Environm Lyon, UM 5256, F-69626 Villeurbanne, France.\n[Pulgarin, Cesar] Ecole Polytech Fed Lausanne, GGEC, Inst Chem Sci & Engn, CH-1015 Lausanne, Switzerland.\n'

### Creates `'authors_WOS'` json column

In [150]:
#WARNING: some C1 WOS entries are not normalized: Missing authors
UDEA['authors_WOS']=UDEA.C1.apply(lambda x: x.split('\n') if x else x).apply(
    lambda x:   [y.replace('[','').replace('] ','; ') for y in x if y.find(affil)>-1 ] if x else x ).apply(
     lambda x: get_author_info(x) if x else x)

# Improve normalization: remove C1s with only affiliation (from Scielo)
UDEA['authors_WOS']=UDEA['authors_WOS'].apply( 
    lambda x: [d for d in x if d.get('WOS_author').find(affil)==-1] if type(x)==list else x )

In [151]:
UDEA[UDEA.Tipo=='SCP'].reset_index(drop=True).loc[0].authors_WOS

[{'WOS_author': 'Isaza, C. V.',
  'affiliation': ['Univ Antioquia, Dpto. de Ingeniería Electrónica, Grupo GEPAR, Calle 67 No 53-108 bl. 19, Medellín, Colombia'],
  'i': 0}]

## Creates and fill `'UDEA_authors'` json column
List of dictionaries with full authors info, similar to the lens.org author json column
* Includes full institutional data base

### includes `'full_name'` from from UDEA database into `'UDEA_authors'`
Any all old list in `'UDEA_authors'` will be deleted and created again

TODO: Update only None entries

In [152]:
#TODO: Remove from fill_trained_data(..)
if 'UDEA_autores' in UDEA.columns:
    UDEA['UDEA_autores']=UDEA['UDEA_autores'].apply(lambda s: s if s else None)    
    UDEA['UDEA_autores']=UDEA['UDEA_autores'].apply(lambda s: re.sub('\s+',' ',s) if type(s)==str else s)
    UDEA['UDEA_authors']=UDEA['UDEA_autores'].apply(lambda s: s.split(';') if type(s)==str else s).apply(
                           lambda l: [{'full_name':y} for y in l ] if type(l)==list else l)

## Load full official researcher data base: PTJ into  `'UDEA_athors'` json column

In [None]:
AU=drive_files.read_drive_excel('Base_de_datos_investigadores_Definitiva.csv')

In [154]:
UPDATE_UDEA_authors_with_AU=True
if (UDEA['UDEA_authors'].dropna().shape[0] and 
    UPDATE_UDEA_authors_with_AU):
    kkn=UDEA.copy()
    kkn=update_institutional_authors(kkn,AU)
    print(kkn.shape,UDEA.shape)
    UDEA=kkn.copy()

0
1
2
3
4
5
6
7
8
9
10
11
(15700, 181) (15700, 181)


Quality check

In [155]:
key_contains_in_list_of_dictionaries(UDEA,'Restrepo, D',column='authors_WOS',key='WOS_author').loc[1:2]

1    [{'WOS_author': 'Restrepo, Diego', 'affiliation': ['Univ Antioquia, Inst Fis, Calle 70 52-21, Medellin, Colombia.'], 'i': 0}, {'WOS_author': 'Rivera, Andres', 'affiliation': ['Univ Antioquia, Inst...
2    [{'WOS_author': 'Restrepo, D.', 'affiliation': ['Univ Antioquia, Inst Fis, Medellin 1226, Colombia.'], 'i': 0}, {'WOS_author': 'Zapata, Oscar', 'affiliation': ['Univ Antioquia, Inst Fis, Medellin ...
Name: authors_WOS, dtype: object

In [156]:
if UPDATE_UDEA_authors_with_AU:
    UDEA.to_json('UDEAtmp.json')
    RECOVER=False
    if RECOVER:
        UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

In [157]:
check_quality(UDEA)

authors_WOS
13645
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
3239
UDEA_authors
10311
UDEA_authors → full_names (Extrapolado puntaje)
10311
UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)
6934


## Add `'UDEA.authors_WOS'` info* to `UDEA.UDEA_authors` json column**
(\*) obtained from `UDEA.C1`

(\*\*) Obtained from [puntaje trained old UDEA data](./WOS_SCI_SCP_PTJ_GS_LNS.ipynb#Merge-with-trained-data-set) and the [official researcher list](./WOS_SCI_SCP_PTJ_GS_LNS.ipynb#Merge-with-official-researcher-list)

* Both WOS author alias and affiliation!

### Obtain name parts and initials from `'full_name'` in `UDEA_authors` dictionary and update `UDEA_authors` with them

In [158]:
UDEA['authors_WOS'].dropna().shape,UDEA[UDEA['authors_WOS'].apply(len)!=0]['authors_WOS'].shape

((15700,), (13645,))

In [159]:
UDEA[UDEA['authors_WOS'].apply(len)!=0]['authors_WOS'].shape

(13645,)

In [160]:
import sys
if 'UDEA_authors' not in UDEA.columns and REBUILD==False:
    sys.exit('Make MERGE_WITH_TRAINED True and run again')

### Get list of WOS author aliases from `'full name'`.
Example: Juan Alberto Restrepo Camargo: `['Restrepo, J.', 'Restrepo, Juan',...]`

Prepare list

In [161]:
# Obtain spanish name parts from full name
dictupdatetmp=UDEA['UDEA_authors'].apply(lambda x: [y.update( 
                split_full_names(y,full_name='full_name')  ) if not pd.isnull(
                y.get('full_name')) else y for y in x] 
                                   if type(x)==list 
                                   else x)

In [162]:
UDEA['authors_WOS'].dropna().shape,UDEA[UDEA['authors_WOS'].apply(len)==0]['authors_WOS'].shape

((15700,), (2055,))

In [163]:
UDEA.loc[3,'UDEA_authors']

[{'INICIALES': 'A. M.',
  'NOMBRES': 'Ana Maria',
  'PRIMER APELLIDO': 'Abreu',
  'SEGUNDO APELLIDO': 'Velez',
  'full_name': 'ABREU VELEZ ANA MARIA'},
 {'INICIALES': 'S. I.',
  'NOMBRES': 'Sergio Ivan',
  'PRIMER APELLIDO': 'Tobon',
  'SEGUNDO APELLIDO': 'Arroyave',
  'full_name': 'TOBON ARROYAVE SERGIO IVAN'}]

### Fill empty `'authors_WOS'` entries list: `[]` by using: 
* `AU` WOS column, 
* Bad `C1` WOS column, and 
* `UDEA_authors`

In [164]:
#Change nan by None
UDEA['UDEA_authors']=UDEA['UDEA_authors'].apply(lambda l: None if type(l)!=list and pd.np.isnan(l) else l )

In [165]:
kk=UDEA.swifter.apply(lambda row: authors_Wos_from_bad_AU_and_bad_C1(row), axis=1)

HBox(children=(IntProgress(value=0, description='Pandas Apply', max=15700, style=ProgressStyle(description_wid…




In [166]:
UDEA['authors_WOS']=kk

In [167]:
UDEA[UDEA['authors_WOS'].apply(lambda l:len(l))==0].shape

(1078, 181)

### Check list aginst real WOS author alias and pick the right one. Capture also affiliation info

In [168]:
kk=UDEA['authors_WOS'].combine( UDEA['UDEA_authors'], func=combinewos )

In [169]:
kk.dropna().shape

(10311,)

In [170]:
UDEA['UDEA_authors']=kk

In [171]:
UDEA['UDEA_authors']=UDEA['UDEA_authors'].apply(lambda l: l if type(l)==list else None)
UDEA_YES=UDEA[~UDEA['UDEA_authors'].isna()].reset_index(drop=True)
UDEA_NOT=UDEA[UDEA['UDEA_authors'].isna()].reset_index(drop=True)
UDEA_NOT['authors_WOS'].dropna().shape

(5389,)

In [172]:
UDEA.to_json('UDEAtmp.json')

Load output results of previous Cell runs

In [173]:
RECOVER=False
if RECOVER:
    UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

## Build a data base with the unique values of the `'UDEA_authors'` column

Obtain UDEA_authors DataFrame: `aunly`

In [174]:
aunly=DataFrame_authors(UDEA)

RABE ANA MARIA


In [175]:
if not aunly.empty:
    aunly.to_json('UDEA_authors_with_WOS_info.json')

In [176]:
RECOVER=False
if RECOVER:
    UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

In [177]:
UDEA.shape

(15700, 181)

In [178]:
if RECOVER:
    if os.path.exists('UDEA_authors_with_WOS_info.json' ):
        aunly=pd.read_json('UDEA_authors_with_WOS_info.json')
    else:
        aunly=drive_files.read_drive_json('UDEA_authors_with_WOS_info.json')

In [179]:
aunly.shape

(1342, 2)

(1273, 2)

In [180]:
check_quality(UDEA)

authors_WOS
14622
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [{'WOS_author': 'Restrepo, D.', 'affiliation': ['Univ Antioquia, Inst Fis, Medellin, Colombia.'], 'origin': 'from AU+UDEA_authors'}] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
3239
UDEA_authors
10311
UDEA_authors → full_names (Extrapolado puntaje)
10311
UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)
6934


## Use full list of WOS alias and affiliations  in `'UDEA_authors'` to:
### find and identify WOS alias of authors with full insitutional info.

This means that some `None` `UDEA_authors` entries will be filled with the extrapolated information.
* The `'Tipo'` is not updated so the new ones are not part of the PTJ idientified articles!

**NOTE** that there are already `UDEA_authors` not `None` entries with `'authors_WOS'` entries as empty list: `[]`:

With exact author matching and high `lv.ratio` for affiliation

In [181]:
UDEA['UDEA_authors']=UDEA['UDEA_authors'].apply(lambda l:fill_full_wos_author_info(l,aunly) )

In [182]:
if UDEA['UDEA_authors'].dropna().shape[0]:
    UDEA.to_json('UDEAtmp.json')

In [183]:
RECOVER=False
if RECOVER:
    UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

In [184]:
UDEA.shape

(15700, 181)

In [185]:
kk=UDEA.authors_WOS.combine(UDEA.UDEA_authors,func=lambda x,y: get_UDEA_authors(x,y,aunly))

In [186]:
UDEA.UDEA_authors.dropna().shape

(10311,)

10311...(7072,)

In [187]:
UDEA['UDEA_authors']=kk

In [188]:
UDEA.UDEA_authors.dropna().shape,UDEA.shape

((10941,), (15700, 181))

((10963,), (15704, 181))

In [189]:
aunly.shape

(1342, 2)

(1461, 2)

In [190]:
if not aunly.empty:
    print(aunly.drop_duplicates('tmp_author').shape)

(1342, 2)


In [191]:
if not aunly.empty:
    aunly.to_json('UDEA_authors_with_WOS_info.json')

In [192]:
RECOVER=False
if RECOVER:
    if os.path.exists('UDEA_authors_with_WOS_info.json' ):
        aunly=pd.read_json('UDEA_authors_with_WOS_info.json')
    else:
        aunly=drive_files.read_drive_json('UDEA_authors_with_WOS_info.json')

In [193]:
if UDEA['UDEA_authors'].dropna().shape[0]:
    UDEA.to_json('UDEAtmp.json')

In [194]:
if RECOVER:
    UDEA=pd.read_json('UDEAtmp.json').reset_index(drop=True)

In [195]:
UDEA.to_json('WOS_SCI_SCP_PTJ_CTR.json.gz',compression='gzip')

In [196]:
if 'UDEA_autores' in UDEA.columns:
    print(UDEA[UDEA['UDEA_autores']==''].shape[0],UDEA['UDEA_autores'].dropna().shape[0])

0 10311


In [197]:
if 'UDEA_authors' in UDEA.columns:
    print(UDEA[UDEA['UDEA_authors']==''].shape[0],UDEA['UDEA_authors'].dropna().shape[0])

0 10941


In [198]:
#print 1

In [199]:
check_quality(UDEA)

authors_WOS
14622
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [{'WOS_author': 'Restrepo, D.', 'affiliation': ['Univ Antioquia, Inst Fis, Medellin, Colombia.'], 'origin': 'from AU+UDEA_authors'}] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
3239
UDEA_authors
10941
UDEA_authors → full_names (Extrapolado puntaje)
10941
UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)
7417


## Exactly as before: Used only if 'UDEA_authors' is empty! 

In [200]:
if RECOVER:
    UDEA=pd.read_json('WOS_SCI_SCP_PTJ_CTR.json.gz',compression='gzip').reset_index(drop=True)
    aunly=drive_files.read_drive_json('UDEA_authors_with_WOS_info.json').reset_index(drop=True)

Same function that `get_UDEA_authors??` but for apply instead of combine

In [202]:
if not UDEA['UDEA_authors'].dropna().shape[0]:
    UDEA['UDEA_authors']=UDEA.authors_WOS.apply(lambda l: build_institutional_authors(l,aunly) )

## Find new `'UDEA_authors'` not `None` entries by similarity search in author info:
* names and alias and relaxed similarity in affiliations

### Use similarity in full `'UDEA_authors'` converted as string

In [203]:
# Normalize
UDEA['authors_WOS']=UDEA['authors_WOS'].apply(lambda l: [] if not l else l)
UDEA['UDEA_authors']=UDEA['UDEA_authors'].apply(lambda l: l if type(l)==list else None)

In [205]:
UDEA[~UDEA['UDEA_authors'].isna()].shape,UDEA['UDEA_authors'].dropna().shape

((10941, 181), (10941,))

In [206]:
df2=aunly.copy()
df2=pd.DataFrame( list( df2['UDEA_authors'].values ) )
df2['UDEA_authors']=aunly['UDEA_authors']
contents=df2[['WOS_author','WOS_affiliation','UDEA_authors','full_name']].reset_index(drop=True)
contents['WOS_author']=contents['WOS_author'].astype(str)
contents['WOS_affiliation']=contents['WOS_affiliation'].astype(str)

In [208]:
import pandas as pd
import swifter

####  Find institutional author info with safe json column converted into string

In [209]:
%time kk=UDEA.swifter.apply(lambda r: json_fuzzy_merge_full(r,contents,cutoff_extra=65), axis=1)

HBox(children=(IntProgress(value=0, description='Pandas Apply', max=15700, style=ProgressStyle(description_wid…


CPU times: user 6min 21s, sys: 1.46 s, total: 6min 23s
Wall time: 6min 22s


In [210]:
#%time kk=dfnot['authors_WOS'].apply(lambda l: json_fuzzy_merge_full(l,contents,cutoff_extra=65) if type(l)==list else l)

In [211]:
kk.dropna().shape

(11689,)

In [212]:
UDEA['UDEA_authors']=kk

In [213]:
UDEA[~UDEA['UDEA_authors'].isna()].shape,UDEA['UDEA_authors'].dropna().shape

((11689, 181), (11689,))

###  Find new `'UDEA_authors entries'`  entries with similarity in  json column specific keys
* Similarity in full names and WOS alias with relaxed similarity in affiliation  but extra information: Names of author  journals used (WOS: `SO` column)

In [214]:
df2=aunly.copy()
df2=pd.DataFrame( list( df2['UDEA_authors'].values ) )
df2['UDEA_authors']=aunly['UDEA_authors']
contents=df2[['WOS_author','WOS_affiliation','UDEA_authors']].reset_index(drop=True)
contents['WOS_author']=contents['WOS_author']#.astype(str)
contents['WOS_affiliation']=contents['WOS_affiliation']#.astype(str)

In [215]:
UDEA['authors_WOS']=UDEA['authors_WOS'].apply(lambda l: [] if not l else l)
UDEA['UDEA_authors']=UDEA['UDEA_authors'].apply(lambda l: l if type(l)==list else None)

In [216]:
kkk=UDEA.swifter.apply(lambda row: json_fuzzy_merge(row,UDEA,contents),axis=1)

HBox(children=(IntProgress(value=0, description='Pandas Apply', max=15700, style=ProgressStyle(description_wid…




In [217]:
UDEA['UDEA_authors']=kkk

In [218]:
UDEA[~UDEA['UDEA_authors'].isna()].shape,UDEA['UDEA_authors'].dropna().shape

((12322, 181), (12322,))

(12322,)

In [219]:
UDEA.to_json('WOS_SCI_SCP_PTJ_CTR.json.gz',compression='gzip')

In [220]:
check_quality(UDEA)

authors_WOS
14622
Leptonic charged Higgs decays in the Zee model ; authors_WOS: [{'WOS_author': 'Restrepo, D.', 'affiliation': ['Univ Antioquia, Inst Fis, Medellin, Colombia.'], 'origin': 'from AU+UDEA_authors'}] ; AU: Sierra, DA
Restrepo, D

Tipo contains UDEA
3239
UDEA_authors
12322
UDEA_authors → full_names (Extrapolado puntaje)
12322
UDEA_authors → "NOMBRE COMPLETO" (Extrapolado CENTRO)
8276
