[View in Colaboratory](https://colab.research.google.com/github/restrepo/medicion/blob/master/cienciometria/merge.ipynb)

# WOSPlus
Manage Web of Science txt files and merge with other bibliographic datasets.

In [0]:
!pip install wosplus openpyxl xlrd > /dev/null

In [0]:
import wosplus as wp
import pandas as pd

##  Load full Google Drive, including your Team drives:

In [30]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Mounted at /content/drive


Access you Google Drive files from the sidebar in the tab "Files → REFRESH"


##  Build or load WOS database

###  Configure public links of  files in Google Drive
* If it is a Google Spreadsheet the corresponding file is downloaded as CSV
* If it is in excel or text file the file is downloaded  directly

To define your  own labeled IDs for public google drive files edit the next cell:

In [31]:
%%writefile drive.cfg
[FILES]
Sample_WOS.xlsx  = 0BxoOXsn2EUNIMldPUFlwNkdLOTQ
Sample_WOS.txt   = 12CtQ_SI2OHrvj_etKpqriGsGoVvv9zkL
savedrecs01.txt  = 1snzdsa-RLwYIf8MUffauaD2ZjNr1U2Os
UDEA_WOS.xlsx    = 1px2IcrjCrkyu7t78Q7PAE5nzV_yuPt9t
UDEA_SCP.xlsx    = 1ulCsFHzDiTmuL9TH8F58ulh0u8Z2ylKh

Writing drive.cfg


In [0]:
WOS=wp.wosplus('drive.cfg')

Publics `id`'s  for Google drive files are labeled with the file names in `drive.cfg`. To access them use the corresponding key in `WOS.drive_file`,  e.g:

In [41]:
list( WOS.drive_file.keys() )

['Sample_WOS.xlsx',
 'Sample_WOS.txt',
 'savedrecs01.txt',
 'UDEA_WOS.xlsx',
 'UDEA_SCP.xlsx']

In [34]:
WOS.drive_file.get('UDEA_WOS.xlsx')

'1px2IcrjCrkyu7t78Q7PAE5nzV_yuPt9t'

or, the more error prompt version

In [0]:
WOS.drive_file['UDEA_WOS.xlsx']

'1px2IcrjCrkyu7t78Q7PAE5nzV_yuPt9t'

### (Re)Build WOS data base from txt WOS files (500 records each)

In [0]:
REBUILD_WOS_TXT=False

In [0]:
if REBUILD_WOS_TXT:
    home='drive/Team Drives/Medición Capacidades Vinculación UDEA/06. Cienciometría/1. Descargas/1. Web of Science/1. UDEA/'
    filetxt=["{}savedrecs.txt".format(home),
             "{}savedrecs (1).txt".format(home),
             "{}savedrecs (2).txt".format(home),
             "{}savedrecs (3).txt".format(home),
             "{}savedrecs (4).txt".format(home),
             "{}savedrecs (5).txt".format(home),
             "{}savedrecs (6).txt".format(home),
             "{}savedrecs (7).txt".format(home),
             "{}savedrecs (8).txt".format(home),
             "{}savedrecs (9).txt".format(home),
             "{}savedrecs (10).txt".format(home),
             "{}savedrecs (11).txt".format(home),
             "{}savedrecs (12).txt".format(home),
             "{}savedrecs (13).txt".format(home),
             "{}savedrecs (14).txt".format(home),
             "{}savedrecs (15).txt".format(home),
             "{}savedrecs (16).txt".format(home),
             "{}savedrecs (17).txt".format(home)
            ]

In [0]:
if REBUILD_WOS_TXT:
    print(filetxt[12])

drive/Team Drives/Medición Capacidades Vinculación UDEA/06. Cienciometría/1. Descargas/1. Web of Science/1. UDEA/savedrecs (12).txt


### Load WOS database

In [0]:
if REBUILD_WOS_TXT:
    UDEA_WOS=pd.DataFrame()
    for f in filetxt: 
        WOS.load_biblio(f)
        UDEA_WOS=UDEA_WOS.append(WOS.WOS)
        
    UDEA_WOS.to_excel('UDEA_WOS.xlsx',index=False)
else:     
    tmp=WOS.load_biblio('UDEA_WOS.xlsx')
    UDEA_WOS=WOS.WOS

Some checks:

In [0]:
WOS.WOS.shape

(8625, 64)

In [0]:
UDEA_WOS[UDEA_WOS.AU.str.contains('Restrepo, D')].shape

(44, 64)

In [0]:
UDEA_WOS[UDEA_WOS.AB!=''].shape

(7402, 64)

## SCP database

In [0]:
REBUILD_SCP_CSV=False #must be local files not in drive.cfg

In [0]:
if REBUILD_SCP_CSV:          
    home='drive/Team Drives/Medición Capacidades Vinculación UDEA/06. Cienciometría/1. Descargas/3. Scopus/1. UDEA/'
    filecsv=["{}UDEA1.csv".format(home),"{}UDEA2.csv".format(home),"{}UDEA3.csv".format(home),
             "{}UDEA4.csv".format(home),"{}UDEA5.csv".format(home),"{}UDEA6.csv".format(home),
             "{}UDEA7.csv".format(home)]

In [0]:
if REBUILD_SCP_CSV:
    UDEA_SCP=pd.DataFrame()
    for f in filecsv:
        #print(f)
        UDEA_SCP=UDEA_SCP.append(pd.read_csv(f,error_bad_lines=False))
        
    UDEA_SCP.to_excel('UDEA_SCP.xlsx',index=False)

b'Skipping line 7: expected 43 fields, saw 44\nSkipping line 42: expected 43 fields, saw 44\nSkipping line 122: expected 43 fields, saw 44\nSkipping line 161: expected 43 fields, saw 44\nSkipping line 225: expected 43 fields, saw 44\nSkipping line 238: expected 43 fields, saw 44\nSkipping line 251: expected 43 fields, saw 45\nSkipping line 252: expected 43 fields, saw 44\nSkipping line 263: expected 43 fields, saw 44\nSkipping line 290: expected 43 fields, saw 44\nSkipping line 339: expected 43 fields, saw 44\nSkipping line 382: expected 43 fields, saw 44\nSkipping line 384: expected 43 fields, saw 44\nSkipping line 386: expected 43 fields, saw 44\nSkipping line 387: expected 43 fields, saw 46\nSkipping line 388: expected 43 fields, saw 45\nSkipping line 389: expected 43 fields, saw 44\nSkipping line 391: expected 43 fields, saw 44\nSkipping line 461: expected 43 fields, saw 44\nSkipping line 1034: expected 43 fields, saw 44\nSkipping line 1192: expected 43 fields, saw 44\n'
b'Skipping

In [48]:
tmp=WOS.load_biblio('UDEA_SCP.xlsx',prefix='SCP')

  return df.rename_axis( dict( (key,prefix+'_'+key) for key in df.columns.values) , axis=1)


In [47]:
WOS.SCP.shape

(10624, 44)

In [53]:
WOS.SCP[WOS.SCP.SCP_Authors.str.contains('Restrepo D.')].shape

(49, 44)