Create a variable 'word_data' which is a python list of all of the text data for your dataset.  This should be ordered according to the doi_ls variable used elsewhere.

The outputs of the script are 3 things:
- Dataframe of DOI, WoS word_data, pub_yr, document_type
- doi_ls (identical to DOI col of above dataframe)
- word_data i.e. text data for each doi in the list. 

In [43]:
print('------------------------------------------------------')
print('Step 0:  Read in data from wos folder and build word_data')
from datetime import datetime as dt
t_start = dt.now()
print(t_start)
print('------------------------------------------------------')

------------------------------------------------------
Step 0:  Read in data from wos folder and build word_data
2018-02-21 15:55:02.004373
------------------------------------------------------


In [44]:
# imports 
import os
import csv
import pandas as pd
import numpy as np
import json
import pickle
from config import Config as c
# cr_db_p = c.cr_db_p
years = c.years
word_datapath = c.word_datapath
dois_p = c.dois_pkl
wos_word_data_p = c.wos_word_data_p

In [45]:
# ensure directories all exist:
input_dir = 'wos'
if not os.path.exists(input_dir):
    raise "There should be a folder called 'wos' with Web of Science tsv files in it"

dirs = ['data','outputs']
for d in dirs:
    if not os.path.exists(d):
        os.makedirs(d)


### Load WoS word_data
This is DOIs and word data from titles, abstracts and keywords on Web of Science.

Note that this is the source of our doi list which is the basis of all data throughout this program.  I.e. if we have a set of tf-idf vectors later in processing, each vector will represent a an article from our doi list and the vectors will have the same ordering as the doi list.  This way, we always know which article an individual piece of data pertains to.


In [46]:
def build_wos_data():
    filepaths = [os.path.join(os.path.abspath('wos'),p) for p in os.listdir('wos')]
    data = pd.DataFrame({})
    print('Reading files from WoS folder into dataframe...')
    for filepath in filepaths:
        df = pd.read_csv(filepath, 
                         sep='\t', 
                         quotechar='"',
                         quoting=csv.QUOTE_NONE,
                         index_col=False, 
#                          error_bad_lines=False,
                         encoding='utf-16')
        data = pd.concat([data,df])
    data['WD'] = data.TI.map(str)+' ... '+str(data.AB) +' ... '+str(data.DE) + ' ... ' +str(data.ID)
    return data # [['DI','SO','SN','EI','PY','DT','WD']]
    

In [47]:
try:
    df = pd.read_csv(wos_word_data_p, index_col=0)
except:
    df = build_wos_data()

Reading files from WoS folder into dataframe...


In [27]:
df[['WD','TI','AB','DE','ID']].sample(40)

Unnamed: 0,WD,TI,AB,DE,ID
429,Improvement of left ventricular mechanics afte...,Improvement of left ventricular mechanics afte...,,,
440,Automated external defibrillators should be ma...,Automated external defibrillators should be ma...,,,RESUSCITATION
168,Impact of extending device longevity on the lo...,Impact of extending device longevity on the lo...,To determine the long-term costs of extending ...,Budget; Cardiac resynchronization therapy; Cos...,CARDIAC-RESYNCHRONIZATION THERAPY; HEART-FAILU...
492,The Importance of Waist Circumference and Body...,The Importance of Waist Circumference and Body...,,Risk factors; Diabetes mellitus; Epidemiologic...,
216,Endovascular Treatment of AICA Flow Dependent ...,Endovascular Treatment of AICA Flow Dependent ...,Peripheral anterior inferior cerebellar artery...,posterior fossa aneurysms; flow dependent aneu...,INFERIOR CEREBELLAR ARTERY; DURAL ARTERIOVENOU...
284,CRANIOPLASTY EFFECT ON CEREBRAL HEMODYNAMICS I...,CRANIOPLASTY EFFECT ON CEREBRAL HEMODYNAMICS I...,,,
33,Health technology assessment on catheter ablat...,Health technology assessment on catheter ablat...,,,
306,Six-minute Walk Distance Predicts the Readmiss...,Six-minute Walk Distance Predicts the Readmiss...,,patients with chronic heart failure; Six-minut...,
308,Failed ureteric access during ureteroscopy - d...,Failed ureteric access during ureteroscopy - d...,,,
57,Targeted overexpression of endothelial nitric ...,Targeted overexpression of endothelial nitric ...,Reduced bioavailability of nitric oxide due to...,Cerebral blood flow; cerebrovascular dysfuncti...,CEREBRAL-BLOOD-FLOW; ISOFLURANE ANESTHESIA; OX...


In [28]:
df.shape # shape of our dataset in (rows , columns)

(87281, 69)

In [29]:
print('Dropping duplicate rows')
df.drop_duplicates(subset=['DI'], inplace=True)
print(df.shape[0], 'rows of data remaining')

Dropping duplicate rows
31190 rows of data remaining


Available article types in WoS data are:
        
        ['Article', 'Review', 'Letter', 'Editorial Material', 'Correction',
       'Article; Proceedings Paper', 'Article; Retracted Publication',
       'Retraction', 'Meeting Abstract', 'Biographical-Item', 'News Item',
       'Book Review', 'News Item; Retracted Publication', 'Reprint']
       
We want only research papers, so 'article' seems like the best thing.

In [30]:
print('Limit data to research articles')
df = df[df['DT']=='Article']
print(df.shape[0], 'rows of data remaining')

Limit data to research articles
19279 rows of data remaining


In [31]:
print('Limit to years selected in config.py \n', ', '.join(years))
df = df[df['PY'].isin(years)]
print(df.shape[0], 'rows of data remaining')

Limit to years selected in config.py 
 2012, 2013, 2014, 2015
12796 rows of data remaining


In [32]:
print('Limit to articles with abstracts in WoS data')
df = df[df.AB.notnull()]
print(df.shape[0], 'rows of data remaining')

Limit to articles with abstracts in WoS data
12144 rows of data remaining


Remove rows where DOI is NaN

In [33]:
print('Limit to articles with DOIs in WoS data')
df = df[df.DI.notnull()]
print(df.shape[0], 'rows of data remaining')

Limit to articles with DOIs in WoS data
12144 rows of data remaining


### Limit columns

In [34]:
print('Limit to useful columns')
print(', '.join(['DI','WD','AU','AF','SO','SN','EI','TC','SC','Z9','PY']))
df = df[['DI','PY','TI','AB','WD','AU','EM','AF','SO','SC','SN','EI','TC','Z9','C1']]
df.head()

Limit to useful columns
DI, WD, AU, AF, SO, SN, EI, TC, SC, Z9, PY


Unnamed: 0,DI,PY,WD,AU,AF,SO,SC,SN,EI,TC,Z9
481,10.1016/j.ahj.2015.09.007,2015,Assessment of the clinical effects of choleste...,"Nicholls, SJ; Lincoff, AM; Barter, PJ; Brewer,...","Nicholls, Stephen J.; Lincoff, A. Michael; Bar...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,1097-5330,37,38
482,10.1016/j.ahj.2015.09.012,2015,Paradigm shift in the intervention for secundu...,"Wu, MH; Chen, HC; Wang, JK; Kao, FY; Huang, SK","Wu, Mei-Hwan; Chen, Hui-Chi; Wang, Jou-Kou; Ka...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,1097-5330,1,1
483,10.1016/j.ahj.2015.09.017,2015,An international comparison of patients underg...,"Kohsaka, S; Miyata, H; Ueda, I; Masoudi, FA; P...","Kohsaka, Shun; Miyata, Hiroaki; Ueda, Ikuko; M...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,1097-5330,9,9
484,10.1016/j.ahj.2015.09.021,2015,Clinical characteristics and in hospital outco...,"Dasari, TW; Saucedo, JF; Krim, S; Alkhouli, M;...","Dasari, Tarun W.; Saucedo, Jorge F.; Krim, Sel...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,1097-5330,0,0
485,10.1016/j.ahj.2015.08.018,2015,Perindopril and beta-blocker for the preventio...,"Bertrand, ME; Ferrari, R; Remme, WJ; Simoons, ...","Bertrand, Michel E.; Ferrari, Roberto; Remme, ...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,1097-5330,5,6


### Memory saving step
Convert dataframe to take only every nth row.  This means that we reduce the size of the dataset.

Note that the 'start' is a random number from 1 to n.  This means that we can get different results with the same dataset.  Helpful if the sample is somehow biased.

In [35]:
from random import randint
n = c.every_nth
if n!=1:
    print('Limiting to 1 article in every {} according to every_nth setting in config.py'.format(n))
start = randint(1,n)-1  # -1 is to prevent OBOE
# start = 0 # Uncomment to avoid re-indexing citation data when testing
df = df.reset_index()
df = df.iloc[start::n, :] # start at 'start' and step through selecting every nth row from there.
df = df.reset_index()
df = df.iloc[:,2:] # resetting index introduces new cols.  Drop them here.
df.shape

(12144, 11)

### Define doi_ls here


In [36]:
doi_ls = list(df['DI'])

In [37]:
print('--------------------------------------------')
print('Filtering complete.')
print('{} articles in our dataset'.format(len(doi_ls)))
print('--------------------------------------------')

--------------------------------------------
Filtering complete.
12144 articles in our dataset
--------------------------------------------


### Build word_data

In [38]:
print('Building word_data')

Building word_data


In [39]:
word_data = []

failures = 0
for doi in doi_ls:
    if doi in list(df['DI']):
        text = df[df['DI'] == doi ].WD.iloc[0]
    else:
        text = 'failure'
    word_data.append(text)
    if text == 'failure':
        failures+=1
    else:
        pass
        
print('Failures: ',str(failures),'/',str(len(doi_ls)))    
print('No. articles with word data:', len(word_data))
print('No. DOIs in our dataset', len(doi_ls))

Failures:  0 / 12144
No. articles with word data: 12144
No. DOIs in our dataset 12144


In [40]:
import pickle
with open(word_datapath, 'wb') as f:
    pickle.dump(word_data, f)    
    
with open(dois_p, 'wb') as f:
    pickle.dump(doi_ls, f)  
    

In [41]:
df.to_csv(c.working_data)

In [42]:
print('Word data for {} files built in {}'.format(len(list(set(word_data))),
                                                 dt.now()-t_start))

Word data for 12140 files built in 0:02:30.000612
