Create a variable 'word_data' which is a python list of all of the text data for your dataset.  This should be ordered according to the doi_ls variable used elsewhere.

The outputs of the script are 3 things:
- Dataframe of DOI, WoS word_data, pub_yr, document_type
- doi_ls (identical to DOI col of above dataframe)
- word_data i.e. text data for each doi in the list. 

In [54]:
print('------------------------------------------------------')
print('Step 0:  Read in data from wos folder and build word_data')
from datetime import datetime as dt
t_start = dt.now()
print(t_start)
print('------------------------------------------------------')

------------------------------------------------------
Step 0:  Read in data from wos folder and build word_data
2018-02-16 13:11:02.331206
------------------------------------------------------


In [55]:
# imports 
import os
import csv
import pandas as pd
import numpy as np
import json
import pickle
from config import Config as c
# cr_db_p = c.cr_db_p
years = c.years
word_datapath = c.word_datapath
dois_p = c.dois_pkl
wos_word_data_p = c.wos_word_data_p

In [None]:
# ensure directories all exist:
input_dir = 'wos'
if not os.path.exists(input_dir):
    raise "There should be a folder called 'wos' with Web of Science tsv files in it"

dirs = ['data','output']
for d in dirs:
    if not os.path.exists(d):
        os.makedirs(d)


### Load WoS word_data
This is DOIs and word data from titles, abstracts and keywords on Web of Science.

Note that this is the source of our doi list which is the basis of all data throughout this program.  I.e. if we have a set of tf-idf vectors later in processing, each vector will represent a an article from our doi list and the vectors will have the same ordering as the doi list.  This way, we always know which article an individual piece of data pertains to.


In [56]:
def build_wos_data():
    filepaths = [os.path.join(os.path.abspath('wos'),p) for p in os.listdir('wos')]
    data = pd.DataFrame({})
    print('Reading files from WoS folder into dataframe...')
    for filepath in filepaths:
        df = pd.read_csv(filepath, 
                         sep='\t', 
                         quotechar='"',
                         quoting=csv.QUOTE_NONE,
                         index_col=False, 
#                          error_bad_lines=False,
                         encoding='utf-16')
        data = pd.concat([data,df])
    data['WD'] = data.TI.map(str)+' ... '+str(data.AB) +' ... '+str(data.DE) + ' ... ' +str(data.ID)
    return data # [['DI','SO','SN','EI','PY','DT','WD']]
    

In [57]:
try:
    df = pd.read_csv(wos_word_data_p, index_col=0)
except:
    df = build_wos_data()

Reading files from WoS folder into dataframe...


In [58]:
df[['WD','TI','AB','DE','ID']].sample(40)

Unnamed: 0,WD,TI,AB,DE,ID
365,Functional Reorganisation in Sensorimotor Cort...,Functional Reorganisation in Sensorimotor Cort...,,,
167,Exercising away the blues: can it help multipl...,Exercising away the blues: can it help multipl...,The present review focuses on exercise as a tr...,Depression; multiple sclerosis; treatment; exe...,QUALITY-OF-LIFE; RANDOMIZED CONTROLLED-TRIAL; ...
133,Online extraction and single trial analysis of...,Online extraction and single trial analysis of...,Understanding how the brain processes errors i...,Error; Feedback; FRN; Inverse solution; Motor ...,BRAIN-COMPUTER INTERFACES; MEDIAL FRONTAL-CORT...
298,C9orf72 ablation in mice does not cause motor ...,C9orf72 ablation in mice does not cause motor ...,ObjectiveHow hexanucleotide (GGGGCC) repeat ex...,,FRONTOTEMPORAL DEMENTIA; HEXANUCLEOTIDE REPEAT...
104,MIRROR THERAPY IN UNILATERAL NEGLECT AFTER STR...,MIRROR THERAPY IN UNILATERAL NEGLECT AFTER STR...,,,PHANTOM LIMB
492,Clinical and histopathological amieloration of...,Clinical and histopathological amieloration of...,,,
216,Implications of Hemisensory Stroke Symptoms in...,Implications of Hemisensory Stroke Symptoms in...,,,
432,Neurologic manifestations of E coli infection-...,Neurologic manifestations of E coli infection-...,Objective: To describe the neurologic and neur...,,CENTRAL-NERVOUS-SYSTEM; NECROSIS-FACTOR-ALPHA;...
476,Opsoclonus as a Manifestation of Hashimoto's E...,Opsoclonus as a Manifestation of Hashimoto's E...,,,
60,Increased Cell-Intrinsic Excitability Induces ...,Increased Cell-Intrinsic Excitability Induces ...,Electrical activity regulates the manner in wh...,,GENERATED GRANULE CELLS; ACTIVITY-DEPENDENT RE...


In [59]:
df.shape # shape of our dataset in (rows , columns)

(47872, 67)

In [60]:
print('Dropping duplicate rows')
df.drop_duplicates(subset=['DI'], inplace=True)
print(df.shape[0], 'rows of data remaining')

Dropping duplicate rows
28011 rows of data remaining


In [61]:
print('Limit data to research articles')
df = df[df['DT']=='Article']
print(df.shape[0], 'rows of data remaining')

Limit data to research articles
21342 rows of data remaining


In [62]:
print('Limit to years selected in config.py \n', ', '.join(years))
df = df[df['PY'].isin(years)]
print(df.shape[0], 'rows of data remaining')

Limit to years selected in config.py 
 2013, 2014
7755 rows of data remaining


In [63]:
print('Limit to articles with abstracts in WoS data')
df = df[df.AB.notnull()]
print(df.shape[0], 'rows of data remaining')

Limit to articles with abstracts in WoS data
7710 rows of data remaining


Remove rows where DOI is NaN

In [64]:
print('Limit to articles with DOIs in WoS data')
df = df[df.DI.notnull()]
print(df.shape[0], 'rows of data remaining')

Limit to articles with DOIs in WoS data
7710 rows of data remaining


### Limit columns

In [65]:
print('Limit to useful columns')
print(', '.join(['DI','WD','AU','AF','SO','SN','EI','TC','SC','Z9','PY']))
df = df[['DI','PY','WD','AU','AF','SO','SC','SN','EI','TC','Z9']]
df.head()

Limit to useful columns
DI, WD, AU, AF, SO, SN, EI, TC, SC, Z9, PY


Unnamed: 0,DI,PY,WD,AU,AF,SO,SC,SN,EI,TC,Z9
386,10.1007/s00401-014-1349-0,2014,Primary age-related tauopathy (PART): a common...,"Crary, JF; Trojanowski, JQ; Schneider, JA; Abi...","Crary, John F.; Trojanowski, John Q.; Schneide...",ACTA NEUROPATHOLOGICA,Neurosciences & Neurology; Pathology,0001-6322,1432-0533,213,214
389,10.1007/s00401-014-1340-9,2014,A beta immunotherapy for Alzheimer's disease: ...,"Sakai, K; Boche, D; Carare, R; Johnston, D; Ho...","Sakai, Kenji; Boche, Delphine; Carare, Roxana;...",ACTA NEUROPATHOLOGICA,Neurosciences & Neurology; Pathology,0001-6322,1432-0533,12,12
390,10.1007/s00401-014-1342-7,2014,Experimental transmissibility of mutant SOD1 m...,"Ayers, JI; Fromholt, S; Koch, M; DeBosier, A; ...","Ayers, Jacob I.; Fromholt, Susan; Koch, Morgan...",ACTA NEUROPATHOLOGICA,Neurosciences & Neurology; Pathology,0001-6322,1432-0533,30,31
391,10.1007/s00401-014-1343-6,2014,Direct evidence of Parkinson pathology spread ...,"Holmqvist, S; Chutna, O; Bousset, L; Aldrin-Ki...","Holmqvist, Staffan; Chutna, Oldriska; Bousset,...",ACTA NEUROPATHOLOGICA,Neurosciences & Neurology; Pathology,0001-6322,1432-0533,89,97
392,10.1007/s00401-014-1344-5,2014,Zebrafish models of BAG3 myofibrillar myopathy...,"Ruparelia, AA; Oorschot, V; Vaz, R; Ramm, G; B...","Ruparelia, Avnika A.; Oorschot, Viola; Vaz, Ra...",ACTA NEUROPATHOLOGICA,Neurosciences & Neurology; Pathology,0001-6322,1432-0533,15,15


### Memory saving step
Convert dataframe to take only every nth row.  This means that we reduce the size of the dataset.

Note that the 'start' is a random number from 1 to n.  This means that we can get different results with the same dataset.  Helpful if the sample is somehow biased.

In [66]:
from random import randint
n = c.every_nth
if n!=1:
    print('Limiting to 1 article in every {} according to every_nth setting in config.py'.format(n))
start = randint(1,n)-1  # -1 is to prevent OBOE
# start = 0 # Uncomment to avoid re-indexing citation data when testing
df = df.reset_index()
df = df.iloc[start::n, :] # start at 'start' and step through selecting every nth row from there.
df = df.reset_index()
df = df.iloc[:,2:] # resetting index introduces new cols.  Drop them here.
df.shape

(7710, 11)

### Define doi_ls here


In [67]:
doi_ls = list(df['DI'])

In [68]:
print('--------------------------------------------')
print('Filtering complete.')
print('{} articles in our dataset'.format(len(doi_ls)))
print('--------------------------------------------')

--------------------------------------------
Filtering complete.
7710 articles in our dataset
--------------------------------------------


### Build word_data

In [69]:
print('Building word_data')

Building word_data


In [70]:
word_data = []

failures = 0
for doi in doi_ls:
    if doi in list(df['DI']):
        text = df[df['DI'] == doi ].WD.iloc[0]
    else:
        text = 'failure'
    word_data.append(text)
    if text == 'failure':
        failures+=1
    else:
        pass
        
print('Failures: ',str(failures),'/',str(len(doi_ls)))    
print('No. articles with word data:', len(word_data))
print('No. DOIs in our dataset', len(doi_ls))

Failures:  0 / 7710
No. articles with word data: 7710
No. DOIs in our dataset 7710


In [71]:
import pickle
with open(word_datapath, 'wb') as f:
    pickle.dump(word_data, f)    
    
with open(dois_p, 'wb') as f:
    pickle.dump(doi_ls, f)  
    

In [72]:
df.to_csv(c.working_data)

In [73]:
print('Word data for {} files built in {}'.format(len(list(set(word_data))),
                                                 dt.now()-t_start))

Word data for 7710 files built in 0:00:40.187414
