Create a variable 'word_data' which is a python list of all of the text data for your dataset.  This should be ordered according to the doi_ls variable used elsewhere.

The outputs of the script are 3 things:
- Dataframe of DOI, WoS word_data, pub_yr, document_type
- doi_ls (identical to DOI col of above dataframe)
- word_data i.e. text data for each doi in the list. 

In [1]:
print('------------------------------------------------------')
print('Step 0:  Read in data from wos folder and build word_data')
from datetime import datetime as dt
t_start = dt.now()
print(t_start)
print('------------------------------------------------------')

------------------------------------------------------
Step 0:  Read in data from wos folder and build word_data
2018-03-06 10:57:10.254986
------------------------------------------------------


In [2]:
# imports 
import os
import csv
import pandas as pd
import numpy as np
import json
import pickle
from config import Config as c
# cr_db_p = c.cr_db_p
years = c.years
word_datapath = c.word_datapath
dois_p = c.dois_pkl
wos_word_data_p = c.wos_word_data_p

In [3]:
# ensure directories all exist:
input_dir = 'wos'
if not os.path.exists(input_dir):
    raise "There should be a folder called 'wos' with Web of Science tsv files in it"

dirs = ['data','outputs']
for d in dirs:
    if not os.path.exists(d):
        os.makedirs(d)


### Load WoS word_data
This is DOIs and word data from titles, abstracts and keywords on Web of Science.

Note that this is the source of our doi list which is the basis of all data throughout this program.  I.e. if we have a set of tf-idf vectors later in processing, each vector will represent a an article from our doi list and the vectors will have the same ordering as the doi list.  This way, we always know which article an individual piece of data pertains to.


In [26]:
def build_wos_data():
    filepaths = [os.path.join(os.path.abspath('wos'),p) for p in os.listdir('wos')]
    data = pd.DataFrame({})
    print('Reading files from WoS folder into dataframe...')
    for filepath in filepaths:
        df = pd.read_csv(filepath, 
                         sep='\t', 
                         quotechar='"',
                         quoting=csv.QUOTE_NONE,
                         index_col=False, 
                         error_bad_lines=False,
                         encoding='utf-16')
        data = pd.concat([data,df])
    data['WD'] = data.TI.map(str)+' | '+ data.AB.map(str) +' | '+ data.DE.map(str) + ' | ' + data.ID.map(str)
    return data # [['DI','SO','SN','EI','PY','DT','WD']]
    

In [5]:
try:
    df = pd.read_csv(wos_word_data_p, index_col=0)
except:
    df = build_wos_data()

Reading files from WoS folder into dataframe...


In [None]:
# checking text data came out right
# df = build_wos_data()
# df.WD.iloc[0]

In [27]:
# check cell
df[['WD','TI','AB','DE','ID']].sample(40)


Reading files from WoS folder into dataframe...


'Late outcome of percutaneous mitral commissurotomy: Randomized comparison of Inoue versus double-balloon technique | Background Late prognosis after successful percutaneous mitral commissurotomy (PMC) is unclear. We compared late results of PMC using Inoue versus double-balloon techniques up to 25 years in a randomized trial. Methods Between 1989 and 1995, 302 patients (77 men, 41 +/- 11 years) with severe mitral stenosis were randomly assigned to undergo PMC using Inoue (n = 152; group I) or double-balloon technique (n = 150; group D). The end points were the composite events of death, mitral surgery, repeat PMC, or deterioration of New York Heart Association (NYHA) class >= 3. Results During median follow-up of 20.7 years (maximum, 25.6), clinical events occurred in 82 (53.9%) patients in group I (37 deaths, 44 mitral surgeries, 9 repeat PMCs, 3 NYHA class >= 3) and in 79 (52.7%) patients in group D (34 deaths, 51 mitral surgeries, 5 repeat PMCs, 4 NYHA class >= 3). Event-free survi

In [28]:
df.shape # shape of our dataset in (rows , columns)

(87281, 69)

In [29]:
print('Dropping duplicate rows')
df.drop_duplicates(subset=['DI'], inplace=True)
print(df.shape[0], 'rows of data remaining')

Dropping duplicate rows
31190 rows of data remaining


Available article types in WoS data are:
        
        ['Article', 'Review', 'Letter', 'Editorial Material', 'Correction',
       'Article; Proceedings Paper', 'Article; Retracted Publication',
       'Retraction', 'Meeting Abstract', 'Biographical-Item', 'News Item',
       'Book Review', 'News Item; Retracted Publication', 'Reprint']
       
We want only research papers, so 'article' seems like the best thing.

In [30]:
print('Limit data to research articles')
df = df[df['DT']=='Article']
print(df.shape[0], 'rows of data remaining')

Limit data to research articles
19279 rows of data remaining


In [31]:
print('Limit to years selected in config.py \n', ', '.join(years))
df = df[df['PY'].isin(years)]
print(df.shape[0], 'rows of data remaining')

Limit to years selected in config.py 
 2012, 2013, 2014, 2015, 2016, 2017
19279 rows of data remaining


In [32]:
print('Limit to articles with abstracts in WoS data')
df = df[df.AB.notnull()]
print(df.shape[0], 'rows of data remaining')

Limit to articles with abstracts in WoS data
18355 rows of data remaining


Remove rows where DOI is NaN

In [33]:
print('Limit to articles with DOIs in WoS data')
df = df[df.DI.notnull()]
print(df.shape[0], 'rows of data remaining')

Limit to articles with DOIs in WoS data
18355 rows of data remaining


### Limit columns

In [34]:
print('Limit to useful columns')
print(', '.join(['DI','WD','AU','AF','SO','SN','EI','TC','SC','Z9','PY']))
df = df[['DI','PY','TI','AB','WD','AU','EM','AF','SO','SC','SN','EI','TC','Z9','C1']]
df.head()

Limit to useful columns
DI, WD, AU, AF, SO, SN, EI, TC, SC, Z9, PY


Unnamed: 0,DI,PY,TI,AB,WD,AU,EM,AF,SO,SC,SN,EI,TC,Z9,C1
0,10.1016/j.ahj.2017.04.004,2017,Late outcome of percutaneous mitral commissuro...,Background Late prognosis after successful per...,Late outcome of percutaneous mitral commissuro...,"Lee, S; Kang, DH; Kim, DH; Song, JM; Song, JK;...",dhkang@amc.seoul.kr; sjpark@amc.seoul.kr,"Lee, Sahmin; Kang, Duk-Hyun; Kim, Dae-Hee; Son...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,1097-5330,0,0,"[Lee, Sahmin; Kang, Duk-Hyun; Kim, Dae-Hee; So..."
1,10.1016/j.ahj.2017.08.014,2017,A unique linkage of administrative and clinica...,"Background Large clinical, research, and admin...",A unique linkage of administrative and clinica...,"Godown, J; Thurm, C; Dodd, DA; Soslow, JH; Fei...",justin.godown@vanderbilt.edu,"Godown, Justin; Thurm, Cary; Dodd, Debra A.; S...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,1097-5330,0,0,"[Godown, Justin; Dodd, Debra A.; Soslow, Jonat..."
2,10.1016/j.ahj.2017.08.004,2017,Contemporary risk model for inhospital major b...,Background Major bleeding is a frequent compli...,Contemporary risk model for inhospital major b...,"Desai, NR; Kennedy, KF; Cohen, DJ; Connolly, T...",robert.mcnamara@yale.edu,"Desai, Nihar R.; Kennedy, Kevin F.; Cohen, Dav...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,1097-5330,0,0,"[Desai, Nihar R.; McNamara, Robert L.] Yale Un..."
3,10.1016/j.ahj.2017.08.013,2017,Contemporary rates and correlates of statin us...,Background Statin therapy ishighly efficacious...,Contemporary rates and correlates of statin us...,"Go, AS; Fan, DJ; Sung, SH; Inveiss, AI; Romo-L...",Alan.S.Go@kp.org,"Go, Alan S.; Fan, Dongjie; Sung, Sue Hee; Inve...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,1097-5330,0,0,"[Go, Alan S.; Fan, Dongjie; Sung, Sue Hee; Inv..."
4,10.1016/j.ahj.2017.08.006,2017,Durability of quality of life benefits of tran...,Background For patients with severe aortic ste...,Durability of quality of life benefits of tran...,"Baron, SJ; Arnold, SV; Reynolds, MR; Wang, KJ;...",dcohen@saint-lukes.org,"Baron, Suzanne J.; Arnold, Suzanne V.; Reynold...",AMERICAN HEART JOURNAL,Cardiovascular System & Cardiology,0002-8703,1097-5330,0,0,"[Baron, Suzanne J.; Arnold, Suzanne V.; Wang, ..."


### Memory saving step
Convert dataframe to take only every nth row.  This means that we reduce the size of the dataset.

Note that the 'start' is a random number from 1 to n.  This means that we can get different results with the same dataset.  Helpful if the sample is somehow biased.

In [35]:
from random import randint
n = c.every_nth
if n!=1:
    print('Limiting to 1 article in every {} according to every_nth setting in config.py'.format(n))
start = randint(1,n)-1  # -1 is to prevent OBOE
# start = 0 # Uncomment to avoid re-indexing citation data when testing
df = df.reset_index()
df = df.iloc[start::n, :] # start at 'start' and step through selecting every nth row from there.
df = df.reset_index()
df = df.iloc[:,2:] # resetting index introduces new cols.  Drop them here.
df.shape

(18355, 15)

### Define doi_ls here


In [36]:
doi_ls = list(df['DI'])

In [37]:
print('--------------------------------------------')
print('Filtering complete.')
print('{} articles in our dataset'.format(len(doi_ls)))
print('--------------------------------------------')

--------------------------------------------
Filtering complete.
18355 articles in our dataset
--------------------------------------------


### Build word_data

In [38]:
print('Building word_data')

Building word_data


In [39]:
word_data = []

failures = 0
for doi in doi_ls:
    if doi in list(df['DI']):
        text = df[df['DI'] == doi ].WD.iloc[0]
    else:
        text = 'failure'
    word_data.append(text)
    if text == 'failure':
        failures+=1
    else:
        pass
        
print('Failures: ',str(failures),'/',str(len(doi_ls)))    
print('No. articles with word data:', len(word_data))
print('No. DOIs in our dataset', len(doi_ls))

Failures:  0 / 18355
No. articles with word data: 18355
No. DOIs in our dataset 18355


In [40]:
import pickle
with open(word_datapath, 'wb') as f:
    pickle.dump(word_data, f)    
    
with open(dois_p, 'wb') as f:
    pickle.dump(doi_ls, f)  
    

In [41]:
df.to_csv(c.working_data)

PermissionError: [Errno 13] Permission denied: 'C:\\Users\\aday\\OneDrive - SAGE Publishing\\Projects\\TFIDF_vis_Cardio\\data\\working_data.csv'

In [None]:
print('Word data for {} files built in {}'.format(len(list(set(word_data))),
                                                 dt.now()-t_start))