## Co-Occurence of Term Analysis

Co-occurence of terms analysis: check how often pre-selected cognitive terms appear in abstracts with ERP terms. 

This analysis searches through pubmed for papers that contain specified ERP and COG terms. Data extracted is the count of the number of papers with both terms. This is used to infer what cognitive terms each ERP is affiliated with. 

NOTE:
- COG terms here are a somewhat arbitrary selection: need a better set of terms, less arbitrarily selected. 

In [1]:
# Import required libraries
import sys
sys.path.append('/Users/thomasdonoghue/Documents/GitCode/ERP_SCANR/')

# Import custom code
from erpsc.count import Count

In [2]:
# Initialize object for term count co-occurences. 
counts = Count()

In [3]:
# Load ERPS and terms from file
counts.set_erps_file()
counts.set_terms_file('cognitive')

In [4]:
# OR: Set small set of ERPs and terms for tests

# Small test set of words
erps = ['N400', 'P600']
cog_terms = ['language', 'memory'] 

# Add ERPs and terms
counts.set_erps(erps)
counts.set_terms(cog_terms)

Unloading previous ERP words.
Unloading previous terms words.


In [5]:
# Scrape the co-occurence of terms data
counts.scrape_data()



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [6]:
# Check the most commonly associated COG term for each ERP
counts.check_cooc_erps()

For the  N400  the most common association is 	 ['language'] with 	 %45.50
For the  P600  the most common association is 	 ['language'] with 	 %58.83


In [7]:
# Check the most commonly associated ERP for each term
counts.check_cooc_terms()

For  ['language']         the strongest associated ERP is 	 P600  with 	 %58.83
For  ['memory']           the strongest associated ERP is 	 N400  with 	 %21.94


In [8]:
# Check the terms with the most papers
counts.check_top()

The most studied ERP is  ['N400']  with     1901 papers
The most studied term is  ['memory']  with   224064  papers


In [9]:
# Check how many papers were found for each term - ERPs
counts.check_counts('erp')

N400  -     1901
P600  -      549


In [10]:
# Check how many papers were found for each term - COGs
counts.check_counts('term')

['language']       -     145436
['memory']         -     224064


In [None]:
# Save pickle file of results
counts.save_pickle('test')

In [None]:
# Load from pickle file
counts = load_pickle_counts('test')

## TEST CODE

In [None]:
# Make a wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
wordcloud = WordCloud().generate_from_frequencies(words_analysis.results[0].freqs)

In [None]:
type(words_analysis.results[0].freqs)

words_analysis.results[0].freqs.plot(500)

In [None]:
words_analysis.results[0].freqs

## Test Code

In [None]:
# TEST IMPORTS
#import requests
#import nltk
#from bs4 import BeautifulSoup

In [None]:

page = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&field=word&term=“N270”AND”Language”')
#page = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&field=word&term=“P300”&retmax=10')
#page = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/27354714')

page_soup = BeautifulSoup(page.content)

#counts = page_soup.find_all('count')

#for i in range(0, len(counts)):
#    count = counts[i]
#    ext = count.text
#    print int(ext)

#art_page = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=' + id_strs)

#art_page_soup = BeautifulSoup(art_page.content, "xml")

In [None]:
ids = page_soup.find_all('id')