## ERP SCANR


This notebook is the overview notebook for the ERP-SCANR project (erpsc).

ERPSC is an attempt to use automated web-scraping and text mining to summarize research on ERPs. 

Hopefully this project will serve as a type of automated meta-analysis, and also a way to pull out 

Currently, two approaches:
- Co-occurence of terms analysis: check how often pre-selected cognitive terms appear in abstracts with ERP terms. 
- Words analysis: scrapes for ERP papers, pulls out words in abstracts for analysis. 

NOTE:
- Known issue: Some ERP terms often return papers where the same name is used for something else. Will need some sort of quality control procedure to check that the papers that are scraped actually refer to what is wanted. 

This notebook is runs and displays the code. The actual code is in the 'erpsc' custom module. 

In [2]:
## Imports

# Import custom code
from erpsc import *

# TEST IMPORTS
#import requests
#import nltk
#from bs4 import BeautifulSoup

In [7]:
## Things to search through - full set
erps = ['P50', 'P100', 'P200', 'P300', 'P3a', 'P3b', 'P400', 'P600', 'N100', 'N170', 'N200',\
         'N270', 'N2pc', 'N400', 'MMN', 'LPC', 'CNV', 'ERN', 'ELAN', 'CPS', 'LRP', 'LDN', 'ORN', 'SEP']

cog_terms = ['language', 'memory', 'attention', 'motor', 'decision making', 'vision', \
        'auditory', 'emotion', 'categorization', 'reward', 'spatial', 'somatosensory', \
        'cognitive', 'awareness', 'tactile', 'pain', 'learning', 'reasoning', 'social', 'action']

In [5]:
# Small test set of words
erps = ['P50', 'P600']
cog_terms = ['language', 'memory']

## Co-Occurence of Term Analysis

This analysis searches through pubmed for papers that contain specified ERP and COG terms. Data extracted is the count of the number of papers with both terms. This is used to infer what cognitive terms each ERP is affiliated with. 

NOTE:
- COG terms here are a somewhat arbitrary selection: need a better set of terms, less arbitrarily selected. 

In [8]:
# Initialize object for term count co-occurences. 
term_counts = ERPSCCount()

# Set erp and cog lists as terms to use
term_counts.set_erps(erps)
term_counts.set_terms(cog_terms)

In [9]:
# Scrape the co-occurence of terms data
term_counts.scrape_data()

In [10]:
# Check the most commonly associated COG term for each ERP
term_counts.check_erps()

For the  P50   the most common association is 	 auditory   with 	 %06.69
For the  P100  the most common association is 	 vision     with 	 %09.66
For the  P200  the most common association is 	 auditory   with 	 %33.03
For the  P300  the most common association is 	 auditory   with 	 %20.96
For the  P3a   the most common association is 	 auditory   with 	 %51.67
For the  P3b   the most common association is 	 attention  with 	 %41.82
For the  P400  the most common association is 	 attention  with 	 %11.90
For the  P600  the most common association is 	 language   with 	 %56.15
For the  N100  the most common association is 	 auditory   with 	 %63.89
For the  N170  the most common association is 	 attention  with 	 %19.11
For the  N200  the most common association is 	 auditory   with 	 %34.27
For the  N270  the most common association is 	 attention  with 	 %30.36
For the  N2pc  the most common association is 	 attention  with 	 %89.38
For the  N400  the most common association is 	 lan

In [11]:
# Check the most commonly associated ERP for each term
term_counts.check_terms()

For  language             the strongest associated ERP is 	 P600  with 	 %56.15
For  memory               the strongest associated ERP is 	 N400  with 	 %22.10
For  attention            the strongest associated ERP is 	 N2pc  with 	 %89.38
For  motor                the strongest associated ERP is 	 MMN   with 	 %14.69
For  decision making      the strongest associated ERP is 	 ERN   with 	 %05.03
For  vision               the strongest associated ERP is 	 CNV   with 	 %10.12
For  auditory             the strongest associated ERP is 	 MMN   with 	 %68.45
For  emotion              the strongest associated ERP is 	 N170  with 	 %16.24
For  categorization       the strongest associated ERP is 	 N170  with 	 %07.43
For  reward               the strongest associated ERP is 	 ERN   with 	 %06.32
For  spatial              the strongest associated ERP is 	 N2pc  with 	 %37.19
For  somatosensory        the strongest associated ERP is 	 SEP   with 	 %34.67
For  cognitive            the strongest 

In [12]:
# Check the terms with the most papers
term_counts.check_top()

The most studied ERP is  P300    with    10266 papers
The most studied COG is  action  with   671741  papers


In [13]:
# Check how many papers were found for each term - ERPs
term_counts.check_counts('erp')

P50   -     8301
P100  -     2205
P200  -      866
P300  -    10266
P3a   -      809
P3b   -      801
P400  -      269
P600  -      545
N100  -      914
N170  -      942
N200  -      604
N270  -       56
N2pc  -      320
N400  -     1887
MMN   -     2301
LPC   -     2496
CNV   -     6455
ERN   -      855
ELAN  -      229
CPS   -     6385
LRP   -     3334
LDN   -      553
ORN   -     2010
SEP   -     7281


In [14]:
# Check how many papers were found for each term - COGs
term_counts.check_counts('term')

language           -     139451
memory             -     218308
attention          -     344657
motor              -     359396
decision making    -     153343
vision             -     138337
auditory           -     116152
emotion            -      27782
categorization     -      11693
reward             -      32414
spatial            -     229424
somatosensory      -      34840
cognitive          -     260918
awareness          -     113128
tactile            -      14325
pain               -     569009
learning           -     281786
reasoning          -      16757
social             -     608721
action             -     671741


In [15]:
# Save pickle file of results
term_counts.save_pickle()

In [16]:
# Load from pickle file
term_counts = load_pickle_counts()

## WORD ANALYSIS

This analysis searches through pubmed for papers that mention specific ERPs. It then scrapes the titles, words, and years of all those papers such that this data can be used for further analysis. 

In [6]:
# Initialize words-analysis object
words_analysis = ERPSCWords()

In [7]:
# Set ERP terms
words_analysis.set_erps(erps)

In [8]:
# Scrape word data for all ERP abstracts
words_analysis.scrape_data()



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))


In [9]:
# Combine words from each article together
words_analysis.combine_words()

In [10]:
# Compute frequency distributions for words for each ERP
words_analysis.freq_dists()

In [11]:
# Check which words are most frequency for each ERP
words_analysis.check_words(8)

P50 :  nf-κb , expression , study , sensory , results , increased , may , gating , 
P600 :  syntactic , processing , semantic , sentences , language , n400 , erp , results , 


In [23]:
# Save pickle of word object
words_analysis.save_pickle()

In [24]:
# Load word pickle object
words_analysis = load_pickle_words()

In [None]:
# Test Code

#page = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&field=word&term=“N270”AND”Language”')
#page = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&field=word&term=“P300”&retmax=10')
page = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/27354714')

page_soup = BeautifulSoup(page.content)

#counts = page_soup.find_all('count')

#for i in range(0, len(counts)):
#    count = counts[i]
#    ext = count.text
#    print int(ext)

art_page = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=' + id_strs)

art_page_soup = BeautifulSoup(art_page.content, "xml")

In [None]:
aa = ERPC()
aa.set_path('Users')