## ERP SCANR


This notebook is the overview notebook for the ERP-SCANR project (erpsc).

ERPSC is an attempt to use automated web-scraping and text mining to summarize research on ERPs. 

Hopefully this project will serve as a type of automated meta-analysis, and also a way to pull out 

Currently, two approaches:
- Co-occurence of terms analysis: check how often pre-selected cognitive terms appear in abstracts with ERP terms. 
- Words analysis: scrapes for ERP papers, pulls out words in abstracts for analysis. 

NOTE:
- Known issue: Some ERP terms often return papers where the same name is used for something else. Will need some sort of quality control procedure to check that the papers that are scraped actually refer to what is wanted. 

This notebook is runs and displays the code. The actual code is in the 'erpsc' custom module. 

In [1]:
## Imports

# Import custom code
from erpsc import *

# TEST IMPORTS
import requests
import nltk
from bs4 import BeautifulSoup

In [2]:
## Things to search through - full set
erps = ['P50', 'P100', 'P200', 'P300', 'P3a', 'P3b', 'P600', 'N100', 'N170', 'N200',\
         'N270', 'N2pc', 'N400', 'MMN', 'LPC', 'CNV', 'ERN', 'ELAN', 'CPS', 'LRP', 'LDN']
cogs = ['language', 'memory', 'attention', 'motor', 'decision making', 'vision', \
         'auditory', 'emotion', 'categorization', 'reward', 'spatial']

In [2]:
# Small test set of words
erps = ['P300', 'P600']
cogs = ['language', 'memory']

## Co-Occurence of Term Analysis

This analysis searches through pubmed for papers that contain specified ERP and COG terms. Data extracted is the count of the number of papers with both terms. This is used to infer what cognitive terms each ERP is affiliated with. 

NOTE:
- COG terms here are a somewhat arbitrary selection: need a better set of terms, less arbitrarily selected. 

In [11]:
# Initialize object for term count co-occurences. 
term_counts = ERPSC_Count()

# Set erp and cog lists as terms to use
term_counts.set_erps(erps)
term_counts.set_cogs(cogs)

In [12]:
# Scrape the co-occurence of terms data
term_counts.scrape_data()

In [13]:
# Check the most commonly associated COG term for each ERP
term_counts.check_erps()

For the  P50   the most common association is 	 auditory   with 	 %06.72
For the  P100  the most common association is 	 vision     with 	 %09.69
For the  P200  the most common association is 	 auditory   with 	 %32.79
For the  P300  the most common association is 	 auditory   with 	 %21.00
For the  P3a   the most common association is 	 auditory   with 	 %51.98
For the  P3b   the most common association is 	 attention  with 	 %41.79
For the  P600  the most common association is 	 language   with 	 %55.22
For the  N100  the most common association is 	 auditory   with 	 %63.75
For the  N170  the most common association is 	 attention  with 	 %19.05
For the  N200  the most common association is 	 auditory   with 	 %34.69
For the  N270  the most common association is 	 attention  with 	 %30.36
For the  N2pc  the most common association is 	 attention  with 	 %88.74
For the  N400  the most common association is 	 language   with 	 %43.19
For the  MMN   the most common association is 	 aud

In [14]:
# Check the most commonly associated ERP term for each COG
term_counts.check_cogs()

For  language             the strongest associated ERP is 	 P600  with 	 %55.22
For  memory               the strongest associated ERP is 	 N400  with 	 %22.27
For  attention            the strongest associated ERP is 	 N2pc  with 	 %88.74
For  motor                the strongest associated ERP is 	 MMN   with 	 %14.50
For  decision making      the strongest associated ERP is 	 ERN   with 	 %05.15
For  vision               the strongest associated ERP is 	 CNV   with 	 %10.19
For  auditory             the strongest associated ERP is 	 MMN   with 	 %68.64
For  emotion              the strongest associated ERP is 	 N170  with 	 %15.91
For  categorization       the strongest associated ERP is 	 N170  with 	 %07.36
For  reward               the strongest associated ERP is 	 ERN   with 	 %06.47
For  spatial              the strongest associated ERP is 	 N2pc  with 	 %36.75


In [15]:
# Check the terms with the most papers
term_counts.check_top()

The most studied ERP is  P300    with    10099 papers
The most studied COG is  motor   with   353054  papers


In [16]:
# Check how many papers were found for each term - ERPs
term_counts.check_counts('erp')

P50   -     8190
P100  -     2177
P200  -      854
P300  -    10099
P3a   -      783
P3b   -      773
P600  -      536
N100  -      902
N170  -      924
N200  -      588
N270  -       56
N2pc  -      302
N400  -     1850
MMN   -     2242
LPC   -     2438
CNV   -     6264
ERN   -      835
ELAN  -      225
CPS   -     6201
LRP   -     3287
LDN   -      536


In [17]:
# Check how many papers were found for each term - COGs
term_counts.check_counts('cog')

language           -     136063
memory             -     213658
attention          -     336831
motor              -     353054
decision making    -     149559
vision             -     136018
auditory           -     114602
emotion            -      26765
categorization     -      11404
reward             -      31530
spatial            -     224131


In [18]:
# Save pickle file of results
term_counts.save_pickle()

In [13]:
# Load from pickle file
term_counts = load_pickle_counts()

## WORD ANALYSIS

This analysis searches through pubmed for papers that mention specific ERPs. It then scrapes the titles, words, and years of all those papers such that this data can be used for further analysis. 

In [3]:
# Initialize words-analysis object
words_analysis = ERPSC_Words()

In [4]:
# Set ERP terms
words_analysis.set_erps(erps)

In [5]:
# Scrape word data for all ERP abstracts
words_analysis.scrape_data()

In [6]:
# Combine words from each article together
words_analysis.comb_words()

In [7]:
# Compute frequency distributions for words for each ERP
words_analysis.freq_dists()

In [12]:
# Check which words are most frequency for each ERP
words_analysis.check_words(9)

P50 :  nf-κb , expression , cells , cell , nuclear , study , increased , activation , protein , 
P100 :  visual , nf-κb , patients , processing , study , expression , results , cells , cell , 
P200 :  processing , study , results , auditory , patients , showed , potentials , compared , event-related , 
P300 :  expression , histone , cell , cells , results , study , protein , also , gene , 
P3a :  auditory , attention , event-related , processing , stimuli , task , study , p3b , results , 
P3b :  event-related , processing , task , study , results , erp , amplitude , p3a , cognitive , 
P600 :  syntactic , processing , semantic , results , sentences , language , erp , n400 , event-related , 
N100 :  auditory , study , amplitude , processing , potentials , patients , sensory , results , response , 
N170 :  face , faces , processing , visual , facial , early , emotional , event-related , results , 
N200 :  p300 , patients , potentials , event-related , amplitude , processing , cognitive , 

In [39]:
# Save pickle of word object
words_analysis.save_pickle()

In [2]:
# Load word pickle object
words_analysis = load_pickle_words()

In [9]:
# Test Code

#page = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&field=word&term=“N270”AND”Language”')
#page = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&field=word&term=“P300”&retmax=10')
page = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/27354714')

page_soup = BeautifulSoup(page.content)

#counts = page_soup.find_all('count')

#for i in range(0, len(counts)):
#    count = counts[i]
#    ext = count.text
#    print int(ext)

art_page = requests.get('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=' + id_strs)

art_page_soup = BeautifulSoup(art_page.content, "xml")