## Counts Scraping

Co-Occurence of Term Analysis

Co-occurence of terms analysis: check how often pre-selected cognitive terms appear in abstracts with ERP terms. 

This analysis searches through pubmed for papers that contain specified ERP and COG terms. Data extracted is the count of the number of papers with both terms. This is used to infer what cognitive terms each ERP is affiliated with. 

NOTE:
- COG terms here are a somewhat arbitrary selection: need a better set of terms, less arbitrarily selected. 

In [1]:
# TODO:
# - add erp_keys and term_keys list to use for axes
# - fix duplicate erp terms
# - clustering: latent factors on ERPs?

In [2]:
%load_ext autoreload 

In [3]:
# Import custom code
%autoreload 2
from erpsc.count import Count
from erpsc.core.io import save_pickle_obj, load_pickle_obj

In [4]:
# Initialize object for term count co-occurences. 
counts = Count()

In [5]:
# Load ERPS and terms from file
counts.set_erps_file()
counts.set_terms_file('cognitive')

In [6]:
# OR: Set small set of ERPs and terms for tests

# Small test set of words
erps = [['P100', 'P1'], 'N400']
excludes = ['', ['protein', 'gene', 'cell']]
cog_terms = ['language', 'memory'] 

# Add ERPs and terms
counts.set_erps(erps)
counts.set_exclusions(excludes)
counts.set_terms(cog_terms)

Unloading previous ERP words.
Unloading previous terms words.


In [7]:
# Scrape the co-occurence of terms data
counts.scrape_data(db='pubmed', verbose=True)

Running counts for:  P100
Running counts for:  N400


In [None]:
# Save pickle file of results
save_pickle_obj(counts, 'test2')

In [None]:
# Load from pickle file
counts = load_pickle_obj('CogScrape_counts')

## Scrape Information

In [8]:
# Check database information
counts.db_info

{'count': '27035335',
 'dbbuild': 'Build170322-2207m.1',
 'dbname': 'pubmed',
 'description': 'PubMed bibliographic record',
 'lastupdate': '2017/03/23 02:08',
 'menuname': 'PubMed'}

In [9]:
# Check requester details
counts.req.check()

Requester object is active: 	 False
Number of requests sent: 	 11
Requester opened: 		 23:48 Wednesday 22 March
Requester closed: 		 23:48 Wednesday 22 March


# Check Counts

In [10]:
# Check the most commonly associated COG term for each ERP
counts.check_cooc_erps()

For the  P100  the most common association is 	 memory     with 	 %00.86
For the  N400  the most common association is 	 language   with 	 %31.33


In [11]:
# Check the most commonly associated ERP for each term
counts.check_cooc_terms()

For  language     the strongest associated ERP is 	 N400  with 	 %31.33
For  memory       the strongest associated ERP is 	 N400  with 	 %19.69


In [12]:
# Check the terms with the most papers
counts.check_top()

The most studied ERP is  P100    with    29386 papers
The most studied term is  memory  with   197776  papers


In [13]:
# Check how many papers were found for each term - ERPs
counts.check_counts('erp')

P100  -    29386
N400  -     1915


In [14]:
# Check how many papers were found for each term - COGs
counts.check_counts('term')

language           -     112090
memory             -     197776


## Viz / Exploration - Tests

In [None]:
# Create axis labels
counts.erp_labels = [erp[0] for erp in counts.erps]
counts.term_labels = [term[0] for term in counts.terms]

In [None]:
%autoreload 2
%matplotlib inline

import pandas as pd

import sklearn.metrics.pairwise as pp

import scipy.cluster.hierarchy as hier

import seaborn as sns

import matplotlib.pyplot as plt

In [None]:
# Plot dat_percent of counts
f, ax = plt.subplots(figsize=(10, 12))
sns.heatmap(counts.dat_percent, square=False,
            xticklabels=counts.term_labels, 
            yticklabels=counts.erp_labels)
f.tight_layout()

## Similarity ERPs

In [None]:
# Calculate similarity between all ERPs
sim = pp.cosine_similarity(counts.dat_percent)

In [None]:
# Plot ERP similarities
f, ax = plt.subplots(figsize=(12, 14))
sns.heatmap(sim, square=True, 
            xticklabels=[term[0] for term in counts.erps],
            yticklabels=[term[0] for term in counts.erps])
f.tight_layout()

In [None]:
# Create dataframes for plotting clustermaps
dat_per_df = pd.DataFrame(counts.dat_percent,
                          index=[term[0] for term in counts.erps],
                          columns=[term[0] for term in counts.terms])
sim_df = pd.DataFrame(sim,
                      [term[0] for term in counts.erps],
                      [term[0] for term in counts.erps])

In [None]:
#
cg = sns.clustermap(dat_per_df.T, method='complete', metric='cosine', figsize=(12, 10))
_ = plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)

In [None]:
#
cg = sns.clustermap(sim_df, method='complete', metric='cosine')
_ = plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)

In [None]:
Y = hier.linkage(counts.dat_percent,
                 method='complete',
                 metric='cosine')

plt.figure(figsize=(3,15))

Z = hier.dendrogram(Y, orientation='left',
                    labels=[term[0] for term in counts.erps],
                    color_threshold=0.25,
                    leaf_font_size=12)

In [None]:
"""
Y = hier.linkage(sim,
                 method='complete',
                 metric='cosine')

plt.figure(figsize=(5,15))

Z = hier.dendrogram(Y, orientation='left',
                    labels=[term[0] for term in counts.erps],
                    color_threshold=0.25,
                    leaf_font_size=12)
"""

In [None]:
"""
# EXAMPLE CODE:
# Compute and plot dendrogram.
fig = pylab.figure()
axdendro = fig.add_axes([0.09,0.1,0.2,0.8])
Y = sch.linkage(D, method='centroid')
Z = sch.dendrogram(Y, orientation='right')
axdendro.set_xticks([])
axdendro.set_yticks([])

# Plot distance matrix.
axmatrix = fig.add_axes([0.3,0.1,0.6,0.8])
index = Z['leaves']
D = D[index,:]
D = D[:,index]
im = axmatrix.matshow(D, aspect='auto', origin='lower')
axmatrix.set_xticks([])
axmatrix.set_yticks([])
"""

## Similarity Terms

In [None]:
# Calculate similarity between all terms
sim_t = pp.cosine_similarity(counts.dat_percent.T)

In [None]:
# Plot term similarities
f, ax = plt.subplots(figsize=(12, 14))
sns.heatmap(sim_t, square=True, 
            xticklabels=[term[0] for term in counts.terms],
            yticklabels=[term[0] for term in counts.terms])
f.tight_layout()

In [None]:
#
sim_t_df = pd.DataFrame(sim_t,
                      [term[0] for term in counts.terms],
                      [term[0] for term in counts.terms])

In [None]:
sns.clustermap?

In [None]:
#
cg = sns.clustermap(sim_t_df, method='complete', metric='cosine')
_ = plt.setp(cg.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)

In [None]:
Y = hier.linkage(counts.dat_percent.T,
                 method='complete',
                 metric='cosine')

plt.figure(figsize=(3,15))

Z = hier.dendrogram(Y, orientation='left',
                    labels=[term[0] for term in counts.terms],
                    color_threshold=0.25,
                    leaf_font_size=12)

In [None]:
"""
Y = hier.linkage(sim_t,
                 method='complete',
                 metric='cosine')

plt.figure(figsize=(5,15))

Z = hier.dendrogram(Y, orientation='left',
                    labels=[term[0] for term in counts.terms],
                    color_threshold=0.25,
                    leaf_font_size=12)
"""