Example Notebook
============

<h2>1. Run LDA the first time</h2>

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import sys
basedir = '../'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda

In [None]:
fragment_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_fragments.csv'
neutral_loss_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_losses.csv'
mzdiff_filename = None
ms1_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_ms1.csv'
ms2_filename = basedir + 'input/final/Beer_3_full1_5_2E5_pos_ms2.csv'
ms2lda = Ms2Lda.lcms_data_from_R(fragment_filename, neutral_loss_filename, mzdiff_filename, 
                             ms1_filename, ms2_filename)

In [None]:
### all the parameters you need to specify to run LDA ###

n_topics = 300 # 300 - 400 topics from cross-validation
n_samples = 500 # 100 is probably okay for testing. For manuscript, use > 500-1000.
n_burn = 250 # if 0 then we only use the last sample
n_thin = 5 # every n-th sample to use for averaging after burn-in
alpha = 50.0/n_topics # hyper-parameter for document-topic distributions
beta = 0.1 # hyper-parameter for topic-word distributions

ms2lda.run_lda(n_topics, n_samples, n_burn, n_thin, alpha, beta)

In [None]:
ms2lda.write_results('beer3_test_method3')
ms2lda.save_project('results/beer3pos.project')

<h2>2. Resuming from Previous Run</h2>

If you did the save_project() above, you can resume from this step directly the next time you load the notebook ..

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import sys
basedir = '../'
sys.path.append(basedir)

from lda_for_fragments import Ms2Lda

In [2]:
ms2lda = Ms2Lda.resume_from('results/beer3pos.project')

Project loaded from results/beer3pos.project time taken = 17.2980229855
 - input_filenames = 
	../input/final/Beer_3_full1_5_2E5_pos_fragments.csv
	../input/final/Beer_3_full1_5_2E5_pos_losses.csv
	../input/final/Beer_3_full1_5_2E5_pos_ms1.csv
	../input/final/Beer_3_full1_5_2E5_pos_ms2.csv
 - df.shape = (1588, 3171)
 - K = 300
 - alpha = 0.166666666667
 - beta = 0.1
 - last_saved_timestamp = Thu Aug  6 16:13:04 2015


<h2>3. Results</h2>

We need to threshold the document-topic and topic-word distributions produced by LDA, so we can say which topics are used in which documents, and which words 'belongs' to a topic. 

In [3]:
# Fixed thresholding of 0.05 for the doc_topic and topic_word matrices
# NOTE: this is what we used before ..
ms2lda.do_thresholding(th_doc_topic=0.05, th_topic_word=0.05)

# Doc_topic matrix is thresholded at 0.05
# Topic_word matrix is thresholded by the smallest value in each row
# ms2lda.do_thresholding(th_doc_topic=0.05, th_topic_word=0.0)

# Both matrices are thresholded by the smallest value in each row 
# Seems a bit difficult to visualise the results effectively due to the very high number of MS1 peaks per topic?
# ms2lda.do_thresholding(th_doc_topic=0.0, th_topic_word=0.0)

Print the words in each topic.

In [4]:
ms2lda.print_topic_words()

Topic 0: fragment_176.87617 (0.305917347625), fragment_119.04873 (0.236429857941), loss_54.01014 (0.0593820676499), 

Topic 1: fragment_275.11062 (0.562559598374), fragment_159.02737 (0.190113652915), 

Topic 2: fragment_119.04988 (0.159031248426), fragment_272.14799 (0.120978493019), fragment_240.12245 (0.106465355378), loss_96.04228 (0.0558866742727), 

Topic 3: fragment_121.06488 (0.632489599641), fragment_103.05448 (0.112052512239), fragment_93.06981 (0.0846015130716), 

Topic 4: fragment_130.05044 (0.871631350453), 

Topic 5: fragment_53.00259 (0.49781597501), fragment_85.06476 (0.366667217937), 

Topic 6: loss_143.05788 (0.664422815706), fragment_215.13975 (0.117796346517), 

Topic 7: fragment_118.08616 (0.638593921988), fragment_132.11306 (0.0966651989898), 

Topic 8: fragment_153.06589 (0.191235471369), loss_119.06984 (0.126144913845), fragment_143.01705 (0.105913741549), loss_115.02683 (0.0635469443177), fragment_366.08148 (0.0608730622739), 

Topic 9: loss_53.04741 (0.2999990

Save the output CSV files

In [5]:
ms2lda.write_results('beer3_test_method3')

Writing topics to results/beer3_test_method3/beer3_test_method3_topics.csv
Writing fragments x topics to results/beer3_test_method3/beer3_test_method3_all.csv
Writing topic docs to results/beer3_test_method3/beer3_test_method3_docs.csv


Set into the list below the MS1 peaks that you want to color differently in the graph page. You can see the names from the label of the nodes in the graph page or also from the CSV matrices written above. Also, in the graph page, you can press the keyboard shortcuts 'C', 'S' and 'T' to hide all circles (topics), squares (documents) and triangle (special documents).

In [5]:
special_nodes = [
    'doc_372.18877_540.996',
    'doc_291.66504_547_239',
    'doc_308.17029_289.13',
    'topic_244',
    'topic_202'
]

If the 'interactive' parameter below is True, we will show an interactive visualisation of the results in a separate tab. You need to interrupt the kernel to stop it once you're done with it (from the menu above, Kernel > Interrupt).

In [None]:
ms2lda.plot_lda_fragments(consistency=0.0, sort_by="h_index", interactive=True, to_highlight=special_nodes)
# ms2lda.plot_lda_fragments(consistency=0.50, sort_by="in_degree")

Ranking topics ...
 - topic 0 h-index = 2
 - topic 1 h-index = 2
 - topic 2 h-index = 2
 - topic 3 h-index = 3
 - topic 4 h-index = 1
 - topic 5 h-index = 2
 - topic 6 h-index = 2
 - topic 7 h-index = 2
 - topic 8 h-index = 2
 - topic 9 h-index = 2
 - topic 10 h-index = 1
 - topic 11 h-index = 3
 - topic 12 h-index = 3
 - topic 13 h-index = 3
 - topic 14 h-index = 1
 - topic 15 h-index = 2
 - topic 16 h-index = 2
 - topic 17 h-index = 2
 - topic 18 h-index = 1
 - topic 19 h-index = 3
 - topic 20 h-index = 3
 - topic 21 h-index = 2
 - topic 22 h-index = 2
 - topic 23 h-index = 3
 - topic 24 h-index = 1
 - topic 25 h-index = 2
 - topic 26 h-index = 2
 - topic 27 h-index = 2
 - topic 28 h-index = 2
 - topic 29 h-index = 3
 - topic 30 h-index = 1
 - topic 31 h-index = 1
 - topic 32 h-index = 3
 - topic 33 h-index = 1
 - topic 34 h-index = 2
 - topic 35 h-index = 2
 - topic 36 h-index = 1
 - topic 37 h-index = 3
 - topic 38 h-index = 2
 - topic 39 h-index = 1
 - topic 40 h-index = 1
 - topi

127.0.0.1 - - [17/Aug/2015 16:36:34] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:36:34] "GET /LDAvis.css HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:36:34] "GET /d3.js HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:36:34] "GET /LDAvis.js HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:36:34] "GET /images/graph_example.jpg HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:36:34] "GET /images/default_logo.png HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:40:07] "GET /topic?circle_id=ldavis_el113561397138104386083335886762-topic122&action=set HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:40:09] "GET /topic?action=next&ts=1439826009292 HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:40:10] "GET /topic?action=next&ts=1439826009822 HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:40:10] "GET /topic?action=next&ts=1439826010125 HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:40:10] "GET /topic?action=next&ts=1439826010340 HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:40:10] "GET /topic?action=ne

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 46513)
----------------------------------------
Serving dynamic json file -- threshold = 35
56, topic_5 degree=26 removed

127.0.0.1 - - [17/Aug/2015 16:40:26] "GET /graph.html?degree=35&visID=ldavis_el113561397138104386083335886762 HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:40:27] "GET /graph.json HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:40:38] "GET /topic?circle_id=ldavis_el113561397138104386083335886762-topic245&action=set HTTP/1.1" 200 -
127.0.0.1 - - [17/Aug/2015 16:41:31] "GET /images/default_logo.png HTTP/1.1" 200 -
