# TF-IDF

data used is paragraphs (lines) from one of my papers, after removing all the headers, figures, and references.<br>
https://doi.org/10.1073/pnas.1810316115

In [1]:
import numpy as np
import string

from tfidf import TFIDF

### Data

In [2]:
# open file
raw_data = open("sample.txt", "r")
data = []
# read line by line
for l in raw_data:
    data.append(l)
# print 3 example paragraphs (lines)
data[:3]

['\ufeffG-protein coupled receptors (GPCRs) are a large group of membrane-bound receptor proteins that are involved in a plethora of diverse processes (e.g. vision, hormone response). In mammals, and particularly in humans, GPCRs are involved in many signal transduction pathways and as such are heavily studied for their immense pharmaceutical potential. Indeed, a large fraction of drugs target various GPCRs, and drug-development is often aimed at GPCRs. Therefore, understanding the activation of GPCRs is a challenge of a major importance both from fundamental and practical considerations. And yet, despite the remarkable progress in structural understanding, we still do not have a translation of the structural information to an energy-based picture. Here we use coarse grained (CG) modeling to chart the free energy landscape of the activation process of the β-2 adrenergic receptor (β2AR; a class A GPCR) as a representative GPCR. The landscape provides the needed tool for analyzing the pr

Note: second paragraph is actually an empty line. This issue should not be relevant if using a list of actual documents and not lines from a single document

### Fitting

In [4]:
tfidf = TFIDF()

In [14]:
tfidf.fit(data)
len(tfidf.unique_words)

1320

In [15]:
# non default punctuation skipping
# in this particular case, hyphens are important for context, and should not be removed
tfidf.fit(data,remove_punctuation=string.punctuation.replace('-',''))
len(tfidf.unique_words)

1325

In [16]:
# user can request case-insensitivity
tfidf.fit(data,remove_punctuation=string.punctuation.replace('-',''),ignore_case=True)
len(tfidf.unique_words)

1275

In [17]:
# user can remove words from TD-IDF analysis

# arbitrary list of words for demonstration
skip = ['and','or','a','an','with','if','is','are','were','to','our','but','since','this','a','been']

tfidf.fit(data,remove_punctuation=string.punctuation.replace('-',''),ignore_case=True,skip_words=skip)
len(tfidf.unique_words)

1260

### Lookup documents by word

In [19]:
tfidf.search_word('energy')

array([51, 53, 22, 24, 42, 21, 10, 23,  6, 41,  9, 19, 14, 40, 26, 25,  2,
        8, 49,  7, 20,  0, 12, 15, 13, 17, 18,  3, 11,  1,  4,  5, 16, 56,
       27, 39, 54, 52, 50, 48, 47, 46, 45, 44, 43, 38, 55, 37, 36, 35, 34,
       33, 32, 31, 30, 29, 28])

In [20]:
data[51]

'Figure 4 – Cross-sections of the CG energy landscape. The energies of the receptor and G-protein states, following several linear paths, are depicted by colored lines accompanied by cartoons adjacent to each point (shown as circles). The vertical black arrow denotes the difference in energy associated with binding of the adrenaline agonist. The purple and gray curves denote inactive receptor. The top of the black arrow denotes the energy of the agonist-free activated receptor and the bottom of the arrow, and all lines starting at that point, denote the agonist-bound receptor states.\n'

In [21]:
# user can view the actualy TD-IDF values
tfidf.search_word('energy',return_tf_idf=True)

(array([51, 53, 22, 24, 42, 21, 10, 23,  6, 41,  9, 19, 14, 40, 26, 25,  2,
         8, 49,  7, 20,  0, 12, 15, 13, 17, 18,  3, 11,  1,  4,  5, 16, 56,
        27, 39, 54, 52, 50, 48, 47, 46, 45, 44, 43, 38, 55, 37, 36, 35, 34,
        33, 32, 31, 30, 29, 28]),
 array([0.02959425, 0.0259302 , 0.02544552, 0.02135428, 0.01930972,
        0.01762247, 0.01538232, 0.01452091, 0.01375086, 0.01148806,
        0.01019727, 0.0101403 , 0.00955323, 0.00935626, 0.00889762,
        0.00848184, 0.00762653, 0.00703533, 0.00546721, 0.00537016,
        0.00467813, 0.00451521, 0.00290884, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.

Many uninformative zeros

In [22]:
# user can prune all zero values
tfidf.search_word('protein',prune=True,return_tf_idf=True)

(array([42, 10, 37, 14, 47, 26, 32, 34, 12]),
 array([0.03927291, 0.0312852 , 0.01963645, 0.01942975, 0.01845827,
        0.01809634, 0.01774833, 0.0138784 , 0.00591611]))

In [23]:
data[42]

'The energy profile of the conformational transition, going from the inactive to the active β2AR conformation, was calculated using an under-development method which calculates the normal modes (NM) of the protein, and performs a MC simulation to sample the transition. See the SI appendix for more details.\n'

### Lookup words by document

In [24]:
tfidf.get_important_list()

[['gpcrs'],
 ['binding'],
 ['explore'],
 ['binding'],
 ['receptors'],
 ['agonists'],
 ['having'],
 ['the'],
 ['effective'],
 ['components'],
 ['importance'],
 ['binding'],
 ['based'],
 ['conformations'],
 ['dipole'],
 ['only'],
 ['binding'],
 ['binding'],
 ['the'],
 ['presents'],
 ['when'],
 ['barrier'],
 ['release'],
 ['the'],
 ['it'],
 ['pre-coupling'],
 ['structure'],
 ['binding'],
 ['binding'],
 ['gtp'],
 ['binding'],
 ['binding'],
 ['provide'],
 ['energetics'],
 ['effects'],
 ['over'],
 ['ternary'],
 ['gcprs'],
 ['binding'],
 ['binding'],
 ['3sn6'],
 ['snapshots'],
 ['transition'],
 ['binding'],
 ['national'],
 ['gtp'],
 ['binding'],
 ['text'],
 ['binding'],
 ['ii2'],
 ['binding'],
 ['arrow'],
 ['binding'],
 ['the'],
 ['binding'],
 ['structures'],
 ['binding']]

The most relevant word for each document

In [26]:
# user can request for a list of most relevant words
# truncated for visual purposes
tfidf.get_important_list(5)[:5]

[['gpcrs', 'involved', 'large', 'the', 'processes'],
 ['binding', 'part', 'transduction', 'guide', 'bind'],
 ['explore', 'gpcr', 'systems', 'the', 'it'],
 ['binding', 'part', 'transduction', 'guide', 'bind'],
 ['receptors', 'knowledge', 'organisms', 'await', 'countless']]