$\textbf{Text Vectorization}$
-

- a vector is a geometric object which contains a magnitude and a direction.

- Text vectorization is the projection of words into a mathematical space while preserving information.

$\textbf{The Bag of Words Model}$
-

- The BOW is a straight forward model for vectorizing sentences.

- BOW uses word frequencies to construct vectors.

- BOW model is an orderless document representation and only the counts of the words matter.

- Because BOW does not take into account the positioning of words we loss smenatic information.

- Vectorizing different sentences and joining the result into a single vocabulary.

- The vocabulary acts as a reference if a specific word is present or absent in each of the sentence.

$EXAMPLE$

In [1]:
import re
import string

s1 = "dog sat mat."
s2 = "cat love dog."

def token_sentence(s):
    # Make a regular expression that matches all punctuation
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    # Use the regex
    res = regex.sub('', s)
    res = res.split()
    return res

new_s1 = token_sentence(s1)
new_s2 = token_sentence(s2)
vocabulary = list(set(new_s1 + new_s2))
vocabulary

['mat', 'sat', 'love', 'cat', 'dog']

In [2]:
new_s1

['dog', 'sat', 'mat']

In [3]:
BOW = [int(u in new_s1) for u in vocabulary]
BOW

[1, 1, 0, 0, 1]

$\text{Term Frequency Inverse Document Frequency (TF-IDF)}$
-

- A model largely used in search engines to query relevant documents.

- Two informations are encoded: the term frequency, and the inverse document frequency.

- The term frequency is the count of words appearing in a document.

- The inverse document frequency measures the importance of words in a document.

- The inverse document frequency is calculated by logarithmically scaling the inverse fraction of the documents containing the word. This is obtained by dividing the total number of documents by the number of documents containing the term, followed by taking the logarithm of the ratio.

- The inverse document frequency measures how common or rare a term is among all documents.

The formula are:
\begin{gather}
TF(t) = \frac{\text{number of times the term "t" appeas in a specific document}}{\text{total number of terms in the document}}
\end{gather}

\begin{gather}
IDF(t) = log(\frac{\text{total number of documents}}{\text{number of documents with term "t"}})
\end{gather}

\begin{gather}
TF \cdotp IDF = TF(t) \cdotp IDF(t)
\end{gather}

- TF-IDF has more information that using vector representation because instead of using the count of words as used in the BOW, TF-IDF makes rare terms more prominent and ignores common words like stopwords such as "is", "that", "of", etc.

In [9]:
!pip install PyPDF

Collecting PyPDF
  Downloading pypdf-4.0.1-py3-none-any.whl (283 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/284.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m225.3/284.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF
Successfully installed PyPDF-4.0.1


$\text{Vectorization Using Gensim}$

In [10]:
from gensim import corpora
import spacy
from pypdf import PdfReader
nlp = spacy.load('en_core_web_sm')

documents = ["In the present contribution, we examine the link between societal crisis situations and belief in conspiracy theories. Contrary to common assumptions, belief in conspiracy theories has been prevalent throughout human history. We first illustrate historical incidents suggesting that societal crisis situations—defined as impactful and rapid societal change that calls established power structures, norms of conduct, or even the existence of specific people or groups into question—have stimulated belief in conspiracy theories. We then review the psychological literature to explain why this is the case. Evidence suggests that the aversive feelings that people experience when in crisis—fear, uncertainty, and the feeling of being out of control—stimulate a motivation to make sense of the situation, increasing the likelihood of perceiving conspiracies in social situations. We then explain that after being formed, conspiracy theories can become historical narratives that may spread through cultural transmission. We conclude that conspiracy theories originate particularly in crisis situations and may form the basis for how people subsequently remember and mentally represent a historical event"
             ,"What psychological factors drive the popularity of conspiracy theories, which explain important events as secret plots by powerful and malevolent groups? What are the psychological consequences of adopting these theories? We review the current research and find that it answers the first of these questions more thoroughly than the second. Belief in conspiracy theories appears to be driven by motives that can be characterized as epistemic (understanding one’s environment), existential (being safe and in control of one’s environment), and social (maintaining a positive image of the self and the social group). However, little research has investigated the consequences of conspiracy belief, and to date, this research does not indicate that conspiracy belief fulfills people’s motivations. Instead, for many people, conspiracy belief may be more appealing than satisfying. Further research is needed to determine for whom, and under what conditions, conspiracy theories may satisfy key psychological motives."
             ,"Belief in conspiracy theories—such as that the 9/11 terrorist attacks were an inside job or that the pharmaceutical industry deliberately spreads diseases—is a widespread and culturally universal phenomenon. Why do so many people around the globe believe conspiracy theories, and why are they so influential? Previous research focused on the proximate mechanisms underlying conspiracy beliefs but ignored the distal, evolutionary origins and functions. We review evidence pertaining to two competing evolutionary hypotheses: (a) conspiracy beliefs are a by-product of a suite of psychological mechanisms (e.g., pattern recognition, agency detection, threat management, alliance detection) that evolved for different reasons, or (b) conspiracy beliefs are part of an evolved psychological mechanism specifically aimed at detecting dangerous coalitions. This latter perspective assumes that conspiracy theories are activated after specific coalition cues, which produce functional counterstrategies to cope with suspected conspiracies. Insights from social, cultural and evolutionary psychology provide tentative support for six propositions that follow from the adaptation hypothesis. We propose that people possess a functionally integrated mental system to detect conspiracies that in all likelihood has been shaped in an ancestral human environment in which hostile coalitions—that is, conspiracies that truly existed—were a frequent cause of misery, death, and reproductive loss"
             ,"We consider the significance of belief in conspiracy theories for political ideologies. Although there is no marked ideological asymmetry in conspiracy belief, research indicates that conspiracy theories may play a powerful role in ideological processes. In particular, they are associated with ideological extremism, distrust of rival ideological camps, populist distrust of mainstream politics, and ideological grievances. The ‘conspiracy mindset’ characterizes the ideological significance of conspiracy belief, and is associated with measuring conspiracy belief by means of abstract propositions associated with aversion and distrust of powerful groups. We suggest that this approach does not pay sufficient attention to the nonrational character of specific conspiracy beliefs and thus runs the risk of mischaracterizing them, and mischaracterizing their ideological implications"
             ,"Although conspiracy theories have arguably always been an important feature of social life, they have only attracted the attention of social psychologists in recent years. The last decade, however, has seen an increase in social psychological research on this topic that has yielded many insights into the causes and consequences of conspiracy thinking. In this article, we draw on examples from our own programme of research to highlight how the methods and concepts of social psychology can be brought to bear on the study of conspiracy theories. Specifically, we highlight how basic social cognitive processes such as pattern perception, projection and agency detection predict the extent to which people believe in conspiracy theories. We then highlight the role of motivations such as the need for uniqueness, and the motivation to justify the system, in predicting the extent to which people adopt conspiracy explanations. We next discuss how conspiracy theories have important consequences for social life, such as decreasing engagement with politics and influencing people’s health and environmental decisions. Finally, we reflect on some of the limitations of research in this domain and consider some important avenues for future research."
             ,"To diagnose HAB-associated illnesses, providers need a basic awareness of HABs and the ability to identify clinical presentations and exposures.","HAB-associated illnesses are primarily a diagnosis of exclusion because clinical testing options for HAB toxins are lacking; ideally, providers have access to clinical diagnostic testing to rule out other possible causes. "
             ,"THE truth is out there”:1 conspiracy theories are all around us. In August City residents, with a margin of error of 3.5 percent, believed that officials 2004, a poll by Zogby International showed that 49 percent of New York of the U.S. government “knew in advance that attacks were planned on or around September 11, 2001, and that they consciously failed to act.”2 In a Scripps-Howard Poll in 2006, some 36 percent of respondents assented to the claim that “federal officials either participated in the attacks on the World Trade Center or took no action to stop them.”3 Sixteen percent said that it was either very likely or somewhat likely that “the collapse of the twin towers in New York was aided by explosives secretly planted in the two buildings.”4 Conspiracy theories can easily be found all over the world. Among sober-minded Canadians, a September 2006 poll found that 22 percent believed that “the attacks on the United States on September 11, 2001 had nothing to do with Osama Bin Laden and were actually a plot by influential Americans.”5 In a poll conducted in seven Muslim countries, 78 percent of respondents said that they do not believe the 9/11 attacks were carried out by Arabs.6 The most popular account, in these countries, is that 9/11 was the work of the U.S. or Israeligovernments.7 In China, a bestseller attributes various events (the rise of Hitler,the Asian financial crisis of 1997–1998, and environmental destruction in thedeveloping world) to the Rothschild banking dynasty; the analysis has been readand debated at high levels of business and government, and it appears to havehad an effect on discussions about currency policies.8 Throughout Americanhistory, race-related violence has often been spurred by false rumors, generallypointing to alleged conspiracies by one or another group.9What causes such theories to arise and spread? Are they important andperhaps even threatening, or merely trivial and even amusing? What can andshould government do about them? We aim here to sketch some psychologicaland social mechanisms that produce, sustain, and spread these theories; to showthat some of them are quite important and should be taken seriously; and to offersuggestions for governmental responses, both as a matter of policy and as amatter of law."]

In [11]:
texts = []
for document in documents:
    text = []
    doc = nlp(document)
    for w in doc:
        if not w.is_stop and not w.is_punct and not w.like_num:
            text.append(w.lemma_)
    texts.append(text)
#texts is a mini-corpus specifically for toxic algal bloom
print(texts)

[['present', 'contribution', 'examine', 'link', 'societal', 'crisis', 'situation', 'belief', 'conspiracy', 'theory', 'contrary', 'common', 'assumption', 'belief', 'conspiracy', 'theory', 'prevalent', 'human', 'history', 'illustrate', 'historical', 'incident', 'suggest', 'societal', 'crisis', 'situation', 'define', 'impactful', 'rapid', 'societal', 'change', 'call', 'establish', 'power', 'structure', 'norm', 'conduct', 'existence', 'specific', 'people', 'group', 'question', 'stimulate', 'belief', 'conspiracy', 'theory', 'review', 'psychological', 'literature', 'explain', 'case', 'evidence', 'suggest', 'aversive', 'feeling', 'people', 'experience', 'crisis', 'fear', 'uncertainty', 'feeling', 'control', 'stimulate', 'motivation', 'sense', 'situation', 'increase', 'likelihood', 'perceive', 'conspiracy', 'social', 'situation', 'explain', 'form', 'conspiracy', 'theory', 'historical', 'narrative', 'spread', 'cultural', 'transmission', 'conclude', 'conspiracy', 'theory', 'originate', 'particul

In [12]:
#creating a BOW representation of the mini-corpus
dictionary = corpora.Dictionary(texts)
print(dictionary.token2id)

{'assumption': 0, 'aversive': 1, 'basis': 2, 'belief': 3, 'call': 4, 'case': 5, 'change': 6, 'common': 7, 'conclude': 8, 'conduct': 9, 'conspiracy': 10, 'contrary': 11, 'contribution': 12, 'control': 13, 'crisis': 14, 'cultural': 15, 'define': 16, 'establish': 17, 'event': 18, 'evidence': 19, 'examine': 20, 'existence': 21, 'experience': 22, 'explain': 23, 'fear': 24, 'feeling': 25, 'form': 26, 'group': 27, 'historical': 28, 'history': 29, 'human': 30, 'illustrate': 31, 'impactful': 32, 'incident': 33, 'increase': 34, 'likelihood': 35, 'link': 36, 'literature': 37, 'mentally': 38, 'motivation': 39, 'narrative': 40, 'norm': 41, 'originate': 42, 'particularly': 43, 'people': 44, 'perceive': 45, 'power': 46, 'present': 47, 'prevalent': 48, 'psychological': 49, 'question': 50, 'rapid': 51, 'remember': 52, 'represent': 53, 'review': 54, 'sense': 55, 'situation': 56, 'social': 57, 'societal': 58, 'specific': 59, 'spread': 60, 'stimulate': 61, 'structure': 62, 'subsequently': 63, 'suggest': 6

$INSIGHTS$

- There are 87 unique words in our corpus that is focused on healthcare and toxic algal bloom.

- Each word is indexed with an integer.

- The index is termed as a "word ID".

- The BOW now can be used for word integer-id mapping.

Using the doc2bow method, which, as the name suggests, helps convert our document to bag-of-words.

In [13]:
corpus = [dictionary.doc2bow(text) for text in texts]
corpus

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 3),
  (4, 1),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 6),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 4),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 2),
  (24, 1),
  (25, 2),
  (26, 2),
  (27, 1),
  (28, 3),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 1),
  (34, 1),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 3),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 1),
  (56, 5),
  (57, 1),
  (58, 3),
  (59, 1),
  (60, 1),
  (61, 2),
  (62, 1),
  (63, 1),
  (64, 2),
  (65, 5),
  (66, 1),
  (67, 1)],
 [(3, 4),
  (10, 6),
  (13, 1),
  (18, 1),
  (23, 1),
  (27, 2),
  (39, 1),
  (44, 2),
  (49, 3),
  (50, 1),
  (54, 1),
  (57, 2),
  (65, 4),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1),
  (72, 1),
  (73, 1),
  (74, 2),
  (75, 1),
  (76, 1),
  (77, 1),
  (78, 2)

- The output is a nested list.

- Each individual sublist represents a documents bag-of-words representation.

- A reminder: you might see different numbers in your list, this is because each time you create a dictionary, different mappings will occur.

- Unlike the example we demonstrated, where an absence of a word was a 0, we use tuples that represent (word_id, word_count).

- We can easily verify this by checking the original sentence, mapping each word to its integer ID and reconstructing our list.

- We can also notice in this case each document has not greater than one count of each word - in smaller corpuses, this tends to happen.

In [20]:
#storing your generated corpus

corpora.MmCorpus.serialize('/1_Conspiracy_Corpus.mm', corpus)

- It is more memory efficient to store your corpus into the disk and later loading it because at most one vector resides in the RAM at a time.

In [16]:
#Converting Bag-of-Words to TF-IDF representation
from gensim import models
tfidf = models.TfidfModel(corpus)

for document in tfidf[corpus]:
       print(document)

[(0, 0.09407283965287364), (1, 0.09407283965287364), (2, 0.09407283965287364), (3, 0.09407283965287364), (4, 0.09407283965287364), (5, 0.09407283965287364), (6, 0.09407283965287364), (7, 0.09407283965287364), (8, 0.09407283965287364), (9, 0.06271522643524909), (10, 0.07808751223917663), (11, 0.09407283965287364), (12, 0.09407283965287364), (13, 0.06271522643524909), (14, 0.25086090574099634), (15, 0.06271522643524909), (16, 0.09407283965287364), (17, 0.09407283965287364), (18, 0.04437219859082065), (19, 0.06271522643524909), (20, 0.09407283965287364), (21, 0.09407283965287364), (22, 0.09407283965287364), (23, 0.12543045287049817), (24, 0.09407283965287364), (25, 0.18814567930574727), (26, 0.18814567930574727), (27, 0.04437219859082065), (28, 0.2822185189586209), (29, 0.09407283965287364), (30, 0.06271522643524909), (31, 0.09407283965287364), (32, 0.09407283965287364), (33, 0.09407283965287364), (34, 0.06271522643524909), (35, 0.06271522643524909), (36, 0.09407283965287364), (37, 0.0940

- TF-IDF scores: The higher the score, the more important the word in the document.

$\textbf{N-Gramming}$
-

- Context is very important when working with text data.
- This context is lost during vector representation because on only the word frequency is taken into account.
- An n-gram is a contiguous sequence of n items in the text. In our case, we will be dealing with words being the item, but depending on the use case, it could be even letters, syllables, or sometimes in the case of speech, phonemes.
- Mono-gram, n=1
- Bi-gram, n = 2.
- Tri-gram, n=3
- N-Gramming is calculated through the conditional probability of a token given by thr preceding token.
- N-Gramming can also be done by calculating words that appear close to each other.
- Bi-gramming is also called co-location, it locates pair of words that are very likely to appear close together.
- Example: "New Hampshire" is one word not "New" and "Hampshire"
- Gensim approaches bigrams by simply combining the two high probability tokens with an underscore. The tokens new and york will now become new_york instead. Similar to the TF- IDF model, bigrams can be created using another Gensim model - Phrases.

In [17]:
import gensim
bigram = gensim.models.Phrases(texts)
texts = [bigram[line] for line in texts]
texts

[['present',
  'contribution',
  'examine',
  'link',
  'societal',
  'crisis',
  'situation',
  'belief',
  'conspiracy_theory',
  'contrary',
  'common',
  'assumption',
  'belief',
  'conspiracy_theory',
  'prevalent',
  'human',
  'history',
  'illustrate',
  'historical',
  'incident',
  'suggest',
  'societal',
  'crisis',
  'situation',
  'define',
  'impactful',
  'rapid',
  'societal',
  'change',
  'call',
  'establish',
  'power',
  'structure',
  'norm',
  'conduct',
  'existence',
  'specific',
  'people',
  'group',
  'question',
  'stimulate',
  'belief',
  'conspiracy_theory',
  'review',
  'psychological',
  'literature',
  'explain',
  'case',
  'evidence',
  'suggest',
  'aversive',
  'feeling',
  'people',
  'experience',
  'crisis',
  'fear',
  'uncertainty',
  'feeling',
  'control',
  'stimulate',
  'motivation',
  'sense',
  'situation',
  'increase',
  'likelihood',
  'perceive',
  'conspiracy',
  'social',
  'situation',
  'explain',
  'form',
  'conspiracy_th

$\textbf{NOTE}:$Since by creating new phrases we add words to our dictionary, this step must be done before we create our dictionary. We would have to run this:

In [18]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

After we are done creating our bi-grams, we can create tri-grams, and other n-grams by simply running the phrases model multiple times on our corpus. Bi-grams still remains the most used n-gram model, though it is worth one's time to glance over the other uses and kinds of n-gram implementations

In [19]:
# Removing both high frequency and low-frequency words.
# Example: get rid of words that occur in less than 20 documents, or in more than 50% of the documents,
dictionary.filter_extremes(no_below=20, no_above=0.5)

$\textbf{Programming Assignment}$

Choose a topic that you will be using as a term paper for this subject. Collect articles, publications, sotries etc. of your chosen topic and develop your own mini-corpus using the preprocessing steps required. Be sure to print the output.

Note that this corpus will be used for the entire subject.