# Links
- [Kaggle](https://www.kaggle.com/code/balatmak/text-preprocessing-steps-and-universal-pipeline)

# To Do
Create a one functoin code to run everything on the pipeline

In [1]:
import nltk

In [2]:
example_text = """
An explosion targeting a tourist bus has injured at least 16 people near the Grand Egyptian Museum, 
next to the pyramids in Giza, security sources say E.U.

South African tourists are among the injured. Most of those hurt suffered minor injuries, 
while three were treated in hospital, N.A.T.O. say.

http://localhost:8888/notebooks/Text%20preprocessing.ipynb

@nickname of twitter user and his email is email@gmail.com . 

A device went off close to the museum fence as the bus was passing on 16/02/2012.
"""

# Tokenization

Assume splitting text into **Tokens** (Words, sentences etc.)

- Spacy tokenized some weird staff like \n, \n\n, but was able to handle urls, emails and twitter-like mentions. Also we see that nltk tokenized abbreviations without the last .

In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize

nltk_words = word_tokenize(example_text) #sent_tokentize for sentences, Used in the BoW part
display(f"Tokenized words: {nltk_words}")

"Tokenized words: ['An', 'explosion', 'targeting', 'a', 'tourist', 'bus', 'has', 'injured', 'at', 'least', '16', 'people', 'near', 'the', 'Grand', 'Egyptian', 'Museum', ',', 'next', 'to', 'the', 'pyramids', 'in', 'Giza', ',', 'security', 'sources', 'say', 'E.U', '.', 'South', 'African', 'tourists', 'are', 'among', 'the', 'injured', '.', 'Most', 'of', 'those', 'hurt', 'suffered', 'minor', 'injuries', ',', 'while', 'three', 'were', 'treated', 'in', 'hospital', ',', 'N.A.T.O', '.', 'say', '.', 'http', ':', '//localhost:8888/notebooks/Text', '%', '20preprocessing.ipynb', '@', 'nickname', 'of', 'twitter', 'user', 'and', 'his', 'email', 'is', 'email', '@', 'gmail.com', '.', 'A', 'device', 'went', 'off', 'close', 'to', 'the', 'museum', 'fence', 'as', 'the', 'bus', 'was', 'passing', 'on', '16/02/2012', '.']"

In [4]:
!python3 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()

doc = nlp(example_text)
spacy_words = [token.text for token in doc]
display(f"Tokenized words: {spacy_words}")

"Tokenized words: ['\\n', 'An', 'explosion', 'targeting', 'a', 'tourist', 'bus', 'has', 'injured', 'at', 'least', '16', 'people', 'near', 'the', 'Grand', 'Egyptian', 'Museum', ',', '\\n', 'next', 'to', 'the', 'pyramids', 'in', 'Giza', ',', 'security', 'sources', 'say', 'E.U.', '\\n\\n', 'South', 'African', 'tourists', 'are', 'among', 'the', 'injured', '.', 'Most', 'of', 'those', 'hurt', 'suffered', 'minor', 'injuries', ',', '\\n', 'while', 'three', 'were', 'treated', 'in', 'hospital', ',', 'N.A.T.O.', 'say', '.', '\\n\\n', 'http://localhost:8888', '/', 'notebooks', '/', 'Text%20preprocessing.ipynb', '\\n\\n', '@nickname', 'of', 'twitter', 'user', 'and', 'his', 'email', 'is', 'email@gmail.com', '.', '\\n\\n', 'A', 'device', 'went', 'off', 'close', 'to', 'the', 'museum', 'fence', 'as', 'the', 'bus', 'was', 'passing', 'on', '16/02/2012', '.', '\\n']"

In [6]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x17feef890>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x17feee990>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x17ff03a70>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x30229cdd0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x302297dd0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x17ff03bc0>)]

# Cleaning

## Lower Casing

In [7]:
example_text.lower()

'\nan explosion targeting a tourist bus has injured at least 16 people near the grand egyptian museum, \nnext to the pyramids in giza, security sources say e.u.\n\nsouth african tourists are among the injured. most of those hurt suffered minor injuries, \nwhile three were treated in hospital, n.a.t.o. say.\n\nhttp://localhost:8888/notebooks/text%20preprocessing.ipynb\n\n@nickname of twitter user and his email is email@gmail.com . \n\na device went off close to the museum fence as the bus was passing on 16/02/2012.\n'

## Punctuation Removal
+ Do after tokenization
+ Useful for TF-IDF, CountVectorization, BinaryVectorization etc.

In [8]:
import string

display(f"Punctuation symbols: {string.punctuation}")

'Punctuation symbols: !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
text_with_punct = "@nickname of twitter user, and his email is email@gmail.com ."

In [10]:
text_without_punct = text_with_punct.translate(str.maketrans('', '', string.punctuation))
display(f"Text without punctuation: {text_without_punct}")

'Text without punctuation: nickname of twitter user and his email is emailgmailcom '

In [11]:
# Here, emails were not detected
# In tokenization, punctuation symbols were parsed as single tokens, so better way would be to tokenize first and then remove punctuation symbols.

doc = nlp(text_with_punct)
tokens = [t.text for t in doc]
# python 
tokens_without_punct_python = [t for t in tokens if t not in string.punctuation]
display(f"Python based removal: {tokens_without_punct_python}")

tokens_without_punct_spacy = [t.text for t in doc if t.pos_ != 'PUNCT']
display(f"Spacy based removal: {tokens_without_punct_spacy}")

"Python based removal: ['@nickname', 'of', 'twitter', 'user', 'and', 'his', 'email', 'is', 'email@gmail.com']"

"Spacy based removal: ['@nickname', 'of', 'twitter', 'user', 'and', 'his', 'email', 'is', 'email@gmail.com']"

## Stopwords Removal
+ Words that usually does not bring additional meaning

In [12]:
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

In [13]:
from nltk.corpus import stopwords

nltk.download('stopwords', download_dir = "/Users/daver/Desktop/NLP_Lab_Exam_Codes/.venv/nltk_data")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/daver/Desktop/NLP_Lab_Exam_Codes/.venv/nltk_dat
[nltk_data]     a...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
text = "This movie is just not good enough"

stopwords = stopwords.words('english')

text_without_stop_words = [t.text for t in nlp(text) if not t.is_stop]
display(f"Spacy text without stop words: {text_without_stop_words}")


text_without_stop_words = [t for t in word_tokenize(text) if t not in stopwords]
display(f"nltk text without stop words: {text_without_stop_words}")

"Spacy text without stop words: ['movie', 'good']"

"nltk text without stop words: ['This', 'movie', 'good', 'enough']"

## Other Cleaning
1. Removal of Emojis
1. Removal of Emoticons
1. Removal of URLs
1. Removal of HTML Tags


In [15]:
import re

def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

def remove_emoticons(data):
    emoticons = re.compile(r'(\:\w+\:|\<[\/\\]?3|[\(\)\\\D|\*\$][\-\^]?[\:\;\=]|[\:\;\=B8][\-\^]?[3DOPp\@\$\*\\\)\(\/\|])(?=\s|[\!\.\?]|$)')
    return re.sub(emoticons, '', data)

def remove_urls(data):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', data)

def remove_html(data):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', data)

def clean_text(data):
    data = remove_emojis(data)
    data = remove_emoticons(data)
    data = remove_urls(data)
    data = remove_html(data)
    return data

In [16]:
import random
import faker
from faker.providers import internet

fake = faker.Faker()

emojis = ["😀", "😃", "😄", "😁", "😆", "😅", "😂", "🤣", "😊", "😇"]
emoticons = [":)", ":(", ":D", ":O", ":P", ";)", ":*", ":/", ":|", ":$", ":^)", ":-)", ":=)", ":]", ":}", ":>", ":3"]
html_tags = ["<p>", "</p>", "<a>", "</a>", "<div>", "</div>", "<span>", "</span>", "<img>", "</img>", "<body>", "</body>", "<header>", "</header>", "<footer>", "</footer>"]


random_text = " ".join(fake.text().split()[:30])
random_text += " " + random.choice(emojis)
random_text += " " + random.choice(emoticons)
random_text += " " + fake.url()
random_text += " " + random.choice(html_tags)

print(random_text)


Write student within skin. Lay model stage discuss. Include understand high art remain face. Tend trip watch clear area want. 😃 :^) https://www.ho-johnson.com/ </p>


In [17]:
clean = clean_text(random_text)
print(clean)

Write student within skin. Lay model stage discuss. Include understand high art remain face. Tend trip watch clear area want.    


# Normalization
normalization is a convertion of any non-text information into textual equivalent.


+ Converting dates to text
+ Numbers to text
+ Currency/Percent signs to text
+ Expanding of abbreviations (content dependent)
+ Spelling mistakes correction


In [18]:
# # This does not work, as scikit-learn has removed a function

# from normalise import normalise

# text = """
# On the 13 Feb. 2007, Theresa May announced on MTV news that the rate of childhod obesity had 
# risen from 7.3-9.6% in just 3 years , costing the N.A.T.O £20m
# """

# user_abbr = {
#     "N.A.T.O": "North Atlantic Treaty Organization"
# }

# normalized_tokens = normalise(word_tokenize(text), user_abbrevs=user_abbr, verbose=False)
# display(f"Normalized text: {' '.join(normalized_tokens)}")

# Stemming and Lemmatization
1. **Stemming** - Process of reducing inflection in words to their root forms
    + Not very reliable, as it's just truncating the words based on some rules.
2. **Lemmatization** - Reduces to its proper dictionary / root form

In [19]:
text = """
On the 13 Feb. 2007, Theresa May announced on MTV news that the rate of childhod obesity had 
risen from 7.3-9.6% in just 3 years , costing the N.A.T.O £20m
"""

In [20]:
from nltk.stem import PorterStemmer
import numpy as np

normalized_tokens = word_tokenize(text)

text = ' '.join(normalized_tokens)
tokens = word_tokenize(text)

In [21]:
porter = PorterStemmer()
stem_words = np.vectorize(porter.stem)
stemed_text = ' '.join(stem_words(tokens))
display(f"Stemed text: {stemed_text}")

'Stemed text: on the 13 feb. 2007 , theresa may announc on mtv news that the rate of childhod obes had risen from 7.3-9.6 % in just 3 year , cost the n.a.t.o £20m'

In [22]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()
lemmatize_words = np.vectorize(wordnet_lemmatizer.lemmatize)
lemmatized_text = ' '.join(lemmatize_words(tokens))
display(f"nltk lemmatized text: {lemmatized_text}")

'nltk lemmatized text: On the 13 Feb. 2007 , Theresa May announced on MTV news that the rate of childhod obesity had risen from 7.3-9.6 % in just 3 year , costing the N.A.T.O £20m'

In [23]:
# Spacy is better than nltk in lemmatization

lemmas = [t.lemma_ for t in nlp(text)]
display(f"Spacy lemmatized text: {' '.join(lemmas)}")

'Spacy lemmatized text: on the 13 Feb. 2007 , Theresa may announce on MTV news that the rate of childhod obesity have rise from 7.3 - 9.6 % in just 3 year , cost the N.A.T.O £ 20 m'

---
---
---

# Embedding Techniques (Basic)
Converting words into vectors

## One-Hot Encoding
Eg : Assume a Corpus of "The Cat eats the mouse"

Unique words - ['the', 'cat', 'eats', 'mouse']
'the" is encoded as [1, 0, 0, 0]

Note that this does not use the entire matrix, just each of the sentence.

Nobody uses this anymore

**Advantages**
+ Very simple to implement

**Disadvantages**
+ Only creates the sparse matrix, which will be massive as the more words you have, you need to process
+ As a result, it is very computationally expensive
+ Creates **OOV (Out of vocabulary)** problem. If the sentence size (No of words in each sentence) is not fixed, it will cause issues
    + You cannot handle new words in the text data
+ Semantic Meaning is not captured. There is no relationship between words

In [24]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

sentence = np.random.choice(clean.split(), 5)

data = np.array(sentence).reshape(-1, 1)

encoder = OneHotEncoder()

onehot_encoded = encoder.fit_transform(data)

print("Original sentence: ", sentence)
print("One-hot encoded: \n", onehot_encoded)


Original sentence:  ['skin.' 'want.' 'student' 'model' 'art']
One-hot encoded: 
   (0, 2)	1.0
  (1, 4)	1.0
  (2, 3)	1.0
  (3, 1)	1.0
  (4, 0)	1.0


---

## Bag of Words (BoW) & N-Grams

Remove the stopwords and then count the frequency in the vocabulary. Order in descending order

![Test](./resources/ss/ss1.png)

**Binary BoW**
Based on what is present in each document, you can use the embeddings

![Test](./resources/ss/ss2.png)

**In another case, if a word repeats in a sentence, then you can increase the frequency from 1 to 2**

**Advantages**
+ Simple and Intuitive

**Disadvantage**
+ Sparsity
+ OOV
+ Ordering of the words (Meaning) has changed compeltely
+ Semantic meaning is lost (It can tell to some extent)


*N-Grams* - In addition to the sentence, you also see the combinarion of sentences. Note that the order should be followed here (Eg : Good Girl is not the same as Girl Good)

Eg : **BiGrams**

![Test](./resources/ss/ss3.png)


In [25]:
paragraph = """
Narendra Damodardas Modi is the present and 15th Indian prime minister. He has been serving our nation since 26th May 2014. From the year 2001 to 2014, before taking over Delhi, he served the role of Honourable Chief Minister of Gujarat. He is a Member of the Parliament (MP), who represents the city of Varanasi.
"""

In [26]:
import nltk
from nltk.util import ngrams
from nltk.tokenize import sent_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [27]:
nltk.download('punkt', download_dir = "/Users/daver/Desktop/NLP_Lab_Exam_Codes/.venv/nltk_data")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/daver/Desktop/NLP_Lab_Exam_Codes/.venv/nltk_dat
[nltk_data]     a...
[nltk_data]   Package punkt is already up-to-date!


True

In [28]:
sentences = sent_tokenize(paragraph)

In [29]:
print(sentences)

['\nNarendra Damodardas Modi is the present and 15th Indian prime minister.', 'He has been serving our nation since 26th May 2014.', 'From the year 2001 to 2014, before taking over Delhi, he served the role of Honourable Chief Minister of Gujarat.', 'He is a Member of the Parliament (MP), who represents the city of Varanasi.']


In [30]:
corpus = []
for i in range(len(sentences)):
    sentences[i].translate(str.maketrans('', '', string.punctuation))
    sentences[i] = sentences[i].lower()
    corpus.append(sentences)

In [31]:
corpus = ' '.join(corpus[0])

In [32]:
corpus = [corpus]

In [33]:
print(corpus)

['\nnarendra damodardas modi is the present and 15th indian prime minister. he has been serving our nation since 26th may 2014. from the year 2001 to 2014, before taking over delhi, he served the role of honourable chief minister of gujarat. he is a member of the parliament (mp), who represents the city of varanasi.']


In [34]:
# BAG OF WORDS PART
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() #ngarm_range=(2, 3) for bigrams + trigrams

In [35]:
x = cv.fit_transform(corpus)

In [36]:
cv.vocabulary_

{'narendra': 23,
 'damodardas': 9,
 'modi': 21,
 'is': 17,
 'the': 37,
 'present': 29,
 'and': 4,
 '15th': 0,
 'indian': 16,
 'prime': 30,
 'minister': 20,
 'he': 14,
 'has': 13,
 'been': 5,
 'serving': 34,
 'our': 26,
 'nation': 24,
 'since': 35,
 '26th': 3,
 'may': 18,
 '2014': 2,
 'from': 11,
 'year': 41,
 '2001': 1,
 'to': 38,
 'before': 6,
 'taking': 36,
 'over': 27,
 'delhi': 10,
 'served': 33,
 'role': 32,
 'of': 25,
 'honourable': 15,
 'chief': 7,
 'gujarat': 12,
 'member': 19,
 'parliament': 28,
 'mp': 22,
 'who': 40,
 'represents': 31,
 'city': 8,
 'varanasi': 39}

In [37]:
corpus[0]

'\nnarendra damodardas modi is the present and 15th indian prime minister. he has been serving our nation since 26th may 2014. from the year 2001 to 2014, before taking over delhi, he served the role of honourable chief minister of gujarat. he is a member of the parliament (mp), who represents the city of varanasi.'

In [38]:
# BoW for the first sentence
x[0].toarray()

array([[1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 2, 1, 1, 2, 1,
        1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1]])

In [39]:
# Binary BoW
cv = CountVectorizer(binary = True)

In [40]:
x = cv.fit_transform(corpus)

In [41]:
x[0].toarray()

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

## TF-IDF
**Term Frequency** - 
+ Trying to give rare words higher weightage
+ No. of repetition of words in sentence / number of words in sentence

**Inverse Document Frequency** - 
+ Finding out the common words
+ loge(No.of sentences / No. of sentences containing the words)

Multiply these and get the value 


In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer() #max_features = 3 will help you get only the top 3 vectors
X = cv.fit_transform(corpus)

In [43]:
corpus[0]

'\nnarendra damodardas modi is the present and 15th indian prime minister. he has been serving our nation since 26th may 2014. from the year 2001 to 2014, before taking over delhi, he served the role of honourable chief minister of gujarat. he is a member of the parliament (mp), who represents the city of varanasi.'

In [45]:
X.toarray()

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

# Better Embeddings (Based on DL)

using DL to get better word embeddings

## Word2Vec

Previously, semantic meanings are not captured. In those cases, the vectors need to be close together. The other problem was that there was sparse matrix. 

**Features**
+ Dimensions are reduced
+ Sparsity is reduced
+ Semantic Meaning is preserved in training

Feature Representations - Every word gets represented by a feature. Then, we create vectors that are related to some 300 dimensions


![Test](./resources/ss/ss4.png)


So, each word is represented as a 300 dimension vector, which is created by the model

**Cosine Similarity** is used to calculate the distance between two vectors.


### CBoW - Continuous BoW
Used for training data for Word2Vec.
![Test](./resources/ss/ss5.png)

Hidden layer will be the window size

![Test](./resources/ss/ss6.png)

### SkipGrams
Just swap the input and output of CBoW

In [46]:
import gensim
from gensim.models import Word2Vec, KeyedVectors

In [47]:
## Lets use a pretrained model for now
import gensim.downloader as api

wv = api.load('word2vec-google-news-300')



In [48]:
vec_king = wv['king']

In [49]:
vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [50]:
vec_king.shape

(300,)

In [51]:
wv.most_similar('king')

[('kings', 0.7138046622276306),
 ('queen', 0.6510956287384033),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204219460487366),
 ('prince', 0.6159993410110474),
 ('sultan', 0.5864824056625366),
 ('ruler', 0.5797566771507263),
 ('princes', 0.5646552443504333),
 ('Prince_Paras', 0.5432944297790527),
 ('throne', 0.5422105193138123)]

In [52]:
# Cosine Similarity
wv.similarity('man', 'king')

0.2294267

In [53]:
vec = wv['king'] - wv['man'] + wv['woman']
wv.most_similar([vec])

[('king', 0.8449392318725586),
 ('queen', 0.7300516366958618),
 ('monarch', 0.6454660296440125),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676948547363),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376776456832886),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]

### Word2Vec from Scratch (Done in Spam Classification Application)

### AvgWord2Vec
In regular Word2Vec, we need a 300 dimension representation for each word. 

In this model, for each sentence we just add the corresponding dimensions from each of the word and get the average of the full sentence

## GloVe

## FastText

# Model Building

## RNNs