Up unitl now, we've been using `Bag of Words` and `Tf-IDF` for vectorization of our text data. But there are some major problems in regard to their usage:

* Although **TF-IDF** gives more importance to uncommon words, but just like the **Bag of Words** they also do not store any semantics information.


* There's definitely a change of overfitting with their usage.


* The **order of words** -- which do indicates the relationship between the words -- of a document is not followed at all.

**=>** These problems give rise to the **`Word2Vec`**.

### # Word2vec:

**i.** In this specific model, each word is basically represented as a vector of 32 or more dimensions instead of just a single number.

**ii.** Most importantly, the semantic information and the relation between words is also being preserved.

**For example:** Vectorization using **Word2Vec** will hold true for the following expression.

`King - Man + Woman = Queen`

In [1]:
%autosave 30

Autosaving every 30 seconds


In [2]:
import re

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from gensim.models import Word2Vec

In [3]:
para = """The best-selling book by Caleb Carr is the basis for "The Alienist," a psychological thriller set amidst the vast wealth, extreme poverty and technological innovation of 1896 New York. A never-before-seen ritualistic killer is responsible for the gruesome murders of boy prostitutes, and newly appointed police commissioner Theodore Roosevelt calls upon criminal psychologist Dr. Laszlo Kreizler, newspaper illustrator John Moore and police department secretary Sara Howard to conduct the investigation in secret. The brilliant, obsessive Kreizler is known as an alienist -- one who studies mental pathologies and the deviant behaviors of those who are alienated from themselves and society. His job, along with his controversial views, makes him a social pariah in some circles. But helped by a band of outsiders, Kreizler's tireless efforts eventually answer the question behind what makes a man into a murderer."""
para

'The best-selling book by Caleb Carr is the basis for "The Alienist," a psychological thriller set amidst the vast wealth, extreme poverty and technological innovation of 1896 New York. A never-before-seen ritualistic killer is responsible for the gruesome murders of boy prostitutes, and newly appointed police commissioner Theodore Roosevelt calls upon criminal psychologist Dr. Laszlo Kreizler, newspaper illustrator John Moore and police department secretary Sara Howard to conduct the investigation in secret. The brilliant, obsessive Kreizler is known as an alienist -- one who studies mental pathologies and the deviant behaviors of those who are alienated from themselves and society. His job, along with his controversial views, makes him a social pariah in some circles. But helped by a band of outsiders, Kreizler\'s tireless efforts eventually answer the question behind what makes a man into a murderer.'

In [4]:
## Clean the above doc

def clean_the_doc(doc):
    try:
        # Cleaned doc
        corpus = []
        
        # Tokenize the doc as separate sentences
        sentences = nltk.sent_tokenize(doc)
        
        # Create an objec tof lemmatizer
        lemma = WordNetLemmatizer()
        
        for i in range(len(sentences)):
            # Just keep the alphabets 
            review = re.sub('[^a-zA-Z]', ' ', sentences[i])
            
            # Lower the sentences
            review = review.lower()
        
            # Clean by removing clean words followed by lemmatizing
            words = nltk.word_tokenize(review)
            cleaned_words = [lemma.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
            
#             # Join the clean words to make cleaned sentence
#             cleaned_sent = " ".join(cleaned_words)
            
            # Append to the cleaned doc
            corpus.append(cleaned_words)
        return corpus            
        ...
    except Exception as e:
        raise e

In [5]:
corpus = clean_the_doc(doc=para)
corpus

[['best',
  'selling',
  'book',
  'caleb',
  'carr',
  'basis',
  'alienist',
  'psychological',
  'thriller',
  'set',
  'amidst',
  'vast',
  'wealth',
  'extreme',
  'poverty',
  'technological',
  'innovation',
  'new',
  'york'],
 ['never',
  'seen',
  'ritualistic',
  'killer',
  'responsible',
  'gruesome',
  'murder',
  'boy',
  'prostitute',
  'newly',
  'appointed',
  'police',
  'commissioner',
  'theodore',
  'roosevelt',
  'call',
  'upon',
  'criminal',
  'psychologist',
  'dr',
  'laszlo',
  'kreizler',
  'newspaper',
  'illustrator',
  'john',
  'moore',
  'police',
  'department',
  'secretary',
  'sara',
  'howard',
  'conduct',
  'investigation',
  'secret'],
 ['brilliant',
  'obsessive',
  'kreizler',
  'known',
  'alienist',
  'one',
  'study',
  'mental',
  'pathology',
  'deviant',
  'behavior',
  'alienated',
  'society'],
 ['job',
  'along',
  'controversial',
  'view',
  'make',
  'social',
  'pariah',
  'circle'],
 ['helped',
  'band',
  'outsider',
  'kreiz

In [6]:
## Word2Vec Vectorization

word2vec = Word2Vec(sentences=corpus, min_count=1)  # Atleast one occurence are required for a term to be considered
word2vec

<gensim.models.word2vec.Word2Vec at 0x226a0b249d0>

In [7]:
## prepare features with the vocab

words = word2vec.wv.vocab
words

AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

In [8]:
## Approach in the newwe versions

words = word2vec.wv.key_to_index
words

{'kreizler': 0,
 'alienist': 1,
 'make': 2,
 'police': 3,
 'upon': 4,
 'prostitute': 5,
 'killer': 6,
 'responsible': 7,
 'gruesome': 8,
 'murder': 9,
 'boy': 10,
 'psychologist': 11,
 'newly': 12,
 'call': 13,
 'appointed': 14,
 'seen': 15,
 'commissioner': 16,
 'theodore': 17,
 'criminal': 18,
 'roosevelt': 19,
 'ritualistic': 20,
 'york': 21,
 'never': 22,
 'set': 23,
 'selling': 24,
 'book': 25,
 'caleb': 26,
 'carr': 27,
 'basis': 28,
 'psychological': 29,
 'thriller': 30,
 'amidst': 31,
 'laszlo': 32,
 'vast': 33,
 'wealth': 34,
 'extreme': 35,
 'poverty': 36,
 'technological': 37,
 'innovation': 38,
 'new': 39,
 'dr': 40,
 'murderer': 41,
 'man': 42,
 'newspaper': 43,
 'society': 44,
 'job': 45,
 'along': 46,
 'controversial': 47,
 'view': 48,
 'social': 49,
 'pariah': 50,
 'circle': 51,
 'helped': 52,
 'band': 53,
 'outsider': 54,
 'tireless': 55,
 'effort': 56,
 'eventually': 57,
 'answer': 58,
 'question': 59,
 'behind': 60,
 'alienated': 61,
 'behavior': 62,
 'deviant': 63,


### # Finding word vectors and looking at its similar words:

In [9]:
vec = word2vec.wv['kreizler']
vec

array([-5.3003023e-04,  2.6677107e-04,  5.0826091e-03,  8.9982525e-03,
       -9.2786001e-03, -7.1347621e-03,  6.4706388e-03,  8.9893667e-03,
       -5.0251847e-03, -3.7632736e-03,  7.3707411e-03, -1.5778933e-03,
       -4.5439592e-03,  6.5781437e-03, -4.8493599e-03, -1.8237075e-03,
        2.8823689e-03,  9.8521635e-04, -8.2716849e-03, -9.4813062e-03,
        7.3286290e-03,  5.0966865e-03,  6.7790011e-03,  7.6997356e-04,
        6.3372087e-03, -3.3905355e-03, -9.5822802e-04,  5.7557095e-03,
       -7.5211399e-03, -3.9286385e-03, -7.4986331e-03, -9.4132160e-04,
        9.5736273e-03, -7.3562195e-03, -2.3464924e-03, -1.9232829e-03,
        8.0974512e-03, -5.9350817e-03,  2.2046863e-05, -4.7575883e-03,
       -9.6042166e-03,  5.0135367e-03, -8.7609673e-03, -4.4046296e-03,
       -2.3056018e-05, -2.8432609e-04, -7.6990118e-03,  9.6035469e-03,
        5.0019673e-03,  9.2336470e-03, -8.1646452e-03,  4.4970601e-03,
       -4.1454998e-03,  8.1364741e-04,  8.5114492e-03, -4.4847853e-03,
      

**=>** The vector of the word `kreizler`.

In [10]:
## Most similar words to `kreizler`

word2vec.wv.most_similar('kreizler')

[('investigation', 0.2902636229991913),
 ('commissioner', 0.21957415342330933),
 ('murder', 0.21617160737514496),
 ('illustrator', 0.20418767631053925),
 ('basis', 0.19654962420463562),
 ('band', 0.17204968631267548),
 ('newspaper', 0.1694621592760086),
 ('thriller', 0.15191279351711273),
 ('wealth', 0.14258213341236115),
 ('eventually', 0.13999484479427338)]

In [11]:
## Most similar words to `thriller`

word2vec.wv.most_similar('thriller')

[('ritualistic', 0.18682411313056946),
 ('psychologist', 0.182087242603302),
 ('question', 0.18049369752407074),
 ('upon', 0.1755843162536621),
 ('prostitute', 0.1626976728439331),
 ('poverty', 0.15943913161754608),
 ('department', 0.1589145064353943),
 ('kreizler', 0.15191277861595154),
 ('illustrator', 0.14638391137123108),
 ('carr', 0.1425441950559616)]

In [12]:
## Most similar words to `alienist`

word2vec.wv.most_similar('alienist')

[('carr', 0.18908195197582245),
 ('never', 0.18853941559791565),
 ('circle', 0.1837977170944214),
 ('appointed', 0.16103841364383698),
 ('brilliant', 0.15977992117404938),
 ('criminal', 0.15935558080673218),
 ('moore', 0.15890660881996155),
 ('call', 0.13751624524593353),
 ('conduct', 0.13453452289104462),
 ('set', 0.1281122863292694)]