# <center> <font size = 24 color = 'steelblue'> <b> Pre-trained word embedding model from gensim

<div class="alert alert-block alert-info">
    
<font size = 4> 
    
**This notebook demonstrates representation of text using pre-trained word embedding models.**

# <a id= 'w0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Installation and import of necessary packages](#w1)<br>
[2. Model implementation](#w2)<br>
[3. Load the embedding model](#w3)<br>

    

<font size =5 color = 'seagreen'> 
    
Using a pre-trained word2vec model to look for most similar words.
    
<b>For this demonstration, `Google News vectors embeddings` are used.

##### <a id = 'w1'>
<font size = 10 color = 'midnightblue'> <b>Installation and import of necessary packages

In [None]:
!pip install scikit-learn
!pip install gensim
!pip install spacy
!python -m spacy download en_core_web_md

 <font size =5 color = 'seagreen'> <b> Import packages

In [None]:
import os
from gensim.models import Word2Vec, KeyedVectors

# To suppress warnings
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore")

import spacy

[top](#w0)

 ##### <a id = 'w2'>
<font size = 10 color = 'midnightblue'> <b>  Model implementation

<font size = 5 color = pwdrblue> <b> Get the word embeddings

In [None]:
path = os.getcwd()
file_name = 'GoogleNews-vectors-negative300 (1).bin'
pretrained_path = path + '/' + file_name

<font size = 5 color = pwdrblue> <b> Load the model

In [None]:
w2v_model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True) #load the model

<font size = 5 color = pwdrblue> <b> Check number of words in vocabulary

In [None]:
print("Number of words in vocabulary: ",len(w2v_model.index_to_key)) #Number of words in the vocabulary.

In [None]:
print(f"First few words of the vocabulary :\n{ w2v_model.index_to_key[:20]}")

<font size = 5 color = pwdrblue> <b> Examine the model to extract most similar words for a given word like `joyful`, `solid`

In [None]:
w2v_model.most_similar('joyful')

In [None]:
w2v_model.most_similar('Travel')

In [None]:
w2v_model.most_similar('real-world')

<div class="alert alert-block alert-success">
<font size = 4> 
    
<center><b> Error occurred because the word is not present in the vocabulary.</b>


<font size = 5 color = seagreen> <b> The below snippet can be used to manage the error and check similarity for multiple words:

In [None]:
inp = "y"
while inp.lower() == 'y':
    word = input("Enter a word to get similar words: ")
    try :
        print(f"Most similar words to '{word}' :\n")
        for t in w2v_model.most_similar(word):
            print(t)
        print('\n')
    except :
        print('Word does not exists in vocabulary!')
    inp = input("Do you want to continue? (Y/N) : ")


<font size = 5 color = pwdrblue> <b>  Get the word vector of any term

In [None]:
w2v_model['beautiful']

<font size = 5 color = pwdrblue> <b>  Get the embeddings for a complete text

<div class="alert alert-block alert-success">
<font size = 4> 
    
- A simple way is to just sum or average the embeddings for individual words.
- Let us see a small example using another NLP library Spacy

[top](#w0)

 ##### <a id = 'w3'>
<font size = 10 color = 'midnightblue'> <b> Load the embedding model

In [None]:
# Load the english embedding
nlp = spacy.load('en_core_web_md')

In [None]:
# Create a model object
mydoc = nlp("Artificial intelligence revolutionizes industries by enhancing automation and decision-making.")

# Get the averaged vector for the entire sentence
print(mydoc.vector)