# Creating Word Embeddings

In [None]:
# need to install
#!pip install pyemd

In [None]:
# Importing all the needed libraries for this project
import pandas as pd
import numpy as np
import gensim
from gensim.models import KeyedVectors

TASK 0: 

-load the 80s_detailed songs using pandas
- create a data series that contains only the lyrics of the songs by Madonna

In [None]:
df = pd.read_csv('../input/80s-songs/80s_detailed(1).csv', delimiter = ',')
inputdf = df.loc[df['artist'] == "Madonna"]['lyrics']
inputdf

TIPS:

- when creating word embeddings, we basically do not remove any tokens from our data. This means that we keep stopwords
- we word at token level, so "house" and "houses" will results in two embeddicng vectors (hopefully very similar)
- we can create lemma embeddings, nothing prevents us from doing this  - it is simply not done in practice (since we need hundreds of million of textual data to create good embeddings)
- in some cases, we can apply some pre-processing to avoid too much sparseness (i.e., too high variation in our data). For example, when working with Twitter data, you may want to subsitute all URLs with a placeholder (e.g., URL). The same for all user mentions (e.g., @GRONlp --> @USER).

In our case, we can relax and take everythin into account.

As for the Topic Modeling, we will use `gensim`. The only important thing that is needed is that we pre-process the text and split into tokens. We will use the function `simple_preprocess()` in `gensim`.
This allows us to prepare our data.

Complete the code below:

In [None]:
# Checking how many songs we have:
len(inputdf)
# Apply gensim.utils.simple_preprocess
madonna_songs_tokens = [gensim.utils.simple_preprocess(line) for line in inputdf]
# print the song at index 3 in the pre-processed list
print(madonna_songs_tokens[3])

In [None]:
print(madonna_songs_tokens)

## Time to create our word2vec model
Now that we have our songs we are ready to create our word2vec model using `gensim`
We can create (and train) a word2vec model by using the line of code in the block below.

*IMPORTANT PARAMETERS* 

Changing any of these parameters affect your word embeddings (and their quality). What do they mean?

### window
The window parameter indicates what the maximum distance between the target word and its neighboring word can be. This means that specifying a higher window value, will result in a neigboring words that are less related to the target word. Vice versa, a lower window value, will result in words that are closer related to the traget word. Previous work has shown that a window between 2 and 5 gives very good results. Experiment with this value to see how it impacts the word embeddings.
### min_count
Minimium frequency count of words. The model would ignore words that do not satisfy the `min_count`. Extremely infrequent words are usually unimportant, so its best to get rid of those. With very small datasets/corpora (up to 10 million words or so - as a rule of thumb), you keep everything (i.e., `min_count=1`). With very large corpora (from more than 100 million words upward) you may want to use a larger minimal frequency (e.g, `min_count=10`). This parameters affects the coverage (i.e., how many tokens in a texts your word embedding can find), the memory usage, and storage requirements of the model files.
### workers
Workers specifies the number of independent threads doing simultaneous training. `worker = 2` (or more), data will be trained in two parallel ways. By default, `worker = 1` i.e, no parallelization. A higher amount of workers can make the training faster, provided your machine can handle it. As for now, we keep this parameter with the default value (`worker = 1`).
### sg
`sg` has only 2 values 0 or 1. It specificies which approach to use when creating the word embeddings. 0 = continuous bag-of-words; 1 = skip-gram models

In [None]:
# Training our word2vec model
# Note -> make sure that madonna_songs_tokens contains lists in which each message is in its own list.
# i.e madonna_songs_tokens = [['this', 'is', 'a', 'message'], [...], ....]
w2v_model = gensim.models.Word2Vec(
        madonna_songs_tokens,
        window=5,
        min_count=1,
        workers=1,
        sg = 1)

## What does our word2vec model look like?

In [None]:
# printing the model
print(w2v_model)

Before moving forward, let see some useful functions and best practices:

- once we have trained our model, we should save it. Normally, this models may take very long time to train. To save the model use the function `.save()` - inside `.save()` you can specify the name of your model

- load a saved model: this can be done using the function `.KeyedVectors.load()` and specifying the name of the model your are loading. If the model is stored in a different folder, specify also the full path to the model

- print the vocabulary of the model (this is important because if a word is not in your vocabulary the model will raise an error)with `wv.vocab`

- print the vector representation of a specific word (e.g. `home`) using the function `.wv['home']` 

In [None]:
# save the model
w2v_model.save('madonna_50.model')

# load the model
my_model = KeyedVectors.load("madonna_50.model")
               
# Printing the first 20 words that in our vocabulary
words = list(my_model.wv.index_to_key)
print(words[0:20])

target_vector = my_model.wv['home']
print(target_vector)
# print the dimensions of the vector
print(len(target_vector))

## What can we do with our word2vec model?
As said before, word embeddings are created by looking at the context in which a word occurs.
This means that the meaning of a word can be obtained by the company it keeps (or its usage).

Word embeddings are then a tool to do automatic semantic analysis of your data/corpus.

Easy things we can do:
- find the top N most similar words of a target word (`.wv.most_similar(positive='string',  topn=int)`)
- determine how similar two words are with respect to each other (`.wv.similarity('string1', 'string2`))
- directly compute the similarity of 2 sentences using similarity of the words that compose them (`.wv.wmdistance(sentence_1, sentence_2)`)
- find analogies (.wv.most_similar_cosmul(positive=['word1', 'word2'], negative=['word3']))

`wv` =  word2vec - it is used only for this model

*KEEP IN MIND* : all of your results depends on 1) the size of your corpus; 2) the content of your data!

In [None]:
# retrieving 5 words that are similar to 'prayer' (according to our dataset!)
word = 'prayer'
top5_sim = my_model.wv.most_similar(positive = word, topn = 5)
print(top5_sim)

In [None]:
# similarity of the word 'prayer' and the word 'home'
# simialrity ranges between 0 (different words) to 1 (exactly the same meaning)
similarity = my_model.wv.similarity(word, 'home')
print(similarity)

In [None]:
# similarity of 2 sentences
sentence_1 = 'Look around everywhere you turn is heartache'.lower().split()
sentence_2 = 'Gonna give you all my love, boy'.lower().split()

sentence_similarity = my_model.wv.wmdistance(sentence_1, sentence_2)
print(sentence_similarity)

In [None]:
# analogies
analogies = my_model.wv.most_similar_cosmul(positive=['love', 'prayer'], negative=['virgin'])
most_similar_key, similarity = analogies[0]  # look at the first match
print(most_similar_key, similarity)

## Visualizing our model
For better understanding, it can be nice to visualize things!

In [None]:
X=my_model.wv.__getitem__(my_model.wv.index_to_key)
df=pd.DataFrame(X)
df.shape
df.head()

In [None]:
#Computing the correlation matrix
X_corr=df.corr()

#Computing eigen values and eigen vectors
values,vectors=np.linalg.eig(X_corr)

#Sorting the eigen vectors coresponding to eigen values in descending order
args = (-values).argsort()
values = vectors[args]
vectors = vectors[:, args]

#Taking first 2 components which explain maximum variance for projecting
new_vectors=vectors[:,:2]

#Projecting it onto new dimesion with 2 axis
neww_X=np.dot(X,new_vectors)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(13,7))
plt.scatter(neww_X[:,0],neww_X[:,1],linewidths=10,color='blue')
plt.xlabel("PC1",size=15)
plt.ylabel("PC2",size=15)
plt.title("Word Embedding Space",size=20)
vocab=list(my_model.wv.index_to_key)
for i, word in enumerate(vocab):
    plt.annotate(word,xy=(neww_X[i,0],neww_X[i,1]))

A last bit of info: people before you have created word embeddings models using very large collection of textula data (this is a methodology which is not free from some ethical concerns on what these large models do actually "learn" and how they keep promoting existing societal bias)

**How do we load existing models**

Gensim comes with a bunch of models already available. This is a list of available models:
- 'fasttext-wiki-news-subwords-300': FastText (Facebook Inc.) embeddings; 300 dimesions vectors; Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens) - English.
- 'conceptnet-numberbatch-17-06-300': 
- 'word2vec-ruscorpora-300': trained on full Russian National Corpus (about 250M words). The model contains 185K words. dimension - 300 window_size - 10
- 'word2vec-google-news-300': Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. - English
- 'glove-wiki-gigaword-50': Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). dimension - 50 - English
- 'glove-wiki-gigaword-100': Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). dimension - 100 - English
- 'glove-wiki-gigaword-200': Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). dimension - 200 - English
- 'glove-wiki-gigaword-300': Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/). dimension - 300 - English
- 'glove-twitter-25': Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) dimension 25; English
- 'glove-twitter-50': Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) dimension 50; English
- 'glove-twitter-100': Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) dimension 100; English
- 'glove-twitter-200': Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) dimension 200; English

In [None]:
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))

If this command raise an error, try to do the following:
- go to https://github.com/RaRe-Technologies/gensim-data
- click on 'code' and download as zip
- put the zipped file in your home directory and unzip it
- copy the info in this file (https://github.com/RaRe-Technologies/gensim-data/blob/master/list.json) and save it into a file called `information.json` in the same folder
- run the command

If you are on a Mac and you get an error of the kind "urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] ", go to Finder--> Application--> Python --> and double clik on the file "Install Certificates.command"

In [None]:
import gensim.downloader as api # import the downloader and allow us to download and use them
# load the target model
glove_vectors = api.load("glove-twitter-25")  # load glove vectors

Now you can do the same things we have done with the madonna.model using an existing models.

In [None]:
# retrieving 5 words that are similar to 'prayer' (according to our dataset!)
word = 'prayer'
top5_sim = glove_vectors.most_similar(positive = word, topn = 5)
print(top5_sim)

In [None]:
similar = glove_vectors.similarity(word, 'home')
print(similar)

In [None]:
# similarity of 2 sentences
sentence_1 = 'Look around everywhere you turn is heartache'.lower().split()
sentence_2 = 'Gonna give you all my love, boy'.lower().split()

sentence_similar = glove_vectors.wmdistance(sentence_1, sentence_2)
print(sentence_similar)

In [None]:
# analogies
analogies = glove_vectors.most_similar_cosmul(positive=['love', 'prayer'], negative=['virgin'])
most_similar_key, similarity = analogies[0]  # look at the first match
print(most_similar_key, similarity)

In [None]:
X= glove_vectors[glove_vectors.index_to_key] 
df=pd.DataFrame(X)
df.shape
df.head()