# **Learn-by-building Word2Vec Embeddings**

In a team of 2 people, create a Word2Vec model by using `gensim` and/or `elang` package. You may use or gather your own dataset to:

- Build a word embedding from text data (NLP), or
- Build a recommender system (Non-NLP), or
- Anything else that have a sequential properties (Non-NLP).

<b>
Deadline: Monday, 4 May 2020

Submit: Google Colab Notebook and Pre-trained Model

Submission: My Trello Card "LBB Submission: Word2Vec Embeddings"
</b>

The rubrics are as follow:

## Collect the Data

- What is the data all about? Give short explanation about the data.
- Where do you get the data from?

## Preprocessing

### NLP
Text data needs to be cleanse before we feed it into a model so that it can capture the semantic of word correctly.
- What are the necessary cleansing steps need to be performed?
- Do you need to remove the stopwords or perform stemming/lemmatizing?
- Have you confirm that the sentence is in the form of a "list of lists of words"?

### Non-NLP
This step may be vary depending on the case, but on general:
- Is there any missing value? If yes, then how do you handle it?
- Have you confirm that the data type is already appropriate?
- Do you need to perform train-test splitting?
- What feature do you use to get the embedding vectors?
- Have you confirm that the feature is in the form of a "list of lists"?

## Training

Consider these following parameters for training the model:
- What are the dimensions of the embedding vector?
- What is the maximum distance between the context and target feature?
- What is the minimum frequency for a feature to be considered as a vocabulary?
- Are you using Skip-Gram or CBOW architecture? Why?
- Which training optimization do you use?
- How many epochs do you let the model to train?

After training is done:
- What is the final size of your vocabulary? Is it far different from the unique count of the original data?

Tips on training Word2Vec model:
- Use `logging` to monitor the training process.
- Use `Callback` to monitor the loss of each epoch.
- Reproducibility is quite hard to maintain, so don't forget to always save your model after the training process is done.

## Visualize

It is always quite helpful to visualize the embeddings that you have created. The dimension of the embedding may be tens to hundreds, whereas humans are limited to see up to three dimensions. So, the dimensions of the vectors must be reduced.
- Which dimensionality reduction algorithm do you use?
- Is there any interesting pattern that could be seen from the visualization? Please elaborate.

Tips:
- You may use `elang` package for this section.

## Use the Word Vectors

You can use several method provided by `gensim` to see how your model is performing.

### NLP
- Choose any words from the dictionary and list out several similar words. Does it semantically makes any sense?
- Does your model able to point out one word that has different context among the other words in a list?
- Does your model able to capture a semantic relation between words?

### Non-NLP
This step is loosely-defined depending on the case, but consider the following:
- Make sure to use the similarity score on your analysis.
- How you can use your model to compute the similarity based on multiple feature?

Tips:
- If the output doesn't make any sense at all, consider training your model again by adding more data or do parameter tuning.

## Conclusion

### NLP
- How does your model performance capturing the semantics of words?
- What are other task(s) that can be done after you successfully represent words into vectors?

### Non-NLP
- Does your model work according to your expectation?
- What might be the next step after the model is obtained?



# Dataset Reference


## NLP

**Data provided on `gensim`: https://github.com/RaRe-Technologies/gensim-data**

In [0]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

In [0]:
import gensim.downloader as api
info = api.info()
corpora_list = ["quora-duplicate-questions", "text8", "fake-news", "20-newsgroups"]
pd.DataFrame(info['corpora']).T.loc[corpora_list,:]

Unnamed: 0,num_records,record_format,file_size,reader_code,license,fields,description,checksum,file_name,read_more,parts,checksum-0,checksum-1,checksum-2,checksum-3
quora-duplicate-questions,404290,dict,21684784,https://github.com/RaRe-Technologies/gensim-data/releases/download/quora-duplicate-questions/__init__.py,probably https://www.quora.com/about/tos,"{'question1': 'the full text of each question', 'question2': 'the full text of each question', 'qid1': 'unique ids of each question', 'qid2': 'unique ids of each question', 'id': 'the id of a training set question pair', 'is_duplicate': 'the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise'}","Over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line contains a duplicate pair or not.",d7cfa7fbc6e2ec71ab74c495586c6365,quora-duplicate-questions.gz,[https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs],1,,,,
text8,1701,list of str (tokens),33182058,https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py,not found,,"First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.",68799af40b6bda07dfa47a32612e5364,text8.gz,[http://mattmahoney.net/dc/textdata.html],1,,,,
fake-news,12999,dict,20102776,https://github.com/RaRe-Technologies/gensim-data/releases/download/fake-news/__init__.py,https://creativecommons.org/publicdomain/zero/1.0/,"{'crawled': 'date the story was archived', 'ord_in_thread': '', 'published': 'date published', 'participants_count': 'number of participants', 'shares': 'number of Facebook shares', 'replies_count': 'number of replies', 'main_img_url': 'image from story', 'spam_score': 'data from webhose.io', 'uuid': 'unique identifier', 'language': 'data from webhose.io', 'title': 'title of story', 'country': 'data from webhose.io', 'domain_rank': 'data from webhose.io', 'author': 'author of story', 'comments': 'number of Facebook comments', 'site_url': 'site URL from BS detector', 'text': 'text of story', 'thread_title': '', 'type': 'type of website (label from BS detector)', 'likes': 'number of Facebook likes'}","News dataset, contains text and metadata from 244 websites and represents 12,999 posts in total from a specific window of 30 days. The data was pulled using the webhose.io API, and because it's coming from their crawler, not all websites identified by their BS Detector are present in this dataset. Data sources that were missing a label were simply assigned a label of 'bs'. There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read.",5e64e942df13219465927f92dcefd5fe,fake-news.gz,[https://www.kaggle.com/mrisdal/fake-news],1,,,,
20-newsgroups,18846,dict,14483581,https://github.com/RaRe-Technologies/gensim-data/releases/download/20-newsgroups/__init__.py,not found,"{'topic': 'name of topic (20 variant of possible values)', 'set': 'marker of original split (possible values 'train' and 'test')', 'data': '', 'id': 'original id inferred from folder name'}","The notorious collection of approximately 20,000 newsgroup posts, partitioned (nearly) evenly across 20 different newsgroups.",c92fd4f6640a86d5ba89eaad818a9891,20-newsgroups.gz,[http://qwone.com/~jason/20Newsgroups/],1,,,,


In [0]:
# how to use
news = api.load("20-newsgroups")



**NLTK Corpus: https://www.nltk.org/book/ch02.html**


In [0]:
import nltk
nltk.download(['gutenberg', 'punkt'])
from nltk.corpus import gutenberg

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [0]:
def corporaStat(corpora_name):
  file_list = []
  for fileid in corpora_name.fileids():
    file_dict = {
        "Filename": fileid,
        "Character Count": len(corpora_name.raw(fileid)),
        "Word Count": len(corpora_name.words(fileid)),
        "Sentence Count": len(corpora_name.sents(fileid)),
        "Vocabulary Count": len(set(w.lower() for w in corpora_name.words(fileid)))
    }
    file_list.append(file_dict)
  return pd.DataFrame(file_list)

In [0]:
corporaStat(gutenberg)

Unnamed: 0,Filename,Character Count,Word Count,Sentence Count,Vocabulary Count
0,austen-emma.txt,887071,192427,7752,7344
1,austen-persuasion.txt,466292,98171,3747,5835
2,austen-sense.txt,673022,141576,4999,6403
3,bible-kjv.txt,4332554,1010654,30103,12767
4,blake-poems.txt,38153,8354,438,1535
5,bryant-stories.txt,249439,55563,2863,3940
6,burgess-busterbrown.txt,84663,18963,1054,1559
7,carroll-alice.txt,144395,34110,1703,2636
8,chesterton-ball.txt,457450,96996,4779,8335
9,chesterton-brown.txt,406629,86063,3806,7794


## Non-NLP

1. Online retail dataset (the one we used in the internal training):
- http://archive.ics.uci.edu/ml/datasets/Online+Retail
- http://archive.ics.uci.edu/ml/datasets/Online+Retail+II

2. Instacart
- Dataset: https://www.kaggle.com/c/instacart-market-basket-analysis/overview
- Reference: https://omarito.me/word2vec-product-recommendations/

3. Music Listening History
- Full Dataset: http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html (Almost 2.5 GB)
- Subsetted Dataset (100 users): https://github.com/tomytjandra/word2vec-embeddings/tree/master/dataset/lastfm-dataset (289 MB)