# Understanding Word2vec Embedding in Practice
Reference to the [article](https://towardsdatascience.com/understanding-word2vec-embedding-in-practice-3e9b8985953)

We will execute the following action in this notebook,
* We use Gensim to train word2vec embedding.
* We use NLTK and spaCy to pre-process the text.
* We use t-SNE to visualize high-dimensional data.

In [9]:
#Import the libraries
import re, string 
import pandas as pd   
from collections import defaultdict
import spacy
from sklearn.manifold import TSNE
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
STOPWORDS = set(stopwords.words('english'))
from gensim.models import Word2Vec
%matplotlib inline
path = 'https://raw.githubusercontent.com/susanli2016/PyCon-Canada-2019-NLP-Tutorial/master/bbc-text.csv'
df = pd.read_csv(path)

def clean_text(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub(r'\w*\d\w*', '', text)
    # Remove a sentence if it is only one word long
    if len(text) > 2:
        #join the parsed word with a space, in which the word doesn't belongs to one of the stopwords
        return ' '.join(word for word in text.split() if word not in STOPWORDS)

#apply this clean_text function only to the context in text column
df_clean = pd.DataFrame(df.text.apply(lambda x: clean_text(x)))

[nltk_data] Downloading package stopwords to C:\Users\Yuan
[nltk_data]     Tao\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In Natural Language Toolkit, nltk, we use nltk to filter out the [stopping word](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/) because they are not important to the analysis. We have many libraries choices to remove stopwords such as NLTK, spaCy, Gensim, and scikit-learn. The best thing with nltk is that the stopwords include contraction word like don't, which is the contration of do not. So nltk really can filter out these less important word in the context.

In [18]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [7]:
df

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
...,...,...
2220,business,cars pull down us retail figures us retail sal...
2221,politics,kilroy unveils immigration policy ex-chatshow ...
2222,entertainment,rem announce new glasgow concert us band rem h...
2223,politics,how political squabbles snowball it s become c...


After cleaning, we don't lose any column, and we have cleaned context in the text columns.

In [6]:
df_clean

Unnamed: 0,text
0,tv future hands viewers home theatre systems p...
1,worldcom boss left books alone former worldcom...
2,tigers wary farrell gamble leicester say rushe...
3,yeading face newcastle fa cup premiership side...
4,ocean twelve raids box office ocean twelve cri...
...,...
2220,cars pull us retail figures us retail sales fe...
2221,kilroy unveils immigration policy exchatshow h...
2222,rem announce new glasgow concert us band rem a...
2223,political squabbles snowball become commonplac...
