# GloVe: Global Vectors for Word Representation
The basic idea behind the GloVe word embedding is to derive the relationship between the words from Global Statistics

![image](https://nlp.stanford.edu/projects/glove/images/man_woman.jpg)

Main concept - 

But how can statistics represent meaning? Let me explain.

One of the simplest ways is to look at the co-occurrence matrix. A co-occurrence matrix tells us how often a particular pair of words occur together. Each value in a co-occurrence matrix is a count of a pair of words occurring together.

For example, consider a corpus: “I play cricket, I love cricket and I love football”. The co-occurrence matrix for the corpus looks like this:

co occurrence matrix

Now, we can easily compute the probabilities of a pair of words. Just to keep it simple, let’s focus on the word “cricket”:

![image](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/03/Screenshot-from-2020-03-14-13-27-54.png)

p(cricket/play)=1

p(cricket/love)=0.5

Next, let’s compute the ratio of probabilities:

p(cricket/play) / p(cricket/love) = 2

As the ratio > 1, we can infer that the most relevant word to cricket is “play” as compared to “love”. Similarly, if the ratio is close to 1, then both words are relevant to cricket.

We are able to derive the relationship between the words using simple statistics. This the idea behind the GloVe pretrained word embedding.

GloVe learns to encode the information of the probability ratio in the form of word vectors.



for more please refer this link - https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings-nlp/

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load


from keras.layers.embeddings import Embedding
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer


import re

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv.zip')

In [None]:
rslt_df = df[(df['toxic'] == 0) & (df['severe_toxic'] == 0) & (df['obscene'] == 0) & (df['threat'] == 0) & (df['insult'] == 0) & (df['identity_hate'] == 0)]
rslt_df2 = df[(df['toxic'] == 1) & (df['severe_toxic'] == 0) & (df['obscene'] == 0) & (df['threat'] == 0) & (df['insult'] == 0) & (df['identity_hate'] == 0)]
new1 = rslt_df[['id', 'comment_text', 'toxic']].iloc[:23000].copy() 
new2 = rslt_df2[['id', 'comment_text', 'toxic']].iloc[:900].copy()
new = pd.concat([new1, new2], ignore_index=True)
new.head()

### Preprocessing
Now we will Preprocess the data by removing the stopwords

In [None]:
from nltk.corpus import stopwords
my_stopwords = stopwords.words('english')

In [None]:
import nltk
tk=nltk.tokenize.TreebankWordTokenizer()
comment_tokens = [tk.tokenize(sent) for sent in new['comment_text']]

In [None]:
type(comment_tokens)

In [None]:
comment_tokens[0]

In [None]:
len(comment_tokens)

In [None]:
from nltk.corpus import stopwords
for i in range(len(comment_tokens)):
    comment_tokens[i] = [w for w in comment_tokens[i] if w not in stopwords.words('english')]

In [None]:
#glove embeddings
from numpy import array
from numpy import asarray
from numpy import zeros

embeddings_dictionary = dict()

glove_file = open('/kaggle/input/nlpword2vecembeddingspretrained/glove.6B.100d.txt', encoding = "utf8")

In [None]:
for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions
glove_file.close()    

In [None]:
print(word)

In [None]:
print(records)

In [None]:
print(vector_dimensions)

In [None]:
print(embeddings_dictionary['hello'])

In [None]:
vocab = embeddings_dictionary.keys()

In [None]:
len(vocab)

In [None]:
# Let's find the top 7 words that are closest to 'compute'
u = embeddings_dictionary['compute']
norm_u = np.linalg.norm(u)
similarity = []

for word in embeddings_dictionary.keys():
    v = embeddings_dictionary[word]
    cosine = np.dot(u, v)/norm_u/np.linalg.norm(v)
    similarity.append((word, cosine))
print(len(similarity))

In [None]:
sorted(similarity, key=lambda x: x[1], reverse=True)[:10]

In [None]:
# ## Now let's do vector algebra.
# 
# ### First we subtract the vector for `france` from `paris`. This could be imagined as a vector pointing from country to its capital. Then we add the vector of `nepal`. Let's see if it does point to the country's capital
output = embeddings_dictionary['paris'] - embeddings_dictionary['france'] + embeddings_dictionary['nepal']
norm_out = np.linalg.norm(output)

In [None]:
similarity = []
for word in embeddings_dictionary.keys():
    v = embeddings_dictionary[word]
    cosine = np.dot(output, v)/norm_out/np.linalg.norm(v)
    similarity.append((word, cosine))
    
print(len(similarity))

sorted(similarity, key=lambda x: x[1], reverse=True)[:7]    

In [None]:
documents = []
for x in comment_tokens:
    document = [word for word in x if word in vocab]
    documents.append(document)
#now this document have only those words which are present in our model's vocab
documents[1:5]   

In [None]:
documents[0]

In [None]:
len(documents)

In [None]:
#checking if there is any empty list inside documents
counter = 0
for i in range (0,len(documents)):
    if documents[i] == []:
        counter += 1
print(counter)

> So there were in total this much empty vectors(output of above cell) which were form due to removal of words whch are not present in our pretrained model's vocab, now we will fill those vectors with zeros

In [None]:
#document embeddings
list_v=[]
for i in range (0,len(documents)):
    if documents[i] == []:
        list_v.append(np.zeros(100,))
    else:
        vec = []
        for j in documents[i]:
            v = embeddings_dictionary[j]
            vec.append(v)
        list_v.append(np.mean(vec, axis=0))

In [None]:
len(documents[i])

> So there are 20 words in last document in documents list

In [None]:
len(list_v[0])

## SMOTE

In [None]:
from collections import Counter
print('Original dataset shape before smote %s' % Counter(new['toxic']))
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X, y = oversample.fit_resample(list_v, new['toxic'])
print('Original dataset shape after smote %s' % Counter(y))

In [None]:
#test-train split
from sklearn.model_selection import train_test_split
Xw_train, Xw_test, yw_train, yw_test = train_test_split(X,y, test_size=0.3, random_state=42)

## LOGISTIC REGRESSION

In [None]:
from sklearn.linear_model import LogisticRegression
clf=LogisticRegression(max_iter=1000)
clf.fit(Xw_train,yw_train)

In [None]:
predicted_res=clf.predict(Xw_test)
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(yw_test,predicted_res)
accuracy

In [None]:
import numpy as np

z=1.96
interval = z * np.sqrt( (0.8244 * (1 - 0.8244)) / yw_test.shape[0])
interval

> Confidence interval [80.22  81.48]

> 

## RANDOM FOREST

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

clf3 = RandomForestClassifier() #Initialize with whatever parameters you want to

# 10-Fold Cross validation
scores = cross_val_score(clf3,Xw_train,yw_train, cv=5)

In [None]:
y_p3 = clf3.fit(Xw_train, yw_train).predict(Xw_test)
accuracy = accuracy_score(yw_test, y_p3)
print('Accuracy: %f' % accuracy)

import numpy as np

z=1.96
interval = z * np.sqrt( (0.9629 * (1 - 0.9629)) / yw_test.shape[0])
interval

> confidence interval [97.05  97.67]