In [None]:
#@title Run this to setup the environment and load the data
import re
import gdown
import seaborn as sns
import pandas as pd
import numpy as np
from torchtext.vocab import GloVe
from sklearn.model_selection import train_test_split
gdown.download('https://drive.google.com/uc?id=1umFXM7SvdBvTlHW0r0CXDcxNqL73jU8Z', 'disaster_data.csv', True)
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import matplotlib
import matplotlib.pyplot as plt
import requests, io, zipfile
# Download class resources...
r = requests.get("https://www.dropbox.com/s/2pj07qip0ei09xt/inspirit_fake_news_resources.zip?dl=1")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

## Instructor-Led Discussion: Modeling the Meaning of Websites using Word Vectors

A shortcoming of our bag-of-words approach is that it only looks at the counts of words in each tweet. What if we had some way of understanding the meaning of words keeping the ordering in mind?

The idea of computationally extracting meaning from words is central to word vectors, which have become a cornerstone of modern deep learning on text. Word vectors are a mapping from words to vectors such that words that have similar meaning have similar word vectors.

For example, the words "good" and "great" have similar word vectors, and the words "good" and "planet" have different word vectors. Thus, word vectors provide us a way to account for the meanings of words with our machine learning models.

We will look at GLoVe Embeddings in this section.

###Load the Data

In [None]:
# Load the data.
disaster_tweets = pd.read_csv('disaster_data.csv',encoding ="ISO-8859-1")

In [None]:
disaster_tweets.head()

###Extract the tweets and the respective labels

In [None]:
#Read the tweet data and convert it to lowercase
tweets = disaster_tweets['text'].str.lower() 
tweets = tweets.apply(lambda x: re.sub(r'[^a-zA-Z0-9]+', ' ',x))

In [None]:
#Extract the labels from the csv
tweet_labels = disaster_tweets['category']

###Split the data into train and test set

In [None]:
#Split the Data into Training and Testing
X_train, X_test, y_train, y_test = train_test_split(tweets, tweet_labels, test_size=0.2, random_state=1,stratify = tweet_labels)

###Load the GLoVe Embeddings

In [None]:
VEC_SIZE = 300
glove = GloVe(name='6B', dim=VEC_SIZE)

# Returns word vector for word if it exists, else return None.
def get_word_vector(word):
    try:
      return glove.vectors[glove.stoi[word.lower()]].numpy()
    except KeyError:
      return None

We've included a handy helper function which retrieves the word vector for a word

##Exercise

Let's retrieve the word vector for "good" using the above get_word_vector function (~30 seconds).

In [None]:
### YOUR CODE HERE ###
good_vector = get_word_vector('good')
### END CODE HERE ###

print('Shape of good vector:', good_vector.shape)
print(good_vector)

Well not much to see here–each word vector is a vector of 300 numbers, and it's hard to interpret them from looking at the numbers. Remember that the important property of word vectors is that words with similar meaning have similar word vectors. The magic happens when we compare word vectors.

Below, we have set up a demo where we compare the word vectors for two words using a comparison metric known as cosine similarity. Intuitively, cosine similarity measures the extent to which two vectors point in the same direction. You might be familiar with the fact that the cosine similarity between two vectors is the same as the cosine of the angle between the two vectors–ranging between -1 and 1. -1 means that two vectors are facing opposite directions, 0 means that they are perpindicular, and 1 means that they are facing the same direction.



##Instructor-Led Discussion: Comparing Word Similarities

Try running the below to compare the vectors for "good" and "great", and then try other words, like "planet". What do you notice that's expected and unexpected? Play around for a couple of minutes then discuss as a class.

Note that the demo runs automatically when you change either word1 or word2.

In [None]:
#@title Word Similarity { run: "auto", display-mode: "both" }

def cosine_similarity(vec1, vec2):    
  return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

word1 = "mumbai" #@param {type:"string"}
word2 = "delhi" #@param {type:"string"}

print('Word 1:', word1)
print('Word 2:', word2)

def cosine_similarity_of_words(word1, word2):
  vec1 = get_word_vector(word1)
  vec2 = get_word_vector(word2)
  
  if vec1 is None:
    print(word1, 'is not a valid word. Try another.')
  if vec2 is None:
    print(word2, 'is not a valid word. Try another.')
  if vec1 is None or vec2 is None:
    return None
  
  return cosine_similarity(vec1, vec2)
  

print('\nCosine similarity:', cosine_similarity_of_words(word1, word2))

We can see that word embeddings appear to capture the meaning of different words–when two words are similar, the cosine similarity score is higher, and when two words are dissimilar, the cosine similarity score is lower.

Word vectors are created by going over a large body of text (the vectors you are using were trained on Wikipedia in part) and noticing which words tend to occur near each-other. If word A tends to co-occur with similar words as word B, then the word vectors for words A and B are mathematically constrained to be similar. If you want to learn more about an algorithm for training word vectors, see this [helpful introduction to word2vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa).

Given word vectors that represent the meaning of words, what can we do with this? We can add word vectors to our feature vector, but which do we choose? It turns out that a solid approach is just to average the word vectors for all the words in the description. Averaging word vectors produces a natural way to produce vectors for sentences and other collections of words, and this is the approach we will use.

## Exercise 

We want to write a function that takes a list of descriptions and turns it into an array containing the average GloVe vector for each description. **Understand the code below and then increment found_words and add vec to X[i].**

In [None]:
def glove_transform_data_descriptions(descriptions):
    X = np.zeros((len(descriptions), VEC_SIZE))
    for i, description in enumerate(descriptions):
        found_words = 0.0
        description = description.strip()
        for word in description.split(): 
            vec = get_word_vector(word)
            if vec is not None:
                ### YOUR CODE HERE ###
                # Increment found_words and add vec to X[i].
  found_words += 1
                X[i] += vec
                
                ### END CODE HERE ###
        # We divide the sum by the number of words added, so we have the
        # average word vector.
        if found_words > 0:
            X[i] /= found_words
            
    return X
  
glove_train_X = glove_transform_data_descriptions(X_train)
glove_train_y = [l for label in y_train]

glove_test_X = glove_transform_data_descriptions(X_test)
glove_test_y = [l for label in y_test]

## Exercise 

Then, we can evaluate our approach as we have in the past. As before, fill in the code for fitting and evaluation (~8 minutes).

In [None]:
model = LogisticRegression()
### YOUR CODE HERE ###
model.fit(glove_train_X, glove_train_y)

glove_train_y_pred = model.predict(glove_train_X)
print('Train accuracy', accuracy_score(glove_train_y, glove_train_y_pred))

glove_test_y_pred = model.predict(glove_test_X)
print('Val accuracy', accuracy_score(glove_test_y, glove_test_y_pred))

print('Confusion matrix:')
print(confusion_matrix(glove_test_y, glove_test_y_pred))

prf = precision_recall_fscore_support(glove_test_y, glove_test_y_pred)

print('Precision:', prf[0][1])
print('Recall:', prf[1][1])
print('F-Score:', prf[2][1])
### END CODE HERE ###

###Exercise(Discussion): Why do you think the accuarcy didnt change much even though we introduced word embeddings as our features?

In [None]:
print(classification_report(y_test,glove_test_y_pred, target_names=['Energy', 'Food', 'Medical', 'None', 'Water']))

In [None]:
#@title Helper Function-Confusion Matrix
'''
Plots the confusion Matrix and saves it
'''
def plot_confusion_matrix(y_true,y_predicted):
  cm = metrics.confusion_matrix(y_true, y_predicted)
  print ("Plotting the Confusion Matrix")
  labels = ['Energy', 'Food', 'Medical', 'None', 'Water']
  df_cm = pd.DataFrame(cm,index =labels,columns = labels)
  fig = plt.figure()
  res = sns.heatmap(df_cm, annot=True,cmap='Blues', fmt='g')
  plt.yticks([0.5,1.5,2.5,3.5,4.5], labels,va='center')
  plt.title('Confusion Matrix - TestData')
  plt.ylabel('True label')
  plt.xlabel('Predicted label')
  plt.show()
  plt.close()

 


In [None]:
plot_confusion_matrix(y_test,glove_test_y_pred)

###Evaluate

*Let's see how our classifier did! We will train our classifier on 80% of the dataset and then test it on 20%. This is called a train-test split and is usually done to evaluate models.*

In [None]:
#@title Get the list of incorrect tweets
pd.set_option('max_colwidth', 500)
incorrect_tweets = []
incorrect_y_test = []
incorrect_y_pred = []
for (t,x,y) in zip(X_test,y_test,glove_test_y_pred):
  if x != y:
    incorrect_tweets.append(t)
    incorrect_y_test.append(x)
    incorrect_y_pred.append(y)

In [None]:
table=pd.DataFrame([incorrect_tweets,incorrect_y_pred,incorrect_y_test]).transpose()
table.columns = ['Tweet', 'Predicted Category', 'True Category']

In [None]:
table

###Exercise(Discussion): Can you figure out why some of these tweets were incorrectly classified?

###Visualizing Word Vectors with t-SNE

We will plot the words using the word embeddings in this section to derive relationships based on the context of the tweets

In [None]:
#@title Helper Function to Visualize the Embeddings
from gensim.models.word2vec import Word2Vec
from sklearn.manifold import TSNE

import re
import matplotlib.pyplot as plt

def clean(text):
    """Remove posting header, split by sentences and words, keep only letters"""
    lines = re.split('[?!.:]\s', re.sub('^.*Lines: \d+', '', re.sub('\n', ' ', text)))
    return [re.sub('[^a-zA-Z]', ' ', line).lower().split() for line in lines]

sentences = [line for text in tweets for line in clean(text)]

#min-count variable helps us eliminate the words which rarely occur! 
model = Word2Vec(sentences, workers=4, size=100, min_count=30, window=10, sample=1e-3)


def tsne_plot(model):
    "Creates and TSNE model and plots it"
    labels = []
    tokens = []

    for word in model.wv.vocab:
        tokens.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(16, 16)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()


In [None]:
tsne_plot(model)

###Exercise(Discussion): Do you notice that similar words are placed close by?

#Finish!