![](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTigQWzoYCNiDyrz1BN4WTf2X2k9OZ_yvW-FsmcIMsdS9fppNmh)


## <a id="competition-description"></a><span style="color:#328c4f">Competition Description</span>

<p style="font-size:19px">
    Twitter has become an important communication channel in times of emergency.
    The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).<br>
    But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:<br>
    
</p>
<img src="https://storage.googleapis.com/kaggle-media/competitions/tweet_screenshot.png" alt="drawing" width="250"/><br>

<p style="font-size:19px">
    The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.
    <br>
In this competition,we have to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. The dataset has 10,000 tweets that were hand classified.
    <br>
<b>Disclaimer</b> : The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
    
</p>

## <a id="data-description"></a><span style="color:#328c4f">Data Description</span>

<p style="font-size:22px; color:#04661e">Data Format</p>
<p style="font-size:19px">Each sample in the train and test set has the following information:</p>

- <span style="font-size:19px">The text of a tweet </span>
- <span style="font-size:19px">A keyword from that tweet (although this may be blank!)</span>
- <span style="font-size:19px">The location the tweet was sent from (may also be blank)</span>

<p style="font-size:19px">You are predicting whether a given tweet is about a real disaster or not. If so, predict a <b>1</b>. If not, predict a <b>0</b></p>



<p style="font-size:22px; color:#04661e">Columns</p>
<ol>
  <li style="font-size:19px">id: a unique identifier for each tweet</li>
  <li style="font-size:19px">text: the text of the tweet</li>
  <li style="font-size:19px">location: the location the tweet was sent from (may be blank)</li>
  <li style="font-size:19px">keyword: a particular keyword from the tweet (may be blank)</li>
  <li style="font-size:19px">target: in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)</li>
</ol>


## <a id="import-library"></a><span style="color:#328c4f">Importing Libraries</span>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict
from collections import  Counter
plt.style.use('ggplot')
stop=set(stopwords.words('english'))
import re
from nltk.tokenize import word_tokenize
import gensim
import string
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from tqdm import tqdm
from keras.models import Sequential
from keras.layers import Embedding,LSTM,Dense,SpatialDropout1D
from keras.initializers import Constant
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import Adam
import os
import plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.offline as offline
import plotly.graph_objs as go
from sklearn.decomposition import PCA
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import pickle
from wordcloud import WordCloud
offline.init_notebook_mode(connected = True)

## <a id="data-load"></a><span style="color:#328c4f">Loading data and Basic Idea </span>

In [None]:
tweet= pd.read_csv('../input/nlp-getting-started/train.csv')
test=pd.read_csv('../input/nlp-getting-started/test.csv')
tweet.head(3)

In [None]:
print('There are {} rows and {} columns in train'.format(tweet.shape[0],tweet.shape[1]))
print('There are {} rows and {} columns in train'.format(test.shape[0],test.shape[1]))

## <a id="data-load"></a><span style="color:#328c4f">Class distribution</span>

<p style="font-size:19px">Before we begin with anything else,let's check the class distribution.There are only two classes 0 and 1.</p>

In [None]:
x=tweet.target.value_counts() #Create a dataframe counting the frequency of each class
x_df = x.to_frame()
x_df.index = ['No-Disaster', 'Disaster']
x_df

In [None]:
fig = px.bar(x_df, y = "target", color=x_df.index)
fig.show()

<p style="font-size:19px">There are more tweets with class 0 ( No disaster) than class 1 ( disaster tweets)</p>

## <a id="character-analysis"></a><span style="color:#328c4f">Missing Value Analysis</span>

<p style="font-size:19px">Here, we analyse the number of missing values in the training tweet dataset</p>

In [None]:

null_value_df = tweet.isnull().sum(axis = 0).to_frame()
null_value_df.columns = ['Number of Null Values']
null_value_df['Percentage Null Values'] = null_value_df['Number of Null Values']*100/tweet.shape[0]
print(null_value_df)


In [None]:
ax = null_value_df.plot.bar( y='Percentage Null Values', rot=0)

<p style="font-size:19px">So, a lot of location column data is missing (more than 33%)</p>

## <a id="character-analysis"></a><span style="color:#328c4f">Location Analysis of Tweets</span>

<p style="font-size:19px">Here, we analyse the location column of training tweets data</p>

In [None]:
location_df = tweet[['location']]
location_df.dropna(inplace=True) ##remove rows with null values
location_df[:5]

<p style="font-size:19px">We create a wordcloud of the locations</p>

In [None]:
location_str = ''
for i in range(location_df.shape[0]):
    location = location_df['location'].iloc[i]
    location_str += location + ' '
#print(location_str)

# Generate a word cloud image
wordcloud = WordCloud().generate(location_str)

fig, ax = plt.subplots(figsize = (10,10))
ax.imshow(wordcloud, interpolation="bilinear")
plt.show()
    

<p style="font-size:19px">A lot of the locations are from North America.</p>

## <a id="character-analysis"></a><span style="color:#328c4f">Keyword Analysis of Tweets</span>

<p style="font-size:19px">Here, we analyse the location column of training tweets data</p>

In [None]:
keyword_df = tweet[['keyword']]
keyword_df.dropna(inplace=True) ##remove rows with null values
keyword_df[:5]

<p style="font-size:19px">We create a wordcloud of the keywords</p>

In [None]:
keyword_str = ''
for i in range(keyword_df.shape[0]):
    kw = keyword_df['keyword'].iloc[i]
    keyword_str += kw + ' '
#print(location_str)

# Generate a word cloud image
wordcloud = WordCloud().generate(keyword_str)

fig, ax = plt.subplots(figsize = (10,10))
ax.imshow(wordcloud, interpolation="bilinear")
plt.show()

<p style="font-size:19px">The wordcloud for keywords show words like fatalities, armageddon, collide, siren appear quite a few times. </p>

## <a id="character-analysis"></a><span style="color:#328c4f">Character Analysis of Tweets</span>

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
tweet_len=tweet[tweet['target']==1]['text'].str.len()
ax1.hist(tweet_len,color='cyan')
ax1.set_title('Disaster tweets')
ax1.set_xlabel('No. of Char./Tweet')
ax1.set_ylabel('No. of Tweets')
tweet_len=tweet[tweet['target']==0]['text'].str.len()
ax2.hist(tweet_len,color='magenta')
ax2.set_title('Not Disaster tweets')
ax2.set_xlabel('No. of Char./Tweet')
ax2.set_ylabel('No. of Tweets')
fig.suptitle('Characters in tweets')
plt.show()

<p style="font-size:19px">The distribution of both seems to be almost same.120 to 140 characters in a tweet are the most common among both.</p>

## <a id="character-analysis"></a><span style="color:#328c4f">Basic Word Analysis of Tweets</span>

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
tweet_len=tweet[tweet['target']==1]['text'].str.split().map(lambda x: len(x))
ax1.hist(tweet_len,color='blue')
ax1.set_title('Disaster tweets')
ax1.set_xlabel('No. of Words/Tweet')
ax1.set_ylabel('No. of Tweets')
tweet_len=tweet[tweet['target']==0]['text'].str.split().map(lambda x: len(x))
ax2.hist(tweet_len,color='red')
ax2.set_title('Not Disaster tweets')
ax2.set_xlabel('No. of Words/Tweet')
ax2.set_ylabel('No. of Tweets')
fig.suptitle('Words in a tweet')
plt.show()


## <a id="character-analysis"></a><span style="color:#328c4f">Average word length in a tweet</span>

<p style="font-size:22px; color:#04661e">Kernel density estimation</p>
<p style="font-size:19px">A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate</p>
<p style="font-size:22px; color:#04661e">Rug Plot</p>
<p style="font-size:19px">Plot marginal distributions by drawing ticks along the x and y axes. This function is intended to complement other plots by showing the location of individual observations in an unobstrusive way.</p>

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
word=tweet[tweet['target']==1]['text'].str.split().apply(lambda x : [len(i) for i in x])
sns.distplot(word.map(lambda x: np.mean(x)),ax=ax1,color='red', rug=True)
ax1.set_title('Disaster')
word=tweet[tweet['target']==0]['text'].str.split().apply(lambda x : [len(i) for i in x])
sns.distplot(word.map(lambda x: np.mean(x)),ax=ax2,color='green', rug=True)
ax2.set_title('Not Disaster')
fig.suptitle('Average Word Length in each Tweet')

## <a id="character-analysis"></a><span style="color:#328c4f">Common Stopwords in Tweets</span>

<p style="font-size:19px">We count the frequencies of stop words for both classes of tweets. The list of stop words is loaded from <i>nltk</i> library</p>


In [None]:
def create_corpus(target): #create a list of words from tweets under a class
    corpus=[]
    
    for x in tweet[tweet['target']==target]['text'].str.split():
        for i in x:
            corpus.append(i)
    return corpus

In [None]:
corpus_no_disaster=create_corpus(0) #create a corpus for No-Disaster tweets
corpus_disaster = create_corpus(1) #create a corpus for Disaster tweets

#loop to create a dict. with stopwords and their frequencies
dic_no_disaster=defaultdict(int)
for word in corpus_no_disaster:
    if word in stop:
        dic_no_disaster[word]+=1

dic_disaster=defaultdict(int)
for word in corpus_disaster:
    if word in stop:
        dic_disaster[word]+=1

        
#take the top 10 stop words
top_no_disaster=sorted(dic_no_disaster.items(), key=lambda x:x[1],reverse=True)[:10] 
top_disaster = sorted(dic_disaster.items(), key=lambda x:x[1],reverse=True)[:10] 
#plot the frequencies in bar plots
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
x_no_disaster,y_no_disaster=zip(*top_no_disaster)
x_disaster, y_disaster = zip(*top_disaster)
ax1.bar(x_no_disaster,y_no_disaster, color='green')
ax1.set_xlabel('No-Disaster')
#plt.show()
ax2.bar(x_disaster, y_disaster, color = 'blue')
ax2.set_xlabel('Disaster')
plt.show()

<p style="font-size:19px">In both of them,"the" dominates which is followed by "a" in class 0 and "in" in class 1.</p>


## <a id="character-analysis"></a><span style="color:#328c4f">Analyzing punctuations</span>

In [None]:
import string
special = string.punctuation

#loop to create a dict. with stopwords and their frequencies
dic_no_disaster=defaultdict(int)
for word in corpus_no_disaster:
    if word in special:
        dic_no_disaster[word]+=1

dic_disaster=defaultdict(int)
for word in corpus_disaster:
    if word in special:
        dic_disaster[word]+=1
        
#take the top 10 punctuations
top_no_disaster=sorted(dic_no_disaster.items(), key=lambda x:x[1],reverse=True)[:10] 
top_disaster = sorted(dic_disaster.items(), key=lambda x:x[1],reverse=True)[:10] 

x_no_disaster,y_no_disaster=zip(*top_no_disaster)
x_disaster, y_disaster = zip(*top_disaster)

#plot the frequencies in bar plots
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))

ax1.bar(x_no_disaster,y_no_disaster, color='green')
ax1.set_xlabel('No-Disaster')
#plt.show()
ax2.bar(x_disaster, y_disaster, color = 'blue')
ax2.set_xlabel('Disaster')
plt.show()

<p style="font-size:19px">In both of them, the punctuation distributions are somewhat similar</p>


## <a id="character-analysis"></a><span style="color:#328c4f">Common Words</span>

In [None]:
counter=Counter(corpus_no_disaster)
most=counter.most_common() #count the frequencies of most common words
x=[]
y=[]
for word,count in most[:40]:
    if (word not in stop) :
        x.append(word)
        y.append(count)


In [None]:
fig, ax = plt.subplots(figsize = (10,5))
sns.barplot(x=y,y=x,ax = ax)
ax.set_xlabel('Frequency')
ax.set_ylabel('Words')
ax.set_title('Frequency of Most Common Words')

<p style="font-size:19px">The data requires lots of cleaning.</p>

## <a id="character-analysis"></a><span style="color:#328c4f">Ngram analysis</span>


<p style="font-size:19px">We will do a bigram <b>(n=2)</b> analysis over the tweets.Let's check the most common bigrams in tweets.</p>

In [None]:
def get_top_tweet_bigrams(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
top_tweet_bigrams=get_top_tweet_bigrams(tweet['text'])[:10]
plt.figure(figsize=(10,5))
x,y=map(list,zip(*top_tweet_bigrams))
fig,ax = plt.subplots(figsize=(10,5))
sns.barplot(x=y,y=x, ax = ax)
ax.set_ylabel('Bi-grams')
ax.set_xlabel('Frequency')
ax.set_title('Bi-gram Frequency Analysis')

<p style="font-size:19px">Given the prevalence of uninformative bi-grams a lot of data cleaning is required.</p>

## <a id="character-analysis"></a><span style="color:#328c4f">Data Cleaning</span>

<p style="font-size:19px">Twitter tweets have to be cleaned before working with them.So we do some basic cleaning such as spelling correction,removing punctuations,removing html tags and emojis etc.</p>

In [None]:
df=pd.concat([tweet,test])
df.shape

### <a id="character-analysis"></a><span style="color:#328c4f">Removing Urls</span>

In [None]:
def remove_URL(text): #remove urls from text 
    url = re.compile(r'https?://\S+|www\.\S+') #regular expression for detecting urls
    return url.sub(r'',text)

In [None]:
df['text']=df['text'].apply(lambda x : remove_URL(x))

### <a id="character-analysis"></a><span style="color:#328c4f">Removing HTML tags</span>

In [None]:
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

In [None]:
df['text']=df['text'].apply(lambda x : remove_html(x))


### <a id="character-analysis"></a><span style="color:#328c4f">Removing Emojis</span>

In [None]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [None]:
df['text']=df['text'].apply(lambda x: remove_emoji(x))


### <a id="character-analysis"></a><span style="color:#328c4f">Removing punctuations</span>

In [None]:
def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

In [None]:
df['text']=df['text'].apply(lambda x : remove_punct(x))

### <a id="character-analysis"></a><span style="color:#328c4f">Spelling Correction</span>

<p style="font-size:19px">Even if I'm not good at spelling I can correct it with python :) I will use <i>pyspellcheker</i> to do that.</p>


In [None]:
!pip install pyspellchecker

In [None]:
from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

In [None]:
#df['text']=df['text'].apply(lambda x : correct_spellings(x)) #takes a long time to execute

## <a id="character-analysis"></a><span style="color:#328c4f">GloVe for Vectorization of Tweet Corpus</span>

<p style="font-size:19px">We use GloVe pretrained corpus model to represent our words. It's available in 3 varieties :50D ,100D and 200 Dimentional.We will try 100 D here. Using these GloVe vectors we are going to represent the corupus from tweets in a vector format</p>


In [None]:
def create_corpus(df): #method to tokenize tweets 
    corpus=[]
    for tweet in tqdm(df['text']):
        words=[word.lower() for word in word_tokenize(tweet) if((word.isalpha()==1) & (word not in stop))]
        corpus.append(words)
    return corpus

<p style="font-size:19px">Let's create a corpus from the input tweets by tokenizing the words from tweets. Stop words are not included</p>


In [None]:
corpus=create_corpus(df)

<p style="font-size:19px">Let's load the word vectors from a pretrained GloVe file</p>


In [None]:
embedding_dict={} #load glove vectors in a dictionary
with open('../input/glove-global-vectors-for-word-representation/glove.6B.100d.txt','r') as f:
    for line in tqdm(f):
        values=line.split()
        word=values[0]
        vectors=np.asarray(values[1:],'float32')
        embedding_dict[word]=vectors
f.close()

<p style="font-size:19px">Let's create a dictionary of words available in the corpus with each word assigned a unique index</p>

In [None]:
#Create arrays of token indices for tweets
MAX_LEN=50
tokenizer_obj=Tokenizer()
tokenizer_obj.fit_on_texts(corpus)
sequences=tokenizer_obj.texts_to_sequences(corpus)
tweet_pad=pad_sequences(sequences,maxlen=MAX_LEN,truncating='post',padding='post')
word_index=tokenizer_obj.word_index #<word, word_index>
print('Number of unique words:',len(word_index))

<p style="font-size:19px">Extract GloVe vectors for the words available in the corpus</p>

In [None]:
num_words=len(word_index)+1 #Assign glove vectors for the words obtained from tokenization
embedding_matrix=np.zeros((num_words,100))

for word,i in tqdm(word_index.items()):
    if i > num_words:
        continue
    
    emb_vec=embedding_dict.get(word)
    if emb_vec is not None:
        embedding_matrix[i]=emb_vec
            

In [None]:
word_index.keys()

## <a id="character-analysis"></a><span style="color:#328c4f">GloVe Visualization Using PCA</span>

In [None]:
if os.path.exists('../working/glove2word2vec_model.sav') == False:
    print('Glove Model loading and saving....')
    glove_model = KeyedVectors.load_word2vec_format('../input/glove-global-vectors-for-word-representation/glove.6B.100d.txt', binary=False, no_header=True)
    filename = '../working/glove2word2vec_model.sav'
    pickle.dump(glove_model, open(filename, 'wb'))

In [None]:
glove_model = pickle.load(open(filename, 'rb'))

In [None]:
def append_list(sim_words, words):
    
    list_of_words = []
    
    for i in range(len(sim_words)):
        
        sim_words_list = list(sim_words[i])
        sim_words_list.append(words)
        sim_words_tuple = tuple(sim_words_list)
        list_of_words.append(sim_words_tuple)
        
    return list_of_words

<p style="font-size:19px">Let's choose 3 random words from the dataset vocabulary and observe the most similar words</p>

In [None]:
input_word = 'pollution,iceberg,sanctuary'
user_input = [x.strip() for x in input_word.split(',')]
result_word = []
for words in user_input:
    sim_words = glove_model.most_similar(words, topn = 5)
    sim_words = append_list(sim_words, words)
    #print(sim_words)
    result_word.extend(sim_words)

#print(result_word)
similar_word = [word[0] for word in result_word]
similarity = [word[1] for word in result_word] 
similar_word.extend(user_input)
labels = [word[2] for word in result_word]
label_dict = dict([(y,x+1) for x,y in enumerate(set(labels))])
color_map = [label_dict[x] for x in labels]    


In [None]:


def display_pca_scatterplot_2D(model, user_input=None, words=None, label=None, color_map=None, topn=5, sample=10):

    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
    
    word_vectors = np.array([model[w] for w in words])
    
    #three_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:3]
    # For 2D, change the three_dim variable into something like two_dim like the following:
    two_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:2]

    data = []
    count = 0
    
    for i in range (len(user_input)):

                trace = go.Scatter(
                    x = two_dim[count:count+topn,0], 
                    y = two_dim[count:count+topn,1],  
                    text = words[count:count+topn],
                    name = user_input[i],
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10,
                        'opacity': 0.8,
                        'color': 2
                    }
       
                )
                
                # For 2D, instead of using go.Scatter3d, we need to use go.Scatter and delete the z variable. Also, instead of using
                # variable three_dim, use the variable that we have declared earlier (e.g two_dim)
            
                data.append(trace)
                count = count+topn

    trace_input = go.Scatter(
                    x = two_dim[count:,0], 
                    y = two_dim[count:,1],  
                    text = words[count:],
                    name = 'input words',
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10,
                        'opacity': 1,
                        'color': 'black'
                    }
                    )

    # For 2D, instead of using go.Scatter3d, we need to use go.Scatter and delete the z variable.  Also, instead of using
    # variable three_dim, use the variable that we have declared earlier (e.g two_dim)
            
    data.append(trace_input)
    
# Configure the layout

    layout = go.Layout(
        margin = {'l': 0, 'r': 0, 'b': 0, 't': 0},
        showlegend=True,
        legend=dict(
        x=1,
        y=0.5,
        font=dict(
            family="Courier New",
            size=25,
            color="black"
        )),
        font = dict(
            family = " Courier New ",
            size = 15),
        autosize = False,
        width = 1000,
        height = 1000
        )


    plot_figure = go.Figure(data = data, layout = layout)
    plot_figure.show()
    


In [None]:


def display_pca_scatterplot_3D(model, user_input=None, words=None, label=None, color_map=None, topn=5, sample=10):

    if words == None:
        if sample > 0:
            words = np.random.choice(list(model.vocab.keys()), sample)
        else:
            words = [ word for word in model.vocab ]
    
    word_vectors = np.array([model[w] for w in words])
    
    three_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:3]
    # For 2D, change the three_dim variable into something like two_dim like the following:
    # two_dim = PCA(random_state=0).fit_transform(word_vectors)[:,:2]

    data = []
    count = 0
    
    for i in range (len(user_input)):

                trace = go.Scatter3d(
                    x = three_dim[count:count+topn,0], 
                    y = three_dim[count:count+topn,1],  
                    z = three_dim[count:count+topn,2],
                    text = words[count:count+topn],
                    name = user_input[i],
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10,
                        'opacity': 0.8,
                        'color': 2
                    }
       
                )
                
                # For 2D, instead of using go.Scatter3d, we need to use go.Scatter and delete the z variable. Also, instead of using
                # variable three_dim, use the variable that we have declared earlier (e.g two_dim)
            
                data.append(trace)
                count = count+topn

    trace_input = go.Scatter3d(
                    x = three_dim[count:,0], 
                    y = three_dim[count:,1],  
                    z = three_dim[count:,2],
                    text = words[count:],
                    name = 'input words',
                    textposition = "top center",
                    textfont_size = 20,
                    mode = 'markers+text',
                    marker = {
                        'size': 10,
                        'opacity': 1,
                        'color': 'black'
                    }
                    )

    # For 2D, instead of using go.Scatter3d, we need to use go.Scatter and delete the z variable.  Also, instead of using
    # variable three_dim, use the variable that we have declared earlier (e.g two_dim)
            
    data.append(trace_input)
    
# Configure the layout

    layout = go.Layout(
        margin = {'l': 0, 'r': 0, 'b': 0, 't': 0},
        showlegend=True,
        legend=dict(
        x=1,
        y=0.5,
        font=dict(
            family="Courier New",
            size=25,
            color="black"
        )),
        font = dict(
            family = " Courier New ",
            size = 15),
        autosize = False,
        width = 1000,
        height = 1000
        )


    plot_figure = go.Figure(data = data, layout = layout)
    plot_figure.show()
    


In [None]:
display_pca_scatterplot_2D(glove_model, user_input, similar_word, labels, color_map)

In [None]:
display_pca_scatterplot_3D(glove_model, user_input, similar_word, labels, color_map)

<p style="font-size:19px">So for the 3 words chosen we can visualize the closest words in meaning in the axis system. Iceberg is related to Titanic because the Titanic sank after getting brushing by an iceberg. Pollution is related to smog, emission, etc. Santuary is similarly related to caves, refuges, shelters, etc.</p>

<p style="font-size:19px">PCA can be a great tool for visualizing relation between related words within 2D or 3D as can be observed from the above example.</p>

*This notebook is for research/surveying, learning, experimenting, and reproducing existing literature found online*

**Reference:**
- https://www.kaggle.com/code/shahules/basic-eda-cleaning-and-glove