## Query Suggester

### Problem Statement: 

One of the first hurdles we discovered for lay users is not knowing what to query when searching Twitter. Often, especially in times of crisis, one types something general like, “fire info, near me”, but without the right hashtags, the query yields scattered results. Maybe the search results have info about the fire, maybe it’s not up to date, or maybe the disaster is so new there’s not info activity to show up in the top of searches. To tackle this issue we created a query suggester. The suggester takes in a query, pulls in tweets based on the specified query, cleans the data, then runs the hashtags through Countvectorizer, pulling out the top 50 most common hashtags. From there, the suggester would use Word2Vec’s “most similar” tool to find the most similar hashtags. Those suggested tweets would then be available to requery with. 

### This note book creates a query suggester using Word2Vec

We trained the Word2Vec on past fires and semi-informed queries. 

#### Importing Libraries

In [1]:
    #General
import pandas as pd
import numpy as np


    #Plotting
import matplotlib.pyplot as plt
import seaborn as sns

    #nltk and regex packages
import regex as re
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

    #Word Vectors
import gensim
import gensim.downloader as api 
from gensim.models.word2vec import Word2Vec



    #Sklearn Packages
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words, text

%config InlineBackend.figure_format = 'retina'

  import pandas.util.testing as tm


#### Importing Cleaned Data

In [2]:
df = pd.read_csv("./Datasets/cleaned.csv")
df.head(2)

Unnamed: 0.1,Unnamed: 0,username,text,query,hashtags,is_road_closed
0,0,EPCF #GreenNewDeal #SunriseMovement,our hearts go out to californians affected by ...,saddleridgefire,"['wildfires', 'ClimateChange', 'ClimateCrisis'...",0
1,1,Jason Singson,"from the #saddleridgefire to the #kincadefire,...",saddleridgefire,"['SaddleridgeFire', 'KincadeFire']",0


#### Selecting Specific Columns to Shorten the DataFrame and saving as `text_df`

In [3]:
text_df = df[['text', 'query', 'hashtags']]
text_df.head()

Unnamed: 0,text,query,hashtags
0,our hearts go out to californians affected by ...,saddleridgefire,"['wildfires', 'ClimateChange', 'ClimateCrisis'..."
1,"from the #saddleridgefire to the #kincadefire,...",saddleridgefire,"['SaddleridgeFire', 'KincadeFire']"
2,our #saddleridgefire leaped a 12 lane major fr...,saddleridgefire,['SaddleridgeFire']
3,"good morning, sam!\nall is well here. however,...",saddleridgefire,['SaddleridgeFire']
4,was your property or home damaged by the #sadd...,saddleridgefire,['SaddleRidgeFire']


#### Confirming for nulls values/that it loaded correctly

In [4]:
text_df.isnull().sum()

text        0
query       0
hashtags    0
dtype: int64

#### Viewing the hashtags column

In [5]:
text_df['hashtags']

0       ['wildfires', 'ClimateChange', 'ClimateCrisis'...
1                      ['SaddleridgeFire', 'KincadeFire']
2                                     ['SaddleridgeFire']
3                                     ['SaddleridgeFire']
4                                     ['SaddleRidgeFire']
                              ...                        
9070                                                   []
9071                                                   []
9072                                                   []
9073                                                   []
9074                                                   []
Name: hashtags, Length: 9075, dtype: object

### Defining a tokenizing function

In [6]:
tokenizer = RegexpTokenizer(r'\s+', gaps=True)

def tokenizing_function(col):

    string = ''
    
    try:
    
        for post in col:
            string += ' ' + post
            
    except:
        pass
    
    all_tokens = tokenizer.tokenize(string)
    
    return all_tokens

#### Calling the tokenizing function on the hashtags column

In [7]:
hash_tokens = tokenizing_function(df['hashtags'])

### Creating a Corpus to Prep for Word2Vec

In [8]:
corpus = api.load('text8')

In [9]:
len(next(iter(corpus)))

10000

#### Prepping and Cleaning `text` Column For Word2Vec

In [10]:
# Converting 'text' column as strings, just as a precaution
text_df['text'] = text_df['text'].astype(str)

## Removing line breaks
text_df['text'] = text_df['text'].map(lambda x: re.sub('\/\/', ' ', x)) 
## Removing Apostrophes
text_df['text'] = text_df['text'].map(lambda x: re.sub('[\\][\']', '', x))
## Removing URLs
text_df['text'] = text_df['text'].map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))
## Removing # hashtag
text_df['text'] = text_df['text'].map(lambda x: x.replace('#',''))
# Only keep letters
text_df['text'] = text_df['text'].map(lambda x: re.sub('[^a-zA-Z]', ' ', x))

# replace double space with single space:
text_df['text'] = text_df['text'].map(lambda x: x.replace('  ',' '))

# change everything to lowercase":
text_df['text'] = text_df['text'].str.lower()

# text_df['text'] = text_df['text'].map(lambda x: x.lower())

#Split on ' '
text_df['text'] = text_df['text'].map(lambda x: x.split(' '))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/p

#### Viewing what a text would look like:

In [11]:
text_df['text'][0]

['our',
 'hearts',
 'go',
 'out',
 'to',
 'californians',
 'affected',
 'by',
 'wildfires',
 'climatechange',
 'is',
 'real',
 '',
 'scientists',
 'say',
 'the',
 'climatecrisis',
 'is',
 'fueling',
 'wildfires',
 'nationwide',
 'minesfire',
 'nytimes',
 'tickfire',
 'kincadefire',
 'saddleridgefire',
 'rawsonfire',
 'skyfire',
 'sawdayfire',
 'millerfire',
 'palisadesfirepic',
 'twitter',
 'com',
 'q',
 'nqaqpaec']

#### Declaring CVEC Function:

In [12]:
cvec = CountVectorizer(stop_words= 'english', min_df=5)

def cvec_function(df_col):
    cvec_matrix = cvec.fit_transform(df_col)

    #Convert to DataFrame
    cvec_df = pd.DataFrame(cvec_matrix.toarray(),
                          columns=cvec.get_feature_names())
    return cvec_df

#### Calling cvec function on `hashtags` column

In [13]:
hash_cvec_df = cvec_function(df['hashtags'])
hash_cvec_df

Unnamed: 0,101fwy,105fwy,10fwy,110fwy,118fwy,134fwy,14freeway,14fwy,1582,15fwy,...,westhollywood,westla,wevapewevote,wildfire,wildfirepic,wildfires,wolffire,woodlandhills,woolseyfire,yeswx
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9070,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9071,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9072,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9073,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Pulling out the 50 most common words

In [14]:
top_50_stats = hash_cvec_df.sum().sort_values(ascending=False).head(50)
top_50 = top_50_stats.index
top_50_stats

saddleridgefire        1390
tickfire                777
knxtraffic              377
latraffic               285
gettyfire               218
kincadefire             189
californiafires         112
easyfire                111
losangeles              101
rt                       96
sigalert                 94
knxtrafficpic            90
california               90
wildfires                81
5fwy                     76
santaclarita             72
sylmar                   71
lacofd                   67
palisadesfire            64
porterranch              63
ca25                     59
wildfire                 58
405fwy                   58
101fwy                   56
lafd                     53
firefighters             52
californiawildfires      50
pulsepointconnected      49
60fwy                    42
605fwy                   42
saddleridge              42
110fwy                   41
sandalwoodfire           39
kincaidfire              39
oakfire                  38
oldwaterfire        

#### Loading the corpus with the text column

In [15]:
corpus = []

for pos in range(len(text_df)):
    non_blank_text = []
    for word in text_df.loc[pos, 'text']:
        if word != '':
            non_blank_text.append(word)            
    corpus.append(non_blank_text)

#### Creating the word2vec model

In [16]:
model = Word2Vec(corpus,      # Corpus of data.
                 size=100,    # How many dimensions do you want in your word vector?
                 window=5,    # How many "context words" do you want?
                 min_count=1, # Ignores words below this threshold.
                 sg=0,        # SG = 1 uses SkipGram, SG = 0 uses CBOW (default).
                 workers=4)   # Number of "worker threads" to use (parallelizes process).

# Do what you'd like to do with your data!
print("done!")

done!


### Prepping and Cleaning Hashtag Column for Word2Vec

In [17]:
text_df['hashtags'][0].strip('[]').replace("'",'').replace(",",'')
text_df['hashtags'] = text_df['hashtags'].apply(lambda x: x.lower())
text_df['hashtags'] = text_df['hashtags'].apply(lambda x: x.replace(',', ''))
text_df['hashtags'] = text_df['hashtags'].apply(lambda x: x.replace("'", ''))
text_df['hashtags'] = text_df['hashtags'].apply(lambda x: x.replace("[", '').replace("]", ''))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

#### Getting Simliar for top 50

In [18]:
top_50_similiar = []
for hashtag in top_50:
    try:
        print(model.most_similar(hashtag))
        top_50_similiar.append(hashtag)
    except:
        print(f'{hashtag} is not in vocabulary')

[('gettyfire', 0.9879131317138672), ('easyfire', 0.9834583401679993), ('construction', 0.9760892987251282), ('hillsidefire', 0.9756529331207275), ('per', 0.975062370300293), ('hillfire', 0.9748534560203552), ('simivalley', 0.9746370911598206), ('extinguished', 0.9738354086875916), ('behavior', 0.9735012650489807), ('saddlerridgefire', 0.9730632305145264)]
[('canyoncountry', 0.9847087860107422), ('porterranch', 0.9834420680999756), ('brushfire', 0.9781323075294495), ('scene', 0.9776748418807983), ('moonset', 0.9757790565490723), ('chatsworth', 0.9757435321807861), ('taboosefire', 0.974877119064331), ('granadahills', 0.9735186100006104), ('sawdayfire', 0.9733231067657471), ('caltranshq', 0.9730457663536072)]
[('latraffic', 0.9874911904335022), ('caltransdist', 0.9817007780075073), ('fondoknxtraffic', 0.9604529142379761), ('knx', 0.9499005079269409), ('octraffic', 0.9476281404495239), ('sepulvedapass', 0.9465002417564392), ('scottburtknxpic', 0.9442294836044312), ('tsjhgq', 0.921592593193

  after removing the cwd from sys.path.


In [19]:
for hashtag in top_50_similiar:
    print(f'{hashtag}')

saddleridgefire
tickfire
knxtraffic
latraffic
gettyfire
kincadefire
californiafires
easyfire
losangeles
rt
sigalert
knxtrafficpic
california
wildfires
santaclarita
sylmar
lacofd
palisadesfire
porterranch
wildfire
lafd
firefighters
californiawildfires
pulsepointconnected
saddleridge
sandalwoodfire
kincaidfire
oakfire
oldwaterfire
breaking
brushfire
cawx
calfire
la
granadahills
caltrans
socal
sawdayfire
simivalley
caplesfire
castaic
beworkzonealert


In [20]:

for hashtag in top_50_similiar:
    print(f'Most Similar words: {hashtag}')
    most_similar = model.most_similar(hashtag, topn = 5)
    most_similar = [tuples[0] for tuples in most_similar]
    print(most_similar)
    print('*************************************')
    print(' ')

Most Similar words: saddleridgefire
['gettyfire', 'easyfire', 'construction', 'hillsidefire', 'per']
*************************************
 
Most Similar words: tickfire
['canyoncountry', 'porterranch', 'brushfire', 'scene', 'moonset']
*************************************
 
Most Similar words: knxtraffic
['latraffic', 'caltransdist', 'fondoknxtraffic', 'knx', 'octraffic']
*************************************
 
Most Similar words: latraffic
['caltransdist', 'fondoknxtraffic', 'knxtraffic', 'knx', 'octraffic']
*************************************
 
Most Similar words: gettyfire
['easyfire', 'construction', 'hillsidefire', 'palisades', 'per']
*************************************
 
Most Similar words: kincadefire
['rawsonfire', 'millerfire', 'losangelesfire', 'mt', 'cawildfires']
*************************************
 
Most Similar words: californiafires
['muirfire', 'repkatiehill', 'fireseason', 'fbclid', 'pgeshutoff']
*************************************
 
Most Similar words: easyfi

  This is separate from the ipykernel package so we can avoid doing imports until


## This is different then above- Examine

In [21]:
for words in top_50:
    try:
        print("word:", words, model.most_similar(words))
        print("-------")
    except:
        pass

word: saddleridgefire [('gettyfire', 0.9879131317138672), ('easyfire', 0.9834583401679993), ('construction', 0.9760892987251282), ('hillsidefire', 0.9756529331207275), ('per', 0.975062370300293), ('hillfire', 0.9748534560203552), ('simivalley', 0.9746370911598206), ('extinguished', 0.9738354086875916), ('behavior', 0.9735012650489807), ('saddlerridgefire', 0.9730632305145264)]
-------
word: tickfire [('canyoncountry', 0.9847087860107422), ('porterranch', 0.9834420680999756), ('brushfire', 0.9781323075294495), ('scene', 0.9776748418807983), ('moonset', 0.9757790565490723), ('chatsworth', 0.9757435321807861), ('taboosefire', 0.974877119064331), ('granadahills', 0.9735186100006104), ('sawdayfire', 0.9733231067657471), ('caltranshq', 0.9730457663536072)]
-------
word: knxtraffic [('latraffic', 0.9874911904335022), ('caltransdist', 0.9817007780075073), ('fondoknxtraffic', 0.9604529142379761), ('knx', 0.9499005079269409), ('octraffic', 0.9476281404495239), ('sepulvedapass', 0.946500241756439

  This is separate from the ipykernel package so we can avoid doing imports until


word: kincadefire [('rawsonfire', 0.9935792684555054), ('millerfire', 0.9925931692123413), ('losangelesfire', 0.9920511245727539), ('mt', 0.9911202192306519), ('cawildfires', 0.9907109141349792), ('laweather', 0.9904063940048218), ('californiahttps', 0.9895683526992798), ('oldwaterfire', 0.9892898201942444), ('burrisfire', 0.9891973733901978), ('en', 0.9891489744186401)]
-------
word: californiafires [('muirfire', 0.9982755184173584), ('repkatiehill', 0.9980692267417908), ('fireseason', 0.9979773759841919), ('fbclid', 0.9978474378585815), ('pgeshutoff', 0.997686505317688), ('kincadefires', 0.9972562193870544), ('george', 0.9971100091934204), ('impeach', 0.9970910549163818), ('iwar', 0.9970659017562866), ('entertainment', 0.9969546794891357)]
-------
word: easyfire [('gettyfire', 0.9972451329231262), ('construction', 0.995810866355896), ('hillsidefire', 0.9958098530769348), ('per', 0.9956652522087097), ('multi', 0.9952682256698608), ('special', 0.9948781728744507), ('sanfernandovalley',

word: brushfire [('granadahills', 0.9928061962127686), ('porterranch', 0.9923161268234253), ('pacific', 0.9913698434829712), ('scene', 0.9912640452384949), ('ranch', 0.990086555480957), ('miller', 0.9898791313171387), ('senior', 0.988982617855072), ('porter', 0.9886443614959717), ('aviso', 0.9882814884185791), ('coordinating', 0.9879388213157654)]
-------
word: cawx [('santarosa', 0.9981982707977295), ('kincaidfire', 0.9981865286827087), ('usagov', 0.9979373216629028), ('briceburgfire', 0.9978317618370056), ('mu', 0.9975957870483398), ('oldwaterfire', 0.997540295124054), ('es', 0.9973127245903015), ('teamtrump', 0.9970971941947937), ('para', 0.9970541596412659), ('php', 0.996938169002533)]
-------
word: calfire [('lapdhq', 0.9957572221755981), ('pio', 0.9948968291282654), ('hamelkcrwhttps', 0.9938516616821289), ('pf', 0.993004560470581), ('lasdhq', 0.9927737712860107), ('mt', 0.9923200607299805), ('skyfire', 0.9920032024383545), ('scvsheriff', 0.9918946027755737), ('chpsouthern', 0.991

### Creating a new variable `all_hashtags` that combines the hashtag column and the text column as a string
to to compare if it gets better results than above

#### Creating the `all_hashtag` variable, which also takes the unique values (set) of the `all_hashtag` variable

In [22]:
all_hashtags = ''
for text in text_df['hashtags']:
    all_hashtags += text
    
all_hashtags = list(set(all_hashtags.split(' ')))

#### Getting the most similiar word to the `all_hashtags` variable:

In [23]:
similiar_all_hashtags = []
for hashtag in all_hashtags:
    try:
        print(model.most_similar(hashtag))
        similiar_all_hashtags.append(hashtag)
    except:
        print(f'{hashtag} is not in vocabulary')

[('briceburgfire', 0.9940375089645386), ('para', 0.9940099716186523), ('es', 0.9940029978752136), ('entertainment', 0.9938743710517883), ('te', 0.9932639598846436), ('santarosa', 0.9931746125221252), ('tower', 0.9931173920631409), ('kincaidfire', 0.9929680228233337), ('mu', 0.9923255443572998), ('php', 0.9922270774841309)]
knxtrafficpicbarbietraffic is not in vocabulary
[('cbsla', 0.9988309741020203), ('nbcla', 0.9974314570426941), ('kfiam', 0.9965604543685913), ('avpressnews', 0.9955005049705505), ('latimes', 0.9954866170883179), ('associatepress', 0.9942792654037476), ('sciencenews', 0.9931564331054688), ('ktla', 0.993129312992096), ('tt', 0.9924414753913879), ('svacorn', 0.9923928380012512)]
[('crashes', 0.9656695127487183), ('complete', 0.9655073881149292), ('congratulations', 0.9650665521621704), ('contemporary', 0.9649884700775146), ('roth', 0.9647583365440369), ('mystery', 0.9647120237350464), ('phos', 0.9646764397621155), ('chefs', 0.9643827080726624), ('deployed', 0.9640847444

  after removing the cwd from sys.path.


[('qb', 0.994373083114624), ('library', 0.9942383766174316), ('lapd', 0.9939656853675842), ('awue', 0.993539571762085), ('lebron', 0.9934566020965576), ('hall', 0.9932618737220764), ('savesfac', 0.9930840730667114), ('director', 0.992987334728241), ('usc', 0.9929561614990234), ('jserra', 0.9928200244903564)]
[('foxandfriends', 0.9920312762260437), ('sfv', 0.9915668964385986), ('omny', 0.991454005241394), ('google', 0.9910644292831421), ('sell', 0.9909424781799316), ('kusinews', 0.9905386567115784), ('whitehouse', 0.9903396368026733), ('housedemocrats', 0.990287184715271), ('football', 0.9902362823486328), ('gopleader', 0.9902141094207764)]
knxtrafficpicgoodmorning is not in vocabulary
[('cost', 0.9809430837631226), ('metrolosangeles', 0.9798597693443298), ('roll', 0.9795887470245361), ('lack', 0.9795877933502197), ('utilities', 0.9795767664909363), ('earthquake', 0.9795348644256592), ('play', 0.9795331358909607), ('vehicles', 0.979435384273529), ('wharton', 0.9793879985809326), ('testi

[('moreno', 0.992591142654419), ('locations', 0.9848480224609375), ('carson', 0.9792683124542236), ('valley', 0.9784637689590454), ('upland', 0.978212833404541), ('riverside', 0.9773004651069641), ('fountain', 0.977085530757904), ('gardena', 0.9753022789955139), ('bound', 0.9750088453292847), ('between', 0.9749372601509094)]
latrafficknxheroes is not in vocabulary
anfsaddleridge is not in vocabulary
spectrumnews1vapefam is not in vocabulary
[('ground', 0.9848113656044006), ('building', 0.9847612380981445), ('gas', 0.9839273691177368), ('lacdod', 0.9838981628417969), ('build', 0.983568549156189), ('feet', 0.9834685921669006), ('suspect', 0.982985258102417), ('white', 0.9829020500183105), ('went', 0.9828536510467529), ('moderate', 0.9828310012817383)]
[('qb', 0.9978084564208984), ('metrolink', 0.9975676536560059), ('pasadenafd', 0.9970225691795349), ('library', 0.996780276298523), ('base', 0.9963300228118896), ('crew', 0.9958920478820801), ('losangelesfiredept', 0.9956490993499756), ('sa

[('lv', 0.9933191537857056), ('investigators', 0.9931309223175049), ('breakingnews', 0.9927127361297607), ('ketla', 0.9927014112472534), ('sponsored', 0.992692232131958), ('heat', 0.9925951361656189), ('yusufdfi', 0.9925252199172974), ('bluegrass', 0.9925222396850586), ('californiafire', 0.9924724102020264), ('attack', 0.9923839569091797)]
starwars101fwy is not in vocabulary
[('suggest', 0.9988004565238953), ('transcript', 0.998731255531311), ('nifc', 0.9979814291000366), ('voters', 0.9979239106178284), ('calif', 0.9973475337028503), ('demonstrated', 0.9968225359916687), ('turkey', 0.9966726899147034), ('ir', 0.9966171979904175), ('cdc', 0.9965900778770447), ('steveknight', 0.9963436722755432)]
[('recuperaci', 0.8662978410720825), ('tehama', 0.8426327109336853), ('fridayfeelingpic', 0.8403034210205078), ('socalhttps', 0.8376435041427612), ('meros', 0.8375427722930908), ('mastergis', 0.8375320434570312), ('tragedia', 0.836212158203125), ('creek', 0.8357051610946655), ('incendiosaddlerid

[('si', 0.9991921186447144), ('iwar', 0.9988893866539001), ('fbclid', 0.9983824491500854), ('crowdstrike', 0.9982644319534302), ('pgeshutoff', 0.9982352256774902), ('features', 0.9982270002365112), ('notifyla', 0.9981107711791992), ('otj', 0.9981027841567993), ('su', 0.9980124235153198), ('cards', 0.9979614019393921)]
[('movie', 0.995604932308197), ('mureaufire', 0.9955145120620728), ('film', 0.9955064058303833), ('fed', 0.9954996109008789), ('actress', 0.9954837560653687), ('annual', 0.9951854348182678), ('lebron', 0.9950739741325378), ('democratic', 0.995019793510437), ('gives', 0.9950193166732788), ('artist', 0.9947686791419983)]
notabotmsabloodmoney is not in vocabulary
spectrumnews1cawx is not in vocabulary
sanfernandovalleysaddleridgefire is not in vocabulary
knxtrafficboycottdhvaniwear is not in vocabulary
2twitter is not in vocabulary
[('canyoncountry', 0.9847087860107422), ('porterranch', 0.9834420680999756), ('brushfire', 0.9781323075294495), ('scene', 0.9776748418807983), ('

[('heed', 0.9725066423416138), ('however', 0.9695598483085632), ('animals', 0.9691077470779419), ('firstresponders', 0.9690996408462524), ('truly', 0.9689680337905884), ('aircraft', 0.9685828685760498), ('true', 0.9682708978652954), ('appreciated', 0.9679309725761414), ('bless', 0.9676486849784851), ('men', 0.9673687219619751)]
[('disasters', 0.9986892938613892), ('unfortunately', 0.9985807538032532), ('color', 0.9985111951828003), ('act', 0.9984928369522095), ('using', 0.9984110593795776), ('himself', 0.998388946056366), ('jury', 0.9983402490615845), ('win', 0.9983288049697876), ('creating', 0.9983206987380981), ('scary', 0.998291015625)]
wildfire2019httpstickfiretickfire is not in vocabulary
cafirehttptickfiretickfire is not in vocabulary
californiamulhollandfire is not in vocabulary
[('lafdvalley', 0.998295783996582), ('vcfd', 0.9953962564468384), ('mayorofla', 0.9953563809394836), ('teamtrump', 0.9949026107788086), ('vickymoorenews', 0.9945147633552551), ('anaheimfire', 0.994038641

[('wendyfire', 0.9973995685577393), ('sepulvedafire', 0.9972833395004272), ('oldfire', 0.9971722960472107), ('oakfire', 0.9969887733459473), ('aqu', 0.9965028762817383), ('gal', 0.995852530002594), ('sylmarfire', 0.9957244992256165), ('sandalwoodfire', 0.9956704378128052), ('electricidad', 0.9956099390983582), ('wolffire', 0.9950792789459229)]
caltrans110fwy is not in vocabulary
lafdsaddleridgefirelalate is not in vocabulary
[('health', 0.9933373928070068), ('sense', 0.9932030439376831), ('customer', 0.9928395748138428), ('vote', 0.9926242828369141), ('photos', 0.9924993515014648), ('license', 0.9924849271774292), ('unhealthy', 0.9923704266548157), ('public', 0.9922953844070435), ('doesnt', 0.9919271469116211), ('money', 0.9919135570526123)]
[('fs', 0.9908061623573303), ('blk', 0.9853502511978149), ('advisory', 0.9833929538726807), ('pedley', 0.9812994599342346), ('tick', 0.9775078296661377), ('brushfire', 0.9763826131820679), ('ridge', 0.9753443002700806), ('jasonryanphoto', 0.9750368

[('wcuq', 0.9374926090240479), ('personas', 0.9322408437728882), ('kitwu', 0.9321703910827637), ('w', 0.9287402629852295), ('perryrsmith', 0.9287045001983643), ('kn', 0.9282410144805908), ('gracias', 0.9281539916992188), ('kh', 0.9280576705932617), ('mu', 0.9280123710632324), ('sharesocal', 0.927899956703186)]
servpropicrt is not in vocabulary
ladflafd is not in vocabulary
[('electrical', 0.9470575451850891), ('cultist', 0.9467405676841736), ('monitor', 0.9456631541252136), ('removing', 0.9442285299301147), ('force', 0.9442091584205627), ('damage', 0.9440405368804932), ('award', 0.9435075521469116), ('christened', 0.9432644844055176), ('players', 0.9431306719779968), ('outages', 0.9428973197937012)]
[('sce', 0.997799813747406), ('fast', 0.995955765247345), ('ups', 0.9955901503562927), ('plastic', 0.9955720901489258), ('next', 0.9953896999359131), ('asap', 0.9952243566513062), ('victims', 0.995205283164978), ('passengers', 0.9949451088905334), ('cut', 0.9947705268859863), ('hit', 0.9947

[('newhall', 0.8942159414291382), ('brookhurst', 0.8941529393196106), ('recreation', 0.8895995616912842), ('accident', 0.8892164826393127), ('lakewood', 0.888742983341217), ('garey', 0.8879929184913635), ('archibald', 0.8877224326133728), ('acton', 0.886806309223175), ('improving', 0.8866100311279297), ('weir', 0.8865662813186646)]
[('staffed', 0.9842566251754761), ('valencia', 0.9836115837097168), ('mysterious', 0.9824459552764893), ('onto', 0.9817318320274353), ('mission', 0.9815242290496826), ('firestorm', 0.9811151027679443), ('distribution', 0.980755090713501), ('ground', 0.980584979057312), ('vista', 0.980523407459259), ('westhollywood', 0.9803122282028198)]
[('porn', 0.9949126839637756), ('impeachmentinquiry', 0.9941181540489197), ('berggrueninst', 0.9938926696777344), ('prince', 0.9937146306037903), ('groups', 0.9936515688896179), ('incidents', 0.9936484694480896), ('clippers', 0.9936107397079468), ('faces', 0.9935867786407471), ('unseat', 0.9935862421989441), ('reporter', 0.99

[('effective', 0.9990866184234619), ('music', 0.9989362955093384), ('book', 0.9987109899520874), ('pbssocal', 0.9986475706100464), ('msabloodmoney', 0.9986293315887451), ('bill', 0.9984647035598755), ('laedc', 0.998418927192688), ('ubikeo', 0.9983993172645569), ('bakery', 0.9983160495758057), ('pressure', 0.9982829689979553)]
dow118fwy is not in vocabulary
[('thousand', 0.9828468561172485), ('upper', 0.9801907539367676), ('avenida', 0.9776046276092529), ('alert', 0.9753087162971497), ('united', 0.9736173748970032), ('gabriel', 0.9726377725601196), ('moorpark', 0.9705907702445984), ('ventura', 0.966240644454956), ('hayvenhurst', 0.9661837816238403), ('bernardino', 0.9658382534980774)]
fakenewsmediahttpslakers is not in vocabulary
41snoopdoggsnoopdogg4snoopdoggsnoopdogggettyfiremariafiresnoopdoggsnoopdogg4snoopdoggsnoopdogggettyfiremariafire1 is not in vocabulary
pleasesharesaddleridgefiresaddleridgefirewildfire is not in vocabulary
[('release', 0.9856276512145996), ('protestors', 0.9816

[('movie', 0.997595489025116), ('reports', 0.9972394704818726), ('issues', 0.9970582723617554), ('son', 0.9969902634620667), ('fed', 0.9969829320907593), ('factor', 0.996788501739502), ('accurate', 0.9966245889663696), ('strong', 0.9965204000473022), ('knight', 0.9964060187339783), ('halloweenie', 0.9963455200195312)]
[('kristenmlago', 0.9777885675430298), ('biased', 0.9769543409347534), ('takes', 0.9756754636764526), ('spreading', 0.9753924012184143), ('sweet', 0.9752455949783325), ('name', 0.9747301936149597), ('crap', 0.9743975400924683), ('scott', 0.97420334815979), ('helping', 0.9731997847557068), ('obviously', 0.9730734825134277)]
tickfiresigalert is not in vocabulary
saddleridgefiresaddleridgefiresaddleridgefiresaddleridgefiresaddleridgefiresaddleridgefiresaddleridgefire is not in vocabulary
californiafirespiccaliforniawildfires is not in vocabulary
wildfiressaddleridgefire is not in vocabulary
weatherabcnews is not in vocabulary
[('because', 0.9921631813049316), ('like', 0.9900

[('must', 0.9967670440673828), ('doesn', 0.996018648147583), ('questions', 0.9956905245780945), ('model', 0.9955968260765076), ('macys', 0.9952566027641296), ('help', 0.9951272010803223), ('hes', 0.994789719581604), ('protect', 0.9945911765098572), ('blame', 0.9945485591888428), ('them', 0.994483232498169)]
605fwy is not in vocabulary
fireupdate is not in vocabulary
[('oldwaterfire', 0.9991651773452759), ('santarosa', 0.9990360736846924), ('briceburgfire', 0.9983630180358887), ('cawx', 0.9981866478919983), ('para', 0.9979686737060547), ('usagov', 0.9977713227272034), ('muirfire', 0.9976954460144043), ('mu', 0.997490406036377), ('es', 0.9972543716430664), ('te', 0.9972189664840698)]
gettyfire2019 is not in vocabulary
lacountyfirefighterspicwednesdaythoughts is not in vocabulary
[('channel', 0.9951410293579102), ('clippers', 0.9945539236068726), ('greenhouse', 0.9945467710494995), ('suggests', 0.9945386648178101), ('porn', 0.9942337274551392), ('captain', 0.9941219091415405), ('prince', 

[('visiting', 0.9959160089492798), ('deserve', 0.9951772093772888), ('officials', 0.9937887191772461), ('instead', 0.9934178590774536), ('hunter', 0.9931067228317261), ('help', 0.9923341274261475), ('macys', 0.9923189282417297), ('insurance', 0.9920095205307007), ('consumers', 0.9920052289962769), ('must', 0.9916644096374512)]
[('hillfire', 0.9351779818534851), ('porterranch', 0.9351500272750854), ('firefight', 0.9339156150817871), ('granadahills', 0.9334271550178528), ('simivalley', 0.9319218397140503), ('littlemountainfire', 0.9310327768325806), ('gettyfire', 0.9296106696128845), ('santaclarita', 0.9295529723167419), ('special', 0.9287687540054321), ('easyfire', 0.9284443855285645)]
sepulvedapass46fireknx1070 is not in vocabulary
[('toward', 0.9953802824020386), ('accident', 0.9953505992889404), ('acton', 0.9953203797340393), ('burbank', 0.9934177398681641), ('encino', 0.9931828379631042), ('diamond', 0.9924647808074951), ('advisory', 0.9914301633834839), ('central', 0.99006372690200

[('sepulvedafire', 0.9976382851600647), ('sylmarfire', 0.9975606203079224), ('incendio', 0.9975404739379883), ('alameda', 0.996717095375061), ('aqu', 0.9964818954467773), ('thehill', 0.996309757232666), ('traditions', 0.9962714314460754), ('por', 0.996084451675415), ('oakfire', 0.9959595799446106), ('tanto', 0.9958005547523499)]
saddleridgefiresaddleridgefiresaddleridgefire is not in vocabulary
nifcsaddleridgefire is not in vocabulary
wildfirepicsaddleridgefiresaddleridgefire is not in vocabulary
risksaddleridgefire is not in vocabulary
[('gasbuddyguy', 0.9982684254646301), ('football', 0.9974462985992432), ('foxandfriends', 0.9973649978637695), ('tuckercarlson', 0.9965222477912903), ('disasterrecovery', 0.9964413046836853), ('si', 0.9964026212692261), ('pa', 0.996377170085907), ('iwar', 0.9963254928588867), ('fbclid', 0.9960930347442627), ('insiders', 0.9960715174674988)]
[('features', 0.9975515007972717), ('kelly', 0.9974432587623596), ('impeach', 0.9972845315933228), ('repkatiehill'

[('electric', 0.9685907363891602), ('wing', 0.9677281975746155), ('congratulations', 0.9677209258079529), ('nights', 0.9675100445747375), ('struggling', 0.9673700332641602), ('requests', 0.9670950174331665), ('large', 0.967018187046051), ('mine', 0.9667826890945435), ('head', 0.966779351234436), ('deployed', 0.9667172431945801)]
comments101fwy is not in vocabulary
[('important', 0.998817503452301), ('together', 0.9985668659210205), ('spend', 0.9982790350914001), ('doesnt', 0.9981469511985779), ('female', 0.9981120824813843), ('market', 0.9980514049530029), ('own', 0.9979767203330994), ('yourself', 0.9979154467582703), ('child', 0.9978809356689453), ('corrupt', 0.9978682398796082)]
[('lafires', 0.9908542633056641), ('pacific', 0.9903488755226135), ('coordinating', 0.9898641109466553), ('lafire', 0.9897902011871338), ('nws', 0.9875128269195557), ('brushfire', 0.9873755574226379), ('jasonryanphoto', 0.9862252473831177), ('aviso', 0.9862058162689209), ('pointe', 0.9861471652984619), ('rapi

[('jacknoyesknbc', 0.9861071705818176), ('desk', 0.9858971834182739), ('guide', 0.9845812320709229), ('cast', 0.9843640327453613), ('wp', 0.9834426045417786), ('knbc', 0.982462465763092), ('infinity', 0.9820433855056763), ('guid', 0.9818084239959717), ('publicsafety', 0.9817184209823608), ('readysetgo', 0.9812262058258057)]
besafepicbeworkzonealerthwy101 is not in vocabulary
climatecrisispicsaddleridgefire is not in vocabulary
wuisaddleridgepulsepointconnectedsaddleridgefiresaddleridgefire is not in vocabulary
douglashodge5fwy is not in vocabulary
caanf14fwy is not in vocabulary
jurupavalleynew is not in vocabulary
californiafireseasyfirediamondbar is not in vocabulary
[('church', 0.996241569519043), ('rapid', 0.9952186346054077), ('friday', 0.9947955012321472), ('sonomafire', 0.9943418502807617), ('warming', 0.994045615196228), ('press', 0.9939972162246704), ('sandcanyon', 0.9936547875404358), ('exhibit', 0.9935072660446167), ('tomorrow', 0.9934905767440796), ('pete', 0.99347758293151

[('active', 0.9857016205787659), ('hillsidefire', 0.98475182056427), ('easyfire', 0.9825868606567383), ('soil', 0.9822996854782104), ('farm', 0.9816824793815613), ('haze', 0.9814388751983643), ('extreme', 0.9812836647033691), ('special', 0.9811846017837524), ('oc', 0.9809830188751221), ('highways', 0.980693519115448)]
[('unfortunately', 0.9978808164596558), ('using', 0.9977971911430359), ('late', 0.9976939558982849), ('cost', 0.9975136518478394), ('title', 0.9974386692047119), ('wevapewevote', 0.9973610043525696), ('whose', 0.997220516204834), ('stage', 0.9971456527709961), ('taxi', 0.9971339702606201), ('roll', 0.9971332550048828)]
cawxcalifornication is not in vocabulary
tickfiretickfiretickfirecaliforniafires is not in vocabulary
[('vs', 0.9940627813339233), ('event', 0.9936144948005676), ('afectadas', 0.9931771755218506), ('coffee', 0.9931156635284424), ('robert', 0.9930127859115601), ('tr', 0.9929772615432739), ('gavinnewsom', 0.9928616881370544), ('daniellegersh', 0.9926650524139

[('rdpic', 0.9879671335220337), ('mu', 0.9868146181106567), ('dayofthedead', 0.9857913255691528), ('pioerikscott', 0.9857395887374878), ('palisadesfire', 0.9854142665863037), ('santarosa', 0.9851201772689819), ('cawx', 0.9842979907989502), ('lafdwest', 0.9842337369918823), ('californiawildfire', 0.984186589717865), ('alisonmartino', 0.9840637445449829)]
breederscup2019teddybear is not in vocabulary
lancasteriveseenit is not in vocabulary
[('venturacounty', 0.9943768978118896), ('simivalley', 0.9942296743392944), ('according', 0.9926544427871704), ('granadahills', 0.9926047325134277), ('oaks', 0.9921631813049316), ('special', 0.9920948147773743), ('dept', 0.9913673400878906), ('littlemountainfire', 0.990945041179657), ('multi', 0.9907240867614746), ('ontario', 0.9906700253486633)]
latrafficbreederscup2019 is not in vocabulary
chatsworthcopernicus is not in vocabulary
superherosfirstrespondersweekhttpscalfirecanadafirefirghter is not in vocabulary
fridayvibespicsaddleridgefire is not in 

[('thepigeonexpress', 0.9639313817024231), ('laist', 0.9615770578384399), ('prevention', 0.9614310264587402), ('sonomacounty', 0.9610564112663269), ('arcgis', 0.9607415199279785), ('nprnews', 0.9587309956550598), ('ocregister', 0.9586838483810425), ('socalregion', 0.9585633277893066), ('roadclosures', 0.9581156969070435), ('media', 0.9576553106307983)]
[('k', 0.9784777760505676), ('againflick', 0.9782280921936035), ('alexdatig', 0.9768868088722229), ('billfoxla', 0.976703405380249), ('teamworkpic', 0.9764537811279297), ('adamcarollashow', 0.9762930870056152), ('jon', 0.9761667251586914), ('y', 0.975981593132019), ('o', 0.9758532047271729), ('ai', 0.975156307220459)]
limesawdayfire is not in vocabulary
[('jnj', 0.7555649876594543), ('tl', 0.7442654371261597), ('michaelrgallas', 0.7411506175994873), ('carpet', 0.7389530539512634), ('sneaker', 0.7374600172042847), ('jgnjejhk', 0.7339864373207092), ('lkh', 0.7329692840576172), ('tz', 0.7316044569015503), ('keaq', 0.7306129932403564), ('iqj

#### Getting the top 50 most similiar for `similiar_all_hash`

In [24]:
requery_list = [] #list of lists

for hashtag in top_50_similiar:
    print(f'Most Similar words: {hashtag}')
    most_similar = model.most_similar(hashtag, topn = 2000)
    most_similar = [tuples[0] for tuples in most_similar if tuples[0] in similiar_all_hashtags][:5]
    print(most_similar)
    print('*************************************')
    print(' ')
    requery_list.append(most_similar)

Most Similar words: saddleridgefire
['gettyfire', 'easyfire', 'construction', 'hillsidefire', 'hillfire']
*************************************
 
Most Similar words: tickfire
['canyoncountry', 'porterranch', 'brushfire', 'moonset', 'chatsworth']
*************************************
 
Most Similar words: knxtraffic
['latraffic', 'sepulvedapass', 'buenapark', 'wordpress', 'boyleheights']
*************************************
 
Most Similar words: latraffic
['knxtraffic', 'buenapark', 'sepulvedapass', 'wordpress', 'em']
*************************************
 
Most Similar words: gettyfire
['easyfire', 'construction', 'hillsidefire', 'sanfernandovalley', 'saddleridgefire']
*************************************
 
Most Similar words: kincadefire


  """


['rawsonfire', 'millerfire', 'losangelesfire', 'cawildfires', 'laweather']
*************************************
 
Most Similar words: californiafires
['muirfire', 'fireseason', 'pgeshutoff', 'kincadefires', 'impeach']
*************************************
 
Most Similar words: easyfire
['gettyfire', 'construction', 'hillsidefire', 'sanfernandovalley', 'oc']
*************************************
 
Most Similar words: losangeles
['caplesfire', 'southfire', 'littlemountainfire', 'breaking', 'taboosefire']
*************************************
 
Most Similar words: rt
['soledadfire', 'verde', 'oes', 'fire', 'sanbernardino']
*************************************
 
Most Similar words: sigalert
['riverside', 'diamondbar', 'anaheim', 'fontana', 'ventura']
*************************************
 
Most Similar words: knxtrafficpic
['buenapark', 'sandiegocounty', 'latraffic', 'em', 'naturopathicmedicine']
*************************************
 
Most Similar words: california
['wildfires', 'tzuchi

#### Converting `requery_list` list of lists into one list called `requery`

In [25]:
requery = []

for i in requery_list:
    for item in i:
        requery.append(item)
requery      

['gettyfire',
 'easyfire',
 'construction',
 'hillsidefire',
 'hillfire',
 'canyoncountry',
 'porterranch',
 'brushfire',
 'moonset',
 'chatsworth',
 'latraffic',
 'sepulvedapass',
 'buenapark',
 'wordpress',
 'boyleheights',
 'knxtraffic',
 'buenapark',
 'sepulvedapass',
 'wordpress',
 'em',
 'easyfire',
 'construction',
 'hillsidefire',
 'sanfernandovalley',
 'saddleridgefire',
 'rawsonfire',
 'millerfire',
 'losangelesfire',
 'cawildfires',
 'laweather',
 'muirfire',
 'fireseason',
 'pgeshutoff',
 'kincadefires',
 'impeach',
 'gettyfire',
 'construction',
 'hillsidefire',
 'sanfernandovalley',
 'oc',
 'caplesfire',
 'southfire',
 'littlemountainfire',
 'breaking',
 'taboosefire',
 'soledadfire',
 'verde',
 'oes',
 'fire',
 'sanbernardino',
 'riverside',
 'diamondbar',
 'anaheim',
 'fontana',
 'ventura',
 'buenapark',
 'sandiegocounty',
 'latraffic',
 'em',
 'naturopathicmedicine',
 'wildfires',
 'tzuchi',
 'gis',
 'molinofire',
 'reporter',
 'california',
 'tzuchi',
 'molinofire',
 

#### converting `requery` into a set - a list with only the unique values of `requery` called `requery_results`

In [26]:
requery_results = set(requery)
requery_results

{'abc',
 'acton',
 'anaheim',
 'anf',
 'ans',
 'appreciate',
 'beverly',
 'boyleheights',
 'breaking',
 'breakingnews',
 'briceburgfire',
 'brushfire',
 'buenapark',
 'burbank',
 'burrisfire',
 'business',
 'ca',
 'cafires',
 'california',
 'californiafire',
 'californiawildfire',
 'canyoncountry',
 'caplesfire',
 'castaic',
 'cawildfires',
 'cawx',
 'cbsla',
 'chatsworth',
 'citycouncil',
 'construction',
 'corona',
 'dexterfire',
 'diablo',
 'diamondbar',
 'easyfire',
 'economics',
 'em',
 'emergency',
 'emergencymanagement',
 'energy',
 'fire',
 'fireseason',
 'firstresponders',
 'fontana',
 'gavinnewsom',
 'gettyfire',
 'gettyfires',
 'gis',
 'granadahills',
 'halloween',
 'helicopter',
 'hillfire',
 'hillsidefire',
 'hollywood',
 'hope',
 'impeach',
 'jasonryanphoto',
 'kincade',
 'kincadefires',
 'kincaidefire',
 'kincaidfire',
 'kindcadefire',
 'knxtraffic',
 'ktla',
 'lacofd',
 'lacofdpio',
 'lacountyfire',
 'lafd',
 'lafire',
 'lancaster',
 'lapdhq',
 'lasdhq',
 'latraffic',
 

### This list would then be used to requery twitter to hopefully get better results. Through looking the the query suggestions it's clear that the list could use more editing or cleaning before implementation. Also, this data was trained on past data and should be tested on live data before implementing.