# Retrieving words related to COVID-19
For this task we'll be using the API of Google Trends. Thanks to this API we'll be able to retrieve the words most frequently used in queries related to COVID-19. We will later use these same words to identify the posts related to the COVID-19 outbreak.

To run the API it's necessary to install the "pytrends" package.

In [1]:
!pip install pytrends

Collecting pytrends
  Using cached pytrends-4.7.2.tar.gz (17 kB)
Building wheels for collected packages: pytrends
  Building wheel for pytrends (setup.py): started
  Building wheel for pytrends (setup.py): finished with status 'done'
  Created wheel for pytrends: filename=pytrends-4.7.2-py3-none-any.whl size=14265 sha256=4dca40a4e37d4d14e511403130370645411309fbb99181ce84231d86686c45ed
  Stored in directory: c:\users\stefyx\appdata\local\pip\cache\wheels\ba\e9\63\a7be983fdd9d25e31de75dd388b6f5ea8b5191c20396a6dc52
Successfully built pytrends
Installing collected packages: pytrends
Successfully installed pytrends-4.7.2


In [4]:
import pandas as pd                        
from pytrends.request import TrendReq
pytrend = TrendReq()

Let's see an example of how the API works. These lines of code retrieve the most research words in Italy.

In [11]:
df = pytrend.trending_searches(pn='italy')
df.head()

Unnamed: 0,0
0,Meet
1,Intesa Sanpaolo
2,Poste Italiane
3,Microsoft Teams
4,Agenzia delle Entrate


Below there are the trending words shown on the Google Trends website, at https://trends.google.com/trends/trendingsearches/daily?geo=IT.
As you can see, result of the previous query is the same as in the website (this image refers to 20/04/2020).

![title](GoogleTrendsExample.png)

## Retrieving frequent search terms for COVID-19

Let's get to the real work now. First we'll look for queries related to the coronavirus. Fot this, we'll need a small list of keywords connected to the coronavirus term.

In [41]:
keywords = ['Coronavirus', 'COVID-19', 'COVID', 'COVID19', 'China flu', 'wuhan', 'virus']
pytrend.build_payload(kw_list = keywords)
related_queries = pytrend.related_queries()

In [74]:
df_queries = pd.DataFrame()
for word in keywords:
    df_queries = df_queries.append(pd.DataFrame(related_queries[word]['rising']))
    df_queries = df_queries.append(pd.DataFrame(related_queries[word]['top']))
df_queries.reset_index(inplace=True)
df_queries.drop(columns=['value','index'], axis=1, inplace=True)

In [76]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df_queries)

                                                 query
0                        thank you coronavirus helpers
1                                     coronavirus tips
2                                   coronavirus brasil
3                                             covid 19
4                              worldometer coronavirus
5                                          worldometer
6                                coronavirus australia
7                                  coronavirus numbers
8                                  coronavirus florida
9                                    coronavirus stats
10                                    mapa coronavirus
11                                   trump coronavirus
12                                     coronavirus hoy
13                                     nyc coronavirus
14   merci à tous ceux qui aident à combattre le co...
15                             coronavirus update live
16                                 coronavirus updates
17        

From this list, we can notice some repeated words:
 - 'updates' and 'news', people look for the latest info over the outbreak
 - 'death'
 - Country names
 - 'symptoms' and 'tips', too many people look for medical information online
 - 'thank you helpers', a bright note in all of this
 - 'WHO', which is an unfortunate acronym to look for
 - 'trump', the US president was at the center of numerous critics for his handling of the crisis
 

In [102]:
#For some reason the API doesn't allow the use of an array of keywords, so we have to cycle through them manually

df_topics = pd.DataFrame()

for word in keywords:
    pytrend.build_payload(kw_list = [word])
    topics = pytrend.related_topics()
    df_topics = df_topics.append(pd.DataFrame(topics[word]['rising']))
    df_topics = df_topics.append(pd.DataFrame(topics[word]['top']))
    
df_topics.reset_index(inplace=True)
df_topics.drop(columns = ['formattedValue', 'hasData', 'link', 'topic_mid', 'value', 'index'], axis = 1, inplace=True)

In [103]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df_topics)

                                           topic_title  \
0                                           Statistics   
1                                               Brazil   
2                                         Worldometers   
3                                           Brazilians   
4                                          U.S. County   
5                                                Cubit   
6                             Johns Hopkins University   
7                                                Covid   
8                                                 Mask   
9                                           U.S. state   
10                                               Wuhan   
11                                            Populace   
12                                         Última Hora   
13                                            Lombardy   
14                                              Decree   
15                        Dow Jones Industrial Average   
16            

We get from here some interesting insights on related topics. Leaving aside the various country names and 'latest news', we find the words:
 - 'Statistics', many people want to know the real numbers of the outbreak
 - 'Mask', for several weeks the WHO wasn't able to give a definitive answer on its utility, so every country went its way
 - 'Decree', decrees have been following each other during this time
 - 'Dow Jones Industrial Average', many people fear the collapse of the economy
 - 'Coronaviridae', this is an interestingly specific term. My guess is, many researchers around the world are using Internet to quickly communicate discoveries over the virus
 - 'Vaccine', unfortunately no vaccine is available yet
 - 'Cubit', I had to look for this one. Apparently it's a modular way of decorating a house or an office; I guess companies want to separate their offices to create a safer workspace during the epidemy
 - 'Infection', 'Pandemic', 'Influenza'
 - 'Centers for Disease Control and Prevention', the US federal office dedicated to prevent the spread of diseases
 - 'Johns Hopkins University', the university has kept the outbreak under monitoring since its beginning
 - 'African swine fever virus', 'Plague', 'Black Death', 'Influenza A virus subtype H2N2', people look to the past for information
 - 'Wet markets', 'bats', the believed causes of the virus
 

## Conclusions
This API is a very powerful tool, allowing to look through the immense amount of information that Google receives everyday. I will be using the following lists of terms (accurately refined from the ones observed abobe) to distinguish between the various Reddit posts.

These lists can be updated in a later moment to take into consideration the period of time when they were the most popular (for example, the word 'Lombardy' )

In [104]:
#LIST OF TERMS RELATED TO CORONAVIRUS

#HIGH CORRELATION: basically synonims; we can assume that a post
#containing one of this worlds is talking about covid-19
coronavirus_terms_high_corr = [
    'coronavirus',
    'corona',
    'virus',
    'covid',
    'covid19',
    'covid-19',
    'flu',
    'wuhan',
    'Coronaviridae'
]

#MEDIUM CORRELATION: worlds that have been widely used together with 
#'Coronavirus' in the last few months, but are normally used as well.
#We can assume that a post from March/April 2020 containing one of these
#words is very likely talking about Coronavirus
coronavirus_terms_medium_corr = [
    'symptoms',
    'cases',
    'helpers',
    'death',
    'deaths',
    'test',
    'china',
    'mask',
    'lombardy',
    'infection',
    'vaccine',
    'pandemic',
    'cdc',
    'outbreak'
]

#LOW CORRELATION: worlds widely used with 'Coronavirus', which however
#are used in unrelated posts as well, even in this period. A post that
#contains only one of these words, and none from before,
#has a high chance of being unrelated
coronavirus_terms_low_corr = [
    'updates',
    'update',
    'latest',
    'news',
    'tips',
    'statistics',
    'wet',
    'market',
    'bats'
]

The API might be useful in a later moment too as well, to track the spread of fake news online, once we have collected enough of them.

## Acknowledgements

For this notebook I need to thank this tutorial here https://towardsdatascience.com/google-trends-api-for-python-a84bc25db88f, where I was able to get a grasp over this technology.