# Google Trends
Leon Yin

This notebook walks through the Google trends data we collected, by giving examples of searches for each of the available categories of real time trends we collected.

We also create samples of search terms that are unique to each category.

In [1]:
import pandas as pd
import s3
from tqdm import tqdm



In [2]:
# inputs
s3_pattern = 's3://markup-investigations-google/trends/combined_trends/*/*/*/*/combined_trends.json'

# outputs
fn_out = '../data/input/trending_searches_by_category.csv'

Here is a dictionay for each Google trends category that is available. It maps the shortcode to the name of each tegory.

In [3]:
cat2desc = {
    'b' : 'Business',
    'e' : 'Entertainment',
    'm' : 'Health',
    't' : 'Science and Tech',
    's' : 'Sports',
    'h' : 'Top Stories'
}

In [4]:
files = s3.ls(s3_pattern)

In [5]:
len(files)

404

In [6]:
data = []
for fn in tqdm(files):
    df = s3.read_json(fn, lines=True)
    data.extend(df.to_dict(orient='records'))

100%|██████████| 404/404 [01:03<00:00,  6.38it/s]


In [7]:
df = pd.DataFrame(data)

In [8]:
df['collection_datetime'] = pd.to_datetime(df['collection_datetime'], unit='s')

In [9]:
df_sample = df[df['collection_datetime'] <= '2020-01-07']

In [10]:
df_sample.head()

Unnamed: 0,image,shareUrl,articles,idsForDedup,id,title,entityNames,collection_datetime,category,session_id
0,{'newsUrl': 'https://www.reuters.com/article/u...,https://trends.google.com/trends/trendingsearc...,[{'articleTitle': 'McDonald&#39;s ousts CEO ov...,"[/m/012r4l7l /m/07gyp7, /m/012r4l7l /m/0dq_5, ...",US_lnk_A0qQ5gAwAACTHM_en,"McDonald's, Chief Executive, Steve Easterbrook","[McDonald's, Chief Executive, Steve Easterbrook]",2019-11-04 04:49:03,b,1572842942
1,{'newsUrl': 'https://www.bbc.com/news/business...,https://trends.google.com/trends/trendingsearc...,"[{'articleTitle': 'No-one understood our idea,...",[/g/11c3ypc1q3],US_lnk_uAGS5gAwAAAqVM_en,Airtable,[Airtable],2019-11-04 04:49:03,b,1572842942
2,{},https://trends.google.com/trends/trendingsearc...,[{'articleTitle': 'Oregon Public Employees Ret...,"[/m/01xm4b /m/02_7t, /m/01xm4b /m/03jzl9, /m/0...",US_lnk_Tr-O5gAwAADA6M_en,"NYSE:MAN, Share, Share price, Finance, Manpowe...","[NYSE:MAN, Share, Share price, Finance, Manpow...",2019-11-04 04:49:03,b,1572842942
3,{'newsUrl': 'https://rivertonroll.com/news/201...,https://trends.google.com/trends/trendingsearc...,[{'articleTitle': 'SFE Investment Counsel Buys...,"[/m/01prdc /m/03jzl9, /m/01prdc /m/05drh, /m/0...",US_lnk_KYtm5gAwAABP3M_en,"NYSE:PFE, Pfizer, New York Stock Exchange, Sha...","[NYSE:PFE, Pfizer, New York Stock Exchange, Sh...",2019-11-04 04:49:03,b,1572842942
4,{},https://trends.google.com/trends/trendingsearc...,[{'articleTitle': 'How Does ITD Cementation In...,"[/g/1dv25c4d /m/015nwb, /g/1dv25c4d /m/01xm4b,...",US_lnk_6z2S5gAwAAB5aM_en,"Share price, NSE:ITDCEM, Price–earnings ratio,...","[Share price, NSE:ITDCEM, Price–earnings ratio...",2019-11-04 04:49:03,b,1572842942


In [11]:
data = []
for _, row in tqdm(df_sample.iterrows()):
    category = row['category']
    for (_id, _ent) in zip(row['idsForDedup'], row['entityNames']):
        data.append({
            'category' : category,
            'id' : _id,
            'search' : _ent
        })

20787it [00:02, 8984.67it/s]


In [12]:
entitys = pd.DataFrame(data)
len(entitys)

74886

In [13]:
entitys.search.nunique()

16723

There is certainly overlap between the 16K unique searches and the almost 75K rows of entity and category data.

In [14]:
entitys.category.unique()

array(['b', 'm', 'e', 't', 's', 'h'], dtype=object)

In [15]:
entitys.category.value_counts()

h    14830
e    13847
s    11857
m    11854
t    11568
b    10930
Name: category, dtype: int64

All search terms in unique categories.

In [16]:
entitys.drop_duplicates(subset=['category', 'search'],
                        inplace=True)

In [17]:
search2count = entitys.search.value_counts()

In [18]:
unique_searches = search2count[search2count == 1].index.tolist()
unique_searches[0]

'SIE San Diego Studio'

In [19]:
entitys.search.str.len().mean()

14.369490790899242

In [20]:
entitys.search.str.split(' ').str.len().mean()

2.141018418201517

In [21]:
cat_search = entitys[entitys.search.isin(unique_searches)]

What is the breakdown of unique searches per category?

In [22]:
cat_search.category.value_counts()

e    3284
b    2653
t    1964
m    1515
s    1141
h    1092
Name: category, dtype: int64

In [23]:
# examples per category
for cat in cat_search.category.unique():
    print(f"{cat} - {cat2desc.get(cat)}")
    print(
        cat_search[cat_search.category == cat]
            .sample(4, random_state=303)
            .search.tolist()
    )
    print('*' * 77)

b - Business
['Autoliv', 'Hang Seng Index', 'NASDAQ:EYPT', 'Non-stop flight']
*****************************************************************************
m - Health
['Cervical cancer', 'Alcoholic hepatitis', 'Kaiser Family Foundation', 'Shelby County']
*****************************************************************************
e - Entertainment
['WWE Live from Madison Square Garden', 'Gus Fring', 'Priscilla Presley', 'Jennifer Syme']
*****************************************************************************
t - Science and Tech
['Workplace by Facebook', 'Sindel', 'Phoenix Point', 'Shinra']
*****************************************************************************
s - Sports
['Cam Reddish', 'Taylor Townsend', 'Face mask', 'Ralf Rangnick']
*****************************************************************************
h - Top Stories
['York County', 'Houston Police Department', 'Mail carrier', 'Surfers Paradise']
*******************************************************************

In [24]:
cat_search.to_csv(fn_out, index=False)

In [25]:
cat_search.search.str.len().mean()

14.62202764185767

In [26]:
cat_search.search.str.split('-').str.len().mean()

1.036913039745901