# Trending searches

The goal of this notebook is to explore if we can get a way to get in a simple manner, the trending search of a day.

What we saw during the exploration, is that some query such as python, are queried everyday. On the opposite, the query 'tes' was trending on the 14th of june but was not requested at all for the next month.

This is important, as we can have a better grasp of what is happening on a day, but was not happening before.

# Loading the data

In [None]:
import pandas as pd
import collections
import numpy as np
import json
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=True)
import plotly.graph_objs as go
import plotly.offline as py
import plotly.graph_objs as go

In [None]:
garbage=[]
df=[]
with open('./all.txt',encoding='utf-8',errors='ignore') as fp:
    test= fp.readlines()
for i,query in enumerate(test):
    try:
        df.append(json.loads(query.strip('\n')))
    except json.decoder.JSONDecodeError:
        garbage.append(query)
df=pd.DataFrame(df)    

In [None]:
df['timestamp']=pd.to_datetime(df['timestamp'])
df.sort_index(inplace=True)
df.set_index('timestamp',inplace=True)
# We create a dataframe with just the query, and timestamp as index
df_query=pd.DataFrame(df['query'])

In [None]:
df_query=df_query[~(df_query['query']=='')] # We delete the empty queries

In order to get the trending query per day, we start by resampling the previous dataframe on a daily basis, and grouping them together as a list

In [None]:
df_query_perday=df_query.resample('1D').apply(list)
df_query_perday.head()

We have now to properly define what is a trending query.
First of all,to be trending a query must be among the most requested.
But to be trending, a query must not be among the most requested everyday.

In this light, this is actually similar to what the TF-IDF score is in information retrieval.
Briefly the TF-IDF stands for Term Frequency - Inverse document frequency, and is a score to determine how important a word is to a document in a  corpus of documents.

If we turn we consider each day as a single "document", by resampling our dataframe on a daily basis, we actually created a corpus.

We can then calculate the "Term Frequency", by counting each queries.

Let's look for search that are trending on a day, but not within the 3 days prior.

In [None]:
df_query_perday['counted']=df_query_perday['query'].apply(lambda x: collections.Counter(x).most_common())
df_query_perday.head()

So we have a list of tuples, with the query and the number of occurences that day.

In [None]:
df_query_perday['counted']=df_query_pertimeframe['query'].apply(lambda x: collections.Counter(x).most_common())


In order to reduce computation cost, we will only focus on the N top queries per days

In [None]:
N=1000 # we choose the top 1000 query here
dfs=[]
for day in df_query_perday.index:
    dfs.append(pd.DataFrame.from_records(df_query_perday.loc[day].counted[0:N],columns=['searchq' ,'count']).set_index('searchq'))
df_days = pd.concat(dfs,axis=1,sort=True)
df_days['idf']=np.log10(len(dfs)/df_days.count(axis='columns'))
# Let's calculate the trending queries of the last day of the data
last_day=dfs[-1] 
last_day['tf'] = last_day / last_day.sum() 
last_day['idf']=df_days["idf"]
last_day['tfidf'] = last_day['idf'] * last_day['tf'] 

Now, let's plot the timeseries of the top 5 trending queries for the last day of the data

In [None]:
def plot_trending(last_day,num_trends):
    plot_by_loc=[df_query[df_query['query']==search].resample('1T').count() for search  in  list(last_day.sort_values('tfidf', ascending=False).head(num_trends).index)]
    trendings=pd.concat(plot_by_loc,axis=1)
    trendings=trendings.resample('1D').sum()
    trendings=trendings.fillna(0)
    trendings.columns=list(last_day.sort_values('tfidf', ascending=False).head(num_trends).index)
    trendings.iplot(subplots=True, shape=(num_trends,1), shared_xaxes=True, fill=True,title='Timeseries of top 5 trending query',yTitle=' ')
    
plot_trending(last_day,5)

We can easily see that these are actually queries that are trending on that day, but never on  any of the other days. This seems to be a working approach. However, sometimes trends will be more cyclic, with events happening once in a while. Let's rework a bit the previous bit :


In [None]:
def get_trends_by_day(df_query_perday,date,N,n_previous):
    dfs=[]
    index_date=df_query_perday.index.get_loc(date)+1
    days=df_query_perday.iloc[index_date-n_previous:index_date].index
    for day in days:
        dfs.append(pd.DataFrame.from_records(df_query_perday.loc[day].counted[0:N],columns=['searchq' ,'count']).set_index('searchq'))
    df_days = pd.concat(dfs,axis=1,sort=True)
    df_days['idf']=np.log10(len(dfs)/df_days.count(axis='columns'))
    # Let's calculate the trending queries of the last day of the data
    last_day=dfs[-1] 
    last_day['tf'] = last_day / last_day.sum() 
    last_day['idf']=df_days["idf"]
    last_day['tfidf'] = last_day['idf'] * last_day['tf'] 
    
    return last_day

Now, the code has been adapted into a function in order to be able to select the trending day of a specific day, compared to a certain number of days before.


In [None]:
trends=get_trends_by_day(df_query_perday,"2018-06-28",100,7)
plot_trending(trends,5)

By extension, and without requiring a handfull of change, this approach could be used by anytime frame possible : get weekly/monthly new trends, as it just on how we create the "documents" based on time resampling. And, in the case of the number of search queries became too high (grouping by month for example), we could easily replace the standard counter by some probabilistic counter, using for example a count-min sketch, in order to still be efficient.