<a href="https://colab.research.google.com/github/verralljellyfish/text-analysis/blob/master/Google_SERP_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SERP Analysis
### How to Analyze, Visualize and Summarise Google's SERP (NER + BART/T5 + Q&A)

This [pytude](https://github.com/norvig/pytudes) as *Peter Norvig* would call it (a simple python program) is meant to help us "see" the terms and the concepts behind search intents. 

👉Full blog post here: "[SERP Analysis with the help of AI](https://wordlift.io/blog/en/serp-analysis)" 🚀

**What makes a SERP unique** and how we can use semantic analysis to understand the characteristics of the results behind a search query on Google. In the end I also use BART to summarise the top result into one coherent text. BART is a machine learning model for text summarisation developed by [Lewis et al. 2019](https://arxiv.org/abs/1910.13461) and made easy to use by [@HuggingFace](https://twitter.com/huggingface). 

</br>
<table align="left">
  <td>
  <a href="https://wordlift.io">
    <img width=130px src="https://wordlift.io/wp-content/uploads/2018/07/logo-assets-510x287.png" />
    </a>
    </td>
    <td>
      by 
      <a href="https://wordlift.io/blog/en/entity/andrea-volpini">
        Andrea Volpini
      </a>
      <br/>
      <br/>
      MIT License
      <br/>
      <br/>
      <i>Last updated: <b>April 8th, 2020</b></i>
  </td>
</table>
</br>
</br>


## Getting Top Results from Google Search
#### Installing libraries


In [0]:
# Installation scraping, cleaning and text analysis only

!pip install google
!pip install -U git+https://github.com/adbar/trafilatura.git

import re
import pandas as pd
import numpy as np
import trafilatura
import pprint



Collecting git+https://github.com/adbar/trafilatura.git
  Cloning https://github.com/adbar/trafilatura.git to /tmp/pip-req-build-iqeq6wul
  Running command git clone -q https://github.com/adbar/trafilatura.git /tmp/pip-req-build-iqeq6wul
Collecting htmldate>=0.6.2
  Downloading https://files.pythonhosted.org/packages/26/17/abb8e6ceedec5bd1a52c3e61acd29bf7eb98e3dea98834ed07bd44244650/htmldate-0.6.2-py3-none-any.whl
Collecting justext>=2.2.0
[?25l  Downloading https://files.pythonhosted.org/packages/6c/5f/c7b909b4b864ebcacfac23ce2f6f01a50c53628787cc14b3c06f79464cab/jusText-2.2.0-py2.py3-none-any.whl (860kB)
[K     |████████████████████████████████| 870kB 5.7MB/s 
[?25hCollecting readability-lxml>=0.7.1
  Downloading https://files.pythonhosted.org/packages/af/a7/8ea52b2d3de4a95c3ed8255077618435546386e35af8969744c0fa82d0d6/readability-lxml-0.7.1.tar.gz
Collecting lxml>=4.4.2
[?25l  Downloading https://files.pythonhosted.org/packages/dd/ba/a0e6866057fc0bbd17192925c1d63a3b85cf522965de9bc

In [0]:
# Installation tensorflow + transformers + pipelines
# You need this to summarize the SERP and to run question-answering on the extracted corpus of text 

!pip install transformers --upgrade # details https://github.com/huggingface/transformers/releases/tag/v2.6.0
!pip install tensorflow==2.1 # you need this to use T5
from transformers import pipeline



Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/78/92cedda05552398352ed9784908b834ee32a0bd071a9b32de287327370b7/transformers-2.8.0-py3-none-any.whl (563kB)
[K     |████████████████████████████████| 573kB 2.7MB/s 
Collecting tokenizers==0.5.2
[?25l  Downloading https://files.pythonhosted.org/packages/d1/3f/73c881ea4723e43c1e9acf317cf407fab3a278daab3a69c98dcac511c04f/tokenizers-0.5.2-cp36-cp36m-manylinux1_x86_64.whl (3.7MB)
[K     |████████████████████████████████| 3.7MB 13.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/99/50/93509f906a40bffd7d175f97fd75ea328ad9bd91f48f59c4bd084c94a25e/sacremoses-0.0.41.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 36.2MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |██████████

### Shooting the query

Here are the parameters that we can use:

* **query** : query string that we want to search for.
* **tld** : tld stands for top level domain which means we want to search our * result on google.com or google.in or some other domain.
* **lang** : lang stands for language.
* **num** : Number of results we want.
* **start** : First result to retrieve.
* **stop** : Last result to retrieve. Use None to keep searching forever.
* **pause** : Lapse to wait between HTTP requests. Lapse too short may cause  Google to block your IP. Keeping significant lapse will make your program slow but its safe and better option.
Return : Generator (iterator) that yields found URLs. 
If the stop parameter is None the iterator will loop forever.

Here is the documentation: https://python-googlesearch.readthedocs.io/en/latest/

In [0]:
uQuery_1 = "Jason Barnard" # here is where everything begins: we choose two queries that we like to compare
uQuery_2 = "Andrea Volpini"

uNum = 15

def getResults(uQuery, uTLD, uNum, uStart, uStop):
  try: 
      from googlesearch import search 
  except ImportError:  
      print("No module named 'google' found") 
  
  # What are we searching for 
  query = uQuery
  
  # Prepare the data frame to store urls
  d = []

  for j in search(query, tld=uTLD, num=uNum, start=uStart, stop=uStop, pause=2): 
      d.append(j)
      print(j)
  return d

results_1 = getResults(uQuery_1, "com", uNum, 1,uNum)
results_2 = getResults(uQuery_2, "com", uNum, 1,uNum)

https://jasonbarnard.com/
https://twitter.com/jasonmbarnard?lang=en
https://en.wikipedia.org/wiki/Jason_Barnard
https://www.searchenginejournal.com/author/jason-barnard/
https://www.semrush.com/user/146089169/
https://kalicube.pro/podcast/
https://kalicube.pro/about/
https://www.linkedin.com/today/author/jasonbarnard
https://www.linkedin.com/in/jasonbarnard/
https://www.facebook.com/public/Jason-Barnard
https://www.stitcher.com/podcast/jason-barnard/seoisaeo-podcast-expert-interviews-at-major-digital-marketing
https://wordlift.io/academy-entries/how-google-ranking-works/
https://blog.grade.us/jason-barnard/
https://www.youtube.com/watch?v=ijA1iFW5kXA
https://www.ggu.edu/graduate/faculty/bio/jason-barnard
https://twitter.com/cyberandy?lang=en
https://it.linkedin.com/in/volpini/it
https://2018.semantics.cc/users/andrea-volpini
https://www.semrush.com/user/151709297/
https://connected-data.london/speakers/andrea-volpini/
https://www.slideshare.net/cyberandy
https://www.atptour.com/en/play

### Scraping results with Trafilatura

The library can seamlessly download, parse and convert web documents: it scrapes the main body text while preserving part of the text formatting and page structure and converts to TXT, CSV, XML & TEI-XML.

Here is the documentation: https://trafilatura.readthedocs.io/



In [0]:
pd.set_option('display.max_colwidth', None) # make sure output is not truncated (cols width)
pd.set_option("display.max_rows", 100) # make sure output is not truncated (rows)

def readResults(urls, query):
    # Prepare the data frame to store results
    x = []
    position = 0 # position on the serp

    # Loop items in results
    for page in urls:
       position += 1
       downloaded = trafilatura.fetch_url(page)
       if downloaded is not None: # assuming the download was successful
        result = trafilatura.extract(downloaded, include_tables=False, include_formatting=False, include_comments=False) 
        x.append((page, result, query, position))
    return x

d = readResults(results_1, uQuery_1) # get results from there 1st query
e = readResults(results_2, uQuery_2) # get results from there 2nd query

df_1 = pd.DataFrame(d, columns=('url', 'result', 'query', 'position')) # store data in a data frame
df_2 = pd.DataFrame(e, columns=('url', 'result', 'query', 'position')) # store data in a data frame

df_final = pd.concat([df_1, df_2])
print("total number of articles (before filtering) ",len(df_final))

# Remove rows where result is empty 
df_final['result'].replace(' ', np.nan, inplace=True)
df_final = df_final.dropna(subset=['result'])

# Remove rows where article are less than 200 characters in lenght
df_final = df_final[df_final['result'].apply(lambda x: len(str(x))>200)]


# Reindex df
df_final.index = range(len(df_final.index))

# Set the file name
uQuery = uQuery_1 + "_" + uQuery_2
cleanQuery = re.sub('\W+','', uQuery)
file_name = cleanQuery + ".csv"

# Store data to CSV
df_final.to_csv(file_name, encoding='utf-8', index=True)
print("total number of articles saved on",file_name, len(df_final))



## Analyze terms from the corpus of results 

Beautiful visualization of how language differs among search results. [Scattertext](https://github.com/JasonKessler/scattertext) is a tool for finding distinguishing terms in small-to-medium-sized corpora like the one we're using here.

Scattertext presents terms/concepts in an interactive, HTML scatter plot. Points corresponding to terms are selectively labeled so that they don't overlap with other labels or points.

Here is the documentation: https://github.com/JasonKessler/scattertext

In [0]:
# Getting additional hourse power - adding more libraries
!pip install scattertext

%matplotlib inline
import scattertext as st
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

import io
from scipy.stats import rankdata, hmean, norm
import spacy
import os, pkgutil, json, urllib
from urllib.request import urlopen
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
display(HTML("<style>.container { width:98% !important; }</style>"))

nlp = spacy.load('en') # make sure you have the right language here 



### Terms that characterize the SERP 

Corpus characteristicness is the difference in dense term ranks between the words in all web pages and a general English-language frequency list.

In [0]:
df_final['index'] = df_final.index
df_final.groupby('query').apply(lambda x: x.result.apply(lambda x: len(x.split())).sum())
df_final['parsed'] = df_final.result.apply(nlp) # run NER using spaCy

# Turn it into a Scattertext corpus
corpus = (st.CorpusFromParsedDocuments(df_final, 
                                       category_col='query', 
                                       parsed_col='parsed')
          .build()
          .remove_terms(ENGLISH_STOP_WORDS, ignore_absences=True)) # remove stop words in English  

In [0]:
# Terms that appear more frequently in the result corpus that are not common in the English language

list(corpus.get_scaled_f_scores_vs_background().index[:15])

['volpini',
 'undoundo',
 'wordlift',
 'boowa',
 'retweeted',
 'threadthanks',
 'tweets',
 'kwala',
 'twitter',
 'semrush',
 'redlink',
 'dubut',
 'cyberandy',
 'seocampus',
 'retweetedwordlift']

### Most frequent terms related to Jason Barnard 


In [0]:
df_final = corpus.get_term_freq_df()
df_final['query'] = corpus.get_scaled_f_scores(uQuery_1)
df_final.sort_values('query', ascending=False).iloc[:15]

Unnamed: 0_level_0,Jason Barnard freq,Andrea Volpini freq,query
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
brand,27,0,1.0
^,21,0,1.0
marketing,19,0,1.0
boowa,14,0,1.0
kwala,12,0,0.999997
serp,12,0,0.999997
boowa &,11,0,0.99998
digital marketing,10,0,0.999893
brand serps,9,0,0.999506
& kwala,9,0,0.999506


### Most frequent terms related to Andrea Volpini


In [0]:
df_final = corpus.get_term_freq_df()
df_final['query'] = corpus.get_scaled_f_scores(uQuery_2)
df_final.sort_values('query', ascending=False).iloc[:15]

Unnamed: 0_level_0,Jason Barnard freq,Andrea Volpini freq,query
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,10,1.0
references,0,10,1.0
undoundo andrea,0,10,1.0
entries,0,9,0.999979
semantic,0,9,0.999979
0 references,0,8,0.999835
knowledge,0,7,0.999039
andrea,2,45,0.995953
focusing,0,6,0.995558
artificial,0,6,0.995558


In [0]:
html = produce_scattertext_explorer(corpus,
                                    category='Jason Barnard',
                                    category_name='Jason Barnard',
                                    not_category_name='Andrea Volpini',
                                    width_in_pixels=900,
                                    minimum_term_frequency=2,
                                    term_significance = st.LogOddsRatioUninformativeDirichletPrior())
                                    #transform=st.Scalers.scale)
open("SERP-Visualization.html", 'wb').write(html.encode('utf-8'))
HTML(html)

### Run this to compare the top 3 results for Jason Barnard with all his other results

Run the cell below if you are interested in comparing what is different from results with position 1,2 and 3 with all the other results. 

This is helpful to understand what terms you need to rank higher for that intent.

In [0]:
df_1['top_result'] = ['Yes' if x <= 3 else 'No' for x in df_1['position']] # add top_result = True when position <=3 

# Remove rows where result is empty 
df_1['result'].replace(' ', np.nan, inplace=True)
df_1 = df_1.dropna(subset=['result'])

df_1['index'] = df_1.index

df_1.groupby('top_result').apply(lambda x: x.result.apply(lambda x: len(x.split())).sum())
df_1['parsed'] = df_1.result.apply(nlp)

# Turn it into a Scattertext corpus
corpus_1 = (st.CorpusFromParsedDocuments(df_1, 
                                       category_col='top_result', 
                                       parsed_col='parsed')
          .build()
          .remove_terms(ENGLISH_STOP_WORDS, ignore_absences=True)) # remove stop words in English 



In [0]:
html_1 = produce_scattertext_explorer(corpus_1,
                                    category='Yes',
                                    category_name='Yes',
                                    not_category_name='No',
                                    width_in_pixels=900,
                                    minimum_term_frequency=3,
                                    term_significance = st.LogOddsRatioUninformativeDirichletPrior())
open("SERP-Visualization_top3.html", 'wb').write(html.encode('utf-8'))
HTML(html_1)

## Summarise Results with BART

### Summarize all content related to Jason

In [0]:
# Remove rows where article are less than 300 characters in lenght
df_1 = df_1[df_1['result'].apply(lambda x: len(str(x))>300)]


# getting text ready by merging all pages together (no index)
full_body = df_1[['result']].agg(''.join, axis=1).to_string(index=False).strip()

print(full_body) 

with open('output.txt', 'w') as text_file:
    text_file.write(full_body)

#from google.colab import files
#files.download('output.txt')

The Brand SERP Guy\nWhy “The Brand SERP Guy”? Because I’ve been studying, tracking and analysing Brand SERPs (what appears when someone Googles your name) since 2014…\nConclusion : Brand SERPs are your new business card, a reflection of your brand’s digital ecosystem and an honest critique of your online marketing strategy.\nThat should be enough to pique the interest of any marketer and any brand… in any industry :)\n2 Decades in Digital Marketing\nI have over 2 decades of experience in digital marketing. I started promoting my first website in the year Google was incorporated and built it up to become one of the top 10,000 most visited sites in the world (60 million visits in 2007).\nIn 2020\nToday I’m a fulltime 100% digital nomad, host and keynote speaker at conferences around the world, whilst interviewing industry experts for his podcast – “With Jason Barnard… The smartest people in marketing talk to Jason about topics they know inside out. The conversations are always intelligen

In [0]:
# documentation for summarizer: https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline
summarizer = pipeline('summarization')

# use t5 instead
#summarizer = pipeline("summarization", model="t5-base", tokenizer="t5-base", framework="tf")


Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/facebook/bart-large-cnn/modelcard.json' to download model card file.
Creating an empty model card.


In [0]:
# change min_ and max_length for different output
summary_result = summarizer(full_body, min_length=150, max_length=300)

##### Machine-generated summary - Web Results

In [0]:
pp = pprint.PrettyPrinter(indent=14)
pp.pprint(summary_result[0]['summary_text'])

('Jason M. Barnard is a Search engine marketing consultant, musician, '
 'cartoon-maker and voice actor. Previously, with his wife Véronique, he '
 'created and voiced the cartoon characters Boowa & Kwala. He also played '
 'double bass and sang with The Barking Dogs, playing over 600 concerts '
 'throughout Europe between 1989 and 1996. He is a fulltime 100% digital '
 'nomad, host and keynote speaker at conferences around the world, whilst '
 'interviewing industry experts for his podcast – “With Jason Barnard… The '
 'smartest people in marketing talk to Jason about topics they know inside '
 'out”. He has over 2 decades of experience in digital marketing. He started '
 'promoting his first website in the year Google was incorporated and built it '
 'up to become one of the top 10,000 most visited sites in the world.')


## Q&A on Results with BERT


In [0]:
from transformers import pipeline

# Test the default model for QA (Bert large finetuned on SQuAD 1.0)
nlp = pipeline('question-answering')

nlp(question= "What musical instrument does Jason Barnard play?", 
     context= full_body)

HBox(children=(IntProgress(value=0, description='Downloading', max=230, style=ProgressStyle(description_width=…




convert squad examples to features: 100%|██████████| 1/1 [00:00<00:00,  3.97it/s]
add example index and unique id: 100%|██████████| 1/1 [00:00<00:00, 505.16it/s]


{'answer': 'MusicBrainz\\n-',
 'end': 6786,
 'score': 0.9965339225109986,
 'start': 6772}

## Credits 

* Google Search library by Mario Vilas
* Trafilatura by Adrien Barbaresi
* Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. Link to preprint: arxiv.org/abs/1703.00565
* BERT with pre-trained model for questions answering by Google Research Team and made easy to use by @HuggingFace 🤗 
* BART developed by Lewis et al. 2019 and made easy to use by @HuggingFace 🤗

