<a href="https://colab.research.google.com/github/simodepth/Entities/blob/main/Benchmark_Entity_Opportunities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Run a Competitor Analysis by Entities with Google NLP


---

**Summary**

- Compare entities and their salience between two web pages
- Display missing entities between two pages


#Requirements and Assumptions
- Python 3 is installed and basic Python syntax understood
- Run on Google Colab
- Google Cloud Platform account
- [NLP API Enabled](https://cloud.google.com/natural-language/docs)
- Credentials created (service account) and JSON file downloaded
- NLP JSON key API is uploaded **every time you run this script**

#! Pip Install Missing Packages
- **fake_useragent**: for generating a user agent when making a request
- **pandas==1.1.2**: that's simply the newest pandas version

In [1]:
!pip install fake_useragent

!pip install pandas==1.1.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting fake_useragent
  Downloading fake-useragent-0.1.11.tar.gz (13 kB)
Building wheels for collected packages: fake-useragent
  Building wheel for fake-useragent (setup.py) ... [?25l[?25hdone
  Created wheel for fake-useragent: filename=fake_useragent-0.1.11-py3-none-any.whl size=13502 sha256=70f8c6cf3b1ae38bea8aa2ca46471a091af3de077ce9e7e865ffa839f1193701
  Stored in directory: /root/.cache/pip/wheels/ed/f7/62/50ab6c9a0b5567267ab76a9daa9d06315704209b2c5d032031
Successfully built fake-useragent
Installing collected packages: fake-useragent
Successfully installed fake-useragent-0.1.11
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pandas==1.1.2
  Downloading pandas-1.1.2-cp37-cp37m-manylinux1_x86_64.whl (10.5 MB)
[K     |████████████████████████████████| 10.5 MB 6.9 MB/s 
Installing collected packages: pandas
  Attempt

In [None]:
#@title Run Import Modules
import os
from google.cloud import language_v1
from google.cloud.language_v1 import enums

from google.cloud import language
from google.cloud.language import types

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

from fake_useragent import UserAgent
import requests
import pandas as pd
import numpy as np

In [None]:
#@title Wrap the JSON-LD key API into a call
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/content/nlp-api-348917-9095c7f4e634.json"


#Build NLP Function
Since we are using the same process to evaluate both pages we can create a function. This helps reduce redundant code. This function named **processhtml()** shown in the code below will:

1. Create a new user agent for the request header
2. Make the request to the web page and store the HTML content
3. Initialize the Google NLP
4. Communicate to Google that you are sending them HTML, rather than plain text
5. Send the request to Google NLP
6. Store the JSON response
7. Convert the JSON into a python dictionary with the entities and salience scores (adjust rounding as needed)
8. Convert the keys to lower case (for comparing)
9. Return the new dictionary to the main script


In [None]:
def processhtml(url):

    ua = UserAgent() 
    headers = { 'User-Agent': ua.chrome } 
    res = requests.get(url,headers=headers) 
    html_page = res.text

    url_dict = {}

    client = language_v1.LanguageServiceClient()

    type_ = enums.Document.Type.HTML

    language = "en"
    document = {"content": html_page, "type": type_, "language": language}

    encoding_type = enums.EncodingType.UTF8

    response = client.analyze_entities(document, encoding_type=encoding_type)

    for entity in response.entities:
        url_dict[entity.name] = round(entity.salience,4)

    url_dict = {k.lower(): v for k, v in url_dict.items()}

    return url_dict

#Process NLP Data and Calculate Salience Difference
Now that we have our function we can set the variables storing the web page URLs we want to compare and then send them to the function we have just created.

In [None]:
url1 = "https://fusionunlimited.co.uk/about-us/" #@param {type:"string"}
url2 = "https://wolfenden.agency/about-us/" #@param {type:"string"} 

url1_dict = processhtml(url1)
url2_dict = processhtml(url2)

Error occurred during loading data. Trying to use cache server https://fake-useragent.herokuapp.com/browsers/0.1.11
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/fake_useragent/utils.py", line 154, in load
    for item in get_browsers(verify_ssl=verify_ssl):
  File "/usr/local/lib/python3.7/dist-packages/fake_useragent/utils.py", line 99, in get_browsers
    html = html.split('<table class="w3-table-all notranslate">')[1]
IndexError: list index out of range


In [None]:
#@title Compare Entities between 2 Webpages 
df = pd.DataFrame([], columns=['URL 1','URL 2','Difference'])

for key in set(url1_dict) & set(url2_dict):
    url1_keywordnum = str(url1_dict.get(key,"n/a"))
    url2_keywordnum = str(url2_dict.get(key,"n/a"))
    
    if url2_keywordnum > url1_keywordnum:
        diff = str(round(float(url2_keywordnum) - float(url1_keywordnum),3))
    else:
        diff = "0"

    new_row = {'Keyword':key,'URL 1':url1_keywordnum,'URL 2':url2_keywordnum,'Difference':diff}
    
    df = df.append(new_row, ignore_index=True)

print(df.sort_values(by='Difference', ascending=False))

     URL 1   URL 2 Difference        Keyword
7   0.0023  0.0244      0.022       strategy
11  0.0028  0.0121      0.009            roi
10  0.0063  0.0092      0.003          teams
2   0.0184  0.0192      0.001           work
0   0.0069  0.0022          0         search
17  0.0714  0.0011          0        website
27     0.0     0.0          0           2020
26  0.0043  0.0013          0             pr
25     0.0     0.0          0           2022
24  0.0054  0.0025          0        clients
23  0.0187  0.0014          0  cookie policy
22  0.0184  0.0068          0           home
21   0.003  0.0023          0           team
20  0.0183  0.0012          0        careers
19  0.0023  0.0004          0       linkedin
18  0.0019  0.0011          0        content
14   0.005  0.0005          0    performance
16  0.0017  0.0005          0          touch
15  0.0177  0.0019          0     experience
1   0.0013  0.0004          0        twitter
13  0.0245  0.0013          0       services
12  0.0058

**Strategy** seems to be an entity found on both pages that are deemed by Google NLP more important on the competitor page against the whole text. 

**This is a keyword you may want to investigate and consider ways to communicate better on your page.**

---



📔 URL1 (benchmark) and URL2 (competitor) contain the **salience scores** for each entity for that URL.
If your competitor's salience score for a keyword is greater than yours, record the difference


---


❗ **"Salience score"** is a metric of calculated importance in relation to the rest of the text.

In [None]:
#@title ⭐️ Find Entity Opportunities from Outranking pages ⭐️
diff_lists = set(url2_dict) - set(url1_dict)

final_diff = {}

for k in diff_lists:
  for key,value in url2_dict.items():
    if k == key:
      final_diff.update({key:value})

df = pd.DataFrame(final_diff.items(), columns=['Keyword','Score'])

print(df.head(25).sort_values(by='Score', ascending=False))

                 Keyword   Score
1                   skin  0.0059
8              visibilis  0.0057
6                 legacy  0.0037
17          pr executive  0.0027
10           opportunity  0.0027
11     account executive  0.0027
18       finance manager  0.0025
16  social media manager  0.0025
13    marketing director  0.0024
3     insight strategist  0.0023
2              marketing  0.0016
15     marketing cookies  0.0015
4                resolve  0.0015
19                 staff  0.0015
9               insights  0.0013
22       cystic fibrosis  0.0013
20        matthew larkin  0.0009
21           emma barnes  0.0009
0            tom corless  0.0009
12      sophie madgewick  0.0009
14     rhea jasmin zakir  0.0009
7            kim rushton  0.0009
5         stefano bianco  0.0009
24                series  0.0007
23                  2011  0.0000


This list shows the **top 25 entities by salience on the competitor page BUT MIGHT NOT appear on your page**.

This is useful to find entity opportunities as it showcases entities used by your competitor to outrank your page

---


**⚠️ Entities Opportunities stem from the previous comparison**