<a href="https://colab.research.google.com/github/simodepth/Entities/blob/main/Find_Entity_Opportunities_from_Outranking_pages_and_Compare_Entities_between_Web_pages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Run a Competitor Analysis by Entities with Google NLP


---

**Summary**

- Compare entities and their salience between two web pages
- Display missing entities between two pages


#Requirements and Assumptions
- Python 3 is installed and basic Python syntax understood
- Access to a Linux installation (I recommend Ubuntu) or Google Colab
- Google Cloud Platform account
- [NLP API Enabled](https://cloud.google.com/natural-language/docs)
- Credentials created (service account) and JSON file downloaded
- NLP JSON key API is uploaded **every time you run this script**

#! Pip Install Missing Packages
- **fake_useragent**: for generating a user agent when making a request
- **pandas==1.1.2**: that's simply the newest pandas version

In [1]:
!pip install fake_useragent

!pip install pandas==1.1.2

Collecting fake_useragent
  Downloading fake-useragent-0.1.11.tar.gz (13 kB)
Building wheels for collected packages: fake-useragent
  Building wheel for fake-useragent (setup.py) ... [?25l[?25hdone
  Created wheel for fake-useragent: filename=fake_useragent-0.1.11-py3-none-any.whl size=13502 sha256=838c67a4d3b4a7dd75b46d8a916b93f2e4f0e8e0f59dac933d2a02dd6c0f9eee
  Stored in directory: /root/.cache/pip/wheels/ed/f7/62/50ab6c9a0b5567267ab76a9daa9d06315704209b2c5d032031
Successfully built fake-useragent
Installing collected packages: fake-useragent
Successfully installed fake-useragent-0.1.11
Collecting pandas==1.1.2
  Downloading pandas-1.1.2-cp37-cp37m-manylinux1_x86_64.whl (10.5 MB)
[K     |████████████████████████████████| 10.5 MB 21.5 MB/s 
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 1.3.5
    Uninstalling pandas-1.3.5:
      Successfully uninstalled pandas-1.3.5
Successfully installed pandas-1.1.2


In [2]:
#@title Run Import Modules
import os
from google.cloud import language_v1
from google.cloud.language_v1 import enums

from google.cloud import language
from google.cloud.language import types

import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

from fake_useragent import UserAgent
import requests
import pandas as pd
import numpy as np

In [3]:
#@title Wrap the JSON-LD key API into a call
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/content/nlp-api-348917-9095c7f4e634.json"


#Build NLP Function
Since we are using the same process to evaluate both pages we can create a function. This helps reduce redundant code. This function named **processhtml()** shown in the code below will:

1. Create a new user agent for the request header
2. Make the request to the web page and store the HTML content
3. Initialize the Google NLP
4. Communicate to Google that you are sending them HTML, rather than plain text
5. Send the request to Google NLP
6. Store the JSON response
7. Convert the JSON into a python dictionary with the entities and salience scores (adjust rounding as needed)
8. Convert the keys to lower case (for comparing)
9. Return the new dictionary to the main script


In [4]:
def processhtml(url):

    ua = UserAgent() 
    headers = { 'User-Agent': ua.chrome } 
    res = requests.get(url,headers=headers) 
    html_page = res.text

    url_dict = {}

    client = language_v1.LanguageServiceClient()

    type_ = enums.Document.Type.HTML

    language = "en"
    document = {"content": html_page, "type": type_, "language": language}

    encoding_type = enums.EncodingType.UTF8

    response = client.analyze_entities(document, encoding_type=encoding_type)

    for entity in response.entities:
        url_dict[entity.name] = round(entity.salience,4)

    url_dict = {k.lower(): v for k, v in url_dict.items()}

    return url_dict

#Process NLP Data and Calculate Salience Difference
Now that we have our function we can set the variables storing the web page URLs we want to compare and then send them to the function we just made.

In [8]:
url1 = "https://fusionunlimited.co.uk/about-us/" 
url2 = "https://www.twentysixdigital.com/our-services/" 

url1_dict = processhtml(url1)
url2_dict = processhtml(url2)

In [12]:
#@title Compare Entities between 2 Webpages 
df = pd.DataFrame([], columns=['URL 1','URL 2','Difference'])

for key in set(url1_dict) & set(url2_dict):
    url1_keywordnum = str(url1_dict.get(key,"n/a"))
    url2_keywordnum = str(url2_dict.get(key,"n/a"))
    
    if url2_keywordnum > url1_keywordnum:
        diff = str(round(float(url2_keywordnum) - float(url1_keywordnum),3))
    else:
        diff = "0"

    new_row = {'Keyword':key,'URL 1':url1_keywordnum,'URL 2':url2_keywordnum,'Difference':diff}
    
    df = df.append(new_row, ignore_index=True)

print(df.sort_values(by='Difference', ascending=False))

     URL 1   URL 2 Difference      Keyword
11  0.0054  0.1121      0.107      clients
18  0.0053  0.0113      0.006  strategists
32  0.0023  0.0052      0.003     strategy
12  0.0016  0.0041      0.003        brand
31  0.0016  0.0036      0.002     audience
20  0.0013  0.0019      0.001        leeds
27  0.0037  0.0042        0.0     approach
16   0.005  0.0052        0.0  performance
2   0.0052  0.0055        0.0      experts
8   0.0028   0.002          0          ppc
21  0.0017  0.0015          0        touch
30  0.0042  0.0025          0         some
29  0.0043  0.0041          0         site
28  0.0183  0.0144          0      careers
3   0.0049  0.0034          0   affiliates
26  0.0177  0.0019          0   experience
25  0.0245  0.0178          0     services
24  0.0013   0.001          0    instagram
23  0.0184  0.0144          0         blog
22  0.0043  0.0034          0        blend
4   0.0246  0.0163          0      contact
9   0.0013   0.001          0     facebook
19  0.0013 

#*clients, strategists, strategy, brand, audience, leeds*
These are entities found on both pages that are deemed by Google NLP more important (relative to the whole text) on the competitor page. **These are keywords you may want to investigate and consider ways to communicate better on your page.**

---



📔 URL1 (benchmark) and URL2 (competitor) (contain the **salience scores** for each entity for that URL.
If your competitor's salience score for a keyword is greater than yours, record the difference


---


❗ **"Salience score"** is a metric of calculated importance in relation to the rest of the text.

In [14]:
#@title ⭐️ Find Entity Opportunities from Outranking pages ⭐️
diff_lists = set(url2_dict) - set(url1_dict)

final_diff = {}

for k in diff_lists:
  for key,value in url2_dict.items():
    if k == key:
      final_diff.update({key:value})

df = pd.DataFrame(final_diff.items(), columns=['Keyword','Score'])

print(df.head(25).sort_values(by='Score', ascending=False))

                                              Keyword   Score
19                                          twentysix  0.0107
4                               performance marketing  0.0093
6                                           decisions  0.0063
13                                        researchers  0.0061
1                                 website development  0.0052
11                                   project managers  0.0037
15                                           opinions  0.0025
5                                              trends  0.0025
8                                                 all  0.0025
14                                              minds  0.0025
20                                        innovations  0.0025
3                                    sovereign street  0.0019
23                                          marketing  0.0019
10                                     privacy policy  0.0018
16                                            ls1 4ba  0.0015
18      

This list shows the **top 25 entities by salience on the competitor page BUT DO NOT appear on your page**.

This is useful to find entity opportunities where pages that outrank you are using but you are not.

---


**⚠️ Entities Opportunities stem from the previous two-folded comparison**