<a href="https://colab.research.google.com/github/simodepth/Entities/blob/main/Topical_coverage_and_Entity_Calculator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Keyword Density Calculator

---

For years SEOs have clambered on about how keyword density is dead.

For keyword ranking perhaps, but it’s still a useful tool in determining a web page’s identity.

If your top ten most used keywords don’t align with your target identity then something is off. For pages I’ve never seen before, I’ll calculate the top ten most frequently used words and if I can’t quickly determine what the page is about, something is wrong.

#Requirements & Assumptions


---

- [Google Knowledge API](https://console.developers.google.com/apis/dashboard)
- Either a list of URLs or a XLSX/CSV file with high traffic landing pages that you can retrieve from the Performance tab in GSC 


In [None]:
!pip install fake_useragent 
!pip install bs4 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import requests
from bs4 import BeautifulSoup
from collections import Counter #this is to count the number of each word
import pandas as pd 
import time #to delay scripts to prevent bottlenecks with the server
import io
import json
from fake_useragent import UserAgent
from google.colab import files
import numpy as np

#Load the URLs to Scan


---
**Choose ONE of the following**
- Load from a List
- Load from Local CSV

In [None]:
#@title Load Local CSV 
crawldf = pd.read_excel('/content/https___seodepths.com_-Performance-on-Search-2022-07-27.xlsx') #@param {type:"string"} 
addresses = crawldf['Address'].tolist()

In [None]:
#@title Load from a List (OPTIONAL)
addresses = ['https://www.johnniewalker.com/en-gb/our-whisky-gifts/engraving-tool/', 'https://www.johnniewalker.com/en-gb/whisky-cocktails/highball-cocktails/', 'https://www.johnniewalker.com/en-gb/whisky-guide/how-to-drink-whisky/', 'https://www.johnniewalker.com/en-gb/whisky-guide/how-whisky-is-made/','https://www.johnniewalker.com/en-gb/whisky-guide/types-of-whisky/','https://www.johnniewalker.com/en-gb/whisky-guide/the-history-of-whisky/','https://www.johnniewalker.com/en-gb/whisky-guide/the-johnnie-walker-story/']

###Make sure to run only one option from above to avoid confusing the crawler

#Set-Up the HTTP Request User Agent


---

`fake_useragent `generates a fake user agent for each web page request. Because it's to our personal use, this is a fake one.

In [None]:
ua = UserAgent()
 
headers = {
    'User-Agent': ua.chrome
}

#Call the Knowledge API Key with a Function


---
For the `url` variable, make sure to replace the `key` parameter with your API key.


In [None]:
def gkbAPI(keyword):
    url = "https://kgsearch.googleapis.com/v1/entities:search?query="+keyword+"&key=AIzaSyAzY_QmeuXffwF2FtWvi_cQf8LIzIys0X0&limit=1&indent=True"

    payload = {}
    headers = {}

    response = requests.request("GET", url, headers=headers, data = payload) #this one makes the call and store the response

    data = json.loads(response.text)

    try:
        getlabel = data['itemListElement'][0]['result']['@type']
    except:
        getlabel = ["none"]
    return getlabel

#Scrape the Webpages with Requests


---

- Create an empty list variable that we'll use to store the site-wide data

- Then we start our `for` loop of the URLs in the addresses list

In [None]:
fulllist = []
 
for row in addresses:
    time.sleep(1)
    url = row
    print(url)
 
    res = requests.get(url,headers=headers)
    html_page = res.content


https://www.johnniewalker.com/en-gb/our-whisky-gifts/engraving-tool/
https://www.johnniewalker.com/en-gb/whisky-cocktails/highball-cocktails/
https://www.johnniewalker.com/en-gb/whisky-guide/how-to-drink-whisky/
https://www.johnniewalker.com/en-gb/whisky-guide/how-whisky-is-made/
https://www.johnniewalker.com/en-gb/whisky-guide/types-of-whisky/
https://www.johnniewalker.com/en-gb/whisky-guide/the-history-of-whisky/
https://www.johnniewalker.com/en-gb/whisky-guide/the-johnnie-walker-story/


#Parse the HTML of Each Page


---

Since we have the URL contents, we can load into BS4 object we'll name **soup**

The `find_all()` function will extract only the text between HTML tags with the `text=True`parameter

In [None]:
soup = BeautifulSoup(html_page, 'html.parser')
text = soup.find_all(text=True) #scrape the text within the HTML from the above URLs

#Data Cleaning


---
- Remove Stopwords, therefore pronouns and articles we don't need to scrape
- Filter out non-relevant HTML tags
- Filter out Special Characters


In [None]:
#@title Remove Stopwords
stopwords = ['get','ourselves', 'hers','us','there','you','for','that','as','between', 'yourself', 'but', 'again', 'there', 'about', 'once', 'during', 'out', 'very', 'having', 'with', 'they', 'own', 'an', 'be', 'some', 'for', 'do', 'its', 'yours', 'such', 'into', 'of', 'most', 'itself', 'other', 'off', 'is', 's', 'am', 'or', 'who', 'as', 'from', 'him', 'each', 'the', 'themselves', 'until', 'below', 'are', 'we', 'these', 'your', 'his', 'through', 'don', 'nor', 'me', 'were', 'her', 'more', 'himself', 'this', 'down', 'should', 'our', 'their', 'while', 'above', 'both', 'up', 'to', 'ours', 'had', 'she', 'all', 'no', 'when', 'at', 'any', 'before', 'them', 'same', 'and', 'been', 'have', 'in', 'will', 'on', 'does', 'yourselves', 'then', 'that', 'because', 'what', 'over', 'why', 'so', 'can', 'did', 'not', 'now', 'under', 'he', 'you', 'herself', 'has', 'just', 'where', 'too', 'only', 'myself', 'which', 'those', 'i', 'after', 'few', 'whom', 't', 'being', 'if', 'theirs', 'my', 'against', 'a', 'by', 'doing', 'it', 'how', 'further', 'was', 'here', 'than','its','(en)']


In [None]:
#@title Filter out non relevant HTML tags
output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script',
    'style',
    'en',
]


In [None]:
#@title Filter out Special Characters
ban_chars = ['|','/','&','()']

#Merge Keywords into a String


---
Time to start creating our list of words from the web text into a giant string

Once we have our long string of text we create a list separating by a space.


In [None]:
for t in text:
    if t.parent.name not in blacklist:
        output += t.replace("\n","").replace("\t","")
output = output.split(" ")

#Apply the Filters previously declared for Data Cleaning

In [None]:
output = [x for x in output if not x=='' and not x[0] =='#' and x not in ban_chars] 
output = [x.lower() for x in output]
output = [word for word in output if word not in stopwords]
 
fulllist += output

#Get the Top 10 Keywords Count


---

Here is where the `Collections` module comes into play. We now send to the `Counter()` function a list of words, possibly **10-20**.

This is to keep the output as manageable as possible 

In [None]:
counts = Counter(output).most_common(10)

#Display the Top 10 N-Grams for the Page 

In [None]:
all_term_data = []
for key, value in counts:
    labels = gkbAPI(key)
    term_data = {
        'Topic': key,
        'Density': value,
        'Entity': ', '.join(labels)
    }
    all_term_data.append(term_data)
df = pd.DataFrame(all_term_data)
selection = ['Topic','Density','Entity']
df = df[selection]
df.head(20).style.set_table_styles(
[{'selector': 'th',
  'props': [('background', '#7CAE00'), 
            ('color', 'white'),
            ('font-family', 'verdana')]},
 
 {'selector': 'td',
  'props': [('font-family', 'verdana')]},

 {'selector': 'tr:nth-of-type(odd)',
  'props': [('background', '#DCDCDC')]}, 
 
 {'selector': 'tr:nth-of-type(even)',
  'props': [('background', 'white')]},
 
]
).hide_index()


Topic,Density,Entity
walker,15,Thing
whisky,11,Thing
johnnie,10,"Brand, Thing"
(en),10,none
label,7,Thing
drink,5,Thing
history,4,Thing
red,3,"Thing, WebSite"
black,3,Thing
good,3,"Organization, Corporation, Thing"


In [None]:
#@title save the output
df.to_csv(r'iCloud Drive\Scrivania\topical_coverage.csv', index = False, header=True)

#Display the Top 10 N-Grams Site-Wide

---

We just stop all page words to get a more complete overview of the keyword density site-wide


In [None]:
print("------ AGGREGATE COUNT -------")


fullcounts = Counter(fulllist).most_common(10)

all_term_data = []
for key, value in fullcounts:
    labels = gkbAPI(key)
    term_data = {
        'Topic': key,
        'Density': value,
        'Entity': ', '.join(labels)
    }
    all_term_data.append(term_data)
df = pd.DataFrame(all_term_data)
df = pd.DataFrame(all_term_data)
selection = ['Topic','Density','Entity']
df = df[selection]
df.head(20).style.set_table_styles(
[{'selector': 'th',
  'props': [('background', '#7CAE00'), 
            ('color', 'white'),
            ('font-family', 'verdana')]},
 
 {'selector': 'td',
  'props': [('font-family', 'verdana')]},

 {'selector': 'tr:nth-of-type(odd)',
  'props': [('background', '#DCDCDC')]}, 
 
 {'selector': 'tr:nth-of-type(even)',
  'props': [('background', 'white')]},
 
]
).hide_index()


------ AGGREGATE COUNT -------


Topic,Density,Entity
walker,15,Thing
whisky,11,Thing
johnnie,10,"Brand, Thing"
(en),10,none
label,7,Thing
drink,5,Thing
history,4,Thing
red,3,"Thing, WebSite"
black,3,Thing
good,3,"Thing, Corporation, Organization"


In [None]:
#@title save the output
df.to_csv(r'iCloud Drive\Scrivania\topical_coverage.csv', index = False, header=True)