<a href="https://colab.research.google.com/github/LanceMeister/N_Gram_Text_Analyzer_for_SEO/blob/main/N_Gram_Text_Analyzer_for_SEO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Requirements and Assumptions

- Python 3 is installed and basic Python syntax understood
- Access to a Linux installation (I recommend Ubuntu) or Google Colab.
- Google Knowledge Base API key
- You have a body of text to analyze
## Import and Install Modules

- nltk: NLP module handles the n-graming and word type tagging
  - nltk.download(‘punkt’): functions for n-gramming
  - nltk.download(‘averaged_perceptron_tagger’): functions for word type tagging
- pandas: for storing and displaying the results
- requests: for making API calls to Keyword Sufer API
- json: for processing the Keyword Surfer API response
First, let’s install the nltk module which you won’t like have already.  If you are using Google Colab put an exclamation mark at the beginning.

In [None]:
#@title
!pip3 install nltk
!pip install --upgrade google-api-python-client

In [None]:
#@title
# -*- coding: utf-8 -*-
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
from nltk.util import ngrams
import requests
import json
import pandas as pd

# Build N-Grams from Provided Text

We’re going to start off with a few functions. I decided to use functions because my app will process 1, 2, 3-gram tables when this script as is, will only process one of them of your choosing. You can automate at the end for all 3 by using a simple while loop. This first function breaks the text into what n-gram you choose later on.

1. Send the text data into the nltk ngram function that returns a list of those n-grams. It takes two parameters. The first is a tokenized version of that text which it processes in the word_tokenize function first and the second parameter is whatever whole number n-gram you want.
2. We join all the n-grams that are returned into a single list
3. For Unigrams we want to tag each tokenized word and filter out anything that isn’t a noun or action verb (present participle)
4. Return the list back to the main script

In [5]:
#@title
def extract_ngrams(data, num):
  n_grams = ngrams(nltk.word_tokenize(data), num)
  gram_list = [ ' '.join(grams) for grams in n_grams]
  
  
  query_tokens = nltk.pos_tag(gram_list)
  query_tokens = [x for x in query_tokens if x[1] in ['NN','NNS','NNP','NNPS','VBG','VBN']]
  query_tokens = [x[0] for x in query_tokens]

  return query_tokens

# Detect Entities from N-Grams

Next, we tackle the function that will take the list of n-grams and pass it to the Google Knowledge Graph (GKG) API to check if it’s an entity.

1. Create a variable to store the n-grams that end up as entities and a variable for your GKG API key.
2.  Loop through all the n-grams in the list
3. Make a request to GKG using n-gram, API key, and limiting to first result found (highest score, most likely)
4.  Load API JSON response into a JSON object to parse
5.  Parse JSON object for “type”, aka entity category, and resultScore, aka their relevancy confidence metric. Some requests are not entities and will break the script so we use a Try/Except method. If the Except is triggered, we make a score of 0 and a label of none.
6. In the JSON data, the entity labels are in a list because entities often have more than one label or category. We loop over this list and create a string delineated list. In the end, we remove the last hanging comma and create a space between the comma and the next label.
7. If a label doesn’t equal none and has a resultScore of greater than 500 let’s build a list for that n-gram that stores the n-gram, its score, and its labels. Then we append that list to a master list to store all the entity data. So we’ll have a list of lists. We then return that list of lists back to the main script.

In [6]:
#@title
def kg(keywords):

  kg_entities = []
  apikey=''

  for x in keywords:
    url = f'https://kgsearch.googleapis.com/v1/entities:search?query={x}&key={apikey}&limit=1&indent=True'
    payload = {}
    headers= {}
    response = requests.request("GET", url, headers=headers, data = payload)
    data = json.loads(response.text)

    try:
      getlabel = data['itemListElement'][0]['result']['@type']
      score = round(float(data['itemListElement'][0]['resultScore']))
    except:
      score = 0
      getlabel = ['none']

    labels = ""
    
    for item in getlabel:
      labels += item + ","
    labels = labels[:-1].replace(",",", ")
    
    if labels != ['none'] and score > 500:
      kg_subset = []
      kg_subset.append(x)
      kg_subset.append(score)
      kg_subset.append(labels)

      kg_entities.append(kg_subset)
  return kg_entities

# Gather Entity Search Metrics

For our final function, we build our API call to Keyword Surfer to get volume, CPC, and competition score. Then we build the dataframe with all of our information.

 1. Create a list with only the entity types.
 2. Create a list with only the entity names.
 3. Convert the entity names list to a string so we can feed it to the URL API.
 4. Build the Keyword Surfer API call using the keyword string-list. Please do not abuse this API or we may find it blocked at some point.
 5. Return the JSON response to seo_data variable.
 6. Build the empty dataframe which we’ll use to store the data in later.
 7. Loop through the entities we fed to the API.
 8. Some entities won’t have data. We detect and assign a value manually for CPC and competition.
 9. We build a dictionary object using the data for each entity and then append it to the dataframe we created earlier.
 10. We format the dataframe with alignment and sorting by Volume and then return the dataframe to the main script.

In [7]:
#@title
def surfer(entities):
  entities_type = [x[2] for x in entities]
  entities = [x[0] for x in entities]
  keywords = json.dumps(entities)

  url2 = 'https://db2.keywordsur.fr/keyword_surfer_keywords?country=de&keywords=' + keywords
  response2 = requests.get(url2,verify=True)
  seo_data = json.loads(response2.text)

  d = {'Keyword': [], 'Volume': [], 'CPC':[], 'Competition':[], 'Entity Types':[]}
  df = pd.DataFrame(data=d)
  counter=0

  for word in seo_data:

    if seo_data[word]["cpc"] == '':
      seo_data[word]["cpc"] = 0.0
    
    if seo_data[word]["competition"] == '':
      seo_data[word]["competition"] = 0.0

    new = {"Keyword":word,"Volume":str(seo_data[word]["search_volume"]),"CPC":"$"+str(round(float(seo_data[word]["cpc"]),2)),"Competition":str(round(float(seo_data[word]["competition"]),4)),'Entity Types':entities_type[counter]}
    df = df.append(new, ignore_index=True)
    counter +=1

  df.style.set_properties(**{'text-align': 'left'})
  df.sort_values(by=['Volume'], ascending=True)
  return df

# Provide Text and Call Functions

We arrive now at the actual start of the script where the data variable should contain the text you want to analyze.  We clean the text a little to be free of some pesky punctuation that sometimes breaks analysis. Certainly, there is a more efficient way of handling it, but this works fine for now. Feel free to add more replacements.

In [8]:
spreadsheetName=" N-Gram Text Analyzer for SEO" #@param {type:"string"}
worksheetName='unigram'
data = "Mit zwei Betriebsstätten in Oberhausen und Duisburg gehören wir zu den wenigen Firmen, die noch in Deutschland produzieren. Was bedeutet das für Sie? Dank der strengen Gesetzgebungen in Deutschland schützen wir nicht nur die Umwelt, sondern auch unsere Mitarbeiter. Durch Expertise, erfahrene Mitarbeiter und moderne Technologie haben wir unsere Kosten im Griff – billige Arbeitskräfte und flexible Gesetze ferner Länder kommen für uns nicht in Frage."#@param {type:"string"}
data = data.replace(',','')
data = data.replace('.','')
data = data.replace(';','')


Here we send the data from above to the first function we created that processes for ngrams. The second parameter number is the number of ngram you want to process.

1 = unigram
2 = bigram
3 = trigram …
Then we send the ngram lists to the Google Knowledge Graph API for entity detection. Lastly, we send the entities to the Keyword Surfer API for metrics, and finally, we display the dataframe.

### 1 = unigram 

In [None]:
#@title
from nltk.corpus import stopwords
stopWords = set(stopwords.words('german'))
keywords = extract_ngrams(data, 1)
entities = kg(keywords)
df = surfer(entities)
df


### 1 = unigram (German)

Run this code when anaylzing German text to remove unwanted "stop words" from the table

In [None]:
#@title
# print(str(stopWords))
df_de = df[~df.Keyword.isin(stopWords)]
df_de

### 2 = bigram 

In [None]:
#@title 
keywords = extract_ngrams(data, 2)
entities = kg(keywords)
df2 = surfer(entities)
df2

### 3 = trigram 

In [None]:
#@title 
keywords = extract_ngrams(data, 3)
entities = kg(keywords)
df3 = surfer(entities)
df3

# Populating our Google Sheet

In [None]:
#@title 
#https://pypi.org/project/gspread-pandas/


In [None]:
#@title 
!pip install gspread-pandas
!pip install --upgrade google-auth[reauth]
#https://github.com/aiguofer/gspread-pandas/pull/47

In [None]:
#@title 
from google.colab import auth
auth.authenticate_user()
import google.auth
import gspread

In [None]:
#@title 
from gspread_pandas import Spread, Client
import google.auth

creds, project = google.auth.default()


In [None]:
#@title 
spread = Spread(spreadsheetName, worksheetName, creds=creds)
spread


In [None]:
#@title 
#https://gspread-pandas.readthedocs.io/en/latest/getting_started.html

In [None]:
#@title 
# Display available worksheets
spread.sheets

In [None]:
#@title 
# Save DataFrame to worksheet 'SEO Analyzed Body Text', create it first if it doesn't exist
spread.df_to_sheet(df, index=False, sheet='unigram', start='A1', replace=True)
spread.df_to_sheet(df2, index=False, sheet='bigram', start='A1', replace=True)
spread.df_to_sheet(df3, index=False, sheet='trigram', start='A1', replace=True)

In [None]:
#@title 
# df.to_csv(r"/content/drive/MyDrive/Colab Notebooks/seoAudit/output/n-gramm.csv", index=False, sep='\t', encoding="utf-8") #write to csv file