Howdy Notebooker (or soon to be notebooker)!

This script automates Steve's interrogative Google Autosuggest strategy. This is where you Google a keyword, delete it and begin typing in "interrogative words," e.g. words that begin senstences.

All you need is a seed keyword like "best mortgage rates" and this Colab notebook does the rest!

Sign up for weekly notes [seonotebook.com](https://seonotebook.com)

Contact me for SEO consulting [steve@seonotebook.com](mailto://steve@seonotebook.com)

Gigantic thanks to Max Geraci for creating this script. Don't miss his awesome tool, [Entities Checker](entitieschecker.com) and [Stacking.Cloud](https://stacking.cloud).

Here's a rundown of how you can utilize this Colab notebook:

<!-- Logo Cell -->
<div align="left">
  <a href="https://seonotebook.com/">
    <img src="https://sp-ao.shortpixel.ai/client/to_auto,q_glossy,ret_img/https://seonotebook.com/wp-content/uploads/2020/07/seonotebook-logo-rgb.png" alt="SEO Notebook Logo" width="280" height="52">
  </a>
</div>


## **SEO Keyword Collection and Analysis Script Guide**

This script is designed to collect a broad set of keywords starting from a seed keyword. Keywords are scraped from Google Autosuggest using various techniques, including:

- **Alphabet technique**: Appending letters of the alphabet (or numbers 0-9) at different positions of the keyword (beginning, end, between words) to generate variations.
- **Asterisk technique**: Using an asterisk as a wildcard to find keyword variations.
- **Modifiers**: Incorporating over 200 interrogative and other modifiers to refine and expand the search.

During the code execution, you will be prompted to enter the seed keyword, a proxy to ensure safe scraping (use the format http://username:password@host:port) to avoid the risk of Google banning your IP, and you will need to upload the Excel file containing the list of modifiers.

The scraping can be done in a single step or recursively to find even more long-tail keywords, gradually moving away from the relevance of the seed keyword.

**Output of this phase includes**:
- `google_autocomplete_suggestions_output.csv`: This file contains the extracted keywords, organized into columns based on the scraping technique used.
- `scraped_seed_keywords_list.csv`: A deduplicated list that merges various lists from the scraping phase. This consolidated list serves as the input for subsequent analyses, including **Semantic Similarity against a Context** and **Topic Modeling**.

These outputs provide a comprehensive keyword set, paving the way for deeper analysis focused on identifying semantically relevant keywords and understanding user intent through topic trends.

### Trimming Down Keywords Through Semantic Similarity
To make the final list more manageable, a trim down approach based on semantic similarity is applied. The starting point is the seed keyword and two expansion keywords (sub-topics or semantically related keywords that define the main topic but can be lexically distant). The script will use GPT-4 to generate a semantically rich text from these keywords. **The seed and expansion keywords are crucial** as they set the context for the entire analysis. The generated context text is then used to compare against the scraped keyword list from Autosuggest to identify relevant keywords, based on a semantic similarity threshold.

### Key Components of the Script
The script uses:
- **KMeans** for clustering,
- **FastEmbed** for efficient vectorization,
- **All-MiniLM-L6-v2** Transformer model for multi-language support.

### Output and Downloads
The lists of clustered keywords and excluded keywords, including their similarity scores, will be accessible for download in Excel format within the newly created 'Output' folder (`clustered_keywords.xlsx` and `excluded_keywords.xlsx`).

### Topic Modeling and Insights
Additionally, the script performs topic modeling on the scraped keyword list using **LDA with HDBSCAN**. The results, available in both tabular Excel format and graphical formats like word clouds, provide immediate insights into the predominant topics reflected in user searches captured by Google Autosuggest. These topic aggregations serve as a valuable tool for guiding the creation of a content plan or a hub of semantically related articles to cover the topic of interest.

##**This section is to scrape Google Autocomplete Suggestions**

###**Installation**
Install all the necessary libraries


In [4]:
pip install -q selenium

Note: you may need to restart the kernel to use updated packages.


In [None]:
gsheet_filename = gc.create("scrape_google_suggestions")
sheet = gsheet_filename.sheet1

In [None]:
sheet.append_row(['keyword', 'keyword_modifier_before_after', 'keyword_interrogatives', 'keyword_with_alphabet', 'keyword_with_asterisk',  'modifiers'])

###**Import all the needed libraries**

In [43]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.keys import Keys
import time
import re
import json
import pandas as pd
import concurrent.futures
import urllib
import requests
from csv import writer
from itertools import zip_longest

###Define variables and proxies

In [6]:
result_list = []
keyword_alphabet_asterisk = []
selenium_modifiers  =[]

###The output should be stored in a CSV file. Before storing the output, define column names

In [7]:
import pandas as pd

# 定义列名
columns = ['keyword', 'keyword_modifier_before_after', 'keyword_interrogatives', 'keyword_with_alphabet', 'keyword_with_asterisk', 'modifiers']

# 创建一个空的DataFrame
df = pd.DataFrame(columns=columns)

# 将DataFrame写入CSV文件
df.to_csv('google_autocomplete_suggestions_output.csv', index=False)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


###Below function is to get autocomplete suggestions for list of modified_keyword [modifiers+keyword and keyword+modifiers]

In [8]:
def scrape_using_request_modifiers(modified_keyword, proxy_list, language="en", country="us"):
    """Sends a GET request to Google's search suggest function and returns a list of suggestions (Simple version)"""
    time.sleep(1)
    headers = {'User-agent':'Mozilla/5.0'}
    for proxy in proxy_list:
        proxies = {"http": proxy, "https": proxy}
        try:
            encoded_keyword = urllib.parse.quote_plus(modified_keyword)

            base_url = f"http://google.com/complete/search?hl={language}&q={encoded_keyword}&json=t&client=serp"
            #response = requests.get(base_url, headers=headers, proxies=proxies)
            response = requests.get(base_url, headers=headers)
            if response.ok and response.content.strip():
                content = json.loads(response.content)
                if content and isinstance(content, list) and len(content) > 1:
                    return content[1]  # Return suggestions without recursion
            return []
        except requests.exceptions.RequestException as e:
            print(f"An error occurred while connecting to the proxy : {str(e)}")
        else:
            break
    else:
      print(f"All proxies failed for keyword '{modified_keyword}'.")
      return []

In [10]:
def scrape_autocomplete_modifiers_kw(keyword, modifiers_list, proxy_list):
    keyword_modifier_before_after = []
    for modifier in modifiers_list:
        modified_kw1 = modifier+' '+keyword
        modified_kw2 = keyword+' '+modifier
        modified_kw1_suggestions = scrape_using_request_modifiers(modified_kw1,proxy_list )
        result_list.append({
        "keyword": modified_kw1,
        "autocomplete_suggestions": modified_kw1_suggestions
        })
        print("modifier ",modifier, ' ', modified_kw1_suggestions)
        modified_kw2_suggestions = scrape_using_request_modifiers(modified_kw2,proxy_list)
        result_list.append({
        "keyword": modified_kw1,
        "autocomplete_suggestions": modified_kw2_suggestions
        })
        print("modifier ",modifier, ' ', modified_kw2_suggestions)
        keyword_modifier_before_after.append(modified_kw1_suggestions)
        keyword_modifier_before_after.append(modified_kw2_suggestions)
    return keyword_modifier_before_after

###Below function is to get autocomplete suggestions for list of modifiers [interrogatives+keyword]

In [12]:
def scrape_autocomplete_interrogatives_kw(keyword, interrogatives_list, proxy_list, language="en"):
    """Sends a GET request to Google's search suggest function and returns a list of suggestions (Simple version)"""
    keyword_interrogatives = []
    for interrogatives in interrogatives_list:
        modified_keyword = interrogatives + ' '+keyword
        time.sleep(1)
        headers = {'User-agent':'Mozilla/5.0'}
        for proxy in proxy_list:
            proxies = {"http": proxy, "https": proxy}
            try:
                encoded_keyword = urllib.parse.quote_plus(modified_keyword)

                base_url = f"http://google.com/complete/search?hl={language}&q={encoded_keyword}&json=t&client=serp"
                #response = requests.get(base_url, headers=headers, proxies=proxies)
                response = requests.get(base_url, headers=headers)
                if response.ok and response.content.strip():
                    content = json.loads(response.content)
                    if content and isinstance(content, list) and len(content) > 1:
                        print("interrogatives ",interrogatives+' '+str(content[1]))
                        result_list.append({
                        "keyword": modified_keyword,
                        "autocomplete_suggestions": content[1]
                        })
                    keyword_interrogatives.append(content[1])
            except requests.exceptions.RequestException as e:
                print(f"An error occurred while connecting to the proxy : {str(e)}")
            else:
              break
        else:
          print(f"All proxies failed for keyword '{keyword}'.")
          return []
    return keyword_interrogatives

###Below function is to get autocomplete suggestions for list of modifiers [alphabet+keyword and astersisk+keyword]. Alphabets and asterisk can be placed in start,middle or end of the keyword depending on User's choice

In [13]:
def make_request_alphabet_asterisk(modified_keyword, depth, proxy_list, max_depth, language="en"):
    time.sleep(1)
    headers = {'User-agent':'Mozilla/5.0'}
    for proxy in proxy_list:
        proxies = {"http": proxy, "https": proxy}
        try:
            encoded_keyword = urllib.parse.quote_plus(modified_keyword)

            base_url = f"http://google.com/complete/search?hl={language}&q={encoded_keyword}&json=t&client=serp"
            #response = requests.get(base_url, headers=headers, proxies=proxies)
            response = requests.get(base_url, headers=headers)
            if response.ok and response.content.strip():
                content = json.loads(response.content)
                if content and isinstance(content, list) and len(content) > 1:
                    keyword_alphabet_asterisk.append(content[1])
                    result_list.append({
                        "keyword": modified_keyword,
                        "autocomplete_suggestions": content[1]
                    })
                    print("alphabet/digits/asterisk ",modified_keyword+' '+str(content[1]))
                    if depth < max_depth:
                            for suggestion in content[1]:
                                make_request_alphabet_asterisk(suggestion, depth+1, max_depth, language=1)

        except requests.exceptions.RequestException as e:
            print(f"An error occurred while connecting to the proxy : {str(e)}")
        else:
          break
    else:
      print(f"All proxies failed for keyword '{modified_keyword}'.")


In [14]:
def scrape_autocomplete_alphabet(keyword, max_depth, insert_option, proxy_list, language="en"):
    """Sends a GET request to Google's search suggest function and returns a list of suggestions (Simple version)"""
    """only when insert_option is end we have recursive approach, otherwise it is simple"""
    for char in 'abcdefghijklmnopqrstuvwxyz123456789':
        depth = 0
        if insert_option in ['end', 'all']:
            modified_keyword = keyword +' '+char
            make_request_alphabet_asterisk(modified_keyword, depth,proxy_list, max_depth,language="en")
        if insert_option in ['start', 'all']:
            modified_keyword = char + ' ' + keyword
            make_request_alphabet_asterisk(modified_keyword, depth,proxy_list, max_depth=0, language="en")
        if insert_option in ['middle', 'all'] and ' ' in keyword:
            words = keyword.split()
            for i in range(1, len(words)):
                modified_keyword = ' '.join(words[:i]) + ' ' + char + ' ' + ' '.join(words[i:])
                make_request_alphabet_asterisk(modified_keyword, depth,proxy_list, max_depth=0, language="en")




def scrape_autocomplete_asterisk(keyword, max_depth, insert_option,proxy_list, language="en"):
    """Sends a GET request to Google's search suggest function and returns a list of suggestions (Simple version)"""
    """only when insert_option is end we have recursive approach, otherwise it is simple"""
    depth=0
    if insert_option in ['end', 'all']:
        modified_keyword = keyword +' '+'*'
        make_request_alphabet_asterisk(modified_keyword, depth, proxy_list,max_depth, language="en")
    if insert_option in ['start', 'all']:
        modified_keyword = '*' + ' ' + keyword
        make_request_alphabet_asterisk(modified_keyword, depth, proxy_list,max_depth=0, language="en")
    if insert_option in ['middle', 'all'] and ' ' in keyword:
        words = keyword.split()
        for i in range(1, len(words)):
            modified_keyword = ' '.join(words[:i]) + ' ' + '*' + ' ' + ' '.join(words[i:])
            make_request_alphabet_asterisk(modified_keyword, depth,proxy_list, max_depth=0, language="en")

###Below function is to get autocomplete suggestions for list of modifiers [seed_keyword, modifiers alone] using Selenium.

In [15]:
def scrape_modifiers_suggestions_selenium(keyword, modifiers_list, index):

    options = webdriver.ChromeOptions()
    options.add_argument("--verbose")
    options.add_argument('--no-sandbox')
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    options.add_argument("--window-size=1920, 1200")
    options.add_argument('--disable-dev-shm-usage')

    driver = webdriver.Chrome(options=options)
    driver.get("https://www.google.com")


    search = driver.find_element(by=By.NAME, value="q")
    search.send_keys(keyword)
    search.send_keys(Keys.RETURN)

    suggestions = driver.find_element(by=By.TAG_NAME, value="textarea")
    suggestions = suggestions.click()


    driver.implicitly_wait(10)
    html_list = driver.find_element(by=By.ID, value="searchform")
    items = html_list.find_elements(by=By.TAG_NAME, value="li")

    if index == 0:
        keyword_autocomplete = []
        for item in items:
            if item.text!='':
                keyword_autocomplete.append(item.text)

        print(keyword_autocomplete)
        result_list.append({
            "keyword": keyword,
            "autocomplete_suggestions": keyword_autocomplete
        })
        selenium_modifiers.append(keyword_autocomplete)

    driver.implicitly_wait(2)

    ############################# *****modifier Technique***** #########################
    for modifier in modifiers_list:
        driver.find_element(by=By.TAG_NAME, value="textarea").clear()

        add_keyword = driver.find_element(by=By.NAME, value="q")
        add_keyword.send_keys(modifier)


        time.sleep(5)
        items = driver.find_element(By.XPATH, "//ul[@role='listbox']").find_elements(By.XPATH, "//li[@role='presentation']")

        keyword_autocomplete = []
        for item in items:
            if item.text!='':
                keyword_autocomplete.append(item.text)
        print("modifier list ",modifier, ' ',keyword_autocomplete)
        result_list.append({
            "keyword": modifier,
            "autocomplete_suggestions": keyword_autocomplete
        })
        selenium_modifiers.append(keyword_autocomplete)

###Invoke all the functions mentioned above. The list of modifiers should be sent in batches to Selenium requests because Google imposes a rate limit when it exceeds 20-25 requests in Selenium.

In [16]:
def scrape_autocomplete(keyword, modifiers_list, interrogatives_list, max_depth,insert_option, batch_size, proxy_list):

    keyword_modifiers_before_after = scrape_autocomplete_modifiers_kw(keyword, modifiers_list,proxy_list)
    keyword_interrogatives = scrape_autocomplete_interrogatives_kw(keyword, interrogatives_list,proxy_list)
    scrape_autocomplete_alphabet(keyword, max_depth, insert_option,proxy_list, language="en")
    keyword_alphabet = keyword_alphabet_asterisk.copy()
    keyword_alphabet_asterisk.clear()
    scrape_autocomplete_asterisk(keyword, max_depth, insert_option,proxy_list, language="en")
    keyword_asterisk = keyword_alphabet_asterisk.copy()
    keyword_alphabet_asterisk.clear()

    batch_num = len(modifiers_list) // batch_size
    for batch in range(batch_num+1):
        start_index = batch * batch_num
        end_index = start_index+batch_num
        sub_modifiers = modifiers_list[start_index:end_index]
        #print("start index ",start_index, 'end index', end_index, sub_modifiers)
        scrape_modifiers_suggestions_selenium(keyword, sub_modifiers, batch)

        #below section is to get remaining sub modifiers that were not part of batches
        if batch==batch_num:
            sub_modifiers = modifiers_list[end_index:]
            #print(sub_modifiers)
            scrape_modifiers_suggestions_selenium(keyword, sub_modifiers, batch)

    #to convert 2D list to 1D list and remove any duplicates if any
    kw_modifier_before_after_1d = list(set([keyword for sublist in keyword_modifiers_before_after for keyword in sublist]))
    kw_interrogatives_1d = list(set([keyword for sublist in keyword_interrogatives for keyword in sublist]))
    kw_alphabet_1d = list(set([keyword for sublist in keyword_alphabet for keyword in sublist]))
    kw_asterisk_1d = list(set([keyword for sublist in keyword_asterisk for keyword in sublist]))
    kw_selenium_modifiers_1d = list(set([keyword for sublist in selenium_modifiers for keyword in sublist]))

    all_combined = [[keyword], kw_modifier_before_after_1d, kw_interrogatives_1d, kw_alphabet_1d, kw_asterisk_1d, kw_selenium_modifiers_1d]
    export_data = zip_longest(*all_combined, fillvalue = '')
    #output= [keyword, kw_modifier_before_after_1d, kw_interrogatives_1d, kw_alphabet_1d, kw_asterisk_1d, kw_selenium_modifiers_1d]
    with open('google_autocomplete_suggestions_output.csv','a') as f:
        writer_obj = writer(f)
        writer_obj.writerows(export_data)
        f.close()
    #sheet.append_row([keyword, str(keyword_modifiers_before_after), str(keyword_interrogatives), str(keyword_alphabet), str(keyword_asterisk), str(selenium_modifiers)])
    selenium_modifiers.clear()
    return result_list

In [39]:
def read_modifiers_list(modifiersfilename,interrogativefilename):
    modifier_data = pd.read_csv(modifiersfilename)
    interrogatives = pd.read_csv(interrogativefilename)
    # return modifier_data["modifiers"].unique().tolist(), interrogatives["interrogatives"].tolist()
    return modifier_data.iloc[:, 0].tolist(), interrogatives.iloc[:, 0].tolist()

##Call the scrape_autocomplete() for all the keywords

In [40]:
def get_autocomplete_results(seed_keywords,insert_option,modifiersfilename,interrogativefilename,max_depth,proxy_list,batch_size):
    modifier_list, interrogatives_list = read_modifiers_list(modifiersfilename,interrogativefilename)

    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        for keyword in seed_keywords:
            result_list = executor.submit(scrape_autocomplete,keyword, modifier_list, interrogatives_list, max_depth,insert_option, batch_size,proxy_list).result()
            with open(keyword+".json", 'w') as f:
                json.dump(result_list, f)
            result_list.clear()
        executor.shutdown()

###Files from Drive should be mounted to the current directory in Google Colab.

In [44]:
def main():
    keyword = str(input("Please enter the keyword: "))

    proxy = str(input("Enter the proxy in the format: http://username:password@host:port"))
    if proxy=='':
        proxy='http://vskol:hpap98q4@80.71.154.59:5432'
    print("Enter the proxy in the format: http://username:password@host:port ")
    print("The proxy can either be a single proxy or multiple. If multiple, please enter every proxies with space [for eg: proxy1 proxy2]")
    proxy_list = list(map(str, proxy.split()))
    print("Please choose an approach:")
    print("1. Simple")
    print("2. Recursive")
    approach_choice = int(input("Enter your choice: ")) - 1

    # New code block to get user's choice on character insertion position
    if approach_choice == 0:  # if 'Simple' approach is chosen
        print("Please choose where to insert the characters:")
        print("1. Start")
        print("2. End")
        print("3. Middle")
        print("4. All")
        insert_option_choice = int(input("Enter your choice: ")) - 1
        insert_options = ['start', 'end', 'middle', 'all']
        chosen_insert_option = insert_options[insert_option_choice]
    else:
        chosen_insert_option = 'end'  # default to 'end' for 'Recursive' approach

    if approach_choice==0:
        max_depth = 0
    else:
        max_depth = int(input("Enter max depth for recursive approach: Please enter any from 1-5"))
        #max_depth=1

    # print('Upload query modifiers list')
    # uploaded = files.upload()
    # if not uploaded:
    #   print('This file is mandatory!')
    #   return

    # filename = next(iter(uploaded))
    filename=r"D:\360downloads\query modifiers list.xlsx"
    modifiersfilename=r"D:\360downloads\modifiers.csv"
    interrogativefilename=r"D:\360downloads\interrogative.csv"
    


    get_autocomplete_results([keyword], chosen_insert_option, modifiersfilename,interrogativefilename,max_depth, proxy_list, batch_size=10)

main()

Enter the proxy in the format: http://username:password@host:port 
The proxy can either be a single proxy or multiple. If multiple, please enter every proxies with space [for eg: proxy1 proxy2]
Please choose an approach:
1. Simple
2. Recursive
Please choose where to insert the characters:
1. Start
2. End
3. Middle
4. All
An error occurred while connecting to the proxy : HTTPConnectionPool(host='google.com', port=80): Max retries exceeded with url: /complete/search?hl=en&q=top+temu&json=t&client=serp (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x000001E93A8F0E50>, 'Connection to google.com timed out. (connect timeout=None)'))
All proxies failed for keyword 'top temu'.
modifier  top   []
An error occurred while connecting to the proxy : HTTPConnectionPool(host='google.com', port=80): Max retries exceeded with url: /complete/search?hl=en&q=temu+top&json=t&client=serp (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x000001E93D23B

NoSuchDriverException: Message: Unable to obtain driver for chrome using Selenium Manager.; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location


##The code below is designed to further process the output file for use as input in semantic similarity analysis.

In [None]:
from itertools import chain
import ast

df = pd.read_csv("google_autocomplete_suggestions_output.csv")


In [None]:
df.head()

In [None]:
global_list = []
seed_keyword = df["keyword"][0]

In [None]:
# df = df.drop(["keyword"], axis=1)
for col in df.columns:
  rm_null = df[col].dropna()
  col_values = [elem for elem in rm_null]
  global_list.append(col_values)

In [None]:
global_list

In [None]:
#flatten 2D list to 1D list
scraped_keywords = [keyword for sublist in global_list for keyword in sublist]
scraped_keywords = list(set(scraped_keywords))
print("Scraped keywords \n", scraped_keywords)
output = pd.DataFrame({"scraped keywords": scraped_keywords})
output.to_csv("scraped_seed_keywords_list.csv", index=False)

In [None]:
len(scraped_keywords)

# **Semantic Similarity**

In [None]:
!pip install -q openai fastembed

The code below creates the directory 'output'.

In [None]:
import os
# Specify the directory path
directory = "output"
# Check if the directory already exists
if not os.path.exists(directory):
    # Create the directory
    os.makedirs(directory)
    print("Directory created successfully!")
else:
    print("Directory already exists!")

In [None]:
import openai
import pandas as pd
from sklearn.cluster import KMeans
from fastembed import TextEmbedding
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import silhouette_score
from openai import OpenAI
import time
import os
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import random
import ipywidgets as wg
from IPython.display import display

In [None]:
# Load keywords from an Excel file
def load_keywords(filename):
    df = pd.read_csv(filename)
    keywords_list = []
    for column in df.columns:
        keywords_list.extend(df[column].dropna().tolist())
    return keywords_list

In [None]:
def generate_context_gpt(keyword, expand_kws):
    prompt = "Generate an exploratory text that densely incorporates terminology and concepts related to {seed_KW}"
    if len(expand_kws) > 0:
        prompt += ", interweaving related topics such as {expand_KW1}"
    if len(expand_kws) > 1:
        prompt += " and {expand_KW2}"
    prompt += ". Aim to cover a broad spectrum of entities, terms, and sub-topics semantically connected to the primary topic, providing a rich set of data points for semantic analysis. The narrative is not crucial here. The goal is to include the 'Central Entities' and as many related terms, keywords, entities, and sub-topics as possible to create a rich dictionary. Generate the requested text without any unrelated comments."

    format_dict = {'seed_KW': keyword}
    if len(expand_kws) > 0:
        format_dict['expand_KW1'] = expand_kws[0]
    if len(expand_kws) > 1:
        format_dict['expand_KW2'] = expand_kws[1]

    prompt = prompt.format(**format_dict)

    try:
        client = OpenAI()
        result = client.chat.completions.create(
            model="gpt-4-1106-preview",
            messages=[
                {"role": "system",
                 "content": "You are an assistant and you should generate text for the provided keyword. You always follow the instructions."},
                {"role": "user", "content": prompt}],
            max_tokens=abs(4000 - len(prompt)),
            seed=123
        )
        generated_prompt = result.choices[0].message.content
        return generated_prompt
    except openai.RateLimitError as e:
        sleep_duration = int(e.headers.get('Retry-After', 30))
        print(f"Rate limit exceeded. Sleeping for {sleep_duration} seconds before retrying...")
        time.sleep(sleep_duration)
        return None
    except openai.APIError as e:
        print(f"Failed to generate prompt due to error: {e}")
        return None


In [None]:
def save_context_to_file(context, filename):
  # Specify the directory path
  directory = "contents"
  # Check if the directory already exists
  if not os.path.exists(directory):
      # Create the directory
      os.makedirs(directory)
      with open("./contents/" + filename, 'w') as f:
          f.write(context)
      print("Directory created successfully!")
  else:
      print("Directory already exists!")

In [None]:
def load_context_gpt(seed_keyword, expand_kws):
    context = generate_context_gpt(seed_keyword, expand_kws)
    if context:
        context_filename = f"{seed_keyword}_context.txt"
        save_context_to_file(context, context_filename)
        return context
    else:
        return None

In [None]:
# Generate embeddings for a list of texts using FastEmbed
def generate_embeddings(texts):
    model = TextEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
    return list(model.embed(texts))

In [None]:
# Cluster embeddings using K-Means
def cluster_embeddings(embeddings, num_clusters=5):
    kmeans = KMeans(n_clusters=num_clusters)  # Adjust the number of clusters as appropriate
    return kmeans.fit_predict(embeddings)

In [None]:
def contextual_relevance_analysis(keyword_embeddings, context_embedding, clusters):
    # Calculate cosine similarity between keyword embeddings and the context embedding
    # Note: context_embedding should be reshaped to fit the expected dimensionality if it's a single vector
    similarities = cosine_similarity(keyword_embeddings, context_embedding.reshape(1, -1))
    relevance_scores = []

    # Iterate over unique cluster IDs to calculate mean relevance score per cluster
    for cluster_id in set(clusters):
        cluster_indices = [i for i, cluster in enumerate(clusters) if cluster == cluster_id]
        # Extract similarities for keywords within the current cluster
        cluster_similarities = similarities[cluster_indices, 0]  # Assuming similarities is a 2D array
        # Calculate mean similarity score for the current cluster
        cluster_relevance_score = np.mean(cluster_similarities)
        # Store the mean relevance score along with the cluster ID
        relevance_scores.append((cluster_id, cluster_relevance_score))

    # Sort the relevance scores list by the relevance score in descending order
    relevance_scores.sort(key=lambda x: x[1], reverse=True)

    return relevance_scores

In [None]:
def optimal_cluster_number(embeddings):
    best_score = -1
    best_n_clusters = 2
    for n_clusters in range(2, min(len(embeddings), 10) + 1):
        kmeans = KMeans(n_clusters=n_clusters, random_state=42)
        labels = kmeans.fit_predict(embeddings)
        score = silhouette_score(embeddings, labels)
        if score > best_score:
            best_score = score
            best_n_clusters = n_clusters
    return best_n_clusters

In [None]:
# Main script logic
def main():
  openai_api_key = str(input("Please enter Open AI API Key: "))
  os.environ["OPENAI_API_KEY"] = openai_api_key

  keywords_list = load_keywords('scraped_seed_keywords_list.csv')
  print(f"Loaded {len(keywords_list)} keywords.")

  seed_keyword = str(input("Please enter seed keyword: "))
  print("Please enter expand keywords related to the seed keyword. Enter it in the format below.")
  print("For example: expand_kw1,expand_kw2 ")
  expand_kws = str(input())
  expand_kws = list(map(str, expand_kws.split(',')))
  print("Expand keywords are ", expand_kws)

  similarity_threshold = float(input("Please enter threshold (range from min: 0.1 to max: 0.8) for semantic similarity: "))
  print("Similarity threshold: ",similarity_threshold)

  context_text = load_context_gpt(seed_keyword, expand_kws)
  if not context_text:
      print(f"No context text generated for {seed_keyword}. Skipping.")
      #continue

  context_embedding = generate_embeddings([context_text])[0]
  keyword_embeddings = generate_embeddings(keywords_list)

  # Calculate similarity for all keywords
  all_keywords_similarity = cosine_similarity(keyword_embeddings, context_embedding.reshape(1, -1)).flatten()

  # Correctly determining the optimal number of clusters and using it
  optimal_n_clusters = optimal_cluster_number(keyword_embeddings)
  clusters = KMeans(n_clusters=optimal_n_clusters, random_state=42).fit_predict(keyword_embeddings)
  print(f"Optimal number of clusters determined to be {optimal_n_clusters}.")

  # Separate selected and excluded keywords based on the similarity threshold
  results = []
  excluded_keywords_with_scores = []

  for i, keyword in enumerate(keywords_list):
      similarity_score = all_keywords_similarity[i]
      if similarity_score >= similarity_threshold:
          results.append({
              'Keyword': keyword,
              'Cluster': clusters[i],
              'Similarity to Context': similarity_score
          })
      else:
          excluded_keywords_with_scores.append((keyword, similarity_score))

  # Debugging: Verbose output of keywords selection
  print("Debugging output for keyword selection process:")



  if results:
      results_df = pd.DataFrame(results)
      print(f"Results for '{seed_keyword}':", results_df)
      results_df.to_excel(f'./output/clustered_keywords_{seed_keyword}.xlsx', index=False)
  else:
      print(f"No results to save for '{seed_keyword}'.")

  if excluded_keywords_with_scores:
      excluded_kw_df = pd.DataFrame(excluded_keywords_with_scores, columns=['Excluded Keywords', 'Similarity Score'])
      # print(f"Excluded Keywords for '{seed_keyword}':", excluded_kw_df)
      excluded_kw_df.to_excel(f"./output/excluded_keywords_{seed_keyword}.xlsx", index=False)
  else:
      print(f"No excluded keywords to save for '{seed_keyword}'.")

if __name__ == "__main__":
    main()


# **LDA Topic Modeling**

In [None]:
!pip install -q hdbscan wordcloud

In [None]:
directory = "LDA_output_images"
# Check if the directory already exists
if not os.path.exists(directory):
    # Create the directory
    os.makedirs(directory)
    print("Directory created successfully!")
else:
    print("Directory already exists!")

In [None]:
directory = "LDA_output_files"
# Check if the directory already exists
if not os.path.exists(directory):
    # Create the directory
    os.makedirs(directory)
    print("Directory created successfully!")
else:
    print("Directory already exists!")

In [None]:
import os
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import hdbscan
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from tqdm import tqdm
import warnings
import nltk

In [None]:
def preprocess_keyword(keyword):
    lemmatizer = WordNetLemmatizer()
    nltk.download('stopwords')
    nltk.download('wordnet')
    stop_words = set(stopwords.words('english'))
    keyword = keyword.lower()
    keyword = ' '.join([lemmatizer.lemmatize(word) for word in keyword.split() if word not in stop_words])
    return keyword

In [None]:
def read_whole_list(filename):
    print(f"Reading whole list from {filename}")
    df = pd.read_csv(filename)
    return [preprocess_keyword(kw) for kw in df['scraped keywords'].tolist()]

In [None]:
def cluster_keywords(whole_list, min_cluster_size, min_samples, cluster_selection_epsilon):
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(whole_list)

    clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, min_samples=min_samples,
                                cluster_selection_epsilon=cluster_selection_epsilon)
    cluster_labels = clusterer.fit_predict(tfidf_matrix)

    df_clusters = pd.DataFrame({'Keyword': whole_list, 'Cluster': cluster_labels})
    return df_clusters, tfidf_matrix, vectorizer

In [None]:
def topic_modeling(tfidf_matrix, n_components, max_iter, learning_method):
    lda = LatentDirichletAllocation(n_components=n_components, max_iter=max_iter,
                                    learning_method=learning_method, random_state=42)
    lda.fit(tfidf_matrix)
    return lda

In [None]:
def visualize_clusters(df_clusters, tfidf_matrix):
    tsne = TSNE(n_components=2, random_state=42)
    tsne_embeddings = tsne.fit_transform(tfidf_matrix.toarray())

    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(tsne_embeddings[:, 0], tsne_embeddings[:, 1], c=df_clusters['Cluster'], cmap='viridis')
    plt.colorbar(scatter)
    plt.xlabel('t-SNE Dimension 1')
    plt.ylabel('t-SNE Dimension 2')
    plt.title('Keyword Clusters')
    plt.savefig('./LDA_output_images/keyword_clusters.png')
    plt.close()

    cluster_sizes = df_clusters['Cluster'].value_counts()
    plt.figure(figsize=(12, 6))
    plt.bar(cluster_sizes.index, cluster_sizes.values)
    plt.xlabel('Cluster')
    plt.ylabel('Number of Keywords')
    plt.title('Keyword Cluster Sizes')
    plt.savefig('./LDA_output_images/cluster_sizes.png')
    plt.close()

In [None]:
def visualize_topics(lda, vectorizer, n_top_words=20):
    for topic_idx, topic in enumerate(lda.components_):
        top_keywords = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-n_top_words - 1:-1]]
        wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(top_keywords))
        plt.figure(figsize=(10, 5))
        plt.imshow(wordcloud, interpolation='bilinear')
        plt.axis('off')
        plt.title(f'Topic {topic_idx + 1}')
        plt.savefig(f'./LDA_output_images/topic_{topic_idx + 1}_wordcloud.png')
        plt.close()

In [None]:
def create_output_files(df_clusters, lda, vectorizer):
    n_top_words = 10

    # Assign topics to keywords
    topic_probs = lda.transform(vectorizer.transform(df_clusters['Keyword']))
    df_clusters['Topic'] = topic_probs.argmax(axis=1) + 1

    # Keyword Clusters worksheet
    df_keyword_clusters = df_clusters[['Keyword', 'Cluster', 'Topic']]
    df_keyword_clusters.to_excel('./LDA_output_files/output.xlsx', sheet_name='Keyword Clusters', index=False)

    # Cluster Summary worksheet
    cluster_summary = df_clusters.groupby('Cluster')['Keyword'].agg(['count', lambda x: ', '.join(x.head(5))]).reset_index()
    cluster_summary.columns = ['Cluster', 'Number of Keywords', 'Top Keywords']
    with pd.ExcelWriter('./LDA_output_files/output.xlsx', engine='openpyxl', mode='a') as writer:
        cluster_summary.to_excel(writer, sheet_name='Cluster Summary', index=False)

    # Topic Summary worksheet
    topic_keywords = pd.DataFrame({'Topic': range(1, lda.n_components + 1)})
    topic_keywords['Top Keywords'] = topic_keywords['Topic'].apply(lambda x: ', '.join([vectorizer.get_feature_names_out()[i] for i in lda.components_[x - 1].argsort()[:-n_top_words - 1:-1]]))
    df_clusters_filtered = df_clusters[df_clusters['Topic'].isin(topic_keywords['Topic'])]
    topic_keywords['Number of Keywords'] = df_clusters_filtered['Topic'].value_counts().loc[topic_keywords['Topic']].values
    with pd.ExcelWriter('./LDA_output_files/output.xlsx', engine='openpyxl', mode='a') as writer:
        topic_keywords.to_excel(writer, sheet_name='Topic Summary', index=False)

    # Cluster-Topic Overlap worksheet
    cluster_topic_overlap = pd.crosstab(df_clusters['Cluster'], df_clusters['Topic'])
    with pd.ExcelWriter('./LDA_output_files/output.xlsx', engine='openpyxl', mode='a') as writer:
        cluster_topic_overlap.to_excel(writer, sheet_name='Cluster-Topic Overlap')

In [None]:
def main():
    whole_list = read_whole_list('scraped_seed_keywords_list.csv')
    print(f"Read {len(whole_list)} keywords from the whole list")

    min_cluster_size = 6
    min_samples = 5
    cluster_selection_epsilon = 0.5

    print("Clustering keywords...")
    df_clusters, tfidf_matrix, vectorizer = cluster_keywords(whole_list, min_cluster_size, min_samples, cluster_selection_epsilon)

    n_components = 30
    max_iter = 100
    learning_method = 'online'

    print("Performing topic modeling...")
    lda = topic_modeling(tfidf_matrix, n_components, max_iter, learning_method)

    print("Visualizing clusters and topics...")
    visualize_clusters(df_clusters, tfidf_matrix)
    visualize_topics(lda, vectorizer)

    print("Creating output files...")
    create_output_files(df_clusters, lda, vectorizer)

    print("Done!")

if __name__ == '__main__':
    main()