# SEO Extraction
## Step 1: Reading content

In [1]:
!pip install --upgrade openai



In [2]:
from openai import OpenAI
import random
import os
from dotenv import load_dotenv
import json

In [3]:
def read_jsonl(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return [json.loads(line) for line in file]

examples = read_jsonl("../example_docs/blog_posts.jsonl")

In [4]:
print(examples[0]['completion'])

Title: 2 Money Moves We’re Making TODAY to Prepare for a Potential Recession


Content: 




In This Article





Are we in a recession? Are we headed for a recession? No one knows for certain, but you can never be too prepared for an economic downturn. Are you saving money? Do you have a plan in the event you lose your job? In today’s episode, we’ll help prepare you for anything that might be thrown your way!
Welcome back to the BiggerPockets Money podcast! Amidst economic uncertainty, there are two steps you must take to weather tough times: build an emergency fund and brace for a potential layoff. Today, Mindy and guest co-host Amanda Wolfe are bringing you their best money tips for getting through a recession. First, they’ll show you how to pad your emergency fund by saving hundreds on groceries each month, negotiating your bills, and eliminating unnecessary expenses from your budget. Believe it or not, it might even be time to cut back on aggressive debt paydown or extra 401(k) co

## Step 2: Inputting to GPT

In [17]:
load_dotenv()
gpt = os.getenv('gpt_token')
org = os.getenv('gpt_org')
client = OpenAI(api_key=gpt, organization=org)

master_prompt = """
    You are a real estate blog post reader. However, your main goal is to figure out the SEO terms that were used in the real estate blogs
    that you read. Any time you see an SEO term or an SEO phrase, you will extract that and then place it into an array list. Also please ensure
    that you are ONLY extracting terms that are for real estate. The structure of your response should simply be:
    [...(seo terms you find)]
    """
user_prompt = f"""
    Hello there. I have a real estate blog post and I would like to figure out what the SEO terms that were used in this post. May you please extract
    those terms and delivery them to in an array list. This is the blog post I have: {examples[0]['completion']}. Please tell me the SEO terms and 
    phrases that were used.
    """

response = client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": master_prompt},
                    {"role": "user", "content": user_prompt},
                ],
                max_tokens=2000,
                temperature=0.0,
)
print(response)

ChatCompletion(id='chatcmpl-9r4f07yOnoercVLjq1KiFAmyUnIB7', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='["real estate blog post", "BiggerPockets Money podcast", "economic downturn", "emergency fund", "job search", "market pay rate", "current company", "financial freedom", "financial house in order", "emergency fund calculator", "high yield savings account", "401(k) contributions", "job market", "resume writer", "salary transparency", "market research analysis", "Glassdoor", "LinkedIn", "job interview", "new job search", "BiggerPockets Money Facebook Group", "BiggerPockets Forums", "BiggerPockets Podcasts", "BiggerPockets partner"]', role='assistant', function_call=None, tool_calls=None))], created=1722436326, model='gpt-4o-2024-05-13', object='chat.completion', service_tier=None, system_fingerprint='fp_4e2b2da518', usage=CompletionUsage(completion_tokens=128, prompt_tokens=10759, total_tokens=10887))


In [18]:
print(response.choices[0].message.content)

["real estate blog post", "BiggerPockets Money podcast", "economic downturn", "emergency fund", "job search", "market pay rate", "current company", "financial freedom", "financial house in order", "emergency fund calculator", "high yield savings account", "401(k) contributions", "job market", "resume writer", "salary transparency", "market research analysis", "Glassdoor", "LinkedIn", "job interview", "new job search", "BiggerPockets Money Facebook Group", "BiggerPockets Forums", "BiggerPockets Podcasts", "BiggerPockets partner"]


In [19]:
import ast

actual_list = ast.literal_eval(response.choices[0].message.content)
print(len(actual_list))

24


In [20]:
user_prompt = f"""
    Hello there. I have a real estate blog post and I would like to figure out what the SEO terms that were used in this post. May you please extract
    those terms and delivery them to in an array list. This is the blog post I have: {examples[1]['completion']}. Please tell me the SEO terms and 
    phrases that were used.
    """

response = client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": master_prompt},
                    {"role": "user", "content": user_prompt},
                ],
                max_tokens=2000,
                temperature=0.0,
)
print(response)

ChatCompletion(id='chatcmpl-9r4fvauQi3tBr6HBQkKQZ13zItyXB', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='[\n    "real estate agents",\n    "Department of Justice (DOJ)",\n    "commission structures",\n    "homebuyers",\n    "agent fees",\n    "traditional commission rates",\n    "industry practices",\n    "real estate commissions",\n    "buyer’s agent",\n    "listing protocols",\n    "National Association of Realtors",\n    "California Association of Realtors",\n    "home sales",\n    "real estate companies",\n    "residential real estate",\n    "real estate business",\n    "real estate attorney",\n    "real estate investors",\n    "home prices",\n    "closing costs",\n    "MLS (Multiple Listing Service)",\n    "real estate market",\n    "real estate investing",\n    "investment strategies",\n    "real estate listings",\n    "real estate trends",\n    "home construction",\n    "loan modifications",\n    "real estate shows",\n    "

In [21]:
print(response.choices[0].message.content)

[
    "real estate agents",
    "Department of Justice (DOJ)",
    "commission structures",
    "homebuyers",
    "agent fees",
    "traditional commission rates",
    "industry practices",
    "real estate commissions",
    "buyer’s agent",
    "listing protocols",
    "National Association of Realtors",
    "California Association of Realtors",
    "home sales",
    "real estate companies",
    "residential real estate",
    "real estate business",
    "real estate attorney",
    "real estate investors",
    "home prices",
    "closing costs",
    "MLS (Multiple Listing Service)",
    "real estate market",
    "real estate investing",
    "investment strategies",
    "real estate listings",
    "real estate trends",
    "home construction",
    "loan modifications",
    "real estate shows",
    "luxury listings",
    "real estate income",
    "housing market",
    "social media",
    "selling homes",
    "buying homes",
    "Agent Finder",
    "investor-friendly real estate agents",


## Running the whole thing

In [5]:
from openai import OpenAI
import random
import os
from dotenv import load_dotenv
import json
import ast

load_dotenv()
gpt = os.getenv('gpt_token')
org = os.getenv('gpt_org')
client = OpenAI(api_key=gpt, organization=org)

def read_jsonl(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return [json.loads(line) for line in file]

examples = read_jsonl("../example_docs/blog_posts.jsonl")

master_seo_list = []

master_prompt = """
    You are a real estate blog post reader. However, your main goal is to figure out the SEO terms that were used in the real estate blogs
    that you read. Any time you see an SEO term or an SEO phrase, you will extract that and then place it into an array list. Also please ensure
    that you are ONLY extracting terms that are for real estate. The structure of your response should simply be:
    [...(seo terms you find)]
    """
for i in range(0, len(examples)):
    print(i, end=" ")
    user_prompt = f"""
    Hello there. I have a real estate blog post and I would like to figure out what the SEO terms that were used in this post. May you please extract
    those terms and delivery them to in an array list. This is the blog post I have: {examples[i]['completion']}. Please tell me the SEO terms and 
    phrases that were used.
    """
    print("talking...", end=" ")
    response = client.chat.completions.create(
                    model="gpt-4o",
                    messages=[
                        {"role": "system", "content": master_prompt},
                        {"role": "user", "content": user_prompt},
                    ],
                    max_tokens=4096,
                    temperature=0.0,
    )
    
    try:
        actual_list = ast.literal_eval(response.choices[0].message.content)
        print("responded")
    except (ValueError, SyntaxError) as e:
        print("Error:", e)
    master_seo_list = master_seo_list + actual_list

print(len(master_seo_list))


0 talking... responded
1 talking... responded
2 talking... responded
3 talking... responded
4 talking... responded
5 talking... responded
6 talking... responded
7 talking... responded
8 talking... responded
9 talking... responded
10 talking... responded
11 talking... responded
12 talking... responded
13 talking... responded
14 talking... responded
15 talking... responded
16 talking... responded
17 talking... responded
18 talking... responded
19 talking... responded
20 talking... Error: invalid syntax. Perhaps you forgot a comma? (<unknown>, line 1)
21 talking... responded
22 talking... responded
23 talking... responded
24 talking... responded
25 talking... responded
26 talking... responded
27 talking... responded
28 talking... responded
29 talking... responded
30 talking... responded
31 talking... responded
32 talking... responded
33 talking... responded
34 talking... responded
35 talking... responded
36 talking... responded
37 talking... responded
38 talking... responded
39 talking...

In [6]:
print(master_seo_list)

['real estate blog post', 'BiggerPockets Money podcast', 'economic downturn', 'emergency fund', 'job search', 'market pay rate', 'current company', 'financial freedom', 'financial house in order', 'emergency fund calculator', 'high yield savings account', '401(k) contributions', 'layoff', 'job layoff', 'market value', 'salary transparency', 'job market', 'resume writer', 'career counseling', 'LinkedIn', 'job search tips', 'financial audit', 'no spend month', 'high yield savings account', 'credit card debt', 'mortgage payments', 'retirement contributions', 'job interview', 'Glassdoor', 'salary range', 'company culture', 'LinkedIn employees', 'job search advice', 'real estate agents', 'Department of Justice (DOJ)', 'commission structures', 'homebuyers', 'agent fees', 'traditional commission rates', 'industry practices', 'real estate commissions', 'buyer’s agent', 'listing protocols', 'National Association of Realtors', 'California Association of Realtors', 'real estate companies', 'home 

In [12]:
import random

seo_topics = []
for i in range(0, 10):
    seo_topics.append(master_seo_list[random.randint(0,len(master_seo_list))])
print(seo_topics)

['housing rule', 'employment growth', 'Cape Coral, Florida', 'American investors', 'real estate politics', 'rental demand', 'unemployment rate', 'real estate prices', 'homeowners', 'rent increase']


In [13]:
with open('../example_docs/seo_tokens.txt', 'w') as file:
    for item in master_seo_list:
        file.write(f"{item}\n")

## Importing from file and using for examples

In [18]:
seo_terms = []
with open('../example_docs/seo_tokens.txt', 'r') as file:
    seo_terms = [line.strip() for line in file]

import random

seo_topics = []
for i in range(0, 10):
    seo_topics.append(seo_terms[random.randint(0,len(seo_terms))])
print(seo_topics)

['Mortgage Refinances', 'home sales', 'supply and demand', 'city’s affordable housing', 'local property', 'real estate investors', 'growth', 'mortgage bankers', 'construction costs', 'housing market']
