<a href="https://colab.research.google.com/github/searchsolved/search-solved-public-seo/blob/main/striking_distance_creator/striking_distance_creator_(csv_version).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Striking Distance Report Creator v1 (CSV Version)
October 2021


In [None]:
!pip install pandas



In [None]:
import pandas as pd
from pandas import DataFrame, Series
from typing import Union
from google.colab import files

# Set Variables

In [None]:
# set all variables here
min_volume = 1  # set the minimum search volume
min_position = 4  # set the minimum position  / default = 4
max_position = 20 # set the maximum position  / default = 20
drop_all_true = True  # If all checks (h1/title/copy) are true, remove the recommendation (Nothing to do)
pagination_filters = "filterby|page|p="  # filter patterns used to detect and drop paginated pages

# Upload the Keyword Export File from Ahrefs or SEMrush

*   This file should be a report of all the keywords a site is ranking for.
*   Ahrefs (v1 and v2) & semrush keyword exports can be uploaded without modification.
*   .csv file in UTF-8 Format

For any other keyword exports, your csv file needs to contain the following column names:
```URL```
```Keyword```
```Volume```
```Position```

In [None]:
# upload the keyword export
upload = files.upload()
upload = list(upload.keys())[0]  # get the name of the uploaded file
df_keywords = pd.read_csv(
    (upload),
    error_bad_lines=False,
    low_memory=False,
    encoding="utf8",
    dtype={
        "URL": "str",
        "Keyword": "str",
        "Volume": "str",
        "Position": int,
        "Current URL": "str",
        "Search Volume": int,
    },
)

print("Uploaded Keyword CSV File Successfully!")
print(df_keywords)

Saving keywords.csv to keywords (4).csv
Uploaded Keyword CSV File Successfully!
                           Keyword Volume  Current position  \
0            office spaces near me   1000                 1   
1    office space for rent near me    450                 1   
2              office space london   1700                 4   
3          serviced office near me   1000                 2   
4                   virtual office   3500                 6   
..                             ...    ...               ...   
290               office cambridge    200                 3   
291           workspace farringdon    300                 9   
292              office colchester    250                 5   
293         office space guildford    100                 4   
294                office for rent    150                 6   

                                           Current URL  
0    https://www.instantoffices.com/en/gb/serviced-...  
1    https://www.instantoffices.com/en/gb/service



  exec(code_obj, self.user_global_ns, self.user_ns)


# Upload the Crawl Export

If you'd like to check if keywords are in the copy (you should!) we recommend setting a custom extraction as shown in the image below:

1.   The extractor MUST be named 'Copy' as below
2.   'Extract Text' must be choosen from the drop down

![](https://drive.google.com/uc?export=view&id=16SVAm_k3QwYe9PuZGq3xywJ2smdeTMN0)

For any other crawler, your csv file needs to contain the following column names: `Address` `Title 1` `H1-1` `Copy 1` `Indexability` (Optional)


In [None]:
# upload the crawl export from Screaming Frog
upload = files.upload()
upload = list(upload.keys())[0]  # get the name of the uploaded file
df_crawl = pd.read_csv(
    (upload),
        error_bad_lines=False,
        low_memory=False,
        encoding="utf8",
        dtype="str",
    )

print("Uploaded Crawl Dataframe Successfully!")
print(df_crawl.head())

Saving internal_html_2.csv to internal_html_2 (1).csv
Uploaded Crawl Dataframe Successfully!
                                             Address Indexability  \
0  https://www.instantoffices.com/blog/reports-an...    Indexable   
1  https://www.instantoffices.com/blog/business-g...    Indexable   
2  https://www.instantoffices.com/blog/featured/a...    Indexable   
3  https://www.instantoffices.com/blog/featured/n...    Indexable   
4  https://www.instantoffices.com/blog/featured/s...    Indexable   

                                             Title 1  \
0  The Changing Face of the Flexible Office Marke...   
1  Step-by-Step Office Relocation Checklist - Ins...   
2  Average Wage UK: What Salary Should You Be Ear...   
3  What Amenities Most Benefit Employee Experienc...   
4  Five signs your company has a toxic culture – ...   

                                                H1-1  \
0  Supply of Flexible Office Space in the UK Has ...   
1           The Step-by-Step Office Moving 



  exec(code_obj, self.user_global_ns, self.user_ns)


# Clean the Keyword Dataframe

In [None]:
print(df_crawl)

                                               Address Indexability  \
0    https://www.instantoffices.com/blog/reports-an...    Indexable   
1    https://www.instantoffices.com/blog/business-g...    Indexable   
2    https://www.instantoffices.com/blog/featured/a...    Indexable   
3    https://www.instantoffices.com/blog/featured/n...    Indexable   
4    https://www.instantoffices.com/blog/featured/s...    Indexable   
..                                                 ...          ...   
358  https://www.instantoffices.com/blog/business-g...    Indexable   
359  https://www.instantoffices.com/blog/instant-of...    Indexable   
360  https://www.instantoffices.com/blog/instant-of...    Indexable   
361  https://www.instantoffices.com/blog/featured/c...    Indexable   
362  https://www.instantoffices.com/blog/business-g...    Indexable   

                                               Title 1  \
0    The Changing Face of the Flexible Office Marke...   
1    Step-by-Step Office Reloca

In [None]:
# standardise the column names between the different input files
df_keywords.rename(
    columns={
        "Current position": "Position",
        "Current URL": "URL",
        "Search Volume": "Volume",
        "Volume (desc)": "Volume",
        "Volume (asc)": "Volume",
    },
    inplace=True,
)

# keep only the following columns from the keyword dataframe
cols = "URL", "Keyword", "Volume", "Position"
df_keywords = df_keywords.reindex(columns=cols)

try:
    # clean the data. (v1 of the ahrefs keyword export combines strings and ints in the volume column)
    df_keywords["Volume"] = df_keywords["Volume"].str.replace("0-10", "0")
except AttributeError:
    pass

# clean the keyword data
df_keywords = df_keywords[df_keywords["URL"].notna()]  # remove any missing values
df_keywords = df_keywords[df_keywords["Volume"].notna()]  # remove any missing values
df_keywords = df_keywords.astype({"Volume": int})  # change data type to int
df_keywords = df_keywords.sort_values(by="Volume", ascending=False)  # sort by highest vol to keep the top opportunity

# make new dataframe to merge search volume back in later
df_keyword_vol = df_keywords[["Keyword", "Volume"]]

# drop rows if minimum search volume doesn't match specified criteria
try:
  df_keywords.loc[df_keywords["Volume"] < min_volume, "Volume_Too_Low"] = "drop"
  df_keywords = df_keywords[~df_keywords["Volume_Too_Low"].isin(["drop"])]
except ValueError:
  print("Please Check if 'Volume' Column is Named Correctly!")
  pass

# drop rows if minimum search position doesn't match specified criteria
df_keywords.loc[df_keywords["Position"] <= min_position, "Position_Too_High"] = "drop"
df_keywords = df_keywords[~df_keywords["Position_Too_High"].isin(["drop"])]

# drop rows if maximum search position doesn't match specified criteria
df_keywords.loc[df_keywords["Position"] >= max_position, "Position_Too_Low"] = "drop"
df_keywords = df_keywords[~df_keywords["Position_Too_Low"].isin(["drop"])]

print(df_keywords)

                                                   URL  \
4    https://www.instantoffices.com/en/gb/serviced-...   
30   https://www.instantoffices.com/en/gb/serviced-...   
80   https://www.instantoffices.com/en/gb/serviced-...   
138  https://www.instantoffices.com/en/gb/serviced-...   
5    https://www.instantoffices.com/en/gb/serviced-...   
..                                                 ...   
211  https://www.instantoffices.com/en/gb/serviced-...   
202  https://www.instantoffices.com/en/gb/serviced-...   
254               https://www.instantoffices.com/en/gb   
294               https://www.instantoffices.com/en/gb   
118  https://www.instantoffices.com/en/gb/serviced-...   

                         Keyword  Volume  Position Volume_Too_Low  \
4                 virtual office    3500         6            NaN   
30            rent office london    2300        12            NaN   
80         london office to rent    2300        14            NaN   
138    rent office space ne

# Clean the Crawl Dataframe

In [None]:
# keep only the following columns from the crawl dataframe
cols = "Address", "Indexability", "Title 1", "H1-1", "Copy 1"
df_crawl = df_crawl.reindex(columns=cols)
df_crawl.head()
# drop non-indexable rows
df_crawl = df_crawl[~df_crawl["Indexability"].isin(["Non-Indexable"])]

# drop pagination
df_crawl = df_crawl[~df_crawl.Address.str.contains(pagination_filters)]

# standardise the column names
df_crawl.rename(columns={"Address": "URL", "Title 1": "Title", "H1-1": "H1", "Copy 1": "Copy"}, inplace=True)
df_crawl.head()

Unnamed: 0,URL,Indexability,Title,H1,Copy
0,https://www.instantoffices.com/blog/reports-an...,Indexable,The Changing Face of the Flexible Office Marke...,Supply of Flexible Office Space in the UK Has ...,The flexible office industry was increasing at...
1,https://www.instantoffices.com/blog/business-g...,Indexable,Step-by-Step Office Relocation Checklist - Ins...,The Step-by-Step Office Moving Checklist,"\n\nThe way we work has forever changed, and m..."
2,https://www.instantoffices.com/blog/featured/a...,Indexable,Average Wage UK: What Salary Should You Be Ear...,Average Wage UK: What Salary Should You Be Ear...,\n\n \nYour age isn’t the only number that inc...
3,https://www.instantoffices.com/blog/featured/n...,Indexable,What Amenities Most Benefit Employee Experienc...,Nine amenities that will instantly improve emp...,"Between WFH, the great resignation and the eme..."
4,https://www.instantoffices.com/blog/featured/s...,Indexable,Five signs your company has a toxic culture – ...,Five Signs Your Company Has a Toxic Culture,\n\nOnline searches for ‘toxic workplace’ incr...


# Group the Keywords

In [None]:
# groups the URLs (remove the dupes and combines stats)
# make a copy of the keywords dataframe for grouping - this ensures stats can be merged back in later from the OG df
df_keywords_group = df_keywords.copy()
df_keywords_group["KWs in Striking Dist."] = 1  # used to count the number of keywords in striking distance
df_keywords_group = (
    df_keywords_group.groupby("URL")
    .agg({"Volume": "sum", "KWs in Striking Dist.": "count"})
    .reset_index()
)
df_keywords_group.head()

Unnamed: 0,URL,Volume,KWs in Striking Dist.
0,https://www.instantoffices.com/en/gb,4850,15
1,https://www.instantoffices.com/en/gb/available...,350,1
2,https://www.instantoffices.com/en/gb/available...,200,1
3,https://www.instantoffices.com/en/gb/available...,250,1
4,https://www.instantoffices.com/en/gb/available...,1000,1


In [None]:
df_keywords_group.head()

Unnamed: 0,URL,Volume,KWs in Striking Dist.
0,https://www.instantoffices.com/en/gb,4850,15
1,https://www.instantoffices.com/en/gb/available...,350,1
2,https://www.instantoffices.com/en/gb/available...,200,1
3,https://www.instantoffices.com/en/gb/available...,250,1
4,https://www.instantoffices.com/en/gb/available...,1000,1


# Display in Adjacent Rows ala Grepwords Style


In [None]:
# create a new df, combine the merged data with the original data. display in adjacent rows ala grepwords
df_merged_all_kws = df_keywords_group.merge(
    df_keywords.groupby("URL")["Keyword"]
    .apply(lambda x: x.reset_index(drop=True))
    .unstack()
    .reset_index()
)
#print(df_merged_all_kws)
# sort by biggest opportunity
df_merged_all_kws = df_merged_all_kws.sort_values(
    by="KWs in Striking Dist.", ascending=False
)

# reindex the columns to keep just the top five keywords
cols = "URL", "Volume", "KWs in Striking Dist.", 0, 1, 2, 3, 4
df_merged_all_kws = df_merged_all_kws.reindex(columns=cols)

# create union and rename the columns
df_striking: Union[Series, DataFrame, None] = df_merged_all_kws.rename(
    columns={
        "Volume": "Striking Dist. Vol",
        0: "KW1",
        1: "KW2",
        2: "KW3",
        3: "KW4",
        4: "KW5",
    }
)

# merges striking distance df with crawl df to merge in the title, h1 and category description
df_striking = pd.merge(df_striking, df_crawl, on="URL", how="inner")

In [None]:
df_striking

Unnamed: 0,URL,Striking Dist. Vol,KWs in Striking Dist.,KW1,KW2,KW3,KW4,KW5,Indexability,Title,H1,Copy


In [None]:
# set the final column order and merge the keyword data in
cols = [
    "URL",
    "Title",
    "H1",
    "Copy",
    "Striking Dist. Vol",
    "KWs in Striking Dist.",
    "KW1",
    "KW1 Vol",
    "KW1 in Title",
    "KW1 in H1",
    "KW1 in Copy",
    "KW2",
    "KW2 Vol",
    "KW2 in Title",
    "KW2 in H1",
    "KW2 in Copy",
    "KW3",
    "KW3 Vol",
    "KW3 in Title",
    "KW3 in H1",
    "KW3 in Copy",
    "KW4",
    "KW4 Vol",
    "KW4 in Title",
    "KW4 in H1",
    "KW4 in Copy",
    "KW5",
    "KW5 Vol",
    "KW5 in Title",
    "KW5 in H1",
    "KW5 in Copy",
]
# re-index the columns to place them in a logical order + inserts new blank columns for kw checks.
df_striking = df_striking.reindex(columns=cols)

In [None]:
#merge in keyword data for each keyword column (KW1 - KW5)
df_striking = pd.merge(df_striking, df_keyword_vol, left_on="KW1", right_on="Keyword", how="left")
df_striking['KW1 Vol'] = df_striking['Volume']
df_striking.drop(['Keyword', 'Volume'], axis=1, inplace=True)
df_striking = pd.merge(df_striking, df_keyword_vol, left_on="KW2", right_on="Keyword", how="left")
df_striking['KW2 Vol'] = df_striking['Volume']
df_striking.drop(['Keyword', 'Volume'], axis=1, inplace=True)
df_striking = pd.merge(df_striking, df_keyword_vol, left_on="KW3", right_on="Keyword", how="left")
df_striking['KW3 Vol'] = df_striking['Volume']
df_striking.drop(['Keyword', 'Volume'], axis=1, inplace=True)
df_striking = pd.merge(df_striking, df_keyword_vol, left_on="KW4", right_on="Keyword", how="left")
df_striking['KW4 Vol'] = df_striking['Volume']
df_striking.drop(['Keyword', 'Volume'], axis=1, inplace=True)
df_striking = pd.merge(df_striking, df_keyword_vol, left_on="KW5", right_on="Keyword", how="left")
df_striking['KW5 Vol'] = df_striking['Volume']
df_striking.drop(['Keyword', 'Volume'], axis=1, inplace=True)
df_striking

Unnamed: 0,URL,Title,H1,Copy,Striking Dist. Vol,KWs in Striking Dist.,KW1,KW1 Vol,KW1 in Title,KW1 in H1,...,KW4,KW4 Vol,KW4 in Title,KW4 in H1,KW4 in Copy,KW5,KW5 Vol,KW5 in Title,KW5 in H1,KW5 in Copy


In [None]:
# drop duplciate url rows
df_striking.drop_duplicates(subset="URL", inplace=True) 

# replace nan values with empty strings
df_striking = df_striking.fillna("")

# drop the title, h1 and category description to lower case so kws can be matched to them
df_striking["Title"] = df_striking["Title"].str.lower()
df_striking["H1"] = df_striking["H1"].str.lower()
df_striking["Copy"] = df_striking["Copy"].str.lower()

In [None]:
# check whether a keyword appears in title, h1 or category description
df_striking["KW1 in Title"] = df_striking.apply(lambda row: row["KW1"] in row["Title"], axis=1)
df_striking["KW1 in H1"] = df_striking.apply(lambda row: row["KW1"] in row["H1"], axis=1)
df_striking["KW1 in Copy"] = df_striking.apply(lambda row: row["KW1"] in row["Copy"], axis=1)
df_striking["KW2 in Title"] = df_striking.apply(lambda row: row["KW2"] in row["Title"], axis=1)
df_striking["KW2 in H1"] = df_striking.apply(lambda row: row["KW2"] in row["H1"], axis=1)
df_striking["KW2 in Copy"] = df_striking.apply(lambda row: row["KW2"] in row["Copy"], axis=1)
df_striking["KW3 in Title"] = df_striking.apply(lambda row: row["KW3"] in row["Title"], axis=1)
df_striking["KW3 in H1"] = df_striking.apply(lambda row: row["KW3"] in row["H1"], axis=1)
df_striking["KW3 in Copy"] = df_striking.apply(lambda row: row["KW3"] in row["Copy"], axis=1)
df_striking["KW4 in Title"] = df_striking.apply(lambda row: row["KW4"] in row["Title"], axis=1)
df_striking["KW4 in H1"] = df_striking.apply(lambda row: row["KW4"] in row["H1"], axis=1)
df_striking["KW4 in Copy"] = df_striking.apply(lambda row: row["KW4"] in row["Copy"], axis=1)
df_striking["KW5 in Title"] = df_striking.apply(lambda row: row["KW5"] in row["Title"], axis=1)
df_striking["KW5 in H1"] = df_striking.apply(lambda row: row["KW5"] in row["H1"], axis=1)
df_striking["KW5 in Copy"] = df_striking.apply(lambda row: row["KW5"] in row["Copy"], axis=1)

# delete true / false values if there is no keyword
df_striking.loc[df_striking["KW1"] == "", ["KW1 in Title", "KW1 in H1", "KW1 in Copy"]] = ""
df_striking.loc[df_striking["KW2"] == "", ["KW2 in Title", "KW2 in H1", "KW2 in Copy"]] = ""
df_striking.loc[df_striking["KW3"] == "", ["KW3 in Title", "KW3 in H1", "KW3 in Copy"]] = ""
df_striking.loc[df_striking["KW4"] == "", ["KW4 in Title", "KW4 in H1", "KW4 in Copy"]] = ""
df_striking.loc[df_striking["KW5"] == "", ["KW5 in Title", "KW5 in H1", "KW5 in Copy"]] = ""
df_striking.head()

ValueError: ignored

In [None]:
# drops rows if all values evaluate to true. (nothing for the user to do).
def true_dropper(col1, col2, col3):
    drop = df_striking.drop(
        df_striking[
            (df_striking[col1] == True)
            & (df_striking[col2] == True)
            & (df_striking[col3] == True)
        ].index
    )
    return drop

if drop_all_true == True:
    df_striking = true_dropper("KW1 in Title", "KW1 in H1", "KW1 in Copy")
    df_striking = true_dropper("KW2 in Title", "KW2 in H1", "KW2 in Copy")
    df_striking = true_dropper("KW3 in Title", "KW3 in H1", "KW3 in Copy")
    df_striking = true_dropper("KW4 in Title", "KW4 in H1", "KW4 in Copy")
    df_striking = true_dropper("KW5 in Title", "KW5 in H1", "KW5 in Copy")

In [None]:
df_striking.to_csv('Keywords in Striking Distance.csv', index=False)
files.download("Keywords in Striking Distance.csv")