<a href="https://colab.research.google.com/github/simodepth/internal_linking/blob/main/Explore_Interlinking_Opportunities_Using_K_Means_and_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Automate Internal Linking Discovery with Python
Internal linking is a crucial asset to leverage in SEO but it may be hard to spot out high-level topic clusters to connect pages to as long as it enhances  search engines crawlability and public discovery - especially for larger sites.  

One such strategy for this is to simply look at existing content clusters or categories on a website. The following framework is aimed to explore such opportunities by clustering one's site content using **k-means** and **sentence transformers**. 



#The Framework

This Python framework is designed to explore internal linking opportunities by clustering pages on a website for topical relevance. The output returned is an array with URL and H1 title columns ordered by topical clustered. 

**Why we care** – the output provides a handy picture of TF-IDF webpages to be linked to from one another in reason of a salient topic equivalence. 

✅ This can be applied to very large sites


#Requirements & Assumptions

- Install `sentence-transformers` as it is an external package
- Import `Internal_html` Screaming Frog crawl and make sure you ONLY include **Address** and **Title 1** columns


In [None]:
#@title Install sentence-transformers
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 3.1 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 32.8 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 37.7 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 4.4 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 35.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYA

In [None]:
import nltk
nltk.download('punkt')
from nltk.util import ngrams
from nltk.corpus import stopwords
nltk.download('stopwords')
from collections import Counter
import string
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#Load SentenceTransformer
📘 This is a machine learning model used for sentences embeddings and it is suitable for tasks like clustering or semantic search.

✅ Model trained on UGC platforms such as Reddit and Yahoo Answers

In [None]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

# Define the function to create N-Grams
📘 N-grams are " a contiguous sequence of n items from a given sample of text or speech". AKA: clusters of similar words 

In [None]:
def extract_ngrams(data, num):
  n_grams = ngrams(nltk.word_tokenize(data), num)
  gram_list = [ ' '.join(grams) for grams in n_grams]
  return gram_list

In [None]:
#@title Extract n-grams and filter out stopwords
def getname(cluster):
  data = ''
  data = ' '.join(cluster)
  keywords = extract_ngrams(data, 1)
  stop_words = set(stopwords.words('english'))
  cluster_name = [x.lower() for x in keywords]
  cluster_name = [x for x in cluster_name if not x in stop_words]
  cluster_name = [x for x in cluster_name if x not in string.punctuation]
  cluster_name = list(Counter(cluster_name).most_common(1))
  return cluster_name

In [None]:
#@title Create an empty Dataframe  
df2 = pd.DataFrame(columns = ['cluster', 'title', 'url']) #@param {type:"string"}

In [None]:
#@title Upload your CSV OR Excel File
df = pd.read_excel("/content/Internal_html.xlsx") #@param {type:"string"}

In [None]:
#@title Filter out branding from the titles and get the Outcome
df.dropna(inplace=True)
df['Title 1'] = df['Title 1'].replace({' \| Fusion Unlimited':''}, regex=True) #@param {type:"string"}

corpus = df["Title 1"].tolist()
df

###Route Tip 💡 
Using `df` function along with `replace` string enable to filter out branding from the title in order to avoid skewing the algorithm.

**Do not skip this step and edit according to your title structure.** 

In [None]:
#@title Set the number of clusters you want to force with K-Means
corpus_embeddings = embedder.encode(corpus)

# adjust this as needed
num_clusters = 15
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

# Find URLs and Store Data


Data are now in the list object `clustered_sentences` therefore it's time to loop though it.

First we call the `getname()` function to name the clusters and then we grab the corresponding URLs. Thereby, we store the cluster name, the title and the URL in the empty dataframe we created earlier


In [None]:
for i, cluster in enumerate(clustered_sentences):
    cluster_name = getname(cluster)
    for x in cluster:
      geturl = df[df['Title 1']==x]['Address'].values[0]
      getdict = {'cluster':cluster_name[0][0],'title':x,'url':geturl}
      df2 = df2.append(getdict, ignore_index = True)
df2


Unnamed: 0,cluster,title,url
0,facebook,Could the ios 14 update affect your mobile tra...,https://fusionunlimited.co.uk/blog/could-the-i...
1,facebook,We went to Facebook Blueprint Live,https://fusionunlimited.co.uk/blog/went-facebo...
2,facebook,Facebook Announces Newsfeed Updates,https://fusionunlimited.co.uk/blog/facebook-an...
3,facebook,Instagram launches personalised news feed,https://fusionunlimited.co.uk/blog/were-going-...
4,facebook,Facebook Introduces New Video Features and Upd...,https://fusionunlimited.co.uk/blog/facebook-in...
...,...,...,...
529,google,Google removes right-hand side ads from result...,https://fusionunlimited.co.uk/blog/google-remo...
530,google,Google to Begin Favouring Mobile Friendly Resu...,https://fusionunlimited.co.uk/blog/google-to-b...
531,google,Google reports spike in “near me” searches,https://fusionunlimited.co.uk/blog/google-repo...
532,google,Google Announce Custom Audience Targeting Feature,https://fusionunlimited.co.uk/blog/google-anno...


##✅ Consider boosting internal links by skimming through URLs corresponding to equivalent clusters

⚠️ There are as many cluster names as there are clusters that you set the number for earlier 