<a href="https://colab.research.google.com/github/simodepth/internal_linking/blob/main/Explore_Interlinking_Opportunities_Using_K_Means_and_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Automate Internal Linking Discovery with Python
Internal linking is a crucial asset to leverage in SEO but it may be hard to spot out high-level topic clusters to connect pages to as long as it enhances  search engines crawlability and public discovery - especially for larger sites.  

One such strategy for this is to simply look at existing content clusters or categories on a website. The following framework is aimed to explore such opportunities by clustering one's site content using **k-means** and **sentence transformers**. 



#The Framework

This Python framework is designed to explore internal linking opportunities by clustering pages on a website for topical relevance. The output returned is an array with URL and H1 title columns ordered by topical clustered. 

**Why we care** – the output provides a handy picture of TF-IDF webpages to be linked to from one another in reason of a salient topic equivalence. 

✅ This can be applied to very large sites


#Requirements & Assumptions

- Install `sentence-transformers` as it is an external package
- Import `Internal_html` Screaming Frog crawl and make sure you ONLY include **Address** and **Title 1** columns


### 1.1 Checking the GPU 

With Colab Pro you have priority access to our fastest GPUs. For example, you may get a T4 or P100 GPU at times when most users of standard Colab receive a slower K80 GPU. You can see what GPU you've been assigned at any time by executing the following cell.
https://colab.research.google.com/notebooks/pro.ipynb#scrollTo=23TOba33L4qf

In [12]:
# See GPU information 
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Wed Jul  6 09:28:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P0    28W /  70W |   1390MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [1]:
#@title Install sentence-transformers
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 2.3 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 17.2 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 51.7 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 8.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 74.4 MB/s 
Collecting tokenizers!

In [2]:
import nltk
nltk.download('punkt')
from nltk.util import ngrams
from nltk.corpus import stopwords
nltk.download('stopwords')
from collections import Counter
import string
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#Load SentenceTransformer
📘 This is a machine learning model used for sentences embeddings and it is suitable for tasks like clustering or semantic search.

✅ Model trained on UGC platforms such as Reddit and Yahoo Answers

In [3]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')


Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

# Define the function to create N-Grams
📘 N-grams are " a contiguous sequence of n items from a given sample of text or speech". AKA: clusters of similar words 

In [4]:
def extract_ngrams(data, num):
  n_grams = ngrams(nltk.word_tokenize(data), num)
  gram_list = [ ' '.join(grams) for grams in n_grams]
  return gram_list

In [5]:
#@title Extract n-grams and filter out stopwords
def getname(cluster):
  data = ''
  data = ' '.join(cluster)
  keywords = extract_ngrams(data, 1)
  stop_words = set(stopwords.words('english'))
  cluster_name = [x.lower() for x in keywords]
  cluster_name = [x for x in cluster_name if not x in stop_words]
  cluster_name = [x for x in cluster_name if x not in string.punctuation]
  cluster_name = list(Counter(cluster_name).most_common(1))
  return cluster_name

In [7]:
df2 = pd.DataFrame(columns = ['cluster', 'title', 'url'])
df = pd.read_csv("/content/internal_html.csv")
df.dropna(inplace=True)
df['Title 1'] = df['Title 1'].replace({' \ fusion':'', ' \| fusion unlimited':''}, regex=True)
corpus = df["Title 1"].tolist()
df



Unnamed: 0,Address,Title 1
0,https://fusionunlimited.co.uk/,Fusion Unlimited - Award winning independent d...
1,https://fusionunlimited.co.uk/privacy-policy/,Privacy Policy | Fusion Unlimited
2,https://fusionunlimited.co.uk/our-work/bewonder/,Bewonder* | Fusion Unlimited
3,https://fusionunlimited.co.uk/about-us/,Behind the scenes | Fusion Unlimited
4,https://fusionunlimited.co.uk/our-work/,Our Work | Fusion Unlimited
...,...,...
313,https://fusionunlimited.co.uk/blog/social-medi...,Social Media Roundup: August | Fusion Unlimited
314,https://fusionunlimited.co.uk/blog/social-medi...,Social Media Roundup: December | Fusion Unlimited
315,https://fusionunlimited.co.uk/blog/social-medi...,Social Media Roundup: September | Fusion Unlim...
316,https://fusionunlimited.co.uk/blog/google-upda...,Google Updates Penguin Algorithm | Fusion Unli...


In [8]:
#@title Set the number of clusters you want to force with K-Means
corpus_embeddings = embedder.encode(corpus)

# adjust this as needed
num_clusters = 10
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_

clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(corpus[sentence_id])

# Find URLs and Store Data


Data are now in the list object `clustered_sentences` therefore it's time to loop though it.

First we call the `getname()` function to name the clusters and then we grab the corresponding URLs. Thereby, we store the cluster name, the title and the URL in the empty dataframe we created earlier


In [9]:
for i, cluster in enumerate(clustered_sentences):
    cluster_name = getname(cluster)
    for x in cluster:
      geturl = df[df['Title 1']==x]['Address'].values[0]
      getdict = {'cluster':cluster_name[0][0],'title':x,'url':geturl}
      df2 = df2.append(getdict, ignore_index = True)
df2


Unnamed: 0,cluster,title,url
0,fusion,Fusion Unlimited Announced as a 2022 Prolific ...,https://fusionunlimited.co.uk/blog/fusion-unli...
1,fusion,Northern Digital Awards 2019 Award Nominations...,https://fusionunlimited.co.uk/blog/fusion-nort...
2,fusion,Fusion Natural Edge Nominated for Northern Dig...,https://fusionunlimited.co.uk/blog/fusion-natu...
3,fusion,Fusion Unlimited nominated for 2 Northern Digi...,https://fusionunlimited.co.uk/blog/fusion-unli...
4,fusion,Fusion Unlimited Shortlisted In The 2015 PROLI...,https://fusionunlimited.co.uk/blog/fusion-unli...
...,...,...,...
268,fusion,Social Media Roundup: November | Fusion Unlimited,https://fusionunlimited.co.uk/blog/social-medi...
269,fusion,Social Media Roundup: October,https://fusionunlimited.co.uk/blog/social-medi...
270,fusion,Social Media Roundup: August | Fusion Unlimited,https://fusionunlimited.co.uk/blog/social-medi...
271,fusion,Social Media Roundup: December | Fusion Unlimited,https://fusionunlimited.co.uk/blog/social-medi...


In [11]:
#@title Download the Output in a CSV Dataframe
df2.to_csv(r'C:\Users\simonedp\Desktop\cluster.csv', index = False, header=True) #on Mac the directory should be: iCloud Drive\Scrivania\cluster.csv
print (df)

                                               Address  \
0                       https://fusionunlimited.co.uk/   
1        https://fusionunlimited.co.uk/privacy-policy/   
2     https://fusionunlimited.co.uk/our-work/bewonder/   
3              https://fusionunlimited.co.uk/about-us/   
4              https://fusionunlimited.co.uk/our-work/   
..                                                 ...   
313  https://fusionunlimited.co.uk/blog/social-medi...   
314  https://fusionunlimited.co.uk/blog/social-medi...   
315  https://fusionunlimited.co.uk/blog/social-medi...   
316  https://fusionunlimited.co.uk/blog/google-upda...   
317  https://fusionunlimited.co.uk/blog/seo-market-...   

                                               Title 1  
0    Fusion Unlimited - Award winning independent d...  
1                    Privacy Policy | Fusion Unlimited  
2                         Bewonder* | Fusion Unlimited  
3                 Behind the scenes | Fusion Unlimited  
4                 

##✅ Consider boosting internal links by skimming through URLs corresponding to equivalent clusters

⚠️ There are as many cluster names as there are clusters that you set the number for earlier 