<a href="https://colab.research.google.com/github/simodepth/Keyword-Research/blob/main/SEO_Keyword_Clustering_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [23]:
!pip install pyfiglet
import pyfiglet
font = pyfiglet.figlet_format('Keyword Clustering with Python')
print(font)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
 _  __                                _ 
| |/ /___ _   ___      _____  _ __ __| |
| ' // _ \ | | \ \ /\ / / _ \| '__/ _` |
| . \  __/ |_| |\ V  V / (_) | | | (_| |
|_|\_\___|\__, | \_/\_/ \___/|_|  \__,_|
          |___/                         
  ____ _           _            _                        _ _   _     
 / ___| |_   _ ___| |_ ___ _ __(_)_ __   __ _  __      _(_) |_| |__  
| |   | | | | / __| __/ _ \ '__| | '_ \ / _` | \ \ /\ / / | __| '_ \ 
| |___| | |_| \__ \ ||  __/ |  | | | | | (_| |  \ V  V /| | |_| | | |
 \____|_|\__,_|___/\__\___|_|  |_|_| |_|\__, |   \_/\_/ |_|\__|_| |_|
                                        |___/                        
 ____        _   _                 
|  _ \ _   _| |_| |__   ___  _ __  
| |_) | | | | __| '_ \ / _ \| '_ \ 
|  __/| |_| | |_| | | | (_) | | | |
|_|    \__, |\__|_| |_|\___/|_| |_|
       |___/                       




#Requirements & Assumptions
- **Upload a keyword list from a file (queries.csv)**: The bigger the keyword list is the better the results of your clustering.

- **Apply stemming to every word within the query**:
We make use of the **Porter Stemmer** that is available in the python NLTK module. It is language independent. **You can also try the Snowball Stemmer that works language specific and might give better results**. The whole stemming part is done to bring down words to their basic root form – this will help us to group those words together.

- **Use TfidfVectorizer to create a feature vector over all queries**:
Clustering algorithms work with numbers – for that reason we transform every keyword to a word vector. This vector contains every stemmed word that was found in the input keyword set and contains the TF-IDF weights.

- **Run a cluster algorithm on top of the query vectors**:
In this script we use sklearn to do the keyword clustering.

- **Look at the results in clustered_queries.csv**:
Keywords that belong to the same group are concatenated together with a pipe delimiter. If you run the keyword clustering script for the first time with a new keyword set you might realize that in some areas the found clusters look not that good.


---



#⚠️**Try to play around with the SENSITIVITY and MIN_CLUSTERSIZE parameters** This can improve your results.

In [1]:

import pandas as pd
import re
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import DBSCAN
import nltk
from nltk.stem.snowball import SnowballStemmer
snow_stemmer = SnowballStemmer(language='english')
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
import csv


In [2]:
def stemmList(list):
    stemmed_list = []
    for l in list:
        words = l.split(" ")
        stem_words = []
        print(l)
        for word in words:
            x = snow_stemmer.stem(word)
            #x = porter_stemmer.stem(word)
            stem_words.append(x)
        key = " ".join(stem_words)
        print(key)
        stemmed_list.append(key)
    return stemmed_list
textlist = []



In [5]:
df = pd.read_excel('Query.xlsx')
textlist = df.iloc[:, 0].to_list()
labellist = textlist
textlist = stemmList(textlist)


ucc coffee
ucc coffe
ucc coffee uk
ucc coffe uk
ucc coffee ireland
ucc coffe ireland
simply personnel login
simpli personnel login
ucc coffee jobs
ucc coffe job
ucc coffee careers
ucc coffe career
ucc coffee machine
ucc coffe machin
ucc
ucc
united coffee
unit coffe
ucc careers
ucc career
ucc coffee uk limited
ucc coffe uk limit
ucc uk
ucc uk
coffee prices uk
coffe price uk
ucc coffee dartford
ucc coffe dartford
ucc self service
ucc self servic
simply personnel
simpli personnel
ucc coffee milton keynes
ucc coffe milton keyn
ucc coffee uk ltd
ucc coffe uk ltd
average coffee price uk
averag coffe price uk
three sixty coffee
three sixti coffe
simply personnel employee login
simpli personnel employe login
thermoplan coffee machine
thermoplan coffe machin
orangutan coffee
orangutan coffe
grand cafe coffee
grand cafe coffe
ucc coffee contact number
ucc coffe contact number
360 coffee
360 coffe
average price of a cup of coffee uk 2021
averag price of a cup of coffe uk 2021
average price of cof

In [21]:
LANGUAGE = 'english' # used for snowball stemmer
SENSITIVITY = 0.8 # The Lower the more clusters
MIN_CLUSTERSIZE = 2
tfidf_vectorizer = TfidfVectorizer(max_df=0.2, max_features=10000,min_df=0.01, stop_words=LANGUAGE,use_idf=True, ngram_range=(1,2))
tfidf_matrix = tfidf_vectorizer.fit_transform(textlist)
ds = DBSCAN(eps=SENSITIVITY, min_samples=MIN_CLUSTERSIZE).fit(tfidf_matrix)
clusters = ds.labels_.tolist()



In [22]:
cluster_df = pd.DataFrame(clusters, columns=['Cluster'])
keywords_df =  pd.DataFrame(labellist, columns=['Keyword'])
result = pd.merge(cluster_df, keywords_df, left_index=True, right_index=True)
grouping = result.groupby(['Cluster'])['Keyword'].apply(' | '.join).reset_index()
grouping.to_csv("clustered_queries.csv",index=False)
grouping

Unnamed: 0,Cluster,Keyword
0,-1,best coffee machine in the world | life 2 | ap...
1,0,ucc coffee | ucc coffee uk | ucc coffee irelan...
2,1,simply personnel login | united coffee | simpl...
3,2,ucc | ucc careers | ucc uk | ucc coffe | ucc c...
4,3,grand cafe coffee | grand cru coffee | grand c...
5,4,ucc coffee beans | orangutan coffee beans | gr...
6,5,private label coffee europe | private label co...
7,6,appia life 2 group compact | appia life | appi...
8,7,black and white coffee machine | black and whi...
9,8,coffee brands uk | british coffee brands | cof...
