# Keyword Categorization

Following the classification scheme of NU.nl, I organize the keywords into the following categories:

- General
- Economy
- Sports
- Media and Culture
- Other

Moreover, I have segregated technology, science, and animals into distinct categories. Additionally, I introduce four new categories: names, politics, events, and entities (companies, countries, cities).

In [None]:
from openai import OpenAI
import json
import pandas as pd
import os
from dotenv import load_dotenv

load_dotenv()
apiKey = os.environ.get('OPENAI_API_KEY')

In [2]:
#List with test words 
 words = ['Lando Norris', '3400 pond', 'achterklap', 'Steven Spielberg', 'Timothée Chalamet', 'Kunstrijden', 'Boksers', 'NPO 3FM', 'Céline Dion', 'Tesla', 'Dilan Yesilgöz' , 'Rami Malek', 'AI-test', 'Weer' , 'Sprinttalent',  'München', 'Booking.com', 'Coronavirus','Yesilgöz', 'Deena & Jim', 'Formule 2', 'Griselda', 'Timothee Chalamet', 'Bulgarije', '3FM', 'Booking', 'Nijmegen']



Function:
Send a keyword to LLM and receive a category in return.
Expected outcomes are one of these categories: General, Economy, Sports, Technology, Science, Media and Culture, Person, Company, Animal, Country, City, Other

In [8]:
client = OpenAI(api_key=apiKey)

# Define a function to get OpenAI completion for a given set of keywords
def getKeywordCategory(keyword):
    # Create completion request
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=[
            {"role": "system", "content": "Je kiest 1 categorie waarbij het volgende keywoord het best bij past. Kies uit: Algemeen, Economie, Sport, Technologie, Wetenschap, Media en Cultuur, Persoon, Bedrijf, Dier, Land, Stad, Overig. Geef alleen de categorie. Voeg toe het voorvoegsel 'Category:'"},
            {"role": "user", "content": keyword}
        ]
    )
    keyword_message = completion.choices[0].message
    keyword_content = keyword_message.content

    # Extracting words after 'Caregory:'
    category = keyword_content.split('Category:')[1].strip()
    return category

# Example return of function
print(getKeywordCategory("Kunstrijden"))

Sport


Get all unique keywords from the articles.

In [41]:
df = pd.read_csv('nu-articles-v4.csv')

# Split each string by commas and concatenate all lists
all_keywords = []
for keywords_str in df['Keywords']:
    keywords_list = keywords_str.split(', ')
    all_keywords.extend(keywords_list)

# Print the combined list
# print(all_keywords)

# Convert the combined list to a set to remove duplicates, then convert it back to a list
unique_keywords = list(set(all_keywords))

# Print the unique keywords list
print(unique_keywords)

['Grace Jabbari', 'Martijn Lakemeier', 'Boek & Cultuur', 'Oostenrijkse Weissensee', 'Nuremberg', 'Angela Groothuizen', 'Gerard Depardieu', 'Europa', 'Jennifer Aniston', 'Memphis Depay', 'Wb Shorttrack', 'Liverpool', 'Sell', 'Volendam', 'Emmy', 'Britney Spears', 'Stad München', 'Wendy Van Dijk', 'Twitter', 'Paul Groot', 'Duitsland', 'Estaimpuis', 'Claudia De Breij', 'Delfina Chaves', "Auto's", 'Rob Roos', 'Kpn', 'Robert De Niro', 'Mathieu Van Der Poel', 'Schagen', 'Champions League Vrouwen', 'World Cup Schaatsen', 'Anthony Kiedis', 'Opvoeding', 'Recensieoverzicht Wonka', 'Kamer', 'Lufthansa Group', 'Duiken', 'Matthew Perry', 'Sergio Pérez', 'Jim Bakkum', 'Zussen-serie', 'Amstelveen', 'Serious Request', 'Lindsay Van Zundert', 'Bbb', 'Ajax', 'E3', 'Herten', 'Vvd', 'Afran Groenewoud', 'Bulgarije', 'Journalistiek', 'Kamers', 'Jennifer Hudson', 'Volkswagen', 'Luke Littler', 'Pieter Baan', 'Wk Handbal', 'Ternauwernood', 'Timothée Chalamet', 'Istanboel', 'Eddy Keur', 'Bob De Bouwer', 'Ns Publi

Plaats alle keywoorden in een json bestand onder de juiste categorie.

Soms stuurt de LLM toch een categorie die ik er niet tussen heb staan door enkele keren het prompt terug te sturen krijg ik het juiste categorie. Na 4x proberen word het woord toegevoegd aan No_category.

In [63]:
#Read the JSON file
with open('Categories.json', 'r') as f:
    data = json.load(f)

#Categorize each keyword
for keyword in unique_keywords:
    i = 0
    while True:
        cat = getKeywordCategory(keyword)
        if cat in ["General", "Economy", "Sports", "Media and Culture", "Technology", "Science", "Person", "Company", "Politics", "Event", "Animal", "Country", "City", "Other"]:
            data[cat].append(keyword)
            print(keyword + " is in: " + cat)
            break
        else:
            if i == 4:
                data["No_category"].append(keyword)
                break
            print("Problem found with keyword: " + keyword)
            i = i + 1

#Write the updated data back to the file
with open('Categories_allKeywords.json', 'w') as f:
    json.dump(data, f, indent=4)

Grace Jabbari is in: Persoon
Martijn Lakemeier is in: Persoon
Boek & Cultuur is in: Media en Cultuur
Oostenrijkse Weissensee is in: Stad
Nuremberg is in: Stad
Angela Groothuizen is in: Persoon
Gerard Depardieu is in: Persoon
Europa is in: Land
Jennifer Aniston is in: Persoon
Memphis Depay is in: Sport
Wb Shorttrack is in: Sport
Liverpool is in: Stad
Sell is in: Economie
Volendam is in: Stad
Emmy is in: Media en Cultuur
Britney Spears is in: Persoon
Stad München is in: Stad
Wendy Van Dijk is in: Persoon
Twitter is in: Technologie
Paul Groot is in: Persoon
Duitsland is in: Land
Estaimpuis is in: Stad
Claudia De Breij is in: Persoon
Delfina Chaves is in: Persoon
Auto's is in: Overig
Rob Roos is in: Persoon
Kpn is in: Bedrijf
Robert De Niro is in: Persoon
Mathieu Van Der Poel is in: Sport
Schagen is in: Stad
Champions League Vrouwen is in: Sport
World Cup Schaatsen is in: Sport
Anthony Kiedis is in: Persoon
Opvoeding is in: Persoon
Recensieoverzicht Wonka is in: Media en Cultuur
Kamer is i

USE GPT4
Python check if the json format is correct.

In [3]:
# Define a function to get OpenAI completion for a given set of keywords
client = OpenAI(api_key=apiKey)

def getDoubleKeywords(keywords):
    # Create completion request
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=[
            {"role": "system", "content": "follow the following steps: 1. connect keywords with the same meaning. 2. place it in a Json format: keywords: [‘Keyword1’: , 'Keyword2’: ]. 3. remove the empty keywords"},
            {"role": "user", "content": keywords}
        ]
    )
    keyword_message = completion.choices[0].message
    keyword_content = keyword_message.content
    print(keyword_content)

    filtered_keywords = {}
    for keyword, synonyms in keyword_message.content["keywords"].items():
        if len(synonyms) > 1:
            filtered_keywords[keyword] = synonyms

    print(filtered_keywords)
    return keyword_content



string_list = [str(element) for element in words]
delimiter = ", "
keywords_string = delimiter.join(string_list)
getDoubleKeywords(keywords_string)


{'keywords': {'Lando Norris': ['Lando Norris'], 'München': ['stad München', 'München'], 'Achterklap': ['achterklap'], 'Steven Spielberg': ['Steven Spielberg'], 'Timothée Chalamet': ['Timothée Chalamet', 'Timothee Chalamet'], 'Kunstrijden': ['Kunstrijden'], 'Boksen': ['Boksen'], 'NPO 3FM': ['NPO 3FM', '3FM', 'radio 3'], 'Céline Dion': ['Céline Dion'], 'Tesla': ['Tesla'], 'Dilan Yesilgöz': ['Dilan Yesilgöz', 'Yesilgöz'], 'Rami Malek': ['Rami Malek'], 'AI-test': ['AI-test'], 'Weer': ['Weer'], 'Sprinttalent': ['Sprinttalent'], 'Booking.com': ['Booking.com', 'Booking'], 'Coronavirus': ['Coronavirus'], 'Deena & Jim': ['Deena & Jim'], 'Formule 2': ['Formule 2'], 'Griselda': ['Griselda'], 'Bulgarije': ['Bulgarije'], 'Nijmegen': ['Nijmegen']}}


Hier gaan ik de output van de LLM langs en de gebruiker vragen wat het beste keywoord is. De aanpassingen worden in een json bestand opgslagen.

In [4]:
GPToutputdata = {
    "keywords": {
        "Lando Norris": ["Lando Norris"],
        "München": ["stad München", "München"],
        "Achterklap": ["achterklap"],
        "Steven Spielberg": ["Steven Spielberg"],
        "Timothée Chalamet": ["Timothée Chalamet", "Timothee Chalamet"],
        "Kunstrijden": ["Kunstrijden"],
        "Boksen": ["Boksen"],
        "NPO 3FM": ["NPO 3FM", "3FM", "radio 3"],
        "Céline Dion": ["Céline Dion"],
        "Tesla": ["Tesla"],
        "Dilan Yesilgöz": ["Dilan Yesilgöz", "Yesilgöz"],
        "Rami Malek": ["Rami Malek"],
        "AI-test": ["AI-test"],
        "Weer": ["Weer"],
        "Sprinttalent": ["Sprinttalent"],
        "Booking.com": ["Booking.com", "Booking"],
        "Coronavirus": ["Coronavirus"],
        "Deena & Jim": ["Deena & Jim"],
        "Formule 2": ["Formule 2"],
        "Griselda": ["Griselda"],
        "Bulgarije": ["Bulgarije"],
        "Nijmegen": ["Nijmegen"]
    }
}
keywordchanges = {}
filtered_keywords = {}

# Filter duplicate keywords from GPT output
for keyword, synonyms in GPToutputdata["keywords"].items():
    if len(synonyms) > 1:
        filtered_keywords[keyword] = synonyms

for key, value_list in filtered_keywords.items():
    dubbles = ""
    for item in value_list:
        dubbles += f"- {item} \n"
    response = input("Typ het beste keywoord. Typ '/next' als de keywoorden een andere betekenis heeft \n \n" + dubbles)

    # TO DO: Add if word is not found
    # If '/next', skip adding it to the list
    if not response == "/next":
        words = dubbles.splitlines()
        for word in words:
            word = word.replace("- ", "").rstrip()
            if response == word:
                # Get the other keywords and place them after the correct keyword in the dictionary
                replace = []
                for add in words:
                    add = add.replace("- ", "").rstrip()
                    if not word == add:
                        replace.append(add)
                keywordchanges[word] = replace
    print(keywordchanges)


{'München': ['stad München']}
{'München': ['stad München'], 'Timothée Chalamet': ['Timothee Chalamet']}
{'München': ['stad München'], 'Timothée Chalamet': ['Timothee Chalamet'], '3FM': ['NPO 3FM', 'radio 3']}
{'München': ['stad München'], 'Timothée Chalamet': ['Timothee Chalamet'], '3FM': ['NPO 3FM', 'radio 3'], 'Dilan Yesilgöz': ['Yesilgöz']}
{'München': ['stad München'], 'Timothée Chalamet': ['Timothee Chalamet'], '3FM': ['NPO 3FM', 'radio 3'], 'Dilan Yesilgöz': ['Yesilgöz'], 'Booking.com': ['Booking']}


Apply the changes to the data and category list

In [57]:
keywordchanges = {
    'München': ['stad München'],
    'Timothée Chalamet': ['Timothée Chalamet'],
    '3FM': ['NPO 3FM', 'radio 3'],
    'Booking.com': ['Booking']
}

def removeKeywordsCategory(category, changelist):
    # Read the JSON file
    with open('Categories_allKeywordstest.json', 'r') as f:
        categoryData = json.load(f)

    # Get the list of keywords for the specified category
    keyword_list = categoryData[category]

    # Iterate through the change list
    for keyword, editlist in changelist.items():
        for editkeyword in editlist:
            print(editkeyword + "  " + keyword)
            try:
                # Remove the edited keywords from the category list
                keyword_list.remove(editkeyword)
            except:
                print(f"{editkeyword} is not found in {category}")

    # Replace the original list with the updated list
    categoryData[category] = keyword_list

    # Write the updated data back to the file
    with open('Categories_allKeywordstests.json', 'w') as f:
        json.dump(categoryData, f, indent=4)

def replaceKeywordsData(changelist):
    # Read the CSV file into a DataFrame
    df = pd.read_csv('nu-articles-v4.csv')

    # Iterate through the change list
    for keyword, editlist in changelist.items():
        for editkeyword in editlist:
            print(editkeyword + "  " + keyword)
            # Keywords in the DataFrame have only the first letter capitalized
            editkeyword = editkeyword.lower().capitalize()
            keyword = keyword.lower().capitalize()
            # Replace keywords in the 'Keywords' column
            df['Keywords'] = df['Keywords'].str.replace(editkeyword, keyword)

    # Save the updated DataFrame back to the CSV file
    df.to_csv('nu-articles-v5.csv', index=False)

removeKeywordsCategory('General', keywordchanges)
replaceKeywordsData(keywordchanges)

stad München  München
Timothee Chalamet  Timothée Chalamet
NPO 3FM  3FM
radio 3  3FM
Booking  Booking.com
stad München  München
Timothee Chalamet  Timothée Chalamet
NPO 3FM  3FM
radio 3  3fm
Booking  Booking.com


In [40]:
client = OpenAI(api_key=apiKey)

# Define a function to get OpenAI completion for a given set of keywords
def get_unclear_keywords(keyword):
    # Create completion request
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=[
            {"role": "system", "content": "Voor een nieuws app kunnen mensen zich abonneren op keywoorden. Geef het cijfer 1 als het keywoord relevant is en het cijfer 0 als het keywoord niet relevant is. Please prefix ‘relevant’: Niet relevante keywoorden zijn: '0-2-nederlaag' > Dit is een sportuitslag. 'Europadeskundige' > Dit is een beroep. 'wij' > Dit is een voornaamwoord 'Januari' > Dit is een maand 'Wout' > Dit is alleen een voornaam 'Extreem Weer' > Dit is te specifiek '3400 pond' > dit is een geld bedrag 'Zwarte' > Dit is een bijvoeglijknaamwoord 'Handballers' > Dit is niet de sport zelf"},
            {"role": "user", "content": keyword}
        ]
    )
    keyword_message = completion.choices[0].message
    keyword_content = keyword_message.content
    # Extracting words after 'Relevant:'
    number = keyword_content.split('relevant:')[1].strip()
    return number

# Apply the function to each row of the 'Keywords' column
print(get_unclear_keywords("voetballers"))

1


Function Loop through keywords and ask if is it a relevant keyword


In [44]:
unclearlist = []
def unclearKeywordCheck(category):
    deletekeywords = []
    for keyword in category:
        if get_unclear_keywords(keyword) == '0':
            unclearlist.append(keyword) 
            response = input(f"Is {keyword} een relevant keywoord? \n Typ: Nee om het keywoord te verijderen")
            if response.lower() == "nee":
                deletekeywords.append(keyword)
        else: 
            print(f"{keyword} is relevant")
            
    return deletekeywords
        
print(unclearKeywordCheck(words))



Lando Norris is relevant
Steven Spielberg is relevant
Timothée Chalamet is relevant
Kunstrijden is relevant
Boksers is relevant
NPO 3FM is relevant
Céline Dion is relevant
Tesla is relevant
Dilan Yesilgöz is relevant
Rami Malek is relevant
AI-test is relevant
Weer is relevant
Sprinttalent is relevant
München is relevant
Booking.com is relevant
Coronavirus is relevant
Yesilgöz is relevant
Deena & Jim is relevant
Formule 2 is relevant
Griselda is relevant
Timothee Chalamet is relevant
Bulgarije is relevant
3FM is relevant
Booking is relevant
Nijmegen is relevant
['3400 pond']


Get every keyword of the category.
Send it to the unclearKeywordCheck function
Removing the unclearkeywords from JSON file and Dataset

In [59]:
with open('Categories_allKeywordstests.json', 'r') as f:
    data = json.load(f)

# Iterate through the categories and their items
for category, items in data.items():
    print(f"Category: {category}")
    removelist = {
       '': unclearKeywordCheck(items)
    }
    removeKeywordsCategory(category, removelist)
    replaceKeywordsData(removelist)
    

Category: Algemeen
Binnen- En Buitenland is relevant
Vervuiling is relevant
Nieuws is relevant
Algemeen is relevant
Duurzaamheid is relevant
Dagelijks Leven is relevant
Brexit is relevant
Jeugdzorg is relevant
Postcovid is relevant
Feiten En Cijfers is relevant
Klimaattop is relevant
Privacy is relevant
Aanslag Tram Utrecht is relevant
Onderwijs is relevant
Coronavirus is relevant
Wonen is relevant
Oorlog is relevant
Goed Nieuws is relevant
Buitenland is relevant
Uitgelicht is relevant
Weer is relevant
Speelgoedbanken is relevant
Discriminatie En Racisme is relevant
Armoede is relevant
Doemdenken is relevant
Binnenland is relevant
January  
Januari  
January  
Januari  
Category: Economie
Sell is relevant
Beleggen is relevant
Stagevergoedingen is relevant
Basisbeurs is relevant
Mkb is relevant
Consument is relevant
Beurzen is relevant
May And Go Away is relevant
Werk is relevant
Economie is relevant
Zzp'ers is relevant
Arbeidsmarkt is relevant
Dieselschandaal is relevant
Geld is releva

KeyboardInterrupt: 

Get every keyword of the category.
If the LLM doesn't provide the same answer three times for which category the keyword belongs to,
ask the user where the keyword fits best and then change the keyword to that category.

In [10]:
with open('Categories_allKeywordstest.json', 'r') as f:
    data = json.load(f)

with open('CategoryChanges.json', 'r') as f:
    datachanges = json.load(f)

categoryCheck = {}
# Iterate through the categories and their items
for category, items in data.items():
    for item in items:
        cat1 = getKeywordCategory(item)
        cat2 = getKeywordCategory(item)
        print(f"{category} {cat1} {cat2}")
        if not category == cat1 == cat2:
            catString = ""
            for c in ["Algemeen", "Economie","Sport", "Media en Cultuur", "Technologie", "Wetenschap", "Persoon", "Bedrijf", "Politiek", "Evenement","Dier", "Land", "Stad","Overig"]:
                catString = catString + f"- {c} \n"
            newcat = input(f"Huidige categorie: {category}\nWat is de juiste categorie voor het volgende keywoord: {item} \n \n Typ de category over: \n {catString}")
            datachanges[category].append([item, newcat])

with open('CategoryChanges.json', 'r') as f:
        json.dump(datachanges, f, indent=4)  


Algemeen Algemeen Overig
Algemeen Milieu Algemeen
Algemeen Algemeen Algemeen
Algemeen Algemeen Algemeen
Algemeen Overig Algemeen
Algemeen Algemeen Algemeen
Algemeen Algemeen Algemeen
Algemeen Algemeen Algemeen
Algemeen Stad Stad
Algemeen Persoon Persoon
Algemeen Media en Cultuur Media en Cultuur
Algemeen Bedrijf Bedrijf
Algemeen Media en Cultuur Media en Cultuur
Economie Economie Economie
Economie Economie Economie
Economie Economie Economie
Economie Economie Economie
Economie Economie Economie
Economie Economie Economie
Economie Economie Economie
Economie Economie Economie
Economie Sport Sport
Economie Economie Economie
Economie Economie Economie
Economie Economie Economie
Economie Economie Economie
Sport Sport Sport
Sport Sport Sport
Sport Sport Sport
Sport Sport Sport
Sport Sport Sport
Sport Sport Sport
Sport Sport Sport
Sport Sport Sport
Sport Bedrijf Economie
Sport Sport Persoon
Sport Sport Sport


Change a keyword to other category.

In [None]:
def changeCategory(changesFilename, applyFilename):
    with open(changesFilename, 'r') as f:
        data = json.load(f)

    with open(applyFilename, 'r') as f:
        applydata = json.load(f)

    for category, items in data.items():
        for item in items:
            keyword, newcat = item
            applydata[newcat].append(keyword)
            applydata[category].remove(keyword)

    with open('output.json', 'w') as f:
        json.dump(applydata, f, indent=4)


changeCategory('CategoryChanges.json', 'Categories_allKeywordstest.json')