In Block C, I have web-scraped the page and created the following dataset:

# Dataset:
I am creating a dataset with the following columns:
ID: A unique identifier for each article.
Title: The title of the article.
Description: A brief summary or description of the article.
Author: The author of the article.
Date: The publication date of the article.
Subject: The topic or category to which the article belongs.
Text: The full text of the article.
Keywords: Keywords describing the article.

Within the nu.nl website, there are 6 main categories:
1. Frontpage
2. Economy
3. Sports
4. Media and Culture
5. Gossip
6. Other

Within these main categories, there are further subcategories. For example, within Economy, there is Tech, and within Sports, there is skating.

The dataset from nu.nl has several issues:
- Some fields only have the subject as a keyword. (Keywords from the text need to be added to these)
- Other keyword fields are missing keywords from the article, such as the city or province.
- The main subject is missing in some articles. For example, an article about skating is categorized as skating but not under the sports category.
- In my design, users should be able to follow topics, people, cities, and countries to see news related to those. Currently, the data is not structured in such a way.

First, I will ensure to fill in the empty keyword fields with Spacy.

In [2]:
# Import 
import spacy
import pandas as pd
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()
apiKey = os.environ.get('OPENAI_API_KEY')

# Read the dataset
df = pd.read_csv('nu-articles-v2.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   ID           458 non-null    int64 
 1   Title        458 non-null    object
 2   Description  458 non-null    object
 3   Author       458 non-null    object
 4   Date         458 non-null    object
 5   Subject      458 non-null    object
 6   Text         458 non-null    object
 7   Keywords     458 non-null    object
dtypes: int64(1), object(7)
memory usage: 28.8+ KB


In [2]:
# load the Dutch language model from spaCy
nlp = spacy.load("nl_core_news_lg")

# Function to extract and format keywords from text
def extract_keywords(text):
    # Process the text using the loaded Dutch language model
    doc = nlp(text)
    # Extract named entities of certain types (PERSON, ORG, LOC, GPE, ANIMAL)
    named_entities = set([entity.text for entity in doc.ents if entity.label_ in ["PERSON", "ORG", "LOC", "GPE", "ANIMAL"]])
    # Join the extracted entities into a comma-separated string
    return ', '.join(named_entities)


# Apply the function to the 'Description' column of DataFrame
df['Extracted_Keywords'] = df[['Title', 'Description']].apply(lambda x: extract_keywords(' '.join(x)), axis=1)

# Display the DataFrame with 'Description', 'Keywords', and 'Extracted_Keywords' columns
print(df[['Description', 'Keywords', 'Extracted_Keywords']].head())


                                         Description  \
0  Werknemers krijgen geen wettelijk recht op thu...   
1  Common en Jennifer Hudson hebben laten weten d...   
2  Ondanks de afgesproken radiostilte tijdens de ...   
3  We vinden het heel belangrijk om te weten welk...   
4  De pepernoten... pardon, kruidnoten liggen alw...   

                                            Keywords  \
0  Politiek, Werk, Economie, Binnenland, NUjij, T...   
1                        Achterklap, Jennifer Hudson   
2                            Politiek, Formatie 2023   
3                                      Nieuws, NU.nl   
4                       Sinterklaas, Eten en Drinken   

                                  Extracted_Keywords  
0                                Eerste Kamer, NUjij  
1                            Jennifer Hudson, Hudson  
2  BBB, Geert Wilders, Dilan Yesilgöz, VVD, NSC, PVV  
3                                                     
4                                                  

The individuals, organizations, locations, countries, and animals are now in the extracted keywords. Because the current keyword list is a mess, I want to bring the following structure: The first keyword becomes the subject, followed by the specified keywords from nu.nl and my keywords combined

In [3]:
# Remove the subject from keywords and extracted_keywords if present
# Loop through 'Keywords' and 'Extracted_Keywords' columns
for col in ['Keywords', 'Extracted_Keywords']:
    # Apply lambda function to each row to remove subject from keywords
    df[col] = df.apply(lambda row: ', '.join([kw.strip() for kw in row[col].split(',') if kw.strip() != row['Subject'].strip()]), axis=1)

# Combine columns and remove duplicates
# Combine 'Subject', 'Keywords', and 'Extracted_Keywords' into 'Combined' column
df['Combined'] = df['Subject'].str.strip()  + ', ' + df['Keywords'] + ', ' + df['Extracted_Keywords']
# Apply lambda function to sort and remove duplicates
df['Combined'] = df['Combined'].apply(lambda x: ', '.join(sorted(set(x.split(', ')), key=x.index)))

# Remove any leading commas from the combined string
df['Combined'] = df['Combined'].str.strip(', ')

# Copy the content of the 'Combined' column to the 'Keywords' column
df['Keywords'] = df['Combined']

# Remove the 'Combined' and 'Extracted_Keywords' columns
df.drop(['Combined', 'Extracted_Keywords'], axis=1, inplace=True)



Double names are not removed, such as Selena, Gomez, and Selena Gomez.
#### Current Keywords:
Gossip, Selena, Selena Gomez, Gomez, Benny Blanco
#### Desired Keywords:
Gossip, Selena Gomez, Benny Blanco

#### Current Keywords:
Skating, Hein Otterspeer, Otterspeer, World Cup skating, Quebec, KNSB, Kjeld Nuis, Nuis
#### Desired Keywords:
Skating, Hein Otterspeer, World Cup skating, Quebec, KNSB, Kjeld Nuis

Sometimes, for example, "Paris" is mentioned twice, once in English and once in Dutch.
#### Current Keywords:
Gossip, Paris-actress, Paris, Emily, Parijs, Ashley Park
#### Desired Keywords:
Gossip, Paris-actress, Emily, Paris, Ashley Park

To fix this, I will use the OpenAI API.

In [4]:
client = OpenAI(api_key=apiKey)

# Define a function to get OpenAI completion for a given set of keywords
def get_openai_completion(keywords):
    # Create completion request
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo-0125",
        messages=[
            {"role": "system", "content": "je verwijderd dubbele keywords zoals namen en vertalingen, kiest voor volledige namen en nhet Nederlandse woord en geeft ze terug in lijst met kommas.  Please prefix the keywords with 'Keywords:'"},
            {"role": "user", "content": keywords}
        ]
    )
    keywords_message = completion.choices[0].message
    keywords_content = keywords_message.content

    # Extracting words after 'Keywords:'
    keywords = keywords_content.split('Keywords:')[1].strip().split(', ')
    return keywords

# Apply the function to each row of the 'Keywords' column
df['Keywords'] = df['Keywords'].apply(get_openai_completion)

In [6]:
# Remove square brackets from the output
df['Keywords'] = df['Keywords'].apply(lambda x: ', '.join(x))

In [7]:
# Save the DataFrame as an CSV-file
df.to_csv('nu-articles-v3.csv', index=False)


In [6]:
df = pd.read_csv('nu-articles-v3.csv')

# Split each string by commas and concatenate all lists
all_keywords = []
for keywords_str in df['Keywords']:
    keywords_list = keywords_str.split(', ')
    all_keywords.extend(keywords_list)

# Print the combined list
# print(all_keywords)

# Convert the combined list to a set to remove duplicates, then convert it back to a list
unique_keywords = list(set(all_keywords))

# Print the unique keywords list
print(unique_keywords)

['Lando Norris', 'achterklap', 'Steven Spielberg', 'Kunstrijden', 'Boksen', 'Armoede', 'Céline Dion', 'Tesla', 'Rami Malek', 'AI-test', 'Sprinttalent', 'Coronavirus', 'Deena & Jim', 'Formule 2', 'Griselda', 'Bulgarije', 'Transferblog', 'Darts', 'Muziek', 'Tim Prins', 'Thuis', 'Grace Jabbari', 'Ceylin del Carmen Alvarado', 'WK darts', 'Epic Games', 'Spanningen Midden-Oosten', 'Zonnepanelen', 'Selena Gomez', 'NTR', 'Algemeen', 'Politiek', 'CDA', 'Abdul Malak', 'Qbuzz', 'Japan', 'Vluchtelingen', 'Tom Holland', 'Michael Smith', 'Shimon', 'Zorg', 'Ammar', 'coronavaccins', 'India', 'Oorlog Israël en Hamas', 'Schaatsen', 'JAXA', 'JA21', 'Houthi-rebellen', 'Syrië', 'Lucinda Brand', 'Zuid', 'Crowe', "John van 't Schip", 'Michael Shannon', 'Kanker', 'Verenigd Koninkrijk', 'Sjoemelsoftware', 'Prinses Beatrix', 'Hamas', 'Oostenrijkse Weissensee', 'James Allison', 'Aflossingsploeg', 'Sint-Pietersberg', 'Edward Sturing', 'Rik van de Westelaken', 'Terschelling', "Jens van 't Wout", 'Tech en Wetenscha