# _NLP_ take home test

### Points to note:
* You must not use an LLM to generate all or parts of the code / comments
* Any code you copy / reuse must be attributed to the source
* Your code and any associated files should be able to be executed in a _clean_ python environment, i.e. you should provide appropriate files and instructions to add any additional packages required.
* You should limit any additional installation requirements to the minimum possible (outside of python packages) - this is not intended to be an infrastructure test.  If you would want to extend via infrastructure solutions, please use the 'Future work' section detailed below
* If you have python version requirements, please specify in a markdown block at the start of your answer
* If you choose to use an LLM please ensure it will run _locally_, with no paywall / api requirements, entirely within your script.  No efforts will be made to re-construct infrastructure dependencies.

### Problem:

You are working on a top secret news analysis project, and as part of this, the users need to be able to perform an open ended search within the associated user interface.  

For the initial release you have a set of news articles, split by category, and your job is to enrich this data / create a new dataset; plus an associated search method.

You have been given the key concepts that are relevant in the data and should be available to the UI (you do NOT need to build the user interface) via your search function:
* Overall / main topic - the key topic the article relates to
* Relevance - a measure of relevance to the set of search terms.  This is a float in the range 0 - 1.0 (1.0 indicating 100% relevance)
* Novelty - a measure of how different this article is to the others. This is a float in the range 0 - 1.0 (1.0 indicating a completely unique result)

I.e.: Given a search string, your function should return:  `{'topic':'your_topic_here', 'relevance': 0.1, 'novelty': 0.1, 'article': full_article_object}`

The project sponsor is keen to add additional features so would also like to add the following.  They have been clear these are stretch goals and will not be critical to the success of the initial release.  You can add extra elements to the result dictionary to support this functionality if you need to.
* 'similar articles' - an ability to find similar articles to the current _selection_ of articles 
* 'key terms' - the main terms and entities contained in the article (this will be used to enable a rich UI feature showing terms with context higlighting, you only need to tag the terms).
* 'user specific article ordering' - an ability to re-order the articles returned; the first version should be to put 'relevant' articles at the top of the list.

Whilst your solution will be delivered inside this notebook, you can make use of external files as you see fit. 

You should assume that your _code_ will be directly ported into the 'production' codebase on completion, so consider this in your submission.  _However_ please make sure you add documentation into the notebook to explain what you are doing and why

Include any areas for extension / further evaluation / things you wanted to try but didn't get time to in a final markdown section 'Future work' - there is no expectation you add anything here, but please use it to capture anything you think would be useful for future review.

We are looking to understand how you approach an open-ended NLP problem, and will not be considering answers using / not-using an LLM to be better/worse

_References_

*   Dataset: D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006 - http://mlg.ucd.ie/datasets/bbc.html

In [1]:
# Importing the relevant packages
import os
root_dir = os.path.abspath(os.path.join(os.getcwd(), '', '')) + '//'
import sys 
sys.path.append(root_dir)
import urllib
import zipfile
from typing import List, Dict

from src_code.data_manager.dm_load_data import DMLoadData
from src_code.data_manager.dm_user_profile import DMUserProfile

[nltk_data] Downloading package punkt to /home/zchiam002/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/zchiam002/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/zchiam002/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/zchiam002/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/zchiam002/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/zchiam002/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Download the data files referenced above 
url = "http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip"
extract_dir = root_dir + "data"

zip_path, _ = urllib.request.urlretrieve(url)
with zipfile.ZipFile(zip_path, "r") as f:
    f.extractall(extract_dir)

# Test Response

The following points encompass the primary objectives of the test response:

- Article Relevance and Novelty: The system calculates how relevant and novel an article is to a user's query.
- Content Understanding: The system extracts key terms from articles to understand their core topics and groups similar articles together.
- User Personalization: It creates a user-level profile to customize search results and recommendations based on an individual’s article history.

These capabilities are powered by a backend codebase that is organized into clear, reusable classes—such as ArticleRepository and Article — to ensure a modular and maintainable structure. The included unit tests are crucial for verifying the correctness of the methods and maintaining code quality.

The response has the following structure: 
1. A walk through of the basic methods used to answer the test questions. 
2. A discussion of the limitations of the current implementation. 
3. A discussion of the future directions.

This test response's implementation provides a foundational approach to text analysis. It highlights the core steps of the process, prioritizing conceptual understanding and proper execution as a stepping stone for future accuracy improvements.

In [3]:
"""The ArticleRepository class is designed to manage a collection of articles. It includes the methods necessary to perform core functions such as text processing, 
vectorization, and similarity calculations. These capabilities enable the system to handle user queries and facilitate personalized recommendations.

The 'DMLoadData' class processes the news data and stores them into a repository. Each article within the repository is further categorized into 3 components: 
- category, 
- title, 
- content.

This structure allows for the application of component-level weights. A key design decision is to give the article's title a higher weight than its content when calculating relevance. 
The reasoning is that the title often encapsulates the main idea of the article more concisely, acting as its most critical component. While this is a reasonable assumption, we acknowledge 
that it may introduce biases and is a good topic for further evaluation."""

# Initialize the article repository - for this test, I have put it in a class. In production, it should be an interface with a vectorized/embedded database
articles = DMLoadData.load_from_directory(root_dir + "data//bbc//")

In [None]:
"""The method 'get_top_n_articles' works executing the following in order: 
1. basic preprocessing of the data 
    - conversion to lower case 
    - remove punctuations and special characters 
    - tokenization and stop-word removal 
    - lemmization, i.e., conversion to root word
2. vectorization of the processed data 
    - converting words into a vector by using the term-frequency inversed document frequency method
    - this takes into consideration the frequency of the word usage as well as rarity before issuing a numerical representation
3. similarity calculation using cosine similarity 
    - it is selected because it can measure deviation via the angle rather than the absolute maginitude
    - this make it easier for comparing texts of different lengths
4. novelty calculation 
    - for every article, calculate the similarity scores to other articles
    - find the average similarity score of the repository 
    - use the average to compare with the average of each individual article
    - the extent to the difference of the individual vs the repository average can be used as a measure of novelty
    - normalize this result
    - this method is similar to finding a k-means of only 1 centriod. 
5. returning the top n key terms 
    - for simplicity, key words instead of terms are used
    - word frequency after removal of stop words are used to determine which words to return
6. the 'search_data function' 
    - this function treats the query in the same manner makes comparison with each article individually
    - the top 10 articles morst relevant articles to the query are returned alongside the relevance score, novelty score and key terms.
    """
# The number of relevant articles to return 
RETURN_TOP_N = 10
RETURN_TOP_N_KEY_TERMS = 5

def search_data(search_term: str) -> List[Dict]:
    ret_df = articles.get_top_n_articles (incoming_query=search_term, top_n=RETURN_TOP_N)

    # Reformat the output into a dataframe 
    ret_list = []
    for row in ret_df.index:
        repository_idx = int(ret_df['article_idx'][row])
        curr_article = articles.get_article(repository_idx)
        curr_key_terms = sorted(curr_article.get_key_terms(top_n=RETURN_TOP_N_KEY_TERMS))

        ret_list.append({'query': search_term, 
                         'relevance': float(ret_df['relevance_score'][row]), 
                         'novelty': float(ret_df['novelty_score'][row]), 
                         'article': {'category': ret_df['category'][row],
                                     'title': ret_df['title'][row], 
                                     'content': ret_df['content'][row],
                                     'key_terms': curr_key_terms,
                                     'index': repository_idx}})

    return ret_list

In [None]:
# An example query and the corresponding result
search_data('arsenal football club')

[{'query': 'arsenal football club',
  'relevance': 0.1674558214399129,
  'novelty': 0.8950634484833246,
  'article': {'category': 'sport',
   'title': 'Arsenal through on penalties',
   'content': "Arsenal win 4-2 on penalties\n\nThe Spanish goalkeeper saved from Alan Quinn and Jon Harley as Arsenal sealed a quarter-final trip to Bolton with a 4-2 victory on penalties. Lauren, Patrick Vieira, Freddie Ljungberg and Ashley Cole scored for Arsenal, while Andy Gray and Phil Jagielka were on target for the Blades. Michael Tonge and Harley wasted chances for the underdogs, but Paddy Kenny was inspired to keep Arsenal at bay. Arsenal, stripped of attacking talent such as Thierry Henry and Dennis Bergkamp, partnered 17-year-old Italian striker Arturo Lupoli with Ljungberg up front. It was a revamped Arsenal line-up, and they were almost a goal behind within seconds as Tonge wasted a glorious chance. Gray ran free down the right flank, and his cross left Tonge with the simplest of chances, but 

In [None]:
"""The following was done to address the 'user-specific article ordering' portion in the test:
1. Create a user profile class so that attributes specific to a user can be tracked
    - the user id
    - the latest n articles viewed
2. In order to demonstrate how these information will be used, a small simulation of queries and selected articles from the query is simulated to build a small user history."""

# Create a dummy user profile
user = DMUserProfile(user_id='dennis_bergkamp', history_limit=10)

# Simulate the generation of 3 queries and user choice
simulated_queries = ['am i still the best dribbler in the english premier league?', 
                     'how to overcome my fear of flying?', 
                     'who is the non-flying dutchman?']

simulated_article_choices = [2, 4, 1]

for simulation_idx in range(3):
    curr_result=search_data(search_term=simulated_queries[simulation_idx])
    curr_chosen_article=curr_result[simulated_article_choices[simulation_idx]]

    # Updating the article history 
    user.append_article(curr_chosen_article['article']['index'])

In [None]:
"""The user history is then used to generate an 'interest vector'. 
    - This vector is then used to find the top 10 most similar articles (not read) and recomended to the user
    - This vector can also be used to tune the result order of the 'search_data' function to be customized at the user-level
    - As a start, a simple weighted relevance vs interest score could be used, but not implemented in this code base
    - The following shows the working code of the relevant articles generator using the above-mentioned method
"""

# Now lets use the small search history to recommend relevant articles
user_interest_vector=user.get_interest_vector(article_repository=articles, top_n_key_words=RETURN_TOP_N_KEY_TERMS)
ret_df=articles.get_top_n_recommended_articles(user_interest_vector=user_interest_vector, top_n_key_words=RETURN_TOP_N_KEY_TERMS, top_n_articles=RETURN_TOP_N)

relevant_articles = []
for row in range(ret_df.shape[0]):
    curr_article = articles.get_article(ret_df['article_idx'][row])
    relevant_articles.append({'category': curr_article.original_category,
                              'title': curr_article.original_title, 
                              'content': curr_article.original_content,
                              'key_terms': curr_article.key_words,
                              'index': int(ret_df['article_idx'][row])})

relevant_articles

[{'category': 'sport',
  'title': 'McClaren targets Champions League',
  'content': 'Middlesbrough boss Steve McClaren believes his side can clinch a top-four spot in the Premiership and secure qualification for the Champions League.\n\nAfter their 3-2 win over Manchester City, McClaren said: "We are playing exciting football, it\'s a magnificent result to keep us in the top five. "But how well we do depends how often we can get our best team out. "Once we got the third goal it should have been four or five but we nearly paid for it in the end." McClaren also praised winger Stewart Downing and strikers Jimmy Floyd Hasselbaink and Mark Viduka, who both ended barren runs in front of goal. He added: "If Stewart keeps playing like this Sven-Goran Eriksson has got to pick him. "And the strikers scored great goals, the combination play between them shows they want to play with each other and they are trying."',
  'key_terms': ['mcclaren', 'goal', 'champion', 'league', 'playing'],
  'index': 

# Limitations and Future Works 
1. Creation of a 