# _NLP_ take home test

### Points to note:
* You must not use an LLM to generate all or parts of the code / comments
* Any code you copy / reuse must be attributed to the source
* Your code and any associated files should be able to be executed in a _clean_ python environment, i.e. you should provide appropriate files and instructions to add any additional packages required.
* You should limit any additional installation requirements to the minimum possible (outside of python packages) - this is not intended to be an infrastructure test.  If you would want to extend via infrastructure solutions, please use the 'Future work' section detailed below
* If you have python version requirements, please specify in a markdown block at the start of your answer
* If you choose to use an LLM please ensure it will run _locally_, with no paywall / api requirements, entirely within your script.  No efforts will be made to re-construct infrastructure dependencies.

### Problem:

You are working on a top secret news analysis project, and as part of this, the users need to be able to perform an open ended search within the associated user interface.  

For the initial release you have a set of news articles, split by category, and your job is to enrich this data / create a new dataset; plus an associated search method.

You have been given the key concepts that are relevant in the data and should be available to the UI (you do NOT need to build the user interface) via your search function:
* Overall / main topic - the key topic the article relates to
* Relevance - a measure of relevance to the set of search terms.  This is a float in the range 0 - 1.0 (1.0 indicating 100% relevance)
* Novelty - a measure of how different this article is to the others. This is a float in the range 0 - 1.0 (1.0 indicating a completely unique result)

I.e.: Given a search string, your function should return:  `{'topic':'your_topic_here', 'relevance': 0.1, 'novelty': 0.1, 'article': full_article_object}`

The project sponsor is keen to add additional features so would also like to add the following.  They have been clear these are stretch goals and will not be critical to the success of the initial release.  You can add extra elements to the result dictionary to support this functionality if you need to.
* 'similar articles' - an ability to find similar articles to the current _selection_ of articles 
* 'key terms' - the main terms and entities contained in the article (this will be used to enable a rich UI feature showing terms with context higlighting, you only need to tag the terms).
* 'user specific article ordering' - an ability to re-order the articles returned; the first version should be to put 'relevant' articles at the top of the list.

Whilst your solution will be delivered inside this notebook, you can make use of external files as you see fit. 

You should assume that your _code_ will be directly ported into the 'production' codebase on completion, so consider this in your submission.  _However_ please make sure you add documentation into the notebook to explain what you are doing and why

Include any areas for extension / further evaluation / things you wanted to try but didn't get time to in a final markdown section 'Future work' - there is no expectation you add anything here, but please use it to capture anything you think would be useful for future review.

We are looking to understand how you approach an open-ended NLP problem, and will not be considering answers using / not-using an LLM to be better/worse

_References_

*   Dataset: D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006 - http://mlg.ucd.ie/datasets/bbc.html

In [1]:
# Importing the relevant packages
import os
root_dir = os.path.abspath(os.path.join(os.getcwd(), '', '')) + '//'
import sys 
sys.path.append(root_dir)
import urllib
import zipfile
from typing import List, Dict

from src_code.data_manager.dm_load_data import DMLoadData

[nltk_data] Downloading package punkt to /home/zchiam002/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/zchiam002/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/zchiam002/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/zchiam002/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/zchiam002/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/zchiam002/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# Download the data files referenced above 
url = "http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip"
extract_dir = root_dir + "data"

zip_path, _ = urllib.request.urlretrieve(url)
with zipfile.ZipFile(zip_path, "r") as f:
    f.extractall(extract_dir)

In [3]:
# Initialize the article repository - for this test, I have put it in a class. In production, it should be an interface with a vectorized/embedded database
articles = DMLoadData.load_from_directory(root_dir + "data//bbc//")

In [None]:
"""
Your should include a working function implementation here, this takes in a search string and finds the most relevant articles;
returning a list of the form given in the task introduction 
{'topic':'your_topic_here', 'relevance': 0.1, 'novelty': 0.1, 'article': full_article_object}
e.g. 
:param search_term: string containing the search
:return:            list of result objects as detailed above
"""

# The number of relevant articles to return 
RETURN_TOP_N = 10
RETURN_TOP_N_KEY_TERMS = 5

def search_data(search_term: str) -> List[Dict]:
    ret_df = articles.get_top_n_articles (incoming_query=search_term, top_n=RETURN_TOP_N)

    # Reformat the output into a dataframe 
    ret_list = []
    for row in ret_df.index:
        repository_idx = int(ret_df['article_idx'][row])
        curr_article = articles.get_article(repository_idx)
        curr_key_terms = sorted(curr_article.get_key_terms(top_n=RETURN_TOP_N_KEY_TERMS))


        ret_list.append({'topic': search_term, 
                         'relevance': float(ret_df['relevance_score'][row]), 
                         'novelty': float(ret_df['novelty_score'][row]), 
                         'article': {'category': ret_df['category'][row],
                                     'title': ret_df['title'][row], 
                                     'content': ret_df['content'][row],
                                     'key_terms': curr_key_terms}})

    return ret_list

SyntaxError: '(' was never closed (2256955456.py, line 20)

In [11]:
search_data('arsenal football club')

[{'topic': 'arsenal football club',
  'relevance': 0.1674558214399129,
  'novelty': 0.8950634484833246,
  'article': {'repository_index': 437,
   'category': 'sport',
   'title': 'Arsenal through on penalties',
   'content': "Arsenal win 4-2 on penalties\n\nThe Spanish goalkeeper saved from Alan Quinn and Jon Harley as Arsenal sealed a quarter-final trip to Bolton with a 4-2 victory on penalties. Lauren, Patrick Vieira, Freddie Ljungberg and Ashley Cole scored for Arsenal, while Andy Gray and Phil Jagielka were on target for the Blades. Michael Tonge and Harley wasted chances for the underdogs, but Paddy Kenny was inspired to keep Arsenal at bay. Arsenal, stripped of attacking talent such as Thierry Henry and Dennis Bergkamp, partnered 17-year-old Italian striker Arturo Lupoli with Ljungberg up front. It was a revamped Arsenal line-up, and they were almost a goal behind within seconds as Tonge wasted a glorious chance. Gray ran free down the right flank, and his cross left Tonge with t