## Part B: Information Retrieval (IR) System

##### Part 1: Importing Articles Data into Python

In [1]:
"""
Information Retrieval Model

This model is created to retrieve information from 1000 articles. There are:
100 business articles
100 entertainment articles
100 food articles
100 graphics articles
100 historical articles
100 medical articles
100 politics articles
100 space articles
100 sport articles
100 technologies articles

These are to be in the same folder. The folder below is from the writer's
personal desktop account, please change it accordingly to suit the user's
needs and correct pathways.
"""

# Using the library pathlib from Path to read the files in the correct
# folder - this is only to begin reading files
from pathlib import Path

# Create an object 'database' to link to the correct folder
database = Path("C:\\Users\\stanleytjandra.DESKTOP-OOIPU77\\Desktop\\Part B")

# Set up an emptiy dictionary to populate later. The dictionary should have
# the following format: {keys (doc title) : values (content of doc)}
articles_dict = {}

# Iterate through all .txt files using .glob and .stem
for txt_file in database.glob('*.txt'):    #.glob() is a method that returns all file paths that matches with .txt
    file_key = txt_file.stem    #.stem gives the file name without the extension
    with txt_file.open('r', encoding='utf-8') as content:    # use utf-08 to read all characters (just in case there are weird characters)
        txt_content = content.read()    # Read all files
        txt_content = txt_content.replace('\n', ' ')    # Replace whitespace with space
        articles_dict[file_key] = txt_content    # Add all contents of file as dictionary values

# Print how many articles are there to check:
print(f'Total articles imported: {len(articles_dict)}')

Total articles imported: 1000


In [2]:
# Test that the dictionary article_list is correctly imported
# Check for the first 20 articles and the first 30 characters
for key in list(articles_dict.keys())[:20]:
    print(f'{key}: {articles_dict[key][:30]}...')

business_1: Lufthansa flies back to profit...
business_10: Winn-Dixie files for bankruptc...
business_100: US economy still growing says ...
business_11: Saab to build Cadillacs in Swe...
business_12: Bank voted 8-1 for no rate cha...
business_13: Industrial revival hope for Ja...
business_14: Khodorkovsky ally denies charg...
business_15: China keeps tight rein on cred...
business_16: Verizon 'seals takeover of MCI...
business_17: Crossrail link 'to get go-ahea...
business_18: Small firms 'hit by rising cos...
business_19: Deutsche Boerse boosts dividen...
business_2: Japanese growth grinds to a ha...
business_20: Brewers' profits lose their fi...
business_21: Russia WTO talks 'make progres...
business_22: India's rupee hits five-year h...
business_23: Dollar drops on reserves conce...
business_24: India and Russia in energy tal...
business_25: Weak data buffets French econo...
business_26: Business fears over sluggish E...


Confirmed that the dictionary is uploaded correctly. Since business articles are the top of the list, the system iterates top-bottom.

##### Part 2: Tokenization, Stopwords Removal and Stemmer (Porter)

In [3]:
"""
This section attempts to preprocess words in preparation for TF-IDF scoring
in the next section. Natural Language ToolKit library is used to conduct
word tokenization, and also stopwrods removal and PorterStemmer functions.
"""

# Import library nltk and its related functions
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Prepare functions
stemmer = PorterStemmer()    # PorterStemmer is chosen for robustness
stop_words = set(stopwords.words('english'))    # All articles are in English

In [4]:
# Write a function that does three functions: tokenize, stopwords removal
# and stem all words

def preprocess_txt(i):
    
    tokens = word_tokenize(i)
    
    preprocessed_tokens = []    # Set up a list for all preprocessed word tokens to populate
    
    for token in tokens:
        token = token.lower()    # All preprocessing needs to be in lowercase
        
        if token not in stop_words and token.isalpha():    # Remove all english stop words and word tokens must be alphanumeric
            stemmedtokens = stemmer.stem(token)    # Use Porter Stemmer algorithm
            preprocessed_tokens.append(stemmedtokens)    # Populate all preprocessed word tokens to poopulate the list
    
    return preprocessed_tokens

# Set up an empty dictionary to connect keys and values (the values being
# stemmed tokens related to their article titles - "keys")
preprocessed_articles = {}

# Then populate the dictionary
for key, txt_content in articles_dict.items():
    preprocessed_articles[key] = preprocess_txt(txt_content)    # Use the function above to populate the dictionary values

In [5]:
# Check if the preprocessed content is well-prepared
for key in list(preprocessed_articles.keys())[:15]:    # Check the first 15 articles, values will be shown as a list
    print(f'{key}: {preprocessed_articles[key][:25]}...')    # Check the first 25 word tokens of each article

business_1: ['lufthansa', 'fli', 'back', 'profit', 'german', 'airlin', 'lufthansa', 'return', 'profit', 'post', 'huge', 'loss', 'preliminari', 'report', 'airlin', 'announc', 'net', 'profit', 'euro', 'compar', 'loss', 'euro', 'oper', 'profit', 'euro']...
business_10: ['file', 'bankruptci', 'us', 'supermarket', 'group', 'file', 'bankruptci', 'protect', 'succumb', 'stiff', 'competit', 'market', 'domin', 'among', 'profit', 'us', 'grocer', 'said', 'chapter', 'protect', 'would', 'enabl', 'success', 'restructur', 'said']...
business_100: ['us', 'economi', 'still', 'grow', 'say', 'fed', 'area', 'us', 'saw', 'economi', 'continu', 'expand', 'decemb', 'earli', 'januari', 'us', 'feder', 'reserv', 'said', 'latest', 'beig', 'book', 'report', 'us', 'region']...
business_11: ['saab', 'build', 'cadillac', 'sweden', 'gener', 'motor', 'world', 'largest', 'car', 'maker', 'confirm', 'build', 'new', 'cadillac', 'bl', 'saab', 'factori', 'sweden', 'car', 'unveil', 'geneva', 'motor', 'show', 'intend', 'compet'

As seen in the above dictionary, word tokens have been stemmed and linked back to their article titles.

##### Part 3: TF-IDF Weighting All Words Tokens

In [6]:
"""
This sections intends to give array weightings to all word tokens based on
their article title. Due to the nature of sklearn's TfidfVectorizer function,
first all of the word tokens must be joined as a single string in each
article. Then using the TfidfVectorizer's fit_transform() function to work
out each token weighting. To confirm, the tf-idf vocabulary can be checked.
The outcome of this is a .toarray() matrix of each word token weighting,
however, due to the vast amount of text data, the array cannot be fully
displayed on a screen.
"""

# First convert preprocessed articles values to strings
txt_str = {}    # Create a new dictionary to be populated by list of strings

for key, tokens in preprocessed_articles.items():    # Extract from previous dictionary
    txt_str[key] = ' '.join(tokens)    # Values in txt_str dictionary are a list of strings of tokens

In [7]:
# Import sklearn library TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Prepare the function
vectorizer = TfidfVectorizer()

# Calculate the tfidf weighting values for each value token in each article
tfidf_matrix = vectorizer.fit_transform(txt_str.values())

# Show the array of weighting (not able to display all to to vast data)
tfidf_array = tfidf_matrix.toarray()
print("Tf-Idf representation:\n", tfidf_array)

# Just to check for word tokens in the tf-idf matrix
tfidf_vocabulary = vectorizer.get_feature_names_out()
print(f'Text vocabulary (word tokens): {tfidf_vocabulary[:100]}...')

Tf-Idf representation:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Text vocabulary (word tokens): ['aa' 'aaa' 'aaaa' 'aac' 'aan' 'aanerud' 'aangegeven' 'aantal' 'aao'
 'aaron' 'aarseth' 'aavso' 'ab' 'abacu' 'abadan' 'aban' 'abandon' 'abat'
 'abba' 'abbasi' 'abbey' 'abbott' 'abbrevi' 'abc' 'abd' 'abdel'
 'abdelaziz' 'abdelhafid' 'abdic' 'abdomen' 'abdomin' 'abdullah' 'abeb'
 'abel' 'aberr' 'aberystwyth' 'abeyi' 'abfp' 'abhin' 'abhorr' 'abi' 'abid'
 'abil' 'abington' 'abiyot' 'abl' 'ablaz' 'abn' 'abner' 'abnorm' 'aboard'
 'abolish' 'abort' 'abortionist' 'aboukir' 'abound' 'abraham' 'abraxi'
 'abroad' 'abrog' 'abrupt' 'abruptli' 'absenc' 'absent' 'absentia'
 'absolut' 'absorb' 'absorbt' 'absorpt' 'abstact' 'abstain' 'abstract'
 'absurd' 'abtahi' 'abu' 'abund' 'abundantli' 'abus' 'abut' 'abuzz'
 'abydo' 'abysm' 'abyss' 'ac' 'academ' 'academi' 'academia' 'acceler'
 'accept' 'acce

As seen above, due to processing 1000 articles, a vast amount of text data makes it difficult to display tens of thousands of word tokens.

##### Part 4: Query Function, Relevance Scoring and Ranking Retrievals

In [8]:
"""
In order to calculate the relevance and similairty between query content
and the tokens in the database, the library cosine_similarity is used.
But before any cosine similarity calculations are computed, first the
query content must also go through preprocessing algorithms similar to
the word tokens in the database - to ensure consistency.
Then the query goes through the vectorization using Tfidf library's
.transfor() function to prepare for cosine_similarity

Three functions are written here:
- preprocess_query() is to preprocess (tokenize, remove stopwords, stem)
- tfidf_query() is to vectorize the query tokens, and compute similarities
- rank_relevant_results() is to sort cosine_similarities value, and order 
hem in ascending order based on the values, which will be a way of ranking
the retrieval results of the query (using numpy library's argsort function)

For the purposes of this assignment, the relevance scores of each result
is displayed (higher cosine_similarity score means higher rank)

One issue in this project is that there may need to be more than top 3
results to analyze the results better. Since each article type consists of
100 articles, this project will show the top 100 results.
"""

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity


def preprocess_query(query):
    
    # Word tokenize all query content
    tokens = word_tokenize(query)
    
    # Stopword removal and stem tokens as previously in the database
    preprocessed_query_tokens = []    # Set up a list for all preprocessed word tokens to populate
    
    for token in tokens:
        token = token.lower()    # All preprocessing needs to be in lowercase
        
        if token not in stop_words and token.isalpha():    # Remove all english stop words and word tokens must be alphanumeric
            stemmedtokens = stemmer.stem(token)    # Use Porter Stemmer algorithm
            preprocessed_query_tokens.append(stemmedtokens)    # Populate all preprocessed word tokens to poopulate the list
    
    preprocessed_query_tokens = ' '.join(preprocessed_query_tokens)    # In one function, prepare query tokens to be vectorized

    return preprocessed_query_tokens


def tfidf_query(query, vectorizer, tfidf_matrix, n=100):
    
    # Preprocess query content:
    tokenized_query = preprocess_query(query)
    
    # Transform query content using Tfidf vectorizer (.transform() function)
    vectorized_query = vectorizer.transform([tokenized_query])
    
    # Compute cosine similarities with documents in the database
    # .flatten() reduces the array to 1-dimension, which is useful in calculating cosine similarities
    cosine_similarities = cosine_similarity(vectorized_query, tfidf_matrix).flatten()
    
    return cosine_similarities


def rank_relevant_results(cosine_similarities, n=100):
    
    # Rank results based on cosine_similarity values using np.argsort()
    top_hundred_relevance = np.argsort(cosine_similarities)[::-1][:n]
    
    # Based on the ranks in the top 100 results, list all the article titles
    top_hundred_articles = [(list(txt_str.keys())[rank], cosine_similarities[rank]) for rank in top_hundred_relevance]
    
    return top_hundred_articles

The three functions above will be called in one single function below. The function query_retrieval prompts users to enter search queries and the top 100 results will be listed.

In [9]:
"""
The function below aims to input search queries and run the search query
through the previous three functions to vectorize, compute similarities to
the database vectors, and display a ranked retrieval result to the user.
"""

def query_retrieval():
    # Input prompt for users
    while True:
        
        query = input("Enter your search terms here (or type 'end' to exit):")
        
        if query.lower() == 'end':
            break
        
        # Recall tfidf_query function above (including preprocess_uery function)
        top_cosine = tfidf_query(query, vectorizer, tfidf_matrix, n=100)
        
        #Recall rank_relevant_results function above
        top_relevant_articles = rank_relevant_results(top_cosine, n=100)
        
        # Print the top 100 relevant articles and their relevant scores
        for rank, (title, score) in enumerate(top_relevant_articles, start=1):    # enumerate is used to provide rank number based on the cosine similarity values
            print(f'Article {rank}: {title}, Relevance: {score}')

In [10]:
# Trial query 1: search for 'global economic prediction of the future'
query_retrieval()

Enter your search terms here (or type 'end' to exit):global economic prediction of the future
Article 1: business_48, Relevance: 0.15981872988731097
Article 2: business_95, Relevance: 0.11884338918950614
Article 3: business_56, Relevance: 0.1180562549235363
Article 4: politics_128, Relevance: 0.11640131812131319
Article 5: politics_21, Relevance: 0.11546592399094732
Article 6: business_71, Relevance: 0.11297651202713817
Article 7: business_96, Relevance: 0.10540175352589681
Article 8: politics_15, Relevance: 0.10323224801916313
Article 9: business_79, Relevance: 0.09581502925664231
Article 10: historical_89, Relevance: 0.087429942392281
Article 11: historical_27, Relevance: 0.08648856944078843
Article 12: business_29, Relevance: 0.08029804731964885
Article 13: politics_189, Relevance: 0.07604991694251638
Article 14: space_47, Relevance: 0.07460986811112091
Article 15: technologie_82, Relevance: 0.0742622290024628
Article 16: medical_145, Relevance: 0.0700790680552658
Article 17: sport_

As expected, many of the search results turned out to be from articles related to business, politics and to a certain extent, historical and techologies. The IR has somewhat captured the main intention of the system.

In [11]:
# Trial query 2: search for 'athlete performance improvements in football'
query_retrieval()

Enter your search terms here (or type 'end' to exit):athlete performance improvements in football
Article 1: sport_43, Relevance: 0.22714970579762211
Article 2: sport_46, Relevance: 0.2019393184742041
Article 3: sport_15, Relevance: 0.19117073944066915
Article 4: sport_27, Relevance: 0.17908106019875208
Article 5: sport_4, Relevance: 0.16966070624273114
Article 6: sport_16, Relevance: 0.1452495938960811
Article 7: sport_80, Relevance: 0.14456479318749943
Article 8: sport_30, Relevance: 0.1081868390021885
Article 9: medical_329, Relevance: 0.09738199845427029
Article 10: entertainment_4, Relevance: 0.0962061055882629
Article 11: sport_81, Relevance: 0.0895831963745092
Article 12: sport_92, Relevance: 0.08844889996298112
Article 13: sport_54, Relevance: 0.08843056317775762
Article 14: sport_59, Relevance: 0.0877641305717896
Article 15: medical_318, Relevance: 0.08736088595881253
Article 16: sport_45, Relevance: 0.08670311458892915
Article 17: entertainment_26, Relevance: 0.08160098680077

The search query 2 above is more diverse as sport may be linked to entertainment, business and medical.

##### Part 5: Evaluating the IR System

In [13]:
"""
To evaluate the IR system, first we use a search query example.
"""

# Trial query 2: search for 'medical'
query_retrieval()

Enter your search terms here (or type 'end' to exit):medical
Article 1: medical_121, Relevance: 0.36153929624303915
Article 2: medical_437, Relevance: 0.32813887217317683
Article 3: medical_102, Relevance: 0.20783802781687508
Article 4: medical_557, Relevance: 0.20143692673576621
Article 5: medical_468, Relevance: 0.19096313408410406
Article 6: medical_488, Relevance: 0.18831919202491498
Article 7: medical_327, Relevance: 0.18646907227514922
Article 8: medical_346, Relevance: 0.1688871996969261
Article 9: medical_300, Relevance: 0.1484292364094448
Article 10: medical_246, Relevance: 0.14424025071376423
Article 11: medical_186, Relevance: 0.14123612081695086
Article 12: medical_646, Relevance: 0.13937835533944043
Article 13: medical_67, Relevance: 0.13725260252158109
Article 14: medical_692, Relevance: 0.13659783558228641
Article 15: medical_608, Relevance: 0.129131425800171
Article 16: medical_319, Relevance: 0.1268857082332253
Article 17: medical_244, Relevance: 0.12445777743885442
Ar

In [16]:
"""
Adapted from Module 5, the code below is written witht the intention to
create a Confusion Matrix, and provide results for metrics such as Recall,
Precision and F1 Score.

This code analyzes the search query 'medical' above, which retrieves 74
articles, with 64 relevant medical articles, but with a total relevant
record of 100 medical articles in the database. These counts were done
manually by the writer.
"""

# Given values
total_relevant_records = 100
retrieved_records = 74
relevant_retrieved_records = 64

# Calculate the number of relevant records not retrieved 
relevant_not_retrieved = total_relevant_records - relevant_retrieved_records

# Calculate the number of irrelevant records retrieved 
irrelevant_retrieved = retrieved_records - relevant_retrieved_records

# Confusion Matrix components
TP = relevant_retrieved_records # True Positives : Number of relevant documents retrieved
FN = relevant_not_retrieved # False Negatives : Number of relevant documents not retrieved
FP = irrelevant_retrieved # False Positives : Number of irrelevant documents retrieved

# Assuming total records in the database is 1000
total_records = 1000

# Calculate True Negatives (TN)
TN = total_records - TP - FN - FP

# Calculate recall
recall = (TP / (TP + FN)) * 100

# Calculate precision
precision = (TP / (TP + FP)) * 100

# Calculate F1-score
f1_score = 2 * (precision * recall) / (precision + recall)

# Display the confusion matrix and results
print(f"Confusion Matrix:")
print(f"                Predicted Positive   Predicted Negative")
print(f"Actual Positive    TP = {TP}                 FN = {FN}")
print(f"Actual Negative    FP = {FP}                 TN = {TN}")

print(f"\nRecall = {recall:.2f}%")
print(f"Precision = {precision:.2f}%")
print(f"F1-score = {f1_score:.2f}%")

Confusion Matrix:
                Predicted Positive   Predicted Negative
Actual Positive    TP = 64                 FN = 36
Actual Negative    FP = 10                 TN = 890

Recall = 64.00%
Precision = 86.49%
F1-score = 73.56%


Hence, for the search query 'medical' the IR system looks to quite reliable. One more search query 'history' is tested on the IR system:

In [17]:
query_retrieval()

Enter your search terms here (or type 'end' to exit):history
Article 1: historical_69, Relevance: 0.29395066723774305
Article 2: historical_14, Relevance: 0.2547101286339654
Article 3: historical_75, Relevance: 0.11751422127418999
Article 4: historical_11, Relevance: 0.11751422127418999
Article 5: entertainment_49, Relevance: 0.0970021986364926
Article 6: historical_72, Relevance: 0.08740403946857904
Article 7: medical_13, Relevance: 0.08726601501575487
Article 8: historical_65, Relevance: 0.0864842272899049
Article 9: historical_83, Relevance: 0.08126581388069458
Article 10: historical_99, Relevance: 0.08002056334256068
Article 11: historical_5, Relevance: 0.0788452611660353
Article 12: historical_22, Relevance: 0.07451770708234658
Article 13: historical_47, Relevance: 0.07266774885258392
Article 14: politics_128, Relevance: 0.0649294019435453
Article 15: historical_80, Relevance: 0.06026484594401457
Article 16: historical_59, Relevance: 0.05910362685176081
Article 17: medical_401, Re

In [19]:
"""
This code analyzes the search query 'history' above, which retrieves 70
articles, with 40 relevant historical articles, but with a total relevant
record of 100 historical articles in the database. These counts were done
manually by the writer.
"""

# Given values
total_relevant_records = 100
retrieved_records = 70
relevant_retrieved_records = 40

# Calculate the number of relevant records not retrieved 
relevant_not_retrieved = total_relevant_records - relevant_retrieved_records

# Calculate the number of irrelevant records retrieved 
irrelevant_retrieved = retrieved_records - relevant_retrieved_records

# Confusion Matrix components
TP = relevant_retrieved_records # True Positives : Number of relevant documents retrieved
FN = relevant_not_retrieved # False Negatives : Number of relevant documents not retrieved
FP = irrelevant_retrieved # False Positives : Number of irrelevant documents retrieved

# Assuming total records in the database is total_relevant_records + total_irrelevant_records
total_records = 1000

# Calculate True Negatives (TN)
TN = total_records - TP - FN - FP

# Calculate recall
recall = (TP / (TP + FN)) * 100

# Calculate precision
precision = (TP / (TP + FP)) * 100

# Calculate F1-score
f1_score = 2 * (precision * recall) / (precision + recall)

# Display the confusion matrix and results
print(f"Confusion Matrix:")
print(f"                Predicted Positive   Predicted Negative")
print(f"Actual Positive    TP = {TP}                 FN = {FN}")
print(f"Actual Negative    FP = {FP}                 TN = {TN}")

print(f"\nRecall = {recall:.2f}%")
print(f"Precision = {precision:.2f}%")
print(f"F1-score = {f1_score:.2f}%")

Confusion Matrix:
                Predicted Positive   Predicted Negative
Actual Positive    TP = 40                 FN = 60
Actual Negative    FP = 30                 TN = 870

Recall = 40.00%
Precision = 57.14%
F1-score = 47.06%


The result show a much reduced IR system reliability. This could possibly be due to the query being more diverse in the database, increasing the possibility of search 'hits' that are not exactly relevant.

##### Further Improvements

Further improvements can be researched by:
1. Using a different Stemmer (Lancaster or Snowball)
2. Evaluating the IR System with more complex queries

##### End of Assignment 3 Part B