__Author:__ Soheil Esmaeilzadeh <br /> 
__Date:__ November 6th, 2020 <br /> 

_________________


## Code Exercise

#### Write a program that gets a wikipedia article as an input, and recommend a sequence of 10 articles to read.

__Input:__ Weblink to a Wikipedia article. <br /> 
__Output:__ List of 10 weblinks to 10 wikipedia articles.

* Notes:  
 * You can download all the wikipedia articles as a file on web
 * The problem is intentionally vague, please use your judgment 
 * You can use any programming language, or library that you want 
 * You don’t need to design a GUI, an application in terminal is enough.
 * You have 3 days to work on but we expect you spend less than 4 hours on it
 
__Example:__ <br />
Input: https://en.wikipedia.org/wiki/Snake <br />
Output: 10 wikipedia links e.g. https://en.wikipedia.org/wiki/Green_anaconda and 9 more links <br />

_________________

## Text Recommendation Framework

### Recommendation Approach
Here I build a recommender system based on the very basic principles of Content-Based Recommendation apporach [[1](https://link.springer.com/chapter/10.1007/978-3-540-72079-9_10)]. In summary, a content-based recommender system utilizes the features of product(s) in order to recommend other product(s) similar to what a user has liked, or purchased, or used.

### Dataset Gathering and Cleaning
Downloading all Wikipedia articles from the webstie as you have suggested comes to 78 GB of data when unzipped. So, as the source data I gather a series of articles from Wikipedia. In order to gather such articles I use __wikipedia API__ for Python that can be accessed through [this link](https://link.springer.com/chapter/10.1007/978-3-540-72079-9_10). This API could be installed using the 'pip', as the standard package-management system for Python, using the command <code>pip install wikipedia</code>. In this work I gather a random set of __ONLY__ <code>num_random_articles=25000</code> wikipedia articles. Alghout, if I had more time and resources, I could have gathered, more and more articles to further saturate the corpus pool with multiple articles.

As a very simple cleaning I get read of non-contextual components in the text, e.g. '\n' and etc. For each article I gather the whole summary of the article provided by the wikipedia API. Such a summary usually includes the top most chunk of text that you observe generally when you open a wikipedia page.

### Featurization

In order to generate an encoded feature-like representation of each article, I create __vector__ representations using the __TF-IDF approach__. Naming each article as 'document' and the whole set of articles as 'corpus', the TF-IDF approach provides a metric which captures the Term Frequency (TF) in the documents as well as the Inverse Document Frequency (IDF) in the corpus. The Term Frequency (TF) is the frequency of different words that appear in a document and the Inverse Document Frequency (IDF) is the inverse of document frequecy among the whole corpus of all documents. In other words, TF measures how often a word appears within a specific document while IDF measures how rare a word appears within the corpus. TFIDF in particular suppresses the dominance (overinflucence) of high freqeuncy words when it comes to determining their importance. Moreover, in simple words, a word is considered important in a document if, it occurs a lot in that document, but rarely in other documents within the corpus. For further reading about TF-IDF please refer to Refs. [[2](https://dl.acm.org/doi/abs/10.1145/1361684.1361686),[3](https://ieeexplore.ieee.org/abstract/document/7754750/)].

Accordingly, I create a set of feature vectors corresponding to each wikipedia article using the TF-IDF approach. TF-IDF vectorization package could be accessed from the scikit-learn Python package through [this link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). I convert all the characters to lowercase before tokenizing in TF-IDF, consider features at the word level not character n-grams, and consider the top <code>num_tfidf_features=100</code> words ordered by term frequency across the corpus as the TF-IDF features.

### Measure of Similarity

In order to identify the recommended articles within the gathered corpus of wikipedia articles that are similar to an input article I use the 'Cosine Similarity' as the similarity measure between the TF-IDF vectors that characterize the features attributed to each wikipedia article. Comparing the Cosine Similary between the TF-IDF vector of the input article and the TF-IDF vectors all the articles in the corpus, I then recommend the top <code>num_recommended_articles=10</code> articles to user.

---

#### Importing/installing the required libraries and packages

In [1]:
# installing the required libraries and packages
# use the below command to install any missing packages, e.g. 'wikipedia' API package
# _=pip3 install wikipedia

# importing the required libraries and packages
import os
import pickle
import wikipedia
import numpy as np
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer as tfidf_vectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings
warnings.filterwarnings('ignore') # warning suppression

#### Setting the input parameters provided by the user

In [33]:
num_random_articles       = 25000      # number of wikipedia articles to be gathered as the recommendation corpus pool
load_or_download_articles = 'download' # choices: {download,load}
data_folder               = "./data"   # directory folder for saving/loading the .pkl file of wikipedia articles
num_tfidf_features        = 1000        # number of tfidf features
input_article_link        = "https://en.wikipedia.org/wiki/Donald_Trump"  # input article for recommendation
num_recommended_articles  = 10
n_gram_range              = (1,5)


#### Gather random wikipedia articles (titles)

In [3]:
if not os.path.exists(data_folder):
    os.makedirs(data_folder)
        
if load_or_download_articles == 'download':
    random_articles_titles=[s for i in [wikipedia.random(500) for i in range(int(num_random_articles//500))] for s in i]

    with open(data_folder+'/random_articles_titles.pkl', 'wb') as file:
        pickle.dump(random_articles_titles, file)
        
elif load_or_download_articles == 'load':
    with open(data_folder+'/random_articles_titles.pkl', 'rb') as file:
        random_articles_titles = pickle.load(file)

print("number of random article titles:", len(set(random_articles_titles)))


number of random article titles: 25000


#### Read the summary text for the wikipedia titles randomly gathered above

In [4]:
all_titles = random_articles_titles

if load_or_download_articles == 'download':
    titles_content_map = dict()
    counter = 0
    for title_num in range(len(all_titles)):  
        title = all_titles[title_num]
        try:
            summary_text = wikipedia.summary(title)
        except:
            continue
        counter+=1
        if counter%500==0: 
            print("="*15)
            print("{} articles have been loaded!".format(counter))  
            print("most recent sample title is:",title)
            print("="*15)
        summary_text_processed = summary_text.replace("\n","").replace("=","").replace("/","").replace("  "," ")
        titles_content_map[title] = summary_text_processed
        
    with open(data_folder+'/titles_content_map.pkl', 'wb') as file:
        pickle.dump(titles_content_map, file)
    
elif load_or_download_articles == 'load':
    with open(data_folder+'/titles_content_map.pkl', 'rb') as file:
        titles_content_map = pickle.load(file)
        
        

500 articles have been loaded!
most recent sample title is: The Power of Buddhism
1000 articles have been loaded!
most recent sample title is: Amorbia effoetana
1500 articles have been loaded!
most recent sample title is: Gregory Fernando Pappas
2000 articles have been loaded!
most recent sample title is: Valdosta Millionaires
2500 articles have been loaded!
most recent sample title is: Marcello Visconti di Modrone
3000 articles have been loaded!
most recent sample title is: Édouard Cissé
3500 articles have been loaded!
most recent sample title is: Reign of Fire
4000 articles have been loaded!
most recent sample title is: Mahonia Hall
4500 articles have been loaded!
most recent sample title is: Cleora scriptaria
5000 articles have been loaded!
most recent sample title is: Rodney McLeod
5500 articles have been loaded!
most recent sample title is: Listed buildings in Jersey
6000 articles have been loaded!
most recent sample title is: William Wilson (artist)
6500 articles have been loaded

#### Creating the feature vectors of the corpus articles using TF-IDF approach

In [34]:
all_titles_contents = list(titles_content_map.values())
all_titles = list(titles_content_map.keys())

vectorizer = tfidf_vectorizer(input=all_titles_contents, lowercase=True, 
                              stop_words="english", ngram_range=n_gram_range ,max_features=num_tfidf_features)

all_titles_contents_matrix = vectorizer.fit_transform(all_titles_contents)

print("number of random article titles after processing:", len(titles_content_map.keys()))
print("number of tfidf vectorized elements:", len(vectorizer.get_feature_names()))

sim_unigram=cosine_similarity(all_titles_contents_matrix)


number of random article titles after processing: 23578
number of tfidf vectorized elements: 1000


#### Featurizing the input article and recommending num_recommended_articles=10 articles similar to that

In [35]:
wiki_article_link_format = input_article_link.split("/").copy()
input_article_title = input_article_link.split('/')[-1]
print("="*100, "\nThe title of the input article is: {}".format(input_article_title))
print("-"*100, "\nThe link for the input article is:{}".format(input_article_link))
summary_text = wikipedia.summary(input_article_title)
summary_text_processed = summary_text.replace("\n","").replace("=","").replace("  "," ")
print("-"*100,"\nDownloaded summary of the input article is:\n\n{}".format(summary_text_processed))

summary_text_processed_vec = vectorizer.transform([summary_text_processed])
summary_text_processed_decoded = [vectorizer.get_feature_names()[i] 
                                  for i in range(len(vectorizer.get_feature_names())) 
                                  if summary_text_processed_vec.todense()[0,i]!=0.0]

print("-"*100,"\nDecoded TF-IDF form of the input article summary is:\n\n{}".format(summary_text_processed_decoded))

cos_sim = cosine_similarity(summary_text_processed_vec,all_titles_contents_matrix)
rec_titles_elems = cos_sim.argsort()[0][-num_recommended_articles:][::-1]
cos_sim.sort()
rec_titles_cos_sim = cos_sim[0][-num_recommended_articles:][::-1]
rec_titles = [all_titles[elem] for elem in rec_titles_elems]
print("="*100)
print("Titles of recommended articles: \n\n{}".format(rec_titles))

print("="*100)
for i in range(len(rec_titles_elems)):
    rec_title_elem = rec_titles_elems[i]
    print("Recommended article no. {} \n".format(i+1))
    print("Title:\n>> {}\n".format(rec_titles[i]))
    link = "/".join(wiki_article_link_format[:-1]+["_".join(rec_titles[i].split(" "))])
    print("Link:\n>> {}\n".format(link))
    rec_title_vec = all_titles_contents_matrix[rec_title_elem][:]
    rec_title_decoded = [vectorizer.get_feature_names()[i] 
                         for i in range(len(vectorizer.get_feature_names())) 
                         if rec_title_vec.todense()[0,i]!=0.0]
    print("Decoded TF-IDF form of the recommended article: \n>> {}\n".format(rec_title_decoded))
    # print("Cosine Similarity:\n>> {}\n".format(rec_titles_cos_sim[i]))
    print("="*100)




The title of the input article is: Donald_Trump
---------------------------------------------------------------------------------------------------- 
The link for the input article is:https://en.wikipedia.org/wiki/Donald_Trump
---------------------------------------------------------------------------------------------------- 
Downloaded summary of the input article is:

Donald John Trump (born June 14, 1946) is the 45th and current president of the United States. Before entering politics, he was a businessman and television personality.Born and raised in Queens, New York City, Trump attended Fordham University for two years and received a bachelor's degree in economics from the Wharton School of the University of Pennsylvania. He became president of his father's real estate business in 1971, renamed it The Trump Organization, and expanded its operations to building or renovating skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, mostly by licens

Decoded TF-IDF form of the recommended article: 
>> ['28', '30', 'administration', 'africa', 'american', 'announced', 'april', 'army', 'best', 'best known', 'border', 'born', 'britain', 'capital', 'city', 'congress', 'countries', 'county', 'defeated', 'democratic', 'died', 'election', 'empire', 'era', 'established', 'european', 'family', 'father', 'fifth', 'financial', 'florida', 'following', 'force', 'foreign', 'france', 'generally', 'george', 'good', 'governor', 'independent', 'james', 'john', 'joined', 'july', 'king', 'known', 'later', 'law', 'leader', 'left', 'member', 'named', 'national', 'new', 'new york', 'new york city', 'north', 'party', 'policy', 'political', 'president', 'republican', 'secretary', 'senate', 'served', 'signed', 'society', 'spain', 'spanish', 'special', 'state', 'states', 'thomas', 'united', 'united states', 'virginia', 'war', 'washington', 'western', 'won', 'york', 'york city']

Recommended article no. 5 

Title:
>> 1836 United States presidential election in

---

## Discussion

Above as an example I have considered the wikipedia page of [Donald Trump](https://en.wikipedia.org/wiki/Donald_Trump) as the input. As I have outputed above, words such as "president, election, united states, america, new york city, democratic, political, government, senate, congress, elected" happen to characterize the feature vector of the input article of 'Donald_Trump'.

Among the top 10 recommended articles, we see wikipages of for instance *'Bobby Jindal 2016 presidential campaign', '1996 United States Senate special election in Kansas', 'June 2015 Justice and Development Party election campaign', 'James W. Monroe', '1836 United States presidential election in Missouri', '2002 United States Senate election in Minnesota', '1952 United States presidential election in Louisiana', 'Gary Warren', '1980 presidential election', '1992 United States presidential election in Idaho'* which all belong to the topics about the United States, election, US politics, presidential related matters, and etc. Moreover, I have also outputted for each recommended article the corresponding (dense and decoded) feature vectors

Accordingly, the proposed recommendation system performs reasonably well, however, in order to further improve it, and make it more robust we can consider some options as stated in the below section about __Future Work__.

---

## Future Work

#### There are ways to further imporve the presented recommendation platform. Below I provide a few insights about such potential directions:

1. Increase the number of collected articles for the recommendation pool
 * Above I have used <code>num_random_articles=25000</code> wikipedia articles as the corpus pool for my recommendation system. Certainly, adding more and more articles would lead to finding better matches for the articles to be recommended to the user.
 
 
2. Gather articles (topics) relavant to the recommendation context, e.g. is the recommendation context more around politics, sports, foods, or etc.
 * This could specially be helpful if we are building the recommendation system for a specific application or an intended audience - For instance, if we intend to build recommendation system for chemists we better have as many chemical engineering related articles as possible in our corpus pool.
 
 
3. Try alternative featurization techniques/variations. A couple of ideas could be:
   * Tuning the length of n-grams that we use for featurizing the articles - above I have picked n-grams of n=1 to n=5 as <code>n_gram_range=(1,5)</code>.
   * Above I have featurized each wikipedia article using its summary text, an alternative would be using some pre-trained word-vectors such as 'word2vec' in order to create vectorized features of only the __Titles__ instead of the body text, and experiment to see if any improvement could be achieved. 
   * Using abstractive text summarization models to summarize the wikipedia articles first, then use those summarized texts for creating feature vectors - See for instance my work on how text summarization could help with text understanding and classification - [[4.] Soheil Esmaeilzadeh et al. - Neural Abstractive Text Summarization and Fake News Detection](https://arxiv.org/abs/1904.00788).
   * Above I have used <code>num_tfidf_features=1000</code> number of features - Tunning the number of features impacts the absolute value of cosine similarities as well as the level of information retrieval/extraction from the articles. An application specific tuning of this parameter could further improve the recommendation framework.
   
   
* In addition to looking at the contents of wikipages, we can include some metadata associated to each article, e.g. topic category, year of publish, length, language, etc.


* We can tune (limit) the number of sentences that are extracted from each wiki page from its summary and avoid the potential bias that might be caused if summary of some articles end up being longer (shorter) than the others.

