# Libraries
Let's import the libraries we will need in this project.

In [1]:
from huggingface_hub import hf_hub_download

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

  from .autonotebook import tqdm as notebook_tqdm


# Download the dataset
We can now download the dataset of Medium articles from the HuggingFace Hub

In [2]:
#download the dataset of medium articles from the HuggingFace Hub

df_articles = pd.read_csv(
    hf_hub_download('fabiochiu/medium-articles', repo_type='dataset', filename='medium_articles.csv'
                    ))

There are 192,368 articles in total, but let's sample 10000 of them to make computations faster.

In [3]:
df_articles = df_articles.sample(10000)

In [4]:
df_articles.head()

Unnamed: 0,title,text,url,authors,timestamp,tags
60716,We Mapped How the Coronavirus Is Driving New S...,"As of March 25, the Ministry of Health had bui...",https://onezero.medium.com/the-pandemic-is-a-t...,['Dave Gershgorn'],2020-05-22 15:10:30.160000+00:00,"['Privacy', 'Coronavirus', 'Surveillance', 'Fa..."
75637,My day with her ________________ At the early ...,My day with her\n\n________________\n\nAt the ...,https://medium.com/@sheriffdeenbnhamzah1999/my...,[],2020-12-24 06:04:27.124000+00:00,['Story']
55104,Cross-border arbitrage strategies,This time we continue our series of articles w...,https://medium.com/hyperquant/cross-border-arb...,[],2018-10-17 16:47:27.788000+00:00,"['Algorithmic Trading', 'Bitcoin', 'Cryptocurr..."
120691,Latest Youth STEM Competition 最新青少年科技竞赛信息 (11/22),有关青少年的最新科技竞赛信息 （K-12)，每周更新\n\n最新更新 2020年11月22日...,https://medium.com/@youthinnolab/latest-youth-...,['Youth Innovation Lab'],2020-11-23 04:44:38.491000+00:00,"['STEM', 'Coding', 'Youth', 'Competition']"
157713,5 Best Food Places In Italy | Viaggi Finti,Does eating brings you pleasure? Food is not o...,https://medium.com/@viaggifintishedirpharma/5-...,['Viaggi Finti Shedir Pharma'],2020-05-20 18:33:52.483000+00:00,"['Viaggi Finti', 'Viaggi Finti Shedirpharma', ..."


# Using the TfIDFVectorizer

Let's create a TfidfVectorizer object and call its fit_transform method on our corpus. By fitting the vectorizer, it computes the TF-IDF score of each token with respect to every article.

As result, the corpus_vectorized variable is a scipy sparse matrix containing 10k rows (one row for each article) and ~110k columns (one column for each token found in the corpus).

In [None]:
# apply the TfidfVectorizer to corpus
vectorizer = TfidfVectorizer()
corpus = df_articles['text']

corpus_vectorized = vectorizer.fit_transform(corpus)
print(corpus_vectorized.shape) #(10000, 121820)

(10000, 121820)


We can then reuse the vectorizer with the transform method to compute the TF-IDF values of the tokens in the query.

In [None]:
# vectorize the query

query = 'learn data science'

query_vectorized = vectorizer.transform([query])
print(query_vectorized.shape) #(1, 121820)



(1, 121820)


Now, both the query and each article have been mapped to vectors of TF-IDF scores with the same dimensions.

# Compute Similarities between Queries and Articles

Next, we compute the similarity between the query vector and each of the articles vectors by performing a matrix multiplication between query_vectorized and the transpose of corpus_vectorized, thus obtaining an array of 10k elements where each element is the score of an article.

In [7]:
# compute scores as the dot product between the query vector and the documents vectors
scores = query_vectorized.dot(corpus_vectorized.transpose())
scores_array = scores.toarray()[0]
print(scores_array.shape)

(10000,)


There are multiple similarity measures to choose from for computing the similarity between two vectors, such as 
1. Dot Product
2. Cosine similarity
3. Euclidean distance 

# show Results

Now we just have to find the indices of scores_array with the highest scores, find their corresponding articles in df_articles and show them.

In [8]:
# retrieve the top_n articles with the highest scores and show them
def show_best_results(df_articles, scores_array, top_n = 10):
    sorted_indices = scores_array.argsort()[::-1]
    for position, idx in enumerate(sorted_indices[:top_n]):
        row = df_articles.iloc[idx]
        title = row['title']
        score = scores_array[idx]
        print(f"{position+1}. (score: {score:.4f}) {title}")

show_best_results(df_articles, scores_array, top_n=10)

1. (score: 0.6153) Top 8 free courses to learn data science
2. (score: 0.5970) How to Transition to Data Science from Computer Science?
3. (score: 0.5936) Is Data Science a science?
4. (score: 0.5175) 5 Data Science Podcasts You Should be Listening To
5. (score: 0.5016) Who is a Data Scientist?
6. (score: 0.4875) Data Science for Business
7. (score: 0.4592) Timeline for Data Science Competence
8. (score: 0.4493) AI+ Subscription Content Available Right Now
9. (score: 0.4487) Highly Effective Data Science Teams
10. (score: 0.4453) A Layman’s Guide to Data Science: How to Become a (Good) Data Scientist


The results are okay, but let's try also with a query containing some stopwords.

In [10]:
# try a different query

query = 'how to learn data science with nlp'

query_vectorized = vectorizer.transform([query])
scores = query_vectorized.dot(corpus_vectorized.transpose())
scores_array = scores.toarray()[0]
show_best_results(df_articles, scores_array, top_n=10)

1. (score: 0.5943) What in the “Hello World” is Natural Language Processing (NLP)?
2. (score: 0.3998) How to Transition to Data Science from Computer Science?
3. (score: 0.3986) Top 8 free courses to learn data science
4. (score: 0.3899) Is Data Science a science?
5. (score: 0.3547) AI+ Subscription Content Available Right Now
6. (score: 0.3325) Data Science for Business
7. (score: 0.3271) 5 Data Science Podcasts You Should be Listening To
8. (score: 0.3264) Who is a Data Scientist?
9. (score: 0.3190) Timeline for Data Science Competence
10. (score: 0.3022) A Layman’s Guide to Data Science: How to Become a (Good) Data Scientist


Still the results are impressive and related. Even without filtering stopwords, the tf-idf heuristic gives low importance to common words and high importance to rare words, thus producing a similar effect.