**Zipfs law** - describes the relationship between the frequency of a word in a corpus and its rank. The law states that the frequency of any given word is inversely proportional to its rank in the frequency table.

Example Pokemon cards, We have 100 pokemon cards and we have to line them up from most common to least. According to the law, the frequency of a word is dependent on its rank. THerefore the most common card will appear the most and the second most common will appear half as many times, third common will be one third and so on.

In [None]:
from huggingface_hub import hf_hub_download
import pandas as pd
import matplotlib.pyplot as plt

# Download the dataset of Medium articles from Hugging Face Hub
df_articles = pd.read_csv(
    hf_hub_download("fabiochiu/medium-articles", repo_type="dataset",
                    filename="medium_articles.csv")
)


df_articles = df_articles[:50000]

# Tokenize the text into individual words
tokenized_words = df_articles["text"].str.lower().str.split()

# Count the frequency of each word
word_counts = {}
for words in tokenized_words:
    for word in words:
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1

# Sort the words by their frequencies in descending order
sorted_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

In [None]:
sorted_words

**TF-IDF**

Term Frequency is like counting how many times a specific word appears in a dataset. Inverse Document Frequency is like thinking about how common or rare a word is across all corpus.

Example TF-IDF helps us find words that are important in a specific book but not very common in all the books.

1. Zipf's law explains the overall distribution of word frequencies in a corpus.
2. TF-IDF focuses on measuring the importance of words within individual documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# apply the TfidfVectorizer to the corpus
corpus = df_articles["text"]
vectorizer = TfidfVectorizer()
corpus_vectorized = vectorizer.fit_transform(corpus) # TF-IDF matrix for the corpus. it computes the TF-IDF score of each token with respect to every article. It transforms the corpus into a matrix where each row represents an article and each column represents a token.
print(corpus_vectorized.shape)

(50000, 288140)


In [None]:
# vectorize query
query = "natural language processing"
query_vectorized = vectorizer.transform([query]) #  TF-IDF vector for the query. it learns transformation to "data science nlp" without re-computing the parameters.
print(query_vectorized.shape)

(1, 288140)


In [None]:
(corpus_vectorized.transpose()).shape

(288140, 50000)

In [None]:
# compute scores as the dot product
scores = query_vectorized.dot(corpus_vectorized.transpose()) # the query vector needs to be multiplied with each document vector in the corpus matrix. to measure the similarity of query to the each document in the corpus.
scores_array = scores.toarray()[0]
print(scores_array.shape)

(50000,)


In [None]:
# retrieve the top_n articles with the highest scores and show them
def show_best_results(df_articles, scores_array, top_n=10):
  sorted_indices = scores_array.argsort()[::-1]
  for position, idx in enumerate(sorted_indices[:top_n]):
    row = df_articles.iloc[idx]
    title = row["title"]
    score = scores_array[idx]
    print(f"{position + 1} [score = {score}]: {title}")

show_best_results(df_articles, scores_array)

1 [score = 0.29307717929917204]: What the “Women in Language” Conference 2020 Taught Me
2 [score = 0.2691499776299288]: Sentiment Analysis with Logistic Regression (Part 1)
3 [score = 0.2682546479961668]: The Story of how Natural Language Processing is changing Financial Services in 2020
4 [score = 0.2666674603843819]: List of free resources to learn Natural Language Processing
5 [score = 0.2659237610875087]: To Expand Your Horizons, You Must Grow Your Language
6 [score = 0.2538403402775664]: Popular Python Libraries in NLP: Dealing with Language Detection, Translation & Beyond!
7 [score = 0.25119371362258425]: Natural Gas Is Dirtier than Coal
8 [score = 0.24682913805000956]: How people wade through one of the top trending technology (artificial intelligence)
9 [score = 0.2399478036375346]: On (Programming) Language Design
10 [score = 0.2347842942649893]: What is Programming? How to get Started?


In [None]:
# the dot is to measure the similarity between the query and each document in the corpus.