Given training documents: 4 documents with

D1 (sports): China soccer

D2 (sports): Japan baseball

D3 (politics): China trade 

D4 (politics): Japan Japan exports


### SOLUTION :

Calculate the prior probabilities of each class based on the training documents. 

In this case, we have 2 classes (Sports and Politics) and 4 training documents. Two of the documents are sports-related and two are politics-related, so the prior probabilities are:

P(Sports) = 2/4 = 0.5

P(Politics) = 2/4 = 0.5

In [1]:
# Import the required libraries

import numpy as np

### Procedure--

##### Step 1: Preparing the training data

The training data consists of 4 documents, each labeled as either "Sports" or "Politics". We split the documents into their constituent words, and use these words to build a vocabulary.

In [18]:
# Training data
train_docs = [
    ("China soccer", "Sports"),
    ("Japan baseball", "Sports"),
    ("China trade", "Politics"),
    ("Japan exports", "Politics")
]

# Vocabulary
vocab = set()
for doc, label in train_docs:
    vocab.update(doc.split())


##### Step 2: Calculating the prior probabilities

The prior probabilities of each class are simply the proportion of training documents that belong to each class.

In [19]:
# Counting the number of documents for each class
num_docs_sports = sum(1 for _, label in train_docs if label == "Sports")
num_docs_politics = sum(1 for _, label in train_docs if label == "Politics")

# Calculating the prior probabilities for each class
prior_sports = num_docs_sports / len(train_docs)
prior_politics = num_docs_politics / len(train_docs)


##### Step 3: Counting the number of words in each class

We count the number of occurrences of each word in each class, and use Laplace smoothing to avoid zero probability estimates.

In [21]:
# Laplace smoothing parameter
alpha = 1

In [22]:
# Counting the number of words in each class
word_counts_sports = {word: 0 for word in vocab}
word_counts_politics = {word: 0 for word in vocab}
for doc, label in train_docs:
    words = doc.split()
    for word in words:
        if label == "Sports":
            word_counts_sports[word] += 1
        else:
            word_counts_politics[word] += 1

In [23]:
# Applying Laplace smoothing to the word counts
for word in vocab:
    word_counts_sports[word] += alpha
    word_counts_politics[word] += alpha

##### Step 4: Classifying the test documents

For each test document, we calculate the likelihood probabilities of each word given each class, and apply Bayes' rule to calculate the posterior probabilities of each class given the document. We then classify the document as the class with the highest posterior probability.

In [24]:
# Test data
test_docs = [
    "soccer",
    "Japan"
]

In [25]:
# Classifying the test documents
for test_doc in test_docs:
    # Calculating the likelihood probabilities for each class
    likelihood_sports = 1
    likelihood_politics = 1
    words = test_doc.split()
    for word in words:
        # Probability of the word given the Sports class
        prob_word_given_sports = (word_counts_sports.get(word, 0)) / (sum(word_counts_sports.values()))
        likelihood_sports *= prob_word_given_sports
        # Probability of the word given the Politics class
        prob_word_given_politics = (word_counts_politics.get(word, 0)) / (sum(word_counts_politics.values()))
        likelihood_politics *= prob_word_given_politics
    
    # Applying Bayes' rule to calculate the posterior probabilities
    posterior_sports = likelihood_sports * prior_sports
    posterior_politics = likelihood_politics * prior_politics
    
    # Classifying the test document
    if posterior_sports > posterior_politics:
        print(f"{test_doc}: Sports")
    else:
        print(f"{test_doc}: Politics")

soccer: Sports
Japan: Politics


In [28]:
# calculate the likelihood probabilities
p_soccer_sports = (2 + 1) / (8 + 2)
p_soccer_politics = (1 + 1) / (8 + 2)
p_japan_sports = (1 + 1) / (8 + 2)
p_japan_politics = (2 + 1) / (8 + 2)

Since P("Politics" | "Japan") > P("Sports" | "Japan"), we classify the document as "Politics".

Therefore, the final classifications for the test documents are:

soccer: Sports

Japan: Politics