### **News Article Text Classification**
This project performs supervised machine learning with text classification. Text classification is the process of assigning categories of text according to its content. More specifically, we are performing Naive Bayes classifier on news articles based on the articles's title to predict its category. 

The dataset that is used for training and testing contains over 120,000 samples of news article titles from over 2000 different news sources. Each sample contains the title of an article and its class label ranging from 0 to 3 which correspond to the four main categories of world, sports, business and science/tech.

**Setup code**

In [None]:
import pandas as pd
import numpy as np
import string
from collections import Counter
from sklearn.metrics import accuracy_score, f1_score
np.random.seed(1)

**Read csv files**

In [None]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [None]:
df_train.head()

Unnamed: 0,label,title,description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


**Create training and test sets**

In [None]:
X_train = np.array(df_train['title'])
y_train = np.array(df_train['label'])
X_test = np.array(df_test['title'])
y_test = np.array(df_test['label'])

for i in range(y_train.shape[0]):
  y_train[i] -= 1
for i in range(y_test.shape[0]):
  y_test[i] -= 1

**Bag of words** 
- Convert title strings to lower case
- Remove punctuation
- Split string into individual words
- Count frequency of each word

In [None]:
def count_frequency(documents):
    """
    count occurrence of each word in the list.
    Inputs:
    - documents: list, each entity is a string type representing the title of an article
    Outputs:
    - frequency: a dictionary. The key is the unique words, and the value is the number of occurrences of the word
    """

    # convert to lower case
    lower_case_doc = []
    for s in documents:
      lower_case_doc.append(s.lower())

    # remove punctuation
    no_punc_doc = []
    for s in lower_case_doc:
      no_punc_doc.append(s.translate(str.maketrans('', '', string.punctuation)))

    # split strings into words
    words_doc = []
    for s in no_punc_doc:
      for w in s.split():
        words_doc.append(w)
    
    # count the frequency of words
    frequency = Counter(words_doc)

    return frequency

**Training the Naive Bayes model**
- compute the prior probability of each label
- compute the conditional probabbility of words for each label

In [None]:
def prior_prob(y_train):
    """
    compute the prior probability
    Inputs:
    - y_train: an array that stores ground true label for training data
    Outputs:
    - prior: a dictionary. key is the class label, value is the prior probability.
    """
    prior = {}
    world = 0
    sports = 0
    business = 0
    science = 0

    n = len(y_train)

    for i in y_train:
      if i==0:
        world += 1
      elif i==1:
        sports += 1
      elif i==2:
        business += 1
      elif i==3:
        science += 1
    
    prior[0] = world/n
    prior[1] = sports/n
    prior[2] = business/n
    prior[3] = science/n  
    
    return prior

In [None]:
def conditional_prob(X_train, y_train):
    """
    compute the conditional probability for a document set
    Inputs:
    - X_train: an array of shape (num_train,) which stores title articles. each entity is a string type.
    - y_train: an array of shape (num_train,). the ground true label for each training data.
    Ouputs:
    - cond_prob: a dictionary. key is the class label, value is a dictionary in which the key is word, the value is the conditional probability of feature x_i given y.
    """
    
    cond_prob = {}
    words_cond_prob_label_0 = {}
    words_cond_prob_label_1 = {}
    words_cond_prob_label_2 = {}
    words_cond_prob_label_3 = {}

    label_0_sms = []
    label_1_sms = []
    label_2_sms = []
    label_3_sms = []
    
    for msg, label in zip(X_train, y_train):
      if label==0:
        label_0_sms.append(msg)
      elif label==1:
        label_1_sms.append(msg)
      elif label==2:
        label_2_sms.append(msg)
      elif label==3:
        label_3_sms.append(msg)
    
    label_0_words = count_frequency(label_0_sms)
    label_1_words = count_frequency(label_1_sms)
    label_2_words = count_frequency(label_2_sms)
    label_3_words = count_frequency(label_3_sms)

    sum_label_0_words = sum(label_0_words.values())
    sum_label_1_words = sum(label_1_words.values())
    sum_label_2_words = sum(label_2_words.values())
    sum_label_3_words = sum(label_3_words.values())

    for w in label_0_words:
      words_cond_prob_label_0[w] = (label_0_words[w]+1)/(sum_label_0_words+20000)

    for w in label_1_words:
      words_cond_prob_label_1[w] = (label_1_words[w]+1)/(sum_label_1_words+20000)
    
    for w in label_2_words:
      words_cond_prob_label_2[w] = (label_2_words[w]+1)/(sum_label_2_words+20000)
    
    for w in label_3_words:
      words_cond_prob_label_3[w] = (label_3_words[w]+1)/(sum_label_3_words+20000)

    cond_prob[0] = words_cond_prob_label_0 
    cond_prob[1] = words_cond_prob_label_1
    cond_prob[2] = words_cond_prob_label_2
    cond_prob[3] = words_cond_prob_label_3
    
    return cond_prob

In [None]:
def train_NB_model(X_train, y_train):
    """
    training a naive bayes model from the training data.
    Inputs:
    - X_train: an array of shape (num_train,) which stores article titles. each entity is a string type.
    - y_train: an array of shape (num_train,). the ground true label for each training data.
    Output:
    - prior: a dictionary, whose key is the class label, and value is the prior probability.
    - conditional: a dictionary whose key is the class label y, and value is another dictionary.
    """
    # compute the prior probability
    prior = prior_prob(y_train)
    
    # compute the conditional probability
    conditional = conditional_prob(X_train, y_train)

    return prior, conditional

**Predicting class labels (used on test data)**

In [None]:
def compute_test_prob(word_count, prior_cat, cond_cat):
    """
    predict the class label for one test example
    Inputs:
    - word_count: a dictionary which stores the frequencies of each word in the title of an article. 
                  Key is the word, value is the number of its occurrence.
    - prior_cat: a scalar. prior probability of a specific label
    - cond_cat: a dictionary. conditional probability of a specific label
    Outputs:
    - prob: posterior probability of a specific label for the test example
    """
    
    sum_cond_probs = 0

    for w in word_count:
      if w in cond_cat:
        sum_cond_probs += word_count[w]*np.log(cond_cat[w])
      else:
        sum_cond_probs += word_count[w]*np.log(1/20000)
    
    prob = np.log(prior_cat) + sum_cond_probs
    
    return prob

In [None]:
def predict_label(X_test, prior_prob, cond_prob):
    """
    predict the class labels for the testing set
    Inputs:
    - X_test: an array of shape (num_test,) which stores test data. 
              Each entity is a string type article title.
    - prior_prob: a dictionary which stores the prior probability for all labels
    - cond_prob: a dictionary whose key is the class label y, and value is another dictionary.
                   In the latter dictionary, the key is word w, and the value is the
                   conditional probability P(X_i = w | y).
    Outputs:
    - predict: an array that stores predicted labels
    - test_prob: an array of shape (num_test, num_classes) which stores the posterior probability of each class
    """
    
    predict = []
    test_prob = []
    
    for sms in X_test:
      t_prob = []
      p = []
      s = 0
      word_count = count_frequency([sms])
      
      for label in prior_prob:
        p.append(compute_test_prob(word_count, prior_prob[label], cond_prob[label]))
      t_prob.append(p)

      max = np.max(t_prob)
      for prob in t_prob:
        s = np.exp(prob-max)
        
      for prob in t_prob:
        test_prob.append(np.exp(prob-max)/sum(s))

    predict.extend(np.argmax(test_prob, axis=1))

    predict = np.array(predict)
    test_prob = np.array(test_prob)
      
    return predict, test_prob

**Predict label for test data**

In [None]:
# training naive Bayes model 
prior, cond = train_NB_model(X_train, y_train)

# evaluate on test set
y_pred, prob = predict_label(X_test, prior, cond)

**Compute performance**

In [None]:
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

print('Testing accuracy: ', acc, '%')
print('F1 score: ', f1)

Testing accuracy:  0.7955263157894736 %
F1 score:  0.7951859451555987
