# CA#3 of AI Course
Rasta Tadayon
810196436 
***

<h1 align="center"> Classification of news dataset with naive bayes </h1>

<h2 align="center">Breif Description of the Project</h2> 

In this course project the objective was predicting category of a given news article given its headline and a short description of it. To do so **naive bayes** was used. **Naive Bayes** classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with assumption of strong independences between the features. 

The solution used here benefits from **Bag of words** method. In this method each word in the news description is viewed as a feature and number of its repetitions in each class is calculated. Eventually for each word the probability of it belonging to a certain class is calculated. Then for a new set of news articles the words and probability of them belonging to each class is calculated using Bayes' theorem and comparing them to detect the category of the new article.

<h2 align="center">Project Steps</h2> 

The dataset given for training and test have <code>authors</code>, <code>category</code>, <code>date</code>, <code>headline</code>, <code>link</code> and a <code>short_description</code> columns. The <code>short_description</code> and <code>headline</code> columns are used for the prediction. In order to be able to use the sentences or words given in these columns, they should be preprocessed.
    
After the preprocessing the dataset is divided into two datasets: training set, evaluation set. Then if the training set is not balanced it will be balanced out using **oversampling method** which will be explained later.

After oversampling is done the dataset given is divided into two datasets, one is the training set and the other is the evaluation set.

Then for each category in the training set a **bag of words** is created which consists of the words that appeared in that category and the number of times those words were repeated.

Using the **bag of words** and naive bayes the probability of the news article belonging to each category(posterior probabiliry) is calculated. The largest probability among the calculated probabilities is selected as the article's category.

Then precision metrics are used to determine how good the model is working. If the model is of desired standard it will be applied to our test set.

In [1]:
import pandas as pd
import nltk
from nltk import RegexpTokenizer
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import re
import numpy as np
import random
import math
import operator

In [2]:
dataset = pd.read_csv('Attachment/data.csv')

<h2 align="center">Preprocess</h2> 

In the preprocessing phase a new column is creted which is the result of concatination of the <code>headline</code> and <code>short_description</code> columns and the following happens to the created column (<code>description_and_headline</code> column):
- It is made lower case.
- Stop words which are words like "the", "that",.. which are commenly used in English language ,and words with lengths under 2 are deleted from the text.
- Text is tokenized by non alphabetic characters as delimiters.
- Words are normalized using Stemming or Lemmatization methods.

In this project both of stemming and lemmatization were used to preprocess the data to compare and choose the best of them.

**Stemming** is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the langauge. Stems are created by removing the suffixes or prefixes used with a word.

**Lemmatization**, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma is the canonical form, dictionary form, or citation form of a set of words. Since lemmatization returns a valid word of the language, it is used where it is necessary to get valid words.

In this project the <code>Preprocessor</code> class is used to normalize the text. This class has a <code>preprocess</code> method which applies all the steps above. Firstly a new column is create in the dataset which consists of concatination of both headline and short_description strings together. Then the normalization happens to this newly created column.
After the preprocess is done two new datasets are created which one is normalized using *Stemming* method and the other using *Lemmatization* method.

In [3]:
class Preprocessor():
    def __init__(self,dataset):
        self.initdf = dataset
        self.tokenizer = RegexpTokenizer(r'\w+')
        self.lemmatizer = WordNetLemmatizer()
        self.stemmer = PorterStemmer()
        self.preprocess()
    def normalizer(self,text):
        text = str(text)
        text = text.lower()
        remNumbers = re.sub('[0-9]+', '', text)
        remURL=re.sub(r'http\S+', '',remNumbers)
        tokens = self.tokenizer.tokenize(remURL)  
        filteredWords = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
        self.stemWords=[self.stemmer.stem(w) for w in filteredWords]
        self.lemmaWords=[self.lemmatizer.lemmatize(w) for w in filteredWords] #something is weird here

    def lemmatize(self, text):
        self.normalizer(text)
        return ' '.join(self.lemmaWords)
    def stem(self,text):
        self.normalizer(text)
        return ' '.join(self.stemWords)
    def preprocess(self):
        self.processedDf = self.initdf.copy()
        self.processedDf = self.processedDf.replace(np.nan, '', regex=True)

        self.stemmedDf = self.processedDf.copy()
        self.lemmatizedDf = self.processedDf.copy()
        self.stemmedDf['description_and_headline'] = self.stemmedDf['headline'] + self.stemmedDf['short_description']
        self.lemmatizedDf['description_and_headline'] = self.lemmatizedDf['headline'] + self.lemmatizedDf['short_description']
        
        self.stemmedDf['description_and_headline'] = self.stemmedDf['description_and_headline'].apply(lambda x: self.stem(x))
        self.lemmatizedDf['description_and_headline'] = self.lemmatizedDf['description_and_headline'].apply(lambda x: self.lemmatize(x))
        

In [4]:
phase1Preprocessor = Preprocessor(dataset)

In [5]:
phase1Preprocessor.stemmedDf.head(5)

Unnamed: 0,index,authors,category,date,headline,link,short_description,description_and_headline
0,0,"Katherine LaGrave, ContributorTravel writer an...",TRAVEL,2014-05-07,"EccentriCities: Bingo Parties, Paella and Isla...",https://www.huffingtonpost.com/entry/eccentric...,Påskekrim is merely the tip of the proverbial ...,eccentr bingo parti paella island hop oslopåsk...
1,1,Ben Hallman,BUSINESS,2014-06-09,Lawyers Are Now The Driving Force Behind Mortg...,https://www.huffingtonpost.com/entry/mortgage-...,,lawyer drive forc behind mortgag scam
2,2,Jessica Misener,STYLE & BEAUTY,2012-03-12,Madonna 'Truth Or Dare' Shoe Line To Debut Thi...,https://www.huffingtonpost.com/entry/madonna-s...,"Madonna is slinking her way into footwear now,...",madonna truth dare shoe line debut fall photo ...
3,3,"Victor and Mary, Contributor\n2Sense-LA.com",TRAVEL,2013-12-17,Sophistication and Serenity on the Las Vegas S...,https://www.huffingtonpost.com/entry/las-vegas...,But what if you're a 30-something couple that ...,sophist seren la vega stripbut someth coupl sh...
4,4,"Emily Cohn, Contributor",BUSINESS,2015-03-19,It's Still Pretty Hard For Women To Get Free B...,https://www.huffingtonpost.com/entry/free-birt...,Obamacare was supposed to make birth control f...,still pretti hard women get free birth control...


In [6]:
phase1Preprocessor.lemmatizedDf.head(5)

Unnamed: 0,index,authors,category,date,headline,link,short_description,description_and_headline
0,0,"Katherine LaGrave, ContributorTravel writer an...",TRAVEL,2014-05-07,"EccentriCities: Bingo Parties, Paella and Isla...",https://www.huffingtonpost.com/entry/eccentric...,Påskekrim is merely the tip of the proverbial ...,eccentricity bingo party paella island hopping...
1,1,Ben Hallman,BUSINESS,2014-06-09,Lawyers Are Now The Driving Force Behind Mortg...,https://www.huffingtonpost.com/entry/mortgage-...,,lawyer driving force behind mortgage scam
2,2,Jessica Misener,STYLE & BEAUTY,2012-03-12,Madonna 'Truth Or Dare' Shoe Line To Debut Thi...,https://www.huffingtonpost.com/entry/madonna-s...,"Madonna is slinking her way into footwear now,...",madonna truth dare shoe line debut fall photo ...
3,3,"Victor and Mary, Contributor\n2Sense-LA.com",TRAVEL,2013-12-17,Sophistication and Serenity on the Las Vegas S...,https://www.huffingtonpost.com/entry/las-vegas...,But what if you're a 30-something couple that ...,sophistication serenity la vega stripbut somet...
4,4,"Emily Cohn, Contributor",BUSINESS,2015-03-19,It's Still Pretty Hard For Women To Get Free B...,https://www.huffingtonpost.com/entry/free-birt...,Obamacare was supposed to make birth control f...,still pretty hard woman get free birth control...


As you can see the the stemmed dataset has words which don't exist in the English language but all the words created in the lemmatized dataset are valid English words.

<h2 align="center">Dividing the Dataset into Training and Evaluation Set</h2> 

The dataset given is divided into two new datasets: training set and evaluation set. The reason for that is to use the training set to train out model and the evaluation set to evaluate how good the model works (using precision metrics). 

Usually 80 percent of the dataset is used as the training set and left 20 percent is used as evaluation set.

The division of the dataset is done using the <code>DatasetDivider</code> class which makes two new datasets representing training and test dataset.

In [13]:
class DatasetDivider():
    def __init__(self,dataset):
        self.dataset = dataset.copy()
        self.dfs = []
        self.trainDf = pd.DataFrame()
        self.testDf = pd.DataFrame()
    def divide(self, percentage = 0.8):
        differentLabels = list(self.dataset.category.unique())
        for label in differentLabels:
            df =  self.dataset[self.dataset['category'] == label].copy()
            msk = np.random.rand(df.shape[0]) < 0.8
            dfList = []
            dfList.append(df[msk])
            dfList.append(df[~msk])
            self.dfs.append(dfList)
        for df in self.dfs:
            self.trainDf = pd.concat([self.trainDf,df[0]]).sort_index()
            self.testDf = pd.concat([self.testDf, df[1]]).sort_index()


In [14]:
dividerLemmatized = DatasetDivider(phase1Preprocessor.lemmatizedDf)
dividerLemmatized.divide()
dividerStemmed = DatasetDivider(phase1Preprocessor.stemmedDf)
dividerStemmed.divide()

In [15]:
lemmatizedTrainDf = dividerLemmatized.trainDf
lemmatizedTestDf = dividerLemmatized.testDf
stemmedTrainDf = dividerStemmed.trainDf
stemmedTestDf = dividerStemmed.testDf

<h2 align="center">Oversampling</h2> 

According to wikipedia oversampling data analysis in a technique used to adjust the distribution of a class in a data set meaning the ratio between the different categories present in the dataset. Oversampling is used both in statistical sampling, survey design methodology and in machine learning.

For oversampling in this project the categories with less data are replicated to reach the amount of the largest category. 

Oversampling involves introducing a bias to select more samples from one class than from another, to compensate for an imbalance that is either already present in the data, or likely to develop if a purely random sample were taken.

In this project the <code>Oversampler</code> class does the oversampling. How it is done is by replicating minor classes until the sizes between all classes are equal.

In [16]:
class Oversampler():
    def __init__(self,trainDf):
        self.trainDf = trainDf
    
    def oversample(self):
        labels = self.trainDf.category.unique()
        categorySizes = []
        for label in labels:
            df = self.trainDf[self.trainDf['category'] == label]
            categorySizes.append(df.shape[0])
        maxCategorySize = max(categorySizes)
        for label in labels:
            df = self.trainDf[self.trainDf['category'] == label]
            if(df.shape[0] < maxCategorySize):
                tempdf = df.sample(maxCategorySize - df.shape[0])
                self.trainDf = pd.concat([self.trainDf, tempdf])

In [17]:
oversamplerLemmatized = Oversampler(lemmatizedTrainDf)
oversamplerLemmatized.oversample()
oversamplerStemmed = Oversampler(stemmedTrainDf)
oversamplerStemmed.oversample()

In [18]:
sum(oversamplerLemmatized.trainDf['category'] == 'BUSINESS')

7096

In [19]:
sum(oversamplerLemmatized.trainDf['category'] == 'TRAVEL')

7096

In [20]:
sum(oversamplerLemmatized.trainDf['category'] == 'STYLE & BEAUTY')

7096

As is observable the number of data in every category is equalized.

<h2 align="center">Bayesian Rule</h2> 

After the steps mentioned before are done the probability of each article belonging to each category should be calculated and the largest probability will be selected as the estimated category.
To calculate the probability **Bayes theorem** is used.

\begin{align}
P(c|X) & = \frac{P(X|c)P(c)}{P(X)} \\
P(c|X) & = P(x_1|c)P(x_2|c)P(x_3|c)...P(x_n|c)P(c)
\end{align}

- Posterior Probabality ($P(c|X)$): The probabality of an article belonging to class  with the condition of conrtainin words X in its description data.
- Likelihood ($P(X|c)$): This probabality indicates the liklihood of the feature belonging to class c. Likelihood is the product of number of repetition of all words divided by total number of words in class c.
- Class Prior Probabality ($P(c)$): The probabality of a news being type c. This probability is equal to the number of documents that belong to the class c devided by the number of all documents..
- Predictor Prior Probabality ($P(X)$): The probabality of occurance of word $x_i$ in the news. Since for deciding the label of the data the calculated $P(c_i|x)$ probabilities are compared, the probability of $P(x)$ is negligible.

As stated before in **Naive Bayes** the featured are assumed to be independent therefore the probability of a document belonging to a class is the product of probabilty of each word in the document belonging to that class.

<h2 align="center">Phase 1</h2> 


The project is done in two phases. In the first phase predictions will only happen between the two categories of 'BUSINESS' and 'TRAVEL' and the second phase predictions will be done between all three categories('STYLE & BEAUTY', 'BUSINESS' and 'TRAVEL').

class <code>BayesianClassifier</code> does the classification job. It inputs the training data and accordingly creates *bag of words* for each class (or category) in the class using the 'description_and_headline' column created in the previous steps. After creating the bag of words, the <code>estimate</code> function which inputs the test set, estimates the category for each article. To do so it firstly calculates the posterior probability of each article which indicates the probability of that article belonging to that specific class and is calculated like the following:

\begin{align}
P(c|X) & = P(x_1|c)P(x_2|c)P(x_3|c)...P(x_n|c)P(c)
\end{align}
$P(x_i|c_j)$ probability is calculated by the count that the word occurred in documents of class j, divided by the sum of the counts of each word in our vocabulary in class j.
This posterior probability is calculated for each class $c_j$ and the largest probability indicates the class it belongs to.

Since the probabilities might become so small that the python interpreter might round it to zero, for calculating the probability the logarithm of the probabilities are summed up which is equivalent to logarithm of product of all $P(x_i|c_j)$. Because logarithm is an incrementing function when used for comparing it does not effect the answer.

It is notable to say when the word $x_i$ does not exist in the bag of words for class $c_j$ the probability is not considered zero, for it will make the posterior probability equal to zero which is not correct, rather it is considered a very small number such as $10^{-8}$ here.

In [58]:
def createBagOfWords(trainDf):
    
    labels = trainDf.category.unique()
    vocabCount = {}
    bagOfWords = []
    
    for label in labels:
        df = trainDf[trainDf['category'] == label]

        description = list(df['description_and_headline']) 
        vocab = []
        for item in description:
            vocab += item.split()
        bagOfWord = dict()
        for word in vocab:
            bagOfWord[word] = vocab.count(word)
        bagOfWords.append([label, bagOfWord])
        vocabCount[label] = sum(bagOfWord.values())
    
    return vocabCount, bagOfWords

In [70]:
class BayesianClassifier():
    def __init__(self, trainDf):
        self.trainDf = trainDf.copy()
        self.bagOfWords = []
        self.P_c = {}
        self.vocabCount = {}
        labels = self.trainDf.category.unique()
        for label in labels:
            df = self.trainDf[self.trainDf['category'] == label]
            self.P_c[label] = df.shape[0]/self.trainDf.shape[0]
    def createBagOfWords(self, vocabCount, bagOfWords):
        self.bagOfWords = bagOfWords
        self.vocabCount = vocabCount
    def label(self, document):
        doc = document.split()
        probabilities = {}
        for label in self.bagOfWords:
            labelName = label[0]
            p = math.log10(self.P_c[labelName])
            for word in doc:
                if word in label[1]:
                    p += math.log10(label[1][word]/self.vocabCount[labelName])
                else:
                    p += -8
            probabilities[labelName] = p
        return max(probabilities.items(), key=operator.itemgetter(1))[0]
    def estimate(self, testDf):
        self.testDf = testDf.copy()
        self.testDf['estimated_category'] = self.testDf['description_and_headline'].apply(lambda x: self.label(x))
        
    def accuracy(self):
        return sum(self.testDf['category'] == self.testDf['estimated_category'])/self.testDf.shape[0]
    
    def precision(self, label):
        TP = sum((self.testDf['estimated_category'] == label) & (self.testDf['category'] == label))
        TP_plus_FP = sum(list(self.testDf['estimated_category'] == label))
        return TP/TP_plus_FP
    
    def recall(self, label):
        TP = sum((self.testDf['estimated_category'] == label) & (self.testDf['category'] == label))
        FN = sum((self.testDf['estimated_category'] != label) & (self.testDf['category'] == label))
        return TP/(TP+FN)
    
    def evaluate(self):
        labels = list(self.trainDf.category.unique())
        if(len(labels) == 2):
            columnNames = ['Phase 1']
        elif(len(labels) == 3):
            columnNames = ['Phase 2']
        columnNames += labels
        
        outDf = pd.DataFrame(columns=columnNames)
        outDf[columnNames[0]] = ['Recall', 'Precision', 'Accuracy']
        for label in columnNames[1:]:
            outDf[label] = [self.recall(label), self.precision(label), self.accuracy()]
        
        return outDf

In [71]:
phase1DfLemmatized = oversamplerLemmatized.trainDf[oversamplerLemmatized.trainDf['category'] != 'STYLE & BEAUTY']
phase1DfStemmed = oversamplerStemmed.trainDf[oversamplerStemmed.trainDf['category'] != 'STYLE & BEAUTY']

In [72]:
phase1testDfLemmatized = lemmatizedTestDf[lemmatizedTestDf['category'] != 'STYLE & BEAUTY']
phase1testDfStemmed = stemmedTestDf[stemmedTestDf['category'] != 'STYLE & BEAUTY']

In [73]:
classifier1Lemmatized = BayesianClassifier(phase1DfLemmatized)
classifier1Stemmed = BayesianClassifier(phase1DfStemmed)

In [63]:
vocabCountLemmitized, bagOfWordsLemmitized = createBagOfWords(phase1DfLemmatized)
vocabCountStemmed, bagOfWordsStemmed = createBagOfWords(phase1DfStemmed)

In [74]:
classifier1Lemmatized.createBagOfWords(vocabCountLemmitized, bagOfWordsLemmitized)

In [75]:
classifier1Stemmed.createBagOfWords(vocabCountStemmed, bagOfWordsStemmed)

In [76]:
classifier1Lemmatized.estimate(phase1testDfLemmatized)

In [77]:
classifier1Stemmed.estimate(phase1testDfStemmed)

In [78]:
classifier1Lemmatized.evaluate()

Unnamed: 0,Phase 1,TRAVEL,BUSINESS
0,Recall,0.95505,0.860921
1,Precision,0.923779,0.915625
2,Accuracy,0.921006,0.921006


In [79]:
classifier1Stemmed.evaluate()

Unnamed: 0,Phase 1,BUSINESS,TRAVEL
0,Recall,0.854824,0.95403
1,Precision,0.921283,0.912595
2,Accuracy,0.915709,0.915709


It is observsble that the two methods do not have that significant of a difference however the lemmatization method works slightly better and the reason is that the Stemming method only deletes the prefix and suffix of the word given but the Lemmatization method changes the word to its root form. On the downside Lemmatization method might take longer since it has an iterative approach.

<h2 align="center">Phase 2 - Classification Between All Three Classes</h2> 


In this phase the classification happens on the whole dataset and between all three classes. The steps for this phase are just like the steps for the previous phase. Since we observed that the Lemmatization method worked better we aregoing to use the dataset which was normalized using this method from here on out.

In [80]:
classifier2Lemmatized = BayesianClassifier(oversamplerLemmatized.trainDf)

In [81]:
phase2VocabCount, phase2BagOfWords = createBagOfWords(oversamplerLemmatized.trainDf)

In [85]:
classifier2Lemmatized.createBagOfWords(phase2VocabCount, phase2BagOfWords)

In [86]:
classifier2Lemmatized.estimate(lemmatizedTestDf)

In [87]:
classifier2Lemmatized.evaluate()

Unnamed: 0,Phase 2,TRAVEL,BUSINESS,STYLE & BEAUTY
0,Recall,0.925638,0.838394,0.915264
1,Precision,0.886762,0.870804,0.937574
2,Accuracy,0.902112,0.902112,0.902112


<h2 align="center">Confusion Matrix</h2> 


According to [geeksforgeeks website](https://www.geeksforgeeks.org/confusion-matrix-machine-learning/)  confusion matrix is a table that is used for describing the performance of a classifier on a set of test data for which the true values are known. It helps visualize the performance of a model.
Most performance measures are computed from the confusion matrix for example accuracy, recall and precision.
The confusion matrix for the phase 2 is as follows:

In [90]:
def confusionMatrix(df):
    correctTravel = sum((df['category'] == 'TRAVEL') & (df['estimated_category'] == 'TRAVEL'))
    businessPredictedTravel = sum((df['category'] == 'BUSINESS') & (df['estimated_category'] == 'TRAVEL'))
    beautyPredictedTravel = sum((df['category'] == 'STYLE & BEAUTY') & (df['estimated_category'] == 'TRAVEL'))
    
    correctBusiness = sum((df['category'] == 'BUSINESS') & (df['estimated_category'] == 'BUSINESS'))
    travelPredictedBusiness = sum((df['category'] == 'TRAVEL') & (df['estimated_category'] == 'BUSINESS'))
    beautyPredictedBusiness = sum((df['category'] == 'STYLE & BEAUTY') & (df['estimated_category'] == 'BUSINESS'))
    
    travelPredictedBeauty = sum((df['category'] == 'TRAVEL') & (df['estimated_category'] == 'STYLE & BEAUTY'))
    businessPredictedBeauty = sum((df['category'] == 'BUSINESS') & (df['estimated_category'] == 'STYLE & BEAUTY'))
    correctBeauty = sum((df['category'] == 'STYLE & BEAUTY') & (df['estimated_category'] == 'STYLE & BEAUTY'))
    
    outDf = pd.DataFrame(columns=['-', 'Predicted: Travel', 'Predicted: Business', 'Predicted: Style & Beauty'])
    outDf['-'] = ['Actual: Travel', 'Actual: Business', 'Actual: Style & Beauty']
    outDf['Predicted: Travel'] = [correctTravel, businessPredictedTravel, beautyPredictedTravel]
    outDf['Predicted: Business'] = [travelPredictedBusiness, correctBusiness, beautyPredictedBusiness]
    outDf['Predicted: Style & Beauty'] = [travelPredictedBeauty, businessPredictedBeauty, correctBeauty]
    return outDf

In [91]:
confusionMatrix(classifier2Lemmatized.testDf)

Unnamed: 0,-,Predicted: Travel,Predicted: Business,Predicted: Style & Beauty
0,Actual: Travel,1668,77,57
1,Actual: Business,117,856,48
2,Actual: Style & Beauty,96,50,1577


<h2 align="center">Prediction on New News Articles</h2> 


In [93]:
test = pd.read_csv('Attachment/test.csv')

In [96]:
testPreprocessor = Preprocessor(test)

In [97]:
classifier2Lemmatized.estimate(testPreprocessor.lemmatizedDf)

In [106]:
predicted = classifier2Lemmatized.testDf.copy()
predicted = predicted.drop(columns = ['headline', 'authors', 'link', 'short_description', 'date','description_and_headline'])

In [108]:
predicted = predicted.rename(columns={"estimated_category": "category"})

In [110]:
predicted.to_csv('output.csv', index=False)

### Questions:

**1- Lemmitization or Stemming?**

Differences between the two methods were mentioned before but are rebrought in this section:
**Stemming** is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the langauge. Stems are created by removing the suffixes or prefixes used with a word.

**Lemmatization**, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma is the canonical form, dictionary form, or citation form of a set of words. Since lemmatization returns a valid word of the language, it is used where it is necessary to get valid words.

In this project the results using either of these methods did not differ much however Lemmatization method worked slightly better.

**2- What is tf-idf and how would it be applied to this project?**

According to wikipedia TFIDF (short for term frequency–inverse document frequency), is a numerical statistic that is intended to reflect how important a word is to a document in a collection. TF-IDF is calculated by multiplying two terms:
- **TF**: The number of times a term occurs in a document is called its term frequency. 
- **IDF**: The inverse document frequency of the word across a set of documents.
The tfidf equation is as follows:

$ TFIDF(t,d,D)=tf(t,d).idf(t,D)$

For the term frequency tf(t,d), the simplest choice is to use the raw count of a term in a document, i.e., the number of times that term t occurs in document d which is what was used in this project but there are more advanced strategies available as well.

The inverse document frequency is a measure of how much information the word provides meaning how common or rare it is across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient:

$idf(t,D)=\log{\frac{N}{|\{d\in D:t\in d\}|}}$

N is the total number of documents.
D is the set of all documnets.
 $|\{d \in D: t \in d\}|$ is number of documents where the term t appears.

The idf term as explained above could have been used to increase the precision of the model.

**3- Why a high precision alone is not a good performance metric?**

Precision is a measurement of how close the measured values are to each other, therefore if the predicted values are close to each other the precision is high however these close predictions can be far from the actual value therefore have very low accuracy.

For example in a spam detector that takes many features into account the precision could be **very** high but the actual model could not be detecting correctly therefore the accuracy can be low.


**4- What would happen if a word is only repeated once in one category?** 

Since how the code in this project works is by multiplyin a very small probability when a category does not have the word, in the category which actually has seen that word once, there is a big chance, that category is selected for the prediction.