# Artificial Intelligence - CA3
## Hossein Soltanloo - 810195407

#### Cleaning the Data
In order to clean the raw data used as training text, we follow a few steps as follows:
1. Read the raw data file
2. Lowercase all the letters. If we do not do this, there will be multiple instances of the same word that appear in our dictionary which will affect the probability of those words and this is not suitable. Thus we lowercase all the words in order to have a clean and effective dictionary.
3. Replace numbers with blank spaces
4. Replace punctuations with blank spaces
5. Tokenize the text into words
6. Lemmatize the tokens. Lemmatization will enable for words which do not have the same root to be grouped together in order for them to be processed as one item. Stemming is the base action for lemmatization. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words. It is a good practice to do stemming or lemmatization as they will group the similar words together and help us reach better probabilities and eventually a better accuracy.
These steps are done in order to extract only the words and not punctuations and numbers; because only the words have value for us.

In [1]:
import string
import nltk
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
import pandas as pd
import re
stop_words = set(stopwords.words("english"))

def tagger(nltk_tag):
  if nltk_tag.startswith('J'):
    return wordnet.ADJ
  elif nltk_tag.startswith('V'):
    return wordnet.VERB
  elif nltk_tag.startswith('N'):
    return wordnet.NOUN
  elif nltk_tag.startswith('R'):
    return wordnet.ADV
  else:
    return None

def preprocess(text):
    new_text = str(text).lower()
    new_text = new_text.strip()
    new_text = re.sub(r"\d+", " ", new_text)
    for ch in string.punctuation:
        new_text = new_text.replace(ch, " ")
    new_text = ' '.join(new_text.split())
    nt = [w for w in new_text.split() if w not in stop_words]
#     stemmer = PorterStemmer()
#     stemmed_nt = [stemmer.stem(w) for w in nt]
    nltk_tagged = nltk.pos_tag(nt)
    wn_tagged = map(lambda x: (x[0], tagger(x[1])), nltk_tagged)
    lemmatized_nt = []
    lemmatizer = WordNetLemmatizer()
    for w, tag in wn_tagged:
        if tag is None:
            lemmatized_nt.append(w)
        else:
            lemmatized_nt.append(lemmatizer.lemmatize(w, tag))
    return ' '.join(lemmatized_nt)

dataset = pd.read_csv('data.csv')
dataset.short_description.fillna(dataset.headline, inplace=True)
dataset['short_description'] = dataset['short_description'].apply(preprocess)
dataset['headline'] = dataset['headline'].apply(preprocess)
dataset['authors'] = dataset['authors'].apply(preprocess)

The next step is to split our dataset into two sets of training and evaluation data which we will use as both to train and extract a model and to evaluate the extracted model to see if it is good. We use a split factor of 0.8 for train data versus the split factor of 0.2 for evaluation data. We also do this split for each category on itself and not the dataset as a whole. This has to be done in order to preserve the distribution of each category in our dataset.

In [2]:
def split_data(df, frac=0.8, seed=200):
    travel_df = df[df['category'] == "TRAVEL"]
    business_df = df[df['category'] == "BUSINESS"]
    style_df = df[df['category'] == "STYLE & BEAUTY"]
    travel_train = travel_df.sample(frac=frac,random_state=seed)
    travel_test = travel_df.drop(travel_train.index)
    business_train = business_df.sample(frac=frac,random_state=seed)
    business_test = business_df.drop(business_train.index)
    style_train = style_df.sample(frac=frac,random_state=seed)
    style_test = style_df.drop(style_train.index)
    return business_test, business_train, style_test, style_train, travel_test, travel_train
business_test, business_train, style_test, style_train, travel_test, travel_train = split_data(dataset)

#### Phase 1
The initial step in this phase is to extract all the words in our dataset. I use a combination of `headline`, `authors`, and `short_desctiption` in order to reach a better accuracy in the model.

In [3]:
all_words = set()

def extract_words(df):
    word_set = set()
    for index, row in df.iterrows():
        words = row['short_description'].split() + row['headline'].split() + row['authors'].split()
        for w in words:
            word_set.add(w)
    return word_set
all_words = all_words.union(extract_words(travel_train))
all_words = all_words.union(extract_words(business_train))

Then we count the number of times that each word in our dictionary has been repeated in each category. This helps us calculate the conditional probabilities for each word.

In [4]:
def extract_occurences(df):
    extraced = {}
    for index, row in df.iterrows():
        words = row['short_description'].split() + row['headline'].split() + row['authors'].split()
        for w in words:
            if w in extraced:
                extraced[w] = extraced[w] + 1
            else:
                extraced[w] = 1

    return extraced

travel_dict = extract_occurences(travel_train)
business_dict = extract_occurences(business_train)
travel_num_of_occurences = sum(travel_dict.values())
business_num_of_occurences = sum(business_dict.values())

Here we calculate the class prior probability and likelihood. The prior probabilities denoted by `P_travel` and `P_business` represent the probability of a given news in the dataset to be of the Travel or the Business category and is calculated by dividing the number of news with that category to the number of the news with all categories.\
The likelihood is a conditional probability that shows the probability of a word occuring in a certain class. It's calculated by dividing the number of times it has occured in a certain category to the sum of total number of all words' occurences and the number of words in our dictionaty.\
The predictor prior probability is not needed to be calculated because we compare the posterior probabilities with the same predictor probabilty, thus there is no need to calculate it.

In [5]:
P_travel = travel_train.shape[0] / (travel_train.shape[0] + business_train.shape[0])
P_business = business_train.shape[0] / (travel_train.shape[0] + business_train.shape[0])

P_words_in_travel = {}
P_words_in_business = {}
for word in all_words:
    P_words_in_travel[word] = 1 / (travel_num_of_occurences + len(all_words))
    P_words_in_business[word] = 1 / (business_num_of_occurences + len(all_words))

for key, value in travel_dict.items():
    P_words_in_travel[key] = (value + 1) / (travel_num_of_occurences + len(all_words))

for key, value in business_dict.items():
    P_words_in_business[key] = (value + 1) / (business_num_of_occurences + len(all_words))

The final step is to calculate the posterior probabilities for each news in the evaluation set. Using the naive bayes method, we assume that all the features are independent of each other, thus we multiply all the conditional probabilities for each word in the news in relation to each category and the resulting class is the category with the highest probability calculated through naive bayes.

In [6]:
def classify(row):
    text = row[1] + " " + row[4] + " " + row[6]
    known_tokens = [w for w in text.split() if w in all_words]
    travel_cat_prob = P_travel
    business_cat_prob = P_business
    for token in known_tokens:

        travel_cat_prob = travel_cat_prob * P_words_in_travel[token]
        business_cat_prob = business_cat_prob * P_words_in_business[token]

    if business_cat_prob > travel_cat_prob:
        return "BUSINESS"
    else:
        return "TRAVEL"

travel_test['classified_as'] = travel_test.apply(lambda x: classify(x), axis=1)
business_test['classified_as'] = business_test.apply(lambda x: classify(x), axis=1)

All the evaluation metrics are calculated below:

In [7]:
classified_tests = travel_test.append(business_test)
correct_detected_travel = travel_test[travel_test['classified_as'] == travel_test['category']].shape[0]
correct_detected_business = business_test[business_test['classified_as'] == business_test['category']].shape[0]
travel_count = travel_test.shape[0]
business_count = business_test.shape[0]
detected_travel = classified_tests[classified_tests['classified_as'] == "TRAVEL"].shape[0]
detected_business = classified_tests[classified_tests['classified_as'] == "BUSINESS"].shape[0]

recall_travel = correct_detected_travel / travel_count
recall_business = correct_detected_business / business_count

precision_travel = correct_detected_travel / detected_travel
precision_business = correct_detected_business / detected_business

accuracy = (correct_detected_travel + correct_detected_business) / classified_tests.shape[0]
print(recall_travel, recall_business)
print(precision_travel, precision_business)
print(accuracy)

0.9758426966292135 0.9438727782974743
0.9666110183639399 0.9591254752851711
0.9638469638469639


<table align="center">
  <tr align="center">
    <th>Phase 1</th>
    <th>Travel</th>
    <th>Business</th>
  </tr>
  <tr align="center">
    <td>Recall</td>
    <td>0.9758426966292135</td>
    <td>0.9438727782974743</td>
  </tr>
  <tr align="center">
    <td>Precision</td>
    <td>0.9666110183639399</td>
    <td>0.9591254752851711</td>
  </tr>
  <tr align="center">
    <td>Accuracy</td>
    <td align="center" colspan="2">0.9638469638469639</td>
  </tr>
</table>


### Phase 2
We repeat all the above steps in order to train a new model based on all three categories:

In [8]:
all_words = all_words.union(extract_words(style_train))
style_dict = extract_occurences(style_train)
style_num_of_occurences = sum(style_dict.values())

P_style = style_train.shape[0] / (style_train.shape[0] + business_train.shape[0] + travel_train.shape[0])
P_travel = travel_train.shape[0] / (style_train.shape[0] + business_train.shape[0] + travel_train.shape[0])
P_business = business_train.shape[0] / (style_train.shape[0] + business_train.shape[0] + travel_train.shape[0])

P_words_in_travel = {}
P_words_in_business = {}
P_words_in_style = {}
for word in all_words:
    P_words_in_travel[word] = 1 / (travel_num_of_occurences + len(all_words))
    P_words_in_business[word] = 1 / (business_num_of_occurences + len(all_words))
    P_words_in_style[word] = 1 / (style_num_of_occurences + len(all_words))

for key, value in travel_dict.items():
    P_words_in_travel[key] = (value + 1) / (travel_num_of_occurences + len(all_words))

for key, value in business_dict.items():
    P_words_in_business[key] = (value + 1) / (business_num_of_occurences + len(all_words))

for key, value in style_dict.items():
    P_words_in_style[key] = (value + 1) / (style_num_of_occurences + len(all_words))
    
def new_classify(row):
    text = row[1] + " " + row[4] + " " + row[6]
    known_tokens = [w for w in text.split() if w in all_words]
    travel_cat_prob = P_travel
    business_cat_prob = P_business
    style_cat_prob = P_style
    for token in known_tokens:
        style_cat_prob = style_cat_prob * P_words_in_style[token]
        travel_cat_prob = travel_cat_prob * P_words_in_travel[token]
        business_cat_prob = business_cat_prob * P_words_in_business[token]

    if business_cat_prob > travel_cat_prob and business_cat_prob > style_cat_prob:
        return "BUSINESS"
    if travel_cat_prob > business_cat_prob and travel_cat_prob > style_cat_prob:
        return "TRAVEL"
    if style_cat_prob > business_cat_prob and style_cat_prob > travel_cat_prob:
        return "STYLE & BEAUTY"

travel_test['classified_as'] = travel_test.apply(lambda x: new_classify(x), axis=1)
business_test['classified_as'] = business_test.apply(lambda x: new_classify(x), axis=1)
style_test['classified_as'] = style_test.apply(lambda x: new_classify(x), axis=1)


Then the evaluation metrics are calculated:

In [9]:
classified_tests = travel_test.append(business_test).append(style_test)
correct_detected_travel = travel_test[travel_test['classified_as'] == travel_test['category']].shape[0]
correct_detected_business = business_test[business_test['classified_as'] == business_test['category']].shape[0]
correct_detected_style = style_test[style_test['classified_as'] == style_test['category']].shape[0]
travel_count = travel_test.shape[0]
business_count = business_test.shape[0]
style_count = style_test.shape[0]
detected_travel = classified_tests[classified_tests['classified_as'] == "TRAVEL"].shape[0]
detected_business = classified_tests[classified_tests['classified_as'] == "BUSINESS"].shape[0]
detected_style = classified_tests[classified_tests['classified_as'] == "STYLE & BEAUTY"].shape[0]

recall_travel = correct_detected_travel / travel_count
recall_business = correct_detected_business / business_count
recall_style = correct_detected_style / style_count

precision_travel = correct_detected_travel / detected_travel
precision_business = correct_detected_business / detected_business
precision_style = correct_detected_style / detected_style

accuracy = (correct_detected_travel + correct_detected_business + correct_detected_style) / classified_tests.shape[0]

print(recall_travel, recall_business, recall_style)
print(precision_travel, precision_business, precision_style)
print(accuracy)

0.9651685393258427 0.9242282507015903 0.9550949913644214
0.9403393541324576 0.940952380952381 0.970743124634289
0.9518098560837331


<table align="center">
  <tr align="center">
    <th>Phase 2</th>
    <th>Travel</th>
    <th>Business</th>
    <th>Style & Beauty</th>
  </tr>
  <tr align="center">
    <td>Recall</td>
    <td>0.9651685393258427</td>
    <td>0.9242282507015903</td>
    <td>0.9550949913644214</td>
  </tr>
  <tr align="center">
    <td>Precision</td>
    <td>0.9403393541324576</td>
    <td>0.940952380952381</td>
    <td>0.970743124634289</td>
  </tr>
  <tr align="center">
    <td>Accuracy</td>
    <td align="center" colspan="3">0.9518098560837331</td>
  </tr>
</table>

Note: As there has been no huge difference between the percentages, I decided not to do any oversampling.


### Confusion Matrix


In [10]:
ss = style_test[style_test['classified_as'] == "STYLE & BEAUTY"].shape[0]
st = style_test[style_test['classified_as'] == "TRAVEL"].shape[0]
sb = style_test[style_test['classified_as'] == "BUSINESS"].shape[0]
tt = travel_test[travel_test['classified_as'] == "TRAVEL"].shape[0]
tb = travel_test[travel_test['classified_as'] == "BUSINESS"].shape[0]
ts = travel_test[travel_test['classified_as'] == "STYLE & BEAUTY"].shape[0]
bb = business_test[business_test['classified_as'] == "BUSINESS"].shape[0]
bt = business_test[business_test['classified_as'] == "TRAVEL"].shape[0]
bs = business_test[business_test['classified_as'] == "STYLE & BEAUTY"].shape[0]
print(tt, tb, ts)
print(bt, bb, bs)
print(st, sb, ss)

1718 37 25
56 988 25
53 25 1659


The confusion matrix is a simple method to find the accuracy and correctness of a model. The confusion matrix in itself is not a performance measure as such, but almost all of the performance metrics are based on confusion matrix and the numbers inside it. All the above metrics can be calculated by this matrix. The diameter of this matrix denotes the number of correct classifications that our model has done. The other cells showcase the number of false classifications the model has done. Four types of cells reside in the matrix that tell us wether a classification has been true or false and if our classifier has classified them truly or falsely which result in true negatives, true positives, false negatives and false positives. The diameter is consisted of the first two categories. The cells to the top of diameter are false positives and the cells to the bottom of it are false negatives.

<table>
    <tr align="center">
        <td></td>
        <td colspan="4">Actual</td>
    </tr>
    <tr>
        <td rowspan="4">Predicted</td>
        <td></td>
        <td>TRAVEL</td>
        <td>BUSINESS</td>
        <td>STYLE & BEAUTY</td>
    </tr>
    <tr>
        <td>TRAVEL</td>
        <td>1718</td>
        <td>37</td>
        <td>25</td>
    </tr>
    <tr>
        <td>BUSINESS</td>
        <td>56</td>
        <td>988</td>
        <td>25</td>
    </tr>
    <tr>
        <td>STYLE & BEAUTY</td>
        <td>53</td>
        <td>25</td>
        <td>1659</td>
    </tr>

</table>




The final classification of test file is done via the following code:

In [11]:
test_dataset = pd.read_csv('test.csv')
test_dataset.dropna(subset=['short_description'], inplace=True)
test_dataset['short_description'] = test_dataset['short_description'].apply(preprocess)
test_dataset['headline'] = test_dataset['headline'].apply(preprocess)
test_dataset['authors'] = test_dataset['authors'].apply(preprocess)
test_dataset.head(10)

def final_classify(row):
    text = row[1] + " " + row[2] + " " + row[4]
    known_tokens = [w for w in text.split() if w in all_words]
    travel_cat_prob = P_travel
    business_cat_prob = P_business
    style_cat_prob = P_style
    for token in known_tokens:
        style_cat_prob = style_cat_prob * P_words_in_style[token]
        travel_cat_prob = travel_cat_prob * P_words_in_travel[token]
        business_cat_prob = business_cat_prob * P_words_in_business[token]

    if business_cat_prob > travel_cat_prob and business_cat_prob > style_cat_prob:
        return "BUSINESS"
    if travel_cat_prob > business_cat_prob and travel_cat_prob > style_cat_prob:
        return "TRAVEL"
    if style_cat_prob > business_cat_prob and style_cat_prob > travel_cat_prob:
        return "STYLE & BEAUTY"

test_dataset['category'] = test_dataset.apply(lambda x: final_classify(x), axis=1)
test_dataset[['index', 'category']].to_csv('output.csv')

#### Questions
1. Lemmatization will enable for words which do not have the same root to be grouped together in order for them to be processed as one item. Stemming is the base action for lemmatization. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words. Stemming is faster and simpler because it does not check if the resulting word has a meaning or not. It is a good practice to do stemming or lemmatization as they will group the similar words together and help us reach better probabilities and eventually a better accuracy. These steps are done in order to extract only the words and not punctuations and numbers; because only the words have value for us. I used stemmer but there was not a huge difference in the accuracy. However lemmatization worked slightly (~%0.1) better for me.
2. A problem with calculating word occurrences is that highly repeated words start to dominate in the document with larger scores, but may not contain as much informational content to the model as rarer but more domain specific words. It’s a score to highlight each word’s relevance in the entire document. The calculations are done via IDF =Log\[(Number of documents) / (Number of documents containing the word)\] and TF = (Number of repetitions of word in a document) / (# of words in a document). So if a word is repeated a lot but is present in more documents, then it's not very important. But if a word is repeated in less documents but has more occurrences, then it could be more helpful and has more IDF. So instead of word counts, we shall use the terms in the naive bayes calculations and there will be higher probabilities for a word that is more important in a certain class.
3. If we have a high precision but low recall, then we may be missing out on a lot of the desired class that we need to identify. We may have a low number of fales positives, meaning that most of the labeled data is true but we have missed a lot of the positives, leading to a low recall. For example it is important not to miss the spam emails. We can have a high precision by not labeling most of the emails as spam and therefore we will have a low recall because of a large number of false negatives. This will affect in a bad way if we need to sensitively do the classification.
4. This way, there will be no difference between a word which is not in that categoty but is in other categories and a word which has occurred only once. But Tabriz is more related to the Travel category but we are not making any importance to it. As I have used the multinomial naive bayes, there will be no difference between the two mentioned conditions, thus it will be treated as a word not belonging to the travel category. But if we ignore a word that is not in a category and we multiply the calculations for travel category with the probability, it will lower this probability in comparison to the others and therefore there will be less chance for the Travel category and it's a complete false negative.