

---



## **Task: Classifying News Articles using Naive Bayes**

### **1. Create Dataset**

We will be using the modified version of the [BBC news dataset](http://mlg.ucd.ie/files/datasets/bbcsport-fulltext.zip) for this task. The zip file containing the raw data is made available on Canvas. Download the zip file and make sure that the file is available within your notebook session. 

**Instructions:** 
* Read all the .txt files in the bbc-updated.zip file 
* Text files read from the zip file must be stored in a Pandas dataframe along with the category the news article belongs to
  * You can use the subdirectory names while reading the files to store category names as corresponding targets in the dataframe
* The dataframe should consist of two columns: 'Text' and 'Category'. Here, ***Text*** column is an attribute and ***Category*** is the target corresponding to the attribute. 
* Use ```pandas.DataFrame.shape``` to print the size of the dataframe after the dataframe is created (useful in verifying all the text files are read)
* You can also use ```pandas.DataFrame.head(n)``` to view the first 'n' examples in the dataframe (useful for verifying that the raw data has been processed as expected)

Store the whole txt file in data frame, 'text' is content, 'category' is the name of the file


In [17]:
# add your code below this comment
import pandas as pd
import glob
import os

path = 'D:/Github/5290/Lab 4/bbc'

df = pd.DataFrame()
for file in glob.glob(f"{path}/*/*.txt"):
    with open(file, 'r') as f:
        whole_text = ''.join(f.readlines()).replace('\n\n', '. ')
        data = pd.DataFrame({'Text': [whole_text], 'Category': [os.path.dirname(file).split('\\')[-1]]})
        df = pd.concat([df, data])
df = df.sample(frac=1)
df = df.reset_index()
print(df.shape)
print(df.head())

(2224, 3)
   index                                               Text       Category
0      0  Mansfield 0-1 Leyton Orient. An second-half go...          sport
1      0  More reforms ahead says Milburn. Labour will c...       politics
2      0  Georgia plans hidden asset pardon. Georgia is ...       business
3      0  Gatlin and Hayes win Owen awards. American Oly...          sport
4      0  Charity single for quake relief. Singers inclu...  entertainment


**Instructions:**
* Once the dataframe is created, split the data into two sets: (1) train set and (2) test set
* Split 30% of the data as test set
* Use ```sklearn.model_selection.train_test_split()``` for easy splitting of data
    * Set ```random_state``` parameter value to **237** to ensure reproducible results
* Use ```pandas.DataFrame.shape``` to print the sizes of the train and test sets after splitting the data

Reference documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


In [22]:
# add your code below this comment
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, random_state=237, test_size=0.3, shuffle=True)

### **2. Feature Extraction (Prepare Inputs)**

**Question 1(a): Given the vocabulary *V*, what is the corresponding BoW representation for sentence *S* using the word absence/presence measure?**

***V: { fishing, likes, at, too, Beth, family, campfire, Adam, beach, the, vacation, lake, enjoys, likes }***

***S: "Adam enjoys fishing at the lake. Beth likes fishing too."***

**Instructions:** This question is to be answered without using any code. **DO NOT** use scikit-learn or other NLP libraries to answer this question.

**Answer for Question 1(a):** Type your answer here!

**Question 1(b): Given the same vocabulary *V*, what is the corresponding BoW representation for sentence *S* using the term frequency measure?**

***V: { fishing, likes, at, too, Beth, family, campfire, Adam, beach, the, vacation, lake, enjoys, likes }***

***S: "Adam enjoys fishing at the lake. Beth likes fishing too."***

**Instructions:** This question is to be answered without using any code and/or scikit-learn. **DO NOT** use scikit-learn or other NLP libraries to answer this question.

**Answer for Question 1(b):** Type your answer here!



---



**Question 2: Given the same vocabulary (as in Questions 1a. and 1b.), write the BoW representations for the following sentence S using both measures (presence/absence of words and term frequency information):**

***S: He wanted to bring peace to his kingdom but his enemies killed him.***

**Instructions:** This question is to be answered without using any code and/or scikit-learn.  **DO NOT** use scikit-learn or other NLP libraries to answer this question.

**Answer for Question 2:** Type your answer here!



---



**Question 3: Consider the following code snippet:**

```
bow_vect = CountVectorizer()

some_corpus = [
     'It is raining heavily today.',
     'And the weather is unpredictable.',
     'Weather forecast for tomorrow says sunny.',
     'Is it raining heavily today?',
]

res_data = bow_vect.fit_transform(some_corpus)

bow_vect.get_feature_names())
>>> ['and', 'for', 'forecast', 'heavily', 'is', 'it', 'raining', 'says', 'sunny', 'the', 'today', 'tomorrow', 'unpredictable', 'weather']


res_data.toarray()
>>> array([ [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0],
            [1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1],
            [0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
            [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0] ])

```

**In the above snippet, '>>>' indicates the output corresponding to the previous line of code. Something interesting is happening after the sentences are converted to their BoW representations. Can you identify? After you identify, provide your observation about what might be impacted/affected when using the encoded representations as shown in the above code snippet.**

**Instructions:** You do not have to run the above code again since the outputs that you would need to answer the corresponding question is already provided.


**Answer for Question 3:** Type your answer here!



---



**Question 4: For the given task, use the term frequency measure to compute the BoW representations for the text documents in the modified version of the BBC dataset.**

**Instructions:** 
* This question is to be answered by using scikit-learn (similar to what was demonstrated in the tutorial section)
* Create BoW representations for train and test sets.

In [4]:
# add your code below this comment



### **3. Prepare Outputs/Labels**

**Instructions:**
* Check to see how the outputs are in the data
* If categories are non-numeric, then encode them as numeric labels (similar to what was discussed in the tutorial demonstration)
* You will have to perform encoding for targets in both train and test sets
* Make sure to perform the same encoding that was done for the targets in the training set when encoding the targets in the test set.

In [5]:
# add your code below this comment



### **4. Model Training and Evaluation**

**Question 5: Train a Naive Bayes model using the training set. Once the model is trained, apply the model on the test set and evalaute the performance of the model by calculating accuracy and generating the confusion matrix.**

**Instructions:**
* Again, this section is very similar to what was demonstrated in the tutorial section.
* Create a Multinomial Naive Bayes classifier (model).
* Train the model using the data in the training set.
* Apply the model on the test set data to get the predictions made by the model.
* Once the predictions are generated, use ```sklearn.metrics.accuracy_score()``` and ```sklearn.metrics.confusion_matrix()``` as metrics to report your model performance. 

In [6]:
# add your code below this comment



### **Establishing a baseline model**

Baseline models are helpful for easy comparison of the models you build. These models are trained using simple heuristics or rules.

**Instructions:**
* All you have to do is run the following block of code. Report the accuracy of your model (the Naive Bayes one) in comparison to the baseline model created in the following code block

In [7]:
# Baselines are simple heuristics to make predictions for a given task
# just execute this code block; nothing needs to be added/modified
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
# choose 'most-frequent class' as the baseline method
baseline_model = DummyClassifier(strategy="most_frequent")

# fit the baseline model on the training data
baseline_model.fit(X_train, y_train)

# make predictions on the test data using the created baseline model
baseline_preds = baseline_model.predict(X_test)

# compute the accuracy of the baseline model
print(accuracy_score(y_test, baseline_preds))

NameError: name 'X_train' is not defined

**Report the accuracies for the baseline and NB models here. Type your answer below! Indicate clearly the numbers corresponding to the models.**



---



# **References**

* D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.
* [Datasets (scikit-learn)](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets)
* [All about Naive Bayes (from a scikit-learn perspective)](https://scikit-learn.org/stable/modules/naive_bayes.html)
* [Multinomial Naive Bayes API](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn-naive-bayes-multinomialnb)
* [Evaluation metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics)
* [Feature extraction module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction)
* [Transforming prediction targets using LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)


