# Homework 1: Preprocessing and Text Classification

Student Name: Yue Bao

Student ID: 1011641

# General Info

<b>Due date</b>: Sunday, 5 Apr 2020 5pm

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -20% per day (both week and weekend days counted)

<b>Marks</b>: 10% of mark for class (with 9% on correctness + 1% on quality and efficiency of your code)

<b>Materials</b>: See [Using Jupyter Notebook and Python page](https://canvas.lms.unimelb.edu.au/courses/17601/pages/using-jupyter-notebook-and-python?module_item_id=1678430) on Canvas (under Modules>Resources) for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages (the packages listed above are all fine to use); if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  

To familiarize yourself with NLTK, here is a free online book:  Steven Bird, Ewan Klein, and Edward Loper (2009). <a href=http://nltk.org/book>Natural Language Processing with Python</a>. O'Reilly Media Inc. You may also consult the <a href=https://www.nltk.org/api/nltk.html>NLTK API</a>.

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 

You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board; we recommend you check it regularly.

<b>Academic misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.

# Overview

In this homework, you'll be working with a collection tweets. The task is to classify whether a tweet constitutes a rumour event. This homework involves writing code to preprocess data and perform text classification.

# 1. Preprocessing (5 marks)

**Instructions**: Run the code below to download the tweet corpus for the assignment. Note: the download may take some time. **No implementation is needed.**

In [2]:
import requests
import os
from pathlib import Path

fname = 'rumour-data.tgz'
data_dir = os.path.splitext(fname)[0] #'rumour-data'

my_file = Path(fname)
if not my_file.is_file():
    url = "https://github.com/jhlau/jhlau.github.io/blob/master/files/rumour-data.tgz?raw=true"
    r = requests.get(url)

    #Save to the current directory
    with open(fname, 'wb') as f:
        f.write(r.content)
        
print("Done. File downloaded:", my_file)


Done. File downloaded: rumour-data.tgz


**Instructions**: Run the code to extract the zip file. Note: the extraction may take a minute or two. **No implementation is needed.**

In [3]:
import tarfile

#decompress rumour-data.tgz
tar = tarfile.open(fname, "r:gz")
tar.extractall()
tar.close()

#remove superfluous files (e.g. .DS_store)
extra_files = []
for r, d, f in os.walk(data_dir):
    for file in f:
        if (file.startswith(".")):
            extra_files.append(os.path.join(r, file))
for f in extra_files:
    os.remove(f)

print("Extraction done.")

Extraction done.


### Question 1 (1.0 mark)

**Instructions**: The corpus data is in the *rumour-data* folder. It contains 2 sub-folders: *non-rumours* and *rumours*. As the names suggest, *rumours* contains all rumour-propagating tweets, while *non-rumours* has normal tweets. Within  *rumours* and *non-rumours*, you'll find some sub-folders, each named with an ID. Each of these IDs constitutes an 'event', where an event is defined as consisting a **source tweet** and its **reactions**.

An illustration of the folder structure is given below:

    rumour-data
        - rumours
            - 498254340310966273
                - reactions
                    - 498254340310966273.json
                    - 498260814487642112.json
                - source-tweet
                    - 498254340310966273.json
        - non-rumours

Now we need to gather the tweet messages for rumours and non-rumour events. As the individual tweets are stored in json format, we need to use a json parser to parse and collect the actual tweet message. The function `get_tweet_text_from_json(file_path)` is provided to do that.

**Task**: Complete the `get_events(event_dir)` function. The function should return **a list of events** for a particular class of tweets (e.g. rumours), and each event should contain the source tweet message and all reaction tweet messages.

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [4]:
import json

def get_tweet_text_from_json(file_path):
    with open(file_path) as json_file:
        data = json.load(json_file)
        return data["text"]
    
def get_events(event_dir):
    event_list = []
    for event in sorted(os.listdir(event_dir)):
        if event != '.DS_Store':
            each_event = []
            tweet_path = os.path.join(event_dir, event, 'source-tweet')
            for json_tweet in os.listdir(tweet_path):
                tweet = get_tweet_text_from_json(tweet_path + '/' + json_tweet)
                each_event.append(tweet)

            reaction_path = os.path.join(event_dir, event, 'reactions')
            for json_reaction in os.listdir(reaction_path):
                reaction = get_tweet_text_from_json(reaction_path + '/' + json_reaction)
                each_event.append(reaction)
        
            event_list.append(each_event)  
    return event_list
    
#a list of events, and each event is a list of tweets (source tweet + reactions)    
rumour_events = get_events(os.path.join(data_dir, "rumours"))
nonrumour_events = get_events(os.path.join(data_dir, "non-rumours"))

print("Number of rumour events =", len(rumour_events))
print("Number of non-rumour events =", len(nonrumour_events))

Number of rumour events = 500
Number of non-rumour events = 1000


**For your testing:**

In [5]:
assert(len(rumour_events) == 500)
assert(len(nonrumour_events) == 1000)

In [5]:
rumour_events

[['Michael Brown is the 17 yr old boy who was shot 10x &amp; killed by police in #Ferguson today. Media reports "police shoot man". #blackboysonly',
  '@TrueNameBrand @CyMadD0x @AmeenaGK \n\nYou must be too "racist" to notice how the #WarOnWomen is stopping us all from getting equal pay...',
  '@d_m_elms @jaythenerdkid or traces back some bogus allegations of something he did in kindergarten. All to justify that he was "dangerous"',
  '@AmeenaGK @RaavynnDigitaL how quaint. black men are "boys" until saying so would get police in trouble, then they\'re "men" and "thugs".',
  '@retrocombine @AmeenaGK was 18 and assaulted a police officer.',
  '@kevinwmiller @AmeenaGK ???',
  'I meant MSM report, not NYPD report, but point taken. @justinstoned @PsychicDogTalk @AmeenaGK',
  '@kevinwmiller @AmeenaGK @retrocombine Does not excuse that he was shot ten times. Ten. To protect what? The candy?',
  "@AmeenaGK He's a young boy.",
  '@AmeenaGK @retrocombine Was this "boy" over six feet and around 2

### Question 2 (1.0 mark)

**Instructions**: Next we need to preprocess the collected tweets to create a bag-of-words representation. The preprocessing steps required here are: (1) tokenize each tweet into individual word tokens (using NLTK `TweetTokenizer`); and (2) remove stopwords (based on NLTK `stopwords`).

**Task**: Complete the `preprocess_events(event)` function. The function takes **a list of events** as input, and returns **a list of preprocessed events**. Each preprocessed event should have a dictionary of words and frequencies.

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [6]:
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from collections import defaultdict

tt = TweetTokenizer()
stopwords = set(stopwords.words('english'))

def preprocess_events(events):
    list_of_events = []
    
    for each_event in events:
        merged_tweets = ' '.join(each_event)
        each_tweet_tokens = tt.tokenize(merged_tweets)
        filtered_tokens = [w for w in each_tweet_tokens if w.lower() not in stopwords]
        new = count_vocab(filtered_tokens)
        list_of_events.append(new)   
    return list_of_events;
       
def count_vocab(token_list):
    vocab = {}
    for token in token_list:
        vocab.setdefault(token, 0)
        vocab[token] += 1  
    return vocab

preprocessed_rumour_events = preprocess_events(rumour_events)
preprocessed_nonrumour_events = preprocess_events(nonrumour_events)

print("Number of preprocessed rumour events =", len(preprocessed_rumour_events))
print("Number of preprocessed non-rumour events =", len(preprocessed_nonrumour_events))


Number of preprocessed rumour events = 500
Number of preprocessed non-rumour events = 1000


**For your testing**:

In [7]:
assert(len(preprocessed_rumour_events) == 500)
assert(len(preprocessed_nonrumour_events) == 1000)

**Instructions**: Hashtags (i.e. topic tags which start with #) pose an interesting tokenisation problem because they often include multiple words written without spaces or capitalization. Run the code below to collect all unique hashtags in the preprocessed data. **No implementation is needed.**



In [8]:
def get_all_hashtags(events):
    hashtags = set([])
    for event in events:
        for word, frequency in event.items():
            if word.startswith("#"):
                hashtags.add(word)
    return hashtags

hashtags = get_all_hashtags(preprocessed_rumour_events + preprocessed_nonrumour_events)
print("Number of hashtags =", len(hashtags))
#hashtags

Number of hashtags = 1829


### Question 3 (2.0 mark)

**Instructions**: Our task here to tokenize the hashtags, by implementing a reversed version of the MaxMatch algorithm discussed in class, where matching begins at the end of the hashtag and progresses backwards. NLTK has a list of words that you can use for matching, see starter code below. Be careful about efficiency with respect to doing word lookups. One extra challenge you have to deal with is that the provided list of words includes only lemmas: your MaxMatch algorithm should match inflected forms by converting them into lemmas using the NLTK lemmatizer before matching. When lemmatising a word, you also need to provide the part-of-speech tag of the word. You should use `nltk.tag.pos_tag` for doing part-of-speech tagging.

Note that the list of words is incomplete, and, if you are unable to make any longer match, your code should default to matching a single letter. Create a new list of tokenized hashtags (this should be a list of lists of strings) and use slicing to print out the last 20 hashtags in the list.

For example, given "#speakup", the algorithm should produce: \["#", "speak", "up"\]. And note that you do not need to delete the hashtag symbol ("#") from the tokenised outputs.

**Task**: Complete the `tokenize_hashtags(hashtags)` function by implementing a reversed MaxMatch algorithm. The function takes as input **a set of hashtags**, and returns **a dictionary** where key="hashtag" and value="a list of word tokens".

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [9]:
#import nltk
#nltk.download('averaged_perceptron_tagger')
#nltk.download('words')
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet


lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
words = set(nltk.corpus.words.words()) #a list of words provided by NLTK

def tokenize_hashtags(hashtags):
    dic = {}
    for each in sorted(hashtags):

        item = MaxMatch(each).split()[::-1]
        dic[each] = item
    return dic

def MaxMatch(hashtag):
    if hashtag == '':
        return ''
    for i in range(1,len(hashtag)-1):
        first_token = hashtag[i:]
        the_rest = hashtag[:i]
        
        lemma = (lemmatizer.lemmatize
                 (first_token.lower(), get_wordnet_pos(first_token.lower())))
        if lemma in [w.lower() for w in words]:
            return first_token + " " + MaxMatch(the_rest)
      
    first_token = hashtag[-1]
    the_rest = hashtag[:-1]
    return first_token + " " + MaxMatch(the_rest)

def get_wordnet_pos(word):
    pos_tag = nltk.tag.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,"N": wordnet.NOUN,
                "V": wordnet.VERB,"R": wordnet.ADV}

    return tag_dict.get(pos_tag, wordnet.NOUN)

tokenized_hashtags = tokenize_hashtags(hashtags)

print(list(tokenized_hashtags.items())[:20])

[('#', ['#']), ('###ards', ['#', '#', '#', 'ar', 'ds']), ('#11', ['#', '1', '1']), ('#137GUNSHOTS', ['#', '1', '3', '7', 'GUNSHOTS']), ('#13RACISTCOPS', ['#', '1', '3', 'RACIST', 'COPS']), ('#13RacistCops', ['#', '1', '3', 'Racist', 'Cops']), ('#14Words', ['#', '1', '4', 'Words']), ('#152', ['#', '1', '5', '2']), ('#1A', ['#', '1', 'A']), ('#20', ['#', '2', '0']), ('#2014', ['#', '2', '0', '1', '4']), ('#28hours', ['#', '2', '8', 'hours']), ('#2A', ['#', '2', 'A']), ('#2UNARMEDBLACKS', ['#', '2', 'UNARMED', 'BLACKS']), ('#2UnarmedBLACKS', ['#', '2', 'Unarmed', 'BLACKS']), ('#498a', ['#', '4', '9', '8', 'a']), ('#4U2521', ['#', '4', 'U', '2', '5', '2', '1']), ('#4U925', ['#', '4', 'U', '9', '2', '5']), ('#4U9525', ['#', '4', 'U', '9', '5', '2', '5']), ("#4U9525's", ['#', '4', 'U', '9', '5', '2', '5', "'", 's'])]


**For your testing:**

In [10]:
assert(len(tokenized_hashtags) == len(hashtags))

### Question 4 (1.0 mark)

**Instructions**: Now that we have the tokenized hashtags, we need to go back and update the bag-of-words representation for each event.

**Task**: Complete the ``update_event_bow(events)`` function. The function takes **a list of preprocessed events**, and for each event, it looks for every hashtag it has and updates the bag-of-words dictionary with the tokenized hashtag tokens. Note: you do not need to delete the counts of the original hashtags when updating the bag-of-words (e.g., if a document has "#speakup":2 in its bag-of-words representation, you do not need to delete this hashtag and its counts).

In [11]:
def update_event_bow(events):
    b = []
    
    for each_event in events:
        hashtag_set = each_event.keys() & tokenized_hashtags.keys()
        #print(hashtag_set)
        d=[]
        for i in hashtag_set:
            a = tokenized_hashtags[i]
            d = d + a
        for token in d:
            each_event.setdefault(token, 0)
            each_event[token] += 1 
        b.append(each_event)
        
    return b
     
update_event_bow(preprocessed_rumour_events)
update_event_bow(preprocessed_nonrumour_events)

print("Number of preprocessed rumour events =", len(preprocessed_rumour_events))
print("Number of preprocessed non-rumour events =", len(preprocessed_nonrumour_events))

Number of preprocessed rumour events = 500
Number of preprocessed non-rumour events = 1000


In [16]:
preprocessed_rumour_events

[{'Michael': 2,
  'Brown': 2,
  '17': 4,
  'yr': 2,
  'old': 4,
  'boy': 5,
  'shot': 7,
  '10x': 2,
  '&': 4,
  'killed': 2,
  'police': 10,
  '#Ferguson': 3,
  'today': 2,
  '.': 47,
  'Media': 2,
  'reports': 2,
  '"': 22,
  'shoot': 2,
  'man': 3,
  '#blackboysonly': 2,
  '@TrueNameBrand': 1,
  '@CyMadD0x': 6,
  '@AmeenaGK': 38,
  'must': 1,
  'racist': 1,
  'notice': 1,
  '#WarOnWomen': 1,
  'stopping': 1,
  'us': 1,
  'getting': 1,
  'equal': 1,
  'pay': 1,
  '...': 2,
  '@d_m_elms': 2,
  '@jaythenerdkid': 3,
  'traces': 1,
  'back': 6,
  'bogus': 1,
  'allegations': 1,
  'something': 1,
  'kindergarten': 1,
  'justify': 1,
  'dangerous': 1,
  '@RaavynnDigitaL': 2,
  'quaint': 1,
  'black': 8,
  'men': 3,
  'boys': 1,
  'saying': 1,
  'would': 2,
  'get': 1,
  'trouble': 1,
  ',': 17,
  "they're": 1,
  'thugs': 1,
  '@retrocombine': 16,
  '18': 1,
  'assaulted': 1,
  'officer': 2,
  '@kevinwmiller': 12,
  '?': 15,
  'meant': 1,
  'MSM': 1,
  'report': 2,
  'NYPD': 1,
  'point': 1

# Text Classification (4 marks)

### Question 5 (1.0 mark)

**Instructions**: Here we are interested to do text classification, to predict, given a tweet and its reactions, whether it is a rumour or not. The task here is to create training, development and test partitions from the preprocessed events and convert the bag-of-words representation into feature vectors.

**Task**: Create training, development and test partitions with a 60%/20%/20% ratio. Remember to preserve the ratio of rumour/non-rumour events for all your partitions. Next, turn the bag-of-words dictionary of each event into a feature vector, using scikit-learn `DictVectorizer`.

In [12]:
from sklearn.feature_extraction import DictVectorizer
ji

vectorizer = DictVectorizer()

def get_labels(list_a):
    labels = []
    for each_event in list_a:
        if each_event in preprocessed_rumour_events:
            labels.append('rumour')
        else:
            labels.append('non-rumour')
    return labels

all_events = preprocessed_rumour_events + preprocessed_nonrumour_events
all_labels = get_labels(all_events)

training_data, dev_and_test_data, training_labels, dev_and_test_labels = train_test_split(all_events, all_labels, 
                                                                                          stratify = all_labels, 
                                                                                          random_state = 1, 
                                                                                          test_size = 0.4)

dev_data, test_data, dev_labels, test_labels = train_test_split(dev_and_test_data, dev_and_test_labels,
                                                               stratify = dev_and_test_labels,
                                                               random_state = 1, 
                                                               test_size = 0.5)    

training_set = vectorizer.fit_transform(training_data)
dev_set      = vectorizer.transform(dev_data)
test_set     = vectorizer.transform(test_data)


print("Vocabulary size =", len(vectorizer.vocabulary_))

Vocabulary size = 27852


In [13]:
dev_labels

['rumour',
 'rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'rumour',
 'rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'rumour',
 'rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'rumour',
 'rumour',
 'rumour',
 'rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'rumour',
 'rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'rumour',
 'rumour',
 'non-rumour',
 'rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 'rumour',
 'rumour',
 'non-rumour',
 'non-rumour',
 'non-rumour',
 '

### Question 6 (2.0 mark)

**Instructions**: Now, let's build some classifiers. Here, we'll be comparing Naive Bayes and Logistic Regression. For each, you need to first find a good value for their main regularisation (hyper)parameters, which you should identify using the scikit-learn docs or other resources. Use the development set you created for this tuning process; do **not** use cross-validation in the training set, or involve the test set in any way. You don't need to show all your work, but you do need to print out the accuracy with enough different settings to strongly suggest you have found an optimal or near-optimal choice. We should not need to look at your code to interpret the output.

**Task**: Implement two text classifiers: Naive Bayes and Logistic Regression. Tune the hyper-parameters of these classifiers and print the task performance for different hyper-parameter settings.

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score


def print_accuracy(clf_list):
    for each_clf in clf_list:
        each_clf.fit(training_set, training_labels)
        a = each_clf.predict(dev_set)
        print ('accuracy =', accuracy_score(dev_labels,a))
            
# Naive Bayes Method
NB_clfs = []

print('Naive Bayes Method: ')

clf1 = MultinomialNB(alpha = 0.4, fit_prior=True)
clf2 = MultinomialNB(alpha = 0.5, fit_prior=True)
clf3 = MultinomialNB(alpha = 0.6, fit_prior=True)
clf4 = MultinomialNB(alpha = 0.7, fit_prior=True)
clf5 = MultinomialNB(alpha = 0.8, fit_prior=True)
clf6 = MultinomialNB(alpha = 0.9, fit_prior=True)
clf7 = MultinomialNB(alpha = 1, fit_prior=True)

NB_clfs.append(clf1)
NB_clfs.append(clf2)
NB_clfs.append(clf3)
NB_clfs.append(clf4)
NB_clfs.append(clf5)
NB_clfs.append(clf6)
NB_clfs.append(clf7)

print_accuracy(NB_clfs)

# Here we choose the alpha parameter to be 0.5 for the Naive Bayes classifier. 


#Logistic regression with different parameter settings

# 'lbfgs' method
print('\nLogistic Regression with solver=\'lbfgs\': ')
Logistic_clfs_1 = []


clf1 = LogisticRegression(C= 1, solver='lbfgs', max_iter = 1000, 
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr')

clf2 = LogisticRegression(C= 1, solver='lbfgs', max_iter = 1000, 
                         fit_intercept=True, class_weight = {'rumour':2, 'non-rumour':1.5}, 
                         penalty='l2',dual =False, multi_class='ovr')


clf3 = LogisticRegression(C= 0.6, solver='lbfgs',max_iter = 1000,
                         fit_intercept=True, class_weight = {'rumour':2, 'non-rumour':1.5}, 
                         penalty='l2',dual =False, multi_class='ovr')

clf4 = LogisticRegression(C= 0.1, solver='lbfgs', max_iter = 1000,
                         fit_intercept=True, class_weight = {'rumour':2, 'non-rumour':1.5}, 
                         penalty='l2',dual =False, multi_class='ovr')

clf5 = LogisticRegression(C= 0.6, solver='lbfgs',max_iter = 1000, 
                         fit_intercept=True, class_weight = {'rumour':3, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr') 

clf6 = LogisticRegression(C= 0.6, solver='lbfgs',max_iter = 1000, 
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':21}, 
                         penalty='l2',dual =False, multi_class='ovr') 

Logistic_clfs_1.append(clf1)
Logistic_clfs_1.append(clf2)
Logistic_clfs_1.append(clf3)
Logistic_clfs_1.append(clf4)
Logistic_clfs_1.append(clf5)
Logistic_clfs_1.append(clf6)

print_accuracy(Logistic_clfs_1)

# For 'lbfgs', max accuracy occurs when the inverse of regularization strength is 0.6,
# and 'rumour':'non-rumour' = 2:1.5. Here the max accuracy is 82%. 


# 'liblinear' method

print('\nLogistic Regression with solver = \'liblinear\': ')

Logistic_clfs_2 = []
clf1 = LogisticRegression(C= 1, solver='liblinear',
                         fit_intercept=False, class_weight = {'rumour':1, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr')
clf2 = LogisticRegression(C= 0.8, solver='liblinear',
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr')

clf3 = LogisticRegression(C= 0.58, solver='liblinear',
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr') 

clf4 = LogisticRegression(C= 0.58, solver='liblinear',
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':1}, 
                         penalty='l2',dual =False, multi_class='ovr') 
clf5 = LogisticRegression(C= 0.3, solver='liblinear',
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':1}, 
                         penalty='l2',dual =False, multi_class='ovr') 

Logistic_clfs_2.append(clf1)
Logistic_clfs_2.append(clf2)
Logistic_clfs_2.append(clf3)
Logistic_clfs_2.append(clf4)
Logistic_clfs_2.append(clf5)

print_accuracy(Logistic_clfs_2)

# Here for solver = 'liblinear', I'd choose clf4, namely rumour:non-romour = 1:1 
# with a stronger regularisation.  The max accuracy is 82.67%. 


# 'newton-cg' method

print('\nLogistic Regression with solver = \'newton-cg\': ')
Logistic_clfs_3 = []

clf1 = LogisticRegression(C= 1, solver='newton-cg',
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour': 2}, 
                         penalty='l2',dual =False, multi_class='ovr')
clf2 = LogisticRegression(C= 0.5, solver='newton-cg',
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour': 2}, 
                         penalty='l2',dual =False, multi_class='ovr')
clf3 = LogisticRegression(C= 0.5, solver='newton-cg',
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour': 1}, 
                         penalty='l2',dual =False, multi_class='ovr')
clf4 = LogisticRegression(C= 0.2, solver='newton-cg',
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour': 1}, 
                         penalty='l2',dual =False, multi_class='ovr')
clf5 = LogisticRegression(C= 0.2, solver='newton-cg',
                         fit_intercept=True, class_weight = {'rumour':2, 'non-rumour':1.5}, 
                         penalty='l2',dual =False, multi_class='ovr') #0.846667


Logistic_clfs_3.append(clf1)
Logistic_clfs_3.append(clf2)
Logistic_clfs_3.append(clf3)
Logistic_clfs_3.append(clf4)
Logistic_clfs_3.append(clf5)

print_accuracy(Logistic_clfs_3)

# Here the max accuracy is achieved when C = 0.5 with the two classes having the same weight. 
# 'newton-cg' method achieves a max accuracy of 82%. 

#'sag' method

print('\nLogistic Regression with solver = \'sag\': ')
Logistic_clfs_4 = []

clf1 = LogisticRegression(C= 1, solver='sag', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr') 
clf2 = LogisticRegression(C= 0.67, solver='sag', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr') 
clf3 = LogisticRegression(C= 0.2, solver='sag', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr') 
clf4 = LogisticRegression(C= 1, solver='sag', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':1}, 
                         penalty='l2',dual =False, multi_class='ovr') 
clf5 = LogisticRegression(C= 1, solver='sag', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':1.5}, 
                         penalty='l2',dual =False, multi_class='ovr')

Logistic_clfs_4.append(clf1)
Logistic_clfs_4.append(clf2)
Logistic_clfs_4.append(clf3)
Logistic_clfs_4.append(clf4)
Logistic_clfs_4.append(clf5)

print_accuracy(Logistic_clfs_4)

# 'sag' method achieves its max accuracy when there's little/no regularisation and the 
# setting of rumour:non-rumour is 1:1. The max accuracy is 82.33%. 



#'saga' method

print('\nLogistic Regression with solver = \'saga\': ')
Logistic_clfs_5 = []
clf1 = LogisticRegression(C= 1, solver='saga', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr') 
clf2 =LogisticRegression(C= 0.2, solver='saga', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr')  
clf3 = LogisticRegression(C= 0.1, solver='saga', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':2}, 
                         penalty='l2',dual =False, multi_class='ovr') 
clf4 = LogisticRegression(C= 0.1, solver='saga', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1, 'non-rumour':1}, 
                         penalty='l2',dual =False, multi_class='ovr') 
clf5 = LogisticRegression(C= 0.1, solver='saga', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1.5, 'non-rumour':1}, 
                         penalty='l2',dual =False, multi_class='ovr') 
clf6 = LogisticRegression(C= 0.05, solver='saga', max_iter = 5000,
                         fit_intercept=True, class_weight = {'rumour':1.8, 'non-rumour':1}, 
                         penalty='l2',dual =False, multi_class='ovr') 

Logistic_clfs_5.append(clf1)
Logistic_clfs_5.append(clf2)
Logistic_clfs_5.append(clf3)
Logistic_clfs_5.append(clf4)
Logistic_clfs_5.append(clf5)
Logistic_clfs_5.append(clf6)

print_accuracy(Logistic_clfs_5)
# In this method, a relatively strong regularisation and adjusted weights for the
# two classes combined give a better result. The max accuracy is this case is 81.67%. 

Naive Bayes Method: 
accuracy = 0.7966666666666666
accuracy = 0.8033333333333333
accuracy = 0.7966666666666666
accuracy = 0.7933333333333333
accuracy = 0.7933333333333333
accuracy = 0.7866666666666666
accuracy = 0.7933333333333333

Logistic Regression with solver='lbfgs': 
accuracy = 0.8166666666666667
accuracy = 0.8166666666666667
accuracy = 0.82
accuracy = 0.8133333333333334
accuracy = 0.8166666666666667
accuracy = 0.79

Logistic Regression with solver = 'liblinear': 
accuracy = 0.8233333333333334
accuracy = 0.8166666666666667
accuracy = 0.81
accuracy = 0.8266666666666667
accuracy = 0.8133333333333334

Logistic Regression with solver = 'newton-cg': 
accuracy = 0.8166666666666667
accuracy = 0.8133333333333334
accuracy = 0.82
accuracy = 0.81
accuracy = 0.8166666666666667

Logistic Regression with solver = 'sag': 
accuracy = 0.8166666666666667
accuracy = 0.8133333333333334
accuracy = 0.81
accuracy = 0.8233333333333334
accuracy = 0.8133333333333334

Logistic Regression with solver = 'sag

### Question 7 (1.0 mark)

**Instructions**: Using the best settings you have found, compare the two classifiers based on performance in the test set. Print out both accuracy and macro-averaged F-score for each classifier. Be sure to label your output.

**Task**: Compute test performance in terms of accuracy and macro-averaged F-score for both Naive Bayes and Logistic Regression, using optimal hyper-parameter settings.

In [15]:
# In the previous question, the optimal accuracy for logistic regression classifiers was 82.67%. 
# Here we are to test the performance of it along with the optimal Naive Bayes classifier on test set. 

clfs = []
clfs.append(NB_clfs[1])
clfs.append(Logistic_clfs_2[3])

for each_clf in clfs:
    print(each_clf)
    each_clf.fit(training_set, training_labels)
    a = each_clf.predict(test_set)
    print ('Accuracy =', accuracy_score(test_labels, a))
    print ('f1_score =', f1_score(test_labels, a, average = 'macro'))
    print ('\n')

# We can conclude that in this case, the logistic regression with this particular parameter settings
# yields a higher accuracy and macro-averaged F-score compared to the Naive Bayes method. 
# The accuracy of this classifier is 83%, which is even higher than that in the dev set.

MultinomialNB(alpha=0.5, class_prior=None, fit_prior=True)
Accuracy = 0.7866666666666666
f1_score = 0.770334928229665


LogisticRegression(C=0.58, class_weight={'non-rumour': 1, 'rumour': 1},
                   dual=False, fit_intercept=True, intercept_scaling=1,
                   l1_ratio=None, max_iter=100, multi_class='ovr', n_jobs=None,
                   penalty='l2', random_state=None, solver='liblinear',
                   tol=0.0001, verbose=0, warm_start=False)
Accuracy = 0.83
f1_score = 0.7982568335552949


