# Homework 1: Preprocessing and Text Classification

Student Name: Xinnan SHEN

Student ID: 1051380

# General Info

<b>Due date</b>: Sunday, 5 Apr 2020 5pm

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -20% per day (both week and weekend days counted)

<b>Marks</b>: 10% of mark for class (with 9% on correctness + 1% on quality and efficiency of your code)

<b>Materials</b>: See [Using Jupyter Notebook and Python page](https://canvas.lms.unimelb.edu.au/courses/17601/pages/using-jupyter-notebook-and-python?module_item_id=1678430) on Canvas (under Modules>Resources) for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages (the packages listed above are all fine to use); if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  

To familiarize yourself with NLTK, here is a free online book:  Steven Bird, Ewan Klein, and Edward Loper (2009). <a href=http://nltk.org/book>Natural Language Processing with Python</a>. O'Reilly Media Inc. You may also consult the <a href=https://www.nltk.org/api/nltk.html>NLTK API</a>.

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 

You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board; we recommend you check it regularly.

<b>Academic misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.

# Overview

In this homework, you'll be working with a collection tweets. The task is to classify whether a tweet constitutes a rumour event. This homework involves writing code to preprocess data and perform text classification.

# 1. Preprocessing (5 marks)

**Instructions**: Run the code below to download the tweet corpus for the assignment. Note: the download may take some time. **No implementation is needed.**

In [1]:
import requests
import os
from pathlib import Path

fname = 'rumour-data.tgz'
data_dir = os.path.splitext(fname)[0] #'rumour-data'

my_file = Path(fname)
if not my_file.is_file():
    url = "https://github.com/jhlau/jhlau.github.io/blob/master/files/rumour-data.tgz?raw=true"
    r = requests.get(url)

    #Save to the current directory
    with open(fname, 'wb') as f:
        f.write(r.content)
        
print("Done. File downloaded:", my_file)


Done. File downloaded: rumour-data.tgz


**Instructions**: Run the code to extract the zip file. Note: the extraction may take a minute or two. **No implementation is needed.**

In [2]:
import tarfile

#decompress rumour-data.tgz
tar = tarfile.open(fname, "r:gz")
tar.extractall()
tar.close()

#remove superfluous files (e.g. .DS_store)
extra_files = []
for r, d, f in os.walk(data_dir):
    for file in f:
        if (file.startswith(".")):
            extra_files.append(os.path.join(r, file))
for f in extra_files:
    os.remove(f)

print("Extraction done.")

Extraction done.


### Question 1 (1.0 mark)

**Instructions**: The corpus data is in the *rumour-data* folder. It contains 2 sub-folders: *non-rumours* and *rumours*. As the names suggest, *rumours* contains all rumour-propagating tweets, while *non-rumours* has normal tweets. Within  *rumours* and *non-rumours*, you'll find some sub-folders, each named with an ID. Each of these IDs constitutes an 'event', where an event is defined as consisting a **source tweet** and its **reactions**.

An illustration of the folder structure is given below:

    rumour-data
        - rumours
            - 498254340310966273
                - reactions
                    - 498254340310966273.json
                    - 498260814487642112.json
                - source-tweet
                    - 498254340310966273.json
        - non-rumours

Now we need to gather the tweet messages for rumours and non-rumour events. As the individual tweets are stored in json format, we need to use a json parser to parse and collect the actual tweet message. The function `get_tweet_text_from_json(file_path)` is provided to do that.

**Task**: Complete the `get_events(event_dir)` function. The function should return **a list of events** for a particular class of tweets (e.g. rumours), and each event should contain the source tweet message and all reaction tweet messages.

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [3]:
import json

def get_tweet_text_from_json(file_path):
    with open(file_path) as json_file:
        data = json.load(json_file)
        return data["text"]
    
def get_events(event_dir):
    event_list = []
    for event in sorted(os.listdir(event_dir)):
        ###
        # Your answer BEGINS HERE
        ###
        current_path=os.path.join(event_dir,event)
        reaction_path=os.path.join(current_path,"reactions")
        source_path=os.path.join(current_path,"source-tweet")
        list_temp=[]
        for json_file in sorted(os.listdir(source_path)):
            list_temp.append(get_tweet_text_from_json(os.path.join(source_path,json_file)))
        for json_file in sorted(os.listdir(reaction_path)):
            list_temp.append(get_tweet_text_from_json(os.path.join(reaction_path,json_file)))
        event_list.append(list_temp)
        ###
        # Your answer ENDS HERE
        ###
        
    return event_list
    
#a list of events, and each event is a list of tweets (source tweet + reactions)    
rumour_events = get_events(os.path.join(data_dir, "rumours"))
nonrumour_events = get_events(os.path.join(data_dir, "non-rumours"))

print("Number of rumour events =", len(rumour_events))
print("Number of non-rumour events =", len(nonrumour_events))

Number of rumour events = 500
Number of non-rumour events = 1000


**For your testing:**

In [4]:
assert(len(rumour_events) == 500)
assert(len(nonrumour_events) == 1000)

### Question 2 (1.0 mark)

**Instructions**: Next we need to preprocess the collected tweets to create a bag-of-words representation. The preprocessing steps required here are: (1) tokenize each tweet into individual word tokens (using NLTK `TweetTokenizer`); and (2) remove stopwords (based on NLTK `stopwords`).

**Task**: Complete the `preprocess_events(event)` function. The function takes **a list of events** as input, and returns **a list of preprocessed events**. Each preprocessed event should have a dictionary of words and frequencies.

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [5]:
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from collections import defaultdict

tt = TweetTokenizer()
stopwords = set(stopwords.words('english'))

def preprocess_events(events):
    ###
    # Your answer BEGINS HERE
    ###
    result_list=[]
    for ele in events:
        token_list=[]
        for doc in ele:
            token_list.extend(tt.tokenize(doc))
        result_event={}
        for t in token_list:
            if t not in stopwords:
                if t not in result_event:
                    result_event[t]=1
                else:
                     result_event[t]= result_event[t]+1
        result_list.append(result_event)    
    return result_list
    ###
    # Your answer ENDS HERE
    ###

preprocessed_rumour_events = preprocess_events(rumour_events)
preprocessed_nonrumour_events = preprocess_events(nonrumour_events)

print("Number of preprocessed rumour events =", len(preprocessed_rumour_events))
print("Number of preprocessed non-rumour events =", len(preprocessed_nonrumour_events))

Number of preprocessed rumour events = 500
Number of preprocessed non-rumour events = 1000


**For your testing**:

In [6]:
assert(len(preprocessed_rumour_events) == 500)
assert(len(preprocessed_nonrumour_events) == 1000)

**Instructions**: Hashtags (i.e. topic tags which start with #) pose an interesting tokenisation problem because they often include multiple words written without spaces or capitalization. Run the code below to collect all unique hashtags in the preprocessed data. **No implementation is needed.**



In [7]:
def get_all_hashtags(events):
    hashtags = set([])
    for event in events:
        for word, frequency in event.items():
            if word.startswith("#"):
                hashtags.add(word)
    return hashtags

hashtags = get_all_hashtags(preprocessed_rumour_events + preprocessed_nonrumour_events)
print("Number of hashtags =", len(hashtags))

Number of hashtags = 1829


### Question 3 (2.0 mark)

**Instructions**: Our task here to tokenize the hashtags, by implementing a reversed version of the MaxMatch algorithm discussed in class, where matching begins at the end of the hashtag and progresses backwards. NLTK has a list of words that you can use for matching, see starter code below. Be careful about efficiency with respect to doing word lookups. One extra challenge you have to deal with is that the provided list of words includes only lemmas: your MaxMatch algorithm should match inflected forms by converting them into lemmas using the NLTK lemmatizer before matching. When lemmatising a word, you also need to provide the part-of-speech tag of the word. You should use `nltk.tag.pos_tag` for doing part-of-speech tagging.

Note that the list of words is incomplete, and, if you are unable to make any longer match, your code should default to matching a single letter. Create a new list of tokenized hashtags (this should be a list of lists of strings) and use slicing to print out the last 20 hashtags in the list.

For example, given "#speakup", the algorithm should produce: \["#", "speak", "up"\]. And note that you do not need to delete the hashtag symbol ("#") from the tokenised outputs.

**Task**: Complete the `tokenize_hashtags(hashtags)` function by implementing a reversed MaxMatch algorithm. The function takes as input **a set of hashtags**, and returns **a dictionary** where key="hashtag" and value="a list of word tokens".

**Check**: Use the assertion statements in *"For your testing"* below for the expected output.

In [8]:
from nltk.corpus import wordnet

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
words = set(nltk.corpus.words.words()) #a list of words provided by NLTK

def get_pos(word):
    tag=nltk.tag.pos_tag([word])[0][1]
    if tag[0]=='N':
        return wordnet.NOUN
    elif tag[0]=='V':
        return wordnet.VERB
    elif tag[0]=='J':
        return wordnet.ADJ
    elif tag[0]=='R':
        return wordnet.ADV
    else:
        return ''


def tokenize_hashtags(hashtags):
    ###
    # Your answer BEGINS HERE
    ###
    newwords=[]
    for word in words:
        newwords.append(word.lower())#lowercase
    result_dict={}
    for ht in hashtags:
        origin_word=ht
        word_list=[]
        word_list.append(ht[0])#extract "#"
        ht=ht[1:len(ht)]
        i=len(ht)
        temp_list=[]
        while i>0:
            j=len(ht)-1
            temp=j
            while j>=0:
                word=ht[j:i]
                pos=get_pos(ht[j:i])
                str=''
                if pos!='':
                    str=lemmatizer.lemmatize(word,pos).lower()#compare with lowercase
                else:
                    str=lemmatizer.lemmatize(word).lower()
                if str in newwords:#matched
                    temp=j
                    j=j-1
                else:
                    j=j-1
            temp_list.append(ht[temp:i])
            ht=ht[0:temp]
            i=temp
        temp_list.reverse()
        word_list.extend(temp_list)
        result_dict[origin_word]=word_list
    return result_dict
    
    ###
    # Your answer ENDS HERE
    ###
    

tokenized_hashtags = tokenize_hashtags(hashtags)

print(list(tokenized_hashtags.items())[:20])

[('#CanadastandswithIsrael', ['#', 'Canada', 'stand', 'swith', 'Israel']), ('#AKP', ['#', 'AK', 'P']), ('#hockey', ['#', 'hockey']), ('#murder', ['#', 'murder']), ('#FilthyCanadaUnderSiege', ['#', 'Filthy', 'Canada', 'Under', 'Siege']), ('#defend', ['#', 'defend']), ('#wth', ['#', 'w', 'th']), ('#jesuisjuif', ['#', 'jesu', 'is', 'ju', 'if']), ('#HANDSUP', ['#', 'HAND', 'SUP']), ('#fergusonthugs', ['#', 'ferguson', 'thugs']), ('#thewholeworldiswatching', ['#', 'the', 'who', 'lew', 'or', 'l', 'dis', 'watching']), ('#CharlieHebdo-attacken', ['#', 'Charlie', 'He', 'b', 'do', '-', 'atta', 'c', 'ken']), ('#PrayForAustralia', ['#', 'Pray', 'For', 'Australia']), ('#NOW', ['#', 'NOW']), ('#shooting', ['#', 'shooting']), ('#Angry', ['#', 'Angry']), ('#FascistPolice', ['#', 'Fascist', 'Police']), ('#Germany', ['#', 'Ger', 'many']), ('#fuckinghilarious', ['#', 'fu', 'c', 'king', 'hilarious']), ('#Backtothesixties', ['#', 'Back', 'to', 'the', 'sixties'])]


**For your testing:**

In [9]:
assert(len(tokenized_hashtags) == len(hashtags))

### Question 4 (1.0 mark)

**Instructions**: Now that we have the tokenized hashtags, we need to go back and update the bag-of-words representation for each event.

**Task**: Complete the ``update_event_bow(events)`` function. The function takes **a list of preprocessed events**, and for each event, it looks for every hashtag it has and updates the bag-of-words dictionary with the tokenized hashtag tokens. Note: you do not need to delete the counts of the original hashtags when updating the bag-of-words (e.g., if a document has "#speakup":2 in its bag-of-words representation, you do not need to delete this hashtag and its counts).

In [10]:
def update_event_bow(events):
    ###
    # Your answer BEGINS HERE
    ###
    for event in events:
        hashtags=set([])
        for word, frequency in event.items():
            if word.startswith("#"):
                hashtags.add(word)
        tokenized_hashtags = tokenize_hashtags(hashtags)
        for hashtag in hashtags:
            for token in tokenized_hashtags[hashtag]:
                if token=='#':
                    continue
                elif token in event:
                    event[token]=event[token]+1
                else:
                    event[token]=1
    return
    ###
    # Your answer ENDS HERE
    ###
            
update_event_bow(preprocessed_rumour_events)
update_event_bow(preprocessed_nonrumour_events)

print("Number of preprocessed rumour events =", len(preprocessed_rumour_events))
print("Number of preprocessed non-rumour events =", len(preprocessed_nonrumour_events))

Number of preprocessed rumour events = 500
Number of preprocessed non-rumour events = 1000


# Text Classification (4 marks)

### Question 5 (1.0 mark)

**Instructions**: Here we are interested to do text classification, to predict, given a tweet and its reactions, whether it is a rumour or not. The task here is to create training, development and test partitions from the preprocessed events and convert the bag-of-words representation into feature vectors.

**Task**: Using scikit-learn, create training, development and test partitions with a 60%/20%/20% ratio. Remember to preserve the ratio of rumour/non-rumour events for all your partitions. Next, turn the bag-of-words dictionary of each event into a feature vector, using scikit-learn `DictVectorizer`.

In [47]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split

vectorizer = DictVectorizer(sparse=False)

###
# Your answer BEGINS HERE
###
events=preprocessed_rumour_events+preprocessed_nonrumour_events
#events=vectorizer.fit_transform(events)
results=(['rumor']*500)+(['non-rumour']*1000)
events_trainanddev,events_test,results_trainanddev,results_test=train_test_split(events, results, test_size=0.2)
events_train,events_dev,results_train,results_dev=train_test_split(events_trainanddev, results_trainanddev, test_size=0.25)
events_train=vectorizer.fit_transform(events_train)
events_dev=vectorizer.transform(events_dev)
events_test=vectorizer.transform(events_test)
###
# Your answer ENDS HERE
###
print("Vocabulary size =", len(vectorizer.vocabulary_))

Vocabulary size = 28626


### Question 6 (2.0 mark)

**Instructions**: Now, let's build some classifiers. Here, we'll be comparing Naive Bayes and Logistic Regression. For each, you need to first find a good value for their main regularisation (hyper)parameters, which you should identify using the scikit-learn docs or other resources. Use the development set you created for this tuning process; do **not** use cross-validation in the training set, or involve the test set in any way. You don't need to show all your work, but you do need to print out the accuracy with enough different settings to strongly suggest you have found an optimal or near-optimal choice. We should not need to look at your code to interpret the output.

**Task**: Implement two text classifiers: Naive Bayes and Logistic Regression. Tune the hyper-parameters of these classifiers and print the task performance for different hyper-parameter settings.

In [45]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
###
# Your answer BEGINS HERE
###
model_NB=MultinomialNB()
score=0
set_alpha=-1
print("------------------------------------------------------Naive Bayes----------------------------------------------------------------")
for a in [0.00001,0.0001,0.001,0.01,0.1,1,10,100,1000,10000]:
    model_NB.set_params(alpha=a).fit(events_train,results_train)
    predictions=model_NB.predict(events_dev)
    this_score=accuracy_score(results_dev,predictions)
    print("alpha= ",a,",score=",round(this_score,3))
    if this_score>score:
        score=this_score
        set_alpha=a
model_NB.set_params(alpha=set_alpha).fit(events_train,results_train)
print("Best Score:",round(score,3),",where alpha=",set_alpha)
score=0
set_penalty=''
set_C=-1
print("--------------------------------------------------Logistic Regression------------------------------------------------------------")
model_LR=LogisticRegression()
for p in ['l1', 'l2']:
    for c in [0.00001,0.0001,0.001,0.01,0.1,1,10,100,1000,10000]:
        model_LR.set_params(penalty=p,C=c,solver='saga').fit(events_train,results_train)
        predictions=model_LR.predict(events_dev)
        this_score=accuracy_score(results_dev,predictions)
        print("penalty= ",p,"C= ",c,",score=",round(this_score,3))
        if this_score>score:
            score=this_score
            set_penalty=p
            set_C=c
model_LR.set_params(penalty=set_penalty,C=set_C,solver='saga').fit(events_train,results_train)
print("Best Score:",round(score,3),",where penalty= ",set_penalty,", and C= ",set_C)
###
# Your answer ENDS HERE
###

------------------------------------------------------Naive Bayes----------------------------------------------------------------
alpha=  1e-05 ,score= 0.77
alpha=  0.0001 ,score= 0.763
alpha=  0.001 ,score= 0.773
alpha=  0.01 ,score= 0.797
alpha=  0.1 ,score= 0.803
alpha=  1 ,score= 0.81
alpha=  10 ,score= 0.69
alpha=  100 ,score= 0.653
alpha=  1000 ,score= 0.65
alpha=  10000 ,score= 0.65
Best Score: 0.81 ,where alpha= 1
--------------------------------------------------Logistic Regression------------------------------------------------------------
penalty=  l1 C=  1e-05 ,score= 0.65
penalty=  l1 C=  0.0001 ,score= 0.65




penalty=  l1 C=  0.001 ,score= 0.65




penalty=  l1 C=  0.01 ,score= 0.65




penalty=  l1 C=  0.1 ,score= 0.707




penalty=  l1 C=  1 ,score= 0.733




penalty=  l1 C=  10 ,score= 0.747




penalty=  l1 C=  100 ,score= 0.747




penalty=  l1 C=  1000 ,score= 0.747




penalty=  l1 C=  10000 ,score= 0.747




penalty=  l2 C=  1e-05 ,score= 0.65




penalty=  l2 C=  0.0001 ,score= 0.65




penalty=  l2 C=  0.001 ,score= 0.697




penalty=  l2 C=  0.01 ,score= 0.74




penalty=  l2 C=  0.1 ,score= 0.743




penalty=  l2 C=  1 ,score= 0.743




penalty=  l2 C=  10 ,score= 0.747




penalty=  l2 C=  100 ,score= 0.743




penalty=  l2 C=  1000 ,score= 0.747




penalty=  l2 C=  10000 ,score= 0.747
Best Score: 0.747 ,where penalty=  l1 , and C=  10




### Question 7 (1.0 mark)

**Instructions**: Using the best settings you have found, compare the two classifiers based on performance in the test set. Print out both accuracy and macro-averaged F-score for each classifier. Be sure to label your output.

**Task**: Compute test performance in terms of accuracy and macro-averaged F-score for both Naive Bayes and Logistic Regression, using optimal hyper-parameter settings.

In [46]:
###
# Your answer BEGINS HERE
###
from sklearn.metrics import classification_report
NB_pre=model_NB.predict(events_test)
LR_pre=model_LR.predict(events_test)
print("------------------------------------------------------Naive Bayes----------------------------------------------------------------")
print(classification_report(results_test, NB_pre, digits=3))
print("--------------------------------------------------Logistic Regression------------------------------------------------------------")
print(classification_report(results_test, LR_pre, digits=3))
###
# Your answer ENDS HERE
###

------------------------------------------------------Naive Bayes----------------------------------------------------------------
              precision    recall  f1-score   support

  non-rumour      0.860     0.898     0.878       205
       rumor      0.756     0.684     0.718        95

    accuracy                          0.830       300
   macro avg      0.808     0.791     0.798       300
weighted avg      0.827     0.830     0.828       300

--------------------------------------------------Logistic Regression------------------------------------------------------------
              precision    recall  f1-score   support

  non-rumour      0.762     0.966     0.852       205
       rumor      0.825     0.347     0.489        95

    accuracy                          0.770       300
   macro avg      0.793     0.657     0.670       300
weighted avg      0.782     0.770     0.737       300

