# Homework 1: Preprocessing and Text Classification

Student Name: Zihang Su

Student ID: 710118

Python version used: Python 3

## General info

<b>Due date</b>: 11pm, Sunday March 18th

<b>Submission method</b>: see LMS

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -20% per day

<b>Marks</b>: 5% of mark for class

<b>Overview</b>: In this homework, you'll be using a corpus of tweets to do tokenisation of hashtags and build polarity classifers using bag of word (BOW) features.

<b>Materials</b>: See the main class LMS page for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages (the packages listed above are all fine to use); if your iPython notebook doesn't run on the marker's machine, you will lose marks.  

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a few minutes, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). The amount each section is worth is given in parenthesis after the instructions. You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

<b>Extra credit</b>: Each homework has a task which is optional with respect to getting full marks on the assignment, but that can be used to offset any points lost on this or any other homework assignment (but not the final project or the exam). We recommend you skip over this step on your first pass, and come back if you have time: the amount of effort required to receive full marks (1 point) on an extra credit question will be substantially more than earning the same amount of credit on other parts of the homework.

<b>Updates</b>: Any major changes to the assignment will be announced via LMS. Minor changes and clarifications will be announced in the forum on LMS, we recommend you check the forum regularly.

<b>Academic Misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.


## Preprocessing

<b>Instructions</b>: For this homework we will be using the tweets in the <i>twitter_samples</i> corpus included with NLTK. You should start by accessing these tweets. Use the <i>strings</i> method included in the NLTK corpus reader for <i>twitter_samples</i> to access the tweets (as raw strings). Iterate over the full corpus, and print out the average length, in characters, of the tweets in the corpus. (0.5)


In [10]:
import nltk
nltk.download('twitter_samples')
from nltk.corpus import twitter_samples as ts

strings = ts.strings()
count = 0
total_length = 0
for tweet in strings:
    count += 1
    total_length += len(tweet)

average_length = total_length / count
print(average_length)

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\msi\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
103.85176666666666


<b>Instructions</b>: Hashtags (i.e. topic tags which start with #) pose an interesting tokenisation problem because they often include multiple words written without spaces or capitalization. You should use a regular expression to extract all hashtags of length 8 or longer which consist only of lower case letters (other than the # at the beginning, of course, though this should be stripped off as part of the extraction process). Do <b>not</b> tokenise the entire tweet as part of this process. The hashtag might occur at the beginning or the end of the tweet; you should double-check that you aren't missing any. After you have collected them into a list, print out number of hashtags you have collected: for full credit, you must get the exact number that we expect.  (1.0)

In [11]:
import re

hashtags = []
for string in strings:
    hashtags += re.findall('(?<=\s)#[a-z]{8,}(?=\s)|^#[a-z]{8,}(?=\s)|(?<=\s)#[a-z]{8,}$|^#[a-z]{8,}$', string)
    
print(len(hashtags))

1411


<b>Instructions</b>: Now, tokenise the hashtags you've collected. To do this, you should implement a reversed version of the MaxMatch algorithm discussed in class (and in the reading), where matching begins at the end of the hashtag and progresses backwards. NLTK has a list of words that you can use for matching, see starter code below. Be careful about efficiency with respect to doing word lookups. One extra challenge you have to deal with is that the provided list of words includes only lemmas: your MaxMatch algorithm should match inflected forms by converting them into lemmas using the NLTK lemmatiser before matching. Note that the list of words is incomplete, and, if you are unable to make any longer match, your code should default to matching a single letter. Create a new list of tokenised hashtags (this should be a list of lists of strings) and use slicing to print out the last 20 hashtags in the list. (1.0)

In [12]:
words = nltk.corpus.words.words() # words is a Python list

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
max_reverse_matched = []

def reverse_max_match(string):
    length = len(string)
    i = 0
    while i < length:
        lemma = lemmatizer.lemmatize(string[i:])
        if lemma in words:
            matched = string[i:]   # keep the original word
            rest = string[0:i]
            return (matched, rest)
        i += 1
        
    # default to matching a single letter when there is no matched
    return (string[-1], string[0:-1])  

for word in hashtags:
    tokenised = []
    
    # remove the hashtag symbol
    word = re.sub("[#]", "", word).strip()
    while len(word) > 0:
        matched = reverse_max_match(word)[0]
        tokenised.append(matched)
        
        # after match a word, update the rest of unchecked string
        rest = reverse_max_match(word)[1]
        word = rest
    max_reverse_matched.append(tokenised)
print(max_reverse_matched[-20:])

[['debate', 'leaders'], ['campaign', 'wow'], ['security', 'social'], ['lies', 'tory'], ['election'], ['c', 'b', 'b', 'd', 'ase', 'i', 'b'], ['doorstep', 'labour'], ['c', 'b', 'b', 'd', 'ase', 'i', 'b'], ['con', 'blab', 'li'], ['debate', 'c', 'b', 'b'], ['fandom', 'li', 'mi'], ['parliament', 'k', 'u'], ['tax', 'bedroom'], ['disability'], ['is', 'nab', 'can'], ['green', 'vote'], ['stings', 'u', 'h', 'li', 'el', 'lan', 'l'], ['tax', 'bedroom'], ['disability'], ['bankrupt']]


This program need to run around 2 mintues in my laptop, the outcome is shown as below. (You could first have a look at the outcome while the program is runnig. Thank you for your patient.)

[['debate', 'leaders'], ['campaign', 'wow'], ['security', 'social'], ['lies', 'tory'], ['election'], ['c', 'b', 'b', 'd', 'ase', 'i', 'b'], ['doorstep', 'labour'], ['c', 'b', 'b', 'd', 'ase', 'i', 'b'], ['con', 'blab', 'li'], ['debate', 'c', 'b', 'b'], ['fandom', 'li', 'mi'], ['parliament', 'k', 'u'], ['tax', 'bedroom'], ['disability'], ['is', 'nab', 'can'], ['green', 'vote'], ['stings', 'u', 'h', 'li', 'el', 'lan', 'l'], ['tax', 'bedroom'], ['disability'], ['bankrupt']]

### Extra Credit (Optional)
<b>Instructions</b>: Implement the forward version of the MaxMatch algorithm as well, and print out all the hashtags which give different results for the two versions of MaxMatch. Your main task is to come up with a good way to select which of the two segmentations is better for any given case, and demonstrate that it works significantly better than using a single version of the algorithm for all hashtags. (1.0)

In [13]:
# find the hashtags can be matched differently by two methods

def forward_max_match(string):
    length = len(string)
    i = length
    while i > 0:
        lemma = lemmatizer.lemmatize(string[0:i])
        if lemma in words:
            matched = string[0:i]
            rest = string[i:]
            return (matched, rest)
        i -= 1
    return (string[0], string[1:])

def average_token_length(hashtag):
    total_length = 0
    count = 0
    for token in hashtag:
        total_length += len(token)
        count += 1
    return total_length/count

# get the list of tokenized hashtags 
max_forward_matched = []
for word in hashtags:
    tokenised = []
    word = re.sub("[#]", "", word).strip()
    while len(word) > 0:
        matched = forward_max_match(word)[0]
        tokenised.append(matched)
        rest = forward_max_match(word)[1]
        word = rest
    max_forward_matched.append(tokenised)

# diff contains all the hashtags that can be tokenized differently
diff = []
optimal_match = []
for i in range(len(max_forward_matched)):
    forward = max_forward_matched[i]
    reverse = max_reverse_matched[i]
    reverse.reverse()
    if forward != reverse:
        # the optimal match is the one with larger average length
        if average_token_length(forward) > average_token_length(reverse):
            optimal_match.append(forward)
        else:
            optimal_match.append(reverse)
        diff.append(hashtags[i])

print(optimal_match)

[['a', 'th', 'abas', 'ca'], ['explore', 'albe', 'r', 'ta'], ['ba', 'tal', 'lad', 'el', 'os', 'gall', 'os'], ['web', 'cam', 'sex'], ['ins', 'ta', 'gram'], ['add', 'm', 'eon', 'snap', 'chat'], ['k', 'i', 'k', 'sex', 'ting'], ['or', 'ca', 'love'], ['fresh', 'ers', 'to', 'finals'], ['undercover', 'bo', 'ss'], ['z', 'ay', 'nis', 'coming', 'back'], ['k', 'i', 'k', 'sex', 'ting'], ['g', 'i', 'ach', 'ie', 'tit', 'ti', 'wedding'], ['i', 'gers', 'of', 'the', 'day'], ['anyway', 'he', 'di', 'da', 'nice', 'job'], ['be', 'stof', 'the', 'day'], ['sa', 'bad', 'ode', 'ga', 'nar', 'seg', 'u', 'id', 'ores'], ['feels', 'li', 'k', 'ean', 'idiot'], ['matter', 'of', 'the', 'heart'], ['hot', 'f', 'm', 'noa', 'id', 'i', 'l', 'fora', 'ria', 'na'], ['han', 'ni', 'bal'], ['add', 'm', 'eon', 'snap', 'chat'], ['p', 'r', 'em', 'io', 'stum', 'undo'], ['a', 'us', 'fa', 'ilia'], ['k', 'i', 'k', 'sex', 'ting'], ['st', 'afford'], ['we', 'wan', 'tice', 'cream'], ['feel', 'good', 'f', 'rid', 'ay'], ['p', 'h', 'android'], [

Demostation: For the problem above, I calculate the average length of tokens and compare them. If the average length by forward_MaxMatch is larger we would use forward way to match, while if the length by reverse_MaxMatch is larger, then we use the reverse way. The optimal results for those hashtags which can be tokenized differently by two method are shown as the output. (might take around 5 mins to run, please be patient.)

# Text classification (Not Optional)

<b>Instructions</b>: The twitter_sample corpus has two subcorpora corresponding to positive and negative tweets. You can access already tokenised versions using the <i> tokenized </i> method, as given in the code sample below. Iterate through these two corpora and build training, development, and test sets for use with Scikit-learn. You should exclude stopwords (from the built-in NLTK list) and tokens with non-alphabetic characters (this is very important you do this because emoticons were used to build the corpus, if you don't remove them performance will be artificially high). You should randomly split each subcorpus, using 80% of the tweets for training, 10% for development, and 10% for testing; make sure you do this <b>before</b> combining the tweets from the positive/negative subcorpora, so that the sets are <i>stratified</i>, i.e. the exact ratio of positive and negative tweets is preserved across the three sets. (1.0)

In [14]:
positive_tweets = nltk.corpus.twitter_samples.tokenized("positive_tweets.json")
negative_tweets = nltk.corpus.twitter_samples.tokenized("negative_tweets.json")

pl = (len(positive_tweets))
nl = (len(negative_tweets))

training = []
training_classification = []
development = []
development_classification = []
testing = []
testing_classification = []

import random
nltk.download("stopwords")
from nltk.corpus import stopwords
stopwords = set(stopwords.words("english"))

# get positive tweets, for training set
for i in range(int(pl*0.8)):
    rand = random.randint(0, len(positive_tweets)-1)
    tweet = positive_tweets.pop(rand)
    tokenised = []
    for token in tweet:
        if token not in stopwords and re.match("^[A-Za-z]+$", token):
            tokenised.append(token)
    training.append(tokenised)
    training_classification.append("positive")

# get positive tweets, for development set
for i in range(int(pl*0.1)):
    rand = random.randint(0, len(positive_tweets)-1)
    tweet = positive_tweets.pop(rand)
    tokenised = []
    for token in tweet:
        if token not in stopwords and re.match("^[A-Za-z]+$", token):
            tokenised.append(token)
    development.append(tokenised)
    development_classification.append("positive")

# get positive tweets, for testing set
for i in range(int(pl*0.1)):
    rand = random.randint(0, len(positive_tweets)-1)
    tweet = positive_tweets.pop(rand)
    tokenised = []
    for token in tweet:
        if token not in stopwords and re.match("^[A-Za-z]+$", token):
            tokenised.append(token)
    testing.append(tokenised)   
    testing_classification.append("positive")
    
# get negative tweets, for training set
for i in range(int(nl*0.8)):
    rand = random.randint(0, len(negative_tweets)-1)
    tweet = negative_tweets.pop(rand)
    tokenised = []
    for token in tweet:
        if token not in stopwords and re.match("^[A-Za-z]+$", token):
            tokenised.append(token)
    training.append(tokenised)
    training_classification.append("negative")

# get negative tweets, for development set
for i in range(int(nl*0.1)):
    rand = random.randint(0, len(negative_tweets)-1)
    tweet = negative_tweets.pop(rand)
    tokenised = []
    for token in tweet:
        if token not in stopwords and re.match("^[A-Za-z]+$", token):
            tokenised.append(token)
    development.append(tokenised)
    development_classification.append("negative")
    
# get negative tweets, for testing set    
for i in range(int(nl*0.1)):
    rand = random.randint(0, len(negative_tweets)-1)
    tweet = negative_tweets.pop(rand)
    tokenised = []
    for token in tweet:
        if token not in stopwords and re.match("^[A-Za-z]+$", token):
            tokenised.append(token)
    testing.append(tokenised)
    testing_classification.append("negative")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\msi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<b>Instructions</b>: Now, let's build some classifiers. Here, we'll be comparing Naive Bayes and Logistic Regression. For each, you need to first find a good value for their main regularisation (hyper)parameters, which you should identify using the scikit-learn docs or other resources. Use the development set you created for this tuning process; do <b>not</b> use crossvalidation in the training set, or involve the test set in any way. You don't need to show all your work, but you do need to print out the accuracy with enough different settings to strongly suggest you have found an optimal or near-optimal choice. We should not need to look at your code to interpret the output. (1.0)

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score, classification_report, f1_score
vectorizer = DictVectorizer()
nb_clf = MultinomialNB()
lr_clf = LogisticRegression()

# Get a bag of word from each tweet and count the word frequency
def get_feature(tweet):
    feature = {}
    for token in tweet:
        feature[token] = feature.get(token, 0) + 1
    return feature

# Feature_matrix is a list of dictionaries
# Check whether the tweet contain words which exist in feature_matrix
def match_feature(feature_matrix, tweet):
    feature = {}
    for token in tweet:
        for dic in feature_matrix:
            if token in dic.keys():
                feature[token] = feature.get(token, 0) + 1
                break
    return feature
        
# Feature_matrix is a list of dictionaries
feature_matrix = []
# We only gather the words in training set as features, then create feature_matrix
for tweet in training:
    feature_dict = get_feature(tweet)
    feature_matrix.append(feature_dict)

# From development and testing set, find the word which exists in feature_matrix and count the number.
dev_test_matrix = []
for tweet in development + testing:
    feature_dict = match_feature(feature_matrix ,tweet)
    dev_test_matrix.append(feature_dict)

# Convert the list of feature to a matrix dataset 
feature_matrix = feature_matrix + dev_test_matrix
dataset = vectorizer.fit_transform(feature_matrix)
dataset_length = dataset._shape[0]

# for naive bayes: find the alpha which make the accuracy maxmized
best_accuracy = 0
best_alpha = 0.1

# The settings of alpha are: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 
# 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9
for alpha_setting in [x*0.1 for x in range(1, 10)] + [x for x in range(1, 10)]:
    nb_clf.set_params(alpha = alpha_setting)
    nb_clf.fit(dataset[0 : dataset_length*0.8], training_classification)
    predictions = nb_clf.predict(dataset[dataset_length*0.8 : dataset_length*0.9])
    accuracy = accuracy_score(development_classification, predictions)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_alpha = alpha_setting
        
# for LogisticRegression model: find the C which make the accuracy maxmized
best_accuracy = 0
best_C = 0.1

# The settings of C are: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 
# 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9
for C_setting in [x*0.1 for x in range(1, 10)] + [x for x in range(1, 10)]:
    lr_clf.set_params(C = C_setting)
    lr_clf.fit(dataset[0 : dataset_length*0.8], training_classification)
    predictions = lr_clf.predict(dataset[dataset_length*0.8 : dataset_length*0.9])
    accuracy = accuracy_score(development_classification, predictions)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_C = C_setting


<b>Instructions</b>: Using the best settings you have found, compare the two classifiers based on performance in the test set. Print out both accuracy and macroaveraged f-score for each classifier. Be sure to label your output. (0.5)

In [8]:
nb_clf.set_params(alpha = best_alpha)
nb_clf.fit(dataset[0 : dataset_length*0.8], training_classification)
predictions = nb_clf.predict(dataset[dataset_length*0.9 : dataset_length])
accuracy = accuracy_score(testing_classification, predictions)
f_score = f1_score(testing_classification, predictions, average='macro')

print("Multinomial Naive Bayes\n")
print("accuracy: " + str(accuracy))
print("macroaveraged f_score: " + str(f_score))

print("\n")
lr_clf.set_params(C = best_C)
lr_clf.fit(dataset[0 : dataset_length*0.8], training_classification)
predictions = lr_clf.predict(dataset[dataset_length*0.9 : dataset_length])
accuracy = accuracy_score(testing_classification, predictions)
f_score = f1_score(testing_classification, predictions, average='macro')

print("Logistic Regression\n")
print("accuracy: " + str(accuracy))
print("macroaveraged f_score: " + str(f_score))

Multinomial Naive Bayes

accuracy: 0.737
macroaveraged f_score: 0.73651281219


Logistic Regression

accuracy: 0.747
macroaveraged f_score: 0.746866092163
