# Homework 1: Preprocessing and Text Classification

Student Name: KangKe

Student ID: 745384

Python version used: 2.7

## General info

<b>Due date</b>: 5pm, Thursday March 16

<b>Submission method</b>: see LMS

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -20% per day

<b>Marks</b>: 5% of mark for class

<b>Overview</b>: In this homework, you'll be using a corpus of tweets to do tokenisation of hashtags and build polarity classifers using bag of word (BOW) features.

<b>Materials</b>: See the main class LMS page for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gemsim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages; if your iPython notebook doesn't run on the marker's machine, you will lose marks.  

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). The amount each section is worth is given in parenthesis after the instructions. You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

<b>Extra credit</b>: Each homework has a task which is optional with respect to getting full marks on the assignment, but that can be used to offset any points lost on this or any other homework assignment (but not the final project or the exam). We recommend you skip over this step on your first pass, and come back if you have time: the amount of effort required to receive full marks (1 point) on an extra credit question will be substantially more than earning the same amount of credit on other parts of the homework.

<b>Updates</b>: Any major changes to the assignment will be announced via LMS. Minor changes and clarifications will be announced in the forum on LMS, we recommend you check the forum regularly.

<b>Academic Misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.


## Preprocessing

<b>Instructions</b>: For this homework we will be using the tweets in the <i>twitter_samples</i> corpus included with NLTK. You should start by accessing these tweets. Use the <i>strings</i> method included in the NLTK corpus reader for <i>twitter_samples</i> to access the tweets (as raw strings). Iterate over the full corpus, and print out the average length, in characters, of the tweets in the corpus. (0.5)


In [2]:
import nltk
from nltk.corpus import twitter_samples

twitters = twitter_samples.strings()
total_length = 0.0
twitter_num = len(twitters)
for t in twitters:
    total_length = total_length + len(t)
    
average_length = total_length/twitter_num
print average_length

103.887266667


<b>Instructions</b>: Hashtags (i.e. topic tags which start with #) pose an interesting tokenisation problem because they often include multiple words written without spaces or capitalization. You should use a regular expression to extract all hashtags of length 8 or longer which consist only of lower case letters (other than the # at the beginning, of course, though this should be stripped off as part of the extraction process). Do <b>not</b> tokenise the entire tweet as part of this process. The hashtag might occur at the beginning or the end of the tweet; you should double-check that you aren't missing any. After you have collected them into a list, print out number of hashtags you have collected: for full credit, you must get the exact number that we expect.  (1.0)

In [2]:
import re
hashtags = []
for twitter in twitters:
    hashtag = re.findall(
        r"(?<=^#)[a-z]{8,}(?=\s)|(?<=\s#)[a-z]{8,}(?=\s)|(?<=\s#)[a-z]{8,}$", 
        twitter
    )  
    hashtags = hashtags + hashtag       
print len(hashtags)

1411


<b>Instructions</b>: Now, tokenise the hashtags you've collected. To do this, you should implement a reversed version of the MaxMatch algorithm discussed in class (and in the reading), where matching begins at the end of the hashtag and progresses backwards. NLTK has a list of words that you can use for matching, see starter code below. Be careful about efficiency with respect to doing word lookups. One extra challenge you have to deal with is that the provided list of words includes only lemmas: your MaxMatch algorithm should match inflected forms by converting them into lemmas using the NLTK lemmatiser before matching. Note that the list of words is incomplete, and, if you are unable to make any longer match, your code should default to matching a single letter. Create a new list of tokenised hashtags (this should be a list of lists of strings) and use slicing to print out the last 20 hashtags in the list. (1.0)

In [3]:
words = nltk.corpus.words.words() # words is a Python list
wnl = nltk.WordNetLemmatizer()
def MaxRevMatch(hashtags):
    wordlist = []
    length = len(hashtags)
    if length == 0:
        return []
    for i in range(0, length):
        firstword = hashtags[i:]
        remainder = hashtags[0:i]
        if wnl.lemmatize(firstword) in words:
            return MaxRevMatch(remainder) + [firstword]
        
    firstword = hashtags[-1]
    remainder = hashtags[0:-1]
    return  MaxRevMatch(remainder) + [firstword]

In [4]:
tokenised_hashtags = []
hashtags1 = hashtags[-20:]
for h in hashtags1:
    tokenised_hashtags = tokenised_hashtags + [MaxRevMatch(h)]
    
print tokenised_hashtags[-20:]

[[u'leaders', u'debate'], [u'wow', u'campaign'], [u'social', u'security'], [u'tory', u'lies'], [u'election'], [u'b', u'i', u'ase', u'd', u'b', u'b', u'c'], [u'labour', u'doorstep'], [u'b', u'i', u'ase', u'd', u'b', u'b', u'c'], [u'li', u'blab', u'con'], [u'b', u'b', u'c', u'debate'], [u'mi', u'li', u'fandom'], [u'u', u'k', u'parliament'], [u'bedroom', u'tax'], [u'disability'], [u'can', u'nab', u'is'], [u'vote', u'green'], [u'l', u'lan', u'el', u'li', u'h', u'u', u'stings'], [u'bedroom', u'tax'], [u'disability'], [u'bankrupt']]


### Extra Credit (Optional)
<b>Instructions</b>: Implement the forward version of the MaxMatch algorithm as well, and print out all the hashtags which give different results for the two versions of MaxMatch. Your main task is to come up with a good way to select which of the two segmentations is better for any given case, and demonstrate that it works significantly better than using a single version of the algorithm for all hashtags. (1.0)

In [5]:
def MaxMatch(hashtags):
    wordlist = []
    length = len(hashtags)
    if length == 0:
        return []
    for i in range(0, length):
        firstword = hashtags[0:length-i]
        remainder = hashtags[length-i:]
        if wnl.lemmatize(firstword) in words:
            return [firstword] + MaxMatch(remainder)
        
    firstword = hashtags[0]
    remainder = hashtags[1:]
    return [firstword] + MaxMatch(remainder)

In [6]:
tokenised_hashtags1 = []
for h in hashtags1:
    tokenised_hashtags1 = tokenised_hashtags1 + [MaxMatch(h)]
    
print tokenised_hashtags1[-20:]

[[u'leaders', u'debate'], [u'wow', u'campaign'], [u'socials', u'e', u'cur', u'it', u'y'], [u'tory', u'lies'], [u'election'], [u'bias', u'e', u'd', u'b', u'b', u'c'], [u'labour', u'doorstep'], [u'bias', u'e', u'd', u'b', u'b', u'c'], [u'li', u'blab', u'con'], [u'b', u'b', u'c', u'debate'], [u'mil', u'if', u'and', u'om'], [u'u', u'k', u'parliament'], [u'bedroom', u'tax'], [u'disability'], [u'canna', u'b', u'is'], [u'vote', u'green'], [u'l', u'lane', u'l', u'li', u'husting', u's'], [u'bedroom', u'tax'], [u'disability'], [u'bankrupt']]


In [7]:
print "All the hashtags with different list in the forward and reverse max match algorithm:"
for h in hashtags:
    forward_list = MaxMatch(h)
    reverse_list = MaxRevMatch(h)
    if forward_list != reverse_list:
        print h

All the hashtags with different list in the forward and reverse max match algorithm:
athabasca
explorealberta
batalladelosgallos
webcamsex
instagram
addmeonsnapchat
kiksexting
orcalove
fresherstofinals
undercoverboss
zayniscomingback
kiksexting
giachietittiwedding
igersoftheday
anywayhedidanicejob
bestoftheday
sabadodeganarseguidores
feelslikeanidiot
matteroftheheart
hotfmnoaidilforariana
hannibal
addmeonsnapchat
premiostumundo
ausfailia
kiksexting
stafford
wewanticecream
feelgoodfriday
phandroid
sexysasunday
zayniscomingback
kikmenow
sabadodeganarseguidores
cyclerevolution
bestoftheday
bestoftheday
sexysasunday
selfshot
whereisthesun
summerismissing
zayniscomingback
kikmenow
indiemusic
selfshot
elfindelmundo
webcamsex
mybrainneedstoshutoff
deathbybaconsmell
imintoher
photooftheday
indiemusic
indiemusic
indiemusic
indiemusic
indiemusic
hannibal
hotfmnoaidilforariana
photooftheday
indiemusic
justgotkanekified
indiemusic
indiemusic
kikmenow
kikmenow
louisiana
indiemusic
docopenhagen
phon

In [8]:
def compare_max_Maxtch(hashtags):
    forward_list = MaxMatch(hashtags)
    reverse_list = MaxRevMatch(hashtags)
    if len(forward_list) >= len(reverse_list):
        return reverse_list
    else:
        return forward_list

In [9]:
tokenised_hashtags = []
for h in hashtags1:
    tokenised_hashtags = tokenised_hashtags + [compare_max_Maxtch(h)]
    
print tokenised_hashtags

[[u'leaders', u'debate'], [u'wow', u'campaign'], [u'social', u'security'], [u'tory', u'lies'], [u'election'], [u'bias', u'e', u'd', u'b', u'b', u'c'], [u'labour', u'doorstep'], [u'bias', u'e', u'd', u'b', u'b', u'c'], [u'li', u'blab', u'con'], [u'b', u'b', u'c', u'debate'], [u'mi', u'li', u'fandom'], [u'u', u'k', u'parliament'], [u'bedroom', u'tax'], [u'disability'], [u'can', u'nab', u'is'], [u'vote', u'green'], [u'l', u'lane', u'l', u'li', u'husting', u's'], [u'bedroom', u'tax'], [u'disability'], [u'bankrupt']]



<b>As the result shows above. 'llanellihustings' gets better result in reverse match. 'milifandom' gets better result in forward match. But in the compare_max_Maxtch algorithm, the results are all the better ones.
The length of the results in the two algorithms are compared first. Because if there is mismatch, the list will have  one character string. This will result in one more string in the list. So, will more mismatch string, more one character string, and longer the list is. Comparing the length of lists first, get the list which is shorter can get better match result.<b>


## Text classification (Not Optional)

<b>Instructions</b>: The twitter_sample corpus has two subcorpora corresponding to positive and negative tweets. You can access already tokenised versions using the <i> tokenized </i> method, as given in the code sample below. Iterate through these two corpora and build training, development, and test sets for use with Scikit-learn. You should exclude stopwords (from the built-in NLTK list) and tokens with non-alphabetic characters (this is very important you do this because emoticons were used to build the corpus, if you don't remove them performance will be artificially high). You should randomly split each subcorpus, using 80% of the tweets for training, 10% for development, and 10% for testing; make sure you do this <b>before</b> combining the tweets from the positive/negative subcorpora, so that the sets are <i>stratified</i>, i.e. the exact ratio of positive and negative tweets is preserved across the three sets. (1.0)

In [3]:
positive_tweets = nltk.corpus.twitter_samples.tokenized("positive_tweets.json")
negative_tweets = nltk.corpus.twitter_samples.tokenized("negative_tweets.json")

In [4]:
def get_BOW(text):
    BOW = {}
    for word in text:
        BOW[word] = BOW.get(word,0) + 1
    return BOW

In [5]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

pos_tweets = []
neg_tweets = []
pos_classification = []
neg_classification = []

for twtt in positive_tweets:
    new_twtt = []
    for w in twtt:
        if w.isalpha() and w not in stopwords:
            new_twtt.append(w)
    pos_tweets.append(get_BOW(new_twtt))
    pos_classification.append("positive")

for twtt in negative_tweets:
    new_twtt = []
    for w in twtt:
        if w.isalpha() and w not in stopwords:
            new_twtt.append(w)
    neg_tweets.append(get_BOW(new_twtt))
    neg_classification.append("negative")

In [13]:
from sklearn.cross_validation import train_test_split
pos_train, pos_dev_test, pos_class, pos_dev_test_class  = train_test_split(pos_tweets, 
                                                                           pos_classification,
                                                                           test_size = 0.2
                                                                           )

pos_dev, pos_test, pos_dev_class, pos_test_class = train_test_split(pos_dev_test, 
                                                                    pos_dev_test_class,
                                                                    test_size = 0.5
                                                                    )



In [14]:
neg_train, neg_dev_test, neg_class, neg_dev_test_class  = train_test_split(neg_tweets, 
                                                                           neg_classification,
                                                                           test_size = 0.2
                                                                           )

neg_dev, neg_test, neg_dev_class, neg_test_class = train_test_split(neg_dev_test, 
                                                                    neg_dev_test_class,
                                                                    test_size = 0.5
                                                                    )

In [15]:
twitter_train = pos_train + neg_train
train_classes = pos_class + neg_class

twitter_dev = pos_dev + neg_dev
dev_classes = pos_dev_class + neg_dev_class

twitter_test = pos_test + neg_test
test_classes = pos_test_class + neg_test_class

<b>Instructions</b>: Now, let's build some classifiers. Here, we'll be comparing Naive Bayes and Logistic Regression. For each, you need to first find a good value for their main regularisation (hyper)parameters, which you should identify using the scikit-learn docs or other resources. Use the development set you created for this tuning process; do <b>not</b> use crossvalidation in the training set, or involve the test set in any way. You don't need to show all your work, but you do need to print out the accuracy with enough different settings to strongly suggest you have found an optimal or near-optimal choice. We should not need to look at your code to interpret the output. (1.0)

In [16]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
from sklearn.feature_extraction import DictVectorizer

In [17]:
vectorizer = DictVectorizer()
twitter = twitter_train + twitter_dev + twitter_test
twitter = vectorizer.fit_transform(twitter)

In [18]:
twitter_train = twitter[0:8000]
twitter_dev = twitter[8000:9000]
twitter_test = twitter[9000: ]

In [19]:
print "the (alpha, accuracy) pairs in MultinomialNB:"
i = 0.1 
best_accuracy = 0.0
best_alpha = 0.0
for alpha_value in range(1,30):
    clf = MultinomialNB(alpha=alpha_value*0.1)
    clf.fit(twitter_train, train_classes)
    prediciton = clf.predict(twitter_dev)
    accuracy = accuracy_score(dev_classes,prediciton)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_alpha = alpha_value*0.1
    print (alpha_value*0.1, accuracy)

the (alpha, accuracy) pairs in MultinomialNB:
(0.1, 0.751)
(0.2, 0.75700000000000001)
(0.30000000000000004, 0.76000000000000001)
(0.4, 0.755)
(0.5, 0.75)
(0.6000000000000001, 0.75600000000000001)
(0.7000000000000001, 0.754)
(0.8, 0.753)
(0.9, 0.754)
(1.0, 0.75600000000000001)
(1.1, 0.755)
(1.2000000000000002, 0.754)
(1.3, 0.752)
(1.4000000000000001, 0.753)
(1.5, 0.751)
(1.6, 0.752)
(1.7000000000000002, 0.751)
(1.8, 0.751)
(1.9000000000000001, 0.751)
(2.0, 0.75)
(2.1, 0.748)
(2.2, 0.746)
(2.3000000000000003, 0.746)
(2.4000000000000004, 0.746)
(2.5, 0.746)
(2.6, 0.747)
(2.7, 0.747)
(2.8000000000000003, 0.747)
(2.9000000000000004, 0.746)


In [20]:
clf = MultinomialNB(alpha=best_alpha)
clf.fit(twitter_train, train_classes)
prediciton = clf.predict(twitter_dev)
print "the best alpha value in MultinomialNB is "+str(best_alpha)+ " with accuracy:" 
print accuracy_score(dev_classes,prediciton)

the best alpha value in MultinomialNB is 0.3 with accuracy:
0.76


In [21]:
print "the (alpha, accuracy) pairs in LogisticRegression:"
best_accuracy = 0.0
best_C = 0.0
for C_value in range(1,30):
    clf1 = LogisticRegression(C=C_value*0.1)
    clf1.fit(twitter_train, train_classes)
    prediciton1 = clf1.predict(twitter_dev)
    accuracy = accuracy_score(dev_classes,prediciton1)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_C = C_value*0.1
    print (C_value*0.1, accuracy)


the (alpha, accuracy) pairs in LogisticRegression:
(0.1, 0.72599999999999998)
(0.2, 0.73099999999999998)
(0.30000000000000004, 0.73199999999999998)
(0.4, 0.73499999999999999)
(0.5, 0.73499999999999999)
(0.6000000000000001, 0.73799999999999999)
(0.7000000000000001, 0.73399999999999999)
(0.8, 0.73799999999999999)
(0.9, 0.73699999999999999)
(1.0, 0.73699999999999999)
(1.1, 0.73599999999999999)
(1.2000000000000002, 0.73599999999999999)
(1.3, 0.73699999999999999)
(1.4000000000000001, 0.73499999999999999)
(1.5, 0.73699999999999999)
(1.6, 0.73799999999999999)
(1.7000000000000002, 0.73899999999999999)
(1.8, 0.73799999999999999)
(1.9000000000000001, 0.73799999999999999)
(2.0, 0.73799999999999999)
(2.1, 0.73799999999999999)
(2.2, 0.74099999999999999)
(2.3000000000000003, 0.74099999999999999)
(2.4000000000000004, 0.74099999999999999)
(2.5, 0.74099999999999999)
(2.6, 0.73899999999999999)
(2.7, 0.73899999999999999)
(2.8000000000000003, 0.73899999999999999)
(2.9000000000000004, 0.73899999999999999)


In [22]:
clf1 = LogisticRegression(C = best_C)
clf1.fit(twitter_train, train_classes)
prediciton1 = clf1.predict(twitter_dev)
print "the best C value in LogisticRegression is"+str(best_C)+ " with accuracy:" 
print accuracy_score(dev_classes,prediciton1)

the best C value in LogisticRegression is2.2 with accuracy:
0.741


<b>Instructions</b>: Using the best settings you have found, compare the two classifiers based on performance in the test set. Print out both accuracy and macroaveraged f-score for each classifier. Be sure to label your output. (0.5)

In [23]:
from sklearn.metrics import f1_score

In [24]:
clf = MultinomialNB(alpha=best_alpha)
clf.fit(twitter_train, train_classes)
prediciton = clf.predict(twitter_test)
print "the best MultinomialNB accuracy:"
print accuracy_score(test_classes,prediciton)
print "f1_score:"
print f1_score(test_classes,prediciton, average='macro')  

the best MultinomialNB accuracy:
0.737
f1_score:
0.736978695274


In [25]:
clf1 = LogisticRegression(C=best_C)
clf1.fit(twitter_train, train_classes)
prediciton1 = clf1.predict(twitter_test)
print "the best LogisticRegression accuracy:"
print accuracy_score(test_classes,prediciton1)
print "f1_score:"
print f1_score(test_classes,prediciton1, average='macro')  

the best LogisticRegression accuracy:
0.73
f1_score:
0.729723236594
