In [54]:
from IPython.display import HTML
HTML('''<script>
    code_show=true; 
    function code_toggle() {
     if (code_show){
     $('div.input').hide();
     } else {
     $('div.input').show();
     }
     code_show = !code_show
    } 
    $( document ).ready(code_toggle);
    </script>
    The raw code for this IPython notebook is by default hidden for easier reading.
    To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

# 1. Tokenization

## Analysis
1. CoreNLP seperates and takes punctuation as tokens, including comma, dollar sign, quotation, double dash
2. For single dash, CoreNLP doesn't divide the words with single dash as seperate words, like buy-back. But Lucene Standard tokenizer will divide it into words.


**CODE IN JAVA CLASS "Tokenization", RESULT ARE SHOWED IN q1.txt**

# 2. Normalization

## Analysis

### lemmatizer and stemmer
1. CoreNLP lemmatizer doesn't remove punctuations while stemmers do.
2. CoreNLP lemmatizer does not change the plural of nouns(words vs word) or case(we vs us) or tempus(said vs say) or comparison(easy vs easiest)
3. CoreNLP lemmatizer does not change the capital of proper nouns.
4. CoreNLP changes "is/are" into "be"

### different stemmers:
1. KStemFilter chops punctuations and dollar signs.
2. PorterFilter transforms words by removing or replacing suffix(temporarily->temporarili)
3. EnglishFilter transforms uppercase into lowercase and filter out some initial stopwords, like "is", "the"

**CODE IN JAVA CLASS "Normalization", RESULT ARE SHOWED IN q2.txt**

# 3. Class Bio 
## File preprocessing
1. **removing "<== ... ==>"**
2. **split sentences with nlp ssplit**

## Tokenization and Normalization
I used Lucene's StandardTokenizer and EnglishAnalyzer for this purpose. The reason of choosing them is below:
1. The StandardTokenizer is grammer based so it will tokenize and split sentence more reasonable. 
2. English Anaylzer not only filter some initial stopwords but also normalize it based on english grammer. 

**RESULTS ARE STORED IN classbio_norm.txt**

In [None]:
# importing CoreNLP and request from API
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

In [53]:
# read in the txt file
# encode the txt file as utf-8
#f = open('classbios_unicode.txt','r')
#text = f.read()
import codecs
f = codecs.open('classbios_unicode.txt','r', encoding = 'utf-8', errors = 'replace')
text = f.read()
text = text.encode('ascii','replace')
# remove all the lines with "<== ... ==>"
import re
text1 = re.sub(r'==>[\w\s]*<==', '\n', text)

In [25]:
# split sentences
# use annotator 'ssplit' for splitting
ssplit = nlp.annotate(text1,properties={'annotators':'ssplit', 'outputFormat':'json'})
# number of sentences: 657
s_len = len(ssplit['sentences'])
# write to file 
target = open('classbio_clean.txt','w')
# join the tokens from sentences
for i in range(s_len):
    sentence = [t['word'] for t in ssplit['sentences'][i]['tokens']]
    target.write(' '.join(sentence[:-1])+'\n\n')

# 4. Basic frequency analysis. 
After classbio is tokenized and normalized in java with Standard Tokenizer and English Analyzer, it is stored as classbio_norm.txt. Read it back into python and start frequency analysis in the following steps:
1. **Filter out stopwords**: a list of initial stop words has been filtered out by Lucene English Analyzer but it's not adequent, so I import stopwords list from nltk corpus and did some further filtering.
2. **Word Count dictionary construction**: word and their count are stored as a dictionary.
3. **Dict into Data Frame and sort**: get top 20 by sorting values.

## Analysis
The top 20 word list gives a reasonable amount of information about the corpus and their interests and background in analytics and since a lot of "analyt" and "analysi" show up. Although our general interest is in data science, "data" did not show up on top, instead, "science" appears. "work" and "compani" also come on top, expressing general background about work experience and real world application.

In [29]:
# use java to normalize the class bio
# read in the normalized file
f = open('classbio_norm.txt','r')
norm = f.read()
word_count = {}
# filter out stop words
from nltk.corpus import stopwords
stopword = stopwords.words('english')
for word in norm.split():
    if word not in stopword:
        if word not in word_count:
            word_count[word] = 1
        else:
            word_count[word]+=1
import pandas as pd
word_count_pd = pd.DataFrame(word_count.items(), columns=['word','count'])
# a list of initial stop words has been filtered out by Lucene English Analyzer
# removing stop words further by using stopwords in nltk package:
word_count_pd.sort_values('count', ascending=False)[1:20]

Unnamed: 0,word,count
723,analyt,137
76,work,102
1604,text,96
290,interest,70
938,program,64
1357,us,58
623,year,55
1021,learn,54
1682,scienc,51
585,graduat,48


# 5. Bigram Frequency Analysis
The same counting process is repeated. 
## Analysis
This time, things start making more sense. "data scienc" and "data scientist" shows up on top, some terminology of data science, like "unstructur data" and "text data" as well as "machine learning". Some time points and locations like "new york" and "high school" indicates people's background and their ways of explaining it.
This is more informative than the single word frequency results.

In [39]:
# 
prword = ""
bigram_count = {}
for word in norm.split():
    if word not in stopword:
        if prword is not "":
            bigram = prword+" "+word
            if bigram not in bigram_count:
                bigram_count[bigram] = 1
            else:
                bigram_count[bigram]+=1
        prword = word
bigram_count_pd = pd.DataFrame(bigram_count.items(), columns=['bigram','count'])
bigram_count_pd.sort_values('count', ascending=False)[1:20]

Unnamed: 0,bigram,count
1045,data scienc,31
175,msia program,21
4059,interest text,20
3095,data scientist,15
736,machin learn,15
4769,unstructur data,13
5747,text data,12
2753,new york,11
531,data analysi,10
2321,becam interest,10


# 6. Sentiment Analysis
## 1. baseline naive bayes classifier
Classifier: 
P(class|words) = P(class, words)/P(words) =P(class)*P(words|class)/P(words) 

P(pos) = P(neg) = number of documents in each class/total number of documents

So only compare P(words|class), under naive bayes assumption of independency, it is equivalent to calculate P(word1|class)\*...\*P(wordn|class), also equivalent to sum of log(P(word1|class)).

process: 
1. build voc, count words in each class
2. Laplace smoothing log prob
3. For a test review, sum of log and do a comparison

## Analysis
I used a test review from the positive review. With the classifier, it successfully classify it as positive with a higher sum of log probability as outcome.

In [109]:
# read in pos and neg:
import os
import math
pos, neg = "", ""
pos_path = 'review_polarity/txt_norm/pos/'
neg_path = 'review_polarity/txt_norm/neg/'
   
for filename in os.listdir(pos_path):
    with open(pos_path+filename, 'r') as f:
        content = f.read()
    pos+=content
for filename in os.listdir(neg_path):
    with open(neg_path+filename, 'r') as f:
        content = f.read()
    neg+=content

In [41]:
import glob
def get_text(path, namestr):
    text = ""
    for filename in glob.glob(os.path.join(path, namestr)):
        with open(filename, 'r') as f:
            content = f.read()
        text+=content
    return text
def get_textlist(path, namestr):
    textlist = []
    for filename in glob.glob(os.path.join(path, namestr)):
        with open(filename, 'r') as f:
            content = f.read()
        textlist.append(content)
    return textlist

from nltk.corpus import stopwords
stopword = stopwords.words('english')
import pandas as pd

def build_voc(pos, neg):
    pos_voc, neg_voc = set(pos.split()), set(neg.split())
    from nltk.corpus import stopwords
    stopword = stopwords.words('english')
    return list((pos_voc | neg_voc)-set(stopword))

def word_count(review, voc):
    word_count = {}
    for word in voc:
        word_count[word]=0    
    for word in review.split():
        if word not in stopword:
            word_count[word]+=1
    return word_count

def naive_bayes_train(pos, neg):
    voc = build_voc(pos, neg)
    word_count_pos, word_count_neg = word_count(pos, voc), word_count(neg, voc)
    word_count_pos_pd = pd.DataFrame(word_count_pos.items(), columns=['word','count'])
    word_count_neg_pd = pd.DataFrame(word_count_neg.items(), columns=['word','count'])
    # log(P(word|pos)):
    total_count_pos = sum(word_count_pos_pd['count']+1)
    word_count_pos_pd['logprob'] = word_count_pos_pd.apply(lambda row: math.log((row['count']+1.0)/total_count_pos), axis = 1)
    # log(P(word|neg))f:
    total_count_neg = sum(word_count_neg_pd['count']+1)
    word_count_neg_pd['logprob'] = word_count_neg_pd.apply(lambda row: math.log((row['count']+1.0)/total_count_neg), axis = 1)
    return {'pos': word_count_pos_pd, 'neg': word_count_neg_pd}

def naive_bayes_test(train_dict, test):
    word_count_pos_pd = train_dict['pos']
    word_count_neg_pd = train_dict['neg']
    pos_prob = sum(word_count_pos_pd[word_count_pos_pd['word'].isin(test)]['logprob'])
    neg_prob = sum(word_count_neg_pd[word_count_neg_pd['word'].isin(test)]['logprob'])
    if (pos_prob>neg_prob):
        return "positive"
    else:
        return "negative"

def naive_bayes_validation(train_dict, testlist, actual_class):
    predict = [naive_bayes_test(train_dict, test.split()) for test in testlist]
    try:
        combo = zip(predict,actual_class)
        count = Counter(combo)
        try:
            precision = count[('positive','positive')]/float(count[('positive','negative')]+count[('positive','positive')])
            recall = count[('positive','positive')]/float(count[('negative','positive')]+count[('positive','positive')])
            Fscore = 2.0*precision*recall/(precision+recall)
            print "precision "+str(precision)+", recall "+str(recall)+", Fscore "+str(Fscore)
            return (precision, recall, Fscore)
        except ValueError:
            print "no positive prediction or no positive class"
            print count
    except IndexError:
        print "predict and actual not the same length"

In [136]:
train_dict = naive_bayes_train(pos, neg)

In [137]:
# test review
with open('review_polarity/txt_norm/pos/cv000_29590.txt', 'r') as f:
    test = f.read().split()
naive_bayes_test(train_dict, test)

'positive'

## 2. Initial Evaluation
### train on cv0 and test on cv6&7

In [18]:
# read in pos and neg:
import os
import math
pos_path = 'review_polarity/txt_norm/pos/'
neg_path = 'review_polarity/txt_norm/neg/'

pos_training = get_text(pos_path, "cv0*.txt")
neg_training = get_text(neg_path, "cv0*.txt")
train_dict = naive_bayes_train(pos_training, neg_training)


In [42]:
textlist = get_textlist(pos_path, "cv[67]*.txt")+get_textlist(neg_path, "cv[67]*.txt")
actual_class = ['positive']*200 + ['negative']*200
result = naive_bayes_validation(train_dict, textlist, actual_class)
result

precision 0.8, recall 0.76, Fscore 0.779487179487


(0.8, 0.76, 0.7794871794871796)

### train on cv0&1&2 and test on cv6&7

In [43]:
# read in pos and neg:
import os
import math
pos_path = 'review_polarity/txt_norm/pos/'
neg_path = 'review_polarity/txt_norm/neg/'

pos_training = get_text(pos_path, "cv[012]*.txt")
neg_training = get_text(neg_path, "cv[012]*.txt")
train_dict = naive_bayes_train(pos_training, neg_training)

textlist = get_textlist(pos_path, "cv[67]*.txt")+get_textlist(neg_path, "cv[67]*.txt")
actual_class = ['positive']*200 + ['negative']*200
result = naive_bayes_validation(train_dict, textlist, actual_class)
result

precision 0.835978835979, recall 0.79, Fscore 0.81233933162


(0.8359788359788359, 0.79, 0.8123393316195374)

### train on cv0&1&2&3&4 and test on cv6&7

In [44]:
# read in pos and neg:
import os
import math
pos_path = 'review_polarity/txt_norm/pos/'
neg_path = 'review_polarity/txt_norm/neg/'

pos_training = get_text(pos_path, "cv[01234]*.txt")
neg_training = get_text(neg_path, "cv[01234]*.txt")
train_dict = naive_bayes_train(pos_training, neg_training)

textlist = get_textlist(pos_path, "cv[67]*.txt")+get_textlist(neg_path, "cv[67]*.txt")
actual_class = ['positive']*200 + ['negative']*200
result = naive_bayes_validation(train_dict, textlist, actual_class)
result

precision 0.857894736842, recall 0.815, Fscore 0.835897435897


(0.8578947368421053, 0.815, 0.8358974358974358)

### train on cv0&1&2&3&4&5 and test on cv6&7

In [45]:
pos_training = get_text(pos_path, "cv[012345]*.txt")
neg_training = get_text(neg_path, "cv[012345]*.txt")
train_dict = naive_bayes_train(pos_training, neg_training)

textlist = get_textlist(pos_path, "cv[67]*.txt")+get_textlist(neg_path, "cv[67]*.txt")
actual_class = ['positive']*200 + ['negative']*200
result = naive_bayes_validation(train_dict, textlist, actual_class)
result

precision 0.865284974093, recall 0.835, Fscore 0.849872773537


(0.8652849740932642, 0.835, 0.8498727735368956)

### Analysis
The precision, recall and Fscore is increasing after the size of training set gets bigger, but it decelerate very fast, so the marginal increase in training set size is very limit after training on cv0-5. So it's a fair judgement that the classifier is not goinging to be improved with even more training data.

## 4. Evaluation on cv8&9
### train on cv0&1&2&3&4&5&6&7 and test on cv8&9
The Fscore and precision as well as recall is lower than last training results.

This comes from 2 possible reasons:
1. The sampling of cv8 and cv9 is biased, so the distribution of words are substantially different.
2. There are overfitting issues in the training phrase. With more training data, more features are created which include more noises into the underlying structure of the model, the real features are diluted. So the model does not generalize well in the testing data and brings down precision and recall.

In [47]:
pos_training = get_text(pos_path, "cv[01234567]*.txt")
neg_training = get_text(neg_path, "cv[01234567]*.txt")
train_dict = naive_bayes_train(pos_training, neg_training)

textlist = get_textlist(pos_path, "cv[89]*.txt")+get_textlist(neg_path, "cv[89]*.txt")
actual_class = ['positive']*200 + ['negative']*200
result = naive_bayes_validation(train_dict, textlist, actual_class)
result

precision 0.837837837838, recall 0.775, Fscore 0.805194805195


(0.8378378378378378, 0.775, 0.8051948051948051)

In [52]:
predict = [naive_bayes_test(train_dict, test.split()) for test in textlist]
result_class = pd.DataFrame(zip(predict, actual_class))
result_class.columns = ['predicted','actual']
result_class['filename'] = glob.glob(os.path.join(pos_path, "cv[89]*.txt")) + glob.glob(os.path.join(neg_path, "cv[89]*.txt"))
result_class

Unnamed: 0,predicted,actual,filename
0,positive,positive,review_polarity/txt_norm/pos/cv800_12368.txt
1,positive,positive,review_polarity/txt_norm/pos/cv801_25228.txt
2,positive,positive,review_polarity/txt_norm/pos/cv802_28664.txt
3,positive,positive,review_polarity/txt_norm/pos/cv803_8207.txt
4,positive,positive,review_polarity/txt_norm/pos/cv804_10862.txt
5,positive,positive,review_polarity/txt_norm/pos/cv805_19601.txt
6,positive,positive,review_polarity/txt_norm/pos/cv806_8842.txt
7,positive,positive,review_polarity/txt_norm/pos/cv807_21740.txt
8,positive,positive,review_polarity/txt_norm/pos/cv808_12635.txt
9,positive,positive,review_polarity/txt_norm/pos/cv809_5009.txt
