by Linh Van NGUYEN

Date: 16/5/2016

### Requirements:
    * sklearn
    * pandas
    * xgboost
    * BeautifulSoup
    * nltk
To install the required package, run this command in the terminal:
    
    sudo pip install -r requirements.txt

Also two following files should be in the same directory:
    * utils.py - some useful functions
        read_textfile(filename, separate): read text from file with given separator
        score_model(model,X,t,cv,scoring): return score of a classifier using cross-validation
        text_to_wordlist(text, remove_stopwords=False): clean and return wordlist from given tweets
    * predict_sentiments.py - function to predict sentiments from given tweets
        predict_singlemodel(tweetsTrain, labelsTrain, tweetsTest): using the best single model 
        predict_singlemodel(tweetsTrain, labelsTrain, tweetsTest): ensemble all models
        
Run the test by commenting the virtual spliting block and give input testing file name

In [1]:
%matplotlib inline  
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from sklearn import cross_validation
from sklearn.metrics import classification_report

import utils as utils
import predict_sentiments as ps



## Part 1: Testing data
You should input the testing data *xxxx.txt* (file of same format as training data)

In [2]:
fileTrain = 'train-tweets.txt'
fileTest = 'test-tweets.txt'

dfTrain = utils.read_textfile(fileTrain,',')
tweetsTrain = dfTrain.tweet.values
nTrain = len(tweetsTrain)

dfTest = utils.read_textfile(fileTest,',')
tweetsTest = dfTest.tweet.values
nTest = len(tweetsTest)

sentimentsAll = np.concatenate((dfTrain.sentiment,dfTest.sentiment),axis=0) # to make sure the order when factorize
labelsAll = pd.factorize(sentimentsAll)[0]
sentiments = pd.factorize(dfTrain.sentiment)[1]
labelsTrain = labelsAll[:nTrain]
labelsTest = labelsAll[nTrain:]

print ("\n\nTest file at \" %s \", containing %d tweets \n\n" % (fileTest, nTest))



Test file at " test-tweets.txt ", containing 1044 tweets 




In [3]:
# Single model
print("==============================================")
print("Report of prediction by the best single models")
print("==============================================")
labelsPred = ps.predict_singlemodel(tweetsTrain, labelsTrain, tweetsTest)
print(classification_report(labelsPred, labelsTest))
print("==============================================")


# Ensemble model
print("==============================================")
print("Report of prediction by the best single models")
print("==============================================")
labelsPred = ps.predict_ensemblemodels(tweetsTrain, labelsTrain, tweetsTest)
print(classification_report(labelsPred, labelsTest))
print("==============================================")

Report of prediction by the best single models
Step 1: Cleaning text 

Step 2: Extracting features 

Step 3: Training classifiers 

stacked features ....
             precision    recall  f1-score   support

          0       0.59      0.68      0.63       335
          1       0.25      0.51      0.34        96
          2       0.82      0.62      0.71       613

avg / total       0.69      0.63      0.65      1044

Report of prediction by the best single models
Step 1: Cleaning text 

Step 2: Extracting features 

Step 3: Training classifiers 

For vectorized features ....
counting features ....
stacked features ....
and significant features ....
done! 

Final predictions and ensemble 

             precision    recall  f1-score   support

          0       0.57      0.74      0.65       296
          1       0.25      0.62      0.35        78
          2       0.89      0.62      0.73       670

avg / total       0.75      0.65      0.68      1044



## Part 2: Virtual testing data by spliting from training data
(comment this block when real testing data is given)

In [4]:
fileTrain = 'train-tweets.txt'
dfTrain = utils.read_textfile(fileTrain,',')
tweets = dfTrain.tweet.values
labels = pd.factorize(dfTrain.sentiment)[0]
sentiments = pd.factorize(dfTrain.sentiment)[1]

tweetsTrain, tweetsTest, labelsTrain, labelsTest = cross_validation.train_test_split(tweets, labels, 
                                                                      train_size=0.85, random_state=1234)

### By a single model (SVM with features selection)

In [5]:
labelsPred = ps.predict_singlemodel(tweetsTrain, labelsTrain, tweetsTest)
print(classification_report(labelsPred, labelsTest))

Step 1: Cleaning text 

Step 2: Extracting features 

Step 3: Training classifiers 

stacked features ....
             precision    recall  f1-score   support

          0       0.62      0.74      0.68       299
          1       0.24      0.54      0.33        69
          2       0.84      0.65      0.73       621

avg / total       0.73      0.67      0.69       989



### By ensemble model

In [6]:
labelsPred = ps.predict_ensemblemodels(tweetsTrain, labelsTrain, tweetsTest)
print(classification_report(labelsPred, labelsTest))

Step 1: Cleaning text 

Step 2: Extracting features 

Step 3: Training classifiers 

For vectorized features ....
counting features ....
stacked features ....
and significant features ....
done! 

Final predictions and ensemble 

             precision    recall  f1-score   support

          0       0.60      0.75      0.67       280
          1       0.24      0.64      0.34        58
          2       0.88      0.65      0.74       651

avg / total       0.76      0.68      0.70       989

