## Kaggle Challenge - Twitter Sentiment Analysis - Bag-Of-Words Model

In [2]:
# read train data
import pandas as pd
train = pd.read_csv("train.csv", header=0, delimiter=",", encoding='latin-1')

In [3]:
list(train.columns.values)

['ItemID', 'Sentiment', 'SentimentText']

In [28]:
train.head(5)

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


In [4]:
train.shape

(99989, 3)

In [5]:
for i in range(1,5):
    print(train["SentimentText"][i])
    print("NEXT---------------------")

                   I missed the New Moon trailer...
NEXT---------------------
              omg its already 7:30 :O
NEXT---------------------
          .. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)...
NEXT---------------------
         i think mi bf is cheating on me!!!       T_T
NEXT---------------------


### Clean the train data

In [7]:
from cleanData import twitts_to_words

In [8]:
num_twitts = train["SentimentText"].size

In [9]:
clean_train_twitts = []

In [10]:
for i in range(0, num_twitts):
    if( (i+1)%30000 == 0 ):
        print ("Review {} of {}\n".format(i+1, num_twitts))             # status updates
    clean_train_twitts.append(twitts_to_words(train["SentimentText"][i]))

Review 30000 of 99989

Review 60000 of 99989

Review 90000 of 99989



In [11]:
len(clean_train_twitts)

99989

In [13]:
for i in range(1,5):
    print(clean_train_twitts[i])
    print("NEXT---------------------")

missed new moon trailer
NEXT---------------------
omg already
NEXT---------------------
omgaga im sooo im gunna cry dentist since suposed get crown put mins
NEXT---------------------
think mi bf cheating
NEXT---------------------


### Creating Features from a Bag of Words (Using scikit-learn)
- convert data training tweets to some kind of numeric representation for machine learning

- Bag-of-Words Approach
The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. 
Vocabulary will be to big based on all tweets, so to limit the size of the feature vectors, we should choose some maximum vocabulary size - 5000 most frequent words
- sklearn.CountVectorizer - Convert a collection of raw documents to a matrix features.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000)

In [16]:
train_data_features = vectorizer.fit_transform(clean_train_twitts)

In [17]:
type(train_data_features)

scipy.sparse.csr.csr_matrix

In [18]:
train_data_features.shape

(99989, 5000)

In [111]:
#train_data_features.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Bag of Words model is trained, let's look at the vocabulary

In [19]:
vocab = vectorizer.get_feature_names()

In [20]:
print(vocab)



### Classify af with the Random Forest
At this point, we have numeric training features from the Bag of Words and the original sentiment labels for each feature vector so let's do some supervised learning!

In [21]:
from sklearn.ensemble import RandomForestClassifier

  from numpy.core.umath_tests import inner1d


In [22]:
# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100)

In [23]:
# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable
forest = forest.fit(train_data_features, train["Sentiment"])
print("Classifier is trained")

Classifier is trained


### Run the trained Random Forest Classifier on the test set (create a submission file) and predict sentiments on some test-cases (for fun) 

In [24]:
test = pd.read_csv("test.csv",header=0,delimiter=",",encoding='latin-1')

In [25]:
test.shape

(299989, 2)

In [26]:
test.columns.values

array(['ItemID', 'SentimentText'], dtype=object)

In [27]:
num_twitts = len(test["SentimentText"])
clean_test_twitts = []

In [29]:
for i in range(0,num_twitts):
    if((i+1) % 30000 == 0):
        print("Review {} of {}\n".format(i+1, num_twitts))
    clean_twitt = twitts_to_words(test["SentimentText"][i] )
    clean_test_twitts.append(clean_twitt)

Review 30000 of 299989

Review 60000 of 299989

Review 90000 of 299989

Review 120000 of 299989

Review 150000 of 299989

Review 180000 of 299989

Review 210000 of 299989

Review 240000 of 299989

Review 270000 of 299989



In [30]:
# Get a bag of words for the test set
test_data_features = vectorizer.transform(clean_test_twitts)

In [31]:
# convert to a numpy array is heavy for the hardware
# test_data_features = test_data_features.toarray()

In [32]:
test_data_features.shape

(299989, 5000)

In [33]:
# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)

In [34]:
# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame(data={"id":test["ItemID"], "sentiment":result})

In [43]:
output.shape

(299989, 2)

In [44]:
# Use pandas to write the comma-separated output file
output.to_csv("Bag_of_Words_model-Random Forest.csv", index=False, quoting=3)

In [63]:
# Prediction on an input string
input_string = "Ohhh, hell, yeah, I like ice-cream"
print(forest.predict(vectorizer.transform([twitts_to_words(input_string)])))

[1]


### Classifier is implemented and works - TODO how to find out correctness 