## Kaggle Challenge - Twitter Sentiment Analysis - Bag-Of-Words Model

### Read the data

In [1]:
import pandas as pd
data = pd.read_csv("../dataset/train.csv", header=0, delimiter=",", encoding='latin-1')

In [2]:
list(data.columns.values)

['ItemID', 'Sentiment', 'SentimentText']

In [3]:
data.head(5)

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


In [4]:
data.shape

(99989, 3)

### Clean data

In [5]:
from cleanData import twitts_to_words

In [6]:
data.SentimentText = data.SentimentText.apply(lambda x: twitts_to_words(x))

### Split the data into train - test data

In [7]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2, random_state=42)

In [8]:
train.shape

(79991, 3)

In [9]:
test.shape

(19998, 3)

In [10]:
train.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
58519,58531,1,sleep suggest watching girlfriend experience
38238,38250,1,also school haha reminding case forgot
3806,3807,1,love country music
27925,27937,1,gmornin little madeleine cake gotta love chame...
6006,6009,0,cheapspeakers everybody rancho dancing infecti...


In [11]:
test.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
33965,33977,1,wheeee
22853,22865,1,thank good meet
19448,19460,0,electricbath eewwww gross sorry hayward hate like
9732,9744,1,followfriday little late special shoutout missus
7129,7132,1,icanhelp shopping deal personal assistant even...


### Creating Features from a Bag of Words (Using scikit-learn)
- convert data training tweets to some kind of numeric representation for machine learning

- Bag-of-Words Approach
The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. 
Vocabulary will be to big based on all tweets, so to limit the size of the feature vectors, we should choose some maximum vocabulary size - 5000 most frequent words
- feature_extraction module from scikit-learn to create bag-of-words features.  
sklearn.CountVectorizer - Convert a collection of raw documents to a matrix features.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
# The input to fit_transform should be a list of strings
train_tweets = train["SentimentText"].tolist()

In [14]:
# using filter() to perform removal of empty strings TODO - if remove empty-remove the corresponding sentiment
# train_tweets = list(filter(None, train_tweets))

In [15]:
len(train_tweets)

79991

In [16]:
# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool. 
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000)

In [17]:
# 1) fit the model and learn the vocabulary; 
# 2) transforms our training data into feature vectors.
train_data_features = vectorizer.fit_transform(train_tweets)

In [18]:
train_data_features

<79991x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 418848 stored elements in Compressed Sparse Row format>

In [19]:
train_data_features.shape

(79991, 5000)

In [20]:
#train_data_features.toarray()

Bag of Words model is trained, let's look at the vocabulary

In [21]:
vocab = vectorizer.get_feature_names()

In [22]:
print(vocab)



### Classify af with the Random Forest
At this point, we have numeric training features from the Bag of Words and the original sentiment labels for each feature vector so let's do some supervised learning by applying classification algorithms!

In [23]:
from sklearn.ensemble import RandomForestClassifier

  from numpy.core.umath_tests import inner1d


In [24]:
# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100)

In [25]:
# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable
forest = forest.fit(train_data_features, train["Sentiment"])
print("Classifier is trained")

Classifier is trained


### Run the trained Random Forest Classifier on the test set (create a submission file) and predict sentiments on some test-cases (for fun) 

In [26]:
test.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
33965,33977,1,wheeee
22853,22865,1,thank good meet
19448,19460,0,electricbath eewwww gross sorry hayward hate like
9732,9744,1,followfriday little late special shoutout missus
7129,7132,1,icanhelp shopping deal personal assistant even...


In [29]:
testData = test["SentimentText"].tolist()

In [31]:
testSentiment = test["Sentiment"].tolist()

In [40]:
# Get a bag of words for the test set
test_data_features = vectorizer.transform(testData)

In [65]:
# convert to a numpy array is heavy for the hardware
# test_data_features = test_data_features.toarray()

In [46]:
# Use the random forest to make sentiment label predictions on the test data
result_Forest = forest.predict(test_data_features)

In [60]:
# compare how good forest classifier is
success = 0
failure = 0
length = len(testSentiment) - 1
for i in range(0, length):
    if testSentiment[i] == result_Forest[i]:
        success += 1
    else:
        failure += 1

In [61]:
success

14462

In [63]:
(success / length)*100

72.32084812721908

In [64]:
(failure / length)*100

27.679151872780917

In [66]:
# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame(data={"id":test["ItemID"], "sentiment":result_Forest})

In [67]:
output.shape

(19998, 2)

In [68]:
# Use pandas to write the comma-separated output file
output.to_csv("Bag_of_Words_model-Random_Forest.csv", index=False, quoting=3)

In [71]:
# Prediction on an input string
input_string = "Ooo hell yeah it is hot"
print(forest.predict(vectorizer.transform([twitts_to_words(input_string)])))

[1]


### Logistic Regression Classifier  

In [79]:
from sklearn.linear_model import LogisticRegression

In [80]:
logregression = LogisticRegression()

In [82]:
logregression = logregression.fit(train_data_features, train["Sentiment"])
print("Classifier is trained")

Classifier is trained


In [85]:
result_Logistic = logregression.predict(test_data_features)

In [88]:
# compare how good forest classifier is
success = 0
failure = 0
length = len(testSentiment) - 1
for i in range(0, length):
    if testSentiment[i] == result_Logistic[i]:
        success += 1
    else:
        failure += 1

In [89]:
(success / length)*100

74.35115267290094

In [90]:
(failure / length)*100

25.648847327099066

In [108]:
outputLog = pd.DataFrame(data={"id":test["ItemID"], "sentiment":result_Logistic})

In [109]:
outputLog.to_csv("Bag_of_Words_model-Logistic_Regression.csv", index=False, quoting=3)

### SVM Classifier

In [93]:
from sklearn.svm import SVC

In [94]:
svm = SVC()

In [95]:
svm = svm.fit(train_data_features, train["Sentiment"])
print("Classifier is trained")

Classifier is trained


In [96]:
result_svm = svm.predict(test_data_features)

In [97]:
# compare how good forest classifier is
success = 0
failure = 0
length = len(testSentiment) - 1
for i in range(0, length):
    if testSentiment[i] == result_svm[i]:
        success += 1
    else:
        failure += 1

In [98]:
(success / length)*100

56.42346351952793

In [99]:
(failure / length)*100

43.57653648047207

In [106]:
outputSVM = pd.DataFrame(data={"id":test["ItemID"], "sentiment":result_svm})

In [107]:
outputSVM.to_csv("Bag_of_Words_model-SVM.csv", index=False, quoting=3)