The goal for this notebook is to work through the steps needed to predict if a tweet is postitive, negative or netural based on the values in the Tweets.csv file. 

In [27]:
import pandas as pd
import numpy as np
import os
import sklearn
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn import metrics 
from sklearn.metrics import classification_report

print(os.listdir("../input"))


In [28]:
tweets = pd.read_csv("../input/Tweets.csv")
tweets = tweets.reindex(np.random.permutation(tweets.index))

We've loaded the Tweets csv file (and randomized it), now let take a look at the first few records. This shows us that we have columns that we can use to predict the sentiment as well as columns that indicate what the actual sentiment  of each tweet was. While our objective here is to simply predict sentiment, there are additional columns that we won't use that could provide value insight in sentiment based on location or time of day. 

In [29]:
tweets.head()

Let's start exploring the data by seeing which columns have data. We can see that airline_sentiment_gold and negativereason_gold don't have much data, so we'll be able to delete those columns. We can also see that we have 14,640 total records. 

In [30]:
tweets.count()

Let's delete the columns we won't need just to clean things up. Let's leave the airline name in the data to see if it adversely biases the prediction.

In [31]:
del tweets['airline_sentiment_confidence']
del tweets['negativereason_confidence']
del tweets['airline_sentiment_gold']
del tweets['name']
del tweets['negativereason']
del tweets['negativereason_gold']
del tweets['retweet_count']
del tweets['tweet_coord']
del tweets['tweet_created']
del tweets['tweet_location']
del tweets['user_timezone']

In [32]:
tweets.head()

Now that the irrelevant data is gone, let's check to see if the data is roughly balanced (similar counts of airline_sentiment and airline).

We can see that while most of the airline_sentiment values are negative, they're all the same order of magnitude. The airline Virgin America is poorly represented in the data set, this will bear watching to see if it influences results. We won't use Airline in the training, but we'll want to check if sentiment accuracy for Virgin America is reduced when we group the predictions by airline.

In [33]:
pd.value_counts(tweets['airline_sentiment'].values, sort = False)

In [34]:
pd.value_counts(tweets['airline'].values, sort = False)

As we'll be using the Tweet text to predect sentiment, let's look at the text characteristics (average text length, variablity of text length, word counts, lemmatized word count, capitalized words, ect.). An earlier line of investigation involved building a function that strips out all but letters, changes everything to lowercase and strips out 'stop_words', then return an array of single word tokens. This approach ended up removing words that seem to be relevent ('not' and 'delayed'). Lemmatization also simplifyed some words with a negative context to their root for when had a positive sentiment. 

Rather than pursuing the line of investigation above, let's just into a word count vector structure. 

Start off by creating a CountVectorizer object, then use the 'fit' method to learn the vocabulary. 

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
# create the transform
cv = CountVectorizer()
# tokenize and build vocab
cv.fit(tweets.text)


This shows that we have 15051 distinct words in the vocabulary.

In [36]:
len(cv.vocabulary_)


This show that the sparce matrix housing our tweets has 14640 rows (one for each tweet) and 15051 columns (one for each word in the vocabulary)

In [44]:
docTerms = cv.fit_transform(tweets.text)
print(type(docTerms))
print(docTerms.shape)

If we visualize the sparce array, we can see that there's a lot of commonality in vocabulary use. 

In [38]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.spy(docTerms,markersize=.25)


In [39]:
X_train, X_test, y_train, y_test = train_test_split(docTerms, tweets['airline_sentiment'].values, test_size = .7, random_state=25)
LogReg = LogisticRegression()
LogReg.fit(X_train, y_train)

In [40]:
y_pred = LogReg.predict(X_test)

In [41]:
confusion_matrix = confusion_matrix(y_test, y_pred)
confusion_matrix

In [42]:
print(classification_report(y_test, y_pred))

In [71]:
tweet = ["late flight rude dirty refund wait"]
t = cv.transform(tweet)
print(tweet, " is ", LogReg.predict(t))
tweet = ["great fun nice good"]
t = cv.transform(tweet)
print(tweet, " is ", LogReg.predict(t))
tweet = ["bad"]
t = cv.transform(tweet)
print(tweet, " is ", LogReg.predict(t))
tweet = ["bad bad bad bad bad "]
t = cv.transform(tweet)
print(tweet, " is ", LogReg.predict(t))

In [76]:
tweet = ["should have taken the train"]
t = cv.transform(tweet)
print(tweet, " is ", LogReg.predict(t))