# Sentiment Analysis

Let's make some money.

For our training set, we'll use the Rotten Tomatoes reviews from before. We'll start by using a logistic regression model as our classifier.

In [None]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.metrics import accuracy_score

In [None]:
# load data
try:
    df = pd.read_csv('data/rt_critics.csv')
except IOError:
    print 'cannot find file'

df.head()

If you want to challenge yourself, ignore the subsequent cells and create the classifier on your own with your favorite model right here.

See how stop words and tf-idf scoring helps or hurts your model.

When you're done with that, skip to 'Next Steps'

In [None]:
# play space for the bold.




In [None]:
# run this cell to examine data

vectorizer = CountVectorizer()

Xcv = vectorizer.fit_transform(df['quote'])

print '%d samples, %d features' % Xcv.shape

In [27]:
# a helper function to train an SVM model and classify the test instances
def classify_svm(xtrain, xtest, ytrain, ytest):
    clf = svm.SVC(kernel='linear')
    clf.fit(xtrain, ytrain) 
    ypredicted = clf.predict(xtest)
    print "Accuracy: %0.2f%%" % 100 * accuracy_score(ytest, ypredicted)

But wait! We have more features than samples. This would ensure overfitting. Let's trim that number down to the top 5000, ranked by the term frequency across all documents.

In [None]:
# run this cell to vectorize our documents

# create vectorizer object
vectorizer = CountVectorizer(max_features=5000)

# convert our documents and their labels into numpy arrays
Xcv = vectorizer.fit_transform(df['quote'])
Y = (df['fresh'] == 'fresh').values.astype(np.int8)

# split the converted data into training and test sets
xtrain, xtest, ytrain, ytest = train_test_split(Xcv, Y, random_state=1)

In [None]:
# Evaluate performance of models
classify_svm(xtrain, xtest, ytrain, ytest)

# Stop Words

The performance isn't horrible, but it's not great. Can we improve things by [using stop words](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)? See the linked documentation for how to tell CountVectorizer to skip stop words.

In [None]:
# edit this cell to include stopwords

# create vectorizer object
vectorizer = CountVectorizer(max_features=5000)

# convert our documents and their labels into numpy arrays
Xcvs = vectorizer.fit_transform(df['quote'])
Y = (df['fresh'] == 'fresh').values.astype(np.int8)

# split the converted data into training and test sets
xtraincvs, xtestcvs, ytraincvs, ytestcvs = train_test_split(Xcvs, Y, random_state=1)

In [None]:
# Evaluate performance of models
classify_svm(xtraincvs, xtestcvs, ytraincvs, ytestcvs)

# tf-idf

If that didn't work, how about using tf-idf weighting?

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

In [None]:
# edit this cell to create a TfidfVectorizer instead of a simple CountVectorizer

# create vectorizer object
vectorizer = CountVectorizer(max_features=5000)

# convert our documents and their labels into numpy arrays
Xti = vectorizer.fit_transform(df['quote'])
Y = (df['fresh'] == 'fresh').values.astype(np.int8)

# split the converted data into training and test sets
xtrainti, xtestti, ytrainti, ytestti = train_test_split(Xti, Y, random_state=1)

In [None]:
# Evaluate performance of models
classify_svm(xtrainti, xtestti, ytrainti, ytestti)

# tf-idf and stop words

Do both together help?

In [None]:
# edit this cell to create a TfidfVectorizer that uses stop words

# create vectorizer object
vectorizer = CountVectorizer(max_features=5000)

# convert our documents and their labels into numpy arrays
Xtis = vectorizer.fit_transform(df['quote'])
Y = (df['fresh'] == 'fresh').values.astype(np.int8)

# split the converted data into training and test sets
xtraintis, xtesttis, ytraintis, ytesttis = train_test_split(Xtis, Y, random_state=1)

In [None]:
# Evaluate performance of models
classify_svm(xtraintis, xtesttis, ytraintis, ytesttis)

# Next steps

Are you satisfied with these results? Why might you be less than satisfied? How can you explain the observed behavior? What are the next steps you would need to do to improve this classifier? If you have time remaining, try a few strategies out below.

In [None]:
# continue playing here.
# did you finish all of the previous labs? How do your implementations compare?

# More Next Steps

We're not making any money with this classifier yet. If it were that easy, everyone would do it and there'd be no money in it. The hardest part of this problem is usually finding good training data. Googling 'sentiment analysis training data' or 'sentiment analysis test data' turns up a few freely available sources. Most of them are hosted by universities.

But notice, determining the judgment of a movie review isn't the same task as determining the emotional content of a tweet. And yet, it kind of is. The computer doesn't know anything about nature of the text. All it knows is that there are documents with one label (fresh/happy) and documents with another label (rotten/sad) and it needs to fit a model to discriminate between the two. This can be extended to more classes (look into the 20 newsgroups dataset in sci-kit learn) and to proprietary corpora.

One application you might use at work is classifying support emails from users. The classes may be 'ranting', 'mischarge', 'lost order', 'gushing'. Or whatever is common. Even if the classifier isn't perfect, it could help streamline the process of getting the right emails to the right support personnel.