** Sentiment Analysis with Python scikit-learn **

https://marcobonzanini.com/2015/01/19/sentiment-analysis-with-python-and-scikit-learn/

Sentiment Analysis is a field of study which analyses people’s opinions towards entities like products, typically expressed in written forms like on-line reviews. Thanks to the massive popularity of social media which provide a constant source of textual data full of opinions to analyse.

This jupyter notebook focuses on one particular application of sentiment analysis: **sentiment classification at the document level**. In other words, given a document (e.g. a review), the task consists in finding out whether it provides a positive or a negative sentiment towards the product being discussed.

The data set we use is the well-known **Polarity Dataset v2.0**. It is database of **movie reviews**, which contains 2,000 documents, labelled and pre-preprocssed. In particular, there are two labels, positive and negative with 1,000 documents each. Every document has been tokenized and lowercased; each line of a document represents a sentence. This pre-processing takes out most of the work we have to do to get started, so we can focus on the classification problem. Real world data are usually messy and need proper pre-processing before we can make good use of them. All we need to do here is read the files and split the words over white spaces.


In [1]:
import sys
import os
import time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.metrics import classification_report

In [12]:
# The correct path to the data file needs to be provided. 
#Under this path, there are two subfolders "neg/" and "pos/"
data_dir = "txt_sentoken/"
classes = ['neg', 'pos']
train_data = []
train_labels = []
test_data = []
test_labels = []

In [13]:
# Read the data
for curr_class in classes:
    # os.path.join means joining path components. It should look like this: txt_sentoken/neg/
    dirname = os.path.join(data_dir, curr_class)
    # os.listdir means listing all file names in a given directory 
    for fname in os.listdir(dirname):
        # Using "with" and "as" to open a file, process its contents, and close it.
        with open(os.path.join(dirname, fname), 'r') as f:
            content = f.read()
            # Python string startswith() method
            # putting all files with filename starting with 'cv9' to test data set
            if fname.startswith('cv9'):
                test_data.append(content)
                test_labels.append(curr_class)
            else:
                train_data.append(content)
                train_labels.append(curr_class)

Words are used as features. Scikit-learn provides several vectorizers. TF-IDF is one of the most common weighting schemes. The parameters used in this example with the vectorizer are:

* min_df=5, discard words appearing in less than 5 documents
* max_df=0.8, discard words appering in more than 80% of the documents
* sublinear_tf=True, use sublinear weighting
* use_idf=True, enable IDF

More options are available and the best configuration might depend on your data or on the details of the task you’re facing.

In [9]:
# Create feature vectors
vectorizer = TfidfVectorizer(min_df = 5, max_df = 0.8, sublinear_tf=True, use_idf=True)

train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

In [10]:
# Perform classification with SVM, kernel=rbf (Gaussian)
classifier_rbf = svm.SVC()
t0 = time.time()
classifier_rbf.fit(train_vectors, train_labels)
t1 = time.time()
prediction_rbf = classifier_rbf.predict(test_vectors)
t2 = time.time()
time_rbf_train = t1-t0
time_rbf_predict = t2-t1

In [6]:
# Perform classification with SVM, kernel=linear
classifier_linear = svm.SVC(kernel='linear')
t0 = time.time()
classifier_linear.fit(train_vectors, train_labels)
t1 = time.time()
prediction_linear = classifier_linear.predict(test_vectors)
t2 = time.time()
time_linear_train = t1-t0
time_linear_predict = t2-t1

In [7]:
# Print results in a nice table
print("Results for SVC(kernel=rbf)")
print("Training time: %fs; Prediction time: %fs" % (time_rbf_train, time_rbf_predict))
print(classification_report(test_labels, prediction_rbf))
print("Results for SVC(kernel=linear)")
print("Training time: %fs; Prediction time: %fs" % (time_linear_train, time_linear_predict))
print(classification_report(test_labels, prediction_linear))

Results for SVC(kernel=rbf)
Training time: 8.020822s; Prediction time: 0.938096s
             precision    recall  f1-score   support

        neg       0.86      0.75      0.80       100
        pos       0.78      0.88      0.83       100

avg / total       0.82      0.81      0.81       200

Results for SVC(kernel=linear)
Training time: 7.193538s; Prediction time: 0.708705s
             precision    recall  f1-score   support

        neg       0.91      0.92      0.92       100
        pos       0.92      0.91      0.91       100

avg / total       0.92      0.92      0.91       200

