# Text classification: sentiment analysis

### Load data 


[Sentiment Analysis Dataset](https://www.kaggle.com/sonaam1234/sentimentdata)

alternative source: 
<br>
[rt-polaritydata](https://github.com/dennybritz/cnn-text-classification-tf/tree/master/data/rt-polaritydata)

alternative source: 
<br>
[Movie Review Data](http://www.cs.cornell.edu/people/pabo/movie-review-data)

Each line in these two files corresponds to a single snippet (usually containing roughly one single sentence); all snippets are down-cased.  
[More info about dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt)



In [1]:
import nltk
from nltk.tokenize import TreebankWordTokenizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC, NuSVC
from nltk.classify.scikitlearn import SklearnClassifier
import re

In [2]:
fn='rt-polarity.neg'
with open(fn, "r",encoding='utf-8', errors='ignore') as f:
    content = f.read()  
texts_neg=  content.splitlines()
print ('len of texts_neg = {:,}'.format (len(texts_neg)))
for review in texts_neg[:5]:
    print ( '\n', review)

len of texts_neg = 5,331

 simplistic , silly and tedious . 

 it's so laddish and juvenile , only teenage boys could possibly find it funny . 

 exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable . 

 [garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation . 

 a visually flashy but narratively opaque and emotionally vapid exercise in style and mystification . 


In [3]:
fn='rt-polarity.pos'

with open(fn, "r",encoding='utf-8', errors='ignore') as f:
    content = f.read()
texts_pos=  content.splitlines()
print ('len of texts_pos = {:,}'.format (len(texts_pos)))
for review in texts_pos[:5]:
    print ('\n', review)

len of texts_pos = 5,331

 the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal . 

 the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth . 

 effective but too-tepid biopic

 if you sometimes like to go to the movies to have fun , wasabi is a good place to start . 

 emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one . 


In [4]:
X_pos_train, X_pos_test = train_test_split(texts_pos, test_size=0.2, random_state=42)

X_neg_train, X_neg_test = train_test_split(texts_neg, test_size=0.2, random_state=42)


### Text preprocessing

In [5]:
def preprocess_text(text):
    text = re.sub('<.*?>', '', text)
    text = re.sub('https://.*', '', text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = text.lower()
    return text

### Apply text preprocessing to training and test data

In [6]:
X_pos_train = [preprocess_text(review) for review in X_pos_train]
X_pos_test = [preprocess_text(review) for review in X_pos_test]
X_neg_train = [preprocess_text(review) for review in X_neg_train]
X_neg_test = [preprocess_text(review) for review in X_neg_test]

### Definition of features and their impact on classification

In [7]:
tokenizer = TreebankWordTokenizer()
stop_words = set(stopwords.words('english'))

def find_features(review):
    words = tokenizer.tokenize(review)
    words = [word for word in words if word.lower() not in stop_words]
    return {word: True for word in words}

### Prepare data

In [8]:
X_y_train = [(find_features(review), 'pos') for review in X_pos_train] + [(find_features(review), 'neg') for review in X_neg_train]
X_y_test = [(find_features(review), 'pos') for review in X_pos_test] + [(find_features(review), 'neg') for review in X_neg_test]

### Training classifiers

In [9]:
MNNB_classifier = SklearnClassifier(MultinomialNB())
lr_classifier = SklearnClassifier(LogisticRegression())
svc_clf = SklearnClassifier(SVC())
lin_svc_clf = SklearnClassifier(LinearSVC())
nu_svc_clf = SklearnClassifier(NuSVC())


MNNB_classifier.train(X_y_train)
lr_classifier.train(X_y_train)
svc_clf.train(X_y_train)
lin_svc_clf.train(X_y_train)
nb_classifier = nltk.NaiveBayesClassifier.train(X_y_train)
nu_svc_clf.train(X_y_train)

<SklearnClassifier(NuSVC())>

### Assessment of the accuracy of classifiers on the test set

In [10]:
print('Accuracy MNNB_classifier ={}%'.format(nltk.classify.accuracy(MNNB_classifier, X_y_test) * 100))
print('Accuracy nb_classifier ={}%'.format(nltk.classify.accuracy(nb_classifier, X_y_test) * 100))
print('Accuracy lr_classifier ={}%'.format(nltk.classify.accuracy(lr_classifier, X_y_test) * 100))
print('Accuracy svc_clf={}%'.format(nltk.classify.accuracy(svc_clf, X_y_test) * 100))
print('Accuracy lin_svc_clf={}%'.format(nltk.classify.accuracy(lin_svc_clf, X_y_test) * 100))
print('Accuracy nu_svc_clf={}%'.format(nltk.classify.accuracy(nu_svc_clf, X_y_test)))

Accuracy MNNB_classifier =77.78819119025304%
Accuracy nb_classifier =77.08528584817245%
Accuracy lr_classifier =75.82005623242736%
Accuracy svc_clf=74.46110590440487%
Accuracy lin_svc_clf=74.22680412371135%
Accuracy nu_svc_clf=0.7507029053420806%
