<h2>Introduction</h2>

Welcome to this new notebook, where I'm dealing with Text Mining and Natural Language Processing (NLP).

The goal is to predict whether a movie review is positive or negative.

Binary classification problem, let's dive into it !

Data is available here :  https://www.kaggle.com/c/word2vec-nlp-tutorial/data

<b>Importing stuff</b>

In [45]:
## Data processing libraries
import pandas as pd
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

## Machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

## Models evaluation
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

<b>Reading data</b>

In [3]:
## Loading the datasets
## Note : here we'll load the test set as well to use its reviews columns to build our global corpus
df_train = pd.read_csv("/Users/Yassine/Notebooks/Data/labeledTrainData.tsv",\
                        delimiter="\t")

df_test = pd.read_csv("/Users/Yassine/Notebooks/Data/testData.tsv",\
                        delimiter="\t")

<b>Quick look at the data</b>

In [4]:
df_train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [5]:
df_train.isnull().sum()

id           0
sentiment    0
review       0
dtype: int64

In [6]:
df_train['review']

0        With all this stuff going down at the moment w...
1        \The Classic War of the Worlds\" by Timothy Hi...
2        The film starts with a manager (Nicholas Bell)...
3        It must be assumed that those who praised this...
4        Superbly trashy and wondrously unpretentious 8...
5        I dont know why people think this is such a ba...
6        This movie could have been very good, but come...
7        I watched this video at a friend's house. I'm ...
8        A friend of mine bought this film for £1, and ...
9        <br /><br />This movie is full of references. ...
10       What happens when an army of wetbacks, towelhe...
11       Although I generally do not like remakes belie...
12       \Mr. Harvey Lights a Candle\" is anchored by a...
13       I had a feeling that after \Submerged\", this ...
14       note to George Litman, and others: the Mystery...
15       Stephen King adaptation (scripted by King hims...
16       `The Matrix' was an exciting summer blockbuste.

<b>We can see that the reviews are quite messy, so we need to perform some cleaning/processing</b>

In [7]:
## Let's define a function to process reviews : 
def process_review(review):
    
    ## Getting the English stopwords to remove them from the review
    stop_words = set(stopwords.words('english'))
    
    ## Removing the backslashes
    review = review.replace('\\','').replace("\'",'')
    
    ## Removing the possible html tags
    review = BeautifulSoup(review, "lxml").text
    
    ## Word tokenizing the review
    review_words = word_tokenize(review)
    
    ## Lower case and removing non-alphabetical character
    regex = re.compile('[a-zA-Z]+')
    review_words = filter(regex.search, map(lambda x : x.lower(), review_words))
    
    ## Removing stopwords
    review_words = [word for word in review_words if not word in stop_words]
    
    ## Stemming words
    sno = SnowballStemmer('english')
    review_words = map(lambda x : sno.stem(x),review_words)
    
    return " ".join(review_words)

<b>Let's check the sentiment distribution</b>

In [8]:
df_train['sentiment'].value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

<b>We can see that there are equal numbers of sentiments 0 and sentiments 1 (12500), so no under/oversampling is needed for better model performance.
<br>
<br>
Let's now apply the process_review function to all the observations</b>

In [27]:
X_train = df_train['review'].apply(process_review)
X_test = df_test['review'].apply(process_review)

y = df_train['sentiment']

corpus = list(X_train) + list(X_test)

<b>Let's now vectorize our corpus using Scikit-learn's TF-IDF Vectorizer class</b>

In [40]:
tfidf_vectorizer = TfidfVectorizer()

In [41]:
X = tfidf_vectorizer.fit_transform(corpus)

<b>Now we have our vectorized features, as a sparse matrix</b>

In [42]:
X.shape

(50000, 83695)

<b>Let's retrieve back our training set</b>

In [43]:
X_train_vect = X[:len(df_train)]

<b>Here is the fun part, let's try out some machine learning models to compare accuracy</b>
<br>
<br>
<b>Let's first try a Logistic Regression Model</b>

In [44]:
for c in [1,5,10,20,30,50,100]:
    
    kfold = KFold(n_splits=10, random_state=42)
    lr_clf = LogisticRegression(C=c)
    scoring = 'roc_auc'
    results = cross_val_score(lr_clf, X_train_vect, y, cv=kfold, scoring=scoring)
    print("[C = %d] AUC: %.6f") % (c, results.mean())

[C = 1] AUC: 0.954347
[C = 5] AUC: 0.956133
[C = 10] AUC: 0.954825
[C = 20] AUC: 0.952763
[C = 30] AUC: 0.951440
[C = 50] AUC: 0.949818
[C = 100] AUC: 0.947875


<b>So the LR model performs best with c = 5</b>
<br>
<b>Let's try a Decision Tree classifier</b>

In [37]:
kfold = KFold(n_splits=10, random_state=42)
dt_clf = DecisionTreeClassifier()
scoring = 'roc_auc'
results = cross_val_score(dt_clf, X_train_vect, y, cv=kfold, scoring=scoring)
print("AUC: %.6f") % (results.mean())

AUC: 0.712001


<h2>Conclusion</h2>
<br>
The Logistic Regression is performing way better than the DecisionTree classifier.
<br>
<br>
<b>Next steps</b>
- Perform feature selection, to reduce the number of features after TF-IDF Vectorization
- Use a different word stemming/tokenization process 
- Try other machine learning models, but this may require heavy computation given the number of features