# Text Classification

- Pandas Documentation: http://pandas.pydata.org/
- Scikit Learn Documentation: http://scikit-learn.org/stable/documentation.html
- Seaborn Documentation: http://seaborn.pydata.org/
- Keras Documentation: https://keras.io


In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## Text classification

Our goal is to perform a binary classification on text data. We will perform both a Spam detection example and a Sentiment analysis example. We will attempt 3 strategies:

1) build naive features based on our ideas

2) use well tested feature extraction technique

3) use deep learning and recurrent models on text

### 1. Spam detection on SMS messages

In [11]:
df = pd.read_csv('../data/sms.tsv', sep='\t')
df.head()

Unnamed: 0,label,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df['label'].value_counts() / len(df)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

### Exercise1: Encode Labels to 0 and 1

Create a variable called y that contains 0 for HAM messages and 1 for SPAM messages. There are several ways to do this.

In [19]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

le.fit(df['label'])
y = le.transform(df['label'])

#dict = {"ham": 0, "spam": 1}
#[dict[l] for l in df.head()['label']] #[df[w]+3 for w in my_review.split()]

In [20]:
le.classes_

array(['ham', 'spam'], dtype=object)

In [21]:
y

array([0, 0, 1, ..., 0, 0, 0])

### Exercise 2: Build naive features based on keywords

- turn all your sms messages to lowercase
- define a function to count occurrences of a single keyword with the following signature:

        def count_word(word, sentence):
            ....
            return count_word_in_sentence
            
            
- to test your function, try it on these examples and check that the results match:
   
        count_word("the", "quick brown fox") # -> 0
        count_word("fox", "quick brown fox") # -> 1
        count_word("a", "a b a abab") # -> 2
     

- using the function `count_word` you just wrote, create a feature matrix `X` using counts of some keywords of your choice. (this will a bag-of-words representation.)
- create other similar features. You could use:
    - the length of the message
    - the presence of numbers
    - the presence of special characters
    - ...

In [27]:
def count_word(word, sentence):
    i = 0
    for w in sentence.split(" "):
        if w == word:
            i += 1
    return i

print(count_word("the", "quick brow fox"))
print(count_word("fox", "quick brow fox"))
print(count_word("a", "a b a abab"))

0
1
2


In [28]:
df.head()

Unnamed: 0,label,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [50]:
a = "a bee c"
a.count(" ")

2

In [52]:
df['count_so'] = [count_word("so", msg.lower()) for msg in df['msg']]
df['count_ok'] = [count_word("ok", msg.lower()) for msg in df['msg']]
df['count_u'] = [count_word("u", msg.lower()) for msg in df['msg']]
df['count_free'] = [count_word("free", msg.lower()) for msg in df['msg']]
df['count_entry'] = [count_word("entry", msg.lower()) for msg in df['msg']]
df['count_contact'] = [count_word("contact", msg.lower()) for msg in df['msg']]
df['word_count'] = [msg.count(" ") for msg in df['msg']]
df

Unnamed: 0,label,msg,test,count_so,count_ok,count_u,count_free,count_entry,count_contact,word_count
0,ham,"Go until jurong point, crazy.. Available only ...",0,0,0,0,0,0,0,19
1,ham,Ok lar... Joking wif u oni...,0,0,1,1,0,0,0,5
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,0,0,0,0,1,2,0,27
3,ham,U dun say so early hor... U c already then say...,1,1,0,2,0,0,0,10
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,0,0,0,0,0,0,12
5,spam,FreeMsg Hey there darling it's been 3 week's n...,0,0,0,0,0,0,0,31
6,ham,Even my brother is not like to speak with me. ...,0,0,0,0,0,0,0,15
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0,0,0,0,0,0,0,25
8,spam,WINNER!! As a valued network customer you have...,0,0,0,0,0,0,0,25
9,spam,Had your mobile 11 months or more? U R entitle...,0,0,0,1,1,0,0,28


### Exercise 3: Train first model and evaluate performance

- split data in to train and test sets with `test_size=0.3, random_state=0`. you can use the `train_test_split` function from sklearn, which we have used in previous labs
- train model of your choice on these features
- evaluate performance on training and test set
- discuss with classmate:
    - how did you evaluate performance?
    - is model overfitting?
    - is model better than benchmark?

### Exercise 4: Cross Validation

- perform a 5-Fold cross validation on your model. you can refer back to lab 8 to refresh your memory on how to do this.
- print the confusion matrix and the classification report on the test data

### Exercise 5: Count Features

- use features based on word counts using the `CountVectorizer` class from Scikit Learn
- use the following function to simplify your code (it encapsulates model training and evaluation):


    def split_fit_eval(X, y, model=None, epochs=10, random_state=0):
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)

        if not model:
            model = Sequential()
            model.add(Dense(1, input_dim=X.shape[1], activation='sigmoid'))

            model.compile(loss='binary_crossentropy',
                          optimizer='adam',
                          metrics=['accuracy'])

        h = model.fit(X_train, y_train, epochs=epochs, verbose=1)

        train_loss, train_acc = model.evaluate(X_train, y_train)
        test_loss, test_acc = model.evaluate(X_test, y_test)

        return train_loss, train_acc, test_loss, test_acc, model, h


- did you improve the performance?

## Sentiment Analysis

The previous dataset was easy. Let's switch to a harder one and do sentiment analysis on it.

In [4]:
df = pd.read_csv('../data/rt_critics.csv')
df.head()

Unnamed: 0,critic,fresh,imdb,publication,quote,review_date,rtid,title
0,Derek Adams,fresh,114709.0,Time Out,"So ingenious in concept, design and execution ...",2009-10-04,9559.0,Toy story
1,Richard Corliss,fresh,114709.0,TIME Magazine,The year's most inventive comedy.,2008-08-31,9559.0,Toy story
2,David Ansen,fresh,114709.0,Newsweek,A winning animated feature that has something ...,2008-08-18,9559.0,Toy story
3,Leonard Klady,fresh,114709.0,Variety,The film sports a provocative and appealing st...,2008-06-09,9559.0,Toy story
4,Jonathan Rosenbaum,fresh,114709.0,Chicago Reader,"An entertaining computer-generated, hyperreali...",2008-03-10,9559.0,Toy story


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14072 entries, 0 to 14071
Data columns (total 8 columns):
critic         13382 non-null object
fresh          14072 non-null object
imdb           14072 non-null float64
publication    14072 non-null object
quote          14072 non-null object
review_date    14072 non-null object
rtid           14072 non-null float64
title          14072 non-null object
dtypes: float64(2), object(6)
memory usage: 879.6+ KB


In [6]:
df['fresh'].value_counts() / len(df)

fresh     0.612067
rotten    0.386299
none      0.001634
Name: fresh, dtype: float64

In [7]:
df = df[df.fresh != 'none'].copy()
df['fresh'].value_counts() / len(df)

fresh     0.613069
rotten    0.386931
Name: fresh, dtype: float64

In [8]:
le.fit(df['fresh'])

NameError: name 'le' is not defined

In [None]:
y = le.transform(df['fresh'])

### Exercise 6: TFIDF

- Build features with word frequencies (Tfidf). (sklearn has a preprocessor for this.)
- do train/test split
- train and evaluate a model

### Exercise 7: NLP with deep learning

- Use the Tokenizer from Keras to:
    - Create a vocabulary
    - Convert sentences to sequences of integers
- pad the sequences so that they look like a tensor using the `pad_sequences` function from Keras.

### Train / Test split on sequences

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

### Exercise 8: Build recurrent neural network model
- use what you have learned to build a recurrent model that classifies the sentiment

### Exercise 9

- Try changing the network architecture and re-train the model at each change. Can you avoid overfitting?
    - change the number of nodes in the LSTM layer
    - change the output dimension of the Embedding layer
    - add dropout and recurrent dropout to the LSTM
    - add a second LSTM layer
    - add kernel regularizers