# Text Classification

- Pandas Documentation: http://pandas.pydata.org/
- Scikit Learn Documentation: http://scikit-learn.org/stable/documentation.html
- Seaborn Documentation: http://seaborn.pydata.org/
- Keras Documentation: https://keras.io


In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

## Text classification

Our goal is to perform a binary classification on text data. We will perform both a Spam detection example and a Sentiment analysis example. We will attempt 3 strategies:

1) build naive features based on our ideas

2) use well tested feature extraction technique

3) use deep learning and recurrent models on text

### 1. Spam detection on SMS messages

In [2]:
df = pd.read_csv('../data/sms.tsv', sep='\t')
df.head()

Unnamed: 0,label,msg
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df['label'].value_counts() / len(df)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

### Exercise1: Encode Labels to 0 and 1

Create a variable called y that contains 0 for HAM messages and 1 for SPAM messages. There are several ways to do this.

In [4]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

y = le.fit_transform(df['label'])

In [5]:
le.classes_

array(['ham', 'spam'], dtype=object)

In [6]:
y

array([0, 0, 1, ..., 0, 0, 0])

In [None]:
(df['label'] == 'spam').astype(int)

In [None]:
df['label'].map({'ham': 0, 'spam': 1})

### Exercise 2: Build naive features based on keywords

- turn all your sms messages to lowercase
- define a function to count occurrences of a single keyword with the following signature:

        def count_word(word, sentence):
            ....
            return count_word_in_sentence
            
            
- to test your function, try it on these examples and check that the results match:
   
        count_word("the", "quick brown fox") # -> 0
        count_word("fox", "quick brown fox") # -> 1
        count_word("a", "a b a abab") # -> 2
     

- using the function `count_word` you just wrote, create a feature matrix `X` using counts of some keywords of your choice. (this will a bag-of-words representation.)
- create other similar features. You could use:
    - the length of the message
    - the presence of numbers
    - the presence of special characters
    - ...

In [7]:
docs_lower  = df['msg'].str.lower().values

In [8]:
df['msg'].apply(lambda x: x.lower()).values

array(['go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...',
       'ok lar... joking wif u oni...',
       "free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005. text fa to 87121 to receive entry question(std txt rate)t&c's apply 08452810075over18's",
       ..., 'pity, * was in mood for that. so...any other suggestions?',
       "the guy did some bitching but i acted like i'd be interested in buying something else next week and he gave it to us for free",
       'rofl. its true to its name'], dtype=object)

In [9]:
docs = df['msg'].values

In [10]:
docs_lower = [d.lower() for d in docs]

In [11]:
def count_word(word, sentence):
    tokens = sentence.split()
    return len([w for w in tokens if w == word])

In [14]:
X = pd.DataFrame([count_word('free', d) for d in docs_lower],
                 columns=['free'])

for keyword in ['win', 'discount', 'call']:
    X[keyword] = [count_word(keyword, d) for d in docs_lower]
    
X.head()

Unnamed: 0,free,win,discount,call
0,0,0,0,0
1,0,0,0,0
2,1,1,0,0
3,0,0,0,0
4,0,0,0,0


In [36]:
import re

In [37]:
def count_numbers(sentence):
    return len(re.findall('[0-9]', sentence))

In [38]:
X['num_char'] = [count_numbers(d) for d in docs_lower]

In [39]:
X.head()

Unnamed: 0,free,win,discount,call,num_char
0,0,0,0,0,0
1,0,0,0,0,0
2,1,1,0,0,25
3,0,0,0,0,0
4,0,0,0,0,0


### Exercise 3: Train first model and evaluate performance

- split data in to train and test sets with `test_size=0.3, random_state=0`. you can use the `train_test_split` function from sklearn, which we have used in previous labs
- train model of your choice on these features
- evaluate performance on training and test set
- discuss with classmate:
    - how did you evaluate performance?
    - is model overfitting?
    - is model better than benchmark?

First model:

- 4 keywords: baseline: 0.868, train_score: 0.904, test_score: 0.906
- 4 kwds + digits: baseline: 0.868, train_score: 0.977, test_score: 0.972

In [41]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(
    X.values, y, test_size=0.3, random_state=0)

model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
baseline = (pd.Series(y_test).value_counts() / len(y_test))[0]
print("baseline: {:0.3}, train_score: {:0.3}, test_score: {:0.3}".format(baseline, train_score, test_score))

baseline: 0.868, train_score: 0.977, test_score: 0.972


In [30]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(10, input_shape=(4, ), activation='relu'),
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=10, verbose=0)

train_score = model.evaluate(X_train, y_train, verbose=0)[1]
test_score = model.evaluate(X_test, y_test, verbose=0)[1]
baseline = (pd.Series(y_test).value_counts() / len(y_test))[0]
print("baseline: {:0.3}, train_score: {:0.3}, test_score: {:0.3}".format(baseline, train_score, test_score))

In [23]:
y_pred = model.predict(X_test)

In [24]:
confusion_matrix(y_test, y_pred)

array([[1375,   76],
       [  81,  140]])

### Exercise 4: Cross Validation

- perform a 5-Fold cross validation on your model. you can refer back to lab 8 to refresh your memory on how to do this.
- print the confusion matrix and the classification report on the test data

In [42]:
from sklearn.model_selection import cross_val_score

In [43]:
scores = cross_val_score(model, X, y, cv = 5, n_jobs=-1)
scores

array([0.9793722 , 0.97399103, 0.96947935, 0.96858169, 0.97666068])

In [44]:
print("Average score: {:0.3} +/- {:0.3}".format(scores.mean(), scores.std()))

Average score: 0.974 +/- 0.00412


In [45]:
from sklearn.metrics import confusion_matrix, classification_report

In [46]:
y_pred = model.predict(X_test)

In [47]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.98      1451
           1       0.93      0.86      0.89       221

   micro avg       0.97      0.97      0.97      1672
   macro avg       0.95      0.92      0.94      1672
weighted avg       0.97      0.97      0.97      1672



In [48]:
cm = confusion_matrix(y_test, y_pred)

pd.DataFrame(cm,
             index=['ham', 'spam'],
             columns=['pred_ham', 'pred_spam'])

Unnamed: 0,pred_ham,pred_spam
ham,1436,15
spam,32,189


### Exercise 5: Count Features

- use features based on word counts using the `CountVectorizer` class from Scikit Learn
- use the following function to simplify your code (it encapsulates model training and evaluation):


    def split_fit_eval(X, y, model=None, epochs=10, random_state=0):
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)

        if not model:
            model = Sequential()
            model.add(Dense(1, input_dim=X.shape[1], activation='sigmoid'))

            model.compile(loss='binary_crossentropy',
                          optimizer='adam',
                          metrics=['accuracy'])

        h = model.fit(X_train, y_train, epochs=epochs, verbose=1)

        train_loss, train_acc = model.evaluate(X_train, y_train)
        test_loss, test_acc = model.evaluate(X_test, y_test)

        return train_loss, train_acc, test_loss, test_acc, model, h


- did you improve the performance?

In [49]:
from sklearn.feature_extraction.text import CountVectorizer

In [50]:
vocab_size = 3000

In [51]:
vect = CountVectorizer(decode_error='ignore',
                       stop_words='english',
                       lowercase=True,
                       max_features=vocab_size)

In [52]:
X = vect.fit_transform(docs)

In [53]:
Xd = X.todense()

In [54]:
vocab = vect.get_feature_names()

In [55]:
vocab[:10]

['00',
 '000',
 '02',
 '0207',
 '02073162414',
 '03',
 '04',
 '05',
 '06',
 '07123456789']

In [56]:
vocab[-10:]

['yogasana', 'yor', 'yr', 'yrs', 'yummy', 'yun', 'yunny', 'yuo', 'yup', 'zed']

In [57]:
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

In [58]:
def split_fit_eval(X, y, model=None, epochs=10, random_state=0):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)
    
    if not model:
        model = Sequential()
        model.add(Dense(1, input_dim=X.shape[1], activation='sigmoid'))

        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=['accuracy'])
    
    h = model.fit(X_train, y_train, epochs=epochs, verbose=1)
    
    train_loss, train_acc = model.evaluate(X_train, y_train)
    test_loss, test_acc = model.evaluate(X_test, y_test)
    
    return train_loss, train_acc, test_loss, test_acc, model, h

In [63]:
res = split_fit_eval(Xd, y)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [60]:
res[1]

0.9820531

In [61]:
res[3]

0.97631013

In [67]:
w, b = res[4].get_weights()

In [76]:
w.ravel()

array([ 0.22206715,  0.2857848 ,  0.19852541, ..., -0.00284304,
       -0.34787667,  0.18753704], dtype=float32)

In [71]:
feature_weights = pd.Series(w.ravel(), index=vocab).sort_values()

In [74]:
feature_weights.head(10)

ok      -0.660934
ll      -0.565965
home    -0.564236
da      -0.549822
oh      -0.513154
going   -0.510647
sorry   -0.509732
come    -0.496596
later   -0.495573
got     -0.474354
dtype: float32

In [73]:
feature_weights.tail(10)

stop       0.449545
mobile     0.469803
service    0.472298
prize      0.489064
150p       0.516159
free       0.526045
claim      0.531227
txt        0.552158
www        0.556348
uk         0.561028
dtype: float32

## Sentiment Analysis

The previous dataset was easy. Let's switch to a harder one and do sentiment analysis on it.

In [None]:
df = pd.read_csv('../data/rt_critics.csv')
df.head()

In [None]:
df.info()

In [None]:
df['fresh'].value_counts() / len(df)

In [None]:
df = df[df.fresh != 'none'].copy()
df['fresh'].value_counts() / len(df)

In [None]:
y = le.fit_transform(df['fresh'])

### Exercise 6: TFIDF

- Build features with word frequencies (Tfidf). (sklearn has a preprocessor for this.)
- do train/test split
- train and evaluate a model

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vect = TfidfVectorizer(decode_error='ignore',
                       stop_words='english',
                       max_features=20000)

X = vect.fit_transform(df['quote'])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

In [None]:
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
model.fit(X_train, y_train)

In [None]:
model.score(X_train, y_train)

In [None]:
model.score(X_test, y_test)

### Exercise 7: NLP with deep learning

- Use the Tokenizer from tensorflow.keras to:
    - Create a vocabulary
    - Convert sentences to sequences of integers
- pad the sequences so that they look like a tensor using the pad_sequences function from tensorflow.keras.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
tokenizer = Tokenizer(num_words=20000)

docs = df['quote']
tokenizer.fit_on_texts(docs)
sequences = tokenizer.texts_to_sequences(docs)

- check max word index
- check max sequence length

In [None]:
max_features = max([max(seq) for seq in sequences if len(seq) > 0]) + 1
max_features

In [None]:
maxlen = max([len(seq) for seq in sequences])
maxlen

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
X = pad_sequences(sequences, maxlen=maxlen)

### Train / Test split on sequences

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

### Exercise 8: Build recurrent neural network model
- use what you have learned to build a recurrent model that classifies the sentiment

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Embedding
from tensorflow.keras.layers import LSTM, GRU

In [None]:
model = Sequential()
model.add(Embedding(input_dim=max_features,
                    output_dim=32,
                    input_length=maxlen))
model.add(LSTM(32))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
h = model.fit(X_train, y_train, batch_size=64, epochs=4, validation_split=0.1)

In [None]:
model.evaluate(X_train, y_train, batch_size=32)


In [None]:
loss, acc = model.evaluate(X_test, y_test, batch_size=32)
acc

In [None]:
pd.DataFrame(h.history).plot(ylim=(-0.05, 1.05))

### Exercise 9

- Try changing the network architecture and re-train the model at each change. Can you avoid overfitting?
    - change the number of nodes in the LSTM layer
    - change the output dimension of the Embedding layer
    - add dropout and recurrent dropout to the LSTM
    - add a second LSTM layer
    - add kernel regularizers

In [None]:
from tensorflow.keras import regularizers

model = Sequential()
model.add(Embedding(input_dim=max_features,
                    output_dim=32,
                    input_length=maxlen))
model.add(LSTM(16, return_sequences=True, dropout=0.1))
model.add(LSTM(8, activity_regularizer=regularizers.l2(0.01), kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.summary()

In [None]:
from tensorflow.keras import callbacks
h = model.fit(X_train, y_train, batch_size=64, epochs=4, validation_split=0.1, callbacks=[callbacks.EarlyStopping()])

In [None]:
model.evaluate(X_train, y_train, batch_size=64)

In [None]:
loss, acc = model.evaluate(X_test, y_test, batch_size=64)
acc