Along the popular kernel notebook: https://www.kaggle.com/artgor/movie-review-sentiment-analysis-eda-and-models

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from nltk.tokenize import TweetTokenizer
import datetime
import lightgbm as lgb
from scipy import stats
from scipy.sparse import hstack, csr_matrix
from sklearn.model_selection import train_test_split, cross_val_score
from wordcloud import WordCloud
from collections import Counter
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
pd.set_option('max_colwidth',400)

In [None]:
train = pd.read_csv('../input/movie-review-sentiment-analysis-kernels-only/train.tsv', sep="\t")
test = pd.read_csv('../input/movie-review-sentiment-analysis-kernels-only/test.tsv', sep="\t")
sub = pd.read_csv('../input/movie-review-sentiment-analysis-kernels-only/sampleSubmission.csv', sep=",")

In [None]:
train.head(10)

# Definition of features 変数の定義の確認

from the [data](https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only/data) page:

```
The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset.

The train/test split has been preserved for the purposes of benchmarking, 
but the sentences have been shuffled from their original order. 

Each Sentence has been parsed into many phrases by the Stanford parser. 
Each phrase has a PhraseId. 
Each sentence has a SentenceId. 
Phrases that are repeated (such as short/common words) are only included once in the data.

train.tsv contains the phrases and their associated sentiment labels. 
We have additionally provided a SentenceId so that you can track which phrases belong to a single sentence.
   
test.tsv contains just phrases. 
You must assign a sentiment label to each phrase.

The sentiment labels are:
0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive
```

that is, 
```
whole document
|
└ sentence <--> SentenceId
|   └ phrase <--> PhraseId, Sentence, Sentiment
|   └ phrase <--> PhraseId, Sentence, Sentiment
|   └ phrase <--> PhraseId, Sentence, Sentiment
|
└ sentence <--> SentenceId
|   └ phrase <--> PhraseId, Sentence, Sentiment
|   └ phrase <--> PhraseId, Sentence, Sentiment
|   └ phrase <--> PhraseId, Sentence, Sentiment
|
└ sentence <--> SentenceId
|   └ phrase <--> PhraseId, Sentence, Sentiment
|   └ phrase <--> PhraseId, Sentence, Sentiment
|   └ phrase <--> PhraseId, Sentence, Sentiment
...
```

In [None]:
train.loc[train.SentenceId == 10]

In [None]:
# Average count of phrases per sentence in train is:
train.groupby('SentenceId')['Phrase'].count().mean()

In [None]:
# Average count of phrases per sentence in test is:
test.groupby('SentenceId')['Phrase'].count().mean()

In [None]:
# Number of phrases in train:
train.shape[0]

In [None]:
# Number of sentences in train:
len(train.SentenceId.unique())

In [None]:
# Number of phrases in test:
test.shape[0]

In [None]:
# Number of sentences in test:
len(test.SentenceId.unique())

In [None]:
# Average word length of phrases in train is:
train.Phrase.apply(lambda x: x.count(" ") + 1).mean()
# or train.Phrase.apply(lambda x: len(x.split())).mean()

In [None]:
# Average word length of phrases in test is:
test.Phrase.apply(lambda x : len(x.split())).mean()

# Most common trigrams for positive phrases

### Concatenate phrases

In [None]:
text = ' '.join(train.loc[train.Sentiment == 4, 'Phrase'].values)
text

In [None]:
ngrams(text.split(), 3)

In [None]:
list(ngrams(text.split(), 3))[:5]

### Most Common Trigrams

In [None]:
text_trigrams = list(ngrams(text.split(), 3))
text_trigrams

In [None]:
Counter(text_trigrams)

In [None]:
Counter(text_trigrams).most_common(10)

### excluding the stop words

In [None]:
text_ = [i for i in text.split() if i not in stopwords.words('english')]
text_trigrams = list(ngrams(text_, 3))
Counter(text_trigrams).most_common(10)

In [None]:
stopwords.words('english')

> The results show the main problem with this dataset: there are to many common words due to sentenced splitted in phrases. As a result stopwords shouldn't be removed from text.

>>
So, we have only phrases as data. And a phrase can contain a single word. And one punctuation mark can cause phrase to receive a different sentiment. Also assigned sentiments can be strange. This means several things:
> -    using stopwords can be a bad idea, especially when phrases contain one single stopword;
> -    puntuation could be important, so it should be used;
> -    ngrams are necessary to get the most info from data;
> -    using features like word count or sentence length won't be useful;

or,
- **Good idea to focus on:**
    - punctuations
    - using ngrams
- **Not good idea to focus on:**
    - stopwords
    - wordcounts, sentence lengths


### Introducing Tokenizer

In [None]:
tokenizer = TweetTokenizer()

In [None]:
tokenizer

In [None]:
tokenizer.tokenize

In [None]:
tokenizer.tokenize('Hello world, Mr. Smith! I am new to this field.')

### Introducing Vectorizer with TweetTokenizer

In [None]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2), tokenizer=tokenizer.tokenize)
vectorizer

In [None]:
full_text = list(train.Phrase.values) + list(test.Phrase.values)
full_text

# QUESTION: Is it OK to tokenize phrases in test data?
# Preprocessing is outside the training?

In [None]:
vectorizer.fit(full_text)

In [None]:
train_vectorized = vectorizer.transform(train.Phrase)
test_vectorized = vectorizer.transform(test.Phrase)

In [None]:
y = train.Sentiment

### Approach 1. Logistic Regression

In [None]:
logreg = LogisticRegression()
ovr = OneVsRestClassifier(logreg) # LEARN

In [None]:
%%time
ovr.fit(train_vectorized, y)

### Scoring

In [None]:
scores = cross_val_score(ovr, train_vectorized, y, scoring='accuracy', n_jobs=-1, cv=3)
scores

In [None]:
# Cross-validation mean accuracy
scores.mean(), scores.std()

In [None]:
print('accuracy: {0:.2f}%, std: {1:.2f}%pt'.format(scores.mean() * 100, scores.std() * 100))

### Approach 2. Linear SVC

In [None]:
%%time
svc = LinearSVC(dual=False)
svc.fit(train_vectorized, y)
scores = cross_val_score(svc, train_vectorized, y, scoring='accuracy', n_jobs=-1, cv=3)
print('CV mean accuracy: {0:.2f}%, std: {1:.2f}%pt'.format(scores.mean()*100, scores.std()*100))

### Approach 3. Deep Learning

>>
And now let's try DL. DL should work better for text classification with multiple layers. I use an architecture similar to those which were used in toxic competition.

In [None]:
# LEARN: toxic competition

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Conv1D, GRU, CuDNNGRU, CuDNNLSTM, BatchNormalization
from keras.layers import Bidirectional, GlobalMaxPool1D, MaxPooling1D, Add, Flatten
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D
from keras.models import Model, load_model
from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
from keras import backend as K
from keras.engine import InputSpec, Layer
from keras.optimizers import Adam

from keras.callbacks import ModelCheckpoint, TensorBoard, Callback, EarlyStopping

### Tokenize (keras way)

In [None]:
tk = Tokenizer(lower=True, filters='')
tk.fit_on_texts(full_text)
train_tokenized = tk.texts_to_sequences(train.Phrase)
test_tokenized = tk.texts_to_sequences(test.Phrase)

# NOTE: tokenize is to convert each word into integer code (frequent words only)

In [None]:
tk

In [None]:
full_text[:10]

In [None]:
len(train.Phrase)

In [None]:
len(train_tokenized), train_tokenized
# NOTE: number of train_tokenized's rows equals to that of train.Phase

In [None]:
max_len = 50
X_train = pad_sequences(train_tokenized, maxlen=max_len) # NOTE: justified each row to right, padding zeros on the left
X_test = pad_sequences(test_tokenized, maxlen=max_len)

In [None]:
X_train[:10]

### Create embeddings matrix for each word

In [None]:
embedding_path = "../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec"

# NOTE: serialized matrix of shape (2M rows, 300 dimention embeddings), 
# each row has 300-dimentional embeddings for the preceding word

# LEARN: FastText, from where this embeddings data comes

In [None]:
embed_size = 300
max_features = 3e4

In [None]:
list(o for o in open(embedding_path))

In [None]:
list(o.split()[0] for o in open(embedding_path))

In [None]:
def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype='float32')
# NOTE: 'word, *arr' corresponds to each text row in the embedding file
# this function converts only the array part into numpy array
# in preparation for creating dictionary

In [None]:
%%time
embedding_index = dict(get_coefs(*o.strip().split(" ")) for o in open(embedding_path))

In [None]:
embedding_index # array of embeddings for words included in FastText file (not yet adjusted for the dataset)

In [None]:
word_index = tk.word_index
word_index

In [None]:
nb_words = min(max_features, len(word_index)) # NOTE: 'nb' means 'number'
embedding_matrix = np.zeros((nb_words + 1, embed_size)) 
embedding_matrix

In [None]:
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

In [None]:
embedding_matrix # array of embeddings for most frequent words in full_text

### One Hot Encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
y_ohe = ohe.fit_transform(y.values.reshape(-1, 1)) 
# NOTE: fit/transform done together for one-hot encoding

In [None]:
np.reshape?

In [None]:
y.values.reshape?

In [None]:
y.values.reshape(-1, 1)

In [None]:
y_ohe # one-hot encoded version of labels

In [None]:
def build_model1(lr=0.0, lr_d=0.0, units=0, spatial_dr=0.0, 
                 kernel_size1=3, kernel_size2=2, dense_units=128, dr=0.1, conv_size=32):
    file_path = 'best_model.hdf5'
    check_point = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, 
                                 save_best_only=True, mode='min')
    early_stop = EarlyStopping(monitor='val_loss', mode='min', patience=3)
    
    inp = Input(shape=(max_len, )) 
    
    
    x = Embedding(19479, embed_size, weights=[embedding_matrix], trainable=False)(inp) # QUESTION: what is 19479 ??
    x1 = SpatialDropout1D(spatial_dr)(x)
    
    x_gru = Bidirectional(CuDNNGRU(units, return_sequences=True))(x1)
    x1 = Conv1D(conv_size, kernel_size=kernel_size1, padding='valid', kernel_initializer='he_uniform')(x_gru)    
    avg_pool1_gru = GlobalAveragePooling1D()(x1)
    max_pool1_gru = GlobalMaxPooling1D()(x1)
    
    x3 = Conv1D(conv_size, kernel_size=kernel_size2, padding='valid', kernel_initializer='he_uniform')(x_gru)    
    avg_pool3_gru = GlobalAveragePooling1D()(x1)
    max_pool3_gru = GlobalMaxPooling1D()(x1)
    
    x_lstm = Bidirectional(CuDNNLSTM(units, return_sequences=True))(x1)
    
    x1 = Conv1D(conv_size, kernel_size=kernel_size1, padding='valid', kernel_initializer='he_uniform')(x_lstm)    
    avg_pool1_lstm = GlobalAveragePooling1D()(x1)
    max_pool1_lstm = GlobalMaxPooling1D()(x1)
    
    x3 = Conv1D(conv_size, kernel_size=kernel_size2, padding='valid', kernel_initializer='he_uniform')(x_lstm)    
    avg_pool3_lstm = GlobalAveragePooling1D()(x1)
    max_pool3_lstm = GlobalMaxPooling1D()(x1)
    
    
    x = concatenate([
        avg_pool1_gru, max_pool1_gru, avg_pool3_gru, max_pool3_gru,
        avg_pool1_lstm, max_pool1_lstm, avg_pool3_lstm, max_pool3_lstm,
    ])
    x = BatchNormalization()(x)
    x = Dense(dense_units, activation='relu')(x)
    x = Dropout(dr)(x)
    x = BatchNormalization()(x)
    x = Dense(int(dense_units / 2), activation='relu')(x)
    x = Dropout(dr)(x)
    x = Dense(5, activation='sigmoid')(x)
    
    model = Model(inputs=inp, outputs=x)
    model.compile(
        loss='binary_crossentropy', 
        optimizer=Adam(lr=lr, decay=lr_d), 
        metrics=['accuracy']
    )
    history = model.fit(X_train, y_ohe, 
        batch_size=128, epochs=10, validation_split=0.1, verbose=1, 
        callbacks=[check_point, early_stop]
    ) # execute fitting
    model = load_model(file_path) # load best model obtained
    return model

### Build model 1

In [None]:
model1 = build_model1(lr=1e-3, lr_d=1e-10, units=64, spatial_dr=0.3, 
                      kernel_size1=3, kernel_size2=2, dense_units=32, dr=0.1, conv_size=32)

```
Train on 140454 samples, validate on 15606 samples
Epoch 1/20
140454/140454 [==============================] - 78s 556us/step - loss: 0.3517 - acc: 0.8377 - val_loss: 0.3242 - val_acc: 0.8485

Epoch 00001: val_loss improved from inf to 0.32423, saving model to best_model.hdf5
Epoch 2/20
140454/140454 [==============================] - 74s 524us/step - loss: 0.3103 - acc: 0.8585 - val_loss: 0.3098 - val_acc: 0.8547

Epoch 00002: val_loss improved from 0.32423 to 0.30981, saving model to best_model.hdf5
Epoch 3/20
140454/140454 [==============================] - 71s 507us/step - loss: 0.3004 - acc: 0.8627 - val_loss: 0.3084 - val_acc: 0.8581

Epoch 00003: val_loss improved from 0.30981 to 0.30841, saving model to best_model.hdf5
Epoch 4/20
140454/140454 [==============================] - 75s 531us/step - loss: 0.2924 - acc: 0.8669 - val_loss: 0.3162 - val_acc: 0.8562

Epoch 00004: val_loss did not improve from 0.30841
Epoch 5/20
140454/140454 [==============================] - 76s 545us/step - loss: 0.2858 - acc: 0.8697 - val_loss: 0.3043 - val_acc: 0.8584

Epoch 00005: val_loss improved from 0.30841 to 0.30430, saving model to best_model.hdf5
Epoch 6/20
140454/140454 [==============================] - 75s 537us/step - loss: 0.2800 - acc: 0.8725 - val_loss: 0.3053 - val_acc: 0.8574

Epoch 00006: val_loss did not improve from 0.30430
Epoch 7/20
140454/140454 [==============================] - 74s 524us/step - loss: 0.2757 - acc: 0.8741 - val_loss: 0.3031 - val_acc: 0.8579

Epoch 00007: val_loss improved from 0.30430 to 0.30311, saving model to best_model.hdf5
Epoch 8/20
140454/140454 [==============================] - 76s 541us/step - loss: 0.2719 - acc: 0.8764 - val_loss: 0.3013 - val_acc: 0.8610

Epoch 00008: val_loss improved from 0.30311 to 0.30131, saving model to best_model.hdf5
Epoch 9/20
140454/140454 [==============================] - 75s 531us/step - loss: 0.2687 - acc: 0.8783 - val_loss: 0.3037 - val_acc: 0.8597

Epoch 00009: val_loss did not improve from 0.30131
Epoch 10/20
140454/140454 [==============================] - 73s 523us/step - loss: 0.2658 - acc: 0.8795 - val_loss: 0.3026 - val_acc: 0.8611

Epoch 00010: val_loss did not improve from 0.30131
Epoch 11/20
140454/140454 [==============================] - 76s 544us/step - loss: 0.2636 - acc: 0.8809 - val_loss: 0.3044 - val_acc: 0.8588

Epoch 00011: val_loss did not improve from 0.30131
```

In [None]:
pred1 = model1.predict(X_test, batch_size = 1024, verbose = 1)
pred1

In [None]:
pred = pred1

In [None]:
predictions = np.round(np.argmax(pred, axis=1)).astype(int)
predictions

In [None]:
sub['Sentiment'] = predictions
sub.to_csv("blend.csv", index=False)