## DNN model for binary classification of movie reviews (positive/negative) 
The main objective is creating DNN model for binary classification of movie reviews (positive/negative) and run prediction on files in the test folder. Results should be in the CSV format.
The first column is file id, second is the result of the model prediction.

#### The secondary objective is creating a model for the movie rating prediction. 
If you accomplish it ñ please add the third column to the resulting CSV file.
Please, provide all the source code you used for the task completion as well as the resulting CSV file.

In [None]:
from importlib import reload
import sys
from imp import reload
import warnings
warnings.filterwarnings('ignore')
if sys.version[0] == '2':
    reload(sys)
    sys.setdefaultencoding("utf-8")

Dataset description
The dataset contains 2 folders:

1) train: folder with data you can use for the training purposes. It contains subfolders:
- Folders neg and pos. These folders contain text files with positive and negative movie reviews. Positive review means rating >= 7, negative - <= 4. File naming has format {id}.txt
- Files pos_rating.csv and neg_rating.csv. Both files contain raw ratings of movies: the first column contain file id, second is a rating value
- Unsup folder. This folder contains text files with movies review without any rating. You can use it (or donít use) as you wish


2)	Test:  folder contains text files with naming rule {id}.txt
- Task description
The main objective is creating DNN model for binary classification of movie reviews (positive/negative) and run prediction on files in the test folder. Results should be in the CSV format. The first column is file id, second is the result of the model prediction.
The secondary objective is creating a model for the movie rating prediction. If you accomplish it ñ please add the third column to the resulting CSV file.
Please, provide all the source code you used for the task completion as well as the resulting CSV file.


In [None]:
import pandas as pd

In [None]:
import os
positiveFiles = [x for x in os.listdir("movie_reviews/train/pos/") if x.endswith(".txt")]
negativeFiles = [x for x in os.listdir("movie_reviews/train/neg/") if x.endswith(".txt")]
testFiles = [x for x in os.listdir("movie_reviews/test/") if x.endswith(".txt")]

In [None]:
positiveReviews, negativeReviews, testReviews = [], [], []
for pfile in positiveFiles:
    with open("movie_reviews/train/pos/"+pfile, encoding="latin1") as f:
        positiveReviews.append(f.read())
for nfile in negativeFiles:
    with open("movie_reviews/train/neg/"+nfile, encoding="latin1") as f:
        negativeReviews.append(f.read())
for tfile in testFiles:
    with open("movie_reviews/test/"+tfile, encoding="latin1") as f:
        testReviews.append(f.read())

In [5]:
reviews = pd.concat([
    pd.DataFrame({"review":positiveReviews, "label":1, "file":positiveFiles}),
    pd.DataFrame({"review":negativeReviews, "label":0, "file":negativeFiles}),
    pd.DataFrame({"review":testReviews, "label":-1, "file":testFiles})
], ignore_index=True).sample(frac=1, random_state=1)
reviews.head()

Unnamed: 0,review,label,file
26247,"In my opinion, this is the best stand-up show ...",-1,14905.txt
35067,I would just like to point out (in addition to...,-1,3027.txt
34590,"Hmmm, not a patch on the original from Shaw Br...",-1,24849.txt
16668,"A pre-Nerd Robert Carradine, a pre-Automan Des...",0,75.txt
12196,Turning Isherwood's somewhat dark and utterly ...,1,2351.txt


In [6]:
reviews = reviews[["review", "label", "file"]].sample(frac=1, random_state=1)
train_data = reviews[reviews.label!=-1]
test_data = reviews[reviews.label==-1]

In [7]:
print(train_data.shape)
print(test_data.shape)

(25000, 3)
(25000, 3)


In [37]:
test_data.head()

Unnamed: 0,review,label,file,predicted_sentiment,id
29635,cute idea to have dionne warwick do the song v...,-1,9434.txt,0,9434
25884,"i don't have much to add to my summary, this f...",-1,8894.txt,1,8894
25558,"cinematography--compared to 'the wrestler,' a ...",-1,14694.txt,0,14694
46579,the duke is a very silly film--a dog becoming ...,-1,17584.txt,1,17584
47066,if your a hard core freddy fan then you might ...,-1,19963.txt,1,19963


In [8]:
print(train_data.review.iloc[0])

This movie is nothing more than Christian propaganda. It started off like a good sci-fi movie and then works a syrupy sweet Christian theme into the story which is totally unrelated. I had to turn it off half way through because I felt tricked into renting it. The catholic church has officially announced that aliens do NOT contradict belief in God.<br /><br />The movie is slightly entertaining despite this but the dialog is unbelievable, writing and acting is mostly rubbish and all in all, this movie is mostly a stinker to be avoided.<br /><br />There was obviously some research done into the phenomenon by the filmmakers, but then you quickly realize that it is only for the purpose to debunk and inject their own paranoid religious views into a valid interesting subject. If you are a zealous religious fanatic who believes in demons and angels , you will love this movie.


In [9]:
seed = 0

import random
import numpy as np
from tensorflow import set_random_seed

random.seed(seed)
np.random.seed(seed)
set_random_seed(seed)


from tensorflow.keras.preprocessing import sequence,text
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout,Embedding,LSTM,Conv1D,GlobalMaxPooling1D,Flatten,MaxPooling1D,GRU,SpatialDropout1D,Bidirectional
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.losses import categorical_crossentropy
from tensorflow.keras.optimizers import Adam

In [10]:
train_data.review.str.len().max()

13704

In [11]:
train_data.review.str.len().mean()

1315.904

In [12]:
train_data.label.value_counts()

1    12500
0    12500
Name: label, dtype: int64

Formating


In [13]:
def format_data(train, test, max_features, maxlen):

    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    from tensorflow.keras.utils import to_categorical
    
    # Shuffle data
    train = train.sample(frac=1).reset_index(drop=True)
    # To lowercase
    train.review = train.review.apply(lambda x: x.lower())
    test.review = test.review.apply(lambda x: x.lower())

    X = train.review
    test_X = test.review
    # Labels to Categorical
    Y = to_categorical(train.label.values)

    # Tokenize and Fit
    tokenizer = Tokenizer(num_words=max_features)
    tokenizer.fit_on_texts(list(X))

    # Convert to sequence (format accepted by the network) & Pad
    X = tokenizer.texts_to_sequences(X)
    X = pad_sequences(X, maxlen=maxlen)
    test_X = tokenizer.texts_to_sequences(test_X)
    test_X = pad_sequences(test_X, maxlen=maxlen)

    return X, Y, test_X

In [85]:
maxlen = 500
max_features = 10000

X, Y, test_X = format_data(train_data, test_data, max_features, maxlen)

### Let's take a look at how the data looks:

In [49]:
X, X.shape

(array([[   0,    0,    0, ...,   54,  577, 3235],
        [   0,    0,    0, ...,   28,   15,   22],
        [   0,    0,    0, ..., 2599,    7,    7],
        ...,
        [   0,    0,    0, ...,    2,   68,  221],
        [   0,    0,    0, ...,   19,   15,   22],
        [   0,    0,    0, ..., 1416,    8, 3664]], dtype=int32), (25000, 500))

In [51]:
Y, Y.shape

(array([[0., 1.],
        [0., 1.],
        [0., 1.],
        ...,
        [0., 1.],
        [1., 0.],
        [1., 0.]], dtype=float32), (25000, 2))

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.25, random_state=seed)

In [18]:
model = Sequential()


model.add(Embedding(max_features, 500, mask_zero=True))
model.add(LSTM(64,dropout=0.3, recurrent_dropout=0.3,return_sequences=True))
model.add(LSTM(32,dropout=0.3, recurrent_dropout=0.3,return_sequences=False))
model.add(Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy',optimizer=Adam(lr=0.001),metrics=['accuracy'])
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 500)         5000000   
_________________________________________________________________
lstm (LSTM)                  (None, None, 64)          144640    
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                12416     
_________________________________________________________________
dense (Dense)                (None, 2)                 66        
Total params: 5,157,122
Trainable params: 5,157,122
Non-trainable params: 0
_________________________________________________________________


In [19]:
epochs = 5
batch_size = 64

In [20]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, validation_data=(X_val, Y_val), epochs=epochs, batch_size=batch_size, verbose=1)

Train on 18750 samples, validate on 6250 samples
Instructions for updating:
Use tf.cast instead.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x13ca4a5f8>

In [48]:
test_X.shape

(25000, 500)

In [21]:
result = model.predict_classes(test_X, batch_size=batch_size, verbose=1)



In [52]:
result

array([0, 1, 0, ..., 0, 0, 0])

In [53]:
test_data.predicted_sentiment = result

In [54]:
test_data

Unnamed: 0,review,label,file,predicted_sentiment,id
29635,cute idea to have dionne warwick do the song v...,-1,9434.txt,0,9434
25884,"i don't have much to add to my summary, this f...",-1,8894.txt,1,8894
25558,"cinematography--compared to 'the wrestler,' a ...",-1,14694.txt,0,14694
46579,the duke is a very silly film--a dog becoming ...,-1,17584.txt,1,17584
47066,if your a hard core freddy fan then you might ...,-1,19963.txt,1,19963
47777,cinema's greatest period started in post-war e...,-1,18261.txt,1,18261
34318,i saw this in the market place at the cannes f...,-1,10732.txt,0,10732
46195,this is the single greatest movie i have ever ...,-1,21426.txt,1,21426
37590,i agree with the comments regarding the downwa...,-1,8021.txt,1,8021
41020,"way to go ace! you just made a chilling, gross...",-1,756.txt,0,756


In [71]:
test_rating = pd.read_csv("test_rating.csv")

In [72]:
test_rating['sentiment'] = test_rating['rating'].map(lambda rating: 0 if rating < 5 else 1)

In [73]:
test_rating.head()

Unnamed: 0,id,rating,sentiment
0,16909,4,0
1,4926,9,1
2,7979,8,1
3,14259,4,0
4,23960,1,0


In [74]:
test_data['id'] = test_data['file'].map(lambda file: file.split('.')[0])

In [75]:
test_data['id']=test_data['id'].astype(int)

In [76]:
test_rating.index

RangeIndex(start=0, stop=25000, step=1)

In [79]:
test_rating = test_rating.join(test_data.set_index('id'))

In [80]:
test_rating['sentiment_prediction_res'] =  test_rating.apply(lambda row: row.sentiment == row.predicted_sentiment, axis=1)

In [81]:
test_rating.head()

Unnamed: 0,id,rating,sentiment,review,label,file,predicted_sentiment,sentiment_prediction_res
0,16909,4,0,drug runner archie moses introduces his friend...,-1,0.txt,0,True
1,4926,9,1,before sunrise is romance for the slacker gene...,-1,1.txt,1,True
2,7979,8,1,"a three stooges short, this one featuring shem...",-1,2.txt,1,True
3,14259,4,0,i saw this film for one reason: the tagline is...,-1,3.txt,0,True
4,23960,1,0,a woman (sylvia kristel) seduces a 15 year old...,-1,4.txt,0,True


In [83]:
correct_ratings = test_rating[test_rating.sentiment_prediction_res == True].count()

In [84]:
print("Accuracy: {}".format(correct_ratings / len(test_rating)))

Accuracy: id                          0.8538
rating                      0.8538
sentiment                   0.8538
review                      0.8538
label                       0.8538
file                        0.8538
predicted_sentiment         0.8538
sentiment_prediction_res    0.8538
dtype: float64


### See results in test_rating.csv

In [86]:
test_rating.to_csv('test_rating.csv')

# Part 2
Files pos_rating.csv and neg_rating.csv. Both files contain raw ratings of movies: the first column contain file id, second is a rating value

In [89]:
train_data.head()

Unnamed: 0,review,label,file
18005,This movie is nothing more than Christian prop...,0,10053.txt
13275,I found it a real task to sit through this fil...,0,1044.txt
16996,"The plot, character development, and gags in t...",0,8374.txt
57,This film is a brilliant retelling of Shakespe...,1,2572.txt
16237,I've given up trying to figure out what versio...,0,11822.txt


In [94]:
pos_ratings = pd.read_csv("movie_reviews/train/pos_rating.csv")

In [95]:
neg_ratings = pd.read_csv("movie_reviews/train/neg_rating.csv")

In [96]:
pos_ratings.head()

Unnamed: 0,id,rating
0,0,9
1,1,8
2,2,9
3,3,7
4,4,7


In [97]:
neg_ratings.head()

Unnamed: 0,id,rating
0,0,1
1,1,1
2,2,1
3,3,1
4,4,3


In [100]:
pos_ratings['review'] = positiveReviews
pos_ratings.head()


Unnamed: 0,id,rating,review
0,0,9,"In my opinion, this is a pretty good celebrity..."
1,1,8,Samuel Fuller's Pickup on South Street is anom...
2,2,9,"If you cannot enjoy a chick flick, stop right ..."
3,3,7,"I realize that alot of people hate this movie,..."
4,4,7,"""Sweeney Todd"" is in my opinion one of a few ""..."


In [101]:
neg_ratings['review'] = negativeReviews
neg_ratings.head()

Unnamed: 0,id,rating,review
0,0,1,If one wants to have a character in a movie ha...
1,1,1,I'm at a loss. This entire movie made absolute...
2,2,1,I got this in the DVD 10 pack CURSE OF THE DEA...
3,3,1,"Unlike many, I don't find the premise or theme..."
4,4,3,A woman in love with her husband (he's suicida...


In [109]:
rating_reviews = pd.concat([pos_ratings, neg_ratings])
rating_reviews.shape

(25000, 3)

In [110]:
rating_reviews.head()

Unnamed: 0,id,rating,review
0,0,9,"In my opinion, this is a pretty good celebrity..."
1,1,8,Samuel Fuller's Pickup on South Street is anom...
2,2,9,"If you cannot enjoy a chick flick, stop right ..."
3,3,7,"I realize that alot of people hate this movie,..."
4,4,7,"""Sweeney Todd"" is in my opinion one of a few ""..."


In [112]:
rating_reviews['label'] = rating_reviews['rating']
rating_reviews.head()

Unnamed: 0,id,rating,review,label
0,0,9,"In my opinion, this is a pretty good celebrity...",9
1,1,8,Samuel Fuller's Pickup on South Street is anom...,8
2,2,9,"If you cannot enjoy a chick flick, stop right ...",9
3,3,7,"I realize that alot of people hate this movie,...",7
4,4,7,"""Sweeney Todd"" is in my opinion one of a few ""...",7


In [113]:
rating_reviews_train = rating_reviews

In [115]:
rating_reviews_test = test_rating
rating_reviews_test.head()


Unnamed: 0,id,rating,sentiment,review,label,file,predicted_sentiment,sentiment_prediction_res
0,16909,4,0,drug runner archie moses introduces his friend...,-1,0.txt,0,True
1,4926,9,1,before sunrise is romance for the slacker gene...,-1,1.txt,1,True
2,7979,8,1,"a three stooges short, this one featuring shem...",-1,2.txt,1,True
3,14259,4,0,i saw this film for one reason: the tagline is...,-1,3.txt,0,True
4,23960,1,0,a woman (sylvia kristel) seduces a 15 year old...,-1,4.txt,0,True


In [138]:
maxlen = 400
max_features = 8000

X, Y, test_X = format_data(rating_reviews_train, rating_reviews_test, max_features, maxlen)

In [139]:
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.25, random_state=seed)

In [140]:
model = Sequential()


model.add(Embedding(max_features, 300, mask_zero=True))
model.add(LSTM(32,dropout=0.2, recurrent_dropout=0.4,return_sequences=True))
model.add(LSTM(32,dropout=0.3, recurrent_dropout=0.3,return_sequences=False))
model.add(Dense(11, activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer=Adam(lr=0.002),metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, None, 300)         2400000   
_________________________________________________________________
lstm_10 (LSTM)               (None, None, 32)          42624     
_________________________________________________________________
lstm_11 (LSTM)               (None, 32)                8320      
_________________________________________________________________
dense_5 (Dense)              (None, 11)                363       
Total params: 2,451,307
Trainable params: 2,451,307
Non-trainable params: 0
_________________________________________________________________


In [141]:
epochs = 5
batch_size = 32

In [142]:
Y_train.shape

(18750, 11)

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, validation_data=(X_val, Y_val), epochs=epochs, batch_size=batch_size, verbose=1)

Train on 18750 samples, validate on 6250 samples
Epoch 1/5