**Here I use Word embeddings method to create the word vectors. Word2Vec is a better alternative to the given problem but I manually wanted to check the entire process to avoid certain logical errors from my side.
Since it was mentioned in the word document that "*The current problem for this assignment only looks at evaluating the essays based on Content; please feel free to ignore the Grammar and Flow modelling for now.**", **I resorted to the use of LSTM(unidirection) to make the process faster. ****

**The aim of this model is to enable automatic evaluation of the paragraphs keeping in reference the content of the paragraph.**

# **Importing the libraries**

In [None]:
!pip install keras-tuner



In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import Dropout
import math
from keras_tuner.tuners import RandomSearch
from keras_tuner.tuners import Hyperband
from keras_tuner.engine.hyperparameters import HyperParameters
import warnings
warnings.simplefilter("ignore", UserWarning)

# Importing the dataset

In [None]:
#Importing the train dataset
train_dataset= pd.read_csv("/content/train.csv")

#importing the test dataset
test_dataset= pd.read_csv("/content/test.csv")

#importing the prompts
prompts= pd.read_csv("/content/all_prompts.csv")
prompts=prompts.iloc[:, 1:]

#seperating the independent values
train_dataset=train_dataset.iloc[:, 1:]
test_dataset=test_dataset.iloc[:, 1:]

#here we store the dependent variable in y and drop the column
y=train_dataset.evaluator_rating.values
train_dataset=train_dataset.drop("evaluator_rating", axis=1)

#number of rows of training dataset
num_rows_train= len(y)

#number of rows of the test dataset
num_rows_test=len(test_dataset.promptId.values)


#Here I combine the independent values of the training and test dataset to enable proper Word Embeddings 
#with same vocabulary. This two datasets will be seperated later
frame= [train_dataset, test_dataset]
dataset = pd.concat(frame)
dataset.reset_index(inplace=True)
total_rows= len(dataset.index.values)


#next I join the prompt question to the respective essays so 
#as to preserve the context of both the question and answer
column_values=[]
keys= prompts.promptId.values
values= prompts.prompt_question.values
iterator= dataset.promptId.values
for i in range(0, total_rows):
    dataset["essay"][i]= values[np.where(keys== dataset["promptId"][i])]+" "+dataset["essay"][i]
    dataset["essay"][i]= dataset["essay"][i][0]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
train_dataset

Unnamed: 0,promptId,uniqueId,essay
0,1,1_323,"At present age, our education system is not go..."
1,1,1_238,I am agree the tightly defined curriculum of o...
2,1,1_212,I strongly agree with the statement that tight...
3,1,1_117,Our education system is nice quitely but i dis...
4,1,1_229,i am totally agree with the statement that tig...
...,...,...,...
1235,5,5_419,The entire world is in the race of producing a...
1236,5,5_420,The race in the development of weapons are pro...
1237,5,5_421,In an era where every second person hopes and ...
1238,5,5_422,INTRODUCTION :Since the beginning of the time ...


In [None]:
test_dataset

Unnamed: 0,promptId,uniqueId,essay
0,1,1_315,Curriculum has been adopted in many schools. T...
1,1,1_214,"I strongly agree with the statement , The tig..."
2,1,1_196,Imagination and creativity is the most importa...
3,1,1_178,In our eduction system leaves no room for imag...
4,1,1_201,"I will agree at some what extend, because if w..."
...,...,...,...
300,5,5_146,Earth is a creation of God and everything that...
301,5,5_65,production of arms and weapons in this present...
302,5,5_151,Race to become more powerful can destroy the e...
303,5,5_404,In its attempt to harness the power of the ato...


In [None]:
dataset

Unnamed: 0,index,promptId,uniqueId,essay
0,0,1,1_323,The tight curriculum of our education system l...
1,1,1,1_238,The tight curriculum of our education system l...
2,2,1,1_212,The tight curriculum of our education system l...
3,3,1,1_117,The tight curriculum of our education system l...
4,4,1,1_229,The tight curriculum of our education system l...
...,...,...,...,...
1540,300,5,5_146,"In the nuclear age, the production and develop..."
1541,301,5,5_65,"In the nuclear age, the production and develop..."
1542,302,5,5_151,"In the nuclear age, the production and develop..."
1543,303,5,5_404,"In the nuclear age, the production and develop..."


# Cleaning the paragraphs

In [None]:
#cleaning the dataset
import nltk
import re
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer  
lem = WordNetLemmatizer()

def cleaner(dataset):
    length= len(dataset.promptId)
    corpus=[]
    for i in range(0,length):
      review = re.sub('[^a-zA-Z]', ' ', dataset["essay"][i])
      review = review.lower()
      review = review.split()  
      review = [lem.lemmatize(word) for word in review if not word in stopwords.words('english')]
      review = ' '.join(review)
      corpus.append(review)
    return corpus

#obtaining the final corpus
corpus= cleaner(dataset=dataset)


#creating a vocabulary and calculate the size of the vocabulary
def vocabulary_array(corpus):
    vocabulary=[]
    for i in corpus:
      text=i
      text = text.split()
      for j in text:
        if j not in vocabulary:
          vocabulary.append(j)
    vocabulary.sort()
    return vocabulary

vocabulary= vocabulary_array(corpus=corpus)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Creating the word vectors

In [None]:
#making the embedded docs for the data
voc_size= len(vocabulary)
onehot_repr=[one_hot(words,voc_size)for words in corpus] 
sizes=[]
for i in onehot_repr:
  sizes.append(len(i))
sent_length= max(sizes)
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)

#as mentioned earlier the embedding of the training and test data are seperated
train_docs=embedded_docs[:num_rows_train]
test_docs=embedded_docs[num_rows_train:]

# Using train test split
**With this we have two sets of data. One for training and another for validation.**


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(train_docs, y, test_size=0.4, random_state=42)

# Building our Model

In [None]:
#Hyperparameter tuning the model
def build_model_1(hp):
  model1 = Sequential()
  model1.add(Embedding(input_dim= voc_size, output_dim= 100, input_length=sent_length,))
  model1.add(LSTM(hp.Int('input_unit',min_value=32,max_value=512,step=32),return_sequences=True,))
  for i in range(hp.Int('n_layers', 1, 10)):
    model1.add(LSTM(hp.Int(f'lstm_{i}_units',min_value=32,max_value=512,step=32),return_sequences=True))
  model1.add(LSTM(hp.Int(f'lstm_last_unit',min_value=32,max_value=512,step=32)))
  model1.add(Dropout(hp.Float('Dropout_rate',min_value=0,max_value=0.5,step=0.1)))
  model1.add(Dense(1, activation=hp.Choice('dense_activation',values=['relu', 'sigmoid'],default='relu')))
  model1.compile(loss='mean_squared_error', optimizer='adam',metrics = ['mse'])
  return model1


tuner1 = Hyperband(build_model_1,
                     objective='mse',
                     max_epochs=5,
                     factor=2,
                     directory='/content/',
                     project_name='automated_essay')

tuner1.search(X_train, y_train, epochs=50, validation_data= [X_val, y_val])

Trial 9 Complete [00h 01m 59s]
mse: 3.9139785766601562

Best mse So Far: 1.1089434623718262
Total elapsed time: 00h 16m 12s
INFO:tensorflow:Oracle triggered exit


In [None]:
# Get the optimal hyperparameters
best_hps1=tuner1.get_best_hyperparameters(num_trials=1)[0]
model1 = tuner1.hypermodel.build(best_hps1)
history1 = model1.fit(X_train, y_train, epochs=30)
val_acc_per_epoch1 = history1.history['mse']
best_epoch1 = val_acc_per_epoch1.index(min(val_acc_per_epoch1)) + 1
hypermodel1 = tuner1.hypermodel.build(best_hps1)

# Retrain the model
hypermodel1.fit(X_train, y_train, epochs=best_epoch1)
hypermodel1.summary()

#hypermodel1.save("/content/models")
#from tensorflow import keras
#hypermodel1 = keras.models.load_model('/content/models')

#model evaluation
eval_result = hypermodel1.evaluate(X_val, y_val)
print("[loss, mse]:", eval_result)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Epoch 1/26
Epoch 2/26
Epoch 3/26
Epoch 4/26
Epoch 5/26
Epoch 6/26
Epoch 7/26
Epoch 8/26
Epoch 9/26
Epoch 10/26
Epoch 11/26
Epoch 12/26
Epoch 13/26
Epoch 14/26
Epoch 15/26
Epoch 16/26
Epoch 17/26
Epoch 18/26
Epoch 19/26
Epoch 20/26
Epoch 21/26
Epoch 22/26
Epoch 23/26
Epoch 24/26
Epoch 25/26
Epoch 26/26
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 340, 100)          1937700   
_________________________________________________________________
lstm (LSTM)                  (None, 340, 224)          291200 

In [None]:
#now we make our predictions on the test dataset
pred= hypermodel1.predict(test_docs)
predictions=[i[0] for i in pred]

**To account for checking the Grammer and Flow, Bi-directional LSTM would be preferred as during the developement I tested this.**

In [None]:
prediction=[]
for i in predictions:
  j=round(i, 1)
  prediction.append(j)

In [None]:
prediction

In [None]:
test_dataset['predicted_score']=prediction
test_dataset

Unnamed: 0,promptId,uniqueId,essay,predicted_score
0,1,1_315,Curriculum has been adopted in many schools. T...,3.0
1,1,1_214,"I strongly agree with the statement , The tig...",3.2
2,1,1_196,Imagination and creativity is the most importa...,2.5
3,1,1_178,In our eduction system leaves no room for imag...,3.2
4,1,1_201,"I will agree at some what extend, because if w...",2.7
...,...,...,...,...
300,5,5_146,Earth is a creation of God and everything that...,2.2
301,5,5_65,production of arms and weapons in this present...,1.1
302,5,5_151,Race to become more powerful can destroy the e...,2.7
303,5,5_404,In its attempt to harness the power of the ato...,3.0


Writing the data with the predictions in the csv file

In [None]:
fileVariable = open('test.csv', 'r+')
fileVariable.truncate(0)
fileVariable.close()

In [None]:
test_dataset.to_csv('test.csv')