# Bag of Words and Word2Vec on CommonLit Readability Prize Competition
In this notebook, I will create a Bag of Words model and a Word Vectors model with Google's Word2Vec, and I will train multiple models on each.

Here are the basic imports.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from bs4 import BeautifulSoup # for removing HTML tags
import re # removing punctuation and numbers
import nltk # removing stop words
import gensim # word vectors (word2vec)
import cython # speeding up training


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data Imports
First, I'll import the required data.

In [None]:
# reading in the train and test data
train_data = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
test_data = pd.read_csv('../input/commonlitreadabilityprize/test.csv')
print(train_data.shape, test_data.shape)
print(train_data["excerpt"][0])
train_data.head()

In [None]:
# splitting data (not using train_test_split since that would mess up the index)
X_train = train_data["excerpt"].iloc[:(train_data["excerpt"].size-200)]
X_val = train_data["excerpt"].iloc[(train_data["excerpt"].size-200):]
y_train = train_data["target"].iloc[:(train_data["target"].size-200)]
y_val = train_data["target"].iloc[(train_data["target"].size-200):]

## Data Preprocessing for Bag of Words
Next, I'll do the data preprocessing to get the sentences ready for a Bag of Words (also partially used later for Word2Vec model).

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

def excerpt_to_wordlist(excerpt, remove_stop_words=False):
    '''Creates a clean excerpt by removing
    HTML tags, punctuation, numbers, and stop words'''
    # remove HTML tags
    excerpt1 = BeautifulSoup(excerpt).get_text()
    # remove punctuation
    excerpt2 = re.sub("[^a-zA-Z0-9]", " ", excerpt1)
    # removing unneccesary newlines/spaces + lower case
    excerpt3 = excerpt2.lower().split()
    if remove_stop_words:
        # removing stop words
        stops = set(stopwords.words("english")) # faster search through set than list
        excerpt4 = [word for word in excerpt3 if word not in stops]
        # lemmatization and stemming
        lemmatizer = WordNetLemmatizer()
        porterStemmer = PorterStemmer()
        excerpt5 = [lemmatizer.lemmatize(porterStemmer.stem(word)) for \
                   word in excerpt4]
        # return final review (joined by spaces)
        return " ".join(excerpt5)
    else:
        return excerpt3

In [None]:
clean_train_excerpts = []
clean_valid_excerpts = []
clean_test_excerpts = []
for i in range(0, X_train.size):
    clean_train_excerpts.append(excerpt_to_wordlist(X_train[i], True))
    if (i+1)%500 == 0:
        print(f"{i+1} finished for train data")
for i in range(X_train.size, X_train.size+X_val.size):
    clean_valid_excerpts.append(excerpt_to_wordlist(X_val[i], True))
for i in range(0, test_data["excerpt"].size):
    clean_test_excerpts.append(excerpt_to_wordlist(test_data["excerpt"][i], True))

## Bag of Words Representation
Here, I will create a Bag of Words representation for the excerpts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# creating the vectorizer for creating the bag of words
vectorizer = CountVectorizer(analyzer = "word", \
                             tokenizer = None, \
                             preprocessor = None, \
                             stop_words = None, \
                             max_features = 5000)

train_data_features = vectorizer.fit_transform(clean_train_excerpts)
train_data_features = train_data_features.toarray()
valid_data_features = vectorizer.transform(clean_valid_excerpts)
valid_data_features = valid_data_features.toarray()
test_data_features = vectorizer.transform(clean_test_excerpts)
test_data_features = test_data_features.toarray()

## Models for Bag of Words
Now, I will create the following models to test on the Bag of Words:
1. Random Forest Regressor
2. XGBoost Regressor

### Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

modelbow1 = RandomForestRegressor(n_estimators=100, random_state=1)
modelbow1.fit(train_data_features, y_train)
val_predictions_bow1 = modelbow1.predict(valid_data_features)
print(mean_squared_error(y_val,val_predictions_bow1))

### XGBoost Regressor

In [None]:
from xgboost import XGBRegressor

modelbow2 = XGBRegressor(n_estimators=100, learning_rate=0.005, n_jobs=3, random_state=1)
modelbow2.fit(train_data_features, y_train)
val_predictions_bow2 = modelbow2.predict(valid_data_features)
print(mean_squared_error(y_val,val_predictions_bow2))

As seen from above, it seems that Random Forest did the best on the Bag of Words representation of the data.

## Data Preprocessing for Word Embeddings
Next, I'll do the data preprocessing for word embeddings.

In [None]:
# Load punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Function to split review into sentences
def excerpt_to_sentences(excerpt, tokenizer, remove_stopwords=False):
    '''Creates a clean text split into sentences'''
    # Using tokenizer to split paragraph into sentences
    sentences1 = tokenizer.tokenize(excerpt.strip())
    # go over each sentence
    sentences = []
    for sentence in sentences1:
        if len(sentence) > 0:
            # call review_to_wordlist
            sentences.append(excerpt_to_wordlist(sentence, remove_stopwords))
    return sentences

In [None]:
sentences = []

# adding train sentences in a list for training the word2vec model (not removing stop words for better model training)
for i in range(0, X_train.size):
    sentences += excerpt_to_sentences(X_train[i], tokenizer)
    if (i+1)%500 == 0:
        print(f"{i+1} sentences finished for train data")
    if (i+1) == X_train.size:
        print("All done")

## Word Embedding Model
Now, I will be using a Word2Vec model to create the word embeddings.

Word embeddings are a way of coding words into a list of numbers, and this vector effectively stores the word's information. This can help for operations such as the following one:
> king - man + woman = queen

This should work because the relation between a king and a man is very similar to the relation between a queen and a woman.

First, I'll create the word embedding model.

In [None]:
# Import the built-in logging module; configure it for Word2Vec to create nice output messages (next three lines from Bag of Words meets Bags of Popcorn tutorial)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
                    level=logging.INFO)

# Setting values for parameters
num_features = 300     # Word vector dimensionality                      
min_word_count = 20    # Minimum word count                        
num_workers = 4        # Number of threads to run in parallel
context = 20           # Context window size                                                                                    
downsampling = 0.0001  # Downsample setting for frequent words

# initializing and training the word2vec model (number 1 from the list)
from gensim.models import word2vec
wv = word2vec.Word2Vec(sentences, workers=num_workers, vector_size=num_features, min_count=min_word_count, \
                        window=context, sample=downsampling)
wv = wv.wv

## Creating Word Embeddings
Now, I'll create the word embeddings (average over the whole excerpt).

In [None]:
def excerpt_to_vector(excerpt, model, num_features):
    '''Function to average all word vectors in an excerpt'''
    # initializing an empty np array
    feature_vector = np.zeros((num_features,),dtype="float32")
    num_words = 0
    # set of words in model's vocabulary
    set_of_words = set(model.index_to_key)
    for word in excerpt:
        if word in set_of_words:
            num_words += 1
            feature_vector = np.add(feature_vector,model[word])
    feature_vector = np.divide(feature_vector, num_words)
    return feature_vector

def get_feature_vectors(excerpts, model, num_features):
    '''Function to create feature vectors for all excerpts'''
    count = 0
    excerpt_vectors = np.zeros((len(excerpts),num_features),dtype="float32")
    for excerpt in excerpts:
        excerpt_vectors[count] = excerpt_to_vector(excerpt, model, num_features)
        count += 1
        if count%1000 == 0:
            print(f"{count} reviews converted")
    return excerpt_vectors

In [None]:
clean_train_excerpts_vec = []
clean_valid_excerpts_vec = []
clean_test_excerpts_vec = []
for i in range(0, X_train.size):
    clean_train_excerpts_vec.append(excerpt_to_wordlist(X_train[i]))
    if (i+1)%500 == 0:
        print(f"{i+1} finished for train data")
for i in range(X_train.size, X_train.size+X_val.size):
    clean_valid_excerpts_vec.append(excerpt_to_wordlist(X_val[i]))
for i in range(0, test_data["excerpt"].size):
    clean_test_excerpts_vec.append(excerpt_to_wordlist(test_data["excerpt"][i]))

train_data_vectors = get_feature_vectors(clean_train_excerpts_vec, wv, num_features)
valid_data_vectors = get_feature_vectors(clean_valid_excerpts_vec, wv, num_features)
test_data_vectors = get_feature_vectors(clean_test_excerpts_vec, wv, num_features)

## Training Models
Now, I will train the following models for Word Embeddings and try to find the best one:
1. Random Forest Regressor
2. XGBoost Regressor

### Random Forest Regressor

In [None]:
modelvec1 = RandomForestRegressor(n_estimators=100, random_state=1)
modelvec1.fit(train_data_vectors, y_train)
val_predictions_vec1 = modelvec1.predict(valid_data_vectors)
print(mean_squared_error(y_val,val_predictions_vec1))

### XGBoost Regressor

In [None]:
modelvec2 = XGBRegressor(n_estimators=100, learning_rate=0.005, n_jobs=3, random_state=1)
modelvec2.fit(train_data_vectors, y_train)
val_predictions_vec2 = modelvec2.predict(valid_data_vectors)
print(mean_squared_error(y_val,val_predictions_vec2))

It seems like RandomForestRegressor does the best for Word Embeddings. It does slightly better on the Bag of Words, so we will use the first model.

## Final Training

Here, we'll do the final training on all of the data, and we will use this for the final submission.

In [None]:
# creating final X and y
final_train_features = np.concatenate([train_data_features, valid_data_features], axis=0)
final_y_train = pd.concat([y_train, y_val], axis=0)

In [None]:
# final training
final_model = RandomForestRegressor(n_estimators=100, random_state=1)
final_model.fit(final_train_features, final_y_train)

# prediction and output to csv file
predictions = final_model.predict(test_data_features)
output = pd.DataFrame(data={"id":test_data["id"], "target": predictions})
output.to_csv("submission.csv", index=False)
print("Submission file created!")

## Thank you for reading! Any feedback on the notebook would be appreciated.