In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

Three files have been provided with us in this competition. Let's read these files.

In [None]:
os.listdir("../input/commonlitreadabilityprize")

In [None]:
train = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
test = pd.read_csv("../input/commonlitreadabilityprize/test.csv")
sub = pd.read_csv("../input/commonlitreadabilityprize/sample_submission.csv")

train.shape, test.shape, sub.shape

Test data seems to be hidden. Let's look at the train and test data.

In [None]:
train.head()

In [None]:
test.head()

Note that, the column:target is the ease of readability score and standard_error is probably the corresponding standard error of the target measure across different rating scores. Now, let's look at the texts.

In [None]:
print("-"*25,"First example from train","-"*25)
print(train.excerpt[0])
print("-"*25,"First example from test","-"*25)
print(test.excerpt[0])

The texts are looking pretty clean as of now. Let's look at the distribution of the target variable

In [None]:
plt.hist(train.target, bins = 100)
plt.title("Histogram of target variable")
plt.show()
print(train.target.describe())

Seems like the target variable is normally distributed ranging from -3.7 to 1.7 in the training data with 1.03 as standard deviation. Let's also look at the distribution of the standard_error variable

In [None]:
plt.hist(train.standard_error, bins = 100)
plt.title("Histogram of standard_error variable")
plt.show()
print(train.standard_error.describe())

At least the standard deviation of standard_error variable is small, 0.03. Low standard error means that multiple rating systems mostly agreed regarding the ease of readability score and high standard error means that ratings from multiple rating systems are scattered. 

Note that, there is one observation with zero standard error, this is probably either a data entry error as the target is also zero or every rating system produced same score for the corresponding text. 

In [None]:
train[train.standard_error == 0]

Now, let's look at the texts with highest and lowest readability.

In [None]:
print("-"*25,"Text corresponding to highest target value in train","-"*25)
print(train.iloc[train.target.argmax(),:]["excerpt"],"\n")

print("-"*25,"Text corresponding to lowest target value in train","-"*25)
print(train.iloc[train.target.argmin(),:]["excerpt"])

Well, yes, I think we can see why the first text has highest target value as it is easy to read and why the later one is more difficult to read and ease of readability score is low. Probably, number of words, number of passive sentences, complex sentences and length of sentences and other features in this direction could be useful. However, what is difficult for me could be easy to read for someone else. Hence, let's look at the correlation of target and standard error.

In [None]:
plt.scatter(train.target, train.standard_error)
plt.title("Scatter plot between target and standard_error")
print("correlation between target and standard error is", np.corrcoef(train.target, train.standard_error)[1,0])

Well, although the standard error and target are not linearly related, seems like there is a dependency. So, mostly standard errors are high when the target values are either very high or very low. This means that, multiple rating system mostly disagreed when on average the texts are either easy to read or difficult to read.

Let's calculate the numberof words in the data.

In [None]:
train['word_count'] = train['excerpt'].apply(lambda x: len(str(x).split()))
test['word_count'] = test['excerpt'].apply(lambda x: len(str(x).split()))
print(train['word_count'].describe())
plt.hist(train.word_count, bins = 100)
plt.title("Distribution of number of words")
plt.show()

Do we have a relationship between the number of words and the target variable? I do not see any though.

In [None]:
plt.scatter(train.target, train.word_count)
plt.title("Scatter plot between target and word_count")

Now, let's try to engineer few features from texts. For feature engineering, we would use textstat library. Note that, this library is installed using internet in the following code cell. However, we can not access internet while submitting to Kaggle in this competition. Hence, we may need to add this libray as an external data in the kernel. Using textstat libray, we can engineer few features which could be related to text complexity. Read details about this library [here](https://github.com/shivam5992/textstat)

In [None]:
!pip install textstat
import textstat

For the sake of simplicity, we are creating the following variables, however, there are also some other index and readability measures like: FOG index etc. could be found in the library. We can create many more features from this library.

In [None]:
%%time

def feature_engineering(df):
    df['sentence_count'] = df['excerpt'].apply(lambda x: textstat.sentence_count(x))
    df['syllable_count'] = df['excerpt'].apply(lambda x: textstat.syllable_count(x,lang='en_US'))
    df['word_per_sentence'] = df.apply(lambda row: row.word_count/row.sentence_count, axis=1)
    df['syllable_per_sentence'] = df.apply(lambda row: row.syllable_count/row.sentence_count, axis=1)
    df['syllable_per_word'] = df.apply(lambda row: row.syllable_count/row.word_count, axis=1)

    df['flesch_reading_ease'] = df['excerpt'].apply(lambda x: textstat.flesch_reading_ease(x))
    df['automated_readability_index'] = df['excerpt'].apply(lambda x: textstat.automated_readability_index(x))
    df['linsear_write_formula'] = df['excerpt'].apply(lambda x: textstat.linsear_write_formula(x))
    
    return df

train = feature_engineering(train)
test = feature_engineering(test)

Now, let's look at the engineered features. We would also check how correlated these features and our target variable are.

In [None]:
cols = ["word_count", "sentence_count", "syllable_count" , 
        "word_per_sentence" , "syllable_per_sentence" , "syllable_per_word" ,
        "flesch_reading_ease", "automated_readability_index" , "linsear_write_formula" , "target" ]

temp = train[cols]
temp.describe()

In [None]:
temp.corr()

If we see the right most column or the bottom most row of the above output correlation matrix, we see that the engineered features do have strong correlation with target. Now, plot the scatterplots of these variables with target variable

In [None]:
columns = [c for c in cols if c not in ["word_count","target"]]

fig, ax = plt.subplots(1, 8, figsize = (30, 5))

for idx, col in enumerate(columns, 0):
    ax[idx].plot(train['target'], train[col], 'o')
    ax[idx].set_xlabel('target')
    ax[idx].set_title(col)

plt.show()

Now, let's do some basic text preprocessing which are essential for text analysis. First we will start with removing stopwords. Stopwords are words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like: the, is, in, for etc. We illustrate stopword removal using the first sample of our data.

In [None]:
from gensim.parsing.preprocessing import remove_stopwords
print("-"*10,"before removing stopwords","-"*10)
print(train.excerpt[0],"\n")
print("-"*10,"After removing stopwords","-"*10)
print(remove_stopwords(train.excerpt[0]))

As, we can see, after removing stopwords, words like - the, to etc. have been removed. Let's create a separate text column in the training data by removing these stopwords.

In [None]:
train['nostop_text'] = train['excerpt'].apply(lambda x: remove_stopwords(x))
train[['excerpt', 'nostop_text']].head()

One of the important and easy to use functions for text analysis is CountVectorizer from sklean.This function can be used to essentially generate count of words (or words combinations like bigram etc.) from the text document. First, we putall our texts from the train data into a list and then we use this function. There are several paramters which we can play with regarding this function. Details can be checked [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). For our purpose, we are using the following parameters:
1. lowercase = True --> setting all characters to lower case 
2. ngram_range = (1,1) --> only consider unigrams 
3. max_features=10000 --> creating the corpus using 10000 most frequeqnt words
4. min_df = 1 --> no words are ignored for smaller number of appearance 
5. max_df = 0.8) --> if there are some corpus specific stopwords, so we would be ignoring words which appeared in more than 80% of the documents 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
train_docs = train["nostop_text"].tolist()
cv = CountVectorizer(lowercase = True, 
                     ngram_range = (1,1), 
                     max_features=10000, 
                     min_df = 1, 
                     max_df = 0.8) 

Now, we create a dataframe, where each column would be a single word and the elements represent the count of that word in the specific row corresponding to the train data. 

In [None]:
sparse_train = cv.fit_transform(train_docs)
counts = pd.DataFrame(sparse_train.toarray(),
                      columns=cv.get_feature_names())
print(counts.shape)
counts.head()

As expected, we see there are 10000 columns as we specified in the max_features parameters that, we would use 10000 most frequent words. However, we see that there are some columns which are only numbers. We may need to remove these numbers and prepare the dataframe again. 

In [None]:
import re
train_docs_nonum = [re.sub(r'\d+', '', i) for i in train_docs]

cv = CountVectorizer(lowercase = True, 
                     ngram_range = (1,1), 
                     max_features=10000, 
                     min_df = 1, 
                     max_df = 0.8) 
sparse_train = cv.fit_transform(train_docs_nonum)

counts = pd.DataFrame(sparse_train.toarray(),
                      columns=cv.get_feature_names())

counts = pd.DataFrame(sparse_train.toarray(),
                      columns=cv.get_feature_names())
counts.head()

Well, now we see columns with only words. Note that there are 10000 columns corresponding to the most frequent 10000 words in the data. Now, we would be interested to see what are the most frequent words, let's have a look at them. We just use column sum in this data and sort the column sum to find out the most frequent words.

In [None]:
print(counts.sum().sort_values(ascending=False)[:20])
counts.sum().sort_values(ascending=False)[:20].plot.bar()

This was with the most frequent words or unigrams. Let's also look at the most frequent bigrams (combination of two subsequent words).

In [None]:
cv = CountVectorizer(lowercase = True, 
                     ngram_range = (2,2), ## only consider bigrams ##
                     max_features=10000,
                     min_df = 1, 
                     max_df = 0.8) 
sparse_train = cv.fit_transform(train_docs_nonum)

counts = pd.DataFrame(sparse_train.toarray(),
                      columns=cv.get_feature_names())

counts = pd.DataFrame(sparse_train.toarray(),
                      columns=cv.get_feature_names())

print(counts.sum().sort_values(ascending=False)[:20])
counts.sum().sort_values(ascending=False)[:20].plot.bar()

Probably, we can extract some features from this analysis. For example, we see one of the top most frequent bigrams: "for example". We can create a count feature related to this bigram as we can hypothesize that the texts that give more example tends to make it easier to read. Who knows? 

In this direction, there is another cool text analysis tool that we can use is topic modeling. Although we assume that features generated from topic modeling would be generalized on both train and test data. Let's use genism library for topic modeling. We would restrict topics which would consist of unigrams only. 

In [None]:
import gensim
from gensim.matutils  import Sparse2Corpus
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

In [None]:
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [None]:
cv = CountVectorizer(lowercase = True, ## setting all characters to lower case ##
                     ngram_range = (1,1), ## only consider unigram and bigrams ##
                     max_features=10000, ## creating the corpus using 10000 most frequeqnt words
                     min_df = 1, ## no words are ignored for smaller number of appearance ##
                     max_df = 0.8) ## if there are some corpus specific stopwords ##
sparse_train = cv.fit_transform(train_docs_nonum)
corpus_data_gensim = gensim.matutils.Sparse2Corpus(sparse_train, documents_columns=False)

vocabulary_gensim = {}
for key, val in cv.vocabulary_.items():
    vocabulary_gensim[val] = key
    
dict = Dictionary()
dict.merge_with(vocabulary_gensim)

Let's try with 5 topics 

In [None]:
lda = LdaModel(corpus_data_gensim, num_topics = 5 )

def document_to_lda_features(lda_model, document):
    topic_importances = lda.get_document_topics(document, minimum_probability=0)
    topic_importances = np.array(topic_importances)
    return topic_importances[:,1]

lda_features = list(map(lambda doc:document_to_lda_features(lda, doc),corpus_data_gensim))

data_pd_lda_features = pd.DataFrame(lda_features)
data_pd_lda_features.columns = ["topic"+str(i) for i in range(5)]
data_pd_lda_features.head()

Let's visualize the topics and see if we can understand any topic from keywords. We also check the correlation of these topic variables with our target variable.

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { max-width:100% !important; }</style>"))
display(HTML("<style>.output_result { max-width:100% !important; }</style>"))
display(HTML("<style>.output_area { max-width:100% !important; }</style>"))
display(HTML("<style>.input_area { max-width:100% !important; }</style>"))
pyLDAvis.enable_notebook()
lda_viz = gensimvis.prepare(lda, corpus_data_gensim, dict)
lda_viz

In [None]:
for i in ["topic"+str(i) for i in range(5)]:
    print("correlation of ", i, "with target is", np.corrcoef(train.target, data_pd_lda_features[i])[1,0])

From the first glance, it is difficult to get an idea of what these topics are from looking at the keywords. However, we see some topic features do have some predictive potential. However, we need to be cautious regarding how to use these variables as topics may not generalize in the test data set.