# CommonLit Readability Challenge
***
The joy of reading to discover new insights. Developing a study that understands how to categorize reading materials can be a challenging process. CommonLit have provided Kaggle with the opportunity to develop algorithms that can help to aid administrators, teachers, parents and students to understand how to assign reading material at the appropriate skill level. In this regard the reading material should provide both enjoyment and challenge to help prevent reading skills from plateauing. The path to discover with this project should encouragement the development of Natural Language Processing techniques that are able to categorize / grade which book excerpt should be assigned to each reading level.

Let's begin!!

# 1. Import packages and Data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Switch on setting to allow all outputs to be displayed
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# Import the datasets
train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
test = pd.read_csv('../input/commonlitreadabilityprize/test.csv')

# 1a. Exploratory Data Analysis (EDA)

In [None]:
# Perform EDA on the train
train.head()
train.shape
train.dtypes
train.describe(include='all')

Initial thoughts
***
* Extracting information from the "Excerpt" column will be key to this analysis
> * All of the values are unique
> * EDA required for text will help with future discoveries
* Target column shows a broad range of values
> * Distribution shows a larger proportion of negative values (perform a Histogram and box plot to confirm)
> * There appears to be a negative skew with the mean value lower than the median (50th percentile)

In [None]:
# Perform EDA on the test
test.head()
test.shape
test.dtypes
test.describe(include='all')

Initial thoughts of the test dataset
***
* Excerpt is the key variable
* Seven values mean that the validation of the training set will be key to optimise the model. Developing a hold out sample on this size could help

# 1b. Data Visualizations

In [None]:
# The histogram provides details on the distribution of the variable. Including the box plot shows key parameter summary values.
# By using plotly we are able to hover over the values and easily understand how the values compare
fig = px.histogram(train, x="target",
                   marginal="box")
fig.show()

In [None]:
# Perform quick analysis to review the distribution of the target variable
# The key variable is the excerpt, so have to extract as much information from this before running regression analysis
# AIM 1 : build a simple tokenization algorithm to create new features. then apply the different regression techniques and pipelines to help optimise
# the model build using sklearn

# Output file has to be called submission.csv

# 2. Data discovery

Key challenge is to understand the difficulty of the readability challenge. When reviewing how difficult a text is there are a few key areas of interest:
* Word difficulty
> * Vocabulary lists : can be used to highlight the proportion of comman words used. The less common a word is the more difficult it can be perceived and understood to be
> * Word length : longer words are usually seen as more difficult that short. Therefore a correlation could be constructed between the word length and text difficulty
* Sentence difficulty
> * Sentence length : longer sentences lead to more difficult text. Have to be aware that the inclusion of colon and semi-colon can impact sentence length as well as the full stop

In [None]:
# Extract insights from the excerpt variable
import spacy

In [None]:
# Initialise spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# Perform initial test on one excerpt
sample = train.head(5)
sample
sample1 = train.loc[0, 'excerpt']

In [None]:
# Create the spacy doc item for review
doc = nlp(sample1)
doc

In [None]:
# Reviewing the token, lemma and stopword for each token (item)
print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
# Review the first 20 values to test the output
for token in doc[:20]:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}\t\t{len(token)}")

## Review stop words

In [None]:
# A few different options for stopwords, spacy and nltk. Lets compare
import nltk
from nltk.corpus import stopwords

In [None]:
# Comparison of the stop words available
print(f"NLTK : {len(stopwords.words('english'))} \n {stopwords.words('english')}")
print(f"Spacy : {len(nlp.Defaults.stop_words)} \n {nlp.Defaults.stop_words}")

# Compare the differences
nltk_set = set(stopwords.words('english'))
spacy_set = set(nlp.Defaults.stop_words)

# Union - all values
union = nltk_set.union(spacy_set)
# Intersection - seen in both sets
inter = nltk_set.intersection(spacy_set)
print(f"Seen in both : {len(inter)} \n {inter}")
# Remainder - differences between sets
nltk_extra = nltk_set - inter
spacy_extra = spacy_set - inter
print(f"Extra NLTK : {len(nltk_extra)} \n {nltk_extra}")
print(f"Extra Spacy : {len(spacy_extra)} \n {spacy_extra}")

Spacy appears to cover a wider range of stopwords. Adding the additional 56 words from the NLTK could help to increase the scope of stopwords available for use.

# Review Tfidftransformer & Tfidfvectorizer
***
Credit to https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.YOqsn-hKiCo for writing a great introductory article

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
#instantiate CountVectorizer() 
cv=CountVectorizer() 

# this steps generates word counts for the words in the sample doc
word_count_vector=cv.fit_transform(sample.excerpt)

word_count_vector.shape

In [None]:
# Compute the IDF values
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(word_count_vector)

In [None]:
# print idf values 
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"]) 
 
# sort ascending 
df_idf.sort_values(by=['idf_weights'])

df_idf.describe()

Lower the IDF value the more common the value is

In [None]:
# Time to compute the TFIDF
# count matrix 
count_vector=cv.transform(sample.excerpt) 
 
# tf-idf scores 
tf_idf_vector=tfidf_transformer.transform(count_vector)

In [None]:
feature_names = cv.get_feature_names() 
 
#get tfidf vector for first document 
first_document_vector=tf_idf_vector[0] 
 
#print the scores 
df = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"]) 
df.sort_values(by=["tfidf"],ascending=False)

Tfidfvectorizer Usage

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# settings that you use for count vectorizer will go here 
tfidf_vectorizer=TfidfVectorizer(use_idf=True) 
 
# just send in all your docs here 
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(sample.excerpt)

In [None]:
# get the first vector out (for the first document) 
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0] 
 
# place tf-idf values in a pandas data frame 
df = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"]) 
df.sort_values(by=["tfidf"],ascending=False)

In [None]:
tfidf_vectorizer=TfidfVectorizer(use_idf=True)
 
# just send in all your docs here
fitted_vectorizer=tfidf_vectorizer.fit(sample.excerpt)
tfidf_vectorizer_vectors=fitted_vectorizer.transform(sample.excerpt)
df = pd.DataFrame(tfidf_vectorizer_vectors.T.todense(), index=fitted_vectorizer.get_feature_names(), columns=sample['id'])
df.head()
df.columns
df.shape
df_out = df.loc[:, ['c12129c31']]
df_out.sort_values(by=['c12129c31'], ascending=False)

In [None]:
# Test on all training data
# just send in all your docs here
fitted_vectorizer=tfidf_vectorizer.fit(train.excerpt)
tfidf_vectorizer_vectors=fitted_vectorizer.transform(train.excerpt)
df = pd.DataFrame(tfidf_vectorizer_vectors.T.todense(), index=fitted_vectorizer.get_feature_names(), columns=train['id'])
df.head()
df.columns
df.shape
df_out = df.loc[:, ['c12129c31']]
df_out.sort_values(by=['c12129c31'], ascending=False)

In [None]:
# Lets create a dictionary to review the key phrase outputs
from collections import defaultdict, Counter

# Returns integers that map to parts of speech
counts_dict = doc.count_by(spacy.attrs.IDS['POS'])

# Print the human readable part of speech tags
for pos, count in counts_dict.items():
    human_readable_tag = doc.vocab[pos].text
    print(human_readable_tag, count)

In [None]:
pos_counts = defaultdict(Counter)
for token in doc:
    pos_counts[token.pos][token.orth] += 1
    
for pos_id, counts in sorted(pos_counts.items()):
    pos = doc.vocab.strings[pos_id]
    for orth_id, count in counts.most_common():
        print(pos, count, doc.vocab.strings[orth_id], len(doc.vocab.strings[orth_id]))

SPACE value appears to correspond to the new line.

In [None]:
# Expanding named entities
for entity in doc.ents:
    print(entity.text, entity.label_)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])
print("Number of sentences", len([*doc.sents]))
print("Sentiment", doc.sentiment)

# Understand the length of sentences
for sent in doc.sents:
    print(sent.start_char, sent.end_char, (sent.end_char - sent.start_char))

In [None]:
from spacy import displacy

# Display the entities within a sentence
displacy.render(doc, style='ent', jupyter=True)

In [None]:
# Visualise the dependencies within a sentence
# displacy.render(doc, style='dep', jupyter=True)

## 3. Create the datasets for training the models

In [None]:
# Lets apply the nlp instance to each excerpt
train['excerpt_scy'] = train['excerpt'].apply(nlp)

In [None]:
# Check the data type for the updated column
type(train.loc[0, 'excerpt_scy'])
train.head()

In [None]:
# Reviewing a different row
train.loc[1,'excerpt_scy']

In [None]:
# Create the class methods required to run the analysis
class NLPMethods():
    # Create constructor for the class
#     def __init__():
    
    # Number of sentences
    def number_sentences(self, nlp_text):
        return len([*nlp_text.sents])
    
    # Average length of sentence
    def average_sentence_length(self, nlp_text):
        sent_length = list()
        for sent in nlp_text.sents:
            sent_length.append(sent.end_char - sent.start_char)
        return np.mean(sent_length)
    
    # Part of speech tags
    def part_of_speech_tags(self, nlp_text):
        counts_dict = nlp_text.count_by(spacy.attrs.IDS['POS'])
        counts_dict1 = {}
        # Extract the text that matches to the POS value
        for k, v in counts_dict.items():
            counts_dict1[nlp_text.vocab[k].text] = v
        return counts_dict1
    
    # Number of spaces
    def number_spaces(self, nlp_text):
        dict_pos = self.part_of_speech_tags(nlp_text)
        if dict_pos.get('SPACE') != None:
            space = dict_pos.get('SPACE')
        else:
            space = 0
        return space
    
    # Part of speech tags - including the word counts
    def word_counts(self, nlp_text):
        pos_counts = defaultdict(Counter)
        for token in nlp_text:
            pos_counts[token.pos][token.orth] += 1
        
        # Create dictionary for the word counts
        word_counts_dict = {}
        for pos_id, counts in sorted(pos_counts.items()):
            pos = nlp_text.vocab.strings[pos_id]
            for orth_id, count in counts.most_common():
                word_counts_dict[nlp_text.vocab.strings[orth_id]] = {'count':count, 
                                                                     'length':len(nlp_text.vocab.strings[orth_id]), 
                                                                     'pos':pos}
        return word_counts_dict
    
    # Number of words
    def number_words(self, nlp_text):
        dict_word_counts = self.word_counts(nlp_text)
        return len(dict_word_counts.items())
    
    # Longest word
    def longest_word(self, nlp_text):
        dict_word_counts = self.word_counts(nlp_text)
        df = pd.DataFrame(dict_word_counts).T.reset_index().rename(columns={'index':'variable'})
        return max(df['length'])

In [None]:
# Add columns for the spacy doc
train['num_sentences'] = train['excerpt_scy'].apply(NLPMethods().number_sentences)
train['avg_sentence_length'] = train['excerpt_scy'].apply(NLPMethods().average_sentence_length)
train['pos_dict'] = train['excerpt_scy'].apply(NLPMethods().part_of_speech_tags)
train['num_space'] = train['excerpt_scy'].apply(NLPMethods().number_spaces)
train['wc_dict'] = train['excerpt_scy'].apply(NLPMethods().word_counts)
train['num_words'] = train['excerpt_scy'].apply(NLPMethods().number_words)
train['longest_word'] = train['excerpt_scy'].apply(NLPMethods().longest_word)

In [None]:
train.sample(5)

In [None]:
# Review the max value target variable
max_val = np.max(train['target'])
train_max = train.loc[(train['target']==max_val), :]
train_max

In [None]:
# Check the reason for the largest target value
type(train.loc[2829, 'excerpt_scy'])
train.loc[2829, 'excerpt_scy']
train.loc[2829, 'excerpt']

It appears that the word "paleontologists" could be causing the difficulty?

In [None]:
# Review the min value target variable
min_val = np.min(train['target'])
train_min = train.loc[(train['target']==min_val), :]
train_min

In [None]:
# Check the reason for the smallest target value
type(train.loc[1705, 'excerpt_scy'])
train.loc[1705, 'excerpt_scy']

# EDA of new variables

In [None]:
# Import libraries
import seaborn as sns

In [None]:
# Correlation analysis
df = train.loc[:, ['id', 'target', 'num_sentences', 'avg_sentence_length', 'num_space', 'num_words', 'longest_word']]
df.head()

cor = df.corr()
sns.heatmap(cor, annot=True)
plt.show()

In [None]:
X = df.drop('id', axis=1)
X.dtypes

In [None]:
# Review a scatter matrix
fig = px.scatter_matrix(X)
fig.show()