# CommonLit Readability Prize

What is CommonLit?
* CommonLit, Inc., is a nonprofit education technology organization. 
* They are serving over 20 million teachers and students with free digital reading and writing lessons for grades 3-12. 
* They want to improve the readability rating methods of these lessons.

Problem Summary
* Identify the appropriate reading level of a passage of text by rating the complexity of reading passages for grade 3-12 classroom use.
* A dataset is provided that includes readers from a wide variety of age groups and a large collection of texts taken from various domains.

Current Gaps.
* As of now,most educational texts are matched to readers using traditional readability methods or commercially available formulas.
  The traditional readability formulas are not roboust enough are often inaccurate whereas commercially available solutions are expensive, non-          transparent, and lack evidence that supports their effectiveness.

Future Desired State
* Literacy curriculum developers and teachers who choose passages will be able to quickly and accurately evaluate works for their classrooms. 
* Rating algorithms will no lenger be a black box and will be available to all. 
* Students will benefit from feedback on the complexity and readability of their work, making it far easier to improve essential reading skills.


# Import Libraries

In [None]:
# Ignore Warnings.
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.filterwarnings('ignore', category=FutureWarning)

# Basic Libraries
import os
import numpy as np
import pandas as pd
from collections import defaultdict
import operator
import re

# import textstat
import gensim.downloader as api

# Import Data Visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Import sklearn packages
from sklearn.linear_model import *
from sklearn.metrics import mean_squared_error
from sklearn.manifold import TSNE
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split

# Import NLP pacakeges.
import nltk
from textblob import TextBlob
from scipy.stats import probplot
from wordcloud import WordCloud, STOPWORDS
from nltk.tokenize import word_tokenize
from scipy.stats import probplot

# Specify print options.
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Utility Functions

In [None]:
# Evaluate Model Performace.
def model_performance(model, X, y):
    y_pred = model.predict(X)
    rmse = mean_squared_error(y, y_pred)
    print(rmse)

# Cross Validation.    
def create_folds(data, num_splits):
    data["kfold"] = -1
    data = data.sample(frac=1).reset_index(drop=True)
    num_bins = int(np.floor(1 + np.log2(len(data))))
    data.loc[:, "bins"] = pd.cut(data["target"], bins=num_bins, labels=False)
    kf = model_selection.StratifiedKFold(n_splits=num_splits)
    for f, (t_, v_) in enumerate(kf.split(X=data, y=data.bins.values)):
        data.loc[v_, 'kfold'] = f
    data = data.drop("bins", axis=1)
    return data

# Import Data 

In [None]:
PATH = '/kaggle/input/commonlitreadabilityprize/'

In [None]:
df_train = pd.read_csv(f'{PATH}train.csv')
df_test = pd.read_csv(f'{PATH}test.csv')

# Take a look at the Data & Summary Stats.

In [None]:
df_train.head()

In [None]:
df_test.head()

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
df_train.describe()

In [None]:
df_raw['polarity'] = df_raw['excerpt'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Exploratory Data Analysis

In [None]:
# 1. What does the distribution of label look like? What does it convey?
# 2. How is standard error distributed? What does it mean?
# 3. What do low and high values of target mean?
# 4. Get the feel of a few high & low values excerpt and read them. 
# 5. How does word count, character length, avg. char. length per word, count of punctuation marks vary with target? 
# 6. Try to get the topic for each excerpt? 
# 7. Get top unigram, bigram & trigram. How do they vary by different target buckets?
# 8. Study other notebooks to get ideas for feature engineering. 

### The compexity of an excerpt increase as the target value increases.

In [None]:
# Distribution of target
sns.kdeplot(data=df_train, x='target', fill=True)

In [None]:
# Distribution of se
sns.kdeplot(data=df_train, x='standard_error', fill=True)

In [None]:
fig, axes = plt.subplots(ncols=2, figsize=(12,6), dpi=100)
sns.kdeplot(data=df_train, x='target', fill=True, ax=axes[0])
axes[0].axvline(df_train['target'].mean(), label=f'target Mean', color='r', linewidth=2, linestyle='--')
axes[0].axvline(df_train['target'].median(), label=f'target Median', color='b', linewidth=2, linestyle='--')
probplot(df_train['target'], plot=axes[1])
axes[0].legend(prop={'size': 10})

for i in range(2):
    axes[i].tick_params(axis='x', labelsize=12)
    axes[i].tick_params(axis='y', labelsize=12)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
axes[0].set_title('target Distribution in Training Set', fontsize=10, pad=12)
axes[1].set_title('target Probability Plot', fontsize=10, pad=12)

In [None]:
# Print top 2 excerpts.
top2 = df_train.nlargest(2, 'target').reset_index(drop=True)
for i in range(2):
    print(f'Excerpt with target value = {top2.loc[i, "target"]}')
    print(top2.loc[i, 'excerpt'])
    print('\n')

In [None]:
# Print bottom 2 excerpts.
bottom2 = df_train.nsmallest(2, 'target').reset_index(drop=True)
for i in range(2):
    print(f'Excerpt with target value = {bottom2.loc[i, "target"]}')
    print(bottom2.loc[i, 'excerpt'])
    print('\n')

In [None]:
# How does word count, character length, avg. char. length per word, count of punctuation marks vary with target? 

In [None]:
# Character length.
df_train['char_count'] = df_train['excerpt'].apply(lambda x: len(str(x)))

In [None]:
# Word Count - split on space.
df_train['word_count_sp'] = df_train.excerpt.str.split().apply(lambda x: len(x))

In [None]:
# Word Count tokenized.
#paragraphs = df_train["excerpt"]

# Tokenize each paragraph
df_train['tokenized_excerpt'] = [word_tokenize(p.lower()) for p in df_train["excerpt"]]

In [None]:
df_train['word_count_tk'] = df_train['tokenized_excerpt'].apply(lambda x: len(x))

In [None]:
# Avg. word length (ignore the punctuations)
df_train['avg_word_length'] = df_train['char_count'] / df_train['word_count_sp']

In [None]:
# Punctuation count
excerpt = []
punct_marks = list()
count_punct = 0
for punct in punct_marks:
    for token in excerpt:
        if punct in token:
            count_punct = count_punct + 1
            
        
    

In [None]:
# Digit count. 
df_train['num_digits'] = df_train['excerpt'].apply(lambda excerpt: sum(char.isdigit() for char in excerpt))

In [None]:
df_train.head()

# Building first models

In [None]:
X = df_train.loc[:, 'excerpt']
y = df_train.loc[:, 'target']

X_train, X_valid, y_train, y_valid = train_test_split(X.values, y, random_state=42, test_size=0.25, shuffle=True)

In [None]:
df_out = pd.DataFrame()
df_out['id'] = df_test.loc[:, 'id']
X_test = df_test.loc[:, 'excerpt']

In [None]:
# TFIDF for feature extraction.
tfv = TfidfVectorizer(
    min_df=3,
    max_features=None, 
    strip_accents='unicode', 
    analyzer='word',
    token_pattern=r'\w{1,}',
    ngram_range=(1, 3), 
    use_idf=1,smooth_idf=1,sublinear_tf=1,
    stop_words = 'english')

# Fitting TF-IDF to both training and test sets (semi-supervised learning)
tfv.fit(list(X_train) + list(X_valid))

X_train_tfv =  tfv.transform(X_train) 
X_valid_tfv = tfv.transform(X_valid)

In [None]:
X_test_tfv = tfv.transform(X_test)

In [None]:
# Build a linear model
# Fitting a simple Logistic Regression on TFIDF

lr = Ridge()
lr.fit(X_train_tfv, y_train)

In [None]:
# Train performacne
model_performance(lr, X_train_tfv, y_train)
# Validation performacne
model_performance(lr, X_valid_tfv, y_valid)
# Test set
df_out['target'] = lr.predict(X_test_tfv)
# Submission
df_out.to_csv('submission.csv', index = False)

# Learning & Practice Concepts

### Tokenization: A process that splits an input sequence into so-called tokens
* Token can be thought of as a useful unit for semantic processing.
* Can be a word sentence or a paragraph
* Examples of popular tokenizers are nltk.tokenize.WhitespaceTokenizer, PunctTokenizer, TreebankWordTokenizer

In [None]:
excerpt = df_train['excerpt']

In [None]:
text = "This is Andrew's text, isn't it?"

In [None]:
tkz = nltk.tokenize.WhitespaceTokenizer()
tkz.tokenize(text)

In [None]:
# Punctuations might be useful to gauge reading difficulty.
tkz = nltk.tokenize.WordPunctTokenizer()
tkz.tokenize(text)

In [None]:
tkz = nltk.tokenize.TreebankWordTokenizer()
tkz.tokenize(text)

In [None]:
# Tokenization
tk_excerpt = excerpt.apply(word_tokenize)
tk_excerpt.head()

### Token Normalization
* We may want the same token for different forms of the word. Ex wolfs, wolves -> wolf.
* Stemming: A process of removing and replacing suffixes to get the root form of the word known as a stem. wolves -> wolv. Produces non words
* Lemmatization: Return the base of dictionary form of the word known as lemma.

In [None]:
# Stemming
porter = nltk.PorterStemmer()
tk_st_excerpt = tk_excerpt.apply(lambda x: [porter.stem(y) for y in x])
tk_st_excerpt.head()

In [None]:
# Stemming example
text1 = excerpt[0]
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tkz.tokenize(text1)
print(tokens)

stemmer = nltk.stem.PorterStemmer()
stemmed = " ".join(stemmer.stem(token) for token in tokens )
print(stemmed)


In [None]:
# Lemmatization.
wnl = nltk.WordNetLemmatizer()
tk_lm_excerpt = tk_excerpt.apply(lambda x: [wnl.lemmatize(w) for w in x])
tk_lm_excerpt.head()

In [None]:
# Sentence Splitting
excerpt_sent_tokenized = excerpt.apply(lambda x: nltk.sent_tokenize(x))
excerpt_sent_tokenized.head()

In [None]:
print(excerpt_sent_tokenized)

### Feature Extraction from text

In [None]:
# Bag of words.
# Among medium frequency n-grams, the n-grams with smaller frrequency can be more discriminating because it can capture, 
# a specific issue in the review.

texts = ["good movie", "not a good movie", "did not like", "i like it", "good one"]
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1, 2))
features = tfidf.fit_transform(texts)
pd.DataFrame(
  features.todense(),
  columns = tfidf.get_feature_names()
)

### Semantic Similarity

In [None]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic
from nltk.collocations import *

In [None]:
## Path Similarity
# 1. Shortest path between two concepts in the heirarchy 
# 2. similarity measure inversely related to this distance.

generator = wn.synset('generator.n.01')
coil = wn.synset('coil.n.01')
car = wn.synset('car.n.01')

print(generator.path_similarity(car))

In [None]:
## Lowest Common Subsumer
# 1. Find the lowest commom ancestor to both concepts.
# 2. Calculate Lin Smilarity: Similarity measure based on information contained in LCS of both concepts.

brown_ic = wordnet_ic.ic('ic-brown.dat')
print(generator.lin_similarity(coil, brown_ic))
print(generator.lin_similarity(car, brown_ic))

In [None]:
## Collocations and Distributional similarity.
# 1. Two words that frequently appear in similar contexts are more likely to be semantically related.
# 2. Words before, after, in a small window.
# 3. POS of words before, after, in a small window.
# 4. Compute strength of association between words

text = ' '.join(df_train['excerpt'].to_list())
bigram_meausres = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(text)
finder.nbest(bigram_meausres.pmi, 10)

In [None]:
finder.apply_freq_filter(10)

### Topic Modelling: Generative Models and LDA
* A coarse level analysis of what's in a text collection.
* Topic: the subject/theme of a discourse.
* Topics are represented as word distribution.
* A document is assumed to be a mixture of topics.
* You're given a corpus and a set of topics.
* Essentially it's a text clustering problem, documents & words clustered simultaneously.

In [None]:
#doc_set = 'This is a kaggle notebook. I like kaggling'

In [None]:
#doc_set.split()

In [None]:
#import gensim
#from gensim import corpora, models

#dictionary = corpora.Dictionary(doc_set.split())
#corpus = [dictionary.doc2bow(doc) for doc in doc_set]
#lda_model = gensim.ldamodel.LdaModel(corpus, num_topics=4 , id2word=dictionary , passes = 50)
#print(lda_model.print_topics(num_topics=4 , num_words=5))