# Abstract
After having built [the baseline model (version 0.0)](https://www.kaggle.com/lukeshrek/classify-toxic-question) with Logistics Regression and TF-IDF Bi-grams Vectorizer, I personally found that model is still very simple and can be further improved to give better results.

In this version 1.0, I will proceed to go deeper into data analysis and visualization, improve the preprocessing steps and build the model based on the Bidirectional GRU network.

This notebook represent Data Exploration and Analysis works. 

## Initial Configurations

In this notebook I will use various package to support data visualize and analyze process.

* Textstat: Textstat is an easy to use library to calculate statistics from text. It helps determine readability, complexity, and grade level.
* Chart Studio: Chart Studio provides a web-service for hosting graphs. Graphs are saved inside online Chart Studio account.

In [None]:
!pip install textstat
!pip install chart_studio

These packages are available for installing using pip.

Code cell below are modules and libraries used in this notebook:

* os: provides functions for interacting with the operating system. 
* json: built-in package which can be used to work with JSON data.
* string: provides additional tools to manipulate strings.
* math: built-in module for mathematical tasks.
* collections: implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers, dict, list, set, and tuple.
* warnings: filterwarnings(action) with action as "ignore" to suppress all warnings.
* statistics: provides functions for calculating mathematical statistics of numeric (Real-valued) data.
* tqdm: output a progress bar by wrapping around any iterable.


* NumPy: NumPy is the fundamental package for array computing with Python.
* pandas: pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
* Matplotlib: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
* Seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
* WordCloud: Word cloud is a technique for visualising frequent words in a text where the size of the words represents their frequency.
* plotly: An open-source, interactive data visualization library for Python.
* spaCy: spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.


In [None]:
import os
import json
import string
import math

from statistics import *
from tqdm import tqdm

# Factory function to supply missing values
from collections import defaultdict

# Warnings control
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999

# Textstat
import textstat

# Imports for visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
color = sns.color_palette()

# plotly based imports
from plotly import tools
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff

# Wordcloud library
from wordcloud import WordCloud, ImageColorGenerator, STOPWORDS

# spaCy based imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# spaCy Parser for questions
punctuations = string.punctuation
stopwords = list(STOP_WORDS)
parser = English()

# 1. Introduction

## Problem Description
        
An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions - those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

In this competition, Kagglers will develop models that identify and flag insincere questions. To date, Quora has employed both machine learning and manual review to address this problem. More scalable methods could be developed to detect toxic and misleading content.

## Data Description
In this competition the model should be able to detect whether a question asked on Quora is sincere or not. An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:

* Has a non-neutral tone
    * Has an exaggerated tone to underscore a point about a group of people
    * Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory
    * Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
    * Makes disparaging attacks/insults against a specific person or group of people
    * Based on an outlandish premise about a group of people
    * Disparages against a characteristic that is not fixable and not measurable
* Isn't grounded in reality
    * Based on false information, or contains absurd assumptions
* Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

The training data includes the question that was asked, and whether it was identified as insincere (target = 1) or not (target = 0).

# 2. Data Exploration and Visualization

## Input

We will take a look at the input data directory

In [None]:
!ls ../input/quora-insincere-questions-classification

* train.csv - the training set
* test.csv - the test set
* sample_submission.csv - A sample submission in the correct format
* enbeddings/ - Folder containing word embeddings.

External data sources are not allowed to use. The following embeddings are given to us which can be used for building our models.

Let's see how train data and test data are distributed.

In [None]:
train_df = pd.read_csv("../input/quora-insincere-questions-classification/train.csv")
test_df = pd.read_csv("../input/quora-insincere-questions-classification/test.csv")
print("Train shape: ", train_df.shape)
print("Test shape: ", test_df.shape)

In [None]:
train_df.head()

In [None]:
train_df.dtypes

* qid - unique question identifier
* question_text - Quora question text
* target - a question labeled "insincere" has a value of 1, otherwise 0

## Embeddings 

Basically embeddings folder are zipped so we cannot show its details by using ```!ls```, but information about the allowed embeddings are available in the competition's data description.

In [None]:
!ls ../input/quora-insincere-questions-classification/embeddings.zip

* GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
* glove.840B.300d - https://nlp.stanford.edu/projects/glove/
* paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
* wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

## Target Count and Target Distribution

It can be easily detected that the dataset is imbalance, in which the vast majority of questions are sincere, and only a small number are insincere. Let us look at the distribution of the target variable by plotting a bar chart and a pie chart to understand more.

In [None]:
# Target count
cnt_srs = train_df['target'].value_counts()
trace = go.Bar(
x=cnt_srs.index,
    y=cnt_srs.values,
    marker=dict(
        color=cnt_srs.values,
        colorscale = 'Picnic',
        reversescale = True
    ),
)

layout = go.Layout(
    title='Target Count',
    font=dict(size=18)
)

data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="TargetCount")

# Target distribution
labels = (np.array(cnt_srs.index))
sizes = (np.array((cnt_srs / cnt_srs.sum())*100))

trace = go.Pie(labels=labels, values=sizes)
layout = go.Layout(
    title='Target distribution',
    font=dict(size=18),
    width=600,
    height=600,
)
data = [trace]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename="usertype")

So about 6% of the training data are insincere questions (target=1) and rest of them are sincere.

## WordCloud of questions

To find the most frequently occuring words in questions, I built word clouds of a random sample of 1000 insincere and 1000 sincere questions on the ```question_text``` column.

First of all, questions are concatenated into a single string. Next, the question string is splitted into a dictionary and each words are counted uniquely to find the time that word appears. Finally, ```generate_from_frequencies``` from WordCloud libraries is used to plot the word cloud from frequency dictionary.

In [None]:
# Split sentences into a dictionary of uniquely occuring words and their frequencies
def word_freq_dict(text):
    # Convert text into word list
    wordList = text.split()
    # Generate word freq dictionary
    wordFreqDict = {word: wordList.count(word) for word in wordList}
    return wordFreqDict

# Plot a wordcloud from a word frequency dictionary
def word_cloud_from_frequency(word_freq_dict, title, figure_size=(10,6)):
    wordcloud.generate_from_frequencies(word_freq_dict)
    plt.figure(figsize=figure_size)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.title(title)
    plt.show()

In [None]:
# Wordcloud of a random sample of 1000 insincere questions
insincere_questions = train_df.question_text[train_df['target'] == 1]
insincere_sample = " ".join(insincere_questions.sample(1000, random_state=1).values)
insincere_word_freq = word_freq_dict(insincere_sample)
wordcloud = WordCloud(width= 5000,
    height=3000,
    max_words=200,
    colormap='Reds',
    background_color='white')

word_cloud_from_frequency(insincere_word_freq, "Most Frequent Words in a sample of 1000 raw questions flagged insincere")

In [None]:
# Wordcloud of a random sample of 1000 sincere questions
sincere_questions = train_df.question_text[train_df['target'] == 0]
sincere_sample = " ".join(sincere_questions.sample(1000, random_state=1).values)
sincere_word_freq = word_freq_dict(sincere_sample)
wordcloud = WordCloud(width= 5000,
    height=3000,
    max_words=200,
    colormap='Blues',
    background_color='white')

word_cloud_from_frequency(sincere_word_freq, "Most Frequent Words in a sample of 1000 raw questions flagged sincere")

There are various of word presents in both types of questions (obviously).
But these two wordcloud seems to be not useful at all, which can be explained by two main reason: (i) wordcloud is made by gathering only 2000 of over 1000000 questions, so total figure can be way more different, and (ii) too many noise and uninformative words(such as what, when, etc., which obviously appears in a question no matter it is toxic or not).

Maybe it is a good idea to look at the most frequent words in each of the classes separately. 

## Word n-grams Count Plot

In the following cell, I have created 1, 2 and 3-gram count plot with similar process.
Before looking up for most useful frequent grams, it has to be ensure that all of the stopwords need to be eliminated from the counter. Words are all lowercased before zipping a number of words (based on parameter ```n_gram``` value). Finally, a frequency dictionary is created and count all words appear.

### Unigram

In [None]:
train_insincere_df = train_df[train_df["target"]==1]
train_sincere_df = train_df[train_df["target"]==0]

stopwords = set(STOPWORDS)
more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}
stopwords = stopwords.union(more_stopwords)
    
# N-gram generation
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(" ") if token != "" if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [" ".join(ngram) for ngram in ngrams]

# Horizontal bar chart
def horizontal_bar_chart(df, color):
    trace = go.Bar(
        y=df["word"].values[::-1],
        x=df["wordcount"].values[::-1],
        showlegend=False,
        orientation = 'h',
        marker=dict(
            color=color,
        ),
    )
    return trace

In [None]:
# Get the bar chart from sincere questions #
freq_dict = defaultdict(int)
for sent in train_sincere_df["question_text"]:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(50), 'blue')

# Get the bar chart from insincere questions #
freq_dict = defaultdict(int)
for sent in train_insincere_df["question_text"]:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(50), 'red')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,
                          subplot_titles=["Frequent words of sincere questions", 
                                          "Frequent words of insincere questions"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title="Word Count Plots")
py.iplot(fig, filename='word-plots')

***Observations from unigram count plot***

* Some of the top words are common across both the classes like 'people', 'will', 'think' etc.
* Many of top words in sincere questions (after excluding the both common ones) used for describe and comparison purpose: 'best', 'good', 'much', etc.
* The other top words in insincere questions (after excluding the both common ones) often involve matters that may be sensitive or controversial: 'trump', 'women', 'white', etc.

### Bigram

As mentioned above, bigram and trigram count plot are made by the same process as unigram: Lowercase non-stopword words, zip into grams and count. 

In [None]:
freq_dict = defaultdict(int)
for sent in train_sincere_df["question_text"]:
    for word in generate_ngrams(sent,2):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(50), 'blue')


freq_dict = defaultdict(int)
for sent in train_insincere_df["question_text"]:
    for word in generate_ngrams(sent,2):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(50), 'red')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04,horizontal_spacing=0.15,
                          subplot_titles=["Frequent bigrams of sincere questions", 
                                          "Frequent bigrams of insincere questions"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=900, paper_bgcolor='rgb(233,233,233)', title="Bigram Count Plots")
py.iplot(fig, filename='word-plots')

***Observations from bigram count plot***

* Some of the top bigrams are common across both the classes are the combinations of a top word among both classes and a top word among one class, especially among sincere questions, for example: 'best way'. People tend to find useful answer for the best method to do something, therefore, they may need to provide informative question.
* It can be inspected that several topics appear: 'computer science', 'machine learning', 'tv shows', etc. 
* The top bigrams in insincere questions still involve matters that may be sensitive or controversial (and significantly related to people, religion, politics) such as 'donald trump', 'white people', 'black people', etc. This observation is quite reasonable and similar to characteristics (provided in the competition description) that can signify that a question is insincere.

### Trigram

In [None]:
freq_dict = defaultdict(int)
for sent in train_sincere_df["question_text"]:
    for word in generate_ngrams(sent,3):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace0 = horizontal_bar_chart(fd_sorted.head(50), 'blue')


freq_dict = defaultdict(int)
for sent in train_insincere_df["question_text"]:
    for word in generate_ngrams(sent,3):
        freq_dict[word] += 1
fd_sorted = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted.columns = ["word", "wordcount"]
trace1 = horizontal_bar_chart(fd_sorted.head(50), 'red')

# Creating two subplots
fig = tools.make_subplots(rows=1, cols=2, vertical_spacing=0.04, horizontal_spacing=0.2,
                          subplot_titles=["Frequent trigrams of sincere questions", 
                                          "Frequent trigrams of insincere questions"])
fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig['layout'].update(height=1200, width=1200, paper_bgcolor='rgb(233,233,233)', title="Trigram Count Plots")
py.iplot(fig, filename='word-plots')

***Observations from trigram count plot***

* We can now easily inspect many trigrams are combinations of popular uni- and bigrams. So further plot count (```n>3```) is now unnecessary.
* More detailed and informational phrases appear: 'black lives matter', 'gun control advocates', etc. 
* Unexpectedly, people's ages were mentioned frequently in the questions, in both categories. Perhaps people really care a lot about characteristics such as physical, psychological, cognitive level of others - traits that are influenced by certain age.

## Meta Features

For further exploration, let's create some meta features and then look at how they are distributed between the classes. The ones that we will create are

* Number of words in the text
* Number of unique words in the text
* Number of characters in the text
* Number of stopwords
* Number of punctuations
* Number of upper case words
* Number of title case words
* Average length of the words

In [None]:
# Number of words in the text
train_df["num_words"] = train_df["question_text"].apply(lambda x: len(str(x).split()))
test_df["num_words"] = test_df["question_text"].apply(lambda x: len(str(x).split()))

# Number of unique words in the text
train_df["num_unique_words"] = train_df["question_text"].apply(lambda x: len(set(str(x).split())))
test_df["num_unique_words"] = test_df["question_text"].apply(lambda x: len(set(str(x).split())))

# Number of characters in the text
train_df["num_chars"] = train_df["question_text"].apply(lambda x: len(str(x)))
test_df["num_chars"] = test_df["question_text"].apply(lambda x: len(str(x)))

# Number of stopwords in the text
train_df["num_stopwords"] = train_df["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
test_df["num_stopwords"] = test_df["question_text"].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))

# Number of punctuations in the text
train_df["num_punctuations"] =train_df['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )
test_df["num_punctuations"] =test_df['question_text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]) )

# Number of title case words in the text
train_df["num_words_upper"] = train_df["question_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))
test_df["num_words_upper"] = test_df["question_text"].apply(lambda x: len([w for w in str(x).split() if w.isupper()]))

# Number of title case words in the text
train_df["num_words_title"] = train_df["question_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))
test_df["num_words_title"] = test_df["question_text"].apply(lambda x: len([w for w in str(x).split() if w.istitle()]))

# Average length of the words in the text
train_df["mean_word_len"] = train_df["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
test_df["mean_word_len"] = test_df["question_text"].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

### Plotting meta features

To have a better view of how these meta features are distributed I use some box plots.

In descriptive statistics, a box plot or boxplot (also known as box and whisker plot) is a type of chart often used in explanatory data analysis. Box plots visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages.

Box plots show the five-number summary of a set of data: including the minimum score, first (lower) quartile, median, third (upper) quartile, and maximum score.

* Minimum Score: The lowest score, excluding outliers (shown at the end of the left whisker).
* Lower Quartile: Twenty-five percent of scores fall below the lower quartile value (also known as the first quartile).
* Median: The median marks the mid-point of the data and is shown by the line that divides the box into two parts (sometimes known as the second quartile). Half the scores are greater than or equal to this value and half are less.
* Upper Quartile: Seventy-five percent of the scores fall below the upper quartile value (also known as the third quartile). Thus, 25% of data are above this value.
* Maximum Score: The highest score, excluding outliers (shown at the end of the right whisker).

Some other summary are:
* Whiskers: The upper and lower whiskers represent scores outside the middle 50% (i.e. the lower 25% of scores and the upper 25% of scores).
* The Interquartile Range (or IQR): This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and 75th percentile).

In [None]:
# Truncate some extreme values for better visuals
train_df['num_words'].loc[train_df['num_words']>60] = 60 
train_df['num_punctuations'].loc[train_df['num_punctuations']>10] = 10 
train_df['num_chars'].loc[train_df['num_chars']>350] = 350

# Box plot making
f, axes = plt.subplots(3, 1, figsize=(10,20))
sns.boxplot(x='target', y='num_words', data=train_df, ax=axes[0])
axes[0].set_xlabel('Target', fontsize=12)
axes[0].set_title("Number of words in each class", fontsize=15)

sns.boxplot(x='target', y='num_chars', data=train_df, ax=axes[1])
axes[1].set_xlabel('Target', fontsize=12)
axes[1].set_title("Number of characters in each class", fontsize=15)

sns.boxplot(x='target', y='num_punctuations', data=train_df, ax=axes[2])
axes[2].set_xlabel('Target', fontsize=12)
#plt.ylabel('Number of punctuations in text', fontsize=12)
axes[2].set_title("Number of punctuations in each class", fontsize=15)
plt.show()

***Observations from Meta Features*** 

We can see that the insincere questions have more number of words as well as characters compared to sincere questions. So this might be a useful feature in our model.

## Detailed Statistics for given data

```textstat``` package is used for this purpose. 

Before any further analyze, all questions must be tokenized. I used spaCy tokenizer to perform the task, following with ```tqdm``` to visualize progress bars.

In [None]:
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

tqdm.pandas()
sincere_questions = train_sincere_df["question_text"].progress_apply(spacy_tokenizer)
insincere_questions = train_insincere_df["question_text"].progress_apply(spacy_tokenizer)

To illustrate some popular summarized figures, I created a table data for all plots (which will be described below). 

The figures used are provided by ```math``` module: Mean, Standard Deviation, Variance, Median, Max, Min.

In [None]:
# One function for all plots
def plot_readability(a,b,title,bins=0.1,colors=['#0000FF', '#FF0000']):
    trace1 = ff.create_distplot([a,b], ["Sincere questions","Insincere questions"], bin_size=bins, colors=colors, show_rug=False)
    trace1['layout'].update(title=title)
    iplot(trace1, filename='Distplot')
    table_data= [["Statistical Measures","Sincere questions","Insincere questions"],
                ["Mean",mean(a),mean(b)],
                ["Standard Deviation",pstdev(a),pstdev(b)],
                ["Variance",pvariance(a),pvariance(b)],
                ["Median",median(a),median(b)],
                ["Maximum value",max(a),max(b)],
                ["Minimum value",min(a),min(b)]]
    trace2 = ff.create_table(table_data)
    iplot(trace2, filename='Table')

### Syllable Analysis

In [None]:
syllable_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.syllable_count))
syllable_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.syllable_count))
plot_readability(syllable_sincere,syllable_insincere,"Syllable Analysis",5)

### Lexicon Analysis

In [None]:
lexicon_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.lexicon_count))
lexicon_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.lexicon_count))
plot_readability(lexicon_sincere,lexicon_insincere,"Lexicon Analysis",4)

### Question Length

In [None]:
length_sincere = np.array(train_sincere_df["question_text"].progress_apply(len))
length_insincere = np.array(train_insincere_df["question_text"].progress_apply(len))
plot_readability(length_sincere,length_insincere,"Question Length",40)

### Average Syllables per Word

In [None]:
spw_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.avg_syllables_per_word))
spw_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.avg_syllables_per_word))
plot_readability(spw_sincere,spw_insincere,"Average syllables per word",0.2)

### Average Letter per Word

In [None]:
lpw_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.avg_letter_per_word))
lpw_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.avg_letter_per_word))
plot_readability(lpw_sincere,lpw_insincere,"Average letters per word",2)

### Readability Features

**Flesch reading ease**

In the Flesch reading-ease test, higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read. The formula for the Flesch reading-ease score (FRES) test is:

<img src="https://latex.codecogs.com/svg.image?206.835&space;-&space;1.015\left(\frac{total\;words}{total\;sentences}\right)&space;-&space;84.6\left(\frac{total\;syllables}{total\;words}\right)" title="206.835 - 1.015\left(\frac{total\;words}{total\;sentences}\right) - 84.6\left(\frac{total\;syllables}{total\;words}\right)" />

Scores can be interpreted as shown in the table below.

|     Score    |  School level (US) |                                  Notes                                  |
|:------------:|:------------------:|:-----------------------------------------------------------------------:|
| 100.00–90.00 | 5th grade          | Very easy to read. Easily understood by an average 11-year-old student. |
| 90.0–80.0    | 6th grade          | Easy to read. Conversational English for consumers.                     |
| 80.0–70.0    | 7th grade          | Fairly easy to read.                                                    |
| 70.0–60.0    | 8th & 9th grade    | Plain English. Easily understood by 13- to 15-year-old students.        |
| 60.0–50.0    | 10th to 12th grade | Fairly difficult to read.                                               |
| 50.0–30.0    | College            | Difficult to read.                                                      |
| 30.0–10.0    | College graduate   | Very difficult to read. Best understood by university graduates.        |
| 10.0–0.0     | Professional       | Extremely difficult to read. Best understood by university graduates.   |

In [None]:
fre_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.flesch_reading_ease))
fre_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.flesch_reading_ease))
plot_readability(fre_sincere,fre_insincere,"Flesch Reading Ease",20)

Generally speaking, questions given in the dataset are fairly easy to read (mean value is ~70-75).

**The Flesch-Kincaid Grade Level**

The "Flesch–Kincaid Grade Level Formula" presents a score as a U.S. grade level, making it easier to judge the readability level of texts. It can also mean the number of years of education generally required to understand this text, relevant when the formula results in a number greater than 10. The grade level is calculated with the following formula:

<img src="https://latex.codecogs.com/svg.image?0.39&space;-&space;11.8\left(\frac{total\;words}{total\;sentences}\right)&space;-&space;15.59\left(\frac{total\;syllables}{total\;words}\right)" title="0.39 - 11.8\left(\frac{total\;words}{total\;sentences}\right) - 15.59\left(\frac{total\;syllables}{total\;words}\right)" />

The result is a number that corresponds with a U.S. grade level. 

In [None]:
fkg_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.flesch_kincaid_grade))
fkg_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.flesch_kincaid_grade))
plot_readability(fkg_sincere,fkg_insincere,"Flesch Kincaid Grade",4)

**The Fog Scale (Gunning FOG Formula)**

The Gunning fog index is a readability test for English writing. The index estimates the years of formal education a person needs to understand the text on the first reading. 

The Gunning fog index is calculated with the following formula:

<img src="https://latex.codecogs.com/svg.image?0.4\left&space;[&space;\left&space;(&space;\frac{words}{sentences}&space;\right&space;)&plus;&space;100&space;&space;\left&space;(&space;\frac{complex\;words}{words}&space;\right)&space;\right]" title="0.4\left [ \left ( \frac{words}{sentences} \right )+ 100 \left ( \frac{complex\;words}{words} \right) \right]" />

The fog index is commonly used to confirm that text can be read easily by the intended audience. Texts for a wide audience generally need a fog index less than 12. Texts requiring near-universal understanding generally need an index less than 8.

In [None]:
fog_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.gunning_fog))
fog_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.gunning_fog))
plot_readability(fog_sincere,fog_insincere,"The Fog Scale (Gunning FOG Formula)",4)

**Automated Readability Index**

The automated readability index (ARI) is a readability test for English texts, designed to gauge the understandability of a text. Like the Flesch–Kincaid grade level, Gunning fog index, SMOG index, Fry readability formula, and Coleman–Liau index, it produces an approximate representation of the US grade level needed to comprehend the text.

The formula for calculating the automated readability index is given below:

<img src="https://latex.codecogs.com/svg.image?4.71\left(\frac{character}{words}\right)&space;+&space;0.5\left(\frac{words}{sentences}\right)&space;-&space;21.43" title="4.71\left(\frac{character}{words}\right) + 0.5\left(\frac{words}{sentences}\right) - 21.43" />

| Score | Age   | Grade Level        |
|-------|-------|--------------------|
| 1     | 5-6   | Kindergarten       |
| 2     | 6-7   | First/Second Grade |
| 3     | 7-9   | Third Grade        |
| 4     | 9-10  | Fourth Grade       |
| 5     | 10-11 | Fifth Grade        |
| 6     | 11-12 | Sixth Grade        |
| 7     | 12-13 | Seventh Grade      |
| 8     | 13-14 | Eighth Grade       |
| 9     | 14-15 | Ninth Grade        |
| 10    | 15-16 | Tenth Grade        |
| 11    | 16-17 | Eleventh Grade     |
| 12    | 17-18 | Twelfth grade      |
| 13    | 18-24 | College student    |
| 14    | 24+   | Professor          |

In [None]:
ari_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.automated_readability_index))
ari_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.automated_readability_index))
plot_readability(ari_sincere,ari_insincere,"Automated Readability Index",10)

**The Coleman-Liau Index**

The Coleman–Liau index is a readability test designed by Meri Coleman and T. L. Liau to gauge the understandability of a text. Its output approximates the U.S. grade level thought necessary to comprehend the text.

Like the ARI but unlike most of the other indices, Coleman–Liau relies on characters instead of syllables per word.

The Coleman–Liau index is calculated with the following formula:

<img src="https://latex.codecogs.com/svg.image?CLI&space;=&space;0.0588L&space;-&space;0.296S&space;-&space;15.8" title="CLI = 0.0588L - 0.296S - 15.8" />

In [None]:
cli_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.coleman_liau_index))
cli_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.coleman_liau_index))
plot_readability(cli_sincere,cli_insincere,"The Coleman-Liau Index",10)

**Linsear Write Formula**

The standard Linsear Write metric Lw runs on a 100-word sample:
1. For each "easy word", defined as words with 2 syllables or less, add 1 point.
2. For each "hard word", defined as words with 3 syllables or more, add 3 points.
3. Divide the points by the number of sentences in the 100-word sample.
4. Adjust the provisional result r:
* If r > 20, Lw = r / 2.
* If r ≤ 20, Lw = r / 2 - 1.
The result is a "grade level" measure, reflecting the estimated years of education needed to read the text fluently

In [None]:
lwf_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.linsear_write_formula))
lwf_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.linsear_write_formula))
plot_readability(lwf_sincere,lwf_insincere,"Linsear Write Formula",2)

**Dale–Chall Readability Formula**

The Dale–Chall readability formula is a readability test that provides a numeric gauge of the comprehension difficulty that readers come upon when reading a text. It uses a list of 3000 words that groups of fourth-grade American students could reliably understand, considering any word not on that list to be difficult.

The formula for calculating the raw score of the Dale–Chall readability score (1948) is given below:

<img src="https://latex.codecogs.com/svg.image?0.1579\left&space;(&space;&space;\frac{difficult\;words}{words}\ast&space;&space;100\right&space;)&space;&plus;&space;0.0496\left&space;(&space;\frac{words}{sentences}&space;\right&space;)" title="0.1579\left ( \frac{difficult\;words}{words}\ast 100\right ) + 0.0496\left ( \frac{words}{sentences} \right )" />

If the percentage of difficult words is above 5%, then add 3.6365 to the raw score to get the adjusted score, otherwise the adjusted score is equal to the raw score.

|     Score    |                                 Notes                                |
|:------------:|:--------------------------------------------------------------------:|
| 4.9 or lower | easily understood by an average 4th-grade student or lower           |
| 5.0–5.9      | easily understood by an average 5th or 6th-grade student             |
| 6.0–6.9      | easily understood by an average 7th or 8th-grade student             |
| 7.0–7.9      | easily understood by an average 9th or 10th-grade student            |
| 8.0–8.9      | easily understood by an average 11th or 12th-grade student           |
| 9.0–9.9      | easily understood by an average 13th to 15th-grade (college) student |

In [None]:
dcr_sincere = np.array(train_sincere_df["question_text"].progress_apply(textstat.dale_chall_readability_score))
dcr_insincere = np.array(train_insincere_df["question_text"].progress_apply(textstat.dale_chall_readability_score))
plot_readability(dcr_sincere,dcr_insincere,"Dale-Chall Readability Score",1)

**Readability Consensus based upon all the above tests**

The estimated school grade level required to understand the text based on all above tests.

In [None]:
def consensus_all(text):
    return textstat.text_standard(text,float_output=True)

con_sincere = np.array(train_sincere_df["question_text"].progress_apply(consensus_all))
con_insincere = np.array(train_insincere_df["question_text"].progress_apply(consensus_all))
plot_readability(con_sincere,con_insincere,"Readability Consensus based upon all the above tests",2)