<h1> Abstract </h1>
This notebook illustrates a model developed to tackle Quora Insincere Questions Classification. I employ supervised learning models based on Logistics Regression. My late submission of the dedicated competition approximately reached the score of 0.54.

In [None]:
# Import needed libraries: Numpy, Pandas, Matplotlib, Seaborn
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib as mp
import matplotlib.pyplot as plt
import seaborn as sns

<h1> Introduction </h1>
    <h2> Problem Description </h2>
        
An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions - those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

In this competition, Kagglers will develop models that identify and flag insincere questions. To date, Quora has employed both machine learning and manual review to address this problem. More scalable methods could be developed to detect toxic and misleading content.


<h2> Data Description </h2>
In this competition the model should be able to detect whether a question asked on Quora is sincere or not. An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:

* Has a non-neutral tone
    * Has an exaggerated tone to underscore a point about a group of people
    * Is rhetorical and meant to imply a statement about a group of people
* Is disparaging or inflammatory
    * Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
    * Makes disparaging attacks/insults against a specific person or group of people
    * Based on an outlandish premise about a group of people
    * Disparages against a characteristic that is not fixable and not measurable
* Isn't grounded in reality
    * Based on false information, or contains absurd assumptions
* Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

The training data includes the question that was asked, and whether it was identified as insincere (target = 1) or not (target = 0).

In [None]:
# Read data and show the first 5 rows:
data_raw = pd.read_csv('/kaggle/input/quora-insincere-questions-classification/train.csv')
data_raw.head()

Let's take a look at the questions in this dataset and try finding out how they have been classified as sincere and insincere.

In [None]:
insincere_questions = data_raw[data_raw['target'] == 1].question_text
sincere_questions = data_raw[data_raw['target'] == 0].question_text

In [None]:
# Insincere questions example
insincere_questions.sample(3, random_state=1).values

In [None]:
# Sincere questions example
sincere_questions.sample(3, random_state=1).values

<h2> Related Works </h2>

* Improve your Score with some Text Preprocessing (v1 and v2) by @theoviel: 
https://www.kaggle.com/theoviel/improve-your-score-with-some-text-preprocessing |
https://www.kaggle.com/theoviel/improve-your-score-with-text-preprocessing-v2
* More Text Cleaning To Increase Word Coverage by @sunnymarkliu:
https://www.kaggle.com/sunnymarkliu/more-text-cleaning-to-increase-word-coverage
* Baseline Model: Logistics Regression by @saket7788:
https://www.kaggle.com/saket7788/baseline-model-logistic-regression

I have also researched on multiple approaches such as GRU, LSTM, etc. but have not applied these to tackle the challenge. 

<h1> Data Analysis and Visualization </h1>

<h2> Raw Data Analysis </h2>

It can be easily detected that the dataset is imbalanced, in which the vast majority of questions are sincere, and only a small number are insincere.

In [None]:
# Count the values of sincere and insincere questions labeled
values = data_raw.target.value_counts()
print(values)

# Calculate the percentage of sincere and insincere questions labeled
sincere_q_pc = values[0]/values.sum()*100
insincere_q_pc = values[1]/values.sum()*100
print('\n{}% of questions are sincere while {}% are insincere'.format(sincere_q_pc, insincere_q_pc))

In [None]:
# Draw a graph to illustrate sincere and insincere questions
names = ['Sincere', 'Insincere']

plt.bar(names, values)
plt.suptitle('Number of Sincere and Insincere Questions')
plt.show()

To find the most frequently occuring words in questions, I built word clouds of a random sample of 1000 insincere and 1000 sincere questions.

In [None]:
# Import the wordcloud library
from wordcloud import WordCloud, ImageColorGenerator

# Split sentences into a dictionary of uniquely occuring words and their frequencies
def word_freq_dict(text):
    # Convert text into word list
    wordList = text.split()
    # Generate word freq dictionary
    wordFreqDict = {word: wordList.count(word) for word in wordList}
    return wordFreqDict

In [None]:
# Plot a wordcloud from a word frequency dictionary
def word_cloud_from_frequency(word_freq_dict, title, figure_size=(10,6)):
    wordcloud.generate_from_frequencies(word_freq_dict)
    plt.figure(figsize=figure_size)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.title(title)
    plt.show()

In [None]:
# Wordcloud of a random sample of 1000 insincere questions
insincere_questions = data_raw.question_text[data_raw['target'] == 1]
insincere_sample = " ".join(insincere_questions.sample(1000, random_state=1).values)
insincere_word_freq = word_freq_dict(insincere_sample)
wordcloud = WordCloud(width= 5000,
    height=3000,
    max_words=200,
    colormap='Reds',
    background_color='white')

word_cloud_from_frequency(insincere_word_freq, "Most Frequent Words in a sample of 1000 raw questions flagged insincere") 

In [None]:
# Wordcloud of a random sample of 1000 sincere questions
sincere_questions = data_raw.question_text[data_raw['target'] == 0]
sincere_sample = " ".join(sincere_questions.sample(1000, random_state=1).values)
sincere_word_freq = word_freq_dict(sincere_sample)
wordcloud = WordCloud(width= 5000,
    height=3000,
    max_words=200,
    colormap='Greens',
    background_color='white')

word_cloud_from_frequency(sincere_word_freq, "Most Frequent Words in a sample of 1000 raw questions flagged sincere") 

Obviously, predominance words appear in the word cloud are commonly used words such as 'what', 'is', 'with', 'are', etc. which are useless for the model; consequently, these common words (stopwords) needed to be filtered out.

## **Data Preprocessing**

Preprocessing is one of the key steps in every natural language processing problem as it transforms data into usable one which machine can easily interprete. 

As mentioned above, this component have to take out all the stopwords. Moreover, since the input data are raw text from websites, the input text can have noise which can harmful to machine learning performance such as special characters, spelling mistakes, spacing errors, etc. 

In [None]:
import nltk
import sys
import spacy

#nltk.download('stopwords')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import string

To standardize the data and reduce the amount of noise, preprocessing process is applied according to the following steps:

• Step 1: Lowercase all characters.

• Step 2: Split text into list.

• Step 3: Punctuations were completely removed.

• Step 4: Remove all stop words.

• Step 5: Stem: technique used to extract the base form of the words by removing affixes from them. 

Natural Language Toolkit (NLTK) is used for all pre-processing steps thanks to its popularity and simplicity.

In [None]:
nlp = spacy.load("en_core_web_sm", disable=['parser','ner'])
stop = set(stopwords.words('english'))
punc = set(string.punctuation)

def clean_text(text):
    # Convert the text into lowercase
    text = text.lower()
    # Split into list
    wordList = text.split()
    # Remove punctuation
    wordList = ["".join(x for x in word if (x=="'")|(x not in punc)) for word in wordList]
    # Remove stop words
    wordList = [word for word in wordList if word not in stop]
    # Stem
    porter = PorterStemmer()
    wordList = [porter.stem(word) for word in wordList]

    reformed_sentence = " ".join(wordList)
    doc = nlp(reformed_sentence)
    return " ".join([token.lemma_ for token in doc])

Let's see whether the proposed preprocesing methods work or not:

In [None]:
question = data_raw.question_text.sample(1, random_state=1).values[0]
question

In [None]:
clean_text(question)

Now we will clean every row of text data by running this function:

In [None]:
data_raw['clean_text'] = data_raw['question_text'].astype('str').apply(clean_text)

In [None]:
data_raw.clean_text.head()

Build wordcloud to visualize questions after cleaning process by similar block of code with just slightly adjustment:  

In [None]:
# Wordcloud of a random sample of 1000 cleaned insincere questions
clean_insincere_questions = data_raw.clean_text[data_raw['target'] == 1]
clean_insincere_sample = " ".join(clean_insincere_questions.sample(1000, random_state=1).values)
clean_insincere_word_freq = word_freq_dict(clean_insincere_sample)
wordcloud = WordCloud(width= 5000,
    height=3000,
    max_words=200,
    colormap='Reds',
    background_color='white')

word_cloud_from_frequency(clean_insincere_word_freq, "Most Frequent Words in a sample of 1000 cleaned questions flagged insincere") 

In [None]:
# Wordcloud of a random sample of 1000 clean sincere questions
clean_sincere_questions = data_raw.clean_text[data_raw['target'] == 0]
clean_sincere_sample = " ".join(clean_sincere_questions.sample(1000, random_state=1).values)
clean_sincere_word_freq = word_freq_dict(clean_sincere_sample)
wordcloud = WordCloud(width= 5000,
    height=3000,
    max_words=200,
    colormap='Greens',
    background_color='white')

word_cloud_from_frequency(clean_sincere_word_freq, "Most Frequent Words in a sample of 1000 cleaned questions flagged sincere") 

<h1>TextToVec</h1>

We need to transform text into a matrix of vectors

<h2> Bag of Words </h2>

It is basic model used in natural language processing. It is called bag of words because any order of the words in the document is discarded, so it only tells us whether word is present in the document or not.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
bow_converter = CountVectorizer()


In [None]:
sample_question_text = data_raw['clean_text'].sample(1, random_state= 1).values
sample_question_text

In [None]:
sample_count_vectorized_data = bow_converter.fit_transform(sample_question_text)
sample_count_vectorized_data.toarray()

In [None]:
count_vectorized_data_feature_names = bow_converter.get_feature_names()
count_vectorized_data_feature_names

<h2> TF-IDF </h2>

TF-IDF stands for Term Frequency-Inverse Document Frequency which basically tells importance of the word in the corpus or dataset. TF-IDF contain two concept Term Frequency(TF) and Inverse Document Frequency(IDF).

Term Frequency is defined as how frequently the word appear in the document or corpus. As each sentence is not the same length so it may be possible a word appears in long sentence occur more time as compared to word appear in sorter sentence. Term Frequency can be defined as:

TF = Number of time word appear/Total words

Inverse Document frequency is another concept which is used for finding out importance of the word. It is based on the fact that less frequent words are more informative and important. IDF is represented by formula:

IDF = log10(Number of documents/Number of documents contain word)

TF-IDF is basically a multiplication between TF table and IDF table. It basically reduces values of common word that are used in different document.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_converter = TfidfVectorizer(ngram_range=(1,1))

In [None]:
sample_tfidf_vectorized_data = tfidf_converter.fit_transform(sample_question_text)
sample_tfidf_vectorized_data.toarray()

In [None]:
tfidf_word_feature_names = tfidf_converter.get_feature_names()

In [None]:
tfidf_word_feature_names

In [None]:
len(tfidf_word_feature_names)

# Model

Logistics Regression is one of the easiest ML algorithms as it is easy to implement, interpret, and very efficient to train. Traning a model with LR doesn’t need high computation effort. LR also less prone to overfitting in a low dimensional dataset, and in context of a higher dimensional dataset, regularization can be used to avoid overfitting. Moreover, new data can be updated using stochastic gradient descent (SGD).

But LR also has limitations as it only address linear separable data for non-linear problems transformation is required. Features used for training model should also be carefully extracted otherwise noise will make the probabilistic predictions may be incorrect. LR requires a large dataset and sufficient training examples for all the categories it needs to identify. Lastly, each training tuples must be isolated to all others, because relationship between any of them will make model give more importance to these relative examples.

<h2> Pipeline with LR and Count Vectorizer </h2>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

count_vectorizer = CountVectorizer()
model = LogisticRegression(C=1, random_state=0, max_iter=1000)

vectorize_logit_pipeline = Pipeline([
    ('count_vectorizer', count_vectorizer),
    ('logit', model)
])

Define input and target variables

In [None]:
# Input variable
X = data_raw['clean_text']
# Target variable
y = data_raw['target']

Split training dataset into train and test sets

In [None]:
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3)

Train the model using the feature and target training sets

In [None]:
vectorize_logit_pipeline.fit(train_X, train_y)

Get the predictions from the model

In [None]:
predictions = vectorize_logit_pipeline.predict(test_X)

Check the accuracy score

In [None]:
accuracy_score(test_y, predictions)

Check the f1 score

In [None]:
from sklearn.metrics import f1_score
f1_score(test_y, predictions)

Plot the confusion matrix

In [None]:
confusion_matrix_logit_cv = confusion_matrix(test_y, predictions)
sns.heatmap(confusion_matrix_logit_cv, annot= True, xticklabels=['sincere', 'insincere'], yticklabels=['sincere', 'insincere'])

In [None]:
from sklearn.metrics import classification_report
print(classification_report(test_y, predictions))

<h2> Pipeline with LR and TF-IDF Bi-grams Vectorizer </h2>

In [None]:
tfidf_ngrams_converter = TfidfVectorizer(ngram_range=(1,2))
tfidf_ngrams_logit_pipeline = Pipeline([
    ('tfidf_vectorizer', tfidf_ngrams_converter),
    ('logit', model)
])

In [None]:
tfidf_ngrams_logit_pipeline.fit(train_X, train_y)

In [None]:
new_predictions = tfidf_ngrams_logit_pipeline.predict(test_X)

In [None]:
accuracy_score(test_y, new_predictions)

In [None]:
f1_score(test_y, new_predictions)

In [None]:
confusion_matrix_logit_tfidf = confusion_matrix(test_y, new_predictions)
sns.heatmap(confusion_matrix_logit_tfidf, annot= True, xticklabels=['sincere', 'insincere'], yticklabels=['sincere', 'insincere'])

In [None]:
print(classification_report(test_y, new_predictions, target_names=['sincere', 'insincere']))

It can be observed that the when bigram word vector features are included, LR model gave better accuracy and F1 scores.

In [None]:
test_data = pd.read_csv('/kaggle/input/quora-insincere-questions-classification/test.csv')
test_data.head()

In [None]:
test_data.info()

In [None]:
test_data['clean_text'] = test_data['question_text'].astype('str').apply(clean_text)

In [None]:
test_data.head()

In [None]:
X_final = test_data['clean_text']

In [None]:
y_final = tfidf_ngrams_logit_pipeline.predict(X_final)

In [None]:
y_final[:5]

In [None]:
test_data['target'] = y_final

In [None]:
result_df = test_data[['qid', 'target']]

In [None]:
result_df.rename(columns={'target': 'prediction'}, inplace=True)
result_df.set_index('qid', inplace=True)
result_df.head()

In [None]:
result_df.to_csv('submission.csv')
!head submission.csv

<h1> Conclusion and Future Works</h1>

<h2> Conclusion </h2>

The proposed works using Logistics Regression along with some simple text preprocessing technique to address Quora Insincere Questions Classification problem. The model is simple and still have room for further development.
The final score of this notebook is approximately 0.54 (best score is around 0.7)

<h2> Limitations and Future Works </h2>

The preprocesing methods are usable, but take a lot of time to process all over 1 million tuples. Furthermore, available word embeddings (e.g. Google News, gloVe, wiki-news, etc.) have not used essentially yet. Last but not least, more effective models, which propose better score should be trained and tested for dedicated problem.

The weak point of proposed work indicates my next path to have in-depth research in the near future, which could be presented in a upgraded notebook:
* Develop more essential and less time-consuming preprocessing methods: transfer acronyms to the full form of them and translated from context, trim unnecessary spaces made by spacing errors, remove special characters.
* Apply word embeddings
* Attempt different models: LSTM, GRU, etc.