# Quora Insincere Questions Classification

![title](https://i.ibb.co/GpDpYcW/sincere.jpg)

Side note : this is the first part of two, see the conclusion for the next part.

## Context
An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

In this competition, Kagglers will develop models that identify and flag insincere questions. To date, Quora has employed both machine learning and manual review to address this problem. With your help, they can develop more scalable methods to detect toxic and misleading content.

Here's your chance to combat online trolls at scale. Help Quora uphold their policy of “Be Nice, Be Respectful” and continue to be a place for sharing and growing the world’s knowledge.


## Goal
Predict whether a question asked on Quora is sincere or not

An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:
- Has a non-neutral tone
- Is disparaging or inflammatory
- Isn't grounded in reality
- Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

## Dataset
The training data includes the question that was asked, and whether it was identified as insincere (target = 1). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.


Note that the distribution of questions in the dataset should not be taken to be representative of the distribution of questions asked on Quora. This is, in part, because of the combination of sampling procedures and sanitization measures that have been applied to the final dataset.

---

# Exploratory Data Analysis

Libraries import

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import numpy as np 
import pandas as pd 

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

from string import punctuation 

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score, classification_report

In [None]:
from xgboost import XGBClassifier
import lightgbm as lgb

__File descriptions__
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - A sample submission in the correct format
- enbeddings/ - (see below)

In [None]:
df = pd.read_csv("../input/quora-insincere-questions-classification/train.csv")
df.head()

__Data fields__
- qid - unique question identifier
- question_text - Quora question text
- target - a question labeled "insincere" has a value of 1, otherwise 0

In [None]:
pd.read_csv("../input/quora-insincere-questions-classification/test.csv").head()

## Basic infos and analysis of the target

In [None]:
df.info()

No Nan and no duplicated line :

In [None]:
df.duplicated().sum()

Target analysis

In [None]:
df.target.value_counts()

In [None]:
df.target.describe()

In [None]:
plt.figure(figsize=(5, 4))
sns.countplot(x='target', data=df)
plt.title('Reparition of question by sincerity (insincere = 1)');

In [None]:
print(f'There are {df.target.sum() / df.shape[0] * 100 :.1f}% of insincere questions, which make the dataset highly unbalanced.')

## Word clouds

Generally, though, data scientists don’t think much of word clouds, in large part because the placement of the words doesn’t mean anything other than “here’s some space where I was able to fit a word.”
Anyway, clouds can come handy to have a frist insight of the most common words...

Word clouds (also known as text clouds or tag clouds) work in a simple way: the more a specific word appears in a source of textual data (such as a speech, blog post, or database), the bigger and bolder it appears in the word cloud.

In [None]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)

In [None]:
print('Word cloud image generated from sincere questions')
sincere_wordcloud = WordCloud(width=600, height=400, background_color ='white', min_font_size = 10).generate(str(df[df["target"] == 0]["question_text"]))
#Positive Word cloud
plt.figure(figsize=(15,6), facecolor=None)
plt.imshow(sincere_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show();

In [None]:
print('Word cloud image generated from INsincere questions')
insincere_wordcloud = WordCloud(width=600, height=400, background_color ='white', min_font_size = 10).generate(str(df[df["target"] == 1]["question_text"]))
#Positive Word cloud
plt.figure(figsize=(15,6), facecolor=None)
plt.imshow(insincere_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show();

## Statistics form the question texts

The process of converting data to something a computer can understand is referred to as pre-processing. One of the major forms of pre-processing is to filter out useless data. In natural language processing, useless words (data), are referred to as stop words.

In [None]:
# if needed
# nltk.download('stopwords')

Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

We would not want these words taking up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to be stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. You can find them in the nltk_data directory.

In [None]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words

In [None]:
def create_features(df_):
    """Retrieve from the text column the nb of : words, unique words, characters, stopwords,
    punctuations, upper/lower case char, title..."""
    
    df_["nb_words"] = df_["question_text"].apply(lambda x: len(x.split()))
    df_["nb_unique_words"] = df_["question_text"].apply(lambda x: len(set(str(x).split())))
    df_["nb_chars"] = df["question_text"].apply(lambda x: len(str(x)))
    df_["nb_stopwords"] = df_["question_text"].apply(lambda x : len([nw for nw in str(x).split() if nw.lower() in stop_words]))
    df_["nb_punctuation"] = df_["question_text"].apply(lambda x : len([np for np in str(x) if np in punctuation]))
    df_["nb_uppercase"] = df_["question_text"].apply(lambda x : len([nu for nu in str(x).split() if nu.isupper()]))
    df_["nb_lowercase"] = df_["question_text"].apply(lambda x : len([nl for nl in str(x).split() if nl.islower()]))
    df_["nb_title"] = df_["question_text"].apply(lambda x : len([nl for nl in str(x).split() if nl.istitle()]))
    return df_

In [None]:
df = create_features(df)
df.sample(2)

Let's take a sample - because the data set is quite huge when run locally on a single node - and visualize pair plots :

In [None]:
num_feat = ['nb_words', 'nb_unique_words', 'nb_chars', 'nb_stopwords', \
            'nb_punctuation', 'nb_uppercase', 'nb_lowercase', 'nb_title', 'target'] 
# side note : remove target if needed later

df_sample = df[num_feat].sample(n=round(df.shape[0]/6), random_state=42)

plt.figure(figsize=(16,10))
sns.pairplot(data=df_sample, hue='target')
plt.show()

Basic stats comparison :

In [None]:
df_sample[df_sample['target'] == 0].describe()

In [None]:
df_sample[df_sample['target'] == 1].describe()

Generally speaking, insincere questions are written with more words.

Now with a focus on the distributions, because there is a difference in the spikes between sincere and insincre questions.

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(331)

i=0
for c in num_feat:
    plt.subplot(3, 3, i+1)
    i += 1
    sns.kdeplot(df_sample[df_sample['target'] == 0][c], shade=True)
    sns.kdeplot(df_sample[df_sample['target'] == 1][c], shade=False)
    plt.title(c)

plt.show()

Same conclusion here than shown with stats

Obviously, many of these indicators are highly correlated each other but not towards the target :

In [None]:
sns.set(style="white")

# Compute the correlation matrix
corr = df_sample[num_feat].corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(7, 6))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .5})

What are the most frequent words for each type of question ?

In [None]:
class Vocabulary(object):
    # credits : Shankar G see https://www.kaggle.com/kaosmonkey/visualize-sincere-vs-insincere-words
    
    def __init__(self):
        self.vocab = {}
        self.STOPWORDS = set()
        self.STOPWORDS = set(stopwords.words('english'))
        
    def build_vocab(self, lines):
        for line in lines:
            for word in line.split(' '):
                word = word.lower()
                if (word in self.STOPWORDS):
                    continue
                if (word not in self.vocab):
                    self.vocab[word] = 0
                self.vocab[word] +=1 
    
    def generate_ngrams(text, n_gram=1):
        """arg: text, n_gram"""
        token = [token for token in text.lower().split(" ") if token != "" if token not in STOPWORDS]
        ngrams = zip(*[token[i:] for i in range(n_gram)])
        return [" ".join(ngram) for ngram in ngrams]
    
    def horizontal_bar_chart(df, color):
        trace = go.Bar(
            y=df["word"].values[::-1],
            x=df["wordcount"].values[::-1],
            showlegend=False,
            orientation = 'h',
            marker=dict(
            color=color,
            ),
        )
        return trace

In [None]:
sincere_vocab = Vocabulary()
sincere_vocab.build_vocab(df[df['target'] == 0]['question_text'])
sincere_vocabulary = sorted(sincere_vocab.vocab.items(), reverse=True, key=lambda kv: kv[1])
    
df_sincere_vocab = pd.DataFrame(sincere_vocabulary, columns=['word_sincere', 'frequency'])
sns.barplot(y='word_sincere', x='frequency', data=df_sincere_vocab[:20])

In [None]:
insincere_vocab = Vocabulary()
insincere_vocab.build_vocab(df[df['target'] == 1]['question_text'])
insincere_vocabulary = sorted(insincere_vocab.vocab.items(), reverse=True, key=lambda kv: kv[1])

df_insincere_vocab = pd.DataFrame(insincere_vocabulary, columns=['word_insincere', 'frequency'])
sns.barplot(y='word_insincere', x='frequency', data=df_insincere_vocab[:20])

As we can clearly see there are certain words (swear words, discriminatory words based on race, political figures etc) that show up a lot in insincere sentences.

---

# Text processing & model training

## Metric : F-score

The most appropriated metric is F1-score. Explanation from [Wikipedia](https://en.wikipedia.org/wiki/F1_score):

*"In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0."*

![title](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/350px-Precisionrecall.svg.png)

we'll also use a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) 

In [None]:
def get_fscore_matrix(fitted_clf, model_name):
    print(model_name, ' :')
    
    # get classes predictions for the classification report 
    y_train_pred, y_pred = fitted_clf.predict(X_train), fitted_clf.predict(X_test)
    print(classification_report(y_test, y_pred), '\n') # target_names=y
    
    # computes probabilities keep the ones for the positive outcome only      
    print(f'F1-score = {f1_score(y_test, y_pred):.2f}')

## Text processing

In [None]:
# if needed the first time  
# import nltk
# nltk.download('punkt')

__Process :__
    - tokenization
    - keeping only alphanumeriacl characters
    - removing stop words (punctuation etc...)
    - stemming or lemmatization. 

[source : blog.bitext.com](https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/)

__Tokenization :__
Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation

__Stemming vs. lemmatization :__
The aim of both processes is the same: reducing the inflectional forms of each word into a common base or root. 

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations.

Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. 

In [None]:
df = df[['question_text', 'target']]

def text_processing(local_df):
    """ return the dataframe with tokens stemmetized without numerical values & stopwords """
    stemmer = PorterStemmer()
    # Perform preprocessing
    local_df['txt_processed'] = local_df['question_text'].apply(lambda df: word_tokenize(df))
    local_df['txt_processed'] = local_df['txt_processed'].apply(lambda x: [item for item in x if item.isalpha()])
    local_df['txt_processed'] = local_df['txt_processed'].apply(lambda x: [item for item in x if item not in stop_words])
    local_df['txt_processed'] = local_df['txt_processed'].apply(lambda x: [stemmer.stem(item) for item in x])
    return local_df

In [None]:
df = text_processing(df)
df.tail(2)

## First method : text similarity using TF-IDF

In [None]:
vectorizer = TfidfVectorizer(lowercase=False, analyzer=lambda x: x, min_df=0.01, max_df=0.999)
# min_df & max_df param added for less memory usage

tf_idf = vectorizer.fit_transform(df['txt_processed']).toarray()
pd.DataFrame(tf_idf, columns=vectorizer.get_feature_names()).head()

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(tf_idf, df['target'], test_size=0.2, random_state=42)

XGBoost Classifier without weigths

In [None]:
model = XGBClassifier(objective="binary:logistic")
model.fit(X_train, y_train)
get_fscore_matrix(model, 'XGB Clf withOUT weights')

XGBoost Classifier with weigths

In [None]:
ratio = ((len(y_train) - y_train.sum()) - y_train.sum()) / y_train.sum()
ratio

In [None]:
model = XGBClassifier(objective="binary:logistic", scale_pos_weight=ratio)
model.fit(X_train, y_train)
get_fscore_matrix(model, 'XGB Clf WITH weights')

Now LGBM with weights

In [None]:
model = lgb.LGBMClassifier(n_jobs = -1, class_weight={0:y_train.sum(), 1:len(y_train) - y_train.sum()})
model.fit(X_train, y_train)
get_fscore_matrix(model, 'LGBM weighted')

LogisticRegression

In [None]:
model = LogisticRegression(class_weight={0:y_train.sum(), 1:len(y_train) - y_train.sum()}, C=0.5, max_iter=100, n_jobs=-1)
model.fit(X_train, y_train)
get_fscore_matrix(model, 'LogisticRegression')

## Second approach : a CountVectorizer / Logistic Regression pipeline 

Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

In [None]:
df['str_processed'] = df['txt_processed'].apply(lambda x: " ".join(x))
df.head(2)

In [None]:
pipeline = Pipeline([("cv", CountVectorizer(analyzer="word", ngram_range=(1,4), max_df=0.9)),
                     ("clf", LogisticRegression(solver="saga", class_weight="balanced", C=0.45, max_iter=250, verbose=1, n_jobs=-1))])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['str_processed'], df.target, test_size=0.2, stratify = df.target.values)

In [None]:
lr_model = pipeline.fit(X_train, y_train)
lr_model

In [None]:
get_fscore_matrix(lr_model, 'lr_pipe')

# Conclusion, submission and opening

- First we have used TF-IDF, but the least we can say : this is not really efficient, the recall for insincere question isn't good at all, so this seems to not be the right way to go...
- Instead, using CountVectorizer with a Logistic Regression is more efficient. 

Now let's see what will be the submission score ?

In [None]:
pd.read_csv("../input/quora-insincere-questions-classification/sample_submission.csv").head(2)

In [None]:
df_test = pd.read_csv("../input/quora-insincere-questions-classification/test.csv", index_col='qid')
df_test.tail(2)

In [None]:
df_test = text_processing(df_test)
df_test['str_processed'] = df_test['txt_processed'].apply(lambda x: " ".join(x))
df_test.head(2)

In [None]:
y_pred_final = lr_model.predict(df_test['str_processed'])
y_pred_final

In [None]:
df_submission = pd.DataFrame({"qid":df_test.index, "prediction":y_pred_final})
df_submission.head()

In [None]:
df_submission.to_csv('submission.csv', index=False)

__CREDITS__ : all the people mentionned above and especially [amokrane](https://github.com/atabti) & [moneynass](https://github.com/moneynass) for their inspiring work ! thanks  :)

-> *IN THE 2nd  PART  I'LL USE WORD *ENBEDDINGS* !* 

__if you appreciated my work, your vote is warmly welcome !__