# Quora Insincere Questions Classification
[Quora](https://www.quora.com/) is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.
[Competition: Quora Insincere Questions Classification](https://https://www.kaggle.com/c/quora-insincere-questions-classification)

## Problem statement:
  Build a model to classify whether a question asked on Quora is sincere or not.

To help Quora uphold their policy of “Be Nice, Be Respectful” and continue to be a place for sharing and growing the world’s knowledge.

## Overview of the data:

Quora provided a good amount of training and test data to identify the insincere questions. Train data consists of 1.3 million rows and 3 features in it. And the Test data consists of 300K rows and 2 features. 

## Evaluation Metrics :
Metric is F1 Score between the predicted and the observed targets. There are just two classes, but the positive class makes just over 6% of the total. So the target is highly imbalanced, which is why a metric such as F1 seems appropriate for this kind of problem as it considers both precision and recall of the test to compute the score.


# 1. Data loading and exploration:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd 


## Load the data from CSV files into a pandas dataframe

In [None]:
train_data = pd.read_csv('../input/quora-insincere-questions-classification/train.csv')
test_data = pd.read_csv('../input/quora-insincere-questions-classification/test.csv')

In [None]:
train_data.head(10)

In [None]:
#test_data.head(10)

In [None]:
print("Train shape : ", train_data.shape)
print("Test shape : ", test_data.shape)

In [None]:
train_data.columns

In [None]:
train_data= train_data.drop(['qid'], axis=1)
test_data= test_data.drop(['qid'], axis=1)

In [None]:
test_data.head(10)

In [None]:
train_data.isnull().sum()

In [None]:
test_data.isnull().sum()

# 2. Data Visualization:

## Target Count:

In [None]:
sns.countplot(train_data['target'])

In [None]:
train_data['target'].value_counts()

## Target distribution:


In [None]:
sincere_percent= (len(train_data.question_text[train_data['target'] == 0]) /  len(train_data['question_text']) * 100)
insincere_percent= (len(train_data.question_text[train_data['target'] == 1]) / len(train_data['question_text']) * 100)

In [None]:
print(sincere_percent, insincere_percent)

In [None]:
import matplotlib.pyplot as plt
# Data to plot
labels = 'Sincere', 'Insincere'
sizes = [sincere_percent, insincere_percent]
colors = ['lightskyblue', 'lightcoral']
explode = (0.1, 0)  # explode 1st slice

# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)

plt.axis('equal')
plt.show()

## Word Frequency plot of sincere & insincere questions:
Let us look at the frequently occuring words in the data by creating a word cloud on the 'question_text' column.

In [None]:
import nltk
from wordcloud import WordCloud, STOPWORDS

In [None]:
from collections import defaultdict
train1_data = train_data[train_data["target"]==1]
train0_data = train_data[train_data["target"]==0]

In [None]:
def cloud(text, title, size = (10,7)):
    # Processing Text
    wordcloud = WordCloud(width=800, height=400, background_color ='white',
                          collocations=False
                         ).generate(" ".join(text))
    
    # Output Visualization
    fig = plt.figure(figsize=size, dpi=80)
    plt.imshow(wordcloud,interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=25,color='k')
    plt.tight_layout(pad=0)
    plt.show()
cloud(train_data['question_text'], title="Word Cloud of Questions")

In [None]:
cloud(train0_data["question_text"], title="Word Cloud of sincere Questions")

In [None]:
cloud(train1_data["question_text"], title="Word Cloud of insincere Questions")

## Build language model:

In [None]:
## custom function for ngram generation ##
def generate_ngrams(text, n_gram=1):
    token = [token for token in text.lower().split(" ") if token != "" if token not in STOPWORDS]
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [" ".join(ngram) for ngram in ngrams]


## Unigram model:

In [None]:
## Get the bar chart from sincere questions ##
freq_dict = defaultdict(int)
for sent in train0_data["question_text"]:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
fd_sorted0 = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted0.columns = ["word", "wordcount"]

## Get the bar chart from insincere questions ##
freq_dict = defaultdict(int)
for sent in train1_data["question_text"]:
    for word in generate_ngrams(sent):
        freq_dict[word] += 1
fd_sorted1 = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted1.columns = ["word", "wordcount"]


In [None]:
import seaborn as sns
plt.figure(figsize=(11,10))
plt.title("Frequent words of sincere question")
fd_sorted0_head= fd_sorted0.head(40)
sns.barplot(x=fd_sorted0_head['wordcount'], y=fd_sorted0_head['word'])

In [None]:
plt.figure(figsize=(11,10))
plt.title("Frequent words of insincere question")
fd_sorted1_head= fd_sorted1.head(40)
sns.barplot(x=fd_sorted1_head['wordcount'], y=fd_sorted1_head['word'])

## Bigram model:

In [None]:
freq_dict = defaultdict(int)
for sent in train0_data["question_text"]:
    for word in generate_ngrams(sent,2):
        freq_dict[word] += 1
fd_sorted0 = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted0.columns = ["word", "wordcount"]

## Get the bar chart from insincere questions ##
freq_dict = defaultdict(int)
for sent in train1_data["question_text"]:
    for word in generate_ngrams(sent,2):
        freq_dict[word] += 1
fd_sorted1 = pd.DataFrame(sorted(freq_dict.items(), key=lambda x: x[1])[::-1])
fd_sorted1.columns = ["word", "wordcount"]

In [None]:
import seaborn as sns
plt.figure(figsize=(11,10))
plt.title("Frequent words of sincere question")
fd_sorted0_head= fd_sorted0.head(40)
sns.barplot(x=fd_sorted0_head['wordcount'], y=fd_sorted0_head['word'])

In [None]:
import seaborn as sns
plt.figure(figsize=(11,10))
plt.title("Frequent words of insincere question")
fd_sorted1_head= fd_sorted1.head(40)
sns.barplot(x=fd_sorted1_head['wordcount'], y=fd_sorted1_head['word'])