# Completely lost? Annotated Questions Preprocessing 

Hi all, this kernel is aimed to help you guys really understand the nuts and bolts. I've shamelessly copied the code and some of the text from Siddharth Yadav's kernel - [Analyzing Quora for the Insinceres](https://www.kaggle.com/thebrownviking20/analyzing-quora-for-the-insinceres) and tried to explain every bit of the code - what it does and why its there. 

Let me know If you like it, by leaving a comment and upvoting. 

Have fun.




# About the dataset
An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

Quora is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

In this competition, Kagglers will develop models that identify and flag insincere questions. To date, Quora has employed both machine learning and manual review to address this problem. With your help, they can develop more scalable methods to detect toxic and misleading content.

Here's your chance to combat online trolls at scale. Help Quora uphold their policy of “Be Nice, Be Respectful” and continue to be a place for sharing and growing the world’s knowledge.

###  Import Python libraries

In [None]:
# Usual imports
import numpy as np
import pandas as pd
from tqdm import tqdm
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from statistics import *
from sklearn.feature_extraction.text import CountVectorizer
import concurrent.futures
import time
import pyLDAvis.sklearn
from pylab import bone, pcolor, colorbar, plot, show, rcParams, savefig
import textstat
import warnings
warnings.filterwarnings('ignore')

# Plotly based imports for visualization
from plotly import tools
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff

# spaCy based imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

###  Print content of inputs archive from kaggle

In [None]:
%matplotlib inline
import os
print(os.listdir("../input"))

### Read Training Data set

In [None]:
quora_train = pd.read_csv("../input/train.csv")
quora_train.head()

## Pre-processing
If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? Which texts are similar to each other?
The process of converting data to something a computer can understand is referred to as pre-processing. 

In this case, we will tokenize the data, get rid on useless words and standardize lower and upper case based on the LEMMA of each word (Lemma is the base form of the word - for example, the lemma of "is" is "be" - to understand this better, read more  [here](https://spacy.io/usage/linguistic-features)

###   Tokenization - Segmenting text into words, punctuations marks etc.

Altough it is possible to work with the questions as a whole, tokenization allows us to compare different features such as similar words, etc. It also allows us to disconsider punctuation and stop words.
The code bellow creates a function that will then be used to parse the question text using the spacy library.
To understand how this works, take a look [here](https://spacy.io/usage/spacy-101)

Here is an example of how the tokenization process works:

In [None]:
sentence="I love it, when David writes Great looking code"
parser = English() # Defines the parse sapcy will use
mytokens = parser(sentence)
print(mytokens)

The following code transforms each word in mytokens into the lemma of the word. There is an issue with creating the leema of pronouns like I and his - the function returns "-PRON-" for those words (read more about this [here](https://github.com/explosion/spaCy/issues/962). As a workarround, we ask for the lemma of the word. If the lemma is "-PRON-", then we ask for the word in itself in lowercase.
<blockquote class="imgur-embed-pub" lang="en" data-id="a/q1EEPRt"><a href="//imgur.com/q1EEPRt"></a></blockquote><script async src="//s.imgur.com/min/embed.js" charset="utf-8"></script>                                        

In [None]:
mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
print(mytokens)

Now, all thats left is to filter out stop words and punctuation. The idea is to use only words with semantical meaning in the analysis. 

In [None]:
punctuations = string.punctuation  #gets a list of puctuations carachters from the string library
stopwords = list(STOP_WORDS) #gets a list of stop words - words that usually have little meaning in the phrase
mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
print(mytokens)

After all that, we join the tokens to recreate the question. For that, we will use the join function. 

In [None]:
mytokens = " ".join([i for i in mytokens]) #go over each token in "mytokens" and add it to a string. Use a space (" ".) to separate the words in the new string.
print(mytokens)

Thats it. We have a processed string that can be fed into a model. But.... We did it only once for a specific phrase. Now its time to encapsulate all that into a function that can be used again to process the whole questions file.

In [None]:
# SpaCy Parser for questions
punctuations = string.punctuation
stopwords = list(STOP_WORDS)
parser = English()

def spacy_tokenizer(sentence): #Create a function called spacy_tokenizer that takes "sentence" as an argument and returns the processed sentence 
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

The code bellow Uses the function we just defined to proccess the tex of all quora questions.
The first line uses the tqdm function to ask the computer to show the progress of whats happening. It helps to know that the computer is actually doing something and not just stuck... Read more about this function [here](https://pypi.org/project/tqdm/).
The second line creates a file called sincere_questions that includes the processed result of all questions from the file quora_train where target=0 (sincere questions).
The third line creates a file called insincere_questions that includes the parsed result of all questions from the file quora_train where target=1 (insincere questions).
You can understand the code as
<blockquote class="imgur-embed-pub" lang="en" data-id="F2TEG96"><a href="//imgur.com/F2TEG96">Understanding Python code -  Spacy tokenizer</a></blockquote><script async src="//s.imgur.com/min/embed.js" charset="utf-8"></script>:


In [None]:
tqdm.pandas()
sincere_questions = quora_train["question_text"][quora_train["target"] == 0].progress_apply(spacy_tokenizer)
insincere_questions = quora_train["question_text"][quora_train["target"] == 1].progress_apply(spacy_tokenizer)

## Hasta La Vista, Baby