# Introduction

Initially, the end-goal of this notebook was to preprocess data for topic detection and tag classification.  
I tried to explain why I choose to diverge or not from a "classical" preprocess on this particular case (see optional & specific).

### Preprocess steps
 
**1) Noise removal**
   1. Removing html (specific)  
   2. Removing contraction  
   3. Spelling correction 
   4. Lowering the text
   
**2) Removing simple character**   
   1. Removing punctuation, special character and number  
   3. Removing single character (optional & specific)

**3) Removing StopWords**  
   1. Removing most frequent word  
   2. Removing certain type of word (optional & specific)  

**4) Steming/Lemmatization**  
   1. Stemming   
   2. Lemmatization  

The advantage of this preprocess is that it's really straightforward but at the same time, you may loose information that are needed for some analysis. It can be used to make some simple topic detection (LDA, NMF, etc.) or classification.

Most of these steps are **TASK-DEPENDANT**. You may choose to not remove Stopwords or Lemmatize your text in some case.
The **ORDER** of these steps may also vary. Make some spelling correction or lemmatization BEFORE removing StopWords may change the result.
Also, note that these steps are far from being optimised (I tokenize, untokenize, then tokenize again, etc.).  
You may also have heard of the expression "Text normalization". This is another step in NLP but it redundant with Noise removal. Those two are not well-defined and are overlapping. So I choose to organise my preprocess only with Noise removal in this case (which is completely arbitrary and it's fine).  
Finaly, I did not include the removing of punctuation, special character and number in the Noise removal step since, it's more interessting to have this step as an independant step. The data are about programming and in programming, we use a lot of special character. So you may want to not remove those characters. Also, it make the preprocess clearer. We begin with cleaning, then we remove character (the most basic unit when working with text), after that we move to the word unit and finaly, we normalize the word.  

### Vocabulary

If you are new to NLP, here is a small list of concepts that are used in this notebook.
- **Tokenize:** "Process of converting a string into a list of substrings, known as tokens."
- **Text normalization:** "Process of transforming text into a single canonical form that it might not have had before (e.g. lowering the text, removing contractions, spelling correction, stemming/lemmatization, etc.). Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure. "  
- **Noise removal:** "Process of removing anythings that can interfer with your analysis (e.g. removing html, lowering the text, removing punctuation/special character, etc.) 
- **Stemming:** "Process of reducing inflected words to their word stem, base or root form—generally a written word form ("fishing", "fished", and "fisher" to the stem "fish")."
- **Lematization:** "Process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form (ie: "walking" to "walk", "better" to "good")."
- **StopWord:** "Words which are filtered out before or after processing of natural language data (text). Stop words usually refers to the most common words in a language (words like "The", "a", etc. in english)."

**Tag list**  
List of tag use in the tagger (pos_tag function) from NLTK:
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

# Libraries and Dataset 

In [None]:
! pip install bs4
# ! pip install pycontractions # The package has a depencies that have not been updated, so I couldn't use it.
! pip install contractions
! pip install autocorrect 

In [None]:
# generic librairies
import time as time
import numpy as np
import pandas as pd
import gc

# Text librairies
import re
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import ToktokTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tag.util import untag
import contractions
# import pycontractions # Alternative better package for removing contractions
from autocorrect import Speller

In [None]:
# https://numpy.org/devdocs/user/basics.types.html

dtypes_questions = {'Id':'int32', 'Score': 'int16', 'Title': 'str', 'Body': 'str'}

In [None]:
%%time
df_questions = pd.read_csv('../input/pythonquestions/Questions.csv',
                           usecols=['Id', 'Score', 'Title', 'Body'], 
                           encoding = "ISO-8859-1",
                           dtype=dtypes_questions,
#                            nrows=100
                          )

In [None]:
df_questions[['Title', 'Body']] = df_questions[['Title', 'Body']].applymap(lambda x: str(x).encode("utf-8", errors='surrogatepass').decode("ISO-8859-1", errors='surrogatepass'))

In [None]:
# Remove all questions that have a negative score
df_questions = df_questions[df_questions["Score"] >= 0]

In [None]:
spell = Speller()
token = ToktokTokenizer()
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
charac = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~0123456789'
stop_words = set(stopwords.words("english"))
adjective_tag_list = set(['JJ','JJR', 'JJS', 'RBR', 'RBS']) # List of Adjective's tag from nltk package

In [None]:
df_questions.info()

# 1) Noise removal

Noise removal is about removing anythings that can interfere with your text analysis. It's like the data cleaning step for a classical ML project.

## 1. Removing html

In [None]:
df_questions['Body'][11]

In [None]:
%%time

# Parse question and title then return only the text
df_questions['Body'] = df_questions['Body'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())
df_questions['Title'] = df_questions['Title'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

As you can see, BeautifulSoup allow us to remove effectively most of the html code but not all. 

In [None]:
df_questions['Body'][11]

So, we need to remove the rest here.

In [None]:
def clean_text(text):
    text = re.sub(r"\'", "'", text) # match all literal apostrophe pattern then replace them by a single whitespace
    text = re.sub(r"\n", " ", text) # match all literal Line Feed (New line) pattern then replace them by a single whitespace
    text = re.sub(r"\xa0", " ", text) # match all literal non-breakable space pattern then replace them by a single whitespace
    text = re.sub('\s+', ' ', text) # match all one or more whitespace then replace them by a single whitespace
    text = text.strip(' ')
    return text

In [None]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: clean_text(x)) 
df_questions['Body'] = df_questions['Body'].apply(lambda x: clean_text(x))

In [None]:
df_questions['Body'][11]

## 2. Remove contractions

In [None]:
def expand_contractions(text):
    """expand shortened words, e.g. 'don't' to 'do not'"""
    text = contractions.fix(text)
    return text

In [None]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: expand_contractions(x)) 
df_questions['Body'] = df_questions['Body'].apply(lambda x: expand_contractions(x))

In [None]:
df_questions['Body'][11]

## 3. Spelling correction

I put this step here, and the code, but I did not make any corrections (It's far TOO much costly!). But if you want to try, there you are!

In [None]:
def autocorrect(text):
    words = token.tokenize(text)
    words_correct = [spell(w) for w in words]
    return ' '.join(map(str, words_correct)) # Return the text untokenize

In [None]:
# %%time

# df_questions['Title'] = df_questions['Title'].apply(lambda x: autocorrect(x)) 
# df_questions['Body'] = df_questions['Body'].apply(lambda x: autocorrect(x)) 

## 4. Lowering the text

I choose to lower the text here since the contractions package may put some capital letters back when removing the contractions.
Lowering the text is a classical and useful step of Noise removal or Text normalization since it reduce the vocabulary, normalize the text and cost almost nothing.

In [None]:
%%time

df_questions['Title'] = df_questions['Title'].str.lower()
df_questions['Body'] = df_questions['Body'].str.lower()

In [None]:
df_questions['Body'][11]

# 2) Removing character

## 1. Removing all non-alphabetical character

Note that I choose to remove ALL non-alphabetical character (including punctuation, number and special character). Thus, I do not consider important words that may contain special characters (like "C#" in programming). You could choose to remove only punctuation and number or to not remove anything at all depending of you problematic!
But I recommend removing at least punctuation in most case, since it can interfere with tokenisation, and number since there are generally not useful.

In [None]:
def remove_punctuation_and_number(text):
    """remove all punctuation and number"""
    return text.translate(str.maketrans(" ", " ", charac)) 



def remove_non_alphabetical_character(text):
    """remove all non-alphabetical character"""
    text = re.sub("[^a-z]+", " ", text) # remove all non-alphabetical character
    text = re.sub("\s+", " ", text) # remove whitespaces left after the last operation
    return text

In [None]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: remove_non_alphabetical_character(x)) 
df_questions['Body'] = df_questions['Body'].apply(lambda x: remove_non_alphabetical_character(x)) 

In [None]:
df_questions['Body'][11]

## 2. Removing single character (optional)

I choose to remove single character since when we do programming we often use single alphabetical character as a variable name ("x", "y", "z", etc.). And I observed that when I tried some topic detection without removing them, I found a lot of topics with them! And even a topic that I could name "Variable name"...

In [None]:
def remove_single_letter(text):
    """remove single alphabetical character"""
    text = re.sub(r"\b\w{1}\b", "", text) # remove all single letter
    text = re.sub("\s+", " ", text) # remove whitespaces left after the last operation
    text = text.strip(" ")
    return text

In [None]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: remove_single_letter(x)) 
df_questions['Body'] = df_questions['Body'].apply(lambda x: remove_single_letter(x)) 

In [None]:
df_questions['Body'][11]

# 3) Removing stopwords

## 1. Removing most frequent words

Removing the most frequent words is a classical step in NLP. Most frequent words don't add a lot of information in most case (since they are in almost every sentences). Removing them create more "space" to other that may have more useful information.   
You can use premade lists from libraries like SciKit-Learn, NLTK and others.
But be aware that those list may be more problematic than useful (especially the scikit-learn list, see [Stop Word Lists in Free Open-source Software Packages](https://www.aclweb.org/anthology/W18-2502.pdf) for more information).

In [None]:
def remove_stopwords(text):
    """remove common words in english by using nltk.corpus's list"""
    words = token.tokenize(text)
    filtered = [w for w in words if not w in stop_words]
    
    return ' '.join(map(str, filtered)) # Return the text untokenize

In [None]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: remove_stopwords(x))
df_questions['Body'] = df_questions['Body'].apply(lambda x: remove_stopwords(x)) 

In [None]:
df_questions['Body'][11]

## 2. Removing adjectives (optional)

Here, I choose to remove adjectives in addition to the NLTK list. Why? Simply because when I initially tried to make some topic detection in a notebook following this one and it improves my topic detection. I also thought that adjectives wouldn't add any useful information.
At the same time, I could also remove verbs with the same reasoning. But I did not because the StackOverflow dataset is about question on programming. And in programming, we have a lot of verbs, or words that may be interpreted as a verb, that may be important ("return", "get", "request", "replace", etc.).
You can use these types of reasoning to improve your preprocess. It will also reduce the vocabulary and thus, reduce your calculation time later on.

In [None]:

def remove_by_tag(text, undesired_tag):
    """remove all words by using ntk tag (adjectives, verbs, etc.)"""
    words = token.tokenize(text) # Tokenize each words
    words_tagged = nltk.pos_tag(tokens=words, tagset=None, lang='eng') # Tag each words and return a list of tuples (e.g. ("have", "VB"))
    filtered = [w[0] for w in words_tagged if w[1] not in undesired_tag] # Select all words that don't have the undesired tags
    
    return ' '.join(map(str, filtered)) # Return the text untokenize

In [None]:
%%time
df_questions['Title'] = df_questions['Title'].apply(lambda x: remove_by_tag(x, adjective_tag_list))
df_questions['Body'] = df_questions['Body'].apply(lambda x: remove_by_tag(x, adjective_tag_list))

In [None]:
df_questions['Body'][11]

# 4) Stemming / Lemmatization

Stemming and Lemmatization are operation that:
- can improve your calculation time later on by reducing your vocabulary
- help to generalize more easily by groupping words together (e.g. "am", "are", "be", etc will be transformed into "be" for lemmatization)


## 1. Stemming

I did not choose to use stemming here but you should always consider this alternative since it's far less costly.

Stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form ("fishing", "fished", and "fisher" to the stem "fish"). It generally operate by removing the affix of a word. A affix can be a suffix or a prefix (e.g. "-ed", "-ing", etc.). It's simple but will not work when the word is "irregular" ("ran" and "run"). Just think of it as a simpler operation than lemmatization, which can be enough in certain case, but can make too much mistake in other case. 

In [None]:
words = ["program", "programs", "programer", "programing", "programers"]
  
for w in words:
    print(w, " : ", stemmer.stem(w))

In [None]:
def stem_text(text):
    """Stem the text"""
    words = nltk.word_tokenize(text) # tokenize the text then return a list of tuple (token, nltk_tag)
    stem_text = []
    for word in words:
        stem_text.append(stemmer.stem(word)) # Stem each words
    return " ".join(stem_text) # Return the text untokenize

In [None]:
# %%time

# df_questions['Title'] = df_questions['Title'].apply(lambda x: stem_text(x)) 
# df_questions['Body'] = df_questions['Body'].apply(lambda x: stem_text(x)) 

## 2. Lemmatization

As said in the beginning, Lemmatization is the process of replacing the inflected form of a word by its lemma (cannonical form or dictionnary form). But in some case, a lemmatizer may not be able to find the right root if you don't precise the type of word as you can see below.

In [None]:
print(lemmatizer.lemmatize("stripes", "v"))
print(lemmatizer.lemmatize("stripes", "n"))  
print(lemmatizer.lemmatize("are"))
print(lemmatizer.lemmatize("are", "v"))

A way to work around this problem is to use a tagger and passe the type of word in the lemmatize function. BUT it's reaaaallly costly. Stemming or a simple lemmatization in this regard is far more efficient.

In [None]:
def lemmatize_text(text):
    """Lemmatize the text by using tag """
    
    tokens_tagged = nltk.pos_tag(nltk.word_tokenize(text))  # tokenize the text then return a list of tuple (token, nltk_tag)
    lemmatized_text = []
    for word, tag in tokens_tagged:
        if tag.startswith('J'):
            lemmatized_text.append(lemmatizer.lemmatize(word,'a')) # Lemmatisze adjectives. Not doing anything since we remove all adjective
        elif tag.startswith('V'):
            lemmatized_text.append(lemmatizer.lemmatize(word,'v')) # Lemmatisze verbs
        elif tag.startswith('N'):
            lemmatized_text.append(lemmatizer.lemmatize(word,'n')) # Lemmatisze nouns
        elif tag.startswith('R'):
            lemmatized_text.append(lemmatizer.lemmatize(word,'r')) # Lemmatisze adverbs
        else:
            lemmatized_text.append(lemmatizer.lemmatize(word)) # If no tags has been found, perform a non specific lemmatization
    return " ".join(lemmatized_text) # Return the text untokenize

In [None]:
%%time

df_questions['Title'] = df_questions['Title'].apply(lambda x: lemmatize_text(x)) 
df_questions['Body'] = df_questions['Body'].apply(lambda x: lemmatize_text(x)) 

In [None]:
df_questions['Body'][11]

# Feature engineering

Just a little bit of FE. Using the title and the body at the same give far more better result for topic detection. 

In [None]:
df_questions['Text'] = df_questions['Title'] + ' ' + df_questions['Body']

# Data exportation

In [None]:
df_questions.to_csv('df_questions_fullclean.csv', encoding='utf-8', errors='surrogatepass')