## Test preprocessing

### This notebook serves as an exercise to practice all text preprocessing steps a given text document using 3 different libraries

### load and explore the data

In [15]:
data = ''
with open('data.txt','r') as inputfile:
    data = inputfile.read()

In [16]:
print(data)

Neuro-linguistic programming (NLP) is a pseudoscientific approach to communication, personal development, and psychotherapy created by Richard Bandler and John Grinder in California, United States in the 1970s. NLP's creators claim there is a connection between neurological processes (neuro-), language (linguistic) and behavioral patterns learned through experience (programming), and that these can be changed to achieve specific goals in life.[1][2] Bandler and Grinder also claim that NLP methodology can "model" the skills of exceptional people, allowing anyone to acquire those skills.[3][4] They claim as well that, often in a single session, NLP can treat problems such as phobias, depression, tic disorders, psychosomatic illnesses, near-sightedness,[5] allergy, common cold,[6] and learning disorders.[7][8]

There is no scientific evidence supporting the claims made by NLP advocates and it has been discredited as a pseudoscience.[9][10][11]

Scientific reviews state that NLP is based o

### Data cleaning

In [17]:
# Need to clean unneeded markings
import re

data_clean = re.sub("\[.+\]","",data)#remove [NUM] tags

In [18]:
data_clean

'Neuro-linguistic programming (NLP) is a pseudoscientific approach to communication, personal development, and psychotherapy created by Richard Bandler and John Grinder in California, United States in the 1970s. NLP\'s creators claim there is a connection between neurological processes (neuro-), language (linguistic) and behavioral patterns learned through experience (programming), and that these can be changed to achieve specific goals in life.\n\nThere is no scientific evidence supporting the claims made by NLP advocates and it has been discredited as a pseudoscience.\n\nScientific reviews state that NLP is based on outdated metaphors of how the brain works that are inconsistent with current neurological theory and contain numerous factual errors.\n\nReviews also found that all of the supportive research on NLP contained significant methodological flaws and that there were three times as many studies of a much higher quality that failed to reproduce the "extraordinary claims" made by

#### We can still see unneeded new-line (\n) characters, but the tokenizer will take care of those

### Tokenization

In [19]:
from nltk.tokenize import word_tokenize

data_clean_word_tokenized = word_tokenize(data_clean)
data_clean_word_tokenized[:10]

['Neuro-linguistic',
 'programming',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'pseudoscientific',
 'approach',
 'to']

#### Side note: in some nlp tasks, you need to preserve the sentence separation. In that case, you must first separate by sentence, and then separate these sentences into tokens

In [20]:
from nltk.tokenize import sent_tokenize

data_clean_sent_tokenized = sent_tokenize(data_clean)
data_clean_sent_tokenized[:2]

['Neuro-linguistic programming (NLP) is a pseudoscientific approach to communication, personal development, and psychotherapy created by Richard Bandler and John Grinder in California, United States in the 1970s.',
 "NLP's creators claim there is a connection between neurological processes (neuro-), language (linguistic) and behavioral patterns learned through experience (programming), and that these can be changed to achieve specific goals in life."]

In [21]:
data_clean_word_sent_tokenized = [word_tokenize(sentence) for sentence in data_clean_sent_tokenized]
data_clean_word_sent_tokenized[0]

['Neuro-linguistic',
 'programming',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'pseudoscientific',
 'approach',
 'to',
 'communication',
 ',',
 'personal',
 'development',
 ',',
 'and',
 'psychotherapy',
 'created',
 'by',
 'Richard',
 'Bandler',
 'and',
 'John',
 'Grinder',
 'in',
 'California',
 ',',
 'United',
 'States',
 'in',
 'the',
 '1970s',
 '.']

### remove punctuations with lowercasing

In [22]:
data_clean_word_tokenized = [word.lower() for word in data_clean_word_tokenized if word.isalpha()]
data_clean_word_tokenized[:10]

['programming',
 'nlp',
 'is',
 'a',
 'pseudoscientific',
 'approach',
 'to',
 'communication',
 'personal',
 'development']

### Removing stopwords

In [23]:
from nltk.corpus import stopwords

data_clean_word_tokenized = [word for word in data_clean_word_tokenized if not word in stopwords.words('english')]
data_clean_word_tokenized[:10]

['programming',
 'nlp',
 'pseudoscientific',
 'approach',
 'communication',
 'personal',
 'development',
 'psychotherapy',
 'created',
 'richard']

### Lemmatization / POS tagging

In [24]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    return ''
    
lemmatizer = WordNetLemmatizer()

data_clean_word_lemmatized = []

for i, word in enumerate(data_clean_word_tokenized):
    pos = get_wordnet_pos(pos_tag([word])[0][1])
    if pos is not '':
        data_clean_word_lemmatized.append(lemmatizer.lemmatize(word, pos))
    else:
        data_clean_word_lemmatized.append(word)

data_clean_word_lemmatized[:10]

['program',
 'nlp',
 'pseudoscientific',
 'approach',
 'communication',
 'personal',
 'development',
 'psychotherapy',
 'create',
 'richard']

## Now we have a dataset of words ready for training