# 1 - Preprocessing

Text preprocessing consists of transforming raw text into a form that suits your task. The most basic transformations include removing punctuation, lower casing the text, removing numbers. More advanced are lemmatizing, remove stop words, Part of speech tagging. There are many more..

It is important to chose the preprocessing steps that are suited to your task rather than systematically applying them. For example, if you are trying to predict the date of certain texts, numbers may be important.

In this exercice you will learn some basic text preprocessing and apply it to the following sentence:

In [1]:
example = "02/12/2018. The world loves Zlatan Ibrahimovic. He is such an amaZing Footballer!!"

## Remove Punctuation

Complete the function to remove punctuation and apply it to the example

In [63]:
import string #This imports string.punctuation, a usable string made of all possible punctuation 

def remove_punctuation(text):
    for punctuation in string.punctuation: 
        text = text.replace(punctuation, ' ') 
    return text

stepbystep = remove_punctuation(example) 

stepbystep #We will apply the following functions to stepbystep to visualize changes

'02 12 2018  The world loves Zlatan Ibrahimovic  He is such an amaZing Footballer  '

## Lower Case

Complete the function and apply.

In [64]:
def lowercase (text): 
    lowercased = text.lower() 
    return lowercased

stepbystep = lowercase(stepbystep) #Remember, apply function to stepbystep

stepbystep

'02 12 2018  the world loves zlatan ibrahimovic  he is such an amazing footballer  '

## Tokenize 

Tokenizing splits up a text into a list of individual words, also called tokens. Using word_tokenize, complete function. Apply to stepbystep.

In [65]:
from nltk import word_tokenize 

def tokenize (text):
    tokenized = word_tokenize(text)
    return tokenized

stepbystep = tokenize(stepbystep)

stepbystep

['02',
 '12',
 '2018',
 'the',
 'world',
 'loves',
 'zlatan',
 'ibrahimovic',
 'he',
 'is',
 'such',
 'an',
 'amazing',
 'footballer']

## Remove Numbers

Complete the function to remove numbers using .isalpha()

In [66]:
def remove_numbers (text):
    words_only = [word for word in text if word.isalpha()]
    return words_only

stepbystep = remove_numbers(stepbystep)

stepbystep


['the',
 'world',
 'loves',
 'zlatan',
 'ibrahimovic',
 'he',
 'is',
 'such',
 'an',
 'amazing',
 'footballer']

## Remove StopWords

"Stopwords" are words are so frequently used that for many tasks (but not all), they don't carry much information. Examples are "any", "all", "what"... NLTK has an inbuilt corpus of english stopwords that can be loaded and used. 

Using that corpus, create a list of english stopwords. Then, complete function to remove them.

In [67]:
from nltk.corpus import stopwords 

# Create a list of english stopwords
stop_words = set(stopwords.words('english')) 

# Create function
def remove_stopwords (text):
    without_stopwords = [word for word in text if not word in stop_words]
    return without_stopwords

stepbystep = remove_stopwords(stepbystep)

stepbystep

['world', 'loves', 'zlatan', 'ibrahimovic', 'amazing', 'footballer']

## Lemmatize

Lemmatizing consists of reducing word derivatives down to their ethymological roots. For example: studies & studying --> study.

Complete the two parts in the following function. First, initiate a WordNetLemmatizer using the NLTK package. Then use it to lemmatize text.

In [68]:
from nltk.stem import WordNetLemmatizer

def lemmatize(text):
    lemmatizer = WordNetLemmatizer() # Initiate lemmatizer
    lemmatized = [lemmatizer.lemmatize(word) for word in text] # Lemmatize
    return lemmatized

stepbystep = lemmatize(stepbystep)

stepbystep

['world', 'love', 'zlatan', 'ibrahimovic', 'amazing', 'footballer']

## Combine steps

Combine all of the above steps into one single function, then apply it to the original sentence "example"

In [None]:
def clean (text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, ' ') # Remove Punctuation
    lowercased = text.lower() # Lower Case
    tokenized = word_tokenize(lowercased) # Tokenize
    words_only = [word for word in tokenized if word.isalpha()] # Remove numbers
    stop_words = set(stopwords.words('english')) # Make stopword list
    without_stopwords = [word for word in words_only if not word in stop_words] # Remove Stop Words
    lemma=WordNetLemmatizer() # Initiate Lemmatizer
    lemmatized = [lemma.lemmatize(word) for word in without_stopwords] # Lemmatize
    return lemmatized

cleaned_in_one_go = clean(example) # Apply the "clean" function to our original sentence "example"

cleaned_in_one_go