## Tokenization
Before we can classify any posts, we'll need to clean and tokenize the text data. Use what you remember from the last lesson on NLP to implement the function `tokenize`. This function should perform the following steps on the string, `text`, using nltk:

1. Identify any urls in `text`, and replace each one with the word, `"urlplaceholder"`.
2. Split `text` into tokens.
3. For each token: lemmatize, normalize case, and strip leading and trailing white space.
4. Return the tokens in a list!

For example, this:
```python
text = 'Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference  http://t.co/Ge9Lp7hpyG'

tokenize(text)
```
should return this:
```txt
['barclays', 'ceo', 'stress', 'the', 'importance', 'of', 'regulatory', 'and', 'cultural', 'reform', 'in', 'financial', 'service', 'at', 'brussels', 'conference', 'urlplaceholder']
```

Hint: You'll have to add an import statement to use the `re` package (which supports regular expressions) and two import statements to use the appropriate functions from `nltk`! Add them to this first code cell.

In [1]:
# download necessary NLTK data
from nltk.tokenize import word_tokenize
from nltk import WordNetLemmatizer
import nltk
nltk.download(['punkt', 'wordnet'])

# import statements
import pandas as pd
import numpy as np
import re

[nltk_data] Downloading package punkt to C:\Users\Victor
[nltk_data]     Pontello\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Victor
[nltk_data]     Pontello\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [19]:
def load_data():
    df = pd.read_csv('../data/corporate_messaging.csv', encoding='latin-1')
    df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
    X = df.text.values
    y = df.category.values
    return X, y

In [2]:
text = 'Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference  http://t.co/Ge9Lp7hpyG'

In [16]:
def clean_text(text):

    url_str_pattern = "(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"

    # Identify any urls in `text`, and replace each one with the word, `"urlplaceholder"`.
    # Normalize case
    text = re.sub(url_str_pattern,'urlplaceholder',text.lower())
    # Split `text` into tokens.
    words = word_tokenize(text)
    # For each token: lemmatize, and strip leading and trailing white space.
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word.strip()) for word in words]
    
    return words


In [17]:
clean_text(text)

['barclays',
 'ceo',
 'stress',
 'the',
 'importance',
 'of',
 'regulatory',
 'and',
 'cultural',
 'reform',
 'in',
 'financial',
 'service',
 'at',
 'brussels',
 'conference',
 'urlplaceholder']

In [20]:
# test out function
X, y = load_data()
for message in X[:5]:
    tokens = clean_text(message)
    print(message)
    print(tokens, '\n')

Barclays CEO stresses the importance of regulatory and cultural reform in financial services at Brussels conference  http://t.co/Ge9Lp7hpyG
['barclays', 'ceo', 'stress', 'the', 'importance', 'of', 'regulatory', 'and', 'cultural', 'reform', 'in', 'financial', 'service', 'at', 'brussels', 'conference', 'urlplaceholder'] 

Barclays announces result of Rights Issue http://t.co/LbIqqh3wwG
['barclays', 'announces', 'result', 'of', 'right', 'issue', 'urlplaceholder'] 

Barclays publishes its prospectus for its å£5.8bn Rights Issue: http://t.co/YZk24iE8G6
['barclays', 'publishes', 'it', 'prospectus', 'for', 'it', 'å£5.8bn', 'right', 'issue', ':', 'urlplaceholder'] 

Barclays Group Finance Director Chris Lucas is to step down at the end of the week due to ill health http://t.co/nkuHoAfnSD
['barclays', 'group', 'finance', 'director', 'chris', 'lucas', 'is', 'to', 'step', 'down', 'at', 'the', 'end', 'of', 'the', 'week', 'due', 'to', 'ill', 'health', 'urlplaceholder'] 

Barclays announces that Ire