<div class="alert alert-danger" style="color:black"><b>Running ML-LV Jupyter Notebooks:</b><br>
    <ol>
        <li>Make sure you are running all notebooks using the <code>adv_ai</code> kernel.
        <li><b>It is very important that you do not create any additional files within the weekly folders on CSCT cloud.</b> Any additional files, or editing the notebooks with a different environment may prevent submission/marking of your work.</li>
            <ul>
                <li>NBGrader will automatically fetch and create the correct folders files for you.</li>
                <li>All files that are not the Jupyter notebooks should be stored in the 'ML-LV/data' directory.</li>
            </ul>
        <li>Please <b>do not pip install</b> any python packages (or anything else). You should not need to install anything to complete these notebooks other than the packages provided in the Jupyter CSCT Cloud environment.</li>
    </ol>
    <b>If you would like to run this notebook locally you should:</b><br>
    <ol>
        <li>Create an environment using the requirements.txt file provided. <b>Any additional packages you install will not be accessible when uploaded to the server and may prevent marking.</b></li>
        <li>Download a copy  of the notebook to your own machine. You can then edit the cells as you wish and then go back and copy the code into/edit the ones on the CSCT cloud in-place.</li>
        <li><b>It is very important that you do not re-upload any notebooks that you have edited locally.</b> This is because NBGrader uses cell metadata to track marked tasks. <b>If you change this format it may prevent marking.</b></li>
    </ol>
</div>

# Practical 2: Text Pre-processing and Representation

In the previous practical we gathered movie reviews from IMDB and annotated them with sentiment. When data is scraped from the web, or even gathered from other sources, it is unlikely to be in a suitable format for NLP applications. So, now that we have some data, the next step is to clean and normalise the text. This will reduce noise within the data and ensure a consistent set of input features for an ML model. Very frequent or infrequent words, punctuation and other characters, emojis, HTML tags etc all increase the number of features present within the data. Each of these may, or may not, be helpful when training a model for a given task.

In the first part of this practical we will examine several text pre-processing and normalisation steps and the process of building a vocabulary. Then develop a function to apply each of these steps to our imdb review data.

In the second part of this practical we will look at several different methods of representing text in a format that is compatible with ML models, i.e. as numbers or vectors.

The objectives of this practical are:
1. Understand various text pre-processing options and determine which are appropriate for a given problem

2. Develop text pre-processing and create a vocabulary functions

3. Explore vectorised language representations - BOW, One-hot and TF-IDF

4. Understand the benefit of word vectors and how to use them

# 1 Text Pre-processing

## 1.0 Import libraries

1. [spaCy](https://spacy.io/) - is a Python library for NLP. It's very efficient and has an excellent set of features.

2. [Natural Language Toolkit (NLTK)](https://www.nltk.org/) - is an older but more comprehensive NLP toolkit for Python.

3. [Unidecode](https://pypi.org/project/Unidecode/) - is a small Python package for stripping accents from letters.

4. [Contractions](https://github.com/kootenpv/contractions) - is a small Python package for expanding contractions.

In [9]:
import os
import re
import spacy
import unidecode
import contractions
import pandas as pd
from collections import Counter
from nltk.stem.snowball import SnowballStemmer

# Get the status of NBgrader (for skipping cell execution while validating/grading)
grading = True if os.getenv('NBGRADER_EXECUTION') else False

# Get the project directory (should be in ML-LV)
path = ''
while os.path.basename(os.path.abspath(path)) != 'ML-LV':
    path = os.path.abspath(os.path.join(path, '..'))

# Set the directory to the data folder (should be in ML-LV/data/imdb)
data_dir = os.path.join(path, 'data', 'imdb')

# Set the directory to the shared dataset folder (should be in shared/datasets/imdb)
dataset_dir = os.path.join(path, '..', 'shared', 'datasets', 'imdb')

# Load the Spacy language model ('en_core_web_md' should be in shared/models/spacy)
nlp = spacy.load(os.path.join(path, '..', 'shared', 'models', 'spacy'))

## 1.1 Pre-processing options

The following cells demonstrate each of the pre-processing options discussed in the lecture. For most we use spaCy but several are possible using regular expressions or plain Python.

It is very unlikely, you would ever need to apply **all** of these steps. In fact you probably wouldn't have much text left if you did! But it is important to understand what each does and when they might be appropriate.

<div class="alert alert-success" style="color:black"><b>Note:</b> The <i>order</i> these steps are applied can sometimes make a big difference.<br>
For example, if you were to remove punctuation and replace with an empty string, then hyphenated words would be joined together. So, 'father-in-law' becomes 'fatherinlaw'.</div>

### Tokenisation and Segmentation

In [3]:
# Create spacy document object
raw_text = "Let's visit my father-in-law in St. Louis next year. He said 'it would be fun!'"
doc = nlp(raw_text)
print(f"Document: {doc}\n")

# Segment the text into sentences
sentences = list([sent for sent in doc.sents])
print(f"Sentences: {sentences}\n")

# Tokenise the sentences
tokens = []
for sent in doc.sents:
    tokens.append([token.text for token in sent])
print(f"Tokens: {tokens}\n")

Document: Let's visit my father-in-law in St. Louis next year. He said 'it would be fun!'

Sentences: [Let's visit my father-in-law in St. Louis next year., He said 'it would be fun!']

Tokens: [['Let', "'s", 'visit', 'my', 'father', '-', 'in', '-', 'law', 'in', 'St.', 'Louis', 'next', 'year', '.'], ['He', 'said', "'", 'it', 'would', 'be', 'fun', '!', "'"]]



### Stemming and Lemmatisation

In [4]:
# Create NLTk stemmer
stemmer = SnowballStemmer(language='english')

# Create spacy document object
raw_text = "studies studying cries cry automatic automation are is car cars am"
doc = nlp(raw_text)

# Print the stem and lemma for each token
print(f"{'Token:':20} {'Stem:':20} {'Lemma:':20}\n")
for token in doc:
    print(f"{token.text:20} {stemmer.stem(token.text):20} {token.lemma_:20}")

Token:               Stem:                Lemma:              

studies              studi                study               
studying             studi                study               
cries                cri                  cry                 
cry                  cri                  cry                 
automatic            automat              automatic           
automation           autom                automation          
are                  are                  be                  
is                   is                   be                  
car                  car                  car                 
cars                 car                  car                 
am                   am                   be                  


### Stop words, Case-folding and Punctuation

In [5]:
# Print Spacy's default stop words
print(f"List of stop words: {list(nlp.Defaults.stop_words)[:50]} \n")

# Create spacy document object
raw_text = "Let's visit my #father-in-law @ St. Louis next year.\n He said 'it would be fun!'"
doc = nlp(raw_text)
print(f"Document: {doc}")

# Remove stop words
print("\nRemoved stop words:")
for sent in doc.sents:
    sent = [token for token in sent if not token.is_stop]
    print(sent)

# Lowercase the tokens
# Python: text.lower()
print("\nLower-cased words:")
for sent in doc.sents:
    sent = [token.lower_ for token in sent]
    print(sent)

# Remove punctuation
# Regex: keep only letters and numbers
# re.sub('[^A-Za-z0-9]+', ' ', text)
print("\nRemoved punctuation:")
for sent in doc.sents:
    sent = [token for token in sent if not token.is_punct]
    print(sent)

List of stop words: ['amongst', '‘ll', '‘ve', 'had', 'on', 'bottom', 'would', 'whereby', 'after', 'therein', 'every', 'noone', 'became', 're', 'towards', 'hers', 'there', 'thru', 'show', 'itself', 'ten', 'become', 'formerly', 'therefore', 'afterwards', 'because', 'if', 'perhaps', 'whose', 'wherein', 'how', 'latter', 'such', 'somehow', 'four', 'was', 'out', 'twenty', "'re", 'across', 'hereby', 'whither', 'last', 'when', 'am', 'that', 'and', 'five', 'than', '‘re'] 

Document: Let's visit my #father-in-law @ St. Louis next year.
 He said 'it would be fun!'

Removed stop words:
[Let, visit, #, father, -, -, law, @, St., Louis, year, ., 
 ]
[said, ', fun, !, ']

Lower-cased words:
['let', "'s", 'visit', 'my', '#', 'father', '-', 'in', '-', 'law', '@', 'st.', 'louis', 'next', 'year', '.', '\n ']
['he', 'said', "'", 'it', 'would', 'be', 'fun', '!', "'"]

Removed punctuation:
[Let, 's, visit, my, father, in, law, St., Louis, next, year, 
 ]
[He, said, it, would, be, fun]


### Whitespace, Characters, Contractions, Accents, HTML tags and Emoji

<div class="alert alert-success" style="color:black"><b>Note:</b> The regex for removing emojis is from <a href=https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python>this stack overflow answer</a>.
</div>

<div class="alert alert-danger" style="color:black">
<b>Parsing HTML with regex:</b> The regular expression used here works reasonably well for simple HTML tags but is not fool proof, as jokingly outlined <a href=https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?noredirect=1&lq=1>in this well known stack overflow answer</a>.<br>

Regex cannot reliably account for the complex structure of HTML, so if it is critical to correctly parse HTML you should use an XML parser or something like Beautiful Soup.
</div>

In [6]:
# Create spacy document object
raw_text = u"<a href='site.com' class='link'> Let's visit mý \t #fáther-in-law @ St. Louis next year.\n <b>He said 'it would be fun!' \U0001f602 </b></a><br>"

doc = nlp(raw_text)
print(f"Document: {doc}")

# Remove whitespace
# Regex: remove 1 or more whitespace characters
# re.sub('\s+', ' ', text)
print("\nRemoved whitespace characters:")
for sent in doc.sents:
    sent = [token for token in sent if not token.is_space]
    print(sent)

# Remove specific characters
# Characters are specified inside the square brackets
print("\nRemoved specific characters:")
for sent in doc.sents:
    sent = re.sub('[@#$]', '', sent.text)
    print(sent)

# Remove accents
print("\nRemoved accents:")
for sent in doc.sents:
    sent = unidecode.unidecode(sent.text)
    print(sent)

# Expand contractions
print("\nExpanded contractions:")
for sent in doc.sents:
    sent = contractions.fix(sent.text)
    print(sent)

# Remove HTML tags
# Match 0 or more characters between < and >
print("\nRemoved HTML tags:")
for sent in doc.sents:
    sent = re.sub('<.*?>', '', sent.text)
    print(sent)

# Remove emojis
emoji_pattern = re.compile("["u"\U0001F100-\U0001FFFF""]+", flags=re.UNICODE)
print("\nRemoved emoji:")
for sent in doc.sents:
    sent = emoji_pattern.sub(r'', sent.text)
    print(sent)

Document: <a href='site.com' class='link'> Let's visit mý 	 #fáther-in-law @ St. Louis next year.
 <b>He said 'it would be fun!' 😂 </b></a><br>

Removed whitespace characters:
[<, a, href='site.com, ', class='link, ', >, Let, 's, visit, mý, #, fáther, -, in, -, law, @, St., Louis, next, year, .]
[<, b, >]
[He, said, ', it, would, be, fun, !, ']
[😂, <, /b></a><br, >]

Removed specific characters:
<a href='site.com' class='link'> Let's visit mý 	 fáther-in-law  St. Louis next year.
 
<b>
He said 'it would be fun!'
😂 </b></a><br>

Removed accents:
<a href='site.com' class='link'> Let's visit my 	 #father-in-law @ St. Louis next year.
 
<b>
He said 'it would be fun!'
 </b></a><br>

Expanded contractions:
<a href='site.com' class='link'> Let us visit mý 	 #fáther-in-law @ St. Louis next year.
 
<b>
He said 'it would be fun!'
😂 </b></a><br>

Removed HTML tags:
 Let's visit mý 	 #fáther-in-law @ St. Louis next year.
 

He said 'it would be fun!'
😂 

Removed emoji:
<a href='site.com' class='li

## 1.2 Building a vocabulary

It is often helpful to create a vocabulary once the text has been processed. At a certain point words appear so infrequently they may have little impact on the model. So a vocabulary allows us to choose how many words (features) to keep and then discard those that are less frequently occurring.

A vocabulary also allows us to map word tokens to indices to perform simple text **vectorisation**. And also add special tokens such as `<unk>` to replace unknown/out-of-vocabulary (OOV) words, and `<pad>` to pad inputs to a given length.


1. First pre-process/normalise the text.

2. Then use `Counter()` to create a dictionary of words and frequency counts.

3. Finally create a vocabulary (list) and add the `vocab_size` number of most frequently occurring words. Note we also added `<unk>` and `<pad>` at the beginning. We will use these in future weeks.

In [57]:
# Create spacy document object
raw_text = "Let's visit my father-in-law in St. Louis next year.\n He said 'it would be fun!'."
doc = nlp(raw_text)
print(f"Document: {doc}")

# Do some pre-processing
# Let's just lowercase the tokens and remove whitespace characters
corpus = []
for sent in doc.sents:
    sent = [token.lower_ for token in sent]
    sent = [token.strip() for token in sent]
    corpus.append(sent)

# Count the frequency of each token in the corpus
word_counter = Counter()
for sent in corpus:
    word_counter.update(sent)
print(word_counter)
print(f"Total word count: {len(word_counter)}")

# Create a vocabulary of vocab_size, also include special tokens
vocab_size = 20
special_tokens = ['<pad>', '<unk>']
vocab = []

# Add the special tokens to the vocabulary
vocab.extend(special_tokens)

# Add the vocab_size most common tokens to the vocabulary
vocab.extend([word for word, count in word_counter.most_common(vocab_size - len(special_tokens))])
print(vocab)
print(f"Vocabulary size: {len(vocab)}")

# Now we can get the index for a token, or the token from an index
print("Vocab index and token:")
print(vocab.index('father'))
print(vocab[vocab.index('father')])

Document: Let's visit my father-in-law in St. Louis next year.
 He said 'it would be fun!'.
Counter({'-': 2, 'in': 2, '.': 2, "'": 2, 'let': 1, "'s": 1, 'visit': 1, 'my': 1, 'father': 1, 'law': 1, 'st.': 1, 'louis': 1, 'next': 1, 'year': 1, '': 1, 'he': 1, 'said': 1, 'it': 1, 'would': 1, 'be': 1, 'fun': 1, '!': 1})
Total word count: 22
['<pad>', '<unk>', '-', 'in', '.', "'", 'let', "'s", 'visit', 'my', 'father', 'law', 'st.', 'louis', 'next', 'year', '', 'he', 'said', 'it']
Vocabulary size: 20
Vocab index and token:
10
father


<div class="alert alert-info" style="color:black"><h2>1.3 Exercise: Pre-processing pipeline</h2>

We have now seen each various pre-processing options applied individually. However, several of these steps will need to be applied at the same time. The appropriate steps to apply are problem specific and choice of approach is all part of a NLP project development. At the very least you will probably need to remove extra whitespace and tokenise the text, but most likely case-folding and removing some special characters will be necessary too. 

Libraries like NLTK, spaCy and textaCy can help you build a processing 'pipeline' but it is sometimes convenient to create a function or class to handle these steps for you.

1. In the following cell complete the `preprocess_text()` function. It should take a single string as input, apply a range of processing options and either return a list of tokens, if `tokenise=True`, or a string. Hint: to return a string you may use the `_join_punctuation()` function given.

2. The function should be able to apply case-folding, expand contractions, lemmatise, remove punctuation, whitespace, accents and basic HTML tags. It should also include arguments to select and apply each of these options separately e.g. `to_lower=False`.

3. You can use the `test_text` string to develop the function. Remember the *order* you apply different steps can make a big difference!

<b>MARKS AVAILABLE: 5</b>
<br>
<b>MO1</b>
</div>

In [176]:
def preprocess_text(text, tokenise=False, to_lower=False, remove_punct=False, remove_space=False, exp_contractions=False, remove_accents=False, remove_html=False, lemmatise=False):
    """Arguments:
        text (str): The text to be preprocessed
        tokenise (bool): Whether to return a list of tokens or a string
        to_lower (bool): Whether to convert text to lowercase
        remove_punct (bool/str): Whether to remove punctuation or a specific character
        remove_space (bool): Whether to remove whitespace characters
        exp_contractions (bool): Whether to expand contractions
        remove_accents (bool): Whether to remove accents
        remove_html (bool): Whether to remove HTML tags
        lemmatise (bool): Whether to lemmatise tokens
    """
    
    # Check if 'text' is not of type string
    if not isinstance(text, str): 
        # Return an empty string if an input is not valid
        return ""

    # Convert to lowercase before tokenisation if needed
    if to_lower:
        text = text.lower()

    # Expand contractions
    if exp_contractions:
        text = contractions.fix(text)

    # Remove accents
    if remove_accents:
        text = unidecode.unidecode(text)

    # Remove HTML tags
    if remove_html:
        text = re.sub(r'<.*?>', '', text)
        
    #  Remove punctuation before tokenisation using regex
    if remove_punct:
        text = re.sub('[^A-Za-z0-9\s]+', ' ', text)  # This removes all punctuation and special characters, keeping only letters, numbers, and spaces.


    # Remove extra whitespace
    # Regex: remove 1 or more whitespace characters
    # re.sub('\s+', ' ', text)
    if remove_space:
        text = re.sub(r'\s+', ' ', text).strip()

    # Tokenisation (Using SpaCy)
    doc = nlp(text)
    tokens = [token.lemma_ if lemmatise else token.text 
              for token in doc if not token.is_space]


    def _join_punctuation(tokens, characters=".,;?!"):
        characters = set(characters)
        tokens = iter(tokens)
        current = next(tokens)

        for next_token in tokens:
            if any((char in characters) for char in next_token):
                current += next_token
            else:
                yield current
                current = next_token

        yield current

    # Return string or tokens
    if tokenise:
        return tokens
    else:
        string =  " ".join(_join_punctuation(tokens))
        return string
    
# Text for testing

test_text = "<a href='https://www.imdb.com/title/tt0000417/reviews/?ref_=tt_ql_urv'>I can now say that I've seen a movie that's over 100 years old</a> Georges Méliès's 1902 masterpiece is not just a science fiction movie. <br /><br />It's also a satire on nineteenth-century science.\t""Le Voyage dans la Lune"" (""A Trip to the Moon"") is also an indictment of colonialism.\nThe astronauts attack the Moon Men - called \"Selenites\" - and then bring one back to Earth, where they parade him around. """
print(f"Test text:\n {test_text}")

processed_text = preprocess_text(test_text, tokenise=True, to_lower=True, remove_punct=True, remove_space=True, exp_contractions=True, remove_accents=True, remove_html=True, lemmatise=True)
print(f"\nPreprocessed text:\n {processed_text}")

Test text:
 <a href='https://www.imdb.com/title/tt0000417/reviews/?ref_=tt_ql_urv'>I can now say that I've seen a movie that's over 100 years old</a> Georges Méliès's 1902 masterpiece is not just a science fiction movie. <br /><br />It's also a satire on nineteenth-century science.	Le Voyage dans la Lune (A Trip to the Moon) is also an indictment of colonialism.
The astronauts attack the Moon Men - called "Selenites" - and then bring one back to Earth, where they parade him around. 

Preprocessed text:
 ['I', 'can', 'now', 'say', 'that', 'I', 'have', 'see', 'a', 'movie', 'that', 'be', 'over', '100', 'year', 'old', 'george', 'melie', 's', '1902', 'masterpiece', 'be', 'not', 'just', 'a', 'science', 'fiction', 'movie', 'it', 'be', 'also', 'a', 'satire', 'on', 'nineteenth', 'century', 'science', 'le', 'voyage', 'dans', 'la', 'lune', 'a', 'trip', 'to', 'the', 'moon', 'be', 'also', 'an', 'indictment', 'of', 'colonialism', 'the', 'astronaut', 'attack', 'the', 'moon', 'man', 'call', 'selenite'

In [171]:
# Test cell (1 mark)

# Test tokenisation
assert preprocess_text("Text to #tokenise, but don't just split!", tokenise=True) == ['Text', 'to', '#', 'tokenise', ',', 'but', 'do', "n't", 'just', 'split', '!']

# Test lowercase
assert preprocess_text("LOWERCASE this Text.", to_lower=True) == "lowercase this text."

# Test remove punctuation
assert preprocess_text("Remove the punctuation, from this #text.", remove_punct=True) == "Remove the punctuation from this text"

# Test remove whitespace
assert preprocess_text("Remove     the extra\t, whitespace \n from this text.    ", remove_space=True) == "Remove the extra, whitespace from this text."

# Test expand contractions
assert preprocess_text("Don't forget you'll need to expand contractions.", exp_contractions=True) == "Do not forget you will need to expand contractions."

print('All tests passed!')

All tests passed!


In [172]:
# Test cell (2 marks)

# Test remove accents
assert preprocess_text("The Jalapeño Café has a strange façade.", remove_accents=True) == "The Jalapeno Cafe has a strange facade."

# Test remove HTML tags
assert preprocess_text("<a href='site.com'>Get the text from the link, <b>remove HTML tags!</b></a>", remove_html=True) == "Get the text from the link, remove HTML tags!"

# Test lemmatisation
assert preprocess_text("I am studying lemmatisation, it is useful.", lemmatise=True) == "I be study lemmatisation, it be useful."

# Test lowercase, remove punctuation but don't tokenise
assert preprocess_text("Tokenise, remove punc, order-matters and LOWERCASE", tokenise=False, remove_punct=True, to_lower=True) == "tokenise remove punc order matters and lowercase"

# Test all options
assert preprocess_text("<b>Tokenise, remove punc, order-matters LOWERCASE\t and and lemmatise it's useful at the Café.</b>\n", tokenise=True, remove_punct=True, to_lower=True, remove_space=True, exp_contractions=True, remove_accents=True, remove_html=True, lemmatise=True) == ['tokenise', 'remove', 'punc', 'order', 'matter', 'lowercase', 'and', 'and', 'lemmatise', 'it', 'be', 'useful', 'at', 'the', 'cafe']

print('All tests passed!')

All tests passed!


In [None]:
# Hidden test cell (2 marks)
# Tests a few different combinations


<div class="alert alert-info" style="color:black"><h2>1.4 Exercise: Pre-processing IMDB</h2>

Now that we have a pre-processing function, let's use it to process the IMDB reviews that you annotated! Once you are happy with the function, in the next cell:

1. First load your IMDB reviews from the `imdb_reviews_annot.csv` file. 

2. Create a new dataframe called `imdb_corpus` with two columns `review` and `sentiment` for the processed reviews and the sentiment labels.

3. Now apply the `preprocess_text()` function to each review. You should **NOT** tokenise at this stage (we will do this later), but at minimum you should remove extra whitespace, lowercase, remove punctuation, remove HTML and expand contractions. Otherwise you are welcome to use other pre-processing options for your reviews.

<b>MARKS AVAILABLE: 5</b>
<br>
<b>MO1</b>
</div>

In [173]:
# YOUR CODE HERE
reviews_df = pd.read_csv(os.path.join(data_dir, 'imdb_reviews_annot.csv'), index_col=0)

# Check the labelled reviews to check if there are unlabelled reviews
print(f"Number of positive reviews: {(reviews_df['sentiment'] == 'positive').sum()}")
print(f"Number of negative reviews: {(reviews_df['sentiment'] == 'negative').sum()}")
print(f"Number of unlabelled reviews: {reviews_df['sentiment'].isnull().sum()}")


# Create new dataframe for processed reviews
imdb_corpus = pd.DataFrame()

def process_review(text):
    processed_text = preprocess_text(
        text=text,
        tokenise=False,        
        to_lower=True,          
        remove_punct=True,      
        remove_space=True,      
        exp_contractions=True,
        remove_accents=True,  
        remove_html=True,       
        lemmatise=False  
    )

    processed_text = processed_text.strip()  
    processed_text = " ".join(processed_text.split()) 
    return processed_text

imdb_corpus['review'] = reviews_df['review'].astype(str).apply(process_review)
imdb_corpus['sentiment'] = reviews_df['sentiment']

Number of positive reviews: 56
Number of negative reviews: 44
Number of unlabelled reviews: 0


In [174]:
# Test cell (1 mark)

# Test the dataframe has correct number of rows and columns
assert imdb_corpus.shape == (100, 2)
assert list(imdb_corpus.columns) == ['review', 'sentiment']

# Test sentiment is only 'positive' or 'negative'
assert set(imdb_corpus['sentiment'].unique()) == {'positive', 'negative'}

# Test reviews are NOT tokenised
assert pd.api.types.infer_dtype(imdb_corpus['review']) == 'string'

print('All tests passed!')

All tests passed!


In [175]:
# Test cell (2 marks)

# Test reviews are lowercased
assert imdb_corpus['review'].str.islower().all()

# Test reviews have no punctuation
assert imdb_corpus['review'].str.match(r'^[A-Za-z0-9\s]+$').all()

# Test reviews have no whitespace characters
assert not imdb_corpus['review'].str.match(r'\s+').any()

print('All tests passed!')

All tests passed!


In [None]:
# Hidden tests (2 marks)
# Test remaining preprocessing steps


Make sure to save the dataframe to a new file called `imdb_reviews.csv`.

In [146]:
# Save the dataframe
imdb_corpus.to_csv(os.path.join(data_dir, 'imdb_reviews.csv'), index=False)

<div class="alert alert-success" style="color:black"><h3>Before you submit this notebook to NBGrader for marking:</h3> 

1. Make sure have completed all exercises marked by <span style="color:blue">**blue cells**</span>.
2. For automatically marked exercises ensure you have completed any cells with `# YOUR CODE HERE`. Then click 'Validate' button above, or ensure all cells run without producing an error.
3. For manually marked exercises ensure you have completed any cells with `"YOUR ANSWER HERE"`.
4. Ensure all cells are run with their output visible.
5. Fill in your student ID (**only**) below.
6. You should now **save and download** your work.

</div>

**Student ID:** 15006280