# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [8]:
# Run this code:

location = '58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [9]:
print("Original word count:", len(prophet))


Original word count: 13637


#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself.

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [13]:
prophet = prophet[568:]



If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [14]:
# View sample words (words 1–10)
print("Sample words 1–10:", prophet[1:11])

Sample words 1–10: ['with', 'confidence?\n\nIf', 'this', 'is', 'my', 'day', 'of', 'harvest,', 'in', 'what\nfields']


#### The next step is to create a function that will remove references.

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [15]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed

    Example:
    Input: 'the{7}'
    Output: 'the'
    '''

    return x.split('{')[0]


Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [16]:
prophet_reference = list(map(reference, prophet))


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [17]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character

    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''

    return x.split('\n')


Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [18]:
prophet_line = list(map(line_break, prophet_reference))


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [19]:
prophet_flat = [i for sub in prophet_line for i in sub]
prophet_flat

['dispense',
 'with',
 'confidence?',
 '',
 'If',
 'this',
 'is',
 'my',
 'day',
 'of',
 'harvest,',
 'in',
 'what',
 'fields',
 'have',
 'I',
 'sowed',
 'the',
 'seed,',
 'and',
 'in',
 'what',
 'unremembered',
 'seasons?',
 '',
 'If',
 'this',
 'indeed',
 'be',
 'the',
 'hour',
 'in',
 'which',
 'I',
 'lift',
 'up',
 'my',
 'lantern,',
 'it',
 'is',
 'not',
 'my',
 'flame',
 'that',
 'shall',
 'burn',
 'therein.',
 '',
 'Empty',
 'and',
 'dark',
 'shall',
 'I',
 'raise',
 'my',
 'lantern,',
 '',
 'And',
 'the',
 'guardian',
 'of',
 'the',
 'night',
 'shall',
 'fill',
 'it',
 'with',
 'oil',
 'and',
 'he',
 'shall',
 'light',
 'it',
 'also.',
 '',
 '*****',
 '',
 'These',
 'things',
 'he',
 'said',
 'in',
 'words.',
 'But',
 'much',
 'in',
 'his',
 'heart',
 'remained',
 'unsaid.',
 'For',
 '',
 'could',
 'not',
 'speak',
 'his',
 'deeper',
 'secret.',
 '',
 '*****',
 '',
 '[Illustration:',
 '0020]',
 '',
 'And',
 'when',
 'he',
 'entered',
 'into',
 'the',
 'city',
 'all',
 'the',
 '

In [20]:
print("First 10 flat words:", prophet_flat[:10])


First 10 flat words: ['dispense', 'with', 'confidence?', '', 'If', 'this', 'is', 'my', 'day', 'of']


# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [21]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list
    and False if the word is in the list.

    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False

    Input: 'John'
    Output: True
    '''

    word_list = ['and', 'the', 'a', 'an']

    return x not in word_list


Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [22]:
prophet_filter = list(filter(word_filter, prophet_flat))
print("Filtered words (first 10):", prophet_filter[:10])


Filtered words (first 10): ['dispense', 'with', 'confidence?', '', 'If', 'this', 'is', 'my', 'day', 'of']


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [23]:
def word_filter_case(x):
    '''
    Case-insensitive version of word_filter
    '''
    word_list = ['and', 'the', 'a', 'an']
    return x.lower() not in word_list


In [24]:
prophet_filter = list(filter(word_filter_case, prophet_flat))
print("Case-insensitive filter applied (first 10):", prophet_filter[:10])


Case-insensitive filter applied (first 10): ['dispense', 'with', 'confidence?', '', 'If', 'this', 'is', 'my', 'day', 'of']


# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces.

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [25]:
def concat_space(a, b):
    '''
    Concatenate two strings with a space
    '''
    return a + ' ' + b

# Reduce the list to a single string
prophet_string = reduce(concat_space, prophet_filter)
print("Reduced string (first 200 chars):", prophet_string[:200])

Reduced string (first 200 chars): dispense with confidence?  If this is my day of harvest, in what fields have I sowed seed, in what unremembered seasons?  If this indeed be hour in which I lift up my lantern, it is not my flame that 
