# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [2]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [3]:
len(prophet)

13637

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [4]:
prophet = prophet[568:]

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [5]:
print(prophet[1:10])

['the{7}', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [18]:
def remove_reference(word):
    return word.split('{')[0]
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''




Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [25]:
prophet_reference = list(map(remove_reference, prophet))
print(prophet_reference[1:10])

['the', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto']


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [26]:
def line_break(word):
    return word.split('\n')

# Apply the function to each word in the list and flatten the result using list comprehension
# Check the result
print(line_break('the\nbeloved'))


['the', 'beloved']


Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [29]:
mapped = map(line_break, prophet_reference)
prophet_line = [part for sublist in mapped for part in sublist]
print(prophet_line[0:10])


['PROPHET', '', '|Almustafa,', 'the', 'chosen', 'and', 'the', 'beloved,', 'who', 'was']


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [30]:
prophet_flat = [word for sublist in prophet_line for word in sublist]

print(prophet_flat[0:10])

['P', 'R', 'O', 'P', 'H', 'E', 'T', '|', 'A', 'l']


In [None]:
# your code here

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [31]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    if x in word_list:
        return False
    else:
        return True

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [32]:
prophet_filter = list(filter(word_filter, prophet_flat))
print(prophet_filter[:10])

['P', 'R', 'O', 'P', 'H', 'E', 'T', '|', 'A', 'l']


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [None]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [None]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    return a + ' ' + b

In [None]:
# your code here

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [35]:
def concat_space(a, b):
    return a + ' ' + b

prophet_string = reduce(concat_space, prophet_filter)

print(prophet_string) 

P R O P H E T | A l m u s t f , t h e c h o s e n n d t h e b e l o v e d , w h o w s d w n u n t o h i s o w n d y , h d w i t e d t w e l v e y e r s i n t h e c i t y o f O r p h l e s e f o r h i s s h i p t h t w s t o r e t u r n n d b e r h i m b c k t o t h e i s l e o f h i s b i r t h . A n d i n t h e t w e l f t h y e r , o n t h e s e v e n t h d y o f I e l o o l , t h e m o n t h o f r e p i n g , h e c l i m b e d t h e h i l l w i t h o u t t h e c i t y w l l s n d l o o k e d s e w r d ; n d h e b e h e l d h i s s h i p c o m i n g w i t h t h e m i s t . T h e n t h e g t e s o f h i s h e r t w e r e f l u n g o p e n , n d h i s j o y f l e w f r o v e r t h e s e . A n d h e c l o s e d h i s e y e s n d p r y e d i n t h e s i l e n c e s o f h i s s o u l . * * * * * B u t s h e d e s c e n d e d t h e h i l l , s d n e s s c m e u p o n h i m , n d h e t h o u g h t i n h i s h e r t : H o w s h l l I g o i n p e c e n d w i t h o u t s o r r o w ? N y , n o 