# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# Import reduce from functools, numpy and pandas
import pandas
import numpy
import functools

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [56]:
# Run this code:

#For managing external context we can encounter problems in the process
#of opening, or writing for what ever come next
# You can solve this with:
#    -try...(you can use an except in between) finally 
#    - with (creates an object f and at the end of the block it closes)
#https://realpython.com/python-with-statement/

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

#opens the file and reads, spliting phrases into words (splits by blank spaces)

In [57]:
print(prophet[:10])

['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Prophet,', 'by', 'Kahlil', 'Gibran\n\nThis']


#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [58]:
prophet = prophet[568:]

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [59]:
print(prophet[0:11])

['PROPHET\n\n|Almustafa,', 'the{7}', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto', 'his']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [60]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    index = x.find("{")
    x = x[:index]
    return x
    # your code here

In [61]:
reference("the{7}")

'the'

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [62]:
# list(map(reference,prophet)) -- map(function,iterable)

prophet_reference = list(map(reference,prophet))

In [63]:
print(prophet_reference[:11])

['PROPHET\n\n|Almustafa', 'the', 'chose', 'an', 'the\nbeloved', 'wh', 'wa', '', 'daw', 'unt', 'hi']


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [64]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    x = x.split("\n")
    return x
    # your code here

In [65]:
line_break('the\nbeloved')

['the', 'beloved']

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [104]:
prophet_line = []
for words in prophet_reference:
    prophet_line.append(line_break(words))

In [105]:
print(prophet_line)

[['PROPHET', '', '|Almustafa'], ['the'], ['chose'], ['an'], ['the', 'beloved'], ['wh'], ['wa'], [''], ['daw'], ['unt'], ['hi'], ['own', 'day'], ['ha'], ['waite'], ['twelv'], ['year'], ['i'], ['th'], ['city', 'o'], ['Orphales'], ['fo'], ['hi'], ['shi'], ['tha'], ['wa'], ['to', 'retur'], ['an'], ['bea'], ['hi'], ['bac'], ['t'], ['th'], ['isl'], ['of', 'hi'], ['birth.', '', 'An'], ['i'], ['th'], ['twelft'], ['year'], ['o'], ['th'], ['seventh', 'da'], ['o'], ['Ielool'], ['th'], ['mont'], ['o'], ['reaping'], ['he', 'climbe'], ['th'], ['hil'], ['withou'], ['th'], ['cit'], ['walls', 'an'], ['looke'], ['seaward'], ['an'], ['h'], ['behel'], ['his', 'shi'], ['comin'], ['wit'], ['th'], ['mist.', '', 'The'], ['th'], ['gate'], ['o'], ['hi'], ['hear'], ['wer'], ['flung', 'open'], ['an'], ['hi'], ['jo'], ['fle'], ['fa'], ['ove'], ['th'], ['sea.', 'An'], ['h'], ['close'], ['hi'], ['eye'], ['an'], ['praye'], ['i'], ['the', 'silence'], ['o'], ['hi'], ['soul.', '', '*****', '', 'Bu'], ['a'], ['h'], ['des

In [111]:
prophet_line = list(map(line_break,prophet_reference))

In [112]:
print(prophet_line)

[['PROPHET', '', '|Almustafa'], ['the'], ['chose'], ['an'], ['the', 'beloved'], ['wh'], ['wa'], [''], ['daw'], ['unt'], ['hi'], ['own', 'day'], ['ha'], ['waite'], ['twelv'], ['year'], ['i'], ['th'], ['city', 'o'], ['Orphales'], ['fo'], ['hi'], ['shi'], ['tha'], ['wa'], ['to', 'retur'], ['an'], ['bea'], ['hi'], ['bac'], ['t'], ['th'], ['isl'], ['of', 'hi'], ['birth.', '', 'An'], ['i'], ['th'], ['twelft'], ['year'], ['o'], ['th'], ['seventh', 'da'], ['o'], ['Ielool'], ['th'], ['mont'], ['o'], ['reaping'], ['he', 'climbe'], ['th'], ['hil'], ['withou'], ['th'], ['cit'], ['walls', 'an'], ['looke'], ['seaward'], ['an'], ['h'], ['behel'], ['his', 'shi'], ['comin'], ['wit'], ['th'], ['mist.', '', 'The'], ['th'], ['gate'], ['o'], ['hi'], ['hear'], ['wer'], ['flung', 'open'], ['an'], ['hi'], ['jo'], ['fle'], ['fa'], ['ove'], ['th'], ['sea.', 'An'], ['h'], ['close'], ['hi'], ['eye'], ['an'], ['praye'], ['i'], ['the', 'silence'], ['o'], ['hi'], ['soul.', '', '*****', '', 'Bu'], ['a'], ['h'], ['des

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [119]:
#Trial with a for loop
text1 = []
for words in prophet_line:
    for i in range(len(words)-1):
            text1.append(words[i])
        
print(text1)

In [122]:
prophet_flat = [words[i] for words in prophet_line 
                for i in range(len(words)-1)]

In [53]:
prophet_flat[:11]

[['PROPHET', '', '|Almustafa'],
 ['the'],
 ['chose'],
 ['an'],
 ['the', 'beloved'],
 ['wh'],
 ['wa'],
 [''],
 ['daw'],
 ['unt'],
 ['hi']]

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [None]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [None]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [None]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    # your code here

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [None]:
# your code here