# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# import reduce from functools, numpy and pandas

from functools import reduce

import numpy as np
import pandas as pd

# Challenge 1 - Mapping

#### We will use the map function to clean up a words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [2]:
# Run this code:

location = '../58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [3]:
# Your code here:

removed_book = prophet[568:]


# somehow the code below did not do the job, if you know why I would love to hear the reason :)

# def remove(x):
#     return x[568:]

# removed_book = list(map(remove, prophet))

# removed_book


#prophet

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [4]:
# Your code here:

removed_book[0:10]

['PROPHET\n\n|Almustafa,',
 'the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto']

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [5]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    
    # Your code here:
    
    return x.split('{')[0]



Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`

In [6]:
# Your code here:

prophet_reference = list(map(reference, removed_book))

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [7]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    
    # Your code here:
    
    return x.split('\n')


Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [8]:
# Your code here:

prophet_line = list(map(line_break, prophet_reference))

prophet_line[:9]

[['PROPHET', '', '|Almustafa,'],
 ['the'],
 ['chosen'],
 ['and'],
 ['the', 'beloved,'],
 ['who'],
 ['was'],
 ['a'],
 ['dawn']]

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [9]:
# Your code here:

prophet_flat = [item for sublist in prophet_line for item in sublist]

prophet_flat[:20]


['PROPHET',
 '',
 '|Almustafa,',
 'the',
 'chosen',
 'and',
 'the',
 'beloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto',
 'his',
 'own',
 'day,',
 'had',
 'waited',
 'twelve',
 'years']

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [10]:
def word_filter(x):
    '''
    Input: A string
    Output: true if the word is not in the specified list and false if the word is in the list
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    # Your code here:
    
    
    return True if x not in word_list else False


In [13]:
prophet_filter = list(filter(word_filter, prophet_flat))

prophet_filter[:9]

['PROPHET',
 '',
 '|Almustafa,',
 'chosen',
 'beloved,',
 'who',
 'was',
 'dawn',
 'unto']

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

# Bonus Challenge - Part 1

Rewrite the `word_filter` function above to not be case sensitive.

In [12]:

# def word_filter_case(x):
   
   
    # Your code here:


SyntaxError: unexpected EOF while parsing (<ipython-input-12-e6fbac36299e>, line 4)

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

Our last challenge will be to write a function that takes two strings and concatenates them together with a space between the two strings.

In [14]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    # Your code here:

    return ' '.join([a, b])

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [15]:
# Your code here:

prophet_string = list(reduce(concat_space, prophet_filter))


In [16]:
prophet_string

['P',
 'R',
 'O',
 'P',
 'H',
 'E',
 'T',
 ' ',
 ' ',
 '|',
 'A',
 'l',
 'm',
 'u',
 's',
 't',
 'a',
 'f',
 'a',
 ',',
 ' ',
 'c',
 'h',
 'o',
 's',
 'e',
 'n',
 ' ',
 'b',
 'e',
 'l',
 'o',
 'v',
 'e',
 'd',
 ',',
 ' ',
 'w',
 'h',
 'o',
 ' ',
 'w',
 'a',
 's',
 ' ',
 'd',
 'a',
 'w',
 'n',
 ' ',
 'u',
 'n',
 't',
 'o',
 ' ',
 'h',
 'i',
 's',
 ' ',
 'o',
 'w',
 'n',
 ' ',
 'd',
 'a',
 'y',
 ',',
 ' ',
 'h',
 'a',
 'd',
 ' ',
 'w',
 'a',
 'i',
 't',
 'e',
 'd',
 ' ',
 't',
 'w',
 'e',
 'l',
 'v',
 'e',
 ' ',
 'y',
 'e',
 'a',
 'r',
 's',
 ' ',
 'i',
 'n',
 ' ',
 'c',
 'i',
 't',
 'y',
 ' ',
 'o',
 'f',
 ' ',
 'O',
 'r',
 'p',
 'h',
 'a',
 'l',
 'e',
 's',
 'e',
 ' ',
 'f',
 'o',
 'r',
 ' ',
 'h',
 'i',
 's',
 ' ',
 's',
 'h',
 'i',
 'p',
 ' ',
 't',
 'h',
 'a',
 't',
 ' ',
 'w',
 'a',
 's',
 ' ',
 't',
 'o',
 ' ',
 'r',
 'e',
 't',
 'u',
 'r',
 'n',
 ' ',
 'b',
 'e',
 'a',
 'r',
 ' ',
 'h',
 'i',
 'm',
 ' ',
 'b',
 'a',
 'c',
 'k',
 ' ',
 't',
 'o',
 ' ',
 'i',
 's',
 'l',
 'e',
 ' '