# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# import reduce from functools, numpy and pandas

from functools import reduce

import numpy as np
import pandas as pd

# Challenge 1 - Mapping

#### We will use the map function to clean up a words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [2]:
# Run this code:

location = '../58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [3]:
# Your code here:

del prophet[0:568]
prophet

['PROPHET\n\n|Almustafa,',
 'the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto',
 'his',
 'own\nday,',
 'had',
 'waited',
 'twelve',
 'years',
 'in',
 'the',
 'city\nof',
 'Orphalese',
 'for',
 'his',
 'ship',
 'that',
 'was',
 'to\nreturn',
 'and',
 'bear',
 'him',
 'back',
 'to',
 'the',
 'isle',
 'of\nhis',
 'birth.\n\nAnd',
 'in',
 'the',
 'twelfth',
 'year,',
 'on',
 'the',
 'seventh\nday',
 'of',
 'Ielool,',
 'the',
 'month',
 'of',
 'reaping,',
 'he\nclimbed',
 'the',
 'hill',
 'without',
 'the',
 'city',
 'walls\nand',
 'looked',
 'seaward;',
 'and',
 'he',
 'beheld',
 'his\nship',
 'coming',
 'with',
 'the',
 'mist.\n\nThen',
 'the',
 'gates',
 'of',
 'his',
 'heart',
 'were',
 'flung\nopen,',
 'and',
 'his',
 'joy',
 'flew',
 'far',
 'over',
 'the',
 'sea.\nAnd',
 'he',
 'closed',
 'his',
 'eyes',
 'and',
 'prayed',
 'in',
 'the\nsilences',
 'of',
 'his',
 'soul.\n\n*****\n\nBut',
 'as',
 'he',
 'descended',
 'the',
 'hill,',
 'a',
 'sadness\

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [4]:
# Your code here:
prophet[1:10]

['the{7}', 'chosen', 'and', 'the\nbeloved,', 'who', 'was', 'a', 'dawn', 'unto']

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [5]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    
    # Your code here:
    return x.split('{')[0]

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`

In [6]:
# Your code here:

prophet_reference = list(map(reference,prophet))
prophet_reference

['PROPHET\n\n|Almustafa,',
 'the',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto',
 'his',
 'own\nday,',
 'had',
 'waited',
 'twelve',
 'years',
 'in',
 'the',
 'city\nof',
 'Orphalese',
 'for',
 'his',
 'ship',
 'that',
 'was',
 'to\nreturn',
 'and',
 'bear',
 'him',
 'back',
 'to',
 'the',
 'isle',
 'of\nhis',
 'birth.\n\nAnd',
 'in',
 'the',
 'twelfth',
 'year,',
 'on',
 'the',
 'seventh\nday',
 'of',
 'Ielool,',
 'the',
 'month',
 'of',
 'reaping,',
 'he\nclimbed',
 'the',
 'hill',
 'without',
 'the',
 'city',
 'walls\nand',
 'looked',
 'seaward;',
 'and',
 'he',
 'beheld',
 'his\nship',
 'coming',
 'with',
 'the',
 'mist.\n\nThen',
 'the',
 'gates',
 'of',
 'his',
 'heart',
 'were',
 'flung\nopen,',
 'and',
 'his',
 'joy',
 'flew',
 'far',
 'over',
 'the',
 'sea.\nAnd',
 'he',
 'closed',
 'his',
 'eyes',
 'and',
 'prayed',
 'in',
 'the\nsilences',
 'of',
 'his',
 'soul.\n\n*****\n\nBut',
 'as',
 'he',
 'descended',
 'the',
 'hill,',
 'a',
 'sadness\nca

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [7]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    
    # Your code here:
    return x.split('\n')

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [8]:
# Your code here:

prophet_line = list(map(line_break,prophet_reference))
prophet_line

[['PROPHET', '', '|Almustafa,'],
 ['the'],
 ['chosen'],
 ['and'],
 ['the', 'beloved,'],
 ['who'],
 ['was'],
 ['a'],
 ['dawn'],
 ['unto'],
 ['his'],
 ['own', 'day,'],
 ['had'],
 ['waited'],
 ['twelve'],
 ['years'],
 ['in'],
 ['the'],
 ['city', 'of'],
 ['Orphalese'],
 ['for'],
 ['his'],
 ['ship'],
 ['that'],
 ['was'],
 ['to', 'return'],
 ['and'],
 ['bear'],
 ['him'],
 ['back'],
 ['to'],
 ['the'],
 ['isle'],
 ['of', 'his'],
 ['birth.', '', 'And'],
 ['in'],
 ['the'],
 ['twelfth'],
 ['year,'],
 ['on'],
 ['the'],
 ['seventh', 'day'],
 ['of'],
 ['Ielool,'],
 ['the'],
 ['month'],
 ['of'],
 ['reaping,'],
 ['he', 'climbed'],
 ['the'],
 ['hill'],
 ['without'],
 ['the'],
 ['city'],
 ['walls', 'and'],
 ['looked'],
 ['seaward;'],
 ['and'],
 ['he'],
 ['beheld'],
 ['his', 'ship'],
 ['coming'],
 ['with'],
 ['the'],
 ['mist.', '', 'Then'],
 ['the'],
 ['gates'],
 ['of'],
 ['his'],
 ['heart'],
 ['were'],
 ['flung', 'open,'],
 ['and'],
 ['his'],
 ['joy'],
 ['flew'],
 ['far'],
 ['over'],
 ['the'],
 ['sea.',

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [9]:
# Your code here:

prophet_flat = [word for sublist in prophet_line for word in sublist]
prophet_flat

['PROPHET',
 '',
 '|Almustafa,',
 'the',
 'chosen',
 'and',
 'the',
 'beloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto',
 'his',
 'own',
 'day,',
 'had',
 'waited',
 'twelve',
 'years',
 'in',
 'the',
 'city',
 'of',
 'Orphalese',
 'for',
 'his',
 'ship',
 'that',
 'was',
 'to',
 'return',
 'and',
 'bear',
 'him',
 'back',
 'to',
 'the',
 'isle',
 'of',
 'his',
 'birth.',
 '',
 'And',
 'in',
 'the',
 'twelfth',
 'year,',
 'on',
 'the',
 'seventh',
 'day',
 'of',
 'Ielool,',
 'the',
 'month',
 'of',
 'reaping,',
 'he',
 'climbed',
 'the',
 'hill',
 'without',
 'the',
 'city',
 'walls',
 'and',
 'looked',
 'seaward;',
 'and',
 'he',
 'beheld',
 'his',
 'ship',
 'coming',
 'with',
 'the',
 'mist.',
 '',
 'Then',
 'the',
 'gates',
 'of',
 'his',
 'heart',
 'were',
 'flung',
 'open,',
 'and',
 'his',
 'joy',
 'flew',
 'far',
 'over',
 'the',
 'sea.',
 'And',
 'he',
 'closed',
 'his',
 'eyes',
 'and',
 'prayed',
 'in',
 'the',
 'silences',
 'of',
 'his',
 'soul.',
 '',
 '*****',
 '',
 'But',

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [11]:
'''As the word_list is empty, I will try to use the stopwords of nltk library'''
import nltk
nltk.download()

from nltk.corpus import stopwords
stops = set(stopwords.words('english'))
print(stops)

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
{'if', 'all', 'is', 'of', 'herself', 'against', "you'll", 'themselves', 'we', 'doing', 'no', "that'll", 'by', 'a', 'same', 'while', 'more', 'can', 'hasn', 'between', 'having', 'won', 'there', 'o', 'why', 'he', "mightn't", 'needn', 'below', 'off', 'myself', 'she', 'just', 'does', 'too', 'am', 't', 'are', 'where', 'was', 'very', 'had', 'after', 'himself', 'on', 'because', 'my', 'in', 'm', 'ain', 'wouldn', "she's", 'then', 'her', 'do', "weren't", 'an', 'their', 'isn', 'at', 'own', 's', "wouldn't", "it's", 'once', 'them', 'this', 'over', 'couldn', 'did', 'wasn', 'when', 'down', 'hadn', "haven't", 'before', 'so', 'above', 'yourself', "won't", 'you', "hadn't", 'some', 'theirs', 'each', 'it', 'yourselves', 'what', 'being', 'or', 'until', 'again', 'will', "you've", 'ma', 'whom', 'these', 'as', 'during', "hasn't", "shan't", "don't", 'y', 'his', 'but', 'ours', 'haven', 'don', 'your', 'been', 'both', "mustn't", 'him'

In [12]:
def word_filter(x):
    '''
    Input: A string
    Output: true if the word is not in the specified list and false if the word is in the list
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = []
    
    # Your code here:
    
    word_list = ["she's", "it's", 'themselves', 's', 't', 'didn', 'been', 'so', 'on', 'doing', 'why', 'll', 'they', 'its', 'each', 'should', "needn't", 'haven', 'an', "shouldn't", "don't", 'to', 'ma', 'those', 'theirs', "didn't", 'is', 'himself', 'herself', 'about', 'don', 'mightn', 'm', 'does', 'no', 'if', 'was', 'just', 'you', 'how', 'such', "couldn't", 'shan', 'isn', 'with', 'will', 'couldn', 'off', 'his', 'more', 'your', 'while', 'am', 'these', 'because', 'shouldn', 'at', 'now', 'him', 'she', 'where', 'further', 're', 'them', 'that', "you're", 'wouldn', 'before', 'yourself', 'which', 'he', 'were', 'into', "you've", 'wasn', 've', 'had', 'weren', 'we', 'very', "shan't", 'up', 'our', 'hers', 'nor', 'then', 'only', 'hasn', 'between', "mustn't", 'but', 'once', 'yourselves', 'out', 'can', 'yours', 'when', 'under', 'here', 'again', 'are', 'other', 'mustn', 'me', 'during', 'for', 'against', 'both', 'do', "weren't", 'have', 'not', 'i', 'most', 'ain', "aren't", "haven't", 'too', 'than', "hadn't", 'in', 'doesn', 'same', 'below', "you'd", 'down', "isn't", 'or', 'myself', 'has', 'a', 'having', 'did', 'the', "doesn't", "you'll", 'some', 'and', "that'll", 'o', 'it', 'being', 'from', 'this', 'whom', 'as', 'any', "wouldn't", 'aren', 'after', 'ourselves', 'of', 'all', 'd', 'through', 'there', 'their', 'over', "should've", 'itself', 'hadn', "mightn't", 'above', 'few', 'needn', 'won', 'her', "wasn't", 'my', 'who', 'until', "won't", 'ours', 'own', "hasn't", 'by', 'what', 'y', 'be']
    
    return (False if (x.lower() in [word.lower() for word in word_list]) else True)

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [13]:
prophet_filter = list(filter(word_filter,prophet_flat))
prophet_filter

['PROPHET',
 '',
 '|Almustafa,',
 'chosen',
 'beloved,',
 'dawn',
 'unto',
 'day,',
 'waited',
 'twelve',
 'years',
 'city',
 'Orphalese',
 'ship',
 'return',
 'bear',
 'back',
 'isle',
 'birth.',
 '',
 'twelfth',
 'year,',
 'seventh',
 'day',
 'Ielool,',
 'month',
 'reaping,',
 'climbed',
 'hill',
 'without',
 'city',
 'walls',
 'looked',
 'seaward;',
 'beheld',
 'ship',
 'coming',
 'mist.',
 '',
 'gates',
 'heart',
 'flung',
 'open,',
 'joy',
 'flew',
 'far',
 'sea.',
 'closed',
 'eyes',
 'prayed',
 'silences',
 'soul.',
 '',
 '*****',
 '',
 'descended',
 'hill,',
 'sadness',
 'came',
 'upon',
 'him,',
 'thought',
 'heart:',
 '',
 'shall',
 'go',
 'peace',
 'without',
 'sorrow?',
 'Nay,',
 'without',
 'wound',
 'spirit',
 'shall',
 'leave',
 'city.',
 '',
 'days',
 'pain',
 'spent',
 'within',
 'walls,',
 'long',
 'nights',
 'aloneness;',
 'depart',
 'pain',
 'aloneness',
 'without',
 'regret?',
 '',
 'many',
 'fragments',
 'spirit',
 'scattered',
 'streets,',
 'many',
 'children',
 

# Bonus Challenge - Part 1

Rewrite the `word_filter` function above to not be case sensitive.

In [14]:

def word_filter_case(x):
   
   
    # Your code here:
    
    '''Can use the same function because I did already not case sensitive'''

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

Our last challenge will be to write a function that takes two strings and concatenates them together with a space between the two strings.

In [15]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    # Your code here:
    
    return a + ' ' + b

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [16]:
# Your code here:

prophet_string = reduce(concat_space,prophet_filter)

In [17]:
prophet_string

'PROPHET  |Almustafa, chosen beloved, dawn unto day, waited twelve years city Orphalese ship return bear back isle birth.  twelfth year, seventh day Ielool, month reaping, climbed hill without city walls looked seaward; beheld ship coming mist.  gates heart flung open, joy flew far sea. closed eyes prayed silences soul.  *****  descended hill, sadness came upon him, thought heart:  shall go peace without sorrow? Nay, without wound spirit shall leave city.  days pain spent within walls, long nights aloneness; depart pain aloneness without regret?  many fragments spirit scattered streets, many children longing walk naked among hills, cannot withdraw without burden ache.  garment cast day, skin tear hands.  thought leave behind me, heart made sweet hunger thirst.  *****  Yet cannot tarry longer.  sea calls things unto calls me, must embark.  stay, though hours burn night, freeze crystallize bound mould.  Fain would take here. shall I?  voice cannot carry tongue  lips gave wings. Alone mus