# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [45]:
# import reduce from functools, numpy and pandas
from functools import reduce
import numpy as np
import pandas as pd

# Challenge 1 - Mapping

#### We will use the map function to clean up a words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [3]:
# load text
filename = '58585-0.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split based on words only
import re
words = re.split(r'\W+', text)
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

['', 'the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'prophet', 'by', 'kahlil', 'gibran', 'this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'united', 'states', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'you', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 're', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'www', 'gutenberg', 'org', 'if', 'you', 'are', 'not', 'located', 'in', 'the', 'united', 'states', 'you', 'll', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'ebook', 'title', 'the', 'prophet', 'author']


#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [10]:
prophet_raw = words[393:]

['the', 'prophet', 'almustafa', 'the', '7', 'chosen', 'and', 'the', 'beloved', 'who']


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [11]:
print(prophet_raw[0:10])

['the', 'prophet', 'almustafa', 'the', '7', 'chosen', 'and', 'the', 'beloved', 'who']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [20]:
prophet = [x for x in prophet_raw if not x.isdigit()]

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`

In [21]:
print(prophet[0:10])

['the', 'prophet', 'almustafa', 'the', 'chosen', 'and', 'the', 'beloved', 'who', 'was']


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [8]:
# List Prophet already runs as a string-only one. 

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [None]:
# List Prophet already runs as a string-only one. 

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [11]:
# List Prophet already runs as a string-only one. 

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [28]:
from nltk.corpus import stopwords 

stop_words = set(stopwords.words('english')) 

prophet_filtered = [w for w in prophet if not w in stop_words]

print(prophet_filtered[0:20])

['prophet', 'almustafa', 'chosen', 'beloved', 'dawn', 'unto', 'day', 'waited', 'twelve', 'years', 'city', 'orphalese', 'ship', 'return', 'bear', 'back', 'isle', 'birth', 'twelfth', 'year']


Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

# Bonus Challenge - Part 1

Rewrite the `word_filter` function above to not be case sensitive.

In [13]:
prophet_filtered = [w for w in prophet if not w in stop_words]

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [31]:
prophet_string = reduce(lambda x, y: x + y + ' ', prophet_filtered)

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [32]:
print(prophet_flattened)

esires lies silent knowledge beyond like seeds dreaming beneath snow heart dreams spring trust dreams hidden gate eternity fear death trembling shepherd stands king whose hand laid upon honour shepherd joyful beneath trembling shall wear mark king yet mindful trembling die stand naked wind melt sun cease breathing free breath restless tides may rise expand seek god unencumbered drink river silence shall indeed sing reached mountain top shall begin climb earth shall claim limbs shall truly dance evening almitra seeress said blessed day place spirit spoken answered spoke also listener descended steps temple people followed reached ship stood upon deck facing people raised voice said people orphalese wind bids leave less hasty wind yet must go wanderers ever seeking lonelier way begin day ended another day sunrise finds us sunset left us even earth sleeps travel seeds tenacious plant ripeness fullness heart given wind scattered brief days among briefer still words spoken voice fade ears l

# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will load a dataset below and then write a function that will perform the transformation.

In [35]:
pm25 = pd.read_csv('PRSA_data_2010.1.1-2014.12.31.csv')

Let's look at the data using the `head()` function.

In [37]:
pm25.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [42]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    
    # Your code here:
    return x/24

Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [60]:
pm25_numerics = pm25[['Iws', 'Is', 'Ir']]

pm25_hourly = np.apply_along_axis(hourly, axis = 0, arr = pm25_numerics)

print(pm25_hourly[0:10])

[[0.07458333 0.         0.        ]
 [0.205      0.         0.        ]
 [0.27958333 0.         0.        ]
 [0.41       0.         0.        ]
 [0.54041667 0.         0.        ]
 [0.67083333 0.         0.        ]
 [0.80125    0.         0.        ]
 [0.87583333 0.         0.        ]
 [1.00625    0.         0.        ]
 [1.13666667 0.         0.        ]]


#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [66]:
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    
    # Your code here:
    return (x.std() / pm25_numerics.count()) - 1
    

In [67]:
np.apply_along_axis(sample_sd, axis = 0, arr = pm25_numerics)

array([[-0.99885884, -0.99998265, -0.99996769],
       [-0.99885884, -0.99998265, -0.99996769],
       [-0.99885884, -0.99998265, -0.99996769]])