# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [1]:
# import reduce from functools, numpy and pandas

from functools import reduce

# Challenge 1 - Mapping

#### We will use the map function to clean up a words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [2]:
# Run this code:

location = '../58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')
prophet[:10]

['\ufeffThe',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'The',
 'Prophet,',
 'by',
 'Kahlil',
 'Gibran\n\nThis']

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [3]:
# Your code here:
list_index = [(index, item) for index, item in enumerate(prophet, start=1)]
prophet2 = list(filter(lambda items: items[0] > 567, list_index))
prophet2[:10]



[(568, 'Farewell................92\n\n\n\n\nTHE'),
 (569, 'PROPHET\n\n|Almustafa,'),
 (570, 'the{7}'),
 (571, 'chosen'),
 (572, 'and'),
 (573, 'the\nbeloved,'),
 (574, 'who'),
 (575, 'was'),
 (576, 'a'),
 (577, 'dawn')]

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [4]:
# Your code here:
index, prophet = zip(*prophet2)
prophet[:10]

('Farewell................92\n\n\n\n\nTHE',
 'PROPHET\n\n|Almustafa,',
 'the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn')

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [5]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    y = x
    if '{' in x:
        i = x.find('{')
        y = x[0:i]
    return y

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`

In [6]:
# Your code here:
prophet_reference = list(map(reference, prophet))
prophet_reference[:10]


['Farewell................92\n\n\n\n\nTHE',
 'PROPHET\n\n|Almustafa,',
 'the',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn']

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [7]:
def line_break(x):
    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character
        
    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    # Your code here:
    return x.split('\n')

print(line_break('PROPHET\n\n|Almustafa,'))

    

['PROPHET', '', '|Almustafa,']


Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [8]:
# Your code here:
prophet_line = list(map(line_break, prophet_reference))
prophet_line[:10]


[['Farewell................92', '', '', '', '', 'THE'],
 ['PROPHET', '', '|Almustafa,'],
 ['the'],
 ['chosen'],
 ['and'],
 ['the', 'beloved,'],
 ['who'],
 ['was'],
 ['a'],
 ['dawn']]

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [9]:
# Your code here:
prophet_flat = [element for lista in prophet_line for element in lista if len(element)>1]
prophet_flat[:10]


['Farewell................92',
 'THE',
 'PROPHET',
 '|Almustafa,',
 'the',
 'chosen',
 'and',
 'the',
 'beloved,',
 'who']

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [46]:
def word_filter(x):
    '''
    Input: A string
    Output: true if the word is not in the specified list and false if the word is in the list
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    word_list = ['and', 'the', 'a', 'an']
    
    # Your code here:
    return False if x.lower() in [word.lower() for word in word_list] else True

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [59]:
prophet_filter = list(filter(lambda s: word_filter(s), prophet))
prophet_filter[:10]

['Farewell................92\n\n\n\n\nTHE',
 'PROPHET\n\n|Almustafa,',
 'the{7}',
 'chosen',
 'the\nbeloved,',
 'who',
 'was',
 'dawn',
 'unto',
 'his']

# Bonus Challenge - Part 1

Rewrite the `word_filter` function above to not be case sensitive.

In [48]:
def word_filter_case(x):
    word_list = ['and', 'the', 'a', 'an']
    # Your code here:
    return False if x.lower() in [word.lower() for word in word_list] else True
    

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [69]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    
    # Your code here:
    return a +' '+ b

# name = concat_space('Lucio','Gutierrez')    
# name

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [71]:
# Your code here:
from functools import reduce
prophet_string = reduce(lambda a,b: concat_space(a, b), prophet_filter)
prophet_string[:100]


'Farewell................92\n\n\n\n\nTHE PROPHET\n\n|Almustafa, the{7} chosen the\nbeloved, who was dawn unto'

# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will load a dataset below and then write a function that will perform the transformation.

In [5]:
# Run this code:
import pandas as pd
# The dataset below contains information about pollution from PM2.5 particles in Beijing 

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv"
pm25 = pd.read_csv(url)

Let's look at the data using the `head()` function.

In [6]:
# Your code here:
pm25.head(10)


Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0
5,6,2010,1,1,5,,-19,-10.0,1017.0,NW,16.1,0,0
6,7,2010,1,1,6,,-19,-9.0,1017.0,NW,19.23,0,0
7,8,2010,1,1,7,,-19,-9.0,1017.0,NW,21.02,0,0
8,9,2010,1,1,8,,-19,-9.0,1017.0,NW,24.15,0,0
9,10,2010,1,1,9,,-20,-8.0,1017.0,NW,27.28,0,0


The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [9]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    # Your code here:
    return x / 24

# hourly(100)

Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [12]:
# Your code here:
df = pd.DataFrame(pm25, columns=['Iws','Is','Ir'])
print(df[:10])
print(df.apply(hourly)[:10])

     Iws  Is  Ir
0   1.79   0   0
1   4.92   0   0
2   6.71   0   0
3   9.84   0   0
4  12.97   0   0
5  16.10   0   0
6  19.23   0   0
7  21.02   0   0
8  24.15   0   0
9  27.28   0   0
        Iws   Is   Ir
0  0.074583  0.0  0.0
1  0.205000  0.0  0.0
2  0.279583  0.0  0.0
3  0.410000  0.0  0.0
4  0.540417  0.0  0.0
5  0.670833  0.0  0.0
6  0.801250  0.0  0.0
7  0.875833  0.0  0.0
8  1.006250  0.0  0.0
9  1.136667  0.0  0.0


#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [40]:
import pandas as pd
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    
    # Your code here:
    return pd.Series.std(x) / pd.Series.count(x) -1
# print(sample_sd(pd.Series([1,2,3,4])))
    

In [41]:
df = pd.DataFrame(pm25, columns=['Iws','Is','Ir'])
print(df[:10])
print(df.apply(sample_sd))

     Iws  Is  Ir
0   1.79   0   0
1   4.92   0   0
2   6.71   0   0
3   9.84   0   0
4  12.97   0   0
5  16.10   0   0
6  19.23   0   0
7  21.02   0   0
8  24.15   0   0
9  27.28   0   0
Iws   -0.998859
Is    -0.999983
Ir    -0.999968
dtype: float64
