# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [3]:
# Import reduce from functools, numpy and pandas
import pandas as pd
import numpy as np
from functools import reduce

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [4]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [9]:
# your code here
import itertools
itertools.islice(prophet,567, len(prophet))
#There is an issue with the 567 element, because it contains both inortmation about the book and is also a part of the book itself

<itertools.islice at 0x1ec603401d8>

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [10]:
# your code here
list(itertools.islice(prophet,567,577))

['Farewell................92\n\n\n\n\nTHE',
 'PROPHET\n\n|Almustafa,',
 'the{7}',
 'chosen',
 'and',
 'the\nbeloved,',
 'who',
 'was',
 'a',
 'dawn']

#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [11]:
# your code here
import re
def reference(x):
    return re.split("{[0-9]}", x)

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [13]:
# your code here
prophet_reference = map(reference, prophet)
list(itertools.islice(prophet_reference,567,577))

[['Farewell................92\n\n\n\n\nTHE'],
 ['PROPHET\n\n|Almustafa,'],
 ['the', ''],
 ['chosen'],
 ['and'],
 ['the\nbeloved,'],
 ['who'],
 ['was'],
 ['a'],
 ['dawn']]

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [14]:
def line_break(x):
    return re.split("\n", x)
    #return x.rstrip()


Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [16]:
# your code here
prophet_line = map(line_break, prophet)
list(itertools.islice(prophet_line,567,577))

[['Farewell................92', '', '', '', '', 'THE'],
 ['PROPHET', '', '|Almustafa,'],
 ['the{7}'],
 ['chosen'],
 ['and'],
 ['the', 'beloved,'],
 ['who'],
 ['was'],
 ['a'],
 ['dawn']]

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [18]:
# your code here
def flattened(list):
    flattened_list = [y for x in list for y in x]
    return flattened_list
prophet_line_flattened = map(flattened, prophet_line)

prophet_flat = map(flattened, prophet_line)
list(itertools.islice(prophet_flat,567,577))

[['h', 'a', 'r', 'v', 'e', 's', 't', ','],
 ['i', 'n'],
 ['w', 'h', 'a', 't', 'f', 'i', 'e', 'l', 'd', 's'],
 ['h', 'a', 'v', 'e'],
 ['I'],
 ['s', 'o', 'w', 'e', 'd'],
 ['t', 'h', 'e'],
 ['s', 'e', 'e', 'd', ','],
 ['a', 'n', 'd'],
 ['i', 'n', 'w', 'h', 'a', 't']]

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [19]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here
    if x in word_list:
        return False
    if x not in word_list:
        return True
prophet_filter=filter(word_filter, prophet)
list(itertools.islice(prophet_line,567,577)) 

[['you'],
 ['from'],
 ['your', 'husks.', '', 'He'],
 ['grinds'],
 ['you'],
 ['to'],
 ['whiteness.', '', 'He'],
 ['kneads'],
 ['you'],
 ['until']]

Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [20]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    x_lower=lower.x
    if x_lower in word_list:
        return False
    if x_lower not in world_list:
        return True  
#Since word_list is lowercase, if I transform any inpit of the finction into lowercase, the function will treat uppercase as well as lowercase items

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [21]:
def concat_space(a, b):
    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    astr=str(a)
    bstr=str(b)
    concat=" ".join([astr, bstr])
    return concat
    # your code here

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [22]:
# your code here
import functools
prophet_string=reduce(concat_space, prophet_filter,0)


# Challenge 4 - Applying Functions to DataFrames

#### Our next step is to use the apply function to a dataframe and transform all cells.

To do this, we will connect to Ironhack's database and retrieve the data from the *pollution* database. Select the *beijing_pollution* table and retrieve its data.

In [225]:
# your code here
import requests
import io
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/PRSA_data_2010.1.1-2014.12.31.csv"
s=requests.get(url).content
beijing_pollution=pd.read_csv(io.StringIO(s.decode('utf-8')))

Let's look at the data using the `head()` function.

In [228]:
# your code here
beijing_pollution.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


The next step is to create a function that divides a cell by 24 to produce an hourly figure. Write the function below.

In [253]:
def hourly(x):
    '''
    Input: A numerical value
    Output: The value divided by 24
        
    Example:
    Input: 48
    Output: 2.0
    '''
    
    # your code here
    if isinstance(x, int):
        return x/24
    if isinstance(x, float):
        return x/24
    else:
        return x

Apply this function to the columns `Iws`, `Is`, and `Ir`. Store this new dataframe in the variable `pm25_hourly`.

In [254]:
# your code here

beijing_pollution.apply(hourly, axis=0)



Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
43819,43820,2014,12,31,19,8.0,-23,-2.0,1034.0,NW,231.97,0,0
43820,43821,2014,12,31,20,10.0,-22,-3.0,1034.0,NW,237.78,0,0
43821,43822,2014,12,31,21,10.0,-22,-3.0,1034.0,NW,242.70,0,0
43822,43823,2014,12,31,22,8.0,-22,-4.0,1034.0,NW,246.72,0,0


#### Our last challenge will be to create an aggregate function and apply it to a select group of columns in our dataframe.

Write a function that returns the standard deviation of a column divided by the length of a column minus 1. Since we are using pandas, do not use the `len()` function. One alternative is to use `count()`. Also, use the numpy version of standard deviation.

In [None]:
def sample_sd(x):
    '''
    Input: A Pandas series of values
    Output: the standard deviation divided by the number of elements in the series
        
    Example:
    Input: pd.Series([1,2,3,4])
    Output: 0.3726779962
    '''
    
    # your code here