# SI 330: Homework 5: APIs on AWS


## Due: Friday, October 19, 2018,  11:59:00pm

### Submission instructions
After completing this homework, you will turn in two files via Canvas ->  Assignments -> HW 5:
- Your Notebook, named si330-hw5-YOUR_UNIQUE_NAME.ipynb and
- the HTML file, named si330-hw5-YOUR_UNIQUE_NAME.html.

### Name:  Samantha Cohen
### Uniqname: samcoh
### People you worked with: Rhea, Emil, and Will 
## Top-Level Goal
To create a microservice that returns the counts of all bigrams in a text passage.



## Learning Objectives
After completing this Lab, you should know how to:
* create an AWS Lambda function that takes a string and returns the counts of all bigrams in that text
* write an AWS API Gateway integration which allows both GET and POST requests to access an AWS Lambda
* write documenation to the microservice that you've created



### Outline of Steps For Analysis
Here's an overview of the steps that you'll need to do to complete this lab.
2. Upload data to an S3 bucket
1. Create an AWS Lambda function that normalizes, tokenizes, and creates and counts bigrams from text, both via a POST request with the text and via a GET request to a URL that returns the text (e.g. an S3 bucket)
3. Create a python code block in this notebook to demonstrate the functionality of your microservice

Each of these steps is detailed below.

## Part 1: Upload data to an S3 bucket
To get ready to test the POST functionality of the code you generate in the next step, you should upload a text file that is **500 or fewer lines** to an S3 bucket.  See the [description of cross-origin resource sharing (CORS)](https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS) for an explanation of why we want to put the data in the same domain (`amazonaws.com`) as the Lambda.

Follow the same approach that we used in the lab to upload a small text file to your S3 bucket, ensuring that the permissions are set to allow public access

### Q1: Enter the URL of your text file


My URL: https://s3.amazonaws.com/si330f18-hw5-samcoh/write-up.txt

## Part 2: Create an AWS Lambda function that normalizes, tokenizes, and creates and counts bigrams of words from text

Similar to what we did in the lab, you're going to create a microservice that consists of two parts: an AWS Lambda and an API Gateway.  You can use exactly the same technique that we did in the lab to get started.

You will need to modify the code in the Lambda to handle two types of requests:
1. A GET request with a queryStringParameter of `url=http://some.url.goes.here/text.txt`, which specifies the location of the text to be processed; AND
2. A POST request with the text to be processed included as the `"text"` value in the body payload.

Both types of requests should return a response whose body contains a JSON-encoded dictionary with a key called `bigram_counts`, which should contain a dictionary mapping each bigram onto the count of how many times it appears in the text. **NOTE:** Because JSON objects can only have keys that are strings, you should encode bigrams as the two words separated by a single space, not as tuples.

For example, if the following text were given to this microservice:

> The quick brown fox jumps over the quick brown squirrel.

It should return a response whose `body` is a JSON-encoded version of this dictionary:

```javascript
{
  "bigram_counts": {
    "the quick": 2,
    "quick brown": 2,
    "brown fox": 1,
    "fox jumped": 1,
    "jumped over": 1,
    "over the": 1,
    "brown squirrel": 1
  }
}
```


### The following code block is a reasonable starting point for creating your Lambda.  Note that this code should not be run in this notebook but rather serve as the starting point for your work in the Lambda editor.

**NOTE** Please see https://stackoverflow.com/questions/21844546/forming-bigrams-of-words-in-list-of-sentences-with-python for hints about how to create bigrams without NLTK.

In [None]:
"""
PUT SOME DOCUMENTATION HERE
"""
import json
import re
from botocore.vendored import requests # We've added this line in addition to what was in the lab.
# You'll need to figure out how to use this `requests` module yourself --- it works similar to the 
# `requests` module that you've used before. 
    
def lambda_handler(event, context):
    method = event['httpMethod']

    text = ''
    # d holds our response variables
    d = {"bigram_counts": {}}
    
    if method == 'GET':
        # handle GET method
        params = event['queryStringParameters']
        if params:
            response = requests.get("https://s3.amazonaws.com/si330f18-hw5-samcoh/write-up.txt")
            url = response.text # retrieve the text from the URL
    if method == 'POST':
        # handle POST method
        body = json.loads(event['body'])
        if 'text' in body:
            url = body['text']
            # do something 
            #pass
    text = url.strip().split(".")
    normalized_text = [x.strip().lower() for x in text if x != ""]
    tokenized_words = [] 
    for word in normalized_text:
        w = re.compile('\w+').findall(word)
        tokenized_words.append(" ".join(w)) 
    #print(tokenized_words)
    bigrams = [b for l in tokenized_words for b in zip(l.strip().split(" ")[:-1], l.strip().split(" ")[1:])]
    for first,second in bigrams: 
        combined = first +" "+ second
        if combined not in d["bigram_counts"]:
            d["bigram_counts"][combined] = 0 
        d["bigram_counts"][combined] += 1 
    #print(bigrams)
    #normalized_text = [x.strip().lower() for x in text]
    #print(normalized_text)
    #text = url.strip().split(" ")
    #normalized_text = [x.strip().lower() for x in text if x != ""]
    #tokenized_words = [] 
    #for word in normalized_text:
        #w = re.compile('\w+').search(word).group(0)
        #tokenized_words.append(w) 
    
    #normalized_text = [x.strip().lower() for x in url.strip().split(".")]
            
    # 1. normalize
    
    # 2. tokenize
    # NOTE: there are many ways to do this depending on your definition of "word", 
    # they may yield slightly different results --- that is okay. The method below
    # has the nice property that words with apostrophes in them are not broken up.
    
    # 3. find bigrams
    # NOTE: see https://stackoverflow.com/questions/21844546/forming-bigrams-of-words-in-list-of-sentences-with-python
    #       for hints about how to create bigrams without nltk
    
    # 4. count bigrams
    
    # 5. return response
    # Note the strict format of the return dictionary
    # It must contain these three elements, and the body
    # must be a stringified JSON object (i.e. you have to call 
    # json.dumps on the JSON structure you're returning)
    return { 
        "statusCode": 200,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps(d),
   }

### Q2.1: Enter the URL of your Lambda

Put your Lambda's URL here: https://dfnz7e4j1m.execute-api.us-east-1.amazonaws.com/default/count_bigrams

### Q2.2: Copy your final Lambda code into the following code block (but do not run it here)

In [None]:
"""
This module normalizes and tokenizes words and returns a dictionary of bigram counts. 
"""
import json
import re
from botocore.vendored import requests # We've added this line in addition to what was in the lab.
# You'll need to figure out how to use this `requests` module yourself --- it works similar to the 
# `requests` module that you've used before.

    
def lambda_handler(event, context):
    method = event['httpMethod']#alwYS -event object; will tell you if get or post 

    text = ''
    # d holds our response variables
    d = {"bigram_counts": {}}
    
    if method == 'GET':
        # handle GET method
        params = event['queryStringParameters'] #always same 
        if params:
            url = requests.get(params["url"]).text #passed in url 
            text = url 
            
    if method == 'POST':
        # handle POST method
        body = json.loads(event['body']) #akwats sane 
    
        if 'text' in body:
            text = body["text"]#the key varries
    
    # 1. normalize      
    
    normalized_text = text.strip().lower().split()
    
    # 2. tokenize
    # NOTE: there are many ways to do this depending on your definition of "word", 
    # they may yield slightly different results --- that is okay. The method below
    # has the nice property that words with apostrophes in them are not broken up.
    
    tokenized_words = [] 
    for word in normalized_text:
        w = re.compile('[\w-]+').findall(word)
        tokenized_words.append(w[0]) 
    
    # 3. find bigrams
    # NOTE: see https://stackoverflow.com/questions/21844546/forming-bigrams-of-words-in-list-of-sentences-with-python
    #       for hints about how to create bigrams without nltk
    bigrams = list(zip(tokenized_words[:-1], tokenized_words[1:]))
    
    # 4. count bigrams
    for first,second in bigrams: 
        combined = first +" "+ second
        if combined not in d["bigram_counts"]:
            d["bigram_counts"][combined] = 0 
        d["bigram_counts"][combined] += 1 
        
    # 5. return response
    # Note the strict format of the return dictionary
    # It must contain these three elements, and the body
    # must be a stringified JSON object (i.e. you have to call 
    # json.dumps on the JSON structure you're returning)
    return { 
        "statusCode": 200,
        "headers": {"Content-Type": "application/json"},
        "body": json.dumps(d),
   }

## Part 3: Demonstrate the GET and POST functionality of your Lambda

### Q3.1: Create a code block that uses `requests` to demonstrate the `GET` functionality of your Lambda.

In [None]:
import requests
url = "https://dfnz7e4j1m.execute-api.us-east-1.amazonaws.com/default/count_bigrams"
response = requests.get(url,params = {"url": "https://s3.amazonaws.com/si330f18-hw5-samcoh/write-up.txt"})

print(response.headers)
print()
print(response.text)


### Q3.2: Create a code block that uses `requests` to demonstrate the `POST` functionality of your Lambda.

In [1]:
import requests
import json 
url = "https://dfnz7e4j1m.execute-api.us-east-1.amazonaws.com/default/count_bigrams"
response = requests.post(url, data = json.dumps({"text": "The quick brown fox jumped over the quick brown squirrel."}))
#this post aws lambda function make post expects json formatted dictionary keys text value of that elelment is what you want to parse 
print(response.headers)
print()
print(response.text)

{'Date': 'Wed, 07 Nov 2018 18:10:08 GMT', 'Content-Type': 'application/json', 'Content-Length': '140', 'Connection': 'keep-alive', 'x-amzn-RequestId': '5c2fdb85-e2b8-11e8-b158-d5536e1ec0cb', 'x-amz-apigw-id': 'QAOUDFIuoAMFbVA=', 'X-Amzn-Trace-Id': 'Root=1-5be32a80-607fdc8498474378d97471c0;Sampled=0'}

{"bigram_counts": {"the quick": 2, "quick brown": 2, "brown fox": 1, "fox jumped": 1, "jumped over": 1, "over the": 1, "brown squirrel": 1}}


## BONUS

**BONUS.1** Break the microservice into three separate ones (normalizing, tokenizing, and counting bigrams). Paste your code for each one here (clearly labelled), followed by a code block that executes them in succession, passing the results from each previous step into the next step.

In [None]:
# normalizing Lambda

In [None]:
# tokenizing Lambda

In [None]:
# counting bigrams Lambda

In [None]:
# example code using all three Lambdas

**BONUS.2** Re-write the last step in **BONUS.1** to generate arbitrary *n*-grams instead of bigrams. This microservice should take an additional parameter, `n`, which determines what size of *n*-gram it will count. Instead of returning a single value (`"bigram_counts"`), it should now return two values: `"counts"` (the dictionary of n-gram counts), and `"n"`, the value of `n` input into the microservice.

In [None]:
# ngram Lambda

In [None]:
# example code using ngram Lambda