# Data Visualization
## Lab 3: Acquiring and Processing Data

### Name: Rudd Fawcett

**Directions:** Throughout this notebook, add markdown cells as needed in order to document your process of trying to understand it.  I don't expect to just see three perfect answers and no explanations.  Tell me what you are trying!  In the end, make sure your cell and kernel linearities line up (that is, make sure that your final product can be followed from top to bottom).

Our goal is to gain a bit more familiarity with the way you might generate datasets from APIs or from a larger CSV.

In [3]:
import random
import json

import numpy as np
import pandas as pd
import requests
import re

### Google Maps Geocoding API

Below is a function that generates a random Latitude and Longitude in Wyoming (it's a [particularly square state](https://www.mapsofworld.com/usa/states/wyoming/wyoming-maps/wyoming-lat-long-map.jpg)).  **Your first goal** is to use the Google Maps Geocoding API to create a dataset of 10 random locations in Wyoming and the town and zip code they lie in.  For example:
```
[{"lat": "44.952055", "lon": "-107.67753", "town": "Parkman", "zip": "82838"}, ...]
```

To do this: 

1. [Get a google maps geocoding API key](https://developers.google.com/maps/documentation/geocoding/get-api-key) -- It's free and quick, just tell them the name of your "app" (it can be anything)
2. [Take a look at how to use the geocoding API](https://developers.google.com/maps/documentation/geocoding/start) -- You're looking for the process they call "reverse geocoding"
3. Build your request
4. Interpret the result: it's a JSON response with tons of extra data so let me help you: you'll need `your_request_response_object.json()`.  Just grab the first result's `address_components` and dive into that array, or look for the results `formatted_address` and find the town and zip code in the resulting string (try `my_string.split(,)`.)
5. Save the data as a [JSON file](https://docs.python.org/2/library/json.html) (or [use pandas](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html), possibly easier)

In [106]:
def generate_lat_long():
    """ Generate a random latitude and longitude in Wyoming. 
    Bounds: 41°N to 45°N and 104.05°W to 111.05°W
    """
    return "{},{}".format(round(random.uniform(41, 45), 5), round(random.uniform(-111.05,-104.05), 5))

In [97]:
API_KEY = 'AIzaSyC-YrFjOAtNoqTFwdy9OXfoVp7V_TXEWts'

locations = []

# Loop through this 10 times
for number in range(10):
    location = {}
    # Generatea a latitude and longitude
    lat_long = generate_lat_long()
    # String interpolation equivalent in Python
    path = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}&key={}'.format(lat_long, API_KEY)
    # GET request to Google API
    response = requests.get(path)
    # Grab relevant results
    data = response.json()['results'][0]
  
    # Setting the keys and values on my location dictionary.
    location['lat'] = data['geometry']['location']['lat']
    location['lon'] = data['geometry']['location']['lng']
    location['town'] = data['address_components'][1]['long_name']
    # Using -1 in python to get last item in array, converting string to int.
    location['zip'] = int(data['address_components'][-1]['long_name'])
    
    # Add the location to the array
    locations.append(location)

# Open/create a locations.json file in write mode
with open('locations.json', 'w') as output:
    # Dump JSON into file
    json.dump(locations, output)

### Slicing a dataset using pandas

[This CSV](http://introcs.cs.princeton.edu/java/data/bnc-wordfreq.csv) contains word frequencies in a subset of the British National Corpus, a 100 million long collect

Questions/tasks for you:
1. How many words are in this dataset?
2. Construct the dataset consisting of all nouns whose frequency is greater than 20000 and which contain an "`ag`" in them.  Some hints: use pandas [boolean slicing](https://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing); pandas can help you with [string containment](https://pandas.pydata.org/pandas-docs/stable/text.html#testing-for-strings-that-match-or-contain-a-pattern), too.
3. Construct the dataset of all the prime-number-indexed rows.  Use [`.loc`](https://pandas.pydata.org/pandas-docs/stable/indexing.html#basics)

Both of the datasets may be just constructed in memory, _i.e._ no need to save them to a file.

In [5]:
# Read file
word_freq = pd.read_csv('bnc-wordfreq.csv')

# Get total number of lines
count = len(word_freq)
print(count)

# Construct data set where frequency > 20000 and the word contains 'ag'
nouns_ag = word_freq[(word_freq['FREQUENCY'] > 20000) & (word_freq['WORD'].str.contains('ag'))]
print(nouns_ag)

# Prime method taken from: https://stackoverflow.com/a/1801446/6669540
def is_prime(n):
    """Returns True if n is prime."""
    if n == 2:
        return True
    if n == 3:
        return True
    if n % 2 == 0:
        return False
    if n % 3 == 0:
        return False

    i = 5
    w = 2

    while i * i <= n:
        if n % i == 0:
            return False

        i += w
        w = 6 - w

    return True

prime_nums = []

for n in range(count):
    if is_prime(n):
        prime_nums.append(n)


nouns_prime_idx = word_freq.loc[prime_nums]
print(nouns_prime_idx)

6318
      RANK  FREQUENCY        WORD PART OF SPEECH
135    160      59829       again            adv
136    169      56208     against           prep
137    399      25340         age              n
148    428      23497       agree              v
3118   462      22117    language              n
3354   470      21884  management              n
5301   503      20586       stage              n
      RANK  FREQUENCY          WORD PART OF SPEECH
1     2107       4249       abandon              v
2     5204       1110         abbey              n
3      966      10468       ability              n
5     6277        809      abnormal              a
7     5085       1154     abolition              n
11    3341       2139         above              a
13     786      12889         above           prep
17    4266       1504        absent              a
19    1651       5782    absolutely            adv
23    5655        966        absurd              a
29    5188       1114    accelerate       

### Text Analysis

The task of taking text data and making it usable is tricky and can sometimes be time consuming.  Today, we'll keep it pretty simple.

1. First, open the text file [`raven.txt`](https://cs.andover.edu/~nzufelt/dataviz/raven.txt), and copy its contents into a single string.
2. Then, remove any character that isn't a letter, `\n`, or (perhaps!) punctuation.  (hint: `"a" in "cat"` is `True` in Python, whereas `"&" in "cat"` is `False`.)
3. `split` the text by "sentence" (more likely by line for this particular text file).  The "sentences" will become the rows of your dataset, and the occurance of certain words will be your columns.  It might help to further `split` your sentences by word.
4. Create a dataset from this based upon whether the words `of`, `nothing`, `raven`, and/or `chamber` appear in each sentence: each entry in your dataset will be `0` (this word/column **not** in this sentence/row) or `1` (this word/column appears in this sentence/row).
5. Output your dataset to a CSV.

In [107]:
with open("raven.txt", "r") as f:
    text = f.read()

# Remove all non-alpha characters and spaces
regex = re.compile('[^a-zA-Z\s]')
# replace all of stanza breaks with new lines 
clean_text = regex.sub('', text).replace('\n\n', '\n')
# split lines into an array of lines
lines = clean_text.split('\n')

rows = []

# Go through the lines and grab their index and value
for index,line in enumerate(lines):
    row = {
        # Make line standard line number (aka 1,2,3 rather than 0,1,2,etc.)
        'line': index+1,
        # If "of" occurs in the line, set it to 1
        'of_occurs': 1 if 'of' in line else 0,
        'nothing_occurs': 1 if 'nothing' in line else 0,
        'raven_occurs': 1 if 'raven' in line else 0,
        'chamber_occurs': 1 if 'chamber' in line else 0
    }
    
    # Add the row to rows
    rows.append(row)

text_data = pd.DataFrame(rows)
# Set the CSV index to be line rather than Panda's default one
text_data.set_index('line', inplace=True)
# Dump data frame to CSV file
text_data.to_csv('raven-text-analysis.csv')