# GA Data Science 16 (DAT16) - Lab2
###The Beginning of the Data Pipeline

Justin Breucop

## Lab goals

- Flow Control: Mastering the Waves
- Python packages: writing your own functions and classes
- Kimono Labs: an external tool for aquiring data

##Flow Control: Catching your rhythm
To review: programming loops have various ways to control your approach. Using the keywords `for` and `while` create a loop that runs over an iterable data type (lists, dictionaries, etc.) or based on a condition, respectively.

Let's make sure we remember this process.

In [77]:
a = [5,9,3,10,1,2,27]
for val in a:
    print val

5
9
3
10
1
2
27


In [78]:
i = 0
while i < 5:
    i += 1
    print "iteration",i

iteration 1
iteration 2
iteration 3
iteration 4
iteration 5


In [79]:
for val in a:
    if val < 9:
        print val
    else:
        print "Too much!"

5
Too much!
3
Too much!
1
2
Too much!


If/Else statements are great ways to insert logical checks or allow for alternatives

In [80]:

for i,val in enumerate(a):

    if val < 9:
        print val
    
    if i == 3:
        break

5
3


Break breaks the loop

##Python Packages and the Magic Within
Python has a large number of packages, libraries, modules, functions, etc. and you'll hear these terms thrown around a lot. Here we'll define our vocabulary.

In [1]:
import numpy as np

Here, we are importing the package numpy. This allows us to reference modules underneath it. Modules are the text files that define functions and classes that can be referenced. Using the word `as` lets us give the numpy package a shortened alias. This is a common practice, but ultimately your choice. At large the python community tends to have standard aliases used (because programmers are lazy and that's a good thing).

Note, some people refer to packages in python as "libraries". This is fine; don't judge them.

Let's see what the documentation has to say about a specific function in np:

In [2]:
help(np.array)

Help on built-in function array in module numpy.core.multiarray:

array(...)
    array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0)
    
    Create an array.
    
    Parameters
    ----------
    object : array_like
        An array, any object exposing the array interface, an
        object whose __array__ method returns an array, or any
        (nested) sequence.
    dtype : data-type, optional
        The desired data-type for the array.  If not given, then
        the type will be determined as the minimum type required
        to hold the objects in the sequence.  This argument can only
        be used to 'upcast' the array.  For downcasting, use the
        .astype(t) method.
    copy : bool, optional
        If true (default), then the object is copied.  Otherwise, a copy
        will only be made if __array__ returns a copy, if obj is a
        nested sequence, or if a copy is needed to satisfy any of the other
        requirements (`dtype`, `order`, etc.).

Note that this function is in a module titled numpy.core.multiarray. The periods reference namespacing, which is like referencing files within a directory.

##Writing Functions and Classes: making your own tools

Functions are extremely important because they allow us to not repeat work, preserve code, and even serve as building blocks for more complex functions. We will explore the syntax here.

In [3]:
def square(x):
    result = x*x
    return result

square(9)

81

You can even put loops in your function

In [6]:
def factorial(x):
    start = 1
    for value in range(1,x+1):
        start = start*value
    return start

factorial(10)

3628800

###Exercise 1.1
Write a function that subtracts two values.

In [8]:
print result

NameError: name 'result' is not defined

But wait, you say. Those are related! And what if I want to recall the result and interact with a changing object? Well we'll group them in to a class called `calculator` 

BIG TAKEAWAY: Use a class for object oriented programming. If you have an object you want to preserve and modify with other functions, classes offer a truncated way to go about this.

In [52]:
class calculator():
    result = 0
    def square(self):
        val =self.result
        self.result = val*val
        return self.result
    
    def factorial(self):
        
        val = 1
        for n in range(2,self.result+1):
            val = val*n
        self.result = val
        return self.result
    
    def add(self,y):
        self.result += y
        return self.result
    


In [54]:
hermes = calculator()
hermes.add(10)
hermes.square()
hermes.add(3)
print "Hermes:",hermes.result

bob = calculator()
bob.add(5)
bob.factorial()
print "Bob:",bob.result

Hermes: 103
Bob: 120


###Bonus:
Modify the class to add a inversion function (1 divided by itself)
Demonstrate its use. Is the returned value correct (hint: think of numerical data types)

For more information, visit https://docs.python.org/2/tutorial/classes.html

If you find yourself writing classes, please see me in office hours to learn about `__init__`

## Kimono Labs



Web scraping with the Kimono Labs API

https://www.kimonolabs.com/

In [64]:

api_key = 'Your API Key'

In [61]:
import json
import urllib
import pandas as pd

def getMovies(year, api_key=api_key):
    """
    Creates list of top 50 movies by gross box office
    sales for a year with ratings and sales
    """
    
    movies, ratings, sales = [], [], []
    #Remember to replace this link with the link to your specifc API
    url = "https://www.kimonolabs.com/api/eb81bu78?" + \
            "apikey={}".format(api_key) + \
            "&year={year}".format(year=str(year)) 
    
    data = json.load(urllib.urlopen(url))
    
    # Iterate through json object to collect data
    for n in xrange(data['count']):
        n_title = data['results']['collection1'][n]['title']['text']
        n_rating = data['results']['collection1'][n]['rating']
        n_sales = data['results']['collection1'][n]['sales']
        movies.append(n_title)
        ratings.append(n_rating)
        sales.append(n_sales)
    
    data = pd.DataFrame({'movie':movies,'rating':ratings,'sales':sales})
    
    return data


In [62]:
movies_1990 = getMovies(1990)

In [63]:
# print movies_1990
# print movies_1990['count']
# print movies_1990['results']
# print movies_1990['results']['collection1']
# print movies_1990['results']['collection1'][0]
# print movies_1990['results']['collection1'][0]['title']
# print movies_1990['results']['collection1'][0]['title']['text']
movies_1990.head()

Unnamed: 0,movie,rating,sales
0,How the Grinch Stole Christmas,6.0,$260M
1,Cast Away,7.7,$234M
2,Mission: Impossible II,6.0,$215M
3,Gladiator,8.5,$188M
4,Meet the Parents,7.0,$166M


In [None]:
movies_2000 = getMovies(2000)
movies_2000

In [None]:
all_movies = pd.concat([movies_1990, movies_2000])

###Exercise 2.1
Build your own api via kimono at https://www.kimonolabs.com/. Create the api we created in class to call the data.

What were the average sales in 1995?
What was the average rating?

###Bonus
What were the average sales for the 90s? How does that differ from the 2000s?