# APIs Lab
In this lab we will practice using APIs to retrieve and store data.

In [3]:
# Imports at the top
import json
import urllib
import pandas as pd
import numpy as np
import requests
import json
import re
import matplotlib.pyplot as plt
%matplotlib inline

## Exercise 1: Get Data From Sheetsu

[Sheetsu](https://sheetsu.com/) is an online service that allows you to access any Google spreadsheet from an API. This can be a very powerful way to share a dataset with colleagues as well as to create a mini centralized data storage, that is simpler to edit than a database.

A Google Spreadsheet with Wine data can be found [here]().

It can be accessed through sheetsu API at this endpoint: https://sheetsu.com/apis/v1.0/dab55afd

Questions:

1. Use the requests library to access the document. Inspect the response text. What kind of data is it?
> Answer: it's a JSON string
- Check the status code of the response object. What code is it?
> 200
- Use the appropriate libraries and read functions to read the response into a Pandas Dataframe
> Possible answers include: pd.read_json and json.loads + pd.Dataframe
- Once you've imported the data into a dataframe, check the value of the 5th line: what's the price?
> 6

### Exercise 2: Post Data to Sheetsu
Now that we've learned how to read data, it'd be great if we could also write data. For this we will need to use a _POST_ request.

1. Use the post command to add the following data to the spreadsheet:

In [14]:
post_data = {
'Grape' : ''
, 'Name' : 'RameshK'
, 'Color' : 'R'
, 'Country' : 'US'
, 'Region' : 'Sonoma'
, 'Vinyard' : ''
, 'Score' : '10'
, 'Consumed In' : '2015'
, 'Vintage' : '1973'
, 'Price' : '200'
}

1. What status did you get? How can you check that you actually added the data correctly?
- In this exercise, your classmates are adding data to the same spreadsheet. What happens because of this? Is it a problem? How could you mitigate it?

In [18]:
pd.read_json("https://sheetsu.com/apis/v1.0/a4b517d6852e")

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4,2013,Vinho Verde
1,W,2015,France,,,17.8,France,3,2013,Peyruchet
2,W,2015,Oregon,,,20,Oregon,3,2013,Abacela
3,W,2015,Spain,chardonay,,7,Spain,2.5,2012,Ochoa
4,R,2015,US,"chiraz, cab",Spice Trader,6,,3,2012,Heartland
5,R,2015,US,cab,,13,California,3.5,2012,Crow Canyon
6,R,2015,US,,#14,21,Oregon,2.5,2013,Abacela
7,R,2015,France,"merlot, cab",,12,Bordeaux,3.5,2012,David Beaulieu
8,R,2015,France,"merlot, cab",,11.99,Medoc,3.5,2011,Chantemerle
9,R,2015,US,merlot,,13,Washington,4,2011,Hyatt


In [19]:
requests.post("https://sheetsu.com/apis/v1.0/a4b517d6852e",json=post_data)

<Response [201]>

## Exercise 3: Data munging

Get back to the dataframe you've created in the beginning. Let's do some data munging:

1. Search for missing data
    - Is there any missing data? How do you deal with it?
    - Is there any data you can just remove?
    - Are the data types appropriate?
- Summarize the data 
    - Try using describe, min, max, mean, var

In [20]:
data = pd.read_json("https://sheetsu.com/apis/v1.0/a4b517d6852e")

In [21]:
data

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4,2013,Vinho Verde
1,W,2015,France,,,17.8,France,3,2013,Peyruchet
2,W,2015,Oregon,,,20,Oregon,3,2013,Abacela
3,W,2015,Spain,chardonay,,7,Spain,2.5,2012,Ochoa
4,R,2015,US,"chiraz, cab",Spice Trader,6,,3,2012,Heartland
5,R,2015,US,cab,,13,California,3.5,2012,Crow Canyon
6,R,2015,US,,#14,21,Oregon,2.5,2013,Abacela
7,R,2015,France,"merlot, cab",,12,Bordeaux,3.5,2012,David Beaulieu
8,R,2015,France,"merlot, cab",,11.99,Medoc,3.5,2011,Chantemerle
9,R,2015,US,merlot,,13,Washington,4,2011,Hyatt


In [22]:
data.isnull().any()

Color          False
Consumed In    False
Country        False
Grape          False
Name           False
Price          False
Region         False
Score          False
Vintage        False
Vinyard        False
dtype: bool

In [24]:
data.replace("",np.nan,inplace=True)
data.head()

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4.0,2013,Vinho Verde
1,W,2015,France,,,17.8,France,3.0,2013,Peyruchet
2,W,2015,Oregon,,,20.0,Oregon,3.0,2013,Abacela
3,W,2015,Spain,chardonay,,7.0,Spain,2.5,2012,Ochoa
4,R,2015,US,"chiraz, cab",Spice Trader,6.0,,3.0,2012,Heartland


In [27]:
data = data.drop_duplicates()
len(data)

81

In [28]:
data.dtypes

Color          object
Consumed In    object
Country        object
Grape          object
Name           object
Price          object
Region         object
Score          object
Vintage        object
Vinyard        object
dtype: object

In [33]:
data['Score'] = pd.to_numeric(data["Score"],errors='coerce')
data['Price'] = pd.to_numeric(data["Price"],errors='coerce')
data
#data[["Score","Price"]]=data[["Score","Price"]].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4.0,2013,Vinho Verde
1,W,2015,France,,,1.780000e+01,France,3.0,2013,Peyruchet
2,W,2015,Oregon,,,2.000000e+01,Oregon,3.0,2013,Abacela
3,W,2015,Spain,chardonay,,7.000000e+00,Spain,2.5,2012,Ochoa
4,R,2015,US,"chiraz, cab",Spice Trader,6.000000e+00,,3.0,2012,Heartland
5,R,2015,US,cab,,1.300000e+01,California,3.5,2012,Crow Canyon
6,R,2015,US,,#14,2.100000e+01,Oregon,2.5,2013,Abacela
7,R,2015,France,"merlot, cab",,1.200000e+01,Bordeaux,3.5,2012,David Beaulieu
8,R,2015,France,"merlot, cab",,1.199000e+01,Medoc,3.5,2011,Chantemerle
9,R,2015,US,merlot,,1.300000e+01,Washington,4.0,2011,Hyatt


In [34]:
data =data[~data['Score'].isnull()]
data

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4.0,2013,Vinho Verde
1,W,2015,France,,,1.780000e+01,France,3.0,2013,Peyruchet
2,W,2015,Oregon,,,2.000000e+01,Oregon,3.0,2013,Abacela
3,W,2015,Spain,chardonay,,7.000000e+00,Spain,2.5,2012,Ochoa
4,R,2015,US,"chiraz, cab",Spice Trader,6.000000e+00,,3.0,2012,Heartland
5,R,2015,US,cab,,1.300000e+01,California,3.5,2012,Crow Canyon
6,R,2015,US,,#14,2.100000e+01,Oregon,2.5,2013,Abacela
7,R,2015,France,"merlot, cab",,1.200000e+01,Bordeaux,3.5,2012,David Beaulieu
8,R,2015,France,"merlot, cab",,1.199000e+01,Medoc,3.5,2011,Chantemerle
9,R,2015,US,merlot,,1.300000e+01,Washington,4.0,2011,Hyatt


In [36]:
data.dtypes

Color           object
Consumed In     object
Country         object
Grape           object
Name            object
Price          float64
Region          object
Score          float64
Vintage         object
Vinyard         object
dtype: object

In [37]:
data=data[data['Score']<=10]
data

Unnamed: 0,Color,Consumed In,Country,Grape,Name,Price,Region,Score,Vintage,Vinyard
0,W,2015,Portugal,,,,Portugal,4.0,2013,Vinho Verde
1,W,2015,France,,,17.80,France,3.0,2013,Peyruchet
2,W,2015,Oregon,,,20.00,Oregon,3.0,2013,Abacela
3,W,2015,Spain,chardonay,,7.00,Spain,2.5,2012,Ochoa
4,R,2015,US,"chiraz, cab",Spice Trader,6.00,,3.0,2012,Heartland
5,R,2015,US,cab,,13.00,California,3.5,2012,Crow Canyon
6,R,2015,US,,#14,21.00,Oregon,2.5,2013,Abacela
7,R,2015,France,"merlot, cab",,12.00,Bordeaux,3.5,2012,David Beaulieu
8,R,2015,France,"merlot, cab",,11.99,Medoc,3.5,2011,Chantemerle
9,R,2015,US,merlot,,13.00,Washington,4.0,2011,Hyatt


In [40]:
data['Consumed In'] = pd.to_numeric(data['Consumed In'], errors='coerce')
data['Vinatage'] = pd.to_numeric(data['Vintage'], errors='coerce')
data['Age'] = data['Consumed In']-data['Vinatage']
data['Age']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


0        2.0
1        2.0
2        2.0
3        3.0
4        3.0
5        3.0
6        2.0
7        3.0
8        4.0
9        4.0
10       3.0
11       2.0
12       5.0
13       3.0
14       1.0
15       1.0
16       1.0
17       2.0
18       2.0
19       2.0
21       1.0
22       1.0
23       1.0
24       2.0
25       3.0
26       2.0
27       3.0
28       3.0
29      42.0
31      42.0
       ...  
74      42.0
77      29.0
79      42.0
80      42.0
81      42.0
85      42.0
92      42.0
103     42.0
106     42.0
132     42.0
145     42.0
149     42.0
152     42.0
156    117.0
161     42.0
163     42.0
174     42.0
175     42.0
181     42.0
186      NaN
228     49.0
232     32.0
271     42.0
277     42.0
281     42.0
287     42.0
289     42.0
291      NaN
294     42.0
311     42.0
Name: Age, dtype: float64

## Exercise 4: Feature Extraction

We would like to use a regression tree to predict the score of a wine. In order to do that, we first need to select and engineer appropriate features.

- Set the target to be the Score column, drop the rows with no score
- Use pd.get_dummies to create dummy features for all the text columns
- Fill the nan values in the numerical columns, using an appropriate method
- Train a Decision tree regressor on the Score, using a train test split:
        X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.3, random_state=42)
- Plot the test values, the predicted values and the residuals
- Calculate R^2 score
- Discuss your findings


In [41]:
data.columns

Index([      u'Color', u'Consumed In',     u'Country',       u'Grape',
              u'Name',       u'Price',      u'Region',       u'Score',
           u'Vintage',     u'Vinyard',    u'Vinatage',         u'Age'],
      dtype='object')

In [42]:
import patsy
y,X = patsy.dmatrices('Score~C(Country)+C(Grape)+C(Region)+C(Score)+C(Vintage)+C(Vinyard)+Age',data)

In [48]:
from sklearn.cross_validation import train_test_split

In [52]:
from sklearn.tree import DecisionTreeRegressor

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=42)
dtr=DecisionTreeRegressor()
model=dtr.fit(X_train,y_train)
predictions=model.predict(X_test)
predictions

array([ 3.5,  3.5,  3. ,  4. ,  3. ,  4. ])

In [53]:
model.score(X_test,y_test)

0.45454545454545447

## Exercise 5: IMDB Movies

Sometimes an API doesn't provide all the information we would like to get and we need to be creative.
Here we will use a combination of scraping and API calls to investigate the ratings and gross earnings of famous movies.

## 5.a Get top movies

The Internet Movie Database contains data about movies. Unfortunately it does not have a public API.

The page http://www.imdb.com/chart/top contains the list of the top 250 movies of all times. Retrieve the page using the requests library and then parse the html to obtain a list of the `movie_ids` for these movies. You can parse it with regular expression or using a library like `BeautifulSoup`.

**Hint:** movie_ids look like this: `tt2582802`

In [54]:
!pip install imdbpie

Collecting imdbpie
  Downloading imdbpie-4.2.0-py2.py3-none-any.whl
Collecting cachecontrol[filecache] (from imdbpie)
  Downloading CacheControl-0.11.7.tar.gz
Collecting lockfile>=0.9 (from cachecontrol[filecache]->imdbpie)
  Downloading lockfile-0.12.2-py2.py3-none-any.whl
Building wheels for collected packages: cachecontrol
  Running setup.py bdist_wheel for cachecontrol ... [?25l- \ | / done
[?25h  Stored in directory: /Users/samyuktha/Library/Caches/pip/wheels/9b/94/d2/1793b004461b5bc238a89e260cd2b9f770437c42424fdd0943
Successfully built cachecontrol
Installing collected packages: lockfile, cachecontrol, imdbpie
Successfully installed cachecontrol-0.11.7 imdbpie-4.2.0 lockfile-0.12.2


In [55]:
import imdbpie

In [56]:
imdb=imdbpie.Imdb()
imdb.top_250()

[{u'can_rate': True,
  u'image': {u'height': 1388,
   u'url': u'https://images-na.ssl-images-amazon.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_.jpg',
   u'width': 933},
  u'num_votes': 1734132,
  u'rating': 9.3,
  u'tconst': u'tt0111161',
  u'title': u'The Shawshank Redemption',
  u'type': u'feature',
  u'year': u'1994'},
 {u'can_rate': True,
  u'image': {u'height': 1129,
   u'url': u'https://images-na.ssl-images-amazon.com/images/M/MV5BNTUxOTdjMDMtMWY1MC00MjkxLTgxYTMtYTM1MjU5ZTJlNTZjXkEyXkFqcGdeQXVyNTA4NzY1MzY@._V1_.jpg',
   u'width': 798},
  u'num_votes': 1184876,
  u'rating': 9.2,
  u'tconst': u'tt0068646',
  u'title': u'The Godfather',
  u'type': u'feature',
  u'year': u'1972'},
 {u'can_rate': True,
  u'image': {u'height': 2999,
   u'url': u'https://images-na.ssl-images-amazon.com/images/M/MV5BMjZiNzIxNTQtNDc5Zi00YWY1LThkMTctMDgzYjY4YjI1YmQyL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_.jpg',
   u'width': 2106},
  u'num_votes': 812632,
  u'rating': 9,
  u'

In [60]:
type(imdb.top_250())

list

In [62]:
df = pd.DataFrame(imdb.top_250())
df

Unnamed: 0,can_rate,image,num_votes,rating,tconst,title,type,year
0,True,{u'url': u'https://images-na.ssl-images-amazon...,1734132,9.3,tt0111161,The Shawshank Redemption,feature,1994
1,True,{u'url': u'https://images-na.ssl-images-amazon...,1184876,9.2,tt0068646,The Godfather,feature,1972
2,True,{u'url': u'https://images-na.ssl-images-amazon...,812632,9.0,tt0071562,The Godfather: Part II,feature,1974
3,True,{u'url': u'https://images-na.ssl-images-amazon...,1718813,9.0,tt0468569,The Dark Knight,feature,2008
4,True,{u'url': u'https://images-na.ssl-images-amazon...,463144,8.9,tt0050083,12 Angry Men,feature,1957
5,True,{u'url': u'https://images-na.ssl-images-amazon...,888278,8.9,tt0108052,Schindler's List,feature,1993
6,True,{u'url': u'https://images-na.ssl-images-amazon...,1358039,8.9,tt0110912,Pulp Fiction,feature,1994
7,True,{u'url': u'https://images-na.ssl-images-amazon...,1245440,8.9,tt0167260,The Lord of the Rings: The Return of the King,feature,2003
8,True,{u'url': u'https://images-na.ssl-images-amazon...,516031,8.9,tt0060196,"The Good, the Bad and the Ugly",feature,1966
9,True,{u'url': u'https://images-na.ssl-images-amazon...,1384565,8.8,tt0137523,Fight Club,feature,1999


In [63]:
imdb.get_title_reviews("tt0167260")

[<Review: u'\nPeter Jackson has d'>,
 <Review: u'\nSaying that this fi'>,
 <Review: u'I am, I admit, an un'>,
 <Review: u'\nAs a movie watcher,'>,
 <Review: u"Obviously, I'm aware">,
 <Review: u'\nFrodo and Sam conti'>,
 <Review: u'\nThis is the final m'>,
 <Review: u'\nFeeling weary and b'>,
 <Review: u'***SPOILERS*** ***SP'>,
 <Review: u'\nThousands of commen'>]

In [64]:
imdb.get_title_reviews("tt0167260")[0]

<Review: u'\nPeter Jackson has d'>

In [65]:
review=imdb.get_title_reviews("tt0167260")[0]

In [66]:
review.text

u'\nPeter Jackson has done it.  He has created an all-encompassing epic saga of Tolkien\'s Lord of the Rings books, and after coming away from the final chapter, how does this rate not only as a film on its own, but as a part of the whole?\n\nPerfect.\n\nI\'ve never seen a series like this.  A trilogy of movies created with such love and care and utter perfection of craft that you can\'t help but walk away and wonder how did Peter Jackson make this possible?  I have always loved the original "Star Wars" and "Indiana Jones" series for their epic storytelling, and just for just fitting in as a great moment in cinema. This should be, will be, remembered with as much revered fondness for generations to come.  They do not make films like these anymore.\n\nAs a stand alone film, it picks up immediately where "Two Towers" ends, so brush up before seeing it.  I\'ve read the books, and the anticipation of seeing some of the more profound moments in this film made me kind of view it with a rushe

In [72]:
from bs4 import BeautifulSoup
import requests

In [74]:
req = requests.get("http://www.imdb.com/chart/top")

In [76]:
soup = BeautifulSoup(req.content,"lxml")
soup.prettify()

u'<!DOCTYPE html>\n<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">\n <head>\n  <meta charset="utf-8"/>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>\n  <script type="text/javascript">\n   var ue_t0=window.ue_t0||+new Date();\n  </script>\n  <script type="text/javascript">\n   var ue_mid = "A1EVAM02EL8SFB"; \n                var ue_sn = "www.imdb.com";  \n                var ue_furl = "fls-na.amazon.com";\n                var ue_sid = "000-0000000-0000000";\n                var ue_id = "1M70EKBD5H3JM6HAWSXM";\n                (function(e){var c=e;var a=c.ue||{};a.main_scope="mainscopecsm";a.q=[];a.t0=c.ue_t0||+new Date();a.d=g;function g(h){return +new Date()-(h?0:a.t0)}function d(h){return function(){a.q.push({n:h,a:arguments,t:a.d()})}}function b(m,l,h,j,i){var k={m:m,f:l,l:h,c:""+j,err:i,fromOnError:1,args:arguments};c.ueLogError(k);return false}b

In [78]:
minisoup=soup.find("tbody",{"class":"lister-list"})
rowlist=minisoup.find_all("tr")
for rows in rowlist:
    print rows

<tr>
<td class="posterColumn">
<span data-value="1" name="rk"></span>
<span data-value="9.21478547000337" name="ir"></span>
<span data-value="7.791552E11" name="us"></span>
<span data-value="1734151" name="nv"></span>
<span data-value="-1.7852145299966296" name="ur"></span>
<a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=2398042102&amp;pf_rd_r=1M70EKBD5H3JM6HAWSXM&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=top&amp;ref_=chttp_tt_1"> <img height="67" src="https://images-na.ssl-images-amazon.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
</a> </td>
<td class="titleColumn">
      1.
      <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=2398042102&amp;pf_rd_r=1M70EKBD5H3JM6HAWSXM&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=top&amp;ref_=chttp_tt_1" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
<span class="secondaryInfo">(1994)</span>
</td>
<td class=

## 5.b Get top movies data

Although the Internet Movie Database does not have a public API, an open API exists at http://www.omdbapi.com.

Use this API to retrieve information about each of the 250 movies you have extracted in the previous step.
- Check the documentation of omdbapi.com to learn how to request movie data by id
- Define a function that returns a python object with all the information for a given id
- Iterate on all the IDs and store the results in a list of such objects
- Create a Pandas Dataframe from the list

## 5.c Get gross data

The OMDB API is great, but it does not provide information about Gross Revenue of the movie. We'll revert back to scraping for this.

- Write a function that retrieves the gross revenue from the entry page at imdb.com
- The function should handle the exception of when the page doesn't report gross revenue
- Retrieve the gross revenue for each movie and store it in a separate dataframe

## 5.d Data munging

- Now that you have movie information and gross revenue information, let's clean the two datasets.
- Check if there are null values. Be careful they may appear to be valid strings.
- Convert the columns to the appropriate formats. In particular handle:
    - Released
    - Runtime
    - year
    - imdbRating
    - imdbVotes
- Merge the data from the two datasets into a single one

## 5.d Text vectorization

There are several columns in the data that contain a comma separated list of items, for example the Genre column and the Actors column. Let's transform those to binary columns using the count vectorizer from scikit learn.

Append these columns to the merged dataframe.

**Hint:** In order to get the actors name right, you'll have to modify the `token_pattern` in the `CountVectorizer`.

## Bonus:

- What are the top 10 grossing movies?
- Who are the 10 actors that appear in the most movies?
- What's the average grossing of the movies in which each of these actors appear?
- What genre is the oldest movie?
