### Importing flat files from the Web via Local storage and Pandas

In [4]:
# Import a file from the web, save it locally and load it into a DataFrame. 

# Import package
from urllib.request import urlretrieve

# Import pandas
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
file = urlretrieve(url, 'winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [2]:
!ls

100NumpyExercisesWithHintPractice.ipynb listsvsdict.png
DataScienceToolboxP1.ipynb              localvsglobalscope1.png
DataScienceToolboxP2.ipynb              localvsglobalscope2.png
ImportingDatainPythonPart1.ipynb        localvsglobalscope3.png
ImportingDatainPythonPart2.ipynb        localvsglobalscope4.png
IntermediatePythonForDataScience.ipynb  moby_dick.txt
README.MD                               nestedfunctions1.png
defaultarg.png                          rdbase.png
iteratingoverfileconnections.png        winequality-red.csv
legb.png


Note that urlretrieve actually made a copy of the csv in your working directory!

If you just wanted to load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas. In particular, you can use the function pd.read_csv() with the URL as the first argument and the separator sep as the second argument.

In [3]:
pd.read_csv(url, sep=';').head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


pd.read_excel() is the pd.read_csv() equivalent for importing Excel Spreadsheets!

Note that the output of pd.read_excel() is a Python dictionary with sheet names as keys and corresponding DataFrames as corresponding values.

In [5]:
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

In [6]:
# Import package
import pandas as pd

# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheetname=None)

# Print the sheetnames to the shell
print(xl.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())

odict_keys(['1700', '1900'])
                 country       1700
0            Afghanistan  34.565000
1  Akrotiri and Dhekelia  34.616667
2                Albania  41.312000
3                Algeria  36.720000
4         American Samoa -14.307000


### HTTP requests to import files from the web


URL: Uniform/Universal Resource Locator

They are references to web resources.

The vast majority of URLs are web addresses but they can also refer to a few other things such as FTP (file transfer protocols) and Database Access.

We'll only focus on Web Addresses!

Ingredients:
- Protocol Identifier: http:// (HyperText Transfer Protocol; https is a more secured form of http)
- Resource Name: Amazon.com
- These specify web addresses uniquely!

Going to a website == sending http request known as a GET request to a server
- GET request is by far the most common type of http request

`urlretrieve()` under the hood actually performs a GET request.

#### get HTML data from the web using urllib and requests package!

In [7]:
# Traditional method
from urllib.request import urlopen, Request

url = 'https://www.wikipedia.org'

request = Request(url)
response = urlopen(request)
html = response.read()
response.close()

In [9]:
# print(html)
# to get the html string; it's huge so not printing here

In [11]:
# get requests using the Requests package (one of the most downloaded packages of all time)!
import requests

url = 'https://www.wikipedia.org'
r = requests.get(url)
text = r.text
# print(text)

#### Scraping the web in Python

HTML: mixture of structured and unstructured data!

BeautifulSoup: Parse and Extract structured data from HTML

In [12]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup)

<html>
 <head>
  <title>
   Guido's Personal Home Page
  </title>
 </head>
 <body bgcolor="#FFFFFF" text="#000000">
  <h1>
   <a href="pics.html">
    <img border="0" src="images/IMG_2192.jpg"/>
   </a>
   Guido van Rossum - Personal Home Page
  </h1>
  <p>
   <a href="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm">
    <i>
     "Gawky and proud of it."
    </i>
   </a>
  </p>
  <h3>
   <a href="http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg">
    Who
I Am
   </a>
  </h3>
  <p>
   Read
my
   <a href="http://neopythonic.blogspot.com/2016/04/kings-day-speech.html">
    "King's
Day Speech"
   </a>
   for some inspiration.
  </p>
  <p>
   I am the author of the
   <a href="http://www.python.org">
    Python
   </a>
   programming language.  See also my
   <a href="Resume.html">
    resume
   </a>
   and my
   <a href="Publications.html">
    publications list
   </a>
   , a
   <a href="bio.html">
    brief bio
   </a>
   , assor



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [13]:
# Get the title of Guido's webpage: guido_title
guido_title = soup.title 
# Note that title is an attribute

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()
# Note that get_text() is a method and not an attribute

# Print Guido's text to the shell
print(guido_text)

<title>Guido's Personal Home Page</title>


Guido's Personal Home Page




Guido van Rossum - Personal Home Page
"Gawky and proud of it."
Who
I Am
Read
my "King's
Day Speech" for some inspiration.

I am the author of the Python
programming language.  See also my resume
and my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some
pictures of me,
my new blog, and
my old
blog on Artima.com.  I am
@gvanrossum on Twitter.  I
also have
a G+
profile.

In January 2013 I joined
Dropbox.  I work on various Dropbox
products and have 50% for my Python work, no strings attached.
Previously, I have worked for Google, Elemental Security, Zope
Corporation, BeOpen.com, CNRI, CWI, and SARA.  (See
my resume.)  I created Python while at CWI.

How to Reach Me
You can send email for me to guido (at) python.org.
I read everything sent there, but if you ask
me a question about using Python, it's likely that I won't have time
to answer it, and will instead ref

In [21]:
# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all(name='a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

pics.html
http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg
http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
http://www.python.org
Resume.html
Publications.html
bio.html
http://legacy.python.org/doc/essays/
http://legacy.python.org/doc/essays/ppt/
interviews.html
pics.html
http://neopythonic.blogspot.com
http://www.artima.com/weblogs/index.jsp?blogger=12088
https://twitter.com/gvanrossum
https://plus.google.com/u/0/115212051037621986145/posts
http://www.dropbox.com
Resume.html
http://groups.google.com/groups?q=comp.lang.python
http://stackoverflow.com
guido.au
http://legacy.python.org/doc/essays/
images/license.jpg
http://www.cnpbagwell.com/audio-faq
http://sox.sourceforge.net/
images/internetdog.gif


### Introduction to APIs and JSONs

In [None]:
# Loading JSON from a file that is stored locally in your working directory
import json

with open('file.json', 'r') as file:
    json_data = json.load(file)
    
type(json_data) 
# should give you a `dict`

### APIs and interacting with the world wide web

- The advantage of knowing how to work with JSONs is that majority of the APIs have data stored as a JSON
- API is an acronym and is short for Application Program interface.
- ? in a API request means a query string 

In [22]:
# Import requests package
import requests

# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=ff21610b&t=social+network'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Print the text of the response
print(r.text)

{"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin (screenplay), Ben Mezrich (book)","Actors":"Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons","Plot":"Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"USA","Awards":"Won 3 Oscars. Another 165 wins & 168 nominations.","Poster":"https://images-na.ssl-images-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.7/10"},{"Source":"Rotten Tomatoes","Value":"96%"},{"Source":"Metacritic","Value":"95/100"}],"Metascore":"95","imdbRating":"7.7","imdbVotes":"528,379","imdbID":"tt1285016","Type":"mo

In [25]:
r.json()

{'Actors': 'Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons',
 'Awards': 'Won 3 Oscars. Another 165 wins & 168 nominations.',
 'BoxOffice': '$96,400,000',
 'Country': 'USA',
 'DVD': '11 Jan 2011',
 'Director': 'David Fincher',
 'Genre': 'Biography, Drama',
 'Language': 'English, French',
 'Metascore': '95',
 'Plot': 'Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.',
 'Poster': 'https://images-na.ssl-images-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg',
 'Production': 'Columbia Pictures',
 'Rated': 'PG-13',
 'Ratings': [{'Source': 'Internet Movie Database', 'Value': '7.7/10'},
  {'Source': 'Rotten Tomatoes', 'Value': '96%'},
  {'Source': 'Metacritic', 'Value': '95/100'}],
 'Released': '01 Oct 2010',
 'Response': 'True',
 'Runtime': '120 min',
 'Title': 

In [5]:
# Query the Wikipedia API for the page -> Pizza!
import requests

url = 'https://en.wikipedia.org/w/api.php?action=query&titles=Pizza&prop=revisions&rvprop=content&format=json&formatversion=2'

r = requests.get(url)
json_wiki_data = r.json()

In [10]:
for k,v in json_wiki_data.items():
    print(k, v)

batchcomplete True
query {'pages': [{'pageid': 24768, 'ns': 0, 'title': 'Pizza', 'revisions': [{'contentformat': 'text/x-wiki', 'contentmodel': 'wikitext', 'content': '{{Other uses}}\n{{pp-semi-indef}}\n{{pp-move-indef}}\n{{Infobox prepared food\n| name             = Pizza\n| image            = Eq it-na pizza-margherita sep2005 sml.jpg\n| caption          = Pizza Margherita, the archetype of [[Neapolitan pizza]]\n| alternate_name   =\n| country          = [[Italy]]\n| region           = [[Campania]] ([[Naples]])\n| course           = Lunch or dinner\n| type             = [[Flatbread]]\n| served           = Hot or warm\n| main_ingredient  = Dough, often [[tomato sauce]], [[cheese]]\n| variations       = [[Calzone]], [[panzerotti]], [[Stromboli (food)|stromboli]]\n| calories         =\n| other            =\n| creator          = Raffaele Esposito\n}}\n{{pizza}}\n\'\'\'Pizza\'\'\' is a traditional [[Italian cuisine|Italian]] [[Dish (food)|dish]] consisting of a yeasted [[flatbread]] typica

### The Twitter API and Authentication
