# Importing Data in Python

## Context

Since the field of Data Science is linked to the massive influx of data, we would like to understand how such data can be fetched. In this notebook, we will cover 3 different data sources:

1. Databases
2. CSV file
3. API call

For each of them, we will see how to use python libraries to fetch the data and convert it to a pandas DataFrame object and join different DataFrames.

## Working with relational databases in Python
### Creating a database engine
The engine below creates a Dialect object tailored towards **PostgreSQL**, as well as a Pool object which will establish a **DBAPI** connection at localhost:5432. The Engine, once created, can be used directly to interact with the database. SQLAlchemy includes many Dialect implementations for various backends. Dialects for the most common databases are included with SQLAlchemy; a handful of others require an additional install of a separate dialect.

The `create_engine()` function produces an Engine object based on a URL. These URLs follow **RFC-1738**, and usually can include username, password, hostname, database name as well as optional keyword arguments for additional configuration. In some cases a file path is accepted, and in others a “data source name” replaces the “host” and “database” portions. The typical form of a database URL is:

```python
dialect+driver://username:password@host:port/database
```

In [1]:
# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('mysql+pymysql://1920AIRAMJI:UCKBHDK@dt5.ehb.be/1920AIRAMJI')

# Save the table names to a list: table_names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names)

['credit']


Now, it's time for liftoff! In this exercise, you'll perform the Hello World of SQL queries, `SELECT`, in order to retrieve all columns of the table `credit` in the Chinook database. Recall that the query `SELECT *` selects all columns.

In [2]:
import pandas as pd

# Open engine connection
con = engine.connect()

# Perform query: rs
rs = con.execute("SELECT * FROM credit")

# Save results of the query to DataFrame: df
df = pd.DataFrame(rs.fetchall())

# Close connection
con.close()

# Print head of DataFrame df
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
0,14dd8831-6af5-400b-83ec-68e61888a048,981165ec-3274-42f5-a3b4-d104041a9ca9,Fully Paid,445412.0,Short Term,709.0,1167493.0,8 years,Home Mortgage,Home Improvements,5214.74,17.2,,6.0,1.0,228190.0,416746.0,1.0,0.0
1,4771cc26-131a-45db-b5aa-537ea4ba5342,2de017a3-2e01-49cb-a581-08169e83be29,Fully Paid,262328.0,Short Term,,,10+ years,Home Mortgage,Debt Consolidation,33295.98,21.1,8.0,35.0,0.0,229976.0,850784.0,0.0,0.0
2,4eed4e6a-aa2f-4c91-8651-ce984ee8fb26,5efb2b2b-bf11-4dfd-a572-3761a2694725,Fully Paid,99999999.0,Short Term,741.0,2231892.0,8 years,Own Home,Debt Consolidation,29200.53,14.9,29.0,18.0,1.0,297996.0,750090.0,0.0,0.0
3,77598f7b-32e7-4e3b-a6e5-06ba0d98fe8a,e777faab-98ae-45af-9a86-7ce5b33b1011,Fully Paid,347666.0,Long Term,721.0,806949.0,3 years,Own Home,Debt Consolidation,8741.9,12.0,,9.0,0.0,256329.0,386958.0,0.0,0.0
4,d4062e70-befa-4995-8643-a0de73938182,81536ad9-5ccf-4eb8-befb-47a4d608658e,Fully Paid,176220.0,Short Term,,,5 years,Rent,Debt Consolidation,20639.7,6.1,,15.0,0.0,253460.0,427174.0,0.0,0.0


Congratulations on executing your first SQL query! Now you're going to figure out how to customize your query in order to:
- Select specified columns from a table;
- Select a specified number of rows;
- Import column names from the database table.

Recall that we performed a very similar query customization:

```python
engine = create_engine('mysql+pymysql://1920AIRAMJI:UCKBHDK@dt5.ehb.be/1920AIRAMJI')

with engine.connect() as con:
    rs = con.execute("SELECT Loan ID, Customer ID, FROM credit")
    df = pd.DataFrame(rs.fetchmany(size = 5))
    df.columns = rs.keys()
```

In [3]:
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("select Term FROM credit")
    df = pd.DataFrame(rs.fetchmany(size = 3))
    df.columns = rs.keys()

# Print the length of the DataFrame df
print(len(df))

# Print the head of the DataFrame df
df.head()

3


Unnamed: 0,Term
0,Short Term
1,Short Term
2,Short Term


Here, you'll take advantage of the power of `pandas` to write the results of your SQL query to a DataFrame in one swift line of Python code!

You'll first import `pandas`. Then you'll query the database to select all records from the `credit` table.

```python
df = pd.read_sql_query("SELECT * FROM credit", engine)
```

In [4]:
# Create engine: engine
engine = create_engine('mysql+pymysql://1920AIRAMJI:UCKBHDK@dt5.ehb.be/1920AIRAMJI')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM credit", engine)

### Retrive the data frame from API
Now it's your turn to pull some movie data down from the Open Movie Database (OMDB) using their API. The movie you'll query the API about is The Social Network. Recall that, in the video, to query the API about the movie Hackers, Hugo's query string was `'http://www.omdbapi.com/?t=hackers'` and had a single argument `t=hackers`.

Note: recently, OMDB has changed their API: you now also have to specify an API key. This means you'll have to add another argument to the URL: `apikey=72bc447a`.

In [5]:
# Import requests package
import requests

# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=72bc447a&t=the+social+network'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Print the text of the response
print(r.text)

{"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin, Ben Mezrich","Actors":"Jesse Eisenberg, Andrew Garfield, Justin Timberlake","Plot":"As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"United States","Awards":"Won 3 Oscars. 172 wins & 186 nominations total","Poster":"https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.8/10"},{"Source":"Rotten Tomatoes","Value":"96%"},{"Source":"Metacritic","Value":"95/100"}],"Metascore":"95","imdbRating":"7.8","imdbVotes":"696,761","imdbID":"tt1285016","Type":"movie","DV

Wow, congrats! You've just queried your first API programmatically in Python and printed the text of the response to the shell. However, as you know, your response is actually a JSON, so you can do one step better and decode the JSON. You can then print the key-value pairs of the resulting dictionary. That's what you're going to do now!

In [6]:
# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

Title:  The Social Network
Year:  2010
Rated:  PG-13
Released:  01 Oct 2010
Runtime:  120 min
Genre:  Biography, Drama
Director:  David Fincher
Writer:  Aaron Sorkin, Ben Mezrich
Actors:  Jesse Eisenberg, Andrew Garfield, Justin Timberlake
Plot:  As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business.
Language:  English, French
Country:  United States
Awards:  Won 3 Oscars. 172 wins & 186 nominations total
Poster:  https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg
Ratings:  [{'Source': 'Internet Movie Database', 'Value': '7.8/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating:  7.8
imdbVotes:  696,761
imdbID:  tt1285016
Type:  movie
DVD:  11 Jan 2011
BoxOffice:  $

You're doing so well and having so much fun that we're going to throw one more API at you: the Wikipedia API (documented [here](https://www.mediawiki.org/wiki/API:Main_page)  ). You'll figure out how to find and extract information from the Wikipedia page for Pizza. What gets a bit wild here is that your query will return nested JSONs, that is, JSONs with JSONs, but Python can handle that because it will translate them into dictionaries within dictionaries.

The URL that requests the relevant query from the Wikipedia API is
`https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza`


In [7]:
# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)

<link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1033289096">
<p class="mw-empty-elt">
</p>
<p><b>Pizza</b> (<small>Italian: </small><span title="Representation in the International Phonetic Alphabet (IPA)" lang="it-Latn-fonipa">[ˈpittsa]</span>, <small>Neapolitan: </small><span title="Representation in the International Phonetic Alphabet (IPA)" lang="nap-Latn-fonipa">[ˈpittsə]</span>) is a dish of  Italian origin consisting of a usually round, flat base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as various types of sausage, anchovies, mushrooms, onions, olives, vegetables, meat, ham, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven. A small pizza is sometimes called a pizzetta. A person who makes pizza is known as a <b>pizzaiolo</b>.
</p><p>In Italy, pizza served in a restaurant is presented unsliced, and is eaten with the use of a knife and fork. In casual settings, ho

## Joining Data with pandas

Let’s first understand the data sets used with the following explanation on each dataframe.
- **user_usage**: A first dataset containing users monthly mobile usage statistics.
- **user_device** : A second dataset containing details of an individual “use” of the system, with dates and device information.
- **android_device** : A third dataset with device and manufacturer data, which lists all Android devices and their model code.

It’s important to note here that:
1. The column name `use_id` is shared between the `user_usage` and `user_device`.
2. The `device` column of `user_device` and Model column of the android_device dataframe contain common codes

In [8]:
# Import the data
user_usage = pd.read_csv("https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/user_usage.csv")
user_device = pd.read_csv("https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/user_device.csv")
android_device = pd.read_csv("https://raw.githubusercontent.com/shanealynn/Pandas-Merge-Tutorial/master/android_devices.csv")

# Print the data
user_usage.head()
user_device.head()
android_device.head()


Unnamed: 0,Retail Branding,Marketing Name,Device,Model
0,,,AD681H,Smartfren Andromax AD681H
1,,,FJL21,FJL21
2,,,T31,Panasonic T31
3,,,hws7721g,MediaPad 7 Youth 2
4,3Q,OC1020A,OC1020A,OC1020A


### Left Merge
Keep every row in the left dataframe. Where there are missing values of the “on” variable in the right dataframe, add empty / NaN values in the result.



In [9]:
left_merge = pd.merge(user_usage, user_device, on = "use_id", how = "left")

With the operation above, `left_merge` has the same size as user_usage as we keep all the rows in the left dataframe using the left parameter for the method `how`.

In [10]:
left_merge.head()

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id,user_id,platform,platform_version,device,use_type_id
0,21.97,4.82,1557.33,22787,12921.0,android,4.3,GT-I9505,1.0
1,1710.08,136.88,7267.55,22788,28714.0,android,6.0,SM-G930F,1.0
2,1710.08,136.88,7267.55,22789,28714.0,android,6.0,SM-G930F,1.0
3,94.46,35.17,519.12,22790,29592.0,android,5.1,D2303,1.0
4,71.59,79.26,1557.33,22792,28217.0,android,5.1,SM-G361F,1.0


As expected, the column `use_id` has already been merged together. We also see that the empty values are replaced by NaN in the right dataframe `user_device`.

### Right Merge
To perform the right merge, we just repeat the code above by simply changing the parameter of `how` from `left` to `right`.

In [11]:
right_merge = pd.merge(user_usage, user_device, on = "use_id", how = "right")

With the operation above, `right_merge` has the same size as `user_device` as we keep all the rows in the right dataframe using the right parameter for the method `how`.

In [12]:
right_merge.head()

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id,user_id,platform,platform_version,device,use_type_id
0,,,,22782,26980,ios,10.2,"iPhone7,2",2
1,,,,22783,29628,android,6.0,Nexus 5,3
2,,,,22784,28473,android,5.1,SM-G903F,1
3,,,,22785,15200,ios,10.2,"iPhone7,2",3
4,,,,22786,28239,android,6.0,ONE E1003,1


This time, we see that the empty values are replaced by NaN in the left dataframe `user_usage`.

### Inner Merge
Pandas uses “inner” merge by default. This keeps only the common values in both the left and right dataframes for the merged data.

In our case, only the rows that contain `use_id` values that are common between `user_usage` and `user_device` remain in the merged data `inner_merge`.

In [13]:
inner_merge = pd.merge(user_usage, user_device, on = "use_id", how = "inner")

Although the “inner” merge is used by Pandas by default, the parameter inner is specified above to be explicit.

With the operation above, the merged data `inner_merge` has different size compared to the original left and right dataframes (`user_usage` & `user_device`) as only common values are merged.

In [14]:
inner_merge.head()

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id,user_id,platform,platform_version,device,use_type_id
0,21.97,4.82,1557.33,22787,12921,android,4.3,GT-I9505,1
1,1710.08,136.88,7267.55,22788,28714,android,6.0,SM-G930F,1
2,1710.08,136.88,7267.55,22789,28714,android,6.0,SM-G930F,1
3,94.46,35.17,519.12,22790,29592,android,5.1,D2303,1
4,71.59,79.26,1557.33,22792,28217,android,5.1,SM-G361F,1


### Outer Merge
Finally, we have “outer” merge.

The “outer” merge combines all the rows for left and right dataframes with NaN when there are no matched values in the rows

In [15]:
outer_merge = pd.merge(user_usage, user_device, on = "use_id", how = "outer")

Notice that the method `indicator` is set to `True` in order to indicate where each row originates from in the merge data `outer_merge`.

In [16]:
outer_merge.iloc[[0, 1, 200, 201, 350, 351]]

Unnamed: 0,outgoing_mins_per_month,outgoing_sms_per_month,monthly_mb,use_id,user_id,platform,platform_version,device,use_type_id
0,21.97,4.82,1557.33,22787,12921.0,android,4.3,GT-I9505,1.0
1,1710.08,136.88,7267.55,22788,28714.0,android,6.0,SM-G930F,1.0
200,28.79,29.42,3114.67,23988,,,,,
201,616.56,99.85,5414.14,24006,,,,,
350,,,,23050,29726.0,ios,10.2,"iPhone7,2",3.0
351,,,,23051,29726.0,ios,10.2,"iPhone7,2",3.0


To further illustrate how the “outer” merge works, we purposely specify certain rows of the `outer_merge` to understand where the rows originate from.

- For the 1st and 2th rows, the rows come from both the dataframes as they have the same values of `use_id` to be merged.
- For the 3rd and 4th rows, the rows come from the left dataframe as the right dataframe doesn’t have the common values of `use_id`.
- For the 5th and 6th rows, the rows come from the right dataframe as the left dataframe doesn’t have the common values of `use_id`.

### Merge Dataframes with Different Column Names
So we’ve talked about how to merge data using different ways — left, right, inner, and outer. But the method on only works for the same column name in the left and right dataframes. Therefore, we use `left_on` and `right_on` to replace the method on as shown below.

In [17]:
left_merge = pd.merge(user_device, android_device, left_on = "device", 
                   right_on = "Model", how = "left", indicator = True)

Here we’ve merged `user_device` with `android_device` since they both contain common codes in their columns `device` and `Model` respectively.

In [18]:
left_merge.head()

Unnamed: 0,use_id,user_id,platform,platform_version,device,use_type_id,Retail Branding,Marketing Name,Device,Model,_merge
0,22782,26980,ios,10.2,"iPhone7,2",2,,,,,left_only
1,22783,29628,android,6.0,Nexus 5,3,LGE,Nexus 5,hammerhead,Nexus 5,both
2,22784,28473,android,5.1,SM-G903F,1,Samsung,Galaxy S5 Neo,s5neolte,SM-G903F,both
3,22785,15200,ios,10.2,"iPhone7,2",3,,,,,left_only
4,22786,28239,android,6.0,ONE E1003,1,OnePlus,OnePlus,OnePlus,ONE E1003,both
