# Databases
---
This notebook details how to use Python to interact with databases.

It's main focus is on conneting to databases and methods for getting different types of data into said databases.

Four different databases are used in each section to show the similarities as well as the slight differences between handling each. 

## Index
---
1. Intro to Databases
2. Creating Databases  
3. Creating Tables
4. Manually Adding Data to a Table
5. Adding CSV & Excel File Data to a Table
6. Adding JSON File Data to a Table
7. Adding JSON data From Web API
8. Adding Data From the Web Using Web Scraping

## 1. Intro to Databases
---
There are two main types of databases, relational and NoSQL. **Relational databases** store their data in a tabular format and use SQL as their query language, while **NoSQL databases** store data in a JSON-like format and use their own query languages rather than SQL. 

In this notebook, 4 databases are used:

1. **SQLite** - A file based relational database. Simple to use, but contains less features and is less secure than the other examples. SQLite is good for small projects as well as for developing and testing purposes.


2. **MySQL & PostgreSQL** - Both of these databases are relational and work based on a \'client-server\' model which requires a DB server to set up and run over the network. They have more features, are more secure, and therefore are good for production level systems. 


3. **MongoDB** - The most popular NoSQL database, MongoDB performs on the same level as MySQl and PostrgreSQL, however the way in which is stores and interacts with data is different as it does not use tabular data structures or SQL.  

Each section in this notebook will contain an example for every database above to show the similarities and differences between each. 

## 2. Creating Databases
---
**Important Note:**
Most production databases and their table structure are usually created outside of a python program. 

### Creating Databases
In order to create a database, it is necessary to already have the database client/server software installed and set up on your computer prior to creating the database (excepting SQLite as it is a file).

While Python can be used to create databases on a server, it is much easier to do that part either using the database servers software or through the command line. I'm showing this just so you can see that it can be done.

### Database Imports, global variables , and global functions

In [1]:
# Each db requires external import packages to work. There are multiple 
# different packages for each, in this notebook I have tried to use either the
# most popular or the one created specifically for each db by its creators.

import requests
import sqlite3           # sqlite import
import mysql.connector   # MySQL import
import psycopg2          # postgreSQL import
import pymongo           # MongoDB import
import pandas as pd
from load_dbs import *

# Variables used in examples 
db = 'movies_db'
db_table = 'movies'

### SQLite
The first three database examples use relational databases. These databases have structured pre-defined table schemas and data is stored in a tabular (row/column) format.

SQLite3 is the simplest type of database as it is just a file and does not require a server to use. Because of this, sqlite datbases are handled differenlty than MySQL and PostgreSQL in many cases and I comment on this in the code whenever necessary. 

**The database interaction process in Python follows 3 basic steps for all the database systems:**

1. Connect to the Database (or database server if no database created)
2. Perform queries on the database
3. Close the database connection

This process is the same for all the examples so I only comment these three steps in the example below.

In [2]:
# sqlite3.connect() creates the database file if not already created,
# otherwise it simply connects to the database file.

# 1. Connect to the Database Server
conn = sqlite3.connect(f'data/{db}')   
print(conn)                                 
                                       
# 2. Perform Queries (not necessary here as db already created in step 1)
print(f'\n{db} created successfully')

# 3. Close the Server Connection
conn.close()                            
print(f'\n{db} connection closed.\n')

<sqlite3.Connection object at 0x000001E9BB410650>

movies_db created successfully

movies_db connection closed.



### MySQL
MySQL is handled differently than SQlite as it must connect to the MySQL database server in order to create the database. In order to connect to a db server you neeed to know the servers host, username, and password.

#### A note on cursors
In order for queries to be performed a 'cursor' object must be created. All database packages have methods for doing so, and they are all virtually the same, so this example is the only one where I comment when the curosr is created. 

In [3]:
conn = mysql.connector.connect(host='localhost',
                               user='root',
                               password='testin123!')
print(conn)

# Create cursor and use it to perform queries
cur = conn.cursor()
cur.execute(f'CREATE DATABASE IF NOT EXISTS {db};')

print(f'\n{db} created successfully')

cur.close()
conn.close()
print(f'\n{db} connection closed.\n')

<mysql.connector.connection.MySQLConnection object at 0x000001E9BB4384C8>

movies_db created successfully

movies_db connection closed.



### PostgreSQL
Connecting to the database server using Postgre is very similar to MySQL, however there are a couple of important differences: 

1. MySQL has slightly different SQL syntax than Postgre in some cases. In the MySQL example above, the SQL syntax 'IF NOT EXISTS' is used to check if the db has already been created and if so simply continues on without re-creating it or throwing an error. Postgre does not have the 'IF NOT' syntax and will throw an error when attempting to create a database that already exists. Here I use a try/catch clause to handle this proplem. 


2. conn.autocommit is needed to allow Postgre to perform the create database query

**Important Note:**
It is best practice to use a try/catch clause whenever dealing with external connections of any type, therefore from here on out I will use this convention

In [4]:
try:
    conn = psycopg2.connect(host='localhost',
                            user='postgres',
                            password='testin123!')
    conn.autocommit = True
    print(conn)

    # Perform Queries Using Cursor 
    cur = conn.cursor()
    cur.execute(f'CREATE DATABASE {db};')
    print(f'Database: {db} created successfully')
except:
    print(f'\nError: {db} already exists on PostgresSQL server')
    
cur.close()
conn.close()
print(f'\n{db} connection closed.')

<connection object at 0x000001E9BB37DE18; dsn: 'user=postgres password=xxx host=localhost', closed: 0>

Error: movies_db already exists on PostgresSQL server

movies_db connection closed.


### MongoDB
MongoDB is a NoSQL database which means it stores its data in an unstructured JSON-like data format and uses its own query syntax rather than SQL.

MongoDb handles the creation of databases very simliarly to SQLite. All you have to do is connect to the desired database, if it exists a connection is made, if not, then the database is created and then the connection is made.

In [5]:
# Note: only the host and port number are required to connect
# to the database server, once database is created a username and 
# password can be created for that specific database. 
try:
    conn = pymongo.MongoClient('localhost', 27017)

    # Note: db won't show up on server until collection(table) 
    # is created and data is added. 
    mongo_db = conn.movies #can also use conn['airport']
    print(mongo_db)
except:
    print('\nError: Could not connect to database.')

conn.close()

Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'movies')


## 3. Creating Tables
---
In all relational databases, a table schema (or blueprint) is used to set up all the column (or field) names as well as what types of data can go into each. 

Because this is not an in-depth database design tutorial, I only use one simple table for this tutorial

### SQLite

In [3]:
#1. Connect to Database
conn = sqlite_conn(db)


#2. Perform Query
try:
    cur = conn.cursor()
    cur.execute('''CREATE TABLE IF NOT EXISTS movies (
                   id_ integer PRIMARY KEY,
                   title text NOT NULL,
                   year_ integer NOT NULL,
                   duration integer NOT NULL,
                   director text NOT NULL);
                ''')
    print(f'connected to the {db} database...')
except:
    print('\nError: could not create table')
    
# An empty pandas dataframe is returned as there are no movie entries yet.
# Note that pandas is discussed in detail later.
table = pd.read_sql_query(f'SELECT * from movies', conn)
print()
print(table)


# 3. Close Connection
close_db('sqlite', db, conn, cur)

connected to the movies_db database...

Empty DataFrame
Columns: [id_, title, year_, duration, director]
Index: []

movies_db database connection closed.



### MySQL

In [3]:
# 1. Connect to Database
try:
    conn = mysql.connector.connect(host='localhost',
                                   user='root',
                                   password='testin123!',
                                   database=db)
    print(f'connected to the {db} database...')
except:
    print(f'Error: could not connect to {db} database.')

    
# 2. Perform Query
try:
    cur = conn.cursor()
    cur.execute('''CREATE TABLE IF NOT EXISTS movies (
                   id_ INT AUTO_INCREMENT PRIMARY KEY,
                   title VARCHAR(75) NOT NULL,
                   year_ INT NOT NULL,
                   duration INT NOT NULL,
                   director VARCHAR(75) NOT NULL);
                ''')
except:
    print('\nError: could not create table.')
    
table = pd.read_sql_query(f'SELECT * FROM movies', conn)
print()
print(table)


# 3. Close Connection
close_db('mysql', db, conn, cur)

connected to the movies_db database...

Empty DataFrame
Columns: [id_, title, year_, duration, director]
Index: []

movies_db database connection closed.



### PostgreSQL
Note: Almost all the code below is identical to the MySQL example above, except for slight SQL syntax differences bewteen the insert queries.

In [3]:
# 1. Connect to Database
try:
    conn = psycopg2.connect(host="localhost",
                            user="postgres",
                            password="testin123!",
                            database=db)
    print(f'connected to the {db} database...')
except:
    print(f'Error: could not connect to {db} database.')

    
# 2. Perform Query
try:
    conn.autocommit = True
    cur = conn.cursor()
    cur.execute('''CREATE TABLE movies (
                   id_ SERIAL PRIMARY KEY,
                   title VARCHAR(75) NOT NULL,
                   year_ INTEGER NOT NULL,
                   duration INTEGER NOT NULL,
                   director VARCHAR(75) NOT NULL);
                ''')
except:
    print('\nError: could not create table, possibly already exists.')
    
table = pd.read_sql_query(f'SELECT * FROM movies', conn)
print()
print(table)


# 3. Close Connection
close_db('postgre', db, conn, cur)

connected to the movies_db database...

Empty DataFrame
Columns: [id_, title, year_, duration, director]
Index: []

movies_db database connection closed.



### MongoDB
MongoDB does not have tables like the above relational databases, instead it uses what are called 'collections' of data stored in a JSON format.

In [3]:
# 1. Connect to Database
conn, movie_db = mongo_conn(db)
 

# 2. Perform Query
# Note that here we are trying to use the movies collection, if it does not
# exist it is created, if it does exist, then it is available to use. 
try:
    movies_table = movie_db.movies
    print(f'\nmovies collection created in the {db} database.')
except:
    print('\nError: could not create table, possibly already exists.')

# Note: find() is same as select * in sql
table = pd.DataFrame(list(movies_table.find()))
print()
print(table)        


# 3. Close Connection
close_db('mongo', db, conn)

connected to the movies_db database...

movies collection created in the movies_db database.

Empty DataFrame
Columns: []
Index: []

movies_db database connection closed.



## 4. Manually Adding Data to a Table
---
Up to this point we have just created the 'movies_db' database along with a single table named 'movies' to hold individual movie data. The rest of the sections in this tutorial focus on the many ways to insert data and retrieve data from a database. This next section shows how to do this in Python itself rather than from an external source.

The examples in this section all add a single movie. The variables used in all examples are below, MAKE SURE TO RUN THE CELL BELOW BEFORE PROCEEDING.

### SQLite

In [6]:
# I created a special function to load the data
# note this was done to avoid using global vars
id_, title, year_, duration, director = get_movie_data()

#1. Connect to Database
conn = sqlite_conn(db)


#2. Perform Query
# Test if movie in db prior to inserting to prevent duplicate entries based
# on title and year
test_query = f'SELECT * FROM {db_table} WHERE title="{title}" AND year_={year_};'
cur = conn.cursor()
cur.execute(test_query) 
query_results = cur.fetchall()

if len(query_results) == 0:
    try:
        cur.execute(f'INSERT INTO {db_table} VALUES (?,?,?,?,?)', 
                    (id_, title, year_, duration, director))
        conn.commit()
    except:
        print('\nError: could not insert data into database.')
else:
    print(f'\nError on insertion, {title}({year_}) already exists in the database.')

# Note the "" around the title, this is required for the query to work
# Also note that the title is Jaws for this example
table = pd.read_sql_query(f'SELECT * FROM {db_table} WHERE title="{title}"', conn)
print()
print(table)


# 3. Close Connection
close_db('sqlite', db, conn, cur)


Error on insertion, Jaws(1975) already exists in the database.

   id_ title  year_  duration          director
0    1  Jaws   1975       124  Steven Spielberg

movies_db database connection closed.



### MySQL

In [7]:
id_, title, year_, duration, director = get_movie_data()

# 1. Connect to Database
conn = mysql_conn(db)
 
    
# 2. Perform Query
# Test to see if movie already in db before insertion
test_query = f'SELECT * FROM {db_table} WHERE title="{title}" AND year_={year_};'
cur = conn.cursor()
cur.execute(test_query)
query_results = cur.fetchall()

if len(query_results) == 0:  
    try:
        cur.execute(f'INSERT INTO {db_table} VALUES (%s,%s,%s,%s,%s)',
                    (id_, title, year_, duration, director))
        conn.commit()
    except:
        print('\nError: could not insert data into database.')
else:
    print(f'\nError on insertion, {title}({year_}) already exists in the database.')
    
table = pd.read_sql_query(f'SELECT * FROM {db_table} WHERE  title="{title}"', conn)
print()
print(table)


# 3. Close Connection
close_db('mysql', db, conn, cur)


Error on insertion, Jaws(1975) already exists in the database.

   id_ title  year_  duration          director
0    1  Jaws   1975       124  Steven Spielberg

movies_db database connection closed.



### PostgreSQL

In [8]:
id_, title, year_, duration, director = get_movie_data()

# 1. Connect to Database
conn = postgres_conn(db)
 
    
# 2. Perform Query
# Test to see if movie already in db before insertion
test_query = f'SELECT * FROM {db_table} WHERE title= %s AND year_= %s;'
cur = conn.cursor()
cur.execute(test_query, (title, year_))
query_results = cur.fetchall()

if len(query_results) == 0:  
    try:
        query = '''INSERT INTO movies (title, year_, duration, director) 
                   VALUES (%s,%s,%s,%s);'''
        cur.execute(query,(title, year_, duration, director))
        conn.commit()
    except:
        print('\nError: could not insert data into database.')
else:
    print(f'\nError on insertion, {title}({year_}) already exists in the database.')

# Note that postgre requires %s placeholder for strings as a security measure
# Therefore when using pandas, the variable itself must be passed as params=[var1, var2, var3...]
table = pd.read_sql_query(f'SELECT * FROM movies WHERE title=%s', conn, params=[title])

cur.execute(f'SELECT * from {db_table}', conn)
print()
print(table)


# 3. Close Connection
close_db('postgre', db, conn, cur)


Error on insertion, Jaws(1975) already exists in the database.

   id_ title  year_  duration          director
0    1  Jaws   1975       124  Steven Spielberg

movies_db database connection closed.



### MongoDB

In [9]:
id_, title, year_, duration, director = get_movie_data()

# 1. Connect to Database
conn, movie_db = mongo_conn(db)


# 2. Perform Query 
# Test to see if movie already in db before insertion
# Note: mongo uses a filter instead of sql, so the entire query is done here in one line
# it is converted to a list and the length is returned, length = 0 means no matches
test_query = len(list(movie_db.movies.find(filter={'title': title, 'year_': year_})))

# connect to collection
movies_table = movie_db.movies
if test_query == 0: 
    try:
        # add a movie
        movie = {'title' : title,
                 'year_' : year_, 
                 'duration' : duration,
                 'director' : director}

        movies_table.insert_one(movie)

        print(f'\nmovies collection created in the {db} database.')
    except:
        print('\nError: could not create table, possibly already exists.')
else:
    print(f'\nError on insertion, {title}({year_}) already exists in the database.')  

# Here a filter is used to find the specific title, in this case 'Jaws'
table = pd.DataFrame(list(movies_table.find(filter={'title': title})))

print()
print(table)        

# 3. Close Connection
close_db('mongo', db, conn)


Error on insertion, Jaws(1975) already exists in the database.

                        _id title  year_  duration          director
0  5e710c8db08888178cf4fdb0  Jaws   1975       124  Steven Spielberg

movies_db database connection closed.



## 5. Adding CSV & Excel File Data to a Table
---
The manual approach above is usally used together with some type of external data connection. When it comes to files, text, CSV, Excel, XML, and JSON files tend to be the most common. This section covers CSV and Excel files. 

### SQLite

In [10]:
#1. Connect to Database
conn = sqlite_conn(db)

    
#2. Perform Query
# Convert Pandas csv_data into a Python dictionary
df = pd.read_csv(f'data\movies.csv')
movies_dict = df.to_dict('split') # split gets correct movie data format
movie_list = movies_dict['data']

for movie in movie_list:
    id_ = None
    title = movie[0]
    year_ = movie[1]
    duration = movie[2]
    director = movie[3]
    query = f'SELECT * FROM {db_table} WHERE title="{title}" AND year_={year_};'
    cur = conn.cursor()
    cur.execute(query) 
    query_results = cur.fetchall()
    
    if len(query_results) == 0:
        try:
            cur.execute(f'INSERT INTO {db_table} VALUES (?,?,?,?,?)', 
                        (id_, title, year_, duration, director))
            conn.commit()
        except:
            print('\nError: could not insert data into database.')
    else:
        print(f'\nError on insertion, {title}({year_}) already exists in the database.')

table = pd.read_sql_query(f'SELECT * from {db_table} WHERE id_ <= 3', conn)
print()
print('All movies in the movie_db:')
print(table)


# 3. Close Connection
close_db('sqlite', db, conn, cur)


Error on insertion, Joker(2019) already exists in the database.

Error on insertion, Beetlejuice(1988) already exists in the database.

All movies in the movie_db:
   id_        title  year_  duration          director
0    1         Jaws   1975       124  Steven Spielberg
1    2        Joker   2019       122     Todd Phillips
2    3  Beetlejuice   1988        92        Tim Burton

movies_db database connection closed.



### MySQL

In [11]:
# 1. Connect to Database
conn = mysql_conn(db)


# 2. Perform Query
df = pd.read_csv(f'data\movies.csv')
movies_dict = df.to_dict('split')
movie_list = movies_dict['data']

for movie in movie_list:
    id_   = None
    title = movie[0]
    year_ = movie[1]
    duration = movie[2]
    director = movie[3]
    
    query = f'SELECT * FROM {db_table} WHERE title="{title}" AND year_={year_};'
    cur = conn.cursor()
    cur.execute(query) 
    query_results = cur.fetchall()
    
    if len(query_results) == 0:
        try:
            cur.execute(f'INSERT INTO {db_table} VALUES (%s,%s,%s,%s,%s)',
                        (id_, title, year_, duration, director))
            conn.commit()
        except:
            print('\nError: could not insert data into database.')
    else:
        print(f'\nError on insertion, {title}({year_}) already exists in the database.')

        
table = pd.read_sql_query(f'SELECT * from {db_table} WHERE id_ <= 3', conn)
print()
print(table)


# 3. Close Connection
close_db('mysql', db, conn, cur)


Error on insertion, Joker(2019) already exists in the database.

Error on insertion, Beetlejuice(1988) already exists in the database.

   id_        title  year_  duration          director
0    1         Jaws   1975       124  Steven Spielberg
1    2        Joker   2019       122     Todd Phillips
2    3  Beetlejuice   1988        92        Tim Burton

movies_db database connection closed.



### PostgreSQL

In [12]:
# 1. Connect to Database
conn = postgres_conn(db)

    
# 2. Perform Query
df = pd.read_csv(f'data\movies.csv')
movies_dict = df.to_dict('split')
movie_list = movies_dict['data']

for movie in movie_list:
    title = movie[0]
    year_ = movie[1]
    duration = movie[2]
    director = movie[3]

    query = f'SELECT * FROM {db_table} WHERE title= %s AND year_= %s;'
    cur = conn.cursor()
    cur.execute(query, (title, year_))
    query_results = cur.fetchall()

    if len(query_results) == 0:  
        try:
            query = '''INSERT INTO movies (title, year_, duration, director) 
                       VALUES (%s,%s,%s,%s);'''
            cur.execute(query,(title, year_, duration, director))
            conn.commit()
        except:
            print('\nError: could not insert data into database.')
    else:
        print(f'\nError on insertion, {movie[0]}({movie[1]}) already exists in the database.')
    
table = pd.read_sql_query(f'SELECT * FROM {db_table} WHERE id_ <= 3', conn)
print()
print(table)


# 3. Close Connection
close_db('postgre', db, conn, cur)


Error on insertion, Joker(2019) already exists in the database.

Error on insertion, Beetlejuice(1988) already exists in the database.

   id_        title  year_  duration          director
0    1         Jaws   1975       124  Steven Spielberg
1    2        Joker   2019       122     Todd Phillips
2    3  Beetlejuice   1988        92        Tim Burton

movies_db database connection closed.



### MongoDB

In [13]:
# 1. Connect to Database
conn, movie_db = mongo_conn(db)


# 2. Perform Query
df = pd.read_csv(f'data\movies.csv')
movies_dict = df.to_dict('split')
movie_list = movies_dict['data']

for movie in movie_list:
    title = movie[0]
    year_ = movie[1]
    duration = movie[2]
    director = movie[3]
    
    query = len(list(movie_db.movies.find(filter={'title': title, 'year_': year_})))

    if query == 0: 
        try:
            movies_table = movie_db.movies

            movie = {'title' : title,
                     'year_' : year_, 
                     'duration' : duration,
                     'director' : director}

            movies_table.insert_one(movie)
        except:
            print('\nError: could not create table, possibly already exists.')
    else:
        print(f'\nError on insertion, {movie[0]}({movie[1]}) already exists in the database.')

        
#.find(), same as 'select * from table in sql', noter here I slice out the first 3 entries
table = pd.DataFrame(list(movie_db.movies.find())[0:3])
print()
print(table)        

# 3. Close Connection
close_db('mongo', db, conn)


Error on insertion, Joker(2019) already exists in the database.

Error on insertion, Beetlejuice(1988) already exists in the database.

                        _id        title  year_  duration          director
0  5e710c8db08888178cf4fdb0         Jaws   1975       124  Steven Spielberg
1  5e710cb0b08888178cf4fdb2        Joker   2019       122     Todd Phillips
2  5e710cb0b08888178cf4fdb3  Beetlejuice   1988        92        Tim Burton

movies_db database connection closed.



## 6. Adding JSON File Data to a Table
---
JSON stands for 'Javascript Object Notation and is used extensively with web API's. Using JSON can be confusing when using the barebones json module that comes with Python. But just like the above CSV examples, Pandas makes life easy when converting JSON files to a proper format for Python use. 

### SQLite

In [14]:
#1. Connect to Database
conn = sqlite_conn(db)

    
#2. Perform Query
df = pd.read_json(f'data\movies.json')
movies_dict = df.to_dict('split') # split gets correct movie data format
movie_list = movies_dict['data']

for movie in movie_list:
    id_ = None
    title = movie[0]
    year_ = movie[1]
    duration = movie[2]
    director = movie[3]
    query = f'SELECT * FROM {db_table} WHERE title="{movie[0]}" AND year_={movie[1]};'
    cur = conn.cursor()
    cur.execute(query) 
    query_results = cur.fetchall()
    
    if len(query_results) == 0:
        try:
            cur.execute(f'INSERT INTO {db_table} VALUES (?,?,?,?,?)', 
                        (id_, movie[0], movie[1], movie[2], movie[3]))
            conn.commit()
        except:
            print('\nError: could not insert data into database.')
    else:
        print(f'\nError on insertion, {movie[0]}({movie[1]}) already exists in the database.')


table = pd.read_sql_query(f'SELECT * from {db_table} WHERE id_ <= 5', conn)
print()
print('All movies in the movie_db:')
print(table)


# 3. Close Connection
close_db('sqlite', db, conn, cur)


Error on insertion, Star Wars: Episode IV - A New Hope(1977) already exists in the database.

Error on insertion, Forrest Gump(1994) already exists in the database.

All movies in the movie_db:
   id_                               title  year_  duration          director
0    1                                Jaws   1975       124  Steven Spielberg
1    2                               Joker   2019       122     Todd Phillips
2    3                         Beetlejuice   1988        92        Tim Burton
3    4  Star Wars: Episode IV - A New Hope   1977       121      George Lucas
4    5                        Forrest Gump   1994       142   Robert Zemeckis

movies_db database connection closed.



### MySQL

In [15]:
# 1. Connect to Database
conn = mysql_conn(db)


# 2. Perform Query
df = pd.read_json(f'data\movies.json')
movies_dict = df.to_dict('split') # split gets correct movie data format
movie_list = movies_dict['data']

for movie in movie_list:
    id_ = None
    title = movie[0]
    year_ = movie[1]
    duration = movie[2]
    director = movie[3]
    
    query = f'SELECT * FROM {db_table} WHERE title="{movie[0]}" AND year_={movie[1]};'
    cur = conn.cursor()
    cur.execute(query) 
    query_results = cur.fetchall()
    
    if len(query_results) == 0:
        try:
            cur.execute(f'INSERT INTO {db_table} VALUES (%s,%s,%s,%s,%s)',
                        (id_, title, year_, duration, director))
            conn.commit()
        except:
            print('\nError: could not insert data into database.')
    else:
        print(f'\nError on insertion, {movie[0]}({movie[1]}) already exists in the database.')

        
table = pd.read_sql_query(f'SELECT * from {db_table} WHERE id_ <=5', conn)
print()
print(table)


# 3. Close Connection
close_db('mysql', db, conn, cur)


Error on insertion, Star Wars: Episode IV - A New Hope(1977) already exists in the database.

Error on insertion, Forrest Gump(1994) already exists in the database.

   id_                               title  year_  duration          director
0    1                                Jaws   1975       124  Steven Spielberg
1    2                               Joker   2019       122     Todd Phillips
2    3                         Beetlejuice   1988        92        Tim Burton
3    4  Star Wars: Episode IV - A New Hope   1977       121      George Lucas
4    5                        Forrest Gump   1994       142   Robert Zemeckis

movies_db database connection closed.



### PostgreSQL

In [16]:
# 1. Connect to Database
conn = postgres_conn(db)

    
# 2. Perform Query
df = pd.read_json(f'data\movies.json')
movies_dict = df.to_dict('split')
movie_list = movies_dict['data']

for movie in movie_list:
    title = movie[0]
    year_ = movie[1]
    duration = movie[2]
    director = movie[3]

    query = f'SELECT * FROM {db_table} WHERE title= %s AND year_= %s;'
    cur = conn.cursor()
    cur.execute(query, (title, year_))
    query_results = cur.fetchall()

    if len(query_results) == 0:  
        try:
            query = '''INSERT INTO movies (title, year_, duration, director) 
                       VALUES (%s,%s,%s,%s);'''
            cur.execute(query,(title, year_, duration, director))
            conn.commit()
        except:
            print('\nError: could not insert data into database.')
    else:
        print(f'\nError on insertion, {movie[0]}({movie[1]}) already exists in the database.')
    
table = pd.read_sql_query(f'SELECT * from {db_table} WHERE id_ <= 5', conn)

cur.execute(f"SELECT * from {db_table}", conn)
print()
print(table)


# 3. Close Connection
close_db('postgre', db, conn, cur)


Error on insertion, Star Wars: Episode IV - A New Hope(1977) already exists in the database.

Error on insertion, Forrest Gump(1994) already exists in the database.

   id_                               title  year_  duration          director
0    1                                Jaws   1975       124  Steven Spielberg
1    2                               Joker   2019       122     Todd Phillips
2    3                         Beetlejuice   1988        92        Tim Burton
3    4  Star Wars: Episode IV - A New Hope   1977       121      George Lucas
4    5                        Forrest Gump   1994       142   Robert Zemeckis

movies_db database connection closed.



### MongoDB

In [17]:
# 1. Connect to Database
conn, movie_db = mongo_conn(db)


# 2. Perform Query
df = pd.read_json(f'data\movies.json')
movies_dict = df.to_dict('split')
movie_list = movies_dict['data']

for movie in movie_list:
    title = movie[0]
    year_ = movie[1]
    duration = movie[2]
    director = movie[3]
    
    query = len(list(movie_db.movies.find(filter={'title': title, 'year_': year_})))

    if query == 0: 
        try:
            movies_table = movie_db.movies

            movie = {'title' : title,
                     'year_' : year_, 
                     'duration' : duration,
                     'director' : director}

            movies_table.insert_one(movie)
        except:
            print('\nError: could not create table, possibly already exists.')
    else:
        print(f'\nError on insertion, {movie[0]}({movie[1]}) already exists in the database.')

        
table = pd.DataFrame(list(movie_db.movies.find())[0:5])  #.find(), same as 'select * from table in sql'
print()


pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 100)
print(table)        

# 3. Close Connection
close_db('mongo', db, conn)


Error on insertion, Star Wars: Episode IV - A New Hope(1977) already exists in the database.

Error on insertion, Forrest Gump(1994) already exists in the database.

                        _id                               title  year_  duration          director
0  5e710c8db08888178cf4fdb0                                Jaws   1975       124  Steven Spielberg
1  5e710cb0b08888178cf4fdb2                               Joker   2019       122     Todd Phillips
2  5e710cb0b08888178cf4fdb3                         Beetlejuice   1988        92        Tim Burton
3  5e710ccfb08888178cf4fdb5  Star Wars: Episode IV - A New Hope   1977       121      George Lucas
4  5e710ccfb08888178cf4fdb6                        Forrest Gump   1994       142   Robert Zemeckis

movies_db database connection closed.



## 7. Adding JSON data From Web API
---
Now that we've created databases, datbase tables, and added data through various file sources, it's time to move on to pulling data from a Web API. Since our database is a movie database, we are going to be pulling movie data from a list of movie names. I am using a free movie database api that has a very simple JSON based interface. You will need to apply for a key (they are free) for it to work.

In [18]:
import re 
import requests
import json
import pandas as pd
import time

### MySQL

In [19]:
key = '204cb77b'

# 1. Convert CSV movie_list file into Python list using pandas
df = pd.read_csv('data\movie_list_small.csv', delimiter='\s,', engine='python')
titles = df.iloc[:,0].values.tolist() # DIAL THIS

for title in titles:
    # 2. Convert API JSON data into python list and convert variables as needed
    try:
        url = f'http://www.omdbapi.com/?t="{title}"&apikey={key}'
        res = requests.get(url)
        res.raise_for_status()
        json_data = json.loads(res.text)
            
        id_ = None
        title = json_data['Title']
        year_  = int(json_data['Year'])
        duration = int(json_data['Runtime'][:-4])
        director = json_data['Director']
    except:
        print('Error: error converting data from API.')

    # 3. Check if movie is in db, if not add it. 
    try:
        conn = mysql_conn(db)
        
        query = f'SELECT * FROM {db_table} WHERE title="{title}" AND year_={year_};'
        cur = conn.cursor()
        cur.execute(query) 
        query_results = cur.fetchall()
        
        if len(query_results) == 0:
            try:
                cur.execute(f'INSERT INTO {db_table} VALUES (%s,%s,%s,%s,%s)',
                            (id_, title, year_, duration, director))
                conn.commit()
            except:
                print('\nError: could not insert data into database.')
        else:
            print(f'\nError on insertion, {title}({year_}) already exists in the database.')   
        close_db('mysql', db, conn, cur)
    except:
        print('\nError test query did not work')
        close_db('mysql', db, conn, cur)
    time.sleep(0.5)
    

# Database Visual
conn = mysql_conn(db)

table = pd.read_sql_query(f'SELECT * from {db_table}', conn)
print()
pd.set_option('display.max_columns', 10)
pd.set_option('display.width', 100)
print(table)

close_db('mysql', db, conn, cur)


Error on insertion, Leaving Las Vegas(1995) already exists in the database.

movies_db database connection closed.


Error on insertion, Othello(1995) already exists in the database.

movies_db database connection closed.


Error on insertion, Now and Then(1995) already exists in the database.

movies_db database connection closed.


   id_                               title  year_  duration             director
0    1                                Jaws   1975       124     Steven Spielberg
1    2                               Joker   2019       122        Todd Phillips
2    3                         Beetlejuice   1988        92           Tim Burton
3    4  Star Wars: Episode IV - A New Hope   1977       121         George Lucas
4    5                        Forrest Gump   1994       142      Robert Zemeckis
5   11                        Now and Then   1995       100  Lesli Linka Glatter
6   12                   Leaving Las Vegas   1995       111          Mike Figgis
7   13         

## 8. Adding Data From the Web Using Web Scraping
---

In [20]:
#--------------------------------------------------------------------------
# VII. Scrape data from web and append to db (using requests, bs4)
#--------------------------------------------------------------------------
# Beautiful Soup is a Python package specifically desinged to allow
# for simple interaction with html web pages
import pprint as pp
import requests
import pandas as pd
import bs4

drudgeArticlesDic = {} 
drudgeArticlesDic['drudge_articles'] = []

res = requests.get('http://www.drudgereport.com/')
res.raise_for_status()
drudge = bs4.BeautifulSoup(res.text, 'html.parser')

for i in range(15):
    id = 'drudge' + str(i+1)
    url = drudge.find_all("a")[i+1].get('href')
    title = drudge.find_all("a")[i+1].getText().strip('\n')
    if "drudgereport.com" in url:
        continue
        
    drudgeArticlesDic['drudge_articles'].append({'id' : id,
                                                 'title'  : title,
                                                 'url'    : url})

    
# raw data (used to ensure data is pulled)
#pp.pprint(drudgeArticlesDic)

pd.DataFrame.from_dict(drudgeArticlesDic['drudge_articles']).head()

Unnamed: 0,id,title,url
0,drudge1,"Layoffs Just Starting, and Forecasts Bleak...",https://dnyuz.com/2020/03/17/layoffs-are-just-...
1,drudge2,Majority with virus walking around undetected...,https://nypost.com/2020/03/17/86-of-people-wit...
2,drudge3,Is there right to anonymity for carriers in Am...,https://thehill.com/opinion/civil-rights/48817...
3,drudge4,18 MONTHS of social distancing?,https://www.axios.com/coronavirus-report-us-uk...
4,drudge5,Stranded travelers struggle to get home...,https://apnews.com/15b99013e674de9ea83a707342e...


In [2]:
import pandas as pd

df = pd.read_csv('data\movie_list_small.csv', delimiter='\s,', engine='python')


titles = df.iloc[:,0].values.tolist() # DIAL THIS
print(titles)


# TODO, sets are faster thant lists, and no duplicates are allowed, 
# try that when getting data into db

# TODO: try using that INLINE FILE, FILE UPLOAD mysql syntx for 
# loading entire csv at once into db

['Leaving Las Vegas', 'Othello', 'Now and Then']


[('Leaving Las Vegas',), ('Othello',), ('Now and Then',)]