<h1>Introduction to databases</h1>

Building pipelines to relational databases

They support more data, multiple simultaneous users and data quality controls than spreadsheets or flat files.

For this course we are going to use SQLite in which databases are stored as regular self-contained computer files, just as csv's or Excel files, making them great for sharing data

<h2>Connecting to databases using sqlAlchemy</h2>

This library has tools to work with many relational databases

<h3>Create_engine function</h3>
It takes a string URL to a database and make an engine that manages database connections

SQLite URL format: sqlite:///filename.db

<h3>pd.read_sql() to query a database</h3>
After we have created the engine, we can use the pd.read_sql() function to load in daata from a database. This function needs two arguments:

1. query: An SQL query or a table name (to load the whole table)
2. engine: A way to connect to the database, it's the engine that we created earlier.

<h4>Best practices creating queries</h4>
Use key words in capital letters.
Use semicolon(;) to end a statement

In [None]:
import pandas as pd
# Import sqlalchemy's create_engine() function
from sqlalchemy import create_engine

# Create the database engine
engine = create_engine('sqlite:///data.db')

# View the tables in the database
print(engine.table_names())

In [None]:
# Create a SQL query to load the entire weather table
query = """
SELECT * 
  FROM weather;
"""

# Load weather with the SQL query
weather = pd.read_sql(query, engine)

# View the first few rows of data
print(weather.head())

<h1>Refining imports with SQL queries</h1>

<h2>Filtering numbers</h2>

We can perform comparisons using mathematical operators and use the where keyword to filter:
- = equals
- '> and >= for grater than
- < and <= for lower than
- <> not equal to

<h2>Using conditionals</h2>

We can combine AND and OR conditions in our Queries.

<h2>Best practices</h2>

As our query's get more complicated, we can write the Query first and assign it to a variable.
We can write a Query in multiple lines by wrapping it in triple quotes so it's easier to read.

In [None]:
# Write query to get date, tmax, and tmin from weather
query = """
SELECT date, 
       tmax, 
       tmin
  FROM weather;
"""

# Make a dataframe by passing query and engine to read_sql()
temperatures = pd.read_sql(query, engine)

# View the resulting dataframe
print(temperatures)

In [None]:
#from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
# Create query to get hpd311calls records about safety
query = """
SELECT *
FROM hpd311calls
WHERE complaint_type = 'SAFETY';
"""

# Query the database and assign result to safety_calls
safety_calls = pd.read_sql(query, engine)

# Graph the number of safety calls by borough
call_counts = safety_calls.groupby('borough').unique_key.count()
call_counts.plot.barh()
plt.show()

In [None]:
# Create query for records with max temps <= 32 or snow >= 1
query = """
SELECT *
  FROM weather
  WHERE tmax <= 32
  OR snow >= 1;
"""

# Query database and assign result to wintry_days
wintry_days = pd.read_sql(query, engine)

# View summary stats about the temperatures
print(wintry_days.describe())

<h1>More complex SQL queries</h1>

We can get unique values for one or more columns using the SELECT DISTINCT sql keywords

To remove duplicate records:

SELECT DISTINCT * FROM TABLE

<h2>Aggregate functions</h2>

SUM, AVG, MAX, MIN

They all take a single column in parentheses

COUNT:

It can accept a single column name, we can get the number of rows that fit a query with count(*) or get unique values using count(DISTINCT ColumnName)

<h2>Group By</h2>

Summarize data by categories.
We have to include it after the where statement and select the column that we are grouping by and an aggregate function





In [None]:
# Create query for unique combinations of borough and complaint_type
query = """
SELECT DISTINCT borough, 
       complaint_type
  FROM hpd311calls;
"""

# Load results of query to a dataframe
issues_and_boros = pd.read_sql(query, engine)

# Check assumption about issues and boroughs
print(issues_and_boros)

In [None]:
# Create query to get call counts by complaint_type
query = """
SELECT complaint_type, 
     COUNT(*)
  FROM hpd311calls
  GROUP BY complaint_type;
"""

# Create dataframe of call counts by issue
calls_by_issue = pd.read_sql(query, engine)

# Graph the number of calls for each housing issue
calls_by_issue.plot.barh(x="complaint_type")
plt.show()

In [None]:
# Create query to get temperature and precipitation by month
query = """
SELECT month, 
        MAX(tmax) AS 'Max Temperature', 
        MIN(tmin) AS 'Min temperature',
        SUM(prcp) AS 'Total Precipitation'
  FROM weather 
 GROUP BY month;
"""

# Get dataframe of monthly weather stats
weather_by_month = pd.read_sql(query, engine)

# View weather stats by month
print(weather_by_month)

<hr>
<h1>Using joins</h1>

The join keyword alone only returns records with key values that appear in both tables (inner join)

The key columns must be the same data type or they will not match

We can use joins with aggregation functions and conditionals

SQL order of keywords:

1. SELECT
2. FROM
3. JOINS
4. WHERE
5. GROUP BY

If your tables are very big, you may decide to filter or aggregate the data first before attempting a join

In [None]:
# Query to join weather to call records by date columns
query = """
SELECT * 
  FROM hpd311calls
  JOIN weather 
  ON hpd311calls.created_date = weather.date;
"""

# Create dataframe of joined tables
calls_with_weather = pd.read_sql(query, engine)

# View the dataframe to make sure all columns were joined
print(calls_with_weather.head())

In [None]:
# Query to get water leak calls and daily precipitation
query = """
SELECT hpd311calls.*, weather.prcp
  FROM hpd311calls
  JOIN weather
    ON hpd311calls.created_date = weather.date
  WHERE hpd311calls.complaint_type = 'WATER LEAK';"""

# Load query results into the leak_calls dataframe
leak_calls = pd.read_sql(query, engine)

# View the dataframe
print(leak_calls.head())

In [None]:
# Modify query to join tmax and tmin from weather by date
query = """
SELECT hpd311calls.created_date, 
	   COUNT(*), 
       weather.tmax,
       weather.tmin
  FROM hpd311calls 
       JOIN weather
       ON hpd311calls.created_date = weather.date
 WHERE hpd311calls.complaint_type = 'HEAT/HOT WATER' 
 GROUP BY hpd311calls.created_date;
 """

# Query database and save results as df
df = pd.read_sql(query, engine)

# View first 5 records
print(df.head())