<a href="https://colab.research.google.com/github/christophermalone/DSCI325/blob/main/Module3_Part1_SQL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 3 | Part 1 | SQL: SELECT()

Structured Queary Language (SQL) is the most common programming language for the use of relational databases.  A relational database is a system that allows for many tables (datasets) to be interconnected by design.  

Wiki for SQL: https://en.wikipedia.org/wiki/SQL 

A relational database includes a <strong>collection of tables</strong> which is in contrast to individual datasets.  A relational database contains substantial structure that permits the tables to be connected.  Relational databases that are well-designed have considerable advantages regarding efficency when working with structured data. 
<p align='center'><img src="https://drive.google.com/uc?export=view&id=146--_Xvpo7QNxPHCuyDdKJiyNWrNDRjM"></p>

Consider the following schematic regarding a possible database for the data being considered in this module. In this relational database, the employment information from year-to-year is contained in their own tables.


*   a <strong>primary key</strong> is an essential components as it <i>connects</i> the information across the various tables
*   notice that the world region is not repeated in each of year-to-year tables, but instead put into it's own table





<p align='center'><img src="https://drive.google.com/uc?export=view&id=12c6K8an_4MslGqnSBnyYZrNxUWCigxA9"></p>

Typically, a server hosts a number of different databases.  Thus, the first step in working with a database is to make a <strong>connection</strong> to the database.




<p align='center'><img src="https://drive.google.com/uc?export=view&id=17KbuvfpVmNSrob2NJmTZILAeKXSd8TDz" width='50%' height='50%'></p>

## Making a Connection

Here, an SQLite3 package will be used to connect to the desired datqbase.

In [7]:
import pandas as pd
import sqlite3

The following code can be used to establish a connection AND close a connection.  For security purposes, you only want to limit the amount of time a connection is left open.

In [None]:
#Making a connection using sqlite3
connect_db = sqlite3.connect("/content/sample_data/OECD_EmploymentRates.db")

# SQL CODE WILL GO HERE

#Closing the connection
connect_db.close()

## Understanding the Structure

The OECD_EmploymentRates database has the following struture.


*   1 table called EmploymentRates
*   14 fields - Country, WorldRegion, and three years worth of Employment Rates for each quarter





<p align='center'><img src="https://drive.google.com/uc?export=view&id=181KI7qN_RC08LJzeT03bTqHl7BGx-E7o"></p>

Getting information regarding the <strong>tables</strong> in your database.

In [9]:
#Making a connection
connect_db = sqlite3.connect("/content/sample_data/OECD_EmploymentRates.db")

#SQL Statement
df = pd.read_sql_query(
                        "SELECT name FROM sqlite_master WHERE type='table'"
                          , connect_db)
#Closing the connection
connect_db.close()

#Using pandas to show output
df.head(20)

Unnamed: 0,name
0,EmploymentRates


Getting information regarding a particular table in your database.

In [11]:
#Makign a connection
connect_db = sqlite3.connect("/content/sample_data/OECD_EmploymentRates.db")

#SQL Statement
df = pd.read_sql_query(
                        "PRAGMA table_info(EmploymentRates)"
                          , connect_db)
#Closing the connection
connect_db.close()

#Using pandas to show output
df.head(15)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,Country,,0,,0
1,1,WorldRegion,,0,,0
2,2,Q1_Year1,,0,,0
3,3,Q2_Year1,,0,,0
4,4,Q3_Year1,,0,,0
5,5,Q4_Year1,,0,,0
6,6,Q1_Year2,,0,,0
7,7,Q2_Year2,,0,,0
8,8,Q3_Year2,,0,,0
9,9,Q4_Year2,,0,,0


## SELECT() in SQL


The following code is used to SELECT <strong>all</strong> columns in a particular table.

In [13]:
#Making a connection
connect_db = sqlite3.connect("/content/sample_data/OECD_EmploymentRates.db")

#SQL Statement
df = pd.read_sql_query(
                          "SELECT * from EmploymentRates"
                          , connect_db)
#Closing the connection
connect_db.close()

#Using pandas to show output
df.head()

Unnamed: 0,Country,WorldRegion,Q1_Year1,Q2_Year1,Q3_Year1,Q4_Year1,Q1_Year2,Q2_Year2,Q3_Year2,Q4_Year2,Q1_Year3,Q2_Year3,Q3_Year3,Q4_Year3
0,Australia,Oceania,73.7,73.6,73.7,74.0,74.1,74.2,74.4,74.4,74.6,70.5,72.0,73.5
1,Austria,Europe,72.8,73.1,73.1,73.1,73.5,73.6,73.5,73.6,73.2,71.4,72.7,
2,Belgium,Europe,64.1,63.9,64.8,65.1,64.7,65.7,65.6,65.1,65.3,64.5,64.9,
3,Canada,North America,73.4,73.4,73.6,73.8,73.9,74.2,74.2,74.1,73.0,64.7,70.2,72.0
4,Chile,South/Central America,64.3,64.3,63.7,64.0,63.8,64.2,64.4,63.9,63.3,51.2,52.6,56.3


The following can be used to rename a column.  The renaming all effects the <strong>output</strong> and not the structure of the table within the database.

In [15]:
#Making a connection
connect_db = sqlite3.connect("/content/sample_data/OECD_EmploymentRates.db")

#SQL Statement
df = pd.read_sql_query(
                         "SELECT WorldRegion AS WorldRegion2 from EmploymentRates"
                          , connect_db)
#Closing a connection
connect_db.close()

#Using pandas to show output
df.head()

Unnamed: 0,WorldRegion2
0,Oceania
1,Europe
2,Europe
3,North America
4,South/Central America


The following can be used to summarize a particular column in SQL.

In [16]:
#Making a connection
connect_db = sqlite3.connect("/content/sample_data/OECD_EmploymentRates.db")

#SQL Statement
df = pd.read_sql_query(
                         "SELECT Avg(Q1_Year1) AS Q1_Avg from EmploymentRates"
                          , connect_db)
#Closing the connection
connect_db.close()

#Using pandas to show output
df.head()

Unnamed: 0,Q1_Avg
0,69.217949
