In [1]:
import pandas as pd
import psycopg2

def execute_query(sql_query, dbname='temp', user='postgres', password='postgres', port='5432'):
    # Create a connection to the PostgreSQL database
    conn = psycopg2.connect(dbname=dbname, user=user, password=password, port=port)

    # Use read_sql to execute the query and load the results into a DataFrame
    df = pd.read_sql(sql_query, conn)

    # Close the database connection
    conn.close()

    # Return the DataFrame
    return df


# Explore table sizes

- `stackoverflow`: questions asked on Stack Overflow with certain tags
- `company`: information on companies related to tags in stackoverflow
- `tag_company`: links stackoverflow to company
- `tag_type`: type categories applied to tags in stackoverflow
- `fortune500`: information on top US companies
Count the number of rows in a table with


Count the number of columns in a table by selecting a few rows and manually counting the columns in the result.

```
SELECT count(*) 
  FROM tablename;
```

In [5]:

query_result = execute_query('SELECT COUNT(*) FROM stackoverflow LIMIT 5')
query_result

Unnamed: 0,count
0,45238


# Count missing values

Which column of fortune500 has the most missing values? To find out, you'll need to check each column individually, although here we'll check just three.

In [6]:

query_result = execute_query("""
-- Select the count of the number of rows
SELECT COUNT(*)
  FROM fortune500;                             
""")
query_result

Unnamed: 0,count
0,500


In [7]:

query_result = execute_query("""
-- Select the count of ticker, 
-- subtract from the total number of rows, 
-- and alias as missing
SELECT count(*) - COUNT(ticker) AS missing
  FROM fortune500;                           
""")
query_result

Unnamed: 0,missing
0,32


In [8]:

query_result = execute_query("""
-- Select the count of profits_change, 
-- subtract from total number of rows, and alias as missing
SELECT COUNT(*) - COUNT(profits_change) AS missing
FROM fortune500                   
""")
query_result

Unnamed: 0,missing
0,63


In [9]:

query_result = execute_query("""
-- Select the count of industry, 
-- subtract from total number of rows, and alias as missing

SELECT COUNT(*) - COUNT(industry) AS missing
FROM fortune500           
""")
query_result

Unnamed: 0,missing
0,13


# Join tables

Part of exploring a database is figuring out how tables relate to each other. The `company` and `fortune500` tables don't have a formal relationship between them in the database, but this doesn't prevent you from joining them.

To join the tables, you need to find a column that they have in common where the values are consistent across the tables. Remember: just because two tables have a column with the same name, it doesn't mean those columns necessarily contain compatible data. If you find more than one pair of columns with similar data, you may need to try joining with each in turn to see if you get the same number of results.

In [10]:

query_result = execute_query("""
SELECT company.name
-- Table(s) to select from
  FROM company
       INNER JOIN fortune500
       ON company.ticker=fortune500.ticker;        
""")
query_result

Unnamed: 0,name
0,Apple Incorporated
1,Amazon.com Inc
2,Alphabet
3,Microsoft Corp.
4,International Business Machines Corporation
5,PayPal Holdings Incorporated
6,"eBay, Inc."
7,Adobe Systems Incorporated


# Foreign keys

Recall that foreign keys reference another row in the database via a unique ID. Values in a foreign key column are restricted to values in the referenced column OR NULL.

Using what you know about foreign keys, why can't the tag column in the tag_type table be a foreign key that references the tag column in the stackoverflow table?

In [18]:
query_result1 = execute_query("""
SELECT COUNT(tag) 
  FROM stackoverflow
""")
query_result1

Unnamed: 0,count
0,45238


In [19]:

query_result2 = execute_query("""
SELECT COUNT(DISTINCT tag) 
  FROM stackoverflow
""")
query_result1 - query_result2

Unnamed: 0,count
0,45185


- `stackoverflow.tag` contains duplicate values

# Read an entity relationship diagram

The information you need is sometimes split across multiple tables in the database.

What is the most common stackoverflow tag_type? What companies have a tag of that type?

In [20]:

query_result = execute_query("""
-- Count the number of tags with each type
SELECT type, COUNT(tag) AS count
  FROM tag_type
 -- To get the count for each type, what do you need to do?
 GROUP BY type
 -- Order the results with the most common
 -- tag types listed first
 ORDER BY count DESC;    
""")
query_result

Unnamed: 0,type,count
0,cloud,31
1,database,6
2,payment,5
3,mobile-os,4
4,api,4
5,company,4
6,storage,2
7,os,2
8,spreadsheet,2
9,identity,1


In [21]:

query_result = execute_query("""
-- Select the 3 columns desired
SELECT company.name, tag_type.tag, tag_type.type
  FROM company
  	   -- Join to the tag_company table
       INNER JOIN tag_company 
       ON company.id = tag_company.company_id
       -- Join to the tag_type table
       INNER JOIN tag_type
       ON tag_company.tag = tag_type.tag
  -- Filter to most common type
  WHERE type='cloud'; 
""")
query_result

Unnamed: 0,name,tag,type
0,Amazon Web Services,amazon-cloudformation,cloud
1,Amazon Web Services,amazon-cloudfront,cloud
2,Amazon Web Services,amazon-cloudsearch,cloud
3,Amazon Web Services,amazon-cloudwatch,cloud
4,Amazon Web Services,amazon-cognito,cloud
5,Amazon Web Services,amazon-data-pipeline,cloud
6,Amazon Web Services,amazon-dynamodb,cloud
7,Amazon Web Services,amazon-ebs,cloud
8,Amazon Web Services,amazon-ec2,cloud
9,Amazon Web Services,amazon-ecs,cloud


# Coalesce

The `coalesce()` function can be useful for specifying a default or backup value when a column contains `NULL` values.

In [22]:

query_result = execute_query("""
-- Use coalesce
SELECT COALESCE(industry, sector, 'Unknown') AS industry2,
       -- Don't forget to count!
       COUNT(*) 
  FROM fortune500 
-- Group by what? (What are you counting by?)
 GROUP BY industry2
-- Order results to see most common first
 ORDER BY COUNT DESC
-- Limit results to get just the one value you want
 LIMIT 1;
""")
query_result

Unnamed: 0,industry2,count
0,Utilities: Gas and Electric,22


# Coalesce with a self-join

You previously joined the `company` and `fortune500` tables to find out which companies are in both tables. Now, also include companies from company that are subsidiaries of Fortune 500 companies as well.

In [23]:

query_result = execute_query("""
SELECT  company_original.name, title, rank
  -- Start with original company information
  FROM company AS company_original
       -- Join to another copy of company with parent
       -- company information
	   LEFT JOIN company AS company_parent
       ON company_original.parent_id = company_parent.id 
       -- Join to fortune500, only keep rows that match
       inner JOIN fortune500 
       -- Use parent ticker if there is one, 
       -- otherwise original ticker
       ON coalesce(company_original.ticker, 
                   company_parent.ticker) = 
             fortune500.ticker
 -- For clarity, order by rank
 ORDER BY rank; 
""")
query_result

Unnamed: 0,name,title,rank
0,Apple Incorporated,Apple,3
1,Amazon.com Inc,Amazon.com,12
2,Amazon Web Services,Amazon.com,12
3,Alphabet,Alphabet,27
4,Google LLC,Alphabet,27
5,Microsoft Corp.,Microsoft,28
6,International Business Machines Corporation,IBM,32
7,PayPal Holdings Incorporated,PayPal Holdings,264
8,"eBay, Inc.",eBay,310
9,Adobe Systems Incorporated,Adobe Systems,443


# Effects of casting

When you cast data from one type to another, information can be lost or changed. See how the casting changes values and practice casting data using the `CAST()` function and the `::` syntax.

In [24]:

query_result = execute_query("""
-- Select the original value
SELECT profits_change, 
	   -- Cast profits_change
       CAST(profits_change AS INTEGER) AS profits_change_int
  FROM fortune500;
""")
query_result

Unnamed: 0,profits_change,profits_change_int
0,-7.2,-7.0
1,0.0,0.0
2,-14.4,-14.0
3,-51.5,-52.0
4,53.0,53.0
...,...,...
495,4.2,4.0
496,5.2,5.0
497,,
498,,


In [25]:

query_result = execute_query("""
-- Divide 10 by 3
SELECT 10/3, 
       -- Cast 10 as numeric and divide by 3
       10::numeric/3;
""")
query_result

Unnamed: 0,?column?,?column?.1
0,3,3.333333


In [26]:

query_result = execute_query("""
SELECT '3.2'::numeric,
       '-123'::numeric,
       '1e3'::numeric,
       '1e-3'::numeric,
       '02314'::numeric,
       '0002'::numeric;
""")
query_result

Unnamed: 0,numeric,numeric.1,numeric.2,numeric.3,numeric.4,numeric.5
0,3.2,-123.0,1000.0,0.001,2314.0,2.0


# Summarize the distribution of numeric values

Was 2017 a good or bad year for revenue of Fortune 500 companies? Examine how revenue changed from 2016 to 2017 by first looking at the distribution of `revenues_change` and then counting companies whose revenue increased.

In [27]:

query_result = execute_query("""
-- Select the count of each value of revenues_change
SELECT revenues_change, COUNT(*)
  FROM fortune500
GROUP BY revenues_change
 -- order by the values of revenues_change
 ORDER BY revenues_change;
""")
query_result

Unnamed: 0,revenues_change,count
0,-57.5,1
1,-53.3,1
2,-51.4,1
3,-50.9,1
4,-45.0,1
...,...,...
282,92.6,1
283,94.5,1
284,115.9,1
285,122.1,1


In [28]:

query_result = execute_query("""
-- Select the count of each revenues_change integer value
SELECT revenues_change::integer, COUNT(*)
  FROM fortune500
GROUP BY revenues_change::integer
 -- order by the values of revenues_change
 ORDER BY revenues_change::integer;
""")
query_result

Unnamed: 0,revenues_change,count
0,-58,1
1,-53,1
2,-51,2
3,-45,1
4,-42,1
...,...,...
78,93,1
79,94,1
80,116,1
81,122,1


In [29]:

query_result = execute_query("""
-- Count rows 
SELECT COUNT(*)
  FROM fortune500
 -- Where...
 WHERE revenues_change > 0;
""")
query_result

Unnamed: 0,count
0,298
