### JOIN's & UNION's

![title](./img/join_1.png)
![title](./img/join_2.png)
![title](./img/join_3.png)
![title](./img/join_4.png)

### EXAMPLE
We'll work with the Hacker News dataset. We begin by reviewing the first several rows of the comments table. 

In [3]:
# Pointing the json key file of google cloud service account to local copy
import os

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='key.json'

In [4]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "comments" table
table_ref = dataset_ref.table("comments")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,by,author,time,time_ts,text,parent,deleted,dead,ranking
0,2701393,5l,5l,1309184881,2011-06-27 14:28:01+00:00,And the glazier who fixed all the broken windo...,2701243,,,0
1,5811403,99,99,1370234048,2013-06-03 04:34:08+00:00,Does canada have the equivalent of H1B/Green c...,5804452,,,0
2,21623,AF,AF,1178992400,2007-05-12 17:53:20+00:00,"Speaking of Rails, there are other options in ...",21611,,,0
3,10159727,EA,EA,1441206574,2015-09-02 15:09:34+00:00,Humans and large livestock (and maybe even pet...,10159396,,,0
4,2988424,Iv,Iv,1315853580,2011-09-12 18:53:00+00:00,I must say I reacted in the same way when I re...,2988179,,,0


**Stories** Table

In [11]:
# Construct a reference to the "stories" table
table_ref = dataset_ref.table("stories")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,by,score,time,time_ts,title,url,text,deleted,dead,descendants,author
0,6940813,sarath237,0,1387536270,2013-12-20 10:44:30+00:00,Sheryl Brindo Hot Pics,http://www.youtube.com/watch?v=ym1cyxneB0Y,Sheryl Brindo Hot Pics,,True,,sarath237
1,6991401,123123321321,0,1388508751,2013-12-31 16:52:31+00:00,Are you people also put off by the culture of ...,,They&#x27;re pretty explicitly &#x27;startup f...,,True,,123123321321
2,1531556,ssn,0,1279617234,2010-07-20 09:13:54+00:00,New UI for Google Image Search,http://googlesystem.blogspot.com/2010/07/googl...,Again following on Bing's lead.,,,0.0,ssn
3,5012398,hoju,0,1357387877,2013-01-05 12:11:17+00:00,Historic website screenshots,http://webscraping.com/blog/Generate-website-s...,Python script to generate historic screenshots...,,,0.0,hoju
4,7214182,kogir,0,1401561740,2014-05-31 18:42:20+00:00,Placeholder,,Mind the gap.,,,0.0,kogir


The query below pulls information from the stories and comments tables to create a table showing all stories posted on January 1, 2012, along with the corresponding number of comments. We use a LEFT JOIN so that the results include stories that didn't receive any comments.

In [10]:
# Query to select all stories posted on January 1, 2012, with number of comments
join_query = """
             WITH c AS
             (
             SELECT parent, COUNT(*) as num_comments
             FROM `bigquery-public-data.hacker_news.comments` 
             GROUP BY parent
             )
             SELECT s.id as story_id, s.by, s.title, c.num_comments
             FROM `bigquery-public-data.hacker_news.stories` AS s
             LEFT JOIN c
             ON s.id = c.parent
             WHERE EXTRACT(DATE FROM s.time_ts) = '2012-01-01'
             ORDER BY c.num_comments DESC
             """

# Run the query, and return a pandas DataFrame
join_result = client.query(join_query).result().to_dataframe()
join_result.head()

Unnamed: 0,story_id,by,title,num_comments
0,3412900,whoishiring,Ask HN: Who is Hiring? (January 2012),154.0
1,3412901,whoishiring,Ask HN: Freelancer? Seeking freelancer? (Janua...,97.0
2,3412643,jemeshsu,Avoid Apress,30.0
3,3414012,ramanujam,Impress.js - a Prezi like implementation using...,27.0
4,3412891,Brajeshwar,"There's no shame in code that is simply ""good ...",27.0


Since the results are ordered by the num_comments column, stories without comments appear at the end of the DataFrame. (Remember that NaN stands for "not a number".)

In [11]:
# None of these stories received any comments
join_result.tail()

Unnamed: 0,story_id,by,title,num_comments
439,3412710,mmichael0070,Stoner Quotes,
440,3412846,jaaminul69,The year 2011 is fully for Kate Middeton.,
441,3413113,see_cloudtweaks,Infographic: Value of Cloud Computing Services...,
442,3412921,pgalih,Resep Cake Irit Telur,
443,3413296,cjstewart88,"Dec 2011, A Pretty Amazing Month for Tubalr… W...",


Next, we write a query to select all usernames corresponding to users who wrote stories or comments on January 1, 2014. We use UNION DISTINCT (instead of UNION ALL) to ensure that each user appears in the table at most once.

In [13]:
# Query to select all users who posted stories or comments on January 1, 2014
union_query = """
              SELECT c.by
              FROM `bigquery-public-data.hacker_news.comments` AS c
              WHERE EXTRACT(DATE FROM c.time_ts) = '2014-01-01'
              UNION DISTINCT
              SELECT s.by
              FROM `bigquery-public-data.hacker_news.stories` AS s
              WHERE EXTRACT(DATE FROM s.time_ts) = '2014-01-01'
              """

# Run the query, and return a pandas DataFrame
union_result = client.query(union_query).result().to_dataframe()
union_result.head()

Unnamed: 0,by
0,pkulak
1,drakaal
2,quinnchr
3,simoncion
4,JackMorgan


To get the number of users who posted on January 1, 2014, we need only take the length of the DataFrame.

In [14]:
# Number of users who posted stories or comments on January 1, 2014
len(union_result)

2282

### EXERCISE

The code cell below fetches the posts_questions table from the stackoverflow dataset. We also preview the first five rows of the table.

In [5]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "stackoverflow" dataset
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "posts_questions" table
table_ref = dataset_ref.table("posts_questions")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,65070674,NewRelic APM cpu usage shows incorrect values ...,<p>Here goes charts of CPU usage of same pod. ...,,0,0,NaT,2020-11-30 09:07:52.263000+00:00,,2020-11-30 09:07:52.263000+00:00,NaT,,,,1448545,,1,0,newrelic|kubernetes-pod,1
1,65070689,NewRelic APM cpu usage shows incorrect values ...,<p>Here goes charts of CPU usage of same pod. ...,,0,0,NaT,2020-11-30 09:09:16.030000+00:00,,2020-11-30 09:09:16.030000+00:00,NaT,,,,1448545,,1,0,newrelic|kubernetes-pod,1
2,64972916,Gitlab : How to batch modify the visibility of...,<p>I had to change the visibility of many proj...,,1,0,NaT,2020-11-23 17:05:43.130000+00:00,,2020-11-23 17:05:43.130000+00:00,NaT,,,,1416845,,1,0,gitlab|batch-processing,2
3,65007914,Upload internal Android application on WmWare(...,<p>I am trying to upload Android debug build o...,,0,0,NaT,2020-11-25 15:37:33.790000+00:00,,2020-11-25 15:37:33.790000+00:00,NaT,,,,8659884,,1,0,android,2
4,65023562,swig: how to add context manager methods to op...,"<p>when swigging opaque handles (in my case, p...",,1,0,NaT,2020-11-26 14:03:29.567000+00:00,,2020-11-26 14:03:29.567000+00:00,NaT,,,,1368566,,1,0,python|swig,2


In [6]:
# Construct a reference to the "posts_answers" table
table_ref = dataset_ref.table("posts_answers")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,64548693,,<p>My workaround without ejecting:</p>\n<ol>\n...,,,0,NaT,2020-10-27 05:20:53.417000+00:00,,2020-10-27 05:20:53.417000+00:00,NaT,,,,9428719,55821078,2,0,,
1,64548694,,<p>The execution thread may be secluded on dif...,,,1,NaT,2020-10-27 05:21:15.397000+00:00,,2020-10-27 05:21:15.397000+00:00,NaT,,,,13927193,33876455,2,0,,
2,64548698,,"<p><code>vw</code> is well supported, so can b...",,,0,NaT,2020-10-27 05:22:37.873000+00:00,,2020-10-27 05:22:37.873000+00:00,NaT,,,,8942566,64548101,2,0,,
3,64548705,,<p>This could be simple. Please check that you...,,,0,NaT,2020-10-27 05:23:43.193000+00:00,,2020-10-27 05:23:43.193000+00:00,NaT,,,,13832463,64276190,2,0,,
4,64548730,,<p>Install the flutter plugin on android studi...,,,3,NaT,2020-10-27 05:26:53.640000+00:00,,2020-10-27 05:26:53.640000+00:00,NaT,,,,11211493,64443398,2,0,,


**1) How long does it take for questions to receive answers?**

You're interested in exploring the data to have a better understanding of how long it generally takes for questions to receive answers. Armed with this knowledge, you plan to use this information to better design the order in which questions are presented to Stack Overflow users.

With this goal in mind, you write the query below, which focuses on questions asked in January 2018. It returns a table with two columns:

    q_id - the ID of the question
    time_to_answer - how long it took (in seconds) for the question to receive an answer


In [4]:
first_query = """
              SELECT q.id AS q_id,
                  MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
              FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                  INNER JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
              ON q.id = a.parent_id
              WHERE q.creation_date >= '2018-01-01' and q.creation_date < '2018-02-01'
              GROUP BY q_id
              ORDER BY time_to_answer
              """

first_result = client.query(first_query).result().to_dataframe()
print("Percentage of answered questions: %s%%" % \
      (sum(first_result["time_to_answer"].notnull()) / len(first_result) * 100))
print("Number of questions:", len(first_result))
first_result.head()

Percentage of answered questions: 100.0%
Number of questions: 134577


Unnamed: 0,q_id,time_to_answer
0,48375126,0
1,48396661,0
2,48100614,0
3,48219063,0
4,48280201,0


In [6]:
# Your code here
correct_query = """
              SELECT q.id AS q_id,
                  MIN(TIMESTAMP_DIFF(a.creation_date, q.creation_date, SECOND)) as time_to_answer
              FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                  LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
              ON q.id = a.parent_id
              WHERE q.creation_date >= '2018-01-01' and q.creation_date < '2018-02-01'
              GROUP BY q_id
              ORDER BY time_to_answer
                """

# Run the query, and return a pandas DataFrame
correct_result = client.query(correct_query).result().to_dataframe()
print("Percentage of answered questions: %s%%" % \
      (sum(correct_result["time_to_answer"].notnull()) / len(correct_result) * 100))
print("Number of questions:", len(correct_result))

Percentage of answered questions: 82.89006873783538%
Number of questions: 162356


**2) Initial questions and answers, Part 1**

You're interested in understanding the initial experiences that users typically have with the Stack Overflow website. Is it more common for users to first ask questions or provide answers? After signing up, how long does it take for users to first interact with the website? To explore this further, you draft the (partial) query in the code cell below.

The query returns a table with three columns:

    owner_user_id - the user ID
    q_creation_date - the first time the user asked a question
    a_creation_date - the first time the user contributed an answer

You want to keep track of users who have asked questions, but have yet to provide answers. And, your table should also include users who have answered questions, but have yet to pose their own questions.

With this in mind, please fill in the appropriate JOIN (i.e., INNER, LEFT, RIGHT, or FULL) to return the correct information.

Note: You need only fill in the appropriate JOIN. All other parts of the query should be left as-is. (You also don't need to write any additional code to run the query, since the cbeck() method will take care of this for you.)

To avoid returning too much data, we'll restrict our attention to questions and answers posed in January 2019. We'll amend the timeframe in Part 2 of this question to be more realistic!

In [8]:
# Your code here
q_and_a_query = """
                SELECT q.owner_user_id AS owner_user_id,
                    MIN(q.creation_date) AS q_creation_date,
                    MIN(a.creation_date) AS a_creation_date
                FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                    FULL JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                ON q.owner_user_id = a.owner_user_id 
                WHERE q.creation_date >= '2019-01-01' AND q.creation_date < '2019-02-01' 
                    AND a.creation_date >= '2019-01-01' AND a.creation_date < '2019-02-01'
                GROUP BY owner_user_id
                """

# Run the query, and return a pandas DataFrame
q_and_a_result = client.query(q_and_a_query).result().to_dataframe()  

q_and_a_result              

Unnamed: 0,owner_user_id,q_creation_date,a_creation_date
0,7415247,2019-01-15 10:53:07.370000+00:00,2019-01-29 09:10:32.643000+00:00
1,8380616,2019-01-30 16:57:52.703000+00:00,2019-01-30 16:59:24.783000+00:00
2,1157814,2019-01-19 10:33:39.277000+00:00,2019-01-22 16:11:24.680000+00:00
3,10912594,2019-01-15 07:37:21.147000+00:00,2019-01-15 12:14:51.683000+00:00
4,1266650,2019-01-30 18:44:27.493000+00:00,2019-01-19 20:27:44.990000+00:00
...,...,...,...
21712,267,2019-01-14 07:51:31.843000+00:00,2019-01-02 13:20:47.030000+00:00
21713,5552776,2019-01-16 11:07:29.533000+00:00,2019-01-16 11:47:17.133000+00:00
21714,7936642,2019-01-09 13:01:47.330000+00:00,2019-01-11 10:55:01.717000+00:00
21715,10712972,2019-01-09 12:09:53.810000+00:00,2019-01-15 08:38:03.943000+00:00


**3) Initial questions and answers, Part 2**

![title](./img/join_5.png)

Write a query that returns the following columns:

    id - the IDs of all users who created Stack Overflow accounts in January 2019 (January 1, 2019, to January 31, 2019, inclusive)
    q_creation_date - the first time the user posted a question on the site; if the user has never posted a question, the value should be null
    a_creation_date - the first time the user posted a question on the site; if the user has never posted a question, the value should be null

Note that questions and answers posted after January 31, 2019, should still be included in the results. And, all users who joined the site in January 2019 should be included (even if they have never posted a question or provided an answer).

The query from the previous question should be a nice starting point to answering this question! You'll need to use the posts_answers and posts_questions tables. You'll also need to use the users table from the Stack Overflow dataset. The relevant columns from the users table are id (the ID of each user) and creation_date (when the user joined the Stack Overflow site, in DATETIME format).

In [9]:
# Your code here
three_tables_query = """
                SELECT u.id AS id,
                         MIN(q.creation_date) AS q_creation_date,
                         MIN(a.creation_date) AS a_creation_date
                     FROM `bigquery-public-data.stackoverflow.users` AS u
                         LEFT JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                             ON u.id = a.owner_user_id
                         LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q
                             ON q.owner_user_id = u.id
                     WHERE u.creation_date >= '2019-01-01' and u.creation_date < '2019-02-01'
                     GROUP BY id
                """

# Run the query, and return a pandas DataFrame
three_tables_result = client.query(three_tables_query).result().to_dataframe()  

three_tables_result 

Unnamed: 0,id,q_creation_date,a_creation_date
0,10982457,NaT,NaT
1,10993177,NaT,NaT
2,10860993,NaT,NaT
3,10964599,NaT,NaT
4,10952343,NaT,NaT
...,...,...,...
142155,10972773,NaT,NaT
142156,10890993,NaT,NaT
142157,10972309,2019-08-10 04:33:33.440000+00:00,NaT
142158,10937416,2020-09-23 12:09:14.613000+00:00,NaT


**4) How many distinct users posted on January 1, 2019?**

In the code cell below, write a query that returns a table with a single column:

* owner_user_id - the IDs of all users who posted at least one question or answer on January 1, 2019. Each user ID should appear at most once.

In the posts_questions (and posts_answers) tables, you can get the ID of the original poster from the owner_user_id column. Likewise, the date of the original posting can be found in the creation_date column.

In order for your answer to be marked correct, your query must use a UNION.

In [13]:
all_users_query = """
                   SELECT q.owner_user_id 
                  FROM `bigquery-public-data.stackoverflow.posts_questions` AS q
                  WHERE EXTRACT(DATE FROM q.creation_date) = '2019-01-01'
                  UNION DISTINCT
                  SELECT a.owner_user_id
                  FROM `bigquery-public-data.stackoverflow.posts_answers` AS a
                  WHERE EXTRACT(DATE FROM a.creation_date) = '2019-01-01'
                  """

# Run the query, and return a pandas DataFrame
three_tables_result = client.query(all_users_query).result().to_dataframe()  

three_tables_result                   

Unnamed: 0,owner_user_id
0,1745073.0
1,6091102.0
2,10432674.0
3,6216358.0
4,10775599.0
...,...
4390,3266179.0
4391,1980359.0
4392,2225384.0
4393,4408508.0
