### GroupBy, Having & Count

#### **COUNT()**

![title](./img/count_1.png)

#### **GROUP BY**

![title](./img/groupby_1.png)

#### GROUP BY ... HAVING

![title](./img/having_1.png)

#### Example: Which Hacker News comments generated the most discussion?

Ready to see an example on a real dataset? The Hacker News dataset contains information on stories and comments from the Hacker News social networking site.

We'll work with the comments table and begin by printing the first few rows.

In [1]:
# Pointing the json key file of google cloud service account to local copy
import os

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='key.json'

In [2]:
from google.cloud import bigquery

# Create a 'Client' object
client = bigquery.Client()

# Construct a reference to the 'hacker_news' dataset
dataset_ref = client.dataset('hacker_news', project='bigquery-public-data')

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the 'comments' table
table_ref = dataset_ref.table('comments')

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the 'comments' table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,by,author,time,time_ts,text,parent,deleted,dead,ranking
0,2701393,5l,5l,1309184881,2011-06-27 14:28:01+00:00,And the glazier who fixed all the broken windo...,2701243,,,0
1,5811403,99,99,1370234048,2013-06-03 04:34:08+00:00,Does canada have the equivalent of H1B/Green c...,5804452,,,0
2,21623,AF,AF,1178992400,2007-05-12 17:53:20+00:00,"Speaking of Rails, there are other options in ...",21611,,,0
3,10159727,EA,EA,1441206574,2015-09-02 15:09:34+00:00,Humans and large livestock (and maybe even pet...,10159396,,,0
4,2988424,Iv,Iv,1315853580,2011-09-12 18:53:00+00:00,I must say I reacted in the same way when I re...,2988179,,,0


In [3]:
# Query to select comments that received more than 10 replies 
# project.dataset.table

query_popular = """
                SELECT parent, COUNT(1) AS NumPosts
                 FROM `bigquery-public-data.hacker_news.comments`
                 GROUP BY parent
                 HAVING COUNT(1) > 10
                """                    

In [4]:
query_job = client.query(query_popular)

# API request - run the query, and convert the results to a pandas DataFrame
popular_comments = query_job.to_dataframe()

# Print the first five rows of the DataFrame
popular_comments.head()

Unnamed: 0,parent,NumPosts
0,6427895,46
1,202918,44
2,8120079,148
3,9016949,38
4,7075537,51


![title](./img/grpby_1.png)

![title](./img/grpby2.png)

![title](./img/grpby_3.png)

### Exercises

**1] Prolific Commenters**

Write a query that returns all authors with more than 10,000 posts as well as their post counts. Call the column with post counts NumPosts.

In [5]:
query = """
        SELECT author, COUNT(id) AS NumPosts
        FROM `bigquery-public-data.hacker_news.comments`
        GROUP BY author
        HAVING COUNT(id) > 10000
        """

In [6]:
query_job = client.query(query)

# API request - run the query, and convert the results to a pandas DataFrame
prolific_comments = query_job.to_dataframe()

# Print the first five rows of the DataFrame
prolific_comments.head()

Unnamed: 0,author,NumPosts
0,eru,10448
1,rbanffy,10557
2,DanBC,12902
3,sp332,10882
4,davidw,10764


**2] Deleted Comments**

How many comments have been deleted? (If a comment was deleted, the deleted column in the comments table will have the value True.)

In [8]:
# Query to determine how many posts were deleted
deleted_posts_query = """
                      SELECT COUNT(1) AS num_deleted_posts
                      FROM `bigquery-public-data.hacker_news.comments`
                      WHERE deleted = True
                      """

# Set up the query
query_job = client.query(deleted_posts_query)

# API request - run the query, and return a pandas DataFrame
deleted_posts = query_job.to_dataframe()

# View results
print(deleted_posts)

   num_deleted_posts
0             227736
