# HackerNews analysis with newspaper3k

- This notebook is a continuation of our Hacker News analysis.
- In the [previous installment](https://www.kaggle.com/michapaliski/hackernews-analysis-with-bigquery) we have shown how you can use BigQuery and basic SQL for retrieving HN stories and comments.
- In this tutorial we build on that and present how you can **collect articles** to which HN stories are referring.
- We will use a great python library called [newspaper3k](https://github.com/codelucas/newspaper) for **scraping the articles and their metadata**

### First part presents how you can:
- connect to BQ from the Kaggle kernel
- run basic SQL queries against the HN dataset

**output: Top30 domains - outlets publishing stories on online privacy which were found worth sharing by HN users**

### Second part focuses on:
- introducing basic features of the newspaper3k
- collecting the most popular HN stories related to online privacy issues

**output: collection of popular articles on online privacy and their metadata**

---

### We use pip for installing newspaper3k. For more details on installation see: https://newspaper.readthedocs.io/en/latest/.

In [None]:
pip install newspaper3k

In [None]:
# Please import necessary packages
import numpy as np
import pandas as pd
from time import sleep
import newspaper
from newspaper import Article
from google.cloud import bigquery

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
from nltk import FreqDist
from nltk import bigrams
from nltk.stem import SnowballStemmer
import itertools
import collections

# Basic features of the newspaper3k

In [None]:
## lets collect most recent articles from the verge

paper = newspaper.build('https://arstechnica.com/', memoize_articles=False)

In [None]:
len(paper.articles)

In [None]:
j=0
for article in paper.articles[:30]:
    print(j, article.url)
    j=j+1

# Connect to BQ from the Kaggle kernel


In [None]:
# Client is needed for configuring API requests. Leaving it empty will initiate Kaggle's public dataset BigQuery integration.
client = bigquery.Client()

In [None]:
# Construct a reference to the "hacker_news" dataset
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Basic SQL queries against the HN dataset

We start our analysis with investigating the top domains that HN users use as sources. 

Steps:
1. Extract domains from the stories' urls using regexp. 
2. Exclude stories without urls
3. Include stories published after '2018-01-01' containing word 'privacy' or 'Privacy' in their titles or texts.
4. `COUNT `top 30 domains and store the results in the column `c` 

In [None]:
# Let's create our first SQL query on HN database. 

query = """
    #standardSQL
    SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') domain, COUNT(*) c
    FROM `bigquery-public-data.hacker_news.full`
    WHERE url!='' AND (REGEXP_CONTAINS(text, r"(p|P)rivacy") OR REGEXP_CONTAINS(title, r"(P|p)rivacy")) AND timestamp > '2018-01-01' AND type='story' 
    GROUP BY domain ORDER BY c DESC LIMIT 30
"""

In [None]:
query

In [None]:
# For more details on using BQ see: https://www.kaggle.com/michapaliski/hackernews-analysis-with-bigquery

# Set up the query
query_job = client.query(query)
# API request - run the query, and return a pandas DataFrame
df = query_job.to_dataframe()

### Top30 domains: outlets publishing stories on online privacy which were found worth sharing by HN users  

In [None]:
df

### We build a similar SQL query. But this time we collect all columns.

In [None]:
query = """
        SELECT *
        FROM `bigquery-public-data.hacker_news.full`
        WHERE (REGEXP_CONTAINS(text, r"(p|P)rivacy") OR REGEXP_CONTAINS(title, r"(P|p)rivacy")) AND timestamp > '2018-01-01' AND type='story'
        """

In [None]:
# Set up the query
query_job = client.query(query)
# API request - run the query, and return a pandas DataFrame
df = query_job.to_dataframe()

In [None]:
# Let's see how many stories don't have urls
print(df.shape)
df=df[~df['url'].isna()]
print(df.shape)

In [None]:
# Some users might refer to the same story. We get rid of duplicates.
df=df.drop_duplicates(subset=['url'])
df.reset_index(inplace=True)
len(df)

In [None]:
# An example story
df['url'][0]

In [None]:
# We point newspaper to the first story's url
article = Article(df['url'][0])
# Next, we download the source code
article.download()
# we parse the html
article.parse()

In [None]:
# now we can access different elements of the article like:

In [None]:
# publication date
article.publish_date

In [None]:
# or article's body
article.text[:1000]

In [None]:
# We can also use some cool nlp features provided by the newspaper3k
article.nlp()

In [None]:
# e.g. we can retrieve the keywords
article.keywords

In [None]:
# or use newspaper3k for creating an automated summary of the article for us
article.summary

# The most popular HN stories related to online privacy issues

### Now when you are familiar with the basic capabilities of the newspaper3k library. Let's use it for a bulk request. In the following we will try to collect top 1k stories

In [None]:
# We sort stories by their score
df=df.sort_values('score',ascending=False)[:1000]
li=df['url'].tolist()

In [None]:
# # Collect articles and their metadata (authors, titles and publication dates)

# date=[]
# auths=[]
# titles=[]
# text=[]

# for no, l in enumerate(li):
#     article = Article(l)
#     try:
#         article.download()
#         article.parse()
#         date.append(article.publish_date)
#         auths.append(article.authors)
#         titles.append(article.title)
#         text.append(article.text)
#     except:
#         date.append(np.nan)
#         auths.append(np.nan)
#         titles.append(np.nan)
#         text.append(np.nan)
#     if no%100==0:
#         print(no)
#     sleep(.5)

# res={
#     'title': titles,
#     'link':li,
#     'date':date,
#     'authors':auths,
#     'text':text
#     }
# df=pd.DataFrame(res)

# df.to_csv('./hn_newspaper.csv')

In [None]:
df=pd.read_csv('../input/hn-newspaper/hn_newspaper(1).csv')

In [None]:
df.head()

In [None]:
print("We have retrieved texts of: "+ str((len(df[~df['text'].isna()])/1000)*100)+'% of the stories \n Not bad!')

In [None]:
df

# Data cleaning

In [None]:
text=df['text'][0]

In [None]:
text

In [None]:
text=word_tokenize(text)

In [None]:
text[:10]

In [None]:
text=[i.lower() for i in text]

In [None]:
stopword_list=stopwords.words('english')

In [None]:
stopword_list[:10]

In [None]:
text=[i for i in text if i not in stopword_list]

In [None]:
st = SnowballStemmer('english')

In [None]:
st.stem('exciting')

In [None]:
text=[st.stem(i) for i in text]

In [None]:
text[:10]

In [None]:
df=df.dropna(subset=['text'])

In [None]:
df['text_token']=df['text'].apply(lambda x: word_tokenize(x))
df['text_token']=df['text_token'].apply(lambda row: [i.lower() for i in row])
df['text_token']=df['text_token'].apply(lambda row: [i for i in row if i not in stopword_list and len(i)>1] )
df['text_token_st']=df['text_token'].apply(lambda row: [st.stem(i) for i in row ] )

In [None]:
df.head()

In [None]:
all_frequencies = Counter()
for i, row in df.iterrows():
    counts=Counter(row['text_token_st'])
    all_frequencies.update(counts)

In [None]:
all_frequencies.most_common()[:10]

In [None]:
temp=[]
for i, row in df.iterrows():
    for j in row['text_token_st']:
        
        temp.append(j)
Counter(temp).most_common()[:10]