# Acquiring Article Data

In this notebook we'll go through the process of acquiring all the article data that will be used in this project. The steps outlined here will allow us to replicate the process of acquiring the data, but new articles are published every day making it difficult to make fully reproduce the process of acquiring the same dataset that will be used in exploration and model training for this project. Nonetheless the steps will be outlined here for reference. A .csv file will be provided containing the data that is used for analysis and modeling for reproducibility.

## Imports

These are all the modules that we'll need to run the code in this notebook.

In [1]:
# We'll need to add the modules directory path to sys.path since this is where all the .py files are located.

import sys
sys.path.append('../modules')

In [22]:
# The Medium API will be used to acquire data.
# Medium API Documentation: https://github.com/weeping-angel/medium-api

import pandas as pd
from medium_api import Medium

# We need requests and Beautiful Soup in order to scrape web data.
import requests
from bs4 import BeautifulSoup

# We need an API key to use the Medium API. This can be acquired from rapidapi.com.
# It is required to create an account and subscribe to the Medium API.
# https://rapidapi.com/nishujain199719-vgIfuFHZxVZ/api/medium2/pricing

from env import medium_api_key

## Acquire Article Data Using Medium API

In this section we'll cover some of the functionality provided by the Medium API. We'll discover how to use it and what can and can't be done.

In [6]:
# First we need to create the Medium object with our API key, this will allow us to make calls to the API.

medium = Medium(medium_api_key)

In [23]:
# Now let's try getting the latestposts for a given topic. The latestposts function only allows specifying a 
# topic.

latestposts = medium.latestposts(topic_slug = 'artificial-intelligence')

In [25]:
# Now let's look at the info for the first article in this list.

latestposts.articles[0].info

{'id': '14f1ac7ec113',
 'tags': ['smart-speaker',
  'smartspeakersmarketshare',
  'telegram',
  'telegram-bot',
  'voice-assistant'],
 'claps': 0,
 'last_modified_at': '2022-07-04 20:53:06',
 'published_at': '2022-07-04 20:53:06',
 'url': 'https://convcomp.it/could-telegram-be-a-competitor-of-voice-assistants-like-amazon-alexa-or-google-assistant-14f1ac7ec113',
 'image_url': 'https://miro.medium.com/1*TyS9LOsSoh-506qWQURRlw.jpeg',
 'lang': 'en',
 'publication_id': 'e9c948ff6ebd',
 'title': 'Could Telegram be a competitor of voice assistants, like Amazon Alexa or Google Assistant?',
 'word_count': 2285,
 'reading_time': 8.822641509434,
 'voters': 0,
 'topics': ['artificial-intelligence'],
 'subtitle': 'An open letter to Pavel Durov, containing some change requests to enable voice integration into Telegram bots ecosystem',
 'author': '452c0445f9d5'}

In [11]:
# Now the second article.

latestposts.articles[1].info

{'id': 'fe99c68643a6',
 'tags': ['metaverse', 'machine-learning', 'strategy', 'marketing', 'law'],
 'claps': 0,
 'last_modified_at': '2022-07-04 20:05:43',
 'published_at': '2022-07-04 19:58:46',
 'url': 'https://medium.com/@MetaverseLaw/metaverse-law-trends-via-machine-learning-fe99c68643a6',
 'image_url': 'https://miro.medium.com/1*rta8P5wpi5-75cERsoJa2g.png',
 'lang': 'en',
 'publication_id': '*Self-Published*',
 'title': 'Metaverse Law Trends via Machine Learning',
 'word_count': 3814,
 'reading_time': 15.742452830189,
 'voters': 0,
 'topics': ['artificial-intelligence'],
 'subtitle': 'If you’re interested only in the machine learning discussion, please skip to the Machine Learning section.',
 'author': '1506fccbdfe3'}

There was a brief delay when grabbing the info for the articles. This makes me wonder if the API is lazy and makes the call only when necessary (in this case when we call info). If this is the case then using the API, on the free tier at least, will not be possible since we will likely expend the allotted monthly API calls very rapidly. Looking at the API usage stats (at RapidAPI) it seems like this is in fact the case.

The documentation shows there is a fetch_articles function that may solve this issue. Let's try it.

In [12]:
# Grab the article data using fetch_articles, this function doesn't return anything, but will instead populate
# the latestposts.articles list.

medium.fetch_articles(latestposts.articles)

In [14]:
len(latestposts.articles)

25

In [15]:
# Now let's try getting the info for a few articles again.

latestposts.articles[0].info

{'id': '14f1ac7ec113',
 'tags': ['smart-speaker',
  'smartspeakersmarketshare',
  'telegram',
  'telegram-bot',
  'voice-assistant'],
 'claps': 0,
 'last_modified_at': '2022-07-04 20:53:06',
 'published_at': '2022-07-04 20:53:06',
 'url': 'https://convcomp.it/could-telegram-be-a-competitor-of-voice-assistants-like-amazon-alexa-or-google-assistant-14f1ac7ec113',
 'image_url': 'https://miro.medium.com/1*TyS9LOsSoh-506qWQURRlw.jpeg',
 'lang': 'en',
 'publication_id': 'e9c948ff6ebd',
 'title': 'Could Telegram be a competitor of voice assistants, like Amazon Alexa or Google Assistant?',
 'word_count': 2285,
 'reading_time': 8.822641509434,
 'voters': 0,
 'topics': ['artificial-intelligence'],
 'subtitle': 'An open letter to Pavel Durov, containing some change requests to enable voice integration into Telegram bots ecosystem',
 'author': '452c0445f9d5'}

In [16]:
latestposts.articles[10].info

{'id': '65de5d04c475',
 'tags': ['artificial-intelligence',
  'machine-learning',
  'data-science',
  'automation',
  'future'],
 'claps': 250,
 'last_modified_at': '2022-07-04 13:10:28',
 'published_at': '2022-07-04 12:26:24',
 'url': 'https://medium.com/gdg-vit/ai-automation-and-the-future-of-workplaces-65de5d04c475',
 'image_url': 'https://miro.medium.com/1*b0ig8NZnu7GBr2tZG0xnsg.png',
 'lang': 'en',
 'publication_id': '7ebddf9721d',
 'title': 'AI, Automation, and the Future of Workplaces',
 'word_count': 1590,
 'reading_time': 6.8333333333333,
 'voters': 5,
 'topics': ['artificial-intelligence'],
 'subtitle': 'Introduction',
 'author': '15c03e4e1b20'}

There is no longer a delay when getting the article info, but the API usage stats (at RapidAPI) confirm that running the fetch_articles functions made 25 API calls.

In [7]:
# One last thing, let's try getting the information for a specific article which we know the URL of.
# https://towardsdatascience.com/google-foobar-challenge-level-1-3487bb252780
# The article id is the series of characters at the end of the URL after the article title. 3487bb252780

article = medium.article(article_id = '3487bb252780')
article.info

{'id': '3487bb252780',
 'tags': ['coding-challenge',
  'google',
  'python',
  'getting-started',
  'programming'],
 'claps': 10,
 'last_modified_at': '2022-07-05 20:03:53',
 'published_at': '2022-07-05 19:19:47',
 'url': 'https://towardsdatascience.com/google-foobar-challenge-level-1-3487bb252780',
 'image_url': 'https://miro.medium.com/0*RX1cyqxIba1gKt1b',
 'lang': 'en',
 'publication_id': '7f60cf5620c9',
 'title': 'Google Foobar Challenge: Level 1',
 'word_count': 1243,
 'reading_time': 5.0738993710692,
 'voters': 2,
 'topics': ['data-science', 'programming'],
 'subtitle': 'An intro to the secretive coding challenge and a breakdown of the problems',
 'author': '94ed6e69690'}

### Takeaways

Using the API would definitely make things easier, but it comes with some very significant limitations. First from the work done above we can see that for each article we collect data on the API will make a call. This means that if, for instance, we wanted to get the latest 25 articles for 5 different topics this would amount to 125 API calls which is half the monthly allotment of 250. Even using a smarter algorithm to ensure we don't make more calls than necessary we would still probably expend the monthly free tier allotment within a week.

Switching to a paid tier is an option. The next tier up would cost 4.99 per month and give 1,250 API calls per month. This might be enough, but it comes with the downside of having to pay for the API. Not to mention that it might possibly still not be enough calls as the application begins to scale to a larger size. The API with the free tier would probably be good enough if this application was kept to a small scope, such as only pulling article data for 2 or 3 different topics. This would not make for a very useful application though so we may need to explore alternative solutions.

It might be worthwhile to design and build a package that utilizes the API just in case later on it is discovered that the API is necessary.

## Acquiring Article Data Through Web Scraping

For scraping the data we need there are a few things to keep in mind.

1. It is possible to structure the URL in such a way that we can get the latest articles, but without JavaScript enabled we only get 10 articles. By simply counting the number of articles published in a single day for a single topic we can expect that for most topics and publications there will likely be more than 10 articles published per day. This means we would either need to run the web scraping script multiple times a day or use a dynamic web scraping library like Selenium. I know from experience that Selenium does not play well with Crontab so this could be problematic.
2. Getting the latest articles for a publication shouldn't be too much of an issue. Getting the latest articles for a specific topic could be problematic. Searching for a specific through the main medium site provides unusual results (the articles don't seem to be in order of most recent, articles of varying related topics seem to be returned, I also seem to get different results for Chrome and Safari).

At this point I believe it is best to move forward with the solution that will be best given the limitations present. So we will only pull the latest articles for given publications and not for specific topics (at least not until this weird issue with the return results is fixed). To accomplish this task we will need requests and Beautiful Soup.

In [3]:
# We'll attempt to get the HTML for the latest articles page from towardsdatascience.com

url = 'https://towardsdatascience.com/latest'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

For each article I would like to gather the following data:
- Title
- Subtitle
- Date
- Read Time
- Author
- Publication
- URL
- Article Intro

I'll start by first getting all the article URLs on the latest article page.

In [4]:
# Get the URLs for all the latest articles.

links = soup.find_all('a', title = 'Latest stories published on Towards Data Science')

In [5]:
len(links)

10

In [6]:
links[0]['href']

'https://towardsdatascience.com/navigating-mlops-dc2a242ef7ed?source=---------0'

In [7]:
links[1]['href']

'https://towardsdatascience.com/ai-can-now-play-minecraft-and-is-a-step-closer-to-navigate-the-world-1f19cfe37ef?source=---------1'

In [8]:
# Now let's put all the URLs into a list.
# I'll also strip the ?source= stuff at the end of the URL.

urls = [link['href'].split('?source')[0] for link in links]
urls

['https://towardsdatascience.com/navigating-mlops-dc2a242ef7ed',
 'https://towardsdatascience.com/ai-can-now-play-minecraft-and-is-a-step-closer-to-navigate-the-world-1f19cfe37ef',
 'https://towardsdatascience.com/calculating-closest-landmass-between-two-points-on-earth-214f73b48fdc',
 'https://towardsdatascience.com/visualizing-cpu-memory-and-gpu-utilities-with-python-8028d859c2b0',
 'https://towardsdatascience.com/choosing-neural-networks-over-n-gram-models-for-natural-language-processing-156ea3a57fc',
 'https://towardsdatascience.com/numpy-ufuncs-the-magic-behind-vectorized-functions-8cc3ba56aa2c',
 'https://towardsdatascience.com/block-recurrent-transformer-lstm-and-transformer-combined-ec3e64af971a',
 'https://towardsdatascience.com/5-ideas-to-create-new-features-from-polygons-f8f902f5ad8f',
 'https://towardsdatascience.com/how-to-use-customer-lifetime-value-ltv-for-data-driven-transformation-6f12596df943',
 'https://towardsdatascience.com/self-supervised-transformer-models-bert-g

Next we can try to gather all the data we need from a given article by scraping the webpage for that URL.

In [9]:
response = requests.get(urls[1])
soup = BeautifulSoup(response.text, 'html.parser')

In [13]:
# We can get the author's name by searching for the class pw-author in the HTML. The author's name is then 
# three <div>s and a <a> below that.

author_name = soup.find('div', class_ = 'pw-author').div.div.div.a.text
author_name

'Alberto Romero'

In [15]:
# We can get the publishing date by searching for the class pw-published-date in the HTML. The publishing 
# date is then one <span> below that.

date = soup.find('p', class_ = 'pw-published-date').span.text
date

'Jul 6'

In [17]:
# We can get the read time by searching for the class pw-reading-time in the HTML.

read_time = soup.find('div', class_ = 'pw-reading-time').text
read_time

'7 min read'

In [18]:
# We can get the title by searching for the class pw-post-title in the HTML.

title = soup.find('h1', class_ = 'pw-post-title').text
title

'AI Can Now Play Minecraft — A Step Closer to Navigate the World'

In [20]:
# We can get the subtitle by searching for the class pw-subtitle-paragraph in the HTML.

subtitle = soup.find('h2', class_ = 'pw-subtitle-paragraph').text
subtitle

'The beginning of open-ended AI'

In [21]:
# We can get the article intro by searching for the class 

article_intro = soup.find('p', class_ = 'pw-post-body-paragraph').text
article_intro

'After building impressive models in language processing (GPT-3) and text-to-image generation (DALL·E 2), OpenAI is now facing an arguably greater challenge: open-ended action. In the great task of solving so-called artificial general intelligence (AGI), they realize language and vision aren’t the only domains in which AI should excel. GPT-3 and DALL·E 2 are extremely good at what they do, but as powerful as they are they remain constrained within the limited boundaries of their virtual worlds.'

Lastly, URL and publication are both values that we would already have. The URL would have been scraped from the latest articles page and the publication will have been provided with the original URL. Now we are able to put all this information in a dictionary and add it to a pandas dataframe. Let's try this out, but run it for all the links we scraped.

In [28]:
df = pd.DataFrame({
    'author' : [],
    'publication' : [],
    'title' : [],
    'subtitle' : [],
    'article_intro' : [],
    'date' : [],
    'read_time' : [],
    'url' : [],
    'publication_url' : []
})

for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    article_info = {}
    
    article_info['author'] = [soup.find('div', class_ = 'pw-author').div.div.div.a.text]
    article_info['publication'] = ['Towards Data Science']
    article_info['title'] = [soup.find('h1', class_ = 'pw-post-title').text]
    article_info['article_intro'] = [soup.find('p', class_ = 'pw-post-body-paragraph').text]
    article_info['date'] = [soup.find('p', class_ = 'pw-published-date').span.text]
    article_info['read_time'] = [soup.find('div', class_ = 'pw-reading-time').text]
    article_info['url'] = [url]
    article_info['publication_url'] = ['https://towardsdatascience.com']
    
    # Sometimes articles don't have a subtitle so we must check if there is a subtitle and 
    # set the subtitle to an empty string if one doesn't exist.
    if (subtitle := soup.find('h2', class_ = 'pw-subtitle-paragraph')) is None:
        article_info['subtitle'] = [None]
    else:
        article_info['subtitle'] = [subtitle.text]
    
    temp = pd.DataFrame(article_info)
    df = pd.concat([df, temp]).reset_index(drop = True)
    
df

Unnamed: 0,author,publication,title,subtitle,article_intro,date,read_time,url,publication_url
0,Christian Freischlag,Towards Data Science,Navigating MLOps in 2022,"Data Science in production: experience, tools ...",MLOps has established itself as an independent...,Jul 6,6 min read,https://towardsdatascience.com/navigating-mlop...,https://towardsdatascience.com
1,Alberto Romero,Towards Data Science,AI Can Now Play Minecraft — A Step Closer to N...,The beginning of open-ended AI,After building impressive models in language p...,Jul 6,7 min read,https://towardsdatascience.com/ai-can-now-play...,https://towardsdatascience.com
2,Andrew Hershy,Towards Data Science,Calculating Closest Landmass Between Two Point...,Python script to identify midpoint and closest...,I made some significant updates to an existing...,Jul 6,5 min read,https://towardsdatascience.com/calculating-clo...,https://towardsdatascience.com
3,Bharath K,Towards Data Science,"Visualizing CPU, Memory, And GPU Utilities wit...","Analyzing CPU, memory usage, and GPU component...","When you are indulged in programming, you are ...",Jul 6,7 min read,https://towardsdatascience.com/visualizing-cpu...,https://towardsdatascience.com
4,Benjamin McCloskey,Towards Data Science,Choosing Neural Networks over N-Gram Models fo...,Today we will look at the strengths of using R...,Traditional learning models transform text fro...,Jul 6,8 min read,https://towardsdatascience.com/choosing-neural...,https://towardsdatascience.com
5,Diego Barba,Towards Data Science,NumPy ufuncs — The Magic Behind Vectorized Fun...,Learn about NumPy universal functions (ufuncs)...,Have you ever wondered about the origin of Num...,Jul 6,7 min read,https://towardsdatascience.com/numpy-ufuncs-th...,https://towardsdatascience.com
6,Nikos Kafritsas,Towards Data Science,Block-Recurrent Transformer: LSTM and Transfor...,A powerful model that combines the best of bot...,The vanilla Transformer is no longer the all-m...,Jul 6,12 min read,https://towardsdatascience.com/block-recurrent...,https://towardsdatascience.com
7,Leonie Monigatti,Towards Data Science,5 Ideas to Create New Features from Polygons,How to Get the Area and Other Features From a ...,Polygon data can be useful in various applicat...,Jul 6,6 min read,https://towardsdatascience.com/5-ideas-to-crea...,https://towardsdatascience.com
8,Lak Lakshmanan,Towards Data Science,How to use Customer Lifetime Value (LTV) for d...,Use LTV to set goals for your business and dev...,If your business is undertaking a data-driven ...,Jul 6,12 min read,https://towardsdatascience.com/how-to-use-cust...,https://towardsdatascience.com
9,Siwei Causevic,Towards Data Science,"Self-supervised Transformer Models — BERT, GPT...",,"In the early days, NLP systems are mostly rule...",Jul 6,6 min read,https://towardsdatascience.com/self-supervised...,https://towardsdatascience.com


A quick check of these results shows that everything is correct with one exception. The date on the last article is not what is shown on the article page. Although when JavaScript is disabled the date is correct. So somehow JavaScript is changing the date after the page loads. A few checks of older articles shows that this is not because the current date is shown by default. It's weird, but for now I will ignore it.

### Takeaways

Using web scraping with requests and Beautiful Soup can be enough to get all the article information needed. There are some limitations. Each scrape can get at most 10 articles. We are not able to get topic information for each article (we may need to use the API for this purpose and build a model to predict topics). Because we can only scrape 10 articles per publication we may need to run the web scraping script multiple times a day in order to keep up with the latest articles. Otherwise at some point in the future I can explore how to accomplish this with Selenium in order to deal with this issue.