# Acquiring Article Data

In this notebook we'll go through the process of acquiring all the article data that will be used in this project. The steps outlined here will allow us to replicate the process of acquiring the data, but new articles are published every day making it difficult to make fully reproduce the process of acquiring the same dataset that will be used in exploration and model training for this project. Nonetheless the steps will be outlined here for reference. A .csv file will be provided containing the data that is used for analysis and modeling for reproducibility.

## Imports

These are all the modules that we'll need to run the code in this notebook.

In [2]:
# We'll need to add the modules directory path to sys.path since this is where all the .py files are located.

import sys
sys.path.append('../modules')

In [20]:
# The Medium API will be used to acquire data.
# Medium API Documentation: https://github.com/weeping-angel/medium-api

# import pandas as pd
from medium_api import Medium

# We need an API key to use the Medium API. This can be acquired from rapidapi.com.
# It is required to create an account and subscribe to the Medium API.
# https://rapidapi.com/nishujain199719-vgIfuFHZxVZ/api/medium2/pricing

from env import medium_api_key

# For timing the execution of cells.
import time

## Acquire Article Data Using Medium API

In this section we'll cover some of the functionality provided by the Medium API. We'll discover how to use it and what can and can't be done.

In [4]:
# First we need to create the Medium object with our API key, this will allow us to make calls to the API.

medium = Medium(medium_api_key)

In [23]:
# Now let's try getting the latestposts for a given topic. The latestposts function only allows specifying a 
# topic.

latestposts = medium.latestposts(topic_slug = 'artificial-intelligence')

In [25]:
# Now let's look at the info for the first article in this list.

latestposts.articles[0].info

{'id': '14f1ac7ec113',
 'tags': ['smart-speaker',
  'smartspeakersmarketshare',
  'telegram',
  'telegram-bot',
  'voice-assistant'],
 'claps': 0,
 'last_modified_at': '2022-07-04 20:53:06',
 'published_at': '2022-07-04 20:53:06',
 'url': 'https://convcomp.it/could-telegram-be-a-competitor-of-voice-assistants-like-amazon-alexa-or-google-assistant-14f1ac7ec113',
 'image_url': 'https://miro.medium.com/1*TyS9LOsSoh-506qWQURRlw.jpeg',
 'lang': 'en',
 'publication_id': 'e9c948ff6ebd',
 'title': 'Could Telegram be a competitor of voice assistants, like Amazon Alexa or Google Assistant?',
 'word_count': 2285,
 'reading_time': 8.822641509434,
 'voters': 0,
 'topics': ['artificial-intelligence'],
 'subtitle': 'An open letter to Pavel Durov, containing some change requests to enable voice integration into Telegram bots ecosystem',
 'author': '452c0445f9d5'}

In [11]:
# Now the second article.

latestposts.articles[1].info

{'id': 'fe99c68643a6',
 'tags': ['metaverse', 'machine-learning', 'strategy', 'marketing', 'law'],
 'claps': 0,
 'last_modified_at': '2022-07-04 20:05:43',
 'published_at': '2022-07-04 19:58:46',
 'url': 'https://medium.com/@MetaverseLaw/metaverse-law-trends-via-machine-learning-fe99c68643a6',
 'image_url': 'https://miro.medium.com/1*rta8P5wpi5-75cERsoJa2g.png',
 'lang': 'en',
 'publication_id': '*Self-Published*',
 'title': 'Metaverse Law Trends via Machine Learning',
 'word_count': 3814,
 'reading_time': 15.742452830189,
 'voters': 0,
 'topics': ['artificial-intelligence'],
 'subtitle': 'If you’re interested only in the machine learning discussion, please skip to the Machine Learning section.',
 'author': '1506fccbdfe3'}

There was a brief delay when grabbing the info for the articles. This makes me wonder if the API is lazy and makes the call only when necessary (in this case when we call info). If this is the case then using the API, on the free tier at least, will not be possible since we will likely expend the allotted monthly API calls very rapidly. Looking at the API usage stats (at RapidAPI) it seems like this is in fact the case.

The documentation shows there is a fetch_articles function that may solve this issue. Let's try it.

In [12]:
# Grab the article data using fetch_articles, this function doesn't return anything, but will instead populate
# the latestposts.articles list.

medium.fetch_articles(latestposts.articles)

In [14]:
len(latestposts.articles)

25

In [15]:
# Now let's try getting the info for a few articles again.

latestposts.articles[0].info

{'id': '14f1ac7ec113',
 'tags': ['smart-speaker',
  'smartspeakersmarketshare',
  'telegram',
  'telegram-bot',
  'voice-assistant'],
 'claps': 0,
 'last_modified_at': '2022-07-04 20:53:06',
 'published_at': '2022-07-04 20:53:06',
 'url': 'https://convcomp.it/could-telegram-be-a-competitor-of-voice-assistants-like-amazon-alexa-or-google-assistant-14f1ac7ec113',
 'image_url': 'https://miro.medium.com/1*TyS9LOsSoh-506qWQURRlw.jpeg',
 'lang': 'en',
 'publication_id': 'e9c948ff6ebd',
 'title': 'Could Telegram be a competitor of voice assistants, like Amazon Alexa or Google Assistant?',
 'word_count': 2285,
 'reading_time': 8.822641509434,
 'voters': 0,
 'topics': ['artificial-intelligence'],
 'subtitle': 'An open letter to Pavel Durov, containing some change requests to enable voice integration into Telegram bots ecosystem',
 'author': '452c0445f9d5'}

In [16]:
latestposts.articles[10].info

{'id': '65de5d04c475',
 'tags': ['artificial-intelligence',
  'machine-learning',
  'data-science',
  'automation',
  'future'],
 'claps': 250,
 'last_modified_at': '2022-07-04 13:10:28',
 'published_at': '2022-07-04 12:26:24',
 'url': 'https://medium.com/gdg-vit/ai-automation-and-the-future-of-workplaces-65de5d04c475',
 'image_url': 'https://miro.medium.com/1*b0ig8NZnu7GBr2tZG0xnsg.png',
 'lang': 'en',
 'publication_id': '7ebddf9721d',
 'title': 'AI, Automation, and the Future of Workplaces',
 'word_count': 1590,
 'reading_time': 6.8333333333333,
 'voters': 5,
 'topics': ['artificial-intelligence'],
 'subtitle': 'Introduction',
 'author': '15c03e4e1b20'}

There is no longer a delay when getting the article info, but the API usage stats (at RapidAPI) confirm that running the fetch_articles functions made 25 API calls.

### Takeaways

Using the API would definitely make things easier, but it comes with some very significant limitations. First from the work done above we can see that for each article we collect data on the API will make a call. This means that if, for instance, we wanted to get the latest 25 articles for 5 different topics this would amount to 125 API calls which is half the monthly allotment of 250. Even using a smarter algorithm to ensure we don't make more calls than necessary we would still probably expend the monthly free tier allotment within a week.

Switching to a paid tier is an option. The next tier up would cost 4.99 per month and give 1,250 API calls per month. This might be enough, but it comes with the downside of having to pay for the API. Not to mention that it might possibly still not be enough calls as the application begins to scale to a larger size. The API with the free tier would probably be good enough if this application was kept to a small scope, such as only pulling article data for 2 or 3 different topics. This would not make for a very useful application though so we may need to explore alternative solutions.

It might be worthwhile to design and build a package that utilizes the API just in case later on it is discovered that the API is necessary.

## Acquiring Article Data Through Web Scraping

Let's now try pulling the data we need with web scraping techniques. We'll use Beautiful Soup to accomplish this.