# Building a Labeled Dataset

In this notebook we'll go through the steps of building the dataset that will be used to train and test the recommendation model. We will want to have a variety of article data in order to obtain the best results possible.

## Imports

These are the modules we'll need to run the code in this notebook.

In [1]:
# We'll need to add the modules directory path to sys.path since this is where all the .py files are located.

import sys
sys.path.append('../modules')

In [2]:
import numpy as np
import pandas as pd
import requests

# The Medium API will be used to acquire data.
# Medium API Documentation: https://github.com/weeping-angel/medium-api

from medium_api import Medium

# The News API will be used to acquire data from a large variety of sources.
# News API Documentation: https://newsapi.org/docs

from newsapi import NewsApiClient

# We need an API key to use the Medium API. This can be acquired from rapidapi.com.
# It is required to create an account and subscribe to the Medium API.
# https://rapidapi.com/nishujain199719-vgIfuFHZxVZ/api/medium2/pricing

from env import medium_api_key, news_api_key

## What Data do we Need?

At the time of writing this I have two sources of data: Medium (web scraping and API) and News API. We're going to be building two separate models: one to label the topic of each article and another to recommend articles based on a user's preferences. The Medium API allows us to get the topic label for each article so we'll use this API to build a dataset with topic labels. To create a good balance we'll use both the Medium and News APIs to build a dataset of articles that will be labeled with user preferences.

Labeling the user preferences is going to be a time consuming process. These labels will be created manually by having different users simply go through the dataset and choose which articles they would find interesting to read. This process obviously won't be shown here, but the final result can be seen in the exploration notebook.

## Building the Topic Label Dataset

In this section we'll go over the steps involved in using the Medium API to build a dataset of articles labeled with topics that each article is categorized under. First though, we must decide on a list of broad topics to cover. We will certainly not be able to cover all the possible topics that could be labeled, but we want to try to cover a wide range of topics. It will be important to keep in mind that we will need to take measures to continue to re-train the topic labeling model periodically in order to be capable of labeling topics that for now will be unseen. For now we'll gather articles covering the following topics:

1. data science
1. machine learning
1. programming
1. python
1. javascript
1. social media
1. self
1. self-improvement
1. philosophy
1. gaming
1. politics
1. health
1. psychology
1. work
1. productivity
1. parenting
1. history
1. cryptocurrency
1. science
1. relationships
1. food
1. technology
1. business
1. astronomy
1. space

Again this is obviously not a comprehensive list of topic categories, but it will provide a good starting point and most importantly will help in exploration to determine what relates articles in each topic category to each other. Here we have 25 different topics. We'll grab 10 articles for each topic. We'll simply grab the latest articles published.

In [16]:
# Let's start by creating a set for all the topics we'll be using.

topics = {
    "data-science",
    "machine-learning",
    "programming",
    "python",
    "javascript",
    "social-media",
    "self",
    "self-improvement",
    "philosophy",
    "gaming",
    "politics",
    "health",
    "psychology",
    "work",
    "productivity",
    "parenting",
    "history",
    "cryptocurrency",
    "science",
    "relationships",
    "food",
    "technology",
    "business",
    "astronomy",
    "space"
}

In [3]:
# Now we start by creating the Medium API object using the API key.

medium = Medium(medium_api_key)

In [18]:
# Now we're loop through each topic and build a list containing the article data for 10 articles for each topic.
# We'll use the topfeeds endpoint which will allow us to grab the latest articles for a given tag.

article_info = []

for topic in topics:
    latest_articles = medium.topfeeds(tag = topic, mode = 'new')
    
    for n in range(10):
        article_info.append(latest_articles.articles[n].info)

In [23]:
len(article_info)

250

In [24]:
article_info[0]

{'id': '4496b66cc0d9',
 'tags': ['programming',
  'software-development',
  'python',
  'javascript',
  'java'],
 'claps': 0,
 'last_modified_at': '2022-07-13 21:46:25',
 'published_at': '2022-07-13 21:46:25',
 'url': 'https://md-kamaruzzaman.medium.com/top-10-in-demand-programming-languages-to-learn-in-2022-4496b66cc0d9',
 'image_url': 'https://miro.medium.com/1*N0FoWAh_s1qr53MflgqJFg.jpeg',
 'lang': 'en',
 'publication_id': '*Self-Published*',
 'title': 'Top 10 In-Demand programming languages to learn in 2022',
 'word_count': 4208,
 'reading_time': 20.629245283019,
 'voters': 0,
 'topics': {},
 'subtitle': 'Python, Java, JavaScript, C, C#, C++, Swift, PHP, Go, Rust',
 'author': 'df4b39a6f082'}

So now we have the article info for all the articles we pulled from the API. Let's now put this all in a dataframe with the data we want so that this information can be saved to a file.

In [25]:
articles = pd.DataFrame({
    'author' : [],
    'publication' : [],
    'title' : [],
    'subtitle' : [],
    'date' : [],
    'word_count' : [],
    'read_time' : [],
    'url' : [],
    'tags' : [],
    'topics' : [],
    'lang' : []
})

for article in article_info:
    info = {}
    
    info['author'] = [article['author']]
    info['publication'] = [article['publication_id']]
    info['title'] = [article['title']]
    info['subtitle'] = [article['subtitle']]
    info['date'] = [article['published_at']]
    info['word_count'] = [article['word_count']]
    info['read_time'] = [round(article['reading_time'], 0)]
    info['url'] = [article['url']]
    info['tags'] = [article['tags']]
    info['topics'] = [article['topics']]
    info['lang'] = [article['lang']]
    
    temp = pd.DataFrame(info)
    articles = pd.concat([articles, temp]).reset_index(drop = True)

articles

Unnamed: 0,author,publication,title,subtitle,date,word_count,read_time,url,tags,topics,lang
0,df4b39a6f082,*Self-Published*,Top 10 In-Demand programming languages to lear...,"Python, Java, JavaScript, C, C#, C++, Swift, P...",2022-07-13 21:46:25,4208.0,21.0,https://md-kamaruzzaman.medium.com/top-10-in-d...,"[programming, software-development, python, ja...",{},en
1,5015c1b2b3a,*Self-Published*,Cheat Sheet: Importing and exporting all file ...,Hi everyone! In this post we’ll review how to ...,2022-07-13 21:39:33,47.0,0.0,https://medium.com/@datawithdan/cheat-sheet-im...,"[python, data, data-science, sql, pandas]",[programming],en
2,c5f0ad9001fd,*Self-Published*,Python Packages That Apple Uses,Welcome back! Python is an awesome programming...,2022-07-13 21:22:54,572.0,2.0,https://preettheman.medium.com/python-packages...,"[python, programming, coding, software-develop...",[programming],en
3,1253d637b2a0,7f60cf5620c9,An Easy Introduction to NumPy Arrays,"What, How, and Why.",2022-07-13 20:51:17,866.0,3.0,https://towardsdatascience.com/an-easy-introdu...,"[numpy, arrays, python, data-science, programm...","[data-science, programming]",en
4,23fec7404e6a,*Self-Published*,Alternative Method to Choose the Right Machine...,"In this article, I will introduce you to a spe...",2022-07-13 19:59:23,1421.0,6.0,https://medium.com/@gromdimon/alternative-meth...,"[machine-learning, python, computer-science, a...",[machine-learning],en
...,...,...,...,...,...,...,...,...,...,...,...
245,87d4eb341b4c,*Self-Published*,AI Art. The new kid on the block.,"It sounds like science fiction, but it isn’t —...",2022-07-13 22:04:08,257.0,1.0,https://medium.com/@farhadtahmazovsoc_82646/ai...,"[technology, business, entrepreneurship, desig...",[artificial-intelligence],en
246,79bd9219e935,*Self-Published*,There Is More To Ideation Events Than Ideas,There is an immense amount of value and opport...,2022-07-13 21:52:05,2411.0,9.0,https://medium.com/@JPrefontaine/there-is-more...,"[innovation, hackathons, business, ideas, cult...",[startups],en
247,782b5907b5bb,29bf33e1ca19,What is Standard Costing and How Does It Work,Standard costing uses predetermined costs for ...,2022-07-13 21:51:53,1542.0,7.0,https://medium.com/magnimetrics/what-is-standa...,"[manufacturing, standard-costing, variance-ana...",[data-science],en
248,2e309864aa63,*Self-Published*,15 Proven Ways To Stop Your Self Sabotage,Thomas Edison failed more than 1000 times whil...,2022-07-13 21:44:12,2376.0,9.0,https://medium.com/@alux.com/15-proven-ways-to...,"[self-improvement, productivity, mindfulness, ...",[self],en


In [26]:
articles.to_csv('../data/articles-with-topic-label.csv', index = False)

### LatestPosts Error

This is an error being thrown by the Medium API.

In [20]:
latestposts = medium.latestposts(topic_slug = 'astronomy')

In [21]:
latestposts.articles[4].info

[ERROR]: Response: {'error': 'API is working but something unexpected occured. Please check your input!'}


KeyError: 'latestposts'

In location 'medium-api/src/medium_api/medium.py' line 82 is what is causing the error. This line returns an empty dictionary as the response which then on line 37 in 'medium-api/src/medium_api/_latestposts.py' when trying to call 

```python
resp['latestposts']
```

a KeyError is raised because the dictionary is empty. The status for the call is 200, but an error in the json data forces the 

```python
__get_resp()
```

function to attempt to re-try the call which is only allowed up to 3 times. I'll need to look into the json data returned by using the API with a http request instead of the python package in order to figure out what is going on here.

In [14]:
url = 'https://medium2.p.rapidapi.com/latestposts/astronomy'
headers = {
    'X-RapidAPI-Key': medium_api_key,
    'X-RapidAPI-Host': 'medium2.p.rapidapi.com'
}

response = requests.get(url, headers = headers)

In [15]:
response.json()

{'error': 'API is working but something unexpected occured. Please check your input!'}

So now I see that it is the main API that is for some reason receiving an error message for the request. There is not much more I can do about this. I have left a message detailing the issue in the API discussion board on RapidAPI.

#### Update

An [issue](https://github.com/weeping-angel/medium-api/issues/1) was submitted on the Github repo for the Medium API and the issue has been resolved so that this error no longer occurs. With the new update a more meaningful error is raised and will not break execution of the script.

### User ID Error

This is an error being thrown on the "user" endpoint of the API.

In [6]:
articles.loc[0].user_id

'df4b39a6f082'

In [7]:
u = medium.user(user_id = 'df4b39a6f082')

KeyError: 'followers'

I'll look deeper into what exactly is causing this error another time.

Additionally, this error is occurring for users whose accounts are under investigation.

In [8]:
url = f'https://medium2.p.rapidapi.com/user/3d738c2a082a'
headers = {
    'X-RapidAPI-Key': medium_api_key,
    'X-RapidAPI-Host': 'medium2.p.rapidapi.com'
}

requests.get(url, headers = headers).json()

{'error': 'API is working. Please check your path/query parameters. Invalid Input.'}

Again I'll look deeper into what exactly is causing this error another time.

### Publication ID Error

This is error is occurring for the "publication" endpoint of the API.

In [9]:
p = medium.publication(publication_id = '519a291a42b5')

[ERROR]: Response: {'error': 'API is working but something unexpected occured. Please check your input!'}


KeyError: 'name'

Once again I'll look into this deeper another time. This seems to occur with publications that are under investigation.