# Building a Labeled Dataset

In this notebook we'll go through the steps of building the dataset that will be used to train and test the recommendation model. We will want to have a variety of article data in order to obtain the best results possible.

## Imports

These are the modules we'll need to run the code in this notebook.

In [1]:
# We'll need to add the modules directory path to sys.path since this is where all the .py files are located.

import sys
sys.path.append('../modules')

In [2]:
import pandas as pd

# The Medium API will be used to acquire data.
# Medium API Documentation: https://github.com/weeping-angel/medium-api

from medium_api import Medium

# The News API will be used to acquire data from a large variety of sources.
# News API Documentation: https://newsapi.org/docs

from newsapi import NewsApiClient

# We need requests and Beautiful Soup in order to scrape web data.
import requests
from bs4 import BeautifulSoup

# We need an API key to use the Medium API. This can be acquired from rapidapi.com.
# It is required to create an account and subscribe to the Medium API.
# https://rapidapi.com/nishujain199719-vgIfuFHZxVZ/api/medium2/pricing

from env import medium_api_key, news_api_key

## What Data do we Need?

At the time of writing this I have two sources of data: Medium (web scraping and API) and News API. We're going to be building two separate models: one to label the topic of each article and another to recommend articles based on a user's preferences. The Medium API allows us to get the topic label for each article so we'll use this API to build a dataset with topic labels. To create a good balance we'll use both the Medium and News APIs to build a dataset of articles that will be labeled with user preferences.

Labeling the user preferences is going to be a time consuming process. These labels will be created manually by having different users simply go through the dataset and choose which articles they would find interesting to read. This process obviously won't be shown here, but the final result can be seen in the exploration notebook.

## Building the Topic Label Dataset

In this section we'll go over the steps involved in using the Medium API to build a dataset of articles labeled with topics that each article is categorized under. First though, we must decide on a list of broad topics to cover. We will certainly not be able to cover all the possible topics that could be labeled, but we want to try to cover a wide range of topics. It will be important to keep in mind that we will need to take measures to continue to re-train the topic labeling model periodically in order to be capable of labeling topics that for now will be unseen. For now we'll gather articles covering the following topics:

1. data science
1. machine learning
1. programming
1. social media
1. self
1. philosophy
1. gaming
1. politics
1. health
1. psychology
1. work
1. productivity
1. parenting
1. history
1. cryptocurrency
1. science
1. relationships
1. food
1. technology
1. business

Again this is obviously not a comprehensive list of topic categories, but it will provide a good starting point and most importantly will help in exploration to determine what relates articles in each topic category to each other. Here we have 20 different topics. The Medium API will allow me to gather data for 200 articles. So we'll grab 10 articles for each topic. We'll simply grab the latest articles published.

In [57]:
# Let's start by creating a set for all the topics we'll be using.

topics = {
    "data-science",
    "machine-learning",
    "programming",
    "social-media",
    "self",
    "philosophy",
    "gaming",
    "politics",
    "health",
    "psychology",
    "work",
    "productivity",
    "parenting",
    "history",
    "cryptocurrency",
    "science",
    "relationships",
    "food",
    "technology",
    "business"
}

In [4]:
# Now we start by creating the Medium API object using the API key.

medium = Medium(medium_api_key)

In [51]:
# Now we're loop through each topic and build a list containing the article data for 10 articles for each topic.

article_info = []

for topic in topics:
    latestposts = medium.latestposts(topic_slug = topic)
    
    for n in range(10):
        article_info.append(latestposts.articles[n].info)

In [56]:
article_info[80]

{'id': 'ad8abfe86025',
 'tags': ['parenting',
  'family',
  'relationships',
  'life-lessons',
  'mental-health'],
 'claps': 351,
 'last_modified_at': '2022-07-11 22:59:37',
 'published_at': '2022-07-11 22:59:37',
 'url': 'https://medium.com/bouncin-and-behavin-blogs/why-spanking-your-child-is-lazy-ineffective-parenting-ad8abfe86025',
 'image_url': 'https://miro.medium.com/0*un2EfqV458H7wQsl',
 'lang': 'en',
 'publication_id': '7debdac7d2c2',
 'title': 'Why Spanking Your Child Is Lazy, Ineffective Parenting',
 'word_count': 1879,
 'reading_time': 7.9238993710692,
 'voters': 8,
 'topics': ['family', 'parenting'],
 'subtitle': 'You will damage your child more than you’ll know by hitting them',
 'author': 'be7e9c3be38b'}

So now we have the article info for all the articles we pulled from the API. Let's now put this all in a dataframe with the data we want so that this information can be saved to a file.

In [61]:
articles = pd.DataFrame({
    'author' : [],
    'publication' : [],
    'title' : [],
    'subtitle' : [],
    'date' : [],
    'read_time' : [],
    'url' : [],
    'topics' : []
})

for article in article_info:
    info = {}
    
    info['author'] = [article['author']]
    info['publication'] = [article['publication_id']]
    info['title'] = [article['title']]
    info['subtitle'] = [article['subtitle']]
    info['date'] = [article['published_at']]
    info['read_time'] = [round(article['reading_time'], 0)]
    info['url'] = [article['url']]
    info['topics'] = [article['topics']]
    
    temp = pd.DataFrame(info)
    articles = pd.concat([articles, temp]).reset_index(drop = True)

articles

Unnamed: 0,author,publication,title,subtitle,date,read_time,url,topics
0,77397c9e27,*Self-Published*,Let’s implement a real-time package tracking a...,What is this app?,2022-07-12 13:10:38,3.0,https://abdulsamet-ileri.medium.com/lets-imple...,"[javascript, programming]"
1,5d719eab784c,*Self-Published*,Coding Projects to Help You Learn,Tackle these beginner projects during your #10...,2022-07-12 13:03:03,3.0,https://foxandcrow.medium.com/coding-projects-...,[programming]
2,9e065ebdbf01,*Self-Published*,The proper and easiest way to set cron jobs.,Your search for THE article dealing with cron ...,2022-07-12 13:29:39,3.0,https://wbarillon.medium.com/the-proper-and-ea...,[programming]
3,2adc5a07e772,7f60cf5620c9,Pandas vs Dask vs Datatable: A Performance Com...,Pandas might not be the best option anymore,2022-07-12 15:24:09,5.0,https://towardsdatascience.com/pandas-vs-dask-...,"[data-science, programming]"
4,60cad4f7b35a,*Self-Published*,My unsolicited tidbits of advice on software e...,"I started writing this a couple months back, a...",2022-07-12 01:04:44,12.0,https://atherqureshi.medium.com/my-unsolicited...,"[work, programming]"
...,...,...,...,...,...,...,...,...
195,d8c51a1b6638,*Self-Published*,Redefining Who You Are After the End of a Rela...,You can be anybody you want to be.,2022-07-12 14:27:50,4.0,https://markemilee.medium.com/redefining-who-y...,"[relationships, self]"
196,22a7f265ec10,e755eb736cc6,The Horror of Bipolar Misdiagnosis as an Autis...,Bipolar disease is a common misdiagnosis in ne...,2022-07-12 14:04:32,3.0,https://medium.com/artfullyautistic/the-horror...,"[mental-health, disability, self]"
197,e567f693be3e,ca96e9e3978f,6 Ways to Use Roam Research to Transform the W...,You haven’t used it right until you can’t live...,2022-07-12 16:15:33,5.0,https://medium.com/skilluped/6-ways-to-use-roa...,[self]
198,20e9ad0b837b,*Self-Published*,5 Things I wish I would have done more in my e...,And one I wish I hadn’t done at all.,2022-07-12 13:36:00,4.0,https://niallleah.medium.com/5-things-i-wish-i...,[self]


In [62]:
articles.to_csv('../data/articles-with-topic-label.csv', index = False)

### LatestPosts Error

This is an error being thrown by the Medium API.

In [20]:
latestposts = medium.latestposts(topic_slug = 'astronomy')

In [21]:
latestposts.articles[4].info

[ERROR]: Response: {'error': 'API is working but something unexpected occured. Please check your input!'}


KeyError: 'latestposts'

In location 'medium-api/src/medium_api/medium.py' line 82 is what is causing the error. This line returns an empty dictionary as the response which then on line 37 in 'medium-api/src/medium_api/_latestposts.py' when trying to call 

```python
resp['latestposts']
```

a KeyError is raised because the dictionary is empty. The status for the call is 200, but an error in the json data forces the 

```python
__get_resp()
```

function to attempt to re-try the call which is only allowed up to 3 times. I'll need to look into the json data returned by using the API with a http request instead of the python package in order to figure out what is going on here.