# Medium Scraping

This notebook teach you how to collect articles data from Medium, filtering with tags and release date, and put it into a csv file.

It was constructed based on [Dorian Lazar](https://dorianlazar.medium.com/) article that can be found 
[Here](https://dorianlazar.medium.com/scraping-medium-with-python-beautiful-soup-3314f898bbf5)


## Scraping stages

1. Create Filters
2. Get all articles with selected tags and release date
3. Iterates articles getting usefull infos


In [1]:
# Import Tools

import requests
from bs4 import BeautifulSoup
import pandas as pd
import random

## Create filters

We need two three filtes.

### tag_urls
A dictionary named 'urls', where the key is the tag name and the values reference the url that access the Medium archives of an specific tag.


In [2]:
tag_urls = {
    'Data Science': 'https://medium.com/tag/data-science/archive/{0}/{1:02d}/{2:02d}',
    'Machine Learning': 'https://medium.com/tag/machine-learning/archive/{0}/{1:02d}/{2:02d}',
    'Artificial Inteligence': 'https://medium.com/tag/artificial-intelligence/archive/{0}/{1:02d}/{2:02d}',
    'Deep Learning': 'https://medium.com/tag/deep-learning/archive/{0}/{1:02d}/{2:02d}',
    'Data': 'https://medium.com/tag/data/archive/{0}/{1:02d}/{2:02d}',
    'Big Data': 'https://medium.com/tag/big-data/archive/{0}/{1:02d}/{2:02d}',
    'Analytics': 'https://medium.com/tag/analytics/archive/{0}/{1:02d}/{2:02d}',
}


### year

A integer that represents our articles release year.

### selected_days

This one is a bit more trick, it is a list on integers that represents the day of the year in sequential order.

Where 1 represents january 1, and 366 represents December 31.

In [3]:
year = 2020
selected_days = [i for i in range(1, 366)] #Every day of the year


## Create support functions

In [4]:
def convert_day(day):
    # if it is a leap year use month_days = [31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    month_days = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    m = 0
    d = 0
    while day > 0:
        d = day
        day -= month_days[m]
        m += 1
    return (m, d)

In [5]:
def get_claps(claps_str):
    if (claps_str is None) or (claps_str == '') or (claps_str.split is None):
        return 0
    split = claps_str.split('K')
    claps = float(split[0])
    claps = int(claps*1000) if len(split) == 2 else int(claps)
    return claps

## Collect Data

Collect Medium data, and put it into a list called articles_data.

In [6]:
articles_data = []
article_id = 0
n = len(selected_days)
for d in selected_days:
    month, day = convert_day(d)
    date = '{0}-{1:02d}-{2:02d}'.format(year, month, day) 
    print(f'{date}')
    for tag, url in tag_urls.items(): 
        response = requests.get(url.format(year, month, day), allow_redirects=True)
        if not response.url.startswith(url.format(year, month, day)):
            continue
        page = response.content
        soup = BeautifulSoup(page, 'html.parser')
        articles = soup.find_all(
            "div",
            class_="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls")
        
        for article in articles:
            
            title = article.find("h3", class_="graf--title")
            if title is None:
                continue
            title = title.contents[0]
            
            author = article.find_all("a")[0]['href'].split('?')[0].split('@')[1]
            author_url = article.find_all("a")[0]['href'].split('?')[0]
            
            subtitle = article.find("h4", class_="graf--subtitle")
            subtitle = subtitle.contents[0] if subtitle is not None else ''
            
            article_url = article.find_all("a")[3]['href'].split('?')[0]
            
            claps = get_claps(article.find_all("button")[1].contents[0])
            
            reading_time = article.find("span", class_="readingTime")
            reading_time = 0 if reading_time is None else int(reading_time['title'].split(' ')[0])
            
            responses = article.find_all("a")
            if len(responses) == 7:
                responses = responses[6].contents[0].split(' ')
                if len(responses) == 0:
                    responses = 0
                else:
                    responses = responses[0]
            else:
                responses = 0

            articles_data.append([article_url, title,
                         author, author_url,
                         subtitle, claps, responses,
                         reading_time, tag, date])

2020-09-07
2020-09-08
2020-09-09
2020-09-10
2020-09-11
2020-09-12
2020-09-13
2020-09-14
2020-09-15
2020-09-16
2020-09-17
2020-09-18
2020-09-19
2020-09-20
2020-09-21
2020-09-22
2020-09-23
2020-09-24
2020-09-25
2020-09-26
2020-09-27
2020-09-28
2020-09-29
2020-09-30
2020-10-01
2020-10-02
2020-10-03
2020-10-04
2020-10-05
2020-10-06
2020-10-07
2020-10-08
2020-10-09
2020-10-10
2020-10-11
2020-10-12
2020-10-13
2020-10-14
2020-10-15
2020-10-16
2020-10-17
2020-10-18
2020-10-19
2020-10-20
2020-10-21
2020-10-22
2020-10-23
2020-10-24
2020-10-25
2020-10-26
2020-10-27
2020-10-28
2020-10-29
2020-10-30
2020-10-31
2020-11-01
2020-11-02
2020-11-03
2020-11-04
2020-11-05
2020-11-06
2020-11-07
2020-11-08
2020-11-09
2020-11-10
2020-11-11
2020-11-12
2020-11-13
2020-11-14
2020-11-15
2020-11-16
2020-11-17
2020-11-18
2020-11-19
2020-11-20
2020-11-21
2020-11-22
2020-11-23
2020-11-24
2020-11-25
2020-11-26
2020-11-27
2020-11-28
2020-11-29
2020-11-30
2020-12-01
2020-12-02
2020-12-03
2020-12-04
2020-12-05
2020-12-06

In [7]:
# Transform article data into panda dataframe.
medium_df = pd.DataFrame(articles_data, columns=[
    'url', 'title', 'author', 'author_page',
    'subtitle', 'claps', 'responses', 'reading_time',
    'tag', 'date'])

In [8]:
convert_day(250)

(9, 7)

## Remove duplicated data

As we can search about similar tags, it can bring the same articles in different iteration, so we need to clean our collected data.

We do it using the panda fucntion drop_duplicates.

In [9]:
medium_df.shape

(51869, 10)

In [10]:
medium_df = medium_df.drop_duplicates(subset=['url', 'title'], keep='first')

## The final data

Lets take a look of how our collected data looks like.

In [11]:
medium_df.shape

(34053, 10)

In [12]:
medium_df

Unnamed: 0,url,title,author,author_page,subtitle,claps,responses,reading_time,tag,date
0,https://towardsdatascience.com/text-classifica...,BERT for Text Classification with NO model tra...,mdipietro09,https://towardsdatascience.com/@mdipietro09,"Use BERT, Word Embedding, and Vector Similarity…",278,6,14,Data Science,2020-09-07
1,https://towardsdatascience.com/predictive-main...,Predictive maintenance of turbofan engines,kpeters_,https://towardsdatascience.com/@kpeters_,,151,0,9,Data Science,2020-09-07
2,https://towardsdatascience.com/minimal-pytorch...,The Most Complete Guide to PyTorch for Data Sc...,mlwhiz,https://towardsdatascience.com/@mlwhiz,All the PyTorch functionality you will ever…,706,3,14,Data Science,2020-09-07
3,https://towardsdatascience.com/using-genetic-a...,Using Genetic Algorithms to Train Neural Networks,vs1324,https://towardsdatascience.com/@vs1324,,213,7,5,Data Science,2020-09-07
4,https://towardsdatascience.com/compas-case-stu...,COMPAS Case Study: Fairness of a Machine Learn...,farhanrahman02,https://towardsdatascience.com/@farhanrahman02,,56,0,8,Data Science,2020-09-07
...,...,...,...,...,...,...,...,...,...,...
51863,https://medium.com/@vandomed/for-second-consec...,"For Second Consecutive Presidential Election, ...",vandomed,https://medium.com/@vandomed,,0,0,2,Analytics,2020-12-31
51864,https://medium.com/@maxdymel/2019-vs-2020-the-...,2019 vs. 2020 — The last Formula 1 seasons com...,maxdymel,https://medium.com/@maxdymel,The 2020 Formula 1 season is over. With…,0,0,5,Analytics,2020-12-31
51865,https://medium.com/@jmexclusives/why-are-uniqu...,Why are Unique Visitors so Important in Websit...,jmexclusives,https://medium.com/@jmexclusives,,0,0,10,Analytics,2020-12-31
51866,https://medium.com/@bryancjavier/periodismo-de...,Periodismo de Analytics,bryancjavier,https://medium.com/@bryancjavier,,0,0,3,Analytics,2020-12-31


## Save Collected data into csv file

We save our data frame into a csv file named medium_data.

In [13]:
medium_df.to_csv('medium_data.csv', index=True)


Hope you enjoy this notebook, feel free to give sugestion or submit PRs.

Made with love by @viniciusLambert