# Medium Scraping

This notebook teach you how to collect articles data from Medium, filtering with tags and release date, and put it into a csv file.

It was constructed based on [Dorian Lazar](https://dorianlazar.medium.com/) article that can be found 
[Here](https://dorianlazar.medium.com/scraping-medium-with-python-beautiful-soup-3314f898bbf5)


## Scraping stages

1. Create Filters
2. Get all articles with selected tags and release date
3. Iterates articles getting usefull infos


In [8]:
# Import Tools

import requests
from bs4 import BeautifulSoup
import pandas as pd
import random

## Create filters

We need two three filtes.

### tag_urls
A dictionary named 'urls', where the key is the tag name and the values reference the url that access the Medium archives of an specific tag.


In [9]:
tag_urls = {
    'Data Science': 'https://medium.com/tag/data-science/archive/{0}/{1:02d}/{2:02d}',
    'Machine Learning': 'https://medium.com/tag/machine-learning/archive/{0}/{1:02d}/{2:02d}',
    'Artificial Inteligence': 'https://medium.com/tag/artificial-intelligence/archive/{0}/{1:02d}/{2:02d}',
    'Deep Learning': 'https://medium.com/tag/deep-learning/archive/{0}/{1:02d}/{2:02d}',
    'Data': 'https://medium.com/tag/data/archive/{0}/{1:02d}/{2:02d}',
    'Big Data': 'https://medium.com/tag/big-data/archive/{0}/{1:02d}/{2:02d}',
    'Analytics': 'https://medium.com/tag/analytics/archive/{0}/{1:02d}/{2:02d}',
}


### year

A integer that represents our articles release year.

### selected_days

This one is a bit more trick, it is a list on integers that represents the day of the year in sequential order.

Where 1 represents january 1, and 366 represents December 31.

In [10]:
year = 2021
#selected_days = [i for i in range(1, 366)] #Every day of the year
selected_days = [i for i in range(1, 8)] #first week of the year

## Create support functions

In [11]:
def convert_day(day):
    # if it is a leap year use month_days = [31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    month_days = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    m = 0
    d = 0
    while day > 0:
        d = day
        day -= month_days[m]
        m += 1
    return (m, d)

In [12]:
def get_claps(claps_str):
    if (claps_str is None) or (claps_str == '') or (claps_str.split is None):
        return 0
    split = claps_str.split('K')
    claps = float(split[0])
    claps = int(claps*1000) if len(split) == 2 else int(claps)
    return claps

## Collect Data

Collect Medium data, and put it into a list called articles_data.

In [13]:
articles_data = []
article_id = 0
n = len(selected_days)
for d in selected_days:
    month, day = convert_day(d)
    date = '{0}-{1:02d}-{2:02d}'.format(year, month, day) 
    print(f'{date}')
    for tag, url in tag_urls.items(): 
        response = requests.get(url.format(year, month, day), allow_redirects=True)
        if not response.url.startswith(url.format(year, month, day)):
            continue
        page = response.content
        soup = BeautifulSoup(page, 'html.parser')
        articles = soup.find_all(
            "div",
            class_="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls")
        
        for article in articles:
            
            title = article.find("h3", class_="graf--title")
            if title is None:
                continue
            title = title.contents[0]
            
            author = article.find_all("a")[0]['href'].split('?')[0].split('@')[1]
            author_url = article.find_all("a")[0]['href'].split('?')[0]
            
            subtitle = article.find("h4", class_="graf--subtitle")
            subtitle = subtitle.contents[0] if subtitle is not None else ''
            
            article_url = article.find_all("a")[3]['href'].split('?')[0]
            
            claps = get_claps(article.find_all("button")[1].contents[0])
            
            reading_time = article.find("span", class_="readingTime")
            reading_time = 0 if reading_time is None else int(reading_time['title'].split(' ')[0])
            
            responses = article.find_all("a")
            if len(responses) == 7:
                responses = responses[6].contents[0].split(' ')
                if len(responses) == 0:
                    responses = 0
                else:
                    responses = responses[0]
            else:
                responses = 0

            articles_data.append([article_url, title,
                         author, author_url,
                         subtitle, claps, responses,
                         reading_time, tag, date])

2021-01-01
2021-01-02
2021-01-03
2021-01-04
2021-01-05
2021-01-06
2021-01-07


In [14]:
# Transform article data into panda dataframe.
medium_df = pd.DataFrame(articles_data, columns=[
    'url', 'title', 'author', 'author_page',
    'subtitle', 'claps', 'responses', 'reading_time',
    'tag', 'date'])

## Remove duplicated data

As we can search about similar tags, it can bring the same articles in different iteration, so we need to clean our collected data.

We do it using the panda fucntion drop_duplicates.

In [15]:
medium_df.shape

(3074, 10)

In [16]:
medium_df = medium_df.drop_duplicates(subset=['url', 'title'], keep='first')

## The final data

Lets take a look of how our collected data looks like.

In [17]:
medium_df.shape

(2002, 10)

In [18]:
medium_df

Unnamed: 0,url,title,author,author_page,subtitle,claps,responses,reading_time,tag,date
0,https://medium.com/coders-camp/180-data-scienc...,180 Data Science and Machine Learning Projects...,amankharwal,https://medium.com/@amankharwal,180 Data Science and Machine Learning…,1100,4,4,Data Science,2021-01-01
1,https://towardsdatascience.com/7-most-recommen...,7 Most Recommended Skills to Learn in 2021 to ...,terenceshin,https://towardsdatascience.com/@terenceshin,Recommended by some of the largest…,1000,10,6,Data Science,2021-01-01
2,https://towardsdatascience.com/implementing-vi...,Implementing VisualTtransformer in PyTorch,FrancescoZ,https://towardsdatascience.com/@FrancescoZ,"Hi guys, happy new year! Today we are going to...",257,4,6,Data Science,2021-01-01
3,https://towardsdatascience.com/optimal-thresho...,Optimal Threshold for Imbalanced Classification,audhiaprilliant,https://towardsdatascience.com/@audhiaprilliant,How to choose the…,223,3,7,Data Science,2021-01-01
4,https://towardsdatascience.com/understanding-t...,Understanding the Confusion Matrix from Scikit...,samarthagrawal86,https://towardsdatascience.com/@samarthagrawal86,Clear representation of output of confusion…,190,5,5,Data Science,2021-01-01
...,...,...,...,...,...,...,...,...,...,...
3069,https://medium.com/smartbug-media/5-steps-to-s...,5 Steps to Successfully Measure Sales and Mark...,stephenlackeythemarketer,https://medium.com/@stephenlackeythemarketer,,0,0,4,Analytics,2021-01-07
3070,https://medium.com/@niefeld/privacy-cookies-f1...,Privacy ❤ cookies,niefeld,https://medium.com/@niefeld,,0,0,2,Analytics,2021-01-07
3071,https://medium.com/@logic2020/analytics-trends...,Analytics trends in 2021: what to expect,logic2020,https://medium.com/@logic2020,,0,0,4,Analytics,2021-01-07
3072,https://medium.com/@bit-team/bitcoin-price-bre...,"Bitcoin price breaks the $ 37,000 resistance. ...",bit-team,https://medium.com/@bit-team,,0,0,3,Analytics,2021-01-07


## Save Collected data into csv file

We save our data frame into a csv file named medium_data.

In [19]:
medium_df.to_csv('medium-data-2021.csv', index=True)


Hope you enjoy this notebook, feel free to give sugestion or submit PRs.

Made with love by @viniciusLambert