# Medium Scraping

This notebook teach you how to collect articles data from Medium, filtering with tags and release date, and put it into a csv file.

It was constructed based on [Dorian Lazar](https://dorianlazar.medium.com/) article that can be found 
[Here](https://dorianlazar.medium.com/scraping-medium-with-python-beautiful-soup-3314f898bbf5)


## Scraping stages

1. Create Filters
2. Get all articles with selected tags and release date
3. Iterates articles getting usefull infos


In [1]:
# Import Tools

import requests
from bs4 import BeautifulSoup
import pandas as pd
import random

## Create filters

We need two three filtes.

### tag_urls
A dictionary named 'urls', where the key is the tag name and the values reference the url that access the Medium archives of an specific tag.


In [2]:
tag_urls = {
    'Data Science': 'https://medium.com/tag/data-science/archive/{0}/{1:02d}/{2:02d}',
    'Machine Learning': 'https://medium.com/tag/machine-learning/archive/{0}/{1:02d}/{2:02d}',
    'Artificial Inteligence': 'https://medium.com/tag/artificial-intelligence/archive/{0}/{1:02d}/{2:02d}',
    'Deep Learning': 'https://medium.com/tag/deep-learning/archive/{0}/{1:02d}/{2:02d}',
    'Data': 'https://medium.com/tag/data/archive/{0}/{1:02d}/{2:02d}',
    'Big Data': 'https://medium.com/tag/big-data/archive/{0}/{1:02d}/{2:02d}',
    'Analytics': 'https://medium.com/tag/analytics/archive/{0}/{1:02d}/{2:02d}',
}


### year

A integer that represents our articles release year.

### selected_days

This one is a bit more trick, it is a list on integers that represents the day of the year in sequential order.

Where 1 represents january 1, and 366 represents December 31.

In [4]:
year = 2020
selected_days = [i for i in range(1, 366)] #All days

## Create support functions

In [6]:
def convert_day(day):
    # if it is a leap year use month_days = [31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    month_days = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
    m = 0
    d = 0
    while day > 0:
        d = day
        day -= month_days[m]
        m += 1
    return (m, d)

In [7]:
def get_claps(claps_str):
    if (claps_str is None) or (claps_str == '') or (claps_str.split is None):
        return 0
    split = claps_str.split('K')
    claps = float(split[0])
    claps = int(claps*1000) if len(split) == 2 else int(claps)
    return claps

## Collect Data

Collect Medium data, and put it into a list called articles_data.

In [None]:
articles_data = []
article_id = 0
n = len(selected_days)
for d in selected_days:
    month, day = convert_day(d)
    date = '{0}-{1:02d}-{2:02d}'.format(year, month, day) 
    print(f'{date}')
    for tag, url in tag_urls.items(): 
        response = requests.get(url.format(year, month, day), allow_redirects=True)
        if not response.url.startswith(url.format(year, month, day)):
            continue
        page = response.content
        soup = BeautifulSoup(page, 'html.parser')
        articles = soup.find_all(
            "div",
            class_="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls")
        
        for article in articles:
            
            title = article.find("h3", class_="graf--title")
            if title is None:
                continue
            title = title.contents[0]
            
            author = article.find_all("a")[0]['href'].split('?')[0].split('@')[1]
            author_url = article.find_all("a")[0]['href'].split('?')[0]
            
            subtitle = article.find("h4", class_="graf--subtitle")
            subtitle = subtitle.contents[0] if subtitle is not None else ''
            
            article_url = article.find_all("a")[3]['href'].split('?')[0]
            
            claps = get_claps(article.find_all("button")[1].contents[0])
            
            reading_time = article.find("span", class_="readingTime")
            reading_time = 0 if reading_time is None else int(reading_time['title'].split(' ')[0])
            
            responses = article.find_all("a")
            if len(responses) == 7:
                responses = responses[6].contents[0].split(' ')
                if len(responses) == 0:
                    responses = 0
                else:
                    responses = responses[0]
            else:
                responses = 0

            articles_data.append([article_url, title,
                         author, author_url,
                         subtitle, claps, responses,
                         reading_time, tag, date])

2020-01-01
2020-01-02
2020-01-03
2020-01-04
2020-01-05
2020-01-06
2020-01-07
2020-01-08
2020-01-09
2020-01-10
2020-01-11
2020-01-12
2020-01-13
2020-01-14
2020-01-15
2020-01-16
2020-01-17
2020-01-18
2020-01-19
2020-01-20
2020-01-21
2020-01-22
2020-01-23
2020-01-24
2020-01-25
2020-01-26
2020-01-27
2020-01-28
2020-01-29
2020-01-30
2020-01-31
2020-02-01
2020-02-02
2020-02-03
2020-02-04
2020-02-05
2020-02-06
2020-02-07
2020-02-08
2020-02-09
2020-02-10
2020-02-11
2020-02-12
2020-02-13
2020-02-14
2020-02-15
2020-02-16
2020-02-17
2020-02-18
2020-02-19
2020-02-20
2020-02-21
2020-02-22
2020-02-23
2020-02-24
2020-02-25
2020-02-26
2020-02-27
2020-02-28
2020-03-01
2020-03-02
2020-03-03
2020-03-04
2020-03-05


In [None]:
# Transform article data into panda dataframe.
medium_df = pd.DataFrame(articles_data, columns=[
    'url', 'title', 'author', 'author_page',
    'subtitle', 'claps', 'responses', 'reading_time',
    'tag', 'date'])

## Remove duplicated data

As we can search about similar tags, it can bring the same articles in different iteration, so we need to clean our collected data.

We do it using the panda fucntion drop_duplicates.

In [45]:
medium_df.shape

(7, 11)

In [46]:
medium_df = medium_df.drop_duplicates(subset=['url', 'title'], keep='first')

## The final data

Lets take a look of how our collected data looks like.

In [7]:
medium_df.shape

NameError: name 'medium_df' is not defined

In [49]:
medium_df

Unnamed: 0,id,url,title,author,author_page,subtitle,claps,responses,reading_time,publication,date
0,1,https://medium.com/coders-camp/180-data-science-and-machine-learning-projects-with-python-6191bc7b9db9,180 Data Science and Machine Learning Projects with Python,amankharwal,https://medium.com/@amankharwal,180 Data Science and Machine Learning…,1100,4,4,Data Science,2021-01-01
3,4,https://towardsdatascience.com/natural-language-generation-part-2-gpt-2-and-huggingface-f3acb35bc86a,Natural Language Generation Part 2: GPT2 and Huggingface,georgedittmar,https://towardsdatascience.com/@georgedittmar,Learn to use Huggingface and GPT-2 to train a…,39,0,8,Deep Learning,2021-01-01
4,5,https://medium.com/@scribblr42/working-with-dataviz-in-2021-whats-next-and-what-matters-356db3aa0098,[Working with #DataViz in 2021 — What’s Next? (and what matters?)],scribblr42,https://medium.com/@scribblr42,,202,0,11,Data,2021-01-01
5,6,https://medium.com/@dharmeshpanchmatia/data-analytics-and-ai-ml-platform-for-ecommerce-68639df89c7f,Data Analytics and AI/ML platform for eCommerce,dharmeshpanchmatia,https://medium.com/@dharmeshpanchmatia,Improve user pr,30,0,5,Big Data,2021-01-01
6,7,https://towardsdatascience.com/understanding-the-confusion-matrix-from-scikit-learn-c51d88929c79,Understanding the Confusion Matrix from Scikit learn,samarthagrawal86,https://towardsdatascience.com/@samarthagrawal86,Clear representation of output of confusion…,190,5,5,Analytics,2021-01-01


## Save Collected data into csv file

We save our data frame into a csv file named medium_data.

In [48]:
medium_df.to_csv('medium_data.csv', index=True)


Hope you enjoy this notebook, feel free to give sugestion or submit PRs.

Made with love by @viniciusLambert