# Medium Article Webscraper
This notebook contains code that webscrapes [medium.com](https://medium.com/) in order to gain key variables about articles written under the 'Data Science' tag. This webscraper scrapes the following variables related to an article:
- __title__: Title of the article
- __article_url__: URL link to the article
- __claps__: The number of claps
- __reading_time__: The time it takes to read the article in minutes
- __date__: The date that the article was published 
- __tag_list__: Other tags that the article uses other than 'Data Science'



_Sources for this code include:_

(1) [Harrison Jansma](https://github.com/harrisonjansma/Medium_Scraper/blob/master/medium_scraper.py) (Apache License 2.0)

(2) [Dorian Lazar](https://github.com/lazuxd/medium-scraping/blob/master/medium_scraping.ipynb) 

For more information on the use of this scraper refer to the [README.md](https://github.com/srpatel2000/DSA-Data-Vis.-Competition-Submission) file.

In [223]:
# import statements

import pandas as pd
import numpy as np
from bs4 import BeautifulSoup, SoupStrainer
import time
from selenium import webdriver
from selenium.common.exceptions import TimeoutException

In [266]:
# this function was taken directly from Harrison Jansma w/ some modification to the executable path
def open_chrome():
    """Opens a chrome driver"""
    driver = webdriver.Chrome(executable_path='/Users/siddhipatel/Downloads/chromedriver_mac64/chromedriver')
    driver.implicitly_wait(30)
    return driver

In [267]:
# this function was taken directly from Dorian Lazar
def get_claps(claps_str):
    """Gets the number of claps in an article"""
    if (claps_str is None) or (claps_str == '') or (claps_str.split is None):
        return 0
    split = claps_str.split('K')
    claps = float(split[0])
    claps = int(claps*1000) if len(split) == 2 else int(claps)
    return claps

In [299]:
# main script that scrapes the articles

data = [] # stores the data gathered by the scraper
year = 2020 # gathers data only in 2020
month = 1 # starts on the month of January 
num_days = [31, 29, 31, 30, 31, 30, 31, 27] # gets the number of days every month of 2020
url = 'https://medium.com/tag/data-science/archive/{0}/{1:02d}/{2:02d}' # base url 

for days in num_days:
    chrome_driver = open_chrome()
    start_date = np.random.randint(1, (days/2)-1) # random start date
    end_date = np.random.randint(days/2, days) # random end date
    for i in range(start_date, end_date):
        date = '{0}-{1:02d}-{2:02d}'.format(year, month, i)
        print(date + " " + str(start_date) + " " + str(end_date))
        
        response = chrome_driver.get(url.format(year, month, i))
        strainer = SoupStrainer('div')
        #gathers all the articles in a specific day of the month
        soup = bs4.BeautifulSoup(chrome_driver.page_source, 'lxml', parse_only=strainer)
        articles = soup.find_all(
            "div",
            class_="postArticle postArticle--short js-postArticle js-trackPostPresentation js-trackPostScrolls")
        
        for article in articles:
            # grabs the title of the article
            title = article.find("h3", class_="graf--title")
            if title is None:
                continue
            title = title.contents[0]
            
            # grabs the article url
            article_url = article.find_all("a")[3]['href'].split('?')[0]
            
            # grabs the number of claps in a given article
            claps = get_claps(article.find_all("button")[1].contents[0])
            
            # grabs the reading time of a given article
            reading_time = article.find("span", class_="readingTime")
            reading_time = 0 if reading_time is None else int(reading_time['title'].split(' ')[0])
            
            # grabs the related tags under an article
            response = chrome_driver.get(article_url)
            strainer = SoupStrainer('ul')
            tags_soup = bs4.BeautifulSoup(chrome_driver.page_source, 'lxml', parse_only=strainer)
            try:
                tags = tags_soup.find_all("ul")[len(tags_soup.find_all("ul"))-1].findChildren("a")
            except:
                tags = None
            if tags is None:
                continue
            tag_list = []
            for i in range(len(tags)):
                tag_list.append(tags[i].contents[0])
            # if Data Science isn't a tag that means the article was only for paid members
            # so the article was skipped and the data was not inputted into the table
            if 'Data Science' not in tag_list:
                continue
                
            time.sleep(2)
            
            data.append([title, article_url, claps, reading_time, date, tag_list])
    time.sleep(10)
    # moves onto the next month
    month += 1

2020-01-03 3 16
2020-01-04 3 16
2020-01-05 3 16
2020-01-06 3 16
2020-01-07 3 16
2020-01-08 3 16
2020-01-09 3 16
2020-01-10 3 16
2020-01-11 3 16
2020-01-12 3 16
2020-01-13 3 16
2020-01-14 3 16
2020-01-15 3 16
2020-02-01 1 26
2020-02-02 1 26
2020-02-03 1 26
2020-02-04 1 26
2020-02-05 1 26
2020-02-06 1 26
2020-02-07 1 26
2020-02-08 1 26
2020-02-09 1 26
2020-02-10 1 26
2020-02-11 1 26
2020-02-12 1 26
2020-02-13 1 26
2020-02-14 1 26
2020-02-15 1 26
2020-02-16 1 26
2020-02-17 1 26
2020-02-18 1 26
2020-02-19 1 26
2020-02-20 1 26
2020-02-21 1 26
2020-02-22 1 26
2020-02-23 1 26
2020-02-24 1 26
2020-02-25 1 26
2020-03-03 3 30
2020-03-04 3 30
2020-03-05 3 30
2020-03-06 3 30
2020-03-07 3 30
2020-03-08 3 30
2020-03-09 3 30
2020-03-10 3 30
2020-03-11 3 30
2020-03-12 3 30
2020-03-13 3 30
2020-03-14 3 30
2020-03-15 3 30
2020-03-16 3 30
2020-03-17 3 30
2020-03-18 3 30
2020-03-19 3 30
2020-03-20 3 30
2020-03-21 3 30
2020-03-22 3 30
2020-03-23 3 30
2020-03-24 3 30
2020-03-25 3 30
2020-03-26 3 30
2020-03-

In [304]:
medium_data = pd.DataFrame(data, columns = ['title', 'article_url', 'claps', 'reading_time', 'date', 'tag_list'])
medium_data.head()

Unnamed: 0,title,article_url,claps,reading_time,date,tag_list
0,Top 10 Technology Trends for 2020,https://towardsdatascience.com/top-10-technolo...,3000,10,2020-01-03,"[Technology, Trends, Artificial Intelligence, ..."
1,Top 10 Skills for a Data Scientist,https://towardsdatascience.com/top-10-skills-f...,2200,9,2020-01-03,"[Data Science, Technology, Business, Machine L..."
2,ML Ops: Machine Learning as an Engineering Dis...,https://towardsdatascience.com/ml-ops-machine-...,1300,10,2020-01-03,"[Data Science, Machine Learning, Data Engineer..."
3,Organizing your Python Code,https://medium.com/@k3no/organizing-your-pytho...,1200,13,2020-01-03,"[Python, Programming, Data Science, Coding, So..."
4,How to be fancy with OOP in Python,https://towardsdatascience.com/how-to-be-fancy...,928,3,2020-01-03,"[Programming, Python, Data Science, Coding]"


In [303]:
medium_data.to_csv('medium_articles.csv', header=True, index = False)