# Sample Code for Scraping GitHub.com Website
## The code in this notebook and repository is a companion for the tutorial entitled:
## The Challenges and Opportunities Mining GitHub

The notebook is split into sections that show how different data collection tasks can be performed.

Do not forget to look into utils.py as well.

# Scraping Websites

You will need to install the excellent [requests_html](https://html.python-requests.org) library for python.

You will also need the selector path or XPath for the items you would like to extract from a website.

Using developer tools and HTML resource selector, we identified the following selector paths that we will use:


In [1]:
# selector path for pulse in https://github.com/django/django/pulse
pulse_sp = "#js-repo-pjax-container > div.container.new-discussion-timeline.experiment-repo-nav > div.repository-content > div > div.col-9 > div.authors-and-code > div.section.diffstat-summary.v-align-top.pt-3.js-pulse-contribution-data > div"

# repo overview (top bar with contributor and license info) in https://github.com/django/django/
repo_overview_sp = "#js-repo-pjax-container > div.container.new-discussion-timeline.experiment-repo-nav > div.repository-content > div.overall-summary.overall-summary-bottomless"

# Scraping Example With JavaScript Loading in Screen

In [2]:
# libraries needed
import pandas as pd
from requests_html import HTMLSession

# install nest_asyncio from pip and patch issue in jupyter
# use command: 'pip install nest_asyncio' in terminal or CMD
import nest_asyncio
nest_asyncio.apply()

# fetch the html page
session = HTMLSession()
r = session.get("http://github.com/django/django/pulse")

# wait for javascript to render for 2 seconds
r.html.render(sleep=2)

# parse the HTML to get the selected part
html_data = r.html.find(pulse_sp, first=True)

# fetch all the HTML text that is tagged as strong
# here we used list comprehensions from python
text_data = [x.text for x in html_data.find("strong")]

# show the selected data
print(text_data)

['14 authors have pushed', '28 commits to master and', '46 commits to all branches. On master,', '64 files have changed and there have been', '1,289', 'additions and', '303', 'deletions']


In [6]:
# Parse data
data = {
    "authors": int(text_data[0].split()[0]),
    "commits_master": int(text_data[1].split()[0]),
    "commits_all": int(text_data[2].split()[0]),
    "files_changed": int(text_data[3].split()[0]),
    "line_additions": int(text_data[4].replace(",","")),
    "line_deletions": int(text_data[6].replace(",","")),
}
    

In [8]:
# data is in python data structure
# can be easily manipulated and converted into dataframe
# or stored as CSV to use with other data analysis software
data

{'authors': 14,
 'commits_master': 28,
 'commits_all': 46,
 'files_changed': 64,
 'line_additions': 1289,
 'line_deletions': 303}

# Example With No JavaScript on Screen

In [9]:
# another example
# fetch the html page
session = HTMLSession()
r = session.get("http://github.com/django/django/")

# fetching repo overview

# parse the HTML to get the selected part
# no need to render as no javascript is involved 
# in page loading
html_data = r.html.find(repo_overview_sp, first=True)

In [12]:
# let's look at the text data
# IMPORTANT: available data may change from one project to another
# do not assume that structure is the same
html_data.text.split("\n")

['26,749 commits',
 '46 branches',
 '215 releases',
 '1,715 contributors',
 'View license',
 'Python 95.8%',
 'JavaScript 1.9%',
 'HTML 1.7%',
 'CSS 0.6%',
 'Shell 0.0%',
 'Smarty 0.0%']