# Top Repositories for Topics on Github - Web Scraping


![](https://i.imgur.com/ducpvjV.jpg)

## 1. Introduction : 


### 1.1. What is web scraping 

Web scraping is the process of collecting structured web data in an automated fashion. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.

In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.

There are a number of tools and methods for performing web scraping; using network traffic, Scrappy, Selenium and Beautiful Soup are the most popular methods. Every method has its own advantage and drawbacks. In this project Beautiful Soup was used.

### 1.2. Problem statement 

GitHub, Inc. is a provider of Internet hosting for software development and version control using Git. It offers the distributed version control and source code management (SCM) functionality of Git, plus its own features. It provides access control and several collaboration features such as bug tracking, feature requests, task management, continuous integration and wikis for every project.Headquartered in California, it has been a subsidiary of Microsoft since 2018.

It is commonly used to host open-source projects.As of January 2020, GitHub reports having over 40 million users and more than 190 million repositories (including at least 28 million public repositories).It is the largest source code host as of April 2020.


With this brief introduction on Github, it seems clear that, in order to better understand the software development trends in the world and identify the hottest repositories, this website can provide valuable insight on better decsion making. 

That's why on this small project I have decided to scrape the topics page of Github and then scrape each page in order to get the info on the most popular repositories of each topic.

## 2. Project Steps:


- We are going to scrape https://github.com/topics 
- We will get a list of topics. For each topic, we will get topic title, topic page URL and topic description. 
- For each topic, we will get the top 25 repositories in the topic from the topic page.
- For each repository, we will grab the repo name, username, stars and repository URL.
- For each topic, we will create a CSV file 

#### Scraping the list of topics from Github

- Use requests to download the page
- Use BS4 to parse and extract information
- Convert it to a Pandas dataframe


In [2]:
# Importing necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

In [4]:
#defining a function for parsing the topics page and turning it into a BS object
def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get (topics_url)
    if response.status_code != 200:
        raise Exception ("Failed to load page {}".format(topic_url) )
    doc = BeautifulSoup(response.text, "html.parser")
    return doc 

In [7]:
#loading the result of our defined function to doc variable
doc = get_topics_page()

In [6]:
#checking if its a BS object
type(doc)

bs4.BeautifulSoup

First we will inspect the Github Topics page to get the right HTML tags
and then we will define a function to get that information automatically


To get topic titles, we can pick "p" tags with the class "f3 lh-condensed mb-0 mt-1 Link--primary"

![](https://i.imgur.com/EOqfQaQ.jpg)

In [12]:
# defining a function to get all the topic titles in the topics page
def get_topic_titles (doc):
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all("p", {"class": selection_class })
    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

In [13]:
#Getting all the titles using our defined function
titles = get_topic_titles(doc)

In [14]:
#getting the length of titles to check if we have get all the titles
len(titles)

30

Similarly, we inspected the page to get the right HTML tags for descriptions
and then we will define a function to get that information automatically



To get topic descriptions, we can pick "p" tags with the class "f5 color-text-secondary mb-0 mt-1"

![](https://i.imgur.com/wX0AIiV.jpg)

In [16]:
#defining a function to get all the topic descriptions in the topics page

def get_topic_descs (doc):
    desc_selector = "f5 color-text-secondary mb-0 mt-1"
    topic_desc_tags = doc.find_all("p", {"class": desc_selector} )
    
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs 

In [18]:
#Getting all the descriptions using our defined function
descs = get_topic_descs(doc)

In [20]:
#getting the length of titles to check if we have get all the descriptions
len(descs)

30

To get topic URLs, we can pick "a" tags with the class "d-flex no-underline"

![](https://i.imgur.com/2LhkWi0.jpg)

In [23]:
#defining a function to get all the topic urls in the topics page

def get_topic_urls (doc):
    topic_link_tags = doc.find_all ("a", {"class": "d-flex no-underline"})
    
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag["href"])  
    return topic_urls

In [24]:
#Getting all the urls using our defined function
urls = get_topic_urls(doc)

In [25]:
#getting the length of titles to check if we have get all the URLs
len(urls)

30

### create a single function for the above tasks

we are going to define a function to get all the information of all topics on the topic page

In [29]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get (topics_url)
    if response.status_code != 200:
        raise Exception ("Failed to load page {}".format(topic_url) )
    topic_dict = {"title": get_topic_titles(doc), "description": get_topic_descs(doc), "url": get_topic_urls (doc)}
    return pd.DataFrame (topic_dict)

In [31]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Getting the top 25 repositories from the topic page

We are going to define a function that gets the url of a topic and turn it into a BS object

In [45]:
#defining a function for getting info on topic pages and turining into BS object
def get_topic_page (topic_url):
    #Download the page
    response = requests.get(topic_url)
    #Check successful response
    if response.status_code != 200:
        raise Exception ("Failed to load page {}".format(topic_url) )
    #Parse using Beautiful Soup
    topic_doc = BeautifulSoup (response.text, "html.parser")
    return topic_doc

In [52]:
#checking with a sample link
doc_topic = get_topic_page("https://github.com/topics/3d")

First we will inspect the topic page to get the right HTML tags
and then we will define a function to get that information automatically


![](https://i.imgur.com/9sebTUx.jpg)

In [47]:
# defining a function to get all the info for each repository
base_url = "https://github.com/"

def parse_star_count (stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == "k":
        return int(float(stars_str[:-1])*1000)
    return int(stars_str) 

def get_repo_info(h1_tag, star_tag ):
    a_tags = h1_tag.find_all("a")
    username = a_tags [0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]["href"]
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

Making a function to get all information for all 25 repositories

In [48]:
def get_topic_repos (topic_doc):
    #Get h3 tags containing repo title, repo URL and username
    h1_selection_class = "f3 color-text-secondary text-normal lh-condensed"
    repo_tags = topic_doc.find_all("h3",{"class":h1_selection_class})
    #Get stars tags 
    star_tags = topic_doc.find_all("a", {"class": "social-count float-none"})
    # Get repo info
    topic_repos_dict = {"username": [], "repo_name": [], "stars": [], "repo_url": []}
    
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict["username"].append(repo_info[0])
        topic_repos_dict["repo_name"].append(repo_info[1])
        topic_repos_dict["stars"].append(repo_info[2])
        topic_repos_dict["repo_url"].append(repo_info[3])
    return pd.DataFrame (topic_repos_dict)

In [50]:
#using the function for an example page
get_topic_repos(doc_topic)

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,73800,https://github.com//mrdoob/three.js
1,libgdx,libgdx,18800,https://github.com//libgdx/libgdx
2,pmndrs,react-three-fiber,14800,https://github.com//pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,14700,https://github.com//BabylonJS/Babylon.js
4,aframevr,aframe,13000,https://github.com//aframevr/aframe
5,ssloy,tinyrenderer,11100,https://github.com//ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,11000,https://github.com//lettier/3d-game-shaders-fo...
7,FreeCAD,FreeCAD,9800,https://github.com//FreeCAD/FreeCAD
8,metafizzy,zdog,8600,https://github.com//metafizzy/zdog
9,CesiumGS,cesium,7400,https://github.com//CesiumGS/cesium


In [51]:
# defining a function to scrape repos of a topic page and turn into CSV file
def scrape_topic (topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists.Skipping...".format(path))
        return 
    topic_df = get_topic_repos (get_topic_page(topic_url))
    
    topic_df.to_csv(path, index = None)


### Putting it all together

- We have created a function to get the list of topics
- We also have created a function to create a CSV file for scraped repos from a topic page
- We are going to create a function to put them together automatically


In [53]:
def scrape_topics_repos():
    print ("scraping list of topics")
    topics_df = scrape_topics()

    os.makedirs("data", exist_ok = True)
    
    for index, row in topics_df.iterrows():
        print("scraping top repositories for {}".format(row["title"]))
        scrape_topic(row["url"], "data/{}.csv".format(row["title"]) )

We are going to run the defined function to scrape the top repos for all the topics on the first page of https://github.com/topics 

In [54]:
scrape_topics_repos ()

scraping list of topics


We have created a CSV file for every topic on topics first page containing the top repositories of each topic

## 3. Ideas for future work

- scraping other pages of topics. This project was done in order to get the first page topics. we can scrape second to last page of topics 
- scraping more than 25 top repositories of topics
- scraping the trending page of Github throughtout a time period can give us a picture of hot topics 