# Scraping Top Repositories for Topics on Github

- Introduction: 
    Web scraping is a technique used to extract data from websites. It involves automating the process of fetching information from web pages and then parsing, extracting, and structuring that information for various purposes.
- Problem statement:
    In this project I have decided to extract informations from github(a central place where code, project files, documentation, and other resources related to a project are stored.). I will scrape top repositories for some topics on Github and put the datas in a csv file to visualize the datas like: username, topic name and stars counts.
- Tools I will use:
    I will use python as my programing language and some libraries of python like, BeautifulSoup, requests, pandas etc.

Here are the steps we'll follow:

- We're  going to scrape https://github.com/topics
- we'll get a list of topics. For each topic we'll get topic title, topic page URL and topic description.
- For each topic we'll get the top 25 repositories in the topic from topic page
- For each repositories , we'll grab the repo name, user name, stars and repo URL
- for each topic we'll create a CSV file.

## Scrape the list of topics from Github


- use requests to download the page
- use BS4 to parse and extract information
- convert to a Pandas Dataframe.

Let's write a function to download the page.

In [1]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
     #download the page
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url) 
    #check successful response
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topics_url}')
    
    #parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

The function above returns the topics page as a BeautifulSoup object.

In [2]:
doc = get_topics_page()

In [3]:
type(doc)

bs4.BeautifulSoup

Let's create some helper functions to parse information from the page.

In [4]:
def get_topic_titles(doc):
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
        
    return topic_titles

`get_topic_titles` can be used to retrieve the list of titles.

In [5]:
titles = get_topic_titles(doc)

In [6]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

similarly, we have defined functions for collecting description and URL as well.

In [7]:
def get_topic_desc(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_desc = []
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
    
    return topic_desc

`get_topic_desc` will collect description fro individual topic and store them in a list.

In [8]:
descs = get_topic_desc(doc)

In [9]:
descs[:2]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.']

In [10]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls    

similarly, `get_topic_urls` will get the urls of each topic and store them in a list.

In [11]:
urls = get_topic_urls(doc)

In [12]:
urls[:2]

['https://github.com/topics/3d', 'https://github.com/topics/ajax']

Now it's time to put them all together into a single function.

In [13]:
import pandas as pd

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_desc(doc),
        'url': get_topic_urls(doc)
        
    }  
    
    return pd.DataFrame(topics_dict)

Combining the three functions we have built earlier and with help of pandas library `scrape_topics` will return a dataframe of titles, description and urls.

In [14]:
topics_dataframe = scrape_topics()

In [15]:
topics_dataframe

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Get the top 20 repositories in the topic from topic page

- we will now go through each topic and collect 20 repositories from each topic.

In [16]:
def get_topic_page(topic_url):
    #download the page
    response = requests.get(topic_url)
    
    #check successful response
    if response.status_code != 200:
        raise Exception(f'Failed to load page {topic_url}')
    
    #parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

This function is for downloading topic pages and returning the page as beautifulsoup object.

In [17]:
topic_doc = get_topic_page('https://github.com/topics/3d')

Now, we have to collect repository information from the page we have downloaded. Let's define some functions for that.

In [18]:
base_url = "https://github.com"

def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

def get_repo_info(h3_tag, star_tag):
    #return all the required info about a repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


`get_repo_info` will return username, repository name, repository URL and stars count of a specific repository .

In [19]:
import jovian

In [None]:
jovian.commit()

In [25]:
def get_topic_repos(topic_doc):
    
    #get the h3 tags containing repo title, repo URL and username
    
    h3_selecction_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selecction_class})
    
    #get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'repo_url': [],
        'stars': []
    }
    
    #get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['repo_url'].append(repo_info[3])
        topic_repos_dict['stars'].append(repo_info[2])
        
    return pd.DataFrame(topic_repos_dict)

This function will will collect username, repo name, repo url and stars of the repositories and bind them in a dictionary and will return a dataframe of that dictionary.

In [26]:
repo_info = get_topic_repos(topic_doc)

In [27]:
repo_info

Unnamed: 0,username,repo_name,repo_url,stars
0,mrdoob,three.js,https://github.com/mrdoob/three.js,94000
1,pmndrs,react-three-fiber,https://github.com/pmndrs/react-three-fiber,23500
2,libgdx,libgdx,https://github.com/libgdx/libgdx,21800
3,BabylonJS,Babylon.js,https://github.com/BabylonJS/Babylon.js,21200
4,ssloy,tinyrenderer,https://github.com/ssloy/tinyrenderer,17600
5,lettier,3d-game-shaders-for-beginners,https://github.com/lettier/3d-game-shaders-for...,15900
6,aframevr,aframe,https://github.com/aframevr/aframe,15600
7,FreeCAD,FreeCAD,https://github.com/FreeCAD/FreeCAD,14900
8,CesiumGS,cesium,https://github.com/CesiumGS/cesium,10800
9,metafizzy,zdog,https://github.com/metafizzy/zdog,10000


## Let's put it all together.

In [28]:
import os

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [29]:



def scrape_topic_repos():
    print('scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for all the topics on the first page of github.com/topics.

In [30]:
scrape_topic_repos()

scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin

In [32]:
import jovian
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "shoykot-amanat/scraping-github-topics-repositories" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/shoykot-amanat/scraping-github-topics-repositories[0m


'https://jovian.com/shoykot-amanat/scraping-github-topics-repositories'