# Scraping Top Repositories for Topics on GitHub



## Introduction about web scraping

![](https://i.imgur.com/6zM7JBq.png)


Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. While web scraping often involves parsing and processing [HTML documents](https://developer.mozilla.org/en-US/docs/Web/HTML), some platforms also offer [REST APIs](https://www.smashingmagazine.com/2018/01/understanding-using-rest-api/) to retrieve information in a machine-readable format like [JSON](https://www.digitalocean.com/community/tutorials/an-introduction-to-json). In this tutorial, we'll use web scraping and REST APIs to create a real-world dataset.

## About GitHub

![](https://1000logos.net/wp-content/uploads/2021/05/GitHub-logo.png)


GitHub is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere.

## Tools that i have used

- 1.Python
- 2.Requests Library
- 3.BeautifulSoup Library
- 4.Pandas

## Steps

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from Github

Explain how you'll do it.

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [40]:
import requests 
import pandas as pd
from bs4 import BeautifulSoup
import os
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

Add some explanation

In [41]:
doc = get_topics_page()

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/OnzIdyP.png)


In [42]:
def get_topic_titles(doc):
    topic_title_tags = doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles

In [43]:
titles = get_topic_titles(doc)

In [44]:
len(titles)

30

In [45]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descriptions and URLs.

In [47]:
def get_topic_descs(doc):
    topic_desc_tags = doc.find_all('p','f5 color-fg-muted mb-0 mt-1')
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [61]:
def get_topic_urls(doc):
    base_url = 'https://github.com'
    topic_link_tags = doc.find_all('a',{'class':'no-underline flex-grow-0'})
    topic_urls = []
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

Let's put this all together into a single function

In [49]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [50]:
import jovian

In [51]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "saripellasurendravarma/scraping-github-topics-repositories" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/saripellasurendravarma/scraping-github-topics-repositories[0m


'https://jovian.com/saripellasurendravarma/scraping-github-topics-repositories'

## Get the top 25 repositories from a topic page



In [52]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


In [53]:
doc = get_topic_page('https://github.com/topics/3d')

In [54]:
def parse_star_count(star_text):
    star_text = star_text.strip()
    if star_text[-1]=='k':
        return int(float(star_text[:-1])*1000)
    return int(star_text)

In [55]:
def get_repo_info(h3_tag,star_tag):
    base_url = 'https://github.com'
    a_tags = h3_tag.find_all('a')
    user_name = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    star_count = parse_star_count(star_tag.text.strip())
    return user_name,repo_name,star_count,repo_url

In [57]:
def get_topic_repos(topic_doc):
    
    # Get the h1 tags containing repo title, repo URL and username
    repo_tags = topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})
    
    # Get star tags
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})
    
    topic_repos_dict = {'Username':[],'Reponame':[],'Stars':[],'URL':[]}
    
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['Username'].append(repo_info[0])
        topic_repos_dict['Reponame'].append(repo_info[1])
        topic_repos_dict['Stars'].append(repo_info[2])
        topic_repos_dict['URL'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [58]:
def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print('The file already exists {}'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [59]:
def scrape_topics_repos():
    print('Scrapping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    for index,row in topics_df.iterrows():
        print('Scrapping top repositories for {}'.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [63]:
scrape_topics_repos()

Scrapping list of topics
Scrapping top repositories for 3D
Scrapping top repositories for Ajax
Scrapping top repositories for Algorithm
Scrapping top repositories for Amp
Scrapping top repositories for Android
Scrapping top repositories for Angular
Scrapping top repositories for Ansible
Scrapping top repositories for API
Scrapping top repositories for Arduino
Scrapping top repositories for ASP.NET
Scrapping top repositories for Atom
Scrapping top repositories for Awesome Lists
Scrapping top repositories for Amazon Web Services
Scrapping top repositories for Azure
Scrapping top repositories for Babel
Scrapping top repositories for Bash
Scrapping top repositories for Bitcoin
Scrapping top repositories for Bootstrap
Scrapping top repositories for Bot
Scrapping top repositories for C
Scrapping top repositories for Chrome
Scrapping top repositories for Chrome extension
Scrapping top repositories for Command line interface
Scrapping top repositories for Clojure
Scrapping top repositories for

We can check that the CSVs were created properly

In [64]:
import pandas as pd
df1 = pd.read_csv('data/3D.csv')
df1.head()

Unnamed: 0,Username,Reponame,Stars,URL
0,mrdoob,three.js,92100,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,22600,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21500,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,20700,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17000,https://github.com/ssloy/tinyrenderer


In [66]:
df2 = pd.read_csv('data/Android.csv')
df2.head()

Unnamed: 0,Username,Reponame,Stars,URL
0,flutter,flutter,154000,https://github.com/flutter/flutter
1,facebook,react-native,110000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,103000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,84000,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,65300,https://github.com/Hack-with-Github/Awesome-Ha...


In [67]:
import jovian

In [68]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "saripellasurendravarma/scraping-github-topics-repositories" on https://jovian.com[0m
[jovian] Committed successfully! https://jovian.com/saripellasurendravarma/scraping-github-topics-repositories[0m


'https://jovian.com/saripellasurendravarma/scraping-github-topics-repositories'