# Scraping the top repositories for the topics on the Github Page

Tarun Rama M

1. What is Web Scraping in short terms?

Web scraping is the automated process of extracting information from websites. It involves fetching the data displayed on a website and converting it into a structured format (e.g.,a CSV, spreadsheet, database, or JSON) for further analysis or use.

2. Introduction to GitHub

GitHub https://github.com/ is a platform for developers and teams to host, manage, and collaborate on projects using Git, a version control system. It provides tools to work on code collaboratively, track changes, and integrate seamlessly with other development workflows.

3. Problem Statement

We are going to use the topics page i.e https://github.com/topics, and from that page we are going find the list of topics for each repository and download it.


4. Tools used in this project

- requests : a Python library is used for making HTTP requests to a specified URL. HTTP request returns a response object with all the response data.

- Beautiful Soup : a Python library used for web scraping purposes to extract data from HTML and XML documents. It provides tools for navigating, searching, and modifying the parse tree of these documents in a simple and readable way.

- Pandas : is widely used in web scraping workflows to structure, clean, and analyze the data extracted from websites. After using tools like Beautiful Soup, or requests to scrape data, Pandas helps process the data into meaningful formats like tables or spreadsheets.

Steps that will be followed -

- We're going to scrape https://github.com/topics

- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll crete a CSV file in the following format.


## Scrape the list of topics from GitHub

- use requests to download the page
- use BeautifulSoup to parse and extract information
- convert it into a panda data frame

In [105]:
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd

- function to download the page and extract the information

In [106]:
def get_topics_page():
  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)
  if response.status_code != 200:
    raise Exception('Failed to load Page {}'.format(topics_url))
  doc = BeautifulSoup(response.text,'html.parser')
  return doc

In [107]:
doc = get_topics_page()

In [108]:
type(doc)

#### Creating Helper functions to parse information from the page.

- func `get_topic_titles` is used to retrieve the list of titles
- To get Topic_titles, we can pick `p` tags with the class `f3 lh-condensed mb-0 mt-1 Link--primary`

In [109]:
def get_topic_titles(doc):
  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p', {'class': selection_class})
  topic_titles = []
  for tag in topic_title_tags:
    topic_titles.append(tag.text)
  return topic_titles

In [110]:
titles = get_topic_titles(doc)

In [111]:
len(titles)

30

In [112]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

func `get_topic_descriptions` is used to get the list of descriptions
- To get topic_descriptions, we can pick `p` tags with the class `f5 color-fg-muted mb-0 mt-1`

In [113]:
def get_topic_descriptions(doc):
  selection_class_two = 'f5 color-fg-muted mb-0 mt-1'
  topic_description_tags = doc.find_all('p',{'class': selection_class_two})
  topic_descriptions = []
  for tag in topic_description_tags:
    topic_descriptions.append(tag.text.strip())
  return topic_descriptions

func `get_topic_urls` is used to get the list of descriptions
- To get topic_descriptions, we can pick `a` tags with the class `no-underline flex-1 d-flex flex-column`

In [114]:
def get_topic_urls(doc):
  selection_class_three = 'no-underline flex-1 d-flex flex-column'
  topic_link_tags = doc.find_all('a',{'class':selection_class_three})
  topic_urls = []
  base_url = 'https://github.com'
  for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
  return topic_urls

- Let's put this all together into a single function

In [115]:
def scrape_topics():

  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)
  if response.status_code != 200:
    raise Exception('Failed to load Page {}'.format(topics_url))
  doc = BeautifulSoup(response.text,'html.parser')
  topics_dict = {
      'title':get_topic_titles(doc),
      'description':get_topic_descriptions(doc),
      'url': get_topic_urls(doc)
  }
  return pd.DataFrame(topics_dict)

# Get the Top Repositories from a topic page

In [116]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check if successfull or not
    if response.status_code != 200:
        raise Exception('Failed to load the page! {}'.format(topic_url))
    # parse using beautifulSoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

In [117]:
doc = get_topic_page('https://github.com/topics/3d')

In [118]:
def get_repo_info(h3_tag,star_tag):
  # returns all the required info about a repository
  base_url = 'https://github.com'

  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tag.text.strip())
  return username, repo_name, stars, repo_url

In [119]:
# get_repo_info(repo_tags[0],star_tags[0])

In [120]:
def get_topic_repos(topic_doc):
    # Get the h3 tags containing repo title, repo_name and repo_url and username
    selection_class_four = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': selection_class_four})
    # get the star tags
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})

    topic_repos_dict = {
      'username': [],
      'repo_name': [],
      'stars':[],
      'repo_url':[]
    }

    # Get repo info
    for i in range(len(repo_tags)):
      repo_info = get_repo_info(repo_tags[i],star_tags[i])
      topic_repos_dict['username'].append(repo_info[0])
      topic_repos_dict['repo_name'].append(repo_info[1])
      topic_repos_dict['stars'].append(repo_info[2])
      topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

In [121]:
def scrape_topic(topic_url,path):

  if os.path.exists(path):
    print("The file {} already exists. Skipping...".format(path))
    return
  topic_df = get_topic_repos(get_topic_page(topic_url))
  topic_df.to_csv(path, index=None)

In [122]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)
  return int(stars_str)

In [123]:
print(parse_star_count("29.1k"))  # testing func

29100


## Putting it all together

- We have a function to get the list fo topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [124]:
def scrape_topics_repos():
  print("Scarping list of topics")
  topics_df = scrape_topics()
  os.makedirs('data',exist_ok=True)
  for index,row in topics_df.iterrows():
    print('Scraping top repositories for "{}"'.format(row['title']))
    scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

Run it to Scrape the Top Repositories for all the topics on the first page of https://github.com/topics

In [125]:
scrape_topics_repos()

Scarping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
The file data/Bitcoin.csv already exists. Skipping...
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command-line interface"
Scraping top repositories for 

Check that the CSV's were created properly

In [126]:
# Read & display the CSV file using pandas

In [127]:
pd.read_csv('data/Android.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,167000,https://github.com/flutter/flutter
1,facebook,react-native,120000,https://github.com/facebook/react-native
2,Genymobile,scrcpy,115000,https://github.com/Genymobile/scrcpy
3,justjavac,free-programming-books-zh_CN,112000,https://github.com/justjavac/free-programming-...
4,Hack-with-Github,Awesome-Hacking,87200,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,54200,https://github.com/Solido/awesome-flutter
6,tldr-pages,tldr,52400,https://github.com/tldr-pages/tldr
7,wasabeef,awesome-android-ui,51100,https://github.com/wasabeef/awesome-android-ui
8,google,material-design-icons,50900,https://github.com/google/material-design-icons
9,laurent22,joplin,46800,https://github.com/laurent22/joplin


End