<a href="https://colab.research.google.com/github/rwiddhi-b/Github-scraping/blob/main/webScraping_Github.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scraping Top Repositories for Topics on GitHub
Tools used (Python, requests, Beautiful Soup, Pandas)



### Steps we'll follow:

We're going to scrape https://github.com/topics

We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description

For each topic, we'll get the top 25 repositories in the topic from the topic page

For each repository, we'll grab the repo name, username, stars and repo URL

For each topic we'll create a CSV file.

## Scrape the list of topics from Github
Steps:

use requests to downlaod the page

use BS4 to parse and extract information

convert to a Pandas dataframe

In [None]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
  topic_url = 'https://github.com/topics'
  # Download page 
  response = requests.get(topic_url)
  # check successful status
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  # parse using beautiful soup
  topic_doc = BeautifulSoup (response.text, 'html.parser')
  return topic_doc

In [None]:
topic_doc = get_topics_page()

In [None]:
type(topic_doc)

bs4.BeautifulSoup

## Steps to parse information from the page

`get_topic_titles` can be used to get the list of titles

In [None]:
def get_topic_titles (topic_doc):
  selection_class ='f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = topic_doc.find_all('p', {'class': selection_class})
  topic_titles = []
  for tag in topic_title_tags: 
    topic_titles.append(tag.text)
  return topic_titles

In [None]:
titles = get_topic_titles(topic_doc)
len(titles) #30
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descriptions and URLs.

In [None]:
def get_topic_descs(topic_doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = topic_doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [None]:
desc = get_topic_descs(topic_doc)
len(desc)

30

In [None]:
def get_topic_urls(topic_doc):
    topic_link_tags = topic_doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [None]:
urls = get_topic_urls(topic_doc)
len(urls)

30

In [None]:
import pandas as pd

In [None]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(topic_doc),
        'description': get_topic_descs(topic_doc),
        'url': get_topic_urls(topic_doc)
    }
    return pd.DataFrame(topics_dict)

### Get the top 25 repositories from a topic page

In [None]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [None]:
doc = get_topic_page('https://github.com/topics/3d')

In [None]:
def parse_star_count(stars):
    stars=stars.strip()
    if stars[-1]=='k':
        return int(float(stars[:-1])*1000)
    return(int(stars))

In [None]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    base_url = 'https://github.com'
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [None]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    repo_tags = topic_doc.find_all('article',{'class':'border rounded color-shadow-small color-bg-subtle my-4'})

    # Get star tags
    star_tags=topic_doc.find_all('span',{'id':'repo-stars-counter-star'})
    
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}

    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [None]:
import os
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## Putting it all together
We have a function to get the list of topics

We have a function to create a CSV file for scraped repos from a topics page

We now create a function to put them together

In [None]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

We run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [None]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin