# Scrape the Top Repositories of Topics from  GitHub

Steps to follow:- 
 
 Scrape the url https://github.com/topics
 
   1. First we will get the List of topic from GitHub,For each of the topic, We will extract the Topic Title,Topic Description         and Page URL.
   2. Then,for each of the topics we will get the Top 25 repositories from the topic page.
   3. For each of the topic we will scrape the Repostory Name,User Name, Number of Stars and Repository URL.
   4. For each of the repo we will create a CSV file in the below format
   
         "Repository Name, User Name,Number of Stars & Repo URL"


# Prerequisite

Following Libraries should be downloaded and installed in the Jupyter NoteBook

    1. BeautifulSoup  (!pip install beautifulsoup4)
    2. Requests       (!pip install requests)
    3. Pandas         (!pip install pandas)

# Scrape the list of topics from Github

Steps to follow: 

*  Download the page using requests Library 
*  Extract the information using BeautifulSoup(BS4)
*  Convert the information to a dataframe using Pandas


In [53]:
#import the required libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

In [15]:
#Function to download the Page

def get_topics_page():
    # Get the URL to scrape from Web
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    #check the status of the download
    if response.status_code != 200:
        raise Exception('error occurred... Unabl to load page {}'.format(topic_url))
    # Convert the downloaded information into the HTML Doc using BeautifulSoup  
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [3]:
doc=get_topics_page()

create a functions to parse information from the page.

To get topic titles, we can pick p tags with the class  " f3 lh-condensed mb-0 mt-1 Link--primary "

In [4]:
#Function to get the list of titles

def get_topic_titles(doc):
    # Select the class of the P tag which contains the Topic title
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

In [5]:
topic_titles=get_topic_titles(doc)

In [6]:
topic_titles[:10]

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET']

In [7]:
#Function to get the Topic Description
def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

Above function takes Html Doc as a parameter and returns the description of the topics from web

In [8]:
topic_desc= get_topic_descs(doc)

In [9]:
topic_desc[:10]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.']

In [11]:
#Function to extract the URL of each of individual topics 

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [12]:
Topic_URL=get_topic_urls(doc)

In [14]:
Topic_URL[:10]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet']

Combining  all these functions to get the required information and convert it into Pandas DataFrame

In [76]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    #check the status code of response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [61]:
topic=scrape_topics()

In [25]:
topic[20:30]

Unnamed: 0,title,description,url
20,Chrome,Chrome is a web browser from the tech company ...,https://github.com/topics/chrome
21,Chrome extension,Google Chrome Extensions are add-ons that allo...,https://github.com/topics/chrome-extension
22,Command line interface,"A CLI, or command-line interface, is a console...",https://github.com/topics/cli
23,Clojure,"Clojure is a dynamic, general-purpose programm...",https://github.com/topics/clojure
24,Code quality,"Automate your code review with style, quality,...",https://github.com/topics/code-quality
25,Code review,Ensure your code meets quality standards and s...,https://github.com/topics/code-review
26,Compiler,Compilers are software that translate higher-l...,https://github.com/topics/compiler
27,Continuous integration,Automatically build and test your code as you ...,https://github.com/topics/continuous-integration
28,COVID-19,The coronavirus disease 2019 (COVID-19) is an ...,https://github.com/topics/covid-19
29,C++,C++ is a general purpose and object-oriented p...,https://github.com/topics/cpp


# Get the top 25 repositories from a topic page

In [29]:
def get_topic_page(topic_url):
    # Download the page from web
    response = requests.get(topic_url)
    # Check the response whether it is suceess or not
    if response.status_code != 200:
        raise Exception('Unable to load page {}'.format(topic_url))
    # Parse using Beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [30]:
topic_doc=get_topic_page('https://github.com/topics/algorithm')

In [35]:
len(topic_doc)

5

In [78]:
#function to convert Star String value to actual absolute integer value 
def convert_star_count(start_str):
    start_str=start_str.strip()
    if(start_str[-1]=='k'):
        return int(float(start_str[:-1])*1000)
    return int(start_str)

In [79]:
#function to get the repository information by passing the h1_tag and Star_tag as a parameter
def get_repo_info(h1_tag, star_tag):
    # returns username,repository_name,number of stars & repo_url about a repository
    base_url = 'https://github.com'
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = convert_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [49]:
def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1', {'class': h1_selection_class} )
    # Get a tag containing star information
    star_tags = topic_doc.find_all('a', { 'class': 'social-count float-none'})
    
    #Dictionary to store the required information
    topic_repos_dict = { 
        'username': [], 
        'repo_name': [], 
        'stars': [],
        'repo_url': []
        }

    # Get each of the repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [50]:
get_topic_repos(topic_doc)

Unnamed: 0,username,repo_name,stars,repo_url
0,jwasham,coding-interview-university,178000,https://github.com/jwasham/coding-interview-un...
1,CyC2018,CS-Notes,132000,https://github.com/CyC2018/CS-Notes
2,TheAlgorithms,Python,110000,https://github.com/TheAlgorithms/Python
3,trekhleb,javascript-algorithms,110000,https://github.com/trekhleb/javascript-algorithms
4,yangshun,tech-interview-handbook,54900,https://github.com/yangshun/tech-interview-han...
5,kdn251,interviews,51700,https://github.com/kdn251/interviews
6,azl397985856,leetcode,42500,https://github.com/azl397985856/leetcode
7,algorithm-visualizer,algorithm-visualizer,33900,https://github.com/algorithm-visualizer/algori...
8,crossoverJie,JCSprout,26200,https://github.com/crossoverJie/JCSprout
9,donnemartin,interactive-coding-challenges,22800,https://github.com/donnemartin/interactive-cod...


In [72]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

# Putting all the functions together to get the required infomation
We have a funciton to get the list of topics

We have a function to create a CSV file for scraped repos from a topics page

Let's create a function to put them together

In [73]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [75]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre

# References and helpful links

 1. https://www.geeksforgeeks.org/python-convert-two-lists-into-a-dictionary/
 2. https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
 3. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup
 4. https://github.com/topics
 5. https://stackabuse.com/creating-and-deleting-directories-with-python