# Scraping Top Repositories for Topics on GitHub

### Introduction:
- #### What is web scraping?
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

- #### What we want to scrape?
Our target is to scrape top repositories of each topic mentioned in the first page of https://github.com/topics page

- #### Tools we're using:-
Python, requests, Beautiful Soup, and pandas


### Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topics page URL and topic description
- For each topic, we'll get the top 30 repositories in the topic from the topic page
- For each repo, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name, Username, Stars, Repo URL
```

## Scrape the list of topics from GitHub

- use requests to download the page
- use Beautiful Soup to parse and extract information
- convert to a Pandas dataframe

### Function to download the topics page

In [2]:
import requests
from bs4 import BeautifulSoup

# function to download the topics page

def get_topics_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)    
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))    
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [10]:
# received topics page 
doc = get_topics_page()

### Helper functions to parse information from the topics page

In [25]:
def get_topic_titles(doc):    
    
    # selection_class refers to class of 'p' tag from which we can extract relevant information
    # relevant tags and classes for selection can be found by using inspect element on the webpage
    
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles

In [16]:
topic_titles = get_topic_titles(doc)

In [17]:
topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly, functions are defined for descriptions and URLs


In [15]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

`get_topic_descs` is used to get the list of descriptions

In [19]:
topic_descs = get_topic_descs(doc)

In [20]:
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [21]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    for tag in topic_link_tags:
        topic_urls.append("https://github.com" + tag['href'])
    return topic_urls

`get_topic_urls` is used to get the list of urls

In [22]:
topic_urls = get_topic_urls(doc)

In [23]:
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Let's put this all together into a single function

In [37]:
import pandas as pd

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    doc = BeautifulSoup(response.text, 'html.parser')
    
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)    

`scrape_topics` returns the topics dataframe which contains topic titles, topic descriptions, and topic urls

## Get the Top Repositories from a topic page

In [26]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

`get_topic_page` returns the required topic page by passing the topic url

In [30]:
def get_topic_repos(topic_doc):
    
    h3_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    
    star_tags = topic_doc.find_all('a', {'class': 'social-count float-none'})
    
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

`get_topic_repos` is used to get the list of repos from a particular topic page

In [39]:
import os

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_doc = get_topic_page(topic_url)
    topic_df = get_topic_repos(topic_doc)    
    topic_df.to_csv(path, index=None)

`scrape_topic` is used to scrape all the top repos from a particular topic and save it as a csv file

## Putting it all together

In [32]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))   

`scrape_topics_repos` scrapes all the top repositories from each topic from the first topics page of GitHub and makes a separate csv file for each topic

Run `scrape_topics_repos` to start scraping

## Summary
- To start scraping top repositories from each topic in the 1st page of https://github.com/topics we downloaded the topic page using requests
- We parsed the topic page using Beautiful Soup and extracted topic titles, topic descriptions and topic urls
- We combined all the topics information into a data frame using pandas
- Fetched top repositories from each topic page
- Finally scraped all the top repositories for each topic and saved it as a CSV file