# Scraping Top Repositories for Topics on GitHub

TODO  (Intro): 
- Introduction about web scraping
- Introduction about GitHub and the problem statement
- Mention the tools you're using (Python, requests, Beautiful Soup, Pandas)



Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from Github

Explain how you'll do it.

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [2]:
import requests
from bs4 import BeautifulSoup
import os
import pandas as pd

In [4]:
def get_topics_page():
    
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [11]:
doc = get_topics_page()

Let's create some helper functions to parse information from the page.

To get topic titles, we can pick `p` tags with the `class` ...

![](https://i.imgur.com/Yp7jt81.png)


In [9]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles

In [12]:
titles = get_topic_titles(doc)

In [13]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descriptions and URLs.

In [14]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-text-secondary mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs



In [19]:
descs = get_topic_descs(doc)

In [20]:
descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [18]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls


In [21]:
url = get_topic_urls(doc)

In [22]:
url[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Let's put this all together into a single function

In [17]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [23]:
topics_df = scrape_topics()

In [25]:
topics_df[:5]

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [28]:
topics_df.to_csv('topics.csv', index=None)

## Get the top repositories from a topic page

In [40]:
def parse_count_star(str):
    str = str.strip()
    if str[-1] == 'k':
        return int(float(str[:-1]) * 1000)
    return int(str)

In [33]:
def get_topic_page(topic_url):
    
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failled to load page {}', format(topic_url))
    # Parse using Beautiful soup    
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

In [38]:
def get_repo_info(h3_tag, star_tag):
    
    # returns all the required info about a repository
    a_tags = h3_tag.find_all('a')
    base_url = 'https://github.com'
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_count_star(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

In [35]:
def get_topic_repos(topic_doc):
    
    # Get the h3 tags containing repo title, repo URL and username
    h3_class_selector = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_class_selector})
    # Get star tags
    star_tags = topic_doc.find_all('a', {'class' : 'social-count float-none'})
    
    topics_repos_dict = {
        'username' : [],
        'repo_name' :[],
        'stars' : [],
        'repo_url' : []
    }
    
    # Get repo info
    for i in range(len(repo_tags)) :
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topics_repos_dict['username'].append(repo_info[0])
        topics_repos_dict['repo_name'].append(repo_info[1])
        topics_repos_dict['stars'].append(repo_info[2])
        topics_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topics_repos_dict)

In [32]:
def scrape_topic(topic_url, path) :
    if os.path.exists(path):
        print('The file {} already exists. Skipping...' .format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [36]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok = True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositries for {}' .format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv' .format(row['title']))

In [41]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositries for 3D
Scraping top repositries for Ajax
Scraping top repositries for Algorithm
Scraping top repositries for Amp
Scraping top repositries for Android
Scraping top repositries for Angular
Scraping top repositries for Ansible
Scraping top repositries for API
Scraping top repositries for Arduino
Scraping top repositries for ASP.NET
Scraping top repositries for Atom
Scraping top repositries for Awesome Lists
Scraping top repositries for Amazon Web Services
Scraping top repositries for Azure
Scraping top repositries for Babel
Scraping top repositries for Bash
Scraping top repositries for Bitcoin
Scraping top repositries for Bootstrap
Scraping top repositries for Bot
Scraping top repositries for C
Scraping top repositries for Chrome
Scraping top repositries for Chrome extension
Scraping top repositries for Command line interface
Scraping top repositries for Clojure
Scraping top repositries for Code quality
Scraping top repositries for Code rev