## Scraping Top Repos for Topics on GitHub

### Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Scrape the list of topics from Github

Explain how you'll do it.

- use requests to downlaod the page
- user BS4 to parse and extract information
- convert to a Pandas dataframe

Let's write a function to download the page.

In [78]:
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_topics_page(topic_url):
    topics_url ='https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('failed to load page {}'.format(topic_url))
        doc = BeautifulSoup(response.textxt,'html.parser')
    return doc

#### def get_topic_titles(doc) gives list of titles

In [79]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class':selection_class }) 
    topic_titles =[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

#### Similarly we have defined functions for descriptions and URLs.

In [80]:
def get_topic_description(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class':desc_selector })
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [81]:
def get_topic_url(doc):
    topic_link_tags = doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

### Combining all function together

In [82]:
def scrape_topics():
    topics_url ='https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title'      : get_topic_titles(doc),
        'description' : get_topic_description(doc),
        'url'         : get_topic_url(doc)   
    }
    return pd.DataFrame(topics_dict)

## Get the top repositories from a topic page

In [83]:
def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #check successful response
    if response.status_code != 200:
        raise Exception('failed to load page {}'.format(topic_url))
    # Parse using Beautifulsoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

In [84]:
doc = get_topic_page('https://github.com/topics/3d')

#### extracting tags for title, urls, stars...

In [85]:
def star_parser(star_num):
    if star_num[-1] =='k':
        return float(star_num[:-1]) * 1000
    return int(star_num)

In [86]:
def get_repo_info(h1_tag,star_tag):
   #returns all the required information about a repository
    a_tags    = h1_tag.find_all('a')
    username  = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url  = base_url + a_tags[1]['href']
    stars     = star_parser(star_tag.text.strip())
    return username, repo_name, repo_url, stars

In [87]:
def get_topic_repo(topic_doc):
    # Get h1 tags containing repo > title, username, url
    h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tag = topic_doc.find_all('h3',{'class':h1_selection_class})
    # Get Star Tag
    star_tag = topic_doc.find_all('span',{'class':"Counter js-social-count"})
    # Get repo info
    
    topic_repos_dict = {'username' : [],'repo_name': [],'stars': [],'repo_url' : []    }
    for i in range(len((repo_tag))):
        repo_info = get_repo_info(repo_tag[i],star_tag[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['repo_url'].append(repo_info[2])   
        topic_repos_dict['stars'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)

## creating CSVs

In [88]:
def scrape_topic(topic_url,topic_name):
    fname = 'D:/' + topic_name + '.csv'
    if os.path.exists(fname):
        print('this file already exists skipping.......'.format(fname))
        return
    topic_df = get_topic_repo(get_topic_page(topic_url))
    topic_df.to_csv(fname, index=None)

## Finally combining all together :

## Putting it all together

- We have a funciton to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [89]:
def scrape_topics_repos():
    print('scraping list of topics')
    topics_df = scrape_topics()
    for index, row in topics_df.iterrows():
        print('scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],row['title'])

Run above function to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [90]:
scrape_topics_repos()

scraping list of topics
scraping top repositories for "3D"
this file already exists skipping.......
scraping top repositories for "Ajax"
this file already exists skipping.......
scraping top repositories for "Algorithm"
this file already exists skipping.......
scraping top repositories for "Amp"
this file already exists skipping.......
scraping top repositories for "Android"
this file already exists skipping.......
scraping top repositories for "Angular"
this file already exists skipping.......
scraping top repositories for "Ansible"
this file already exists skipping.......
scraping top repositories for "API"
this file already exists skipping.......
scraping top repositories for "Arduino"
this file already exists skipping.......
scraping top repositories for "ASP.NET"
this file already exists skipping.......
scraping top repositories for "Atom"
this file already exists skipping.......
scraping top repositories for "Awesome Lists"
this file already exists skipping.......
scraping top re