## Scraping The Top Repositories for Topics on GitHub

- TODO :
1. Introduction about WebScraping
    - Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data     from the web

2. Introduction about github and problem statement
    - Github contains whole bunch of repositories.We are going to use topics page from topic page we are going to find list of topic repositories and download
    
3. Tools we used - python, requests, BeautifulSoup, pandas.
   

### Importants Links
- Github link - https://github.com/topics
- import request -hhttps://requests.readthedocs.io/en/latest/
- we can see response of status code -https://developer.mozilla.org/en-US/docs/Web/HTTP/Status 

### Project outline:
- we'r going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top respositories in the topic from the topic page
- For each respository, we'll grab the repo name, username, stars and repo URL

## Scrape the list of topics from github
- use request to download page
- use bs4 to parse and extract information
- convert to pandas data frame

Lets write function to download page

In [43]:
# Write function to download page

import requests
from bs4 import BeautifulSoup

def get_topic_page():
    topics_url = 'https://github.com/topics'
    
    #Download the page
    response = requests.get(topics_url)
    
    # check the sucessfull response
    if response.status_code != 200:
        raise Exception('failed to load {}'.format(topic_url))
        
    #parse using BeautifulSoup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [44]:
doc = get_topic_page()

In [45]:
doc.find('a')

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

Let's create helper function to parse information from the page.

To get topic titles, we can pick 'p' tags with the 'class' ..
![](https://i.imgur.com/B6Se4Wl.png)

In [46]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p',{'class' : selection_class})
    
    topic_titles = []

    for tags in topic_title_tags:
        topic_titles.append(tags.text)
    
    return topic_titles

get_topic_titles can be used to get the titles

In [47]:
titles = get_topic_titles(doc)


In [21]:
len(titles)

30

In [22]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined functions for descreptions and urls.

To get_topic_desc, we can pick 'p' tags with the 'class' ..
![](https://i.imgur.com/jkWPAxj.png)

In [48]:
def get_topic_desc(doc):
    selection_class1 = 'f5 color-fg-muted mb-0 mt-1'
    desc_tag = doc.find_all('p',{'class' : selection_class1})
    
    topic_desc = []

    for tags in desc_tag:
        topic_desc.append(tags.text.strip())
        
    return desc_tag

In [50]:
len(get_topic_desc(doc))

30

To get_topic_urls, we can pic 'a' tags with 'href'
![](https://i.imgur.com/AvXO0JD.png)

In [51]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',{'class':'no-underline flex-grow-0'})
        
    topic_urls = []
    base_url = 'https://github.com'

    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

In [52]:
get_topic_urls(doc)

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

Let's put this all together into single function and convert single dataframe

In [53]:
import pandas as pd
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('failed to load {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    
    topic_dict = {
        'title' : get_topic_titles(doc),
        'descrption' : get_topic_desc(doc),
        'url' : get_topic_urls(doc)
    }
    
    return pd.DataFrame(topic_dict)

In [54]:
scrape_topics()

Unnamed: 0,title,descrption,url
0,3D,[\n 3D refers to the use of three-dim...,https://github.com/topics/3d
1,Ajax,[\n Ajax is a technique for creating ...,https://github.com/topics/ajax
2,Algorithm,[\n Algorithms are self-contained seq...,https://github.com/topics/algorithm
3,Amp,[\n Amp is a non-blocking concurrency...,https://github.com/topics/amphp
4,Android,[\n Android is an operating system bu...,https://github.com/topics/android
5,Angular,[\n Angular is an open source web app...,https://github.com/topics/angular
6,Ansible,[\n Ansible is a simple and powerful ...,https://github.com/topics/ansible
7,API,[\n An API (Application Programming I...,https://github.com/topics/api
8,Arduino,[\n Arduino is an open source platfor...,https://github.com/topics/arduino
9,ASP.NET,[\n ASP.NET is a web framework for bu...,https://github.com/topics/aspnet


## Get the top respositories in the topic from the topic page

- Get the topic url from topic urls
- From topic url we can find username, repo name, url and Stars
- convert to pandas data frame

Let's write the function to get topic url

In [56]:
def get_topic_page(topic_url):
    
    # download the page
    response = requests.get(topic_url)
    
    # check the sucessfull response
    if response.status_code != 200:
        raise Exception('failed to load {}'.format(topic_url))
    
    #parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

In [58]:
doc = get_topic_page('https://github.com/topics/3d')

In [61]:
len(doc)

5

To get_repo_info, from 'h3' tag we get 'a' tags for username and repo_name
To get repo_url, We used base_url='https://github.com' with 'href'
![](https://i.imgur.com/0VOAj34.png)

In [63]:
def get_repo_info(h3_tags, star_tag):
    # returns all the requried information about repository
    a_tags = h3_tags.find_all('a')
    user_name = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    a_tags = h3_tags.find_all('span')
    stars = parse_star_count(star_tag.text.strip())
    return user_name, repo_name, stars, repo_url

From 'h3' tage will get 'a' tag for username and repo_name
![](https://i.imgur.com/xjqtXx1.png)

To get the stars, we need 'span' tag with the 'class'
![](https://i.imgur.com/i4FD8QA.png)

In [64]:
def get_topic_repos(topic_doc):
    
    
    # Get tags containing Username, repo_title, repo_url
    h3_selection = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class' : h3_selection})
    
    #Get star tags
    star_selection = 'Counter js-social-count'
    star_tag = topic_doc.find_all('span', {'class' : star_selection})
    
    topic_repos_dict = {
        'user_name':[],
        'repo_name':[],
        'repo_url':[],
        'stars':[]
    }
    
    #Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tag[i])
        topic_repos_dict['user_name'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[3])
        topic_repos_dict['repo_url'].append(repo_info[2])
    
    
    return pd.DataFrame(topic_repos_dict)


    

In [72]:
import os
def scrape_topic(topic_url, path):
  
    if os.path.exists(path):
        print('the file {} already exist, skipping...'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)

## Putting it all together
- We have a function to get the list of topics
- We have a function to get csv file for scraped repos from topics page
- Let's create function to put all together

In [1]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok = True)
    
    for index, row in topics_df.iterrows():
        print('scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for all the topics on the first page of 'https://github.com/topics'

In [41]:
scrape_topics_repos()

Scraping list of topics
scraping top repositories for "3D"
the file Data/3D.csv already exist, skipping...
scraping top repositories for "Ajax"
the file Data/Ajax.csv already exist, skipping...
scraping top repositories for "Algorithm"
the file Data/Algorithm.csv already exist, skipping...
scraping top repositories for "Amp"
the file Data/Amp.csv already exist, skipping...
scraping top repositories for "Android"
the file Data/Android.csv already exist, skipping...
scraping top repositories for "Angular"
the file Data/Angular.csv already exist, skipping...
scraping top repositories for "Ansible"
the file Data/Ansible.csv already exist, skipping...
scraping top repositories for "API"
the file Data/API.csv already exist, skipping...
scraping top repositories for "Arduino"
the file Data/Arduino.csv already exist, skipping...
scraping top repositories for "ASP.NET"
the file Data/ASP.NET.csv already exist, skipping...
scraping top repositories for "Atom"
the file Data/Atom.csv already exist,

In [42]:
#Read and display a csv using pandas
pd.read_csv('data/3d.csv')

Unnamed: 0,user_name,repo_name,repo_url,stars
0,mrdoob,three.js,95000,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,24000,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22000,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21000,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18000,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16000,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15000,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,15000,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,11000,https://github.com/CesiumGS/cesium
9,MonoGame,MonoGame,10000,https://github.com/MonoGame/MonoGame


## References and future work

- Summary
    - we'r going to scrape Github topics - https://github.com/topics.
    - From Github Will get list of topics.from each topic we can find Topic title, topic descrption and topic url
    for example topic title - 3d, descrption is title 3d's descrption and topic url- https://github.com/topics/3d.
    - In Each topic will find top repositories 
    - In Each repositories We can grab informaion like User Name, Repository Name, Repository Url and stars
    -amphp	amp	4000	https://github.com/amphp/amp
    
- References link found usefull
    - Github link - https://github.com/topics
    - import request -https://requests.readthedocs.io/en/latest/
    - import request -https://www.w3schools.com/python/module_requests.asp
    - we can see response of status code -https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

- Ideas for future work
    - Dataset of Books (Amazon): Create a dataset of popular books in different genres by scraping the site: https://www.amazon.in/gp/bestsellers/books/

    - Dataset of Quotes (BrainyQuote): Create a dataset of quotes for different tags/topics by scraping the site :https://www.brainyquote.com/topics

    - Dataset of Movies (TMDb): The Movie Database (TMDb) contains information about thousands of movies from around the world: https://www.themoviedb.org/movie.

    