### Top Repositories for Github Topics

### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


### Project Outline

-We are going to scrape https://github.com/topics
-We'll get a list of topics. For each topic we get a topic title, topic page url and topic description.
-For each topic, we'll get top 25 repositories in the topic from the topic page. 
-For each repository, we'll get the repo name, username, stars and repo url 
-For each topic we'll create a CSV file in the following format 

Repo Name, Username, Stars, Repo Url 
 three.js, mrdoob, 96300, https://github.com/mrdoob/three.js
 libgdx, libgdx, 22300, https://github.com/libgdx/libgdx






### Use the requests library to download web pages


In [9]:
!pip install requests --upgrade --quiet

In [11]:
import requests 

In [12]:
topics_url = 'https://github.com/topics'

In [13]:
response = requests.get(topics_url)

In [14]:
response.status_code

200

In [15]:
len(response.text)

170766

In [16]:
page_contents = response.text 

In [20]:
page_contents[:1500]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system" data-a11y-link-underlines="true">\n\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-38f1bf52eeeb.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-56010aa53a8f.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous"

In [25]:
with open('webpage.html', "w", encoding="utf-8") as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information

In [28]:
!pip install beautifulsoup4 --upgrade --quiet

In [29]:
from bs4 import BeautifulSoup 

In [30]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [31]:
type(doc)

bs4.BeautifulSoup

In [42]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class' : selection_class})

In [43]:
len(topic_title_tags)

30

In [45]:
topic_title_tags [:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [49]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class' : desc_selector})

In [50]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [51]:
topic_title_tag0 = topic_title_tags[0]

In [55]:
div_tag = topic_title_tag0.parent

In [56]:
topic_link_tags = doc.find_all('a', {'class' : 'no-underline flex-1 d-flex flex-column'})

In [57]:
len(topic_link_tags)

30

In [61]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)



https://github.com/topics/3d


In [63]:
topic_titles = []

for tag in topic_title_tags: 
   topic_titles.append(tag.text) 

print(topic_titles) 

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [67]:
topic_descs = []

for tag in topic_desc_tags: 
   topic_descs.append(tag.text.strip())

topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [69]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags: 
    topic_urls.append(base_url + tag['href']) 
topic_urls 

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [71]:
!pip install pandas --quiet 

In [73]:
import pandas as pd 

In [75]:
topics_dict = { 
    'title': topic_titles, 
    'description': topic_descs, 
    'url': topic_urls
}

In [76]:
topics_df = pd.DataFrame(topics_dict)

In [77]:
topics_df 

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Creating a CSV from the information 

In [79]:
topics_df.to_csv('topic.csv', index = None)

## Getting information out of a topic page 

In [80]:
topic_page_url = topic_urls[0]

In [81]:
topic_page_url 

'https://github.com/topics/3d'

In [82]:
response = requests.get(topic_page_url) 

In [83]:
response.status_code 

200

In [84]:
len(response.text)

488672

In [85]:
topic_doc = BeautifulSoup(response.text, 'html.parser') 

In [142]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class}) 

In [143]:
len(repo_tags)

20

In [133]:
a_tags = repo_tags[0].find_all('a')

In [134]:
a_tags[0].text.strip()

'mrdoob'

In [135]:
a_tags[1].text.strip()

'three.js'

In [108]:
base_url = 'https://github.com'

repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [159]:
star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'}) 



In [160]:
len(star_tags)

20

In [210]:
star_tags[5]

<span aria-label="16456 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="16,456">16.5k</span>

In [212]:
star_tags[5].text.strip()

'16.5k'

In [204]:
def parse_star_count(stars_str): 
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k' : 
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str) 

In [219]:
def get_repo_info(h3_tag, star_tag): 
    #returns required data 
    a_tags = h3_tag.find_all('a') 
    username = a_tags[0].text.strip() 
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url 

In [214]:
parse_star_count(star_tags[5].text.strip())

16500

In [220]:
get_repo_info(repo_tags[0], star_tags[0]) 

('mrdoob', 'three.js', 96300, 'https://github.com/mrdoob/three.js')

In [222]:
topic_repos_dict = { 
    'username' : [], 
    'repo_name' : [], 
     'stars' : [], 
    'repo_url' : []
}

for i  in range(len(star_tags)):
       repo_info = get_repo_info(repo_tags[i], star_tags[i])
       topic_repos_dict['username'].append(repo_info[0])
       topic_repos_dict['repo_name'].append(repo_info[1])
       topic_repos_dict['stars'].append(repo_info[2])
       topic_repos_dict['repo_url'].append(repo_info[3])
      


### Final Code 

In [399]:
import os

def get_topic_page(topic_url):
      #download page 
    response = requests.get(topic_url)
    #check response time 
    if response.status_code != 200: 
        raise Exception('Failed to load page {}'.format(topic_url))
    #parse using beautiful soup 
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


def get_repo_info(h3_tag, star_tag): 
    #returns required data 
    a_tags = h3_tag.find_all('a') 
    username = a_tags[0].text.strip() 
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url 


import pandas as pd

def get_topic_repos(topic_doc): 
    # get the h3 tags 
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class}) 
    # get star tags 
    star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'}) 

    topic_repos_dict = { 
        'username' : [], 
        'repo_name' : [], 
        'stars' : [], 
        'repo_url' : []
    }

    for i in range(len(star_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3]) 

    return pd.DataFrame(topic_repos_dict)



def scrape_topic(topic_url,topic_name):
    fname = topic_name + '.csv'
    if os.path.exists(fname):
        print("The file {} already exists. Skipping....".format(fname))
        return 
        
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(fname, index=None)

    










    
    
        

### Write a single function for : 
1. Getting the list of topics from the topics page
2. Getting the list of top repos from the individual topic pages
3. For each topic, create a CSV of top repos for the topic 

In [388]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    
    return topic_titles


def get_topic_descs(doc): 
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class' : desc_selector})
    topic_descs = []

    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())

    return topic_descs


def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class' : 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com' 

    for tag in topic_link_tags: 
        topic_urls.append(base_url + tag['href'])

    return topic_urls


    
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    
    if response.status_code != 200: 
        raise Exception('Failed to load page {}'.format(topics_url))

    doc = BeautifulSoup(response.text, 'html.parser')

    topics_dict = {
        'title': get_topic_titles(doc), 
        'description': get_topic_descs(doc), 
        'url': get_topic_urls(doc)
    }

    return pd.DataFrame(topics_dict)
    

In [394]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], row['title'])

In [401]:
scrape_topics_repos()

Scraping top repositories for "3D"
The file 3D.csv already exists. Skipping....
Scraping top repositories for "Ajax"
The file Ajax.csv already exists. Skipping....
Scraping top repositories for "Algorithm"
The file Algorithm.csv already exists. Skipping....
Scraping top repositories for "Amp"
The file Amp.csv already exists. Skipping....
Scraping top repositories for "Android"
The file Android.csv already exists. Skipping....
Scraping top repositories for "Angular"
The file Angular.csv already exists. Skipping....
Scraping top repositories for "Ansible"
The file Ansible.csv already exists. Skipping....
Scraping top repositories for "API"
The file API.csv already exists. Skipping....
Scraping top repositories for "Arduino"
The file Arduino.csv already exists. Skipping....
Scraping top repositories for "ASP.NET"
The file ASP.NET.csv already exists. Skipping....
Scraping top repositories for "Atom"
The file Atom.csv already exists. Skipping....
Scraping top repositories for "Awesome Lists

In [393]:
for index, row in topics_df.iterrows():
    print(row['title'], row['url'])

3D https://github.com/topics/3d
Ajax https://github.com/topics/ajax
Algorithm https://github.com/topics/algorithm
Amp https://github.com/topics/amphp
Android https://github.com/topics/android
Angular https://github.com/topics/angular
Ansible https://github.com/topics/ansible
API https://github.com/topics/api
Arduino https://github.com/topics/arduino
ASP.NET https://github.com/topics/aspnet
Atom https://github.com/topics/atom
Awesome Lists https://github.com/topics/awesome
Amazon Web Services https://github.com/topics/aws
Azure https://github.com/topics/azure
Babel https://github.com/topics/babel
Bash https://github.com/topics/bash
Bitcoin https://github.com/topics/bitcoin
Bootstrap https://github.com/topics/bootstrap
Bot https://github.com/topics/bot
C https://github.com/topics/c
Chrome https://github.com/topics/chrome
Chrome extension https://github.com/topics/chrome-extension
Command line interface https://github.com/topics/cli
Clojure https://github.com/topics/clojure
Code quality h

In [223]:
range(len(star_tags)) 

range(0, 20)

In [227]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [305]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,96300,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,24700,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22300,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21800,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,18500,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,16500,https://github.com/lettier/3d-game-shaders-for...
6,FreeCAD,FreeCAD,16000,https://github.com/FreeCAD/FreeCAD
7,aframevr,aframe,15900,https://github.com/aframevr/aframe
8,CesiumGS,cesium,11300,https://github.com/CesiumGS/cesium
9,MonoGame,MonoGame,10400,https://github.com/MonoGame/MonoGame
