# Top repositories for GitHub topics

# Pick a website and describe your objective

 - Browse through the different websites and pick on to scrape.
 - Identify the information you'd like to scrape from the site. 
 - Decide the format of the output CSV file.
 - Summarize your project idea and outline your strategy in a juoyter notebook.

### Strategy:
    
    - we're going to scrape https://github.com/topics
    - we'll get a list  of topics. For each topic, we'll get topic title, topic page URL and topic desciption
    - For each topic, we'll get the top 25 repositories in the topic from the topic page.
    - For each repository, we'll grab repo name, username, stars and repo URL
    - For each topic we'll create a csv file.

# Step 1: Use the requests library to download web pages

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response  = requests.get(topics_url)  # request has open up the url and downloaded it

In [5]:
response.status_code

# HTTP response status codes indicate whether a specific HTTP request has been successfully completed. 

200

In [6]:
response.text   # .text prints out all the content available on page

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-0c343b529849.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [7]:
len(response.text)

152888

In [8]:
page_contents = response.text

In [9]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-0c343b529849.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [10]:
with open("webpage2.html", "w") as f:
    f.write(page_contents)

# Use Beautiful soup to parse and extract information

In [11]:
!pip install beautifulsoup4 --upgrade --quiet

In [12]:
from bs4 import BeautifulSoup

In [13]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [14]:
type(doc)

bs4.BeautifulSoup

In [15]:
topic_title_tags = doc.find_all("p", 
                     {
                         "class": "f3 lh-condensed mb-0 mt-1 Link--primary"
                     }
                     )

In [16]:
len(topic_title_tags)

30

In [17]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [18]:
desc_selector = "f5 color-fg-muted mb-0 mt-1"

topic_desc_tags = doc.find_all("p", {"class": desc_selector})

In [19]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [20]:
topic_link_tags = doc.find_all('a', {"class": "no-underline flex-grow-0"})

In [21]:
len(topic_link_tags)

30

In [22]:
topic_link_tags[:3]

[<a class="no-underline flex-grow-0" href="/topics/3d">
 <div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
             #
           </div>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/ajax">
 <img alt="ajax" class="rounded mr-3" height="64" src="https://raw.githubusercontent.com/github/explore/8be26d91eb231fec0b8856359979ac09f27173fd/topics/ajax/ajax.png" width="64"/>
 </a>,
 <a class="no-underline flex-grow-0" href="/topics/algorithm">
 <div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
             #
           </div>
 </a>]

In [23]:
topic0_url = "https://github.com" + topic_link_tags[0]["href"]

In [24]:
print(topic0_url)

https://github.com/topics/3d


In [25]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [26]:
len(topic_titles)

30

In [27]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
print(topic_descs[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


In [28]:
topic_urls = []
base_url = "https://github.com"

for tag in topic_link_tags:
    topic_urls.append(base_url + tag["href"])
    
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

In [29]:
!pip install pandas --quiet

In [30]:
import pandas as pd

In [31]:
topics_dict = {
    "title": topic_titles,
    "description": topic_descs,
    "url": topic_urls
}

In [32]:
topics_df = pd.DataFrame(topics_dict)

In [33]:
topics_df.head()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


### Create CSV file with the extracted information

In [34]:
topics_df.to_csv("topics.csv", index=None)

# Getting information out a topic page

In [35]:
topic_page_url = topic_urls[0]

In [64]:
print(topic_page_url)

https://github.com/topics/3d


In [37]:
response2 = requests.get(topic_page_url)

In [38]:
response2.status_code

200

In [39]:
len(response2.text)

454925

In [40]:
topic_doc = BeautifulSoup(response2.text, "html.parser")

In [41]:
h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"

repo_tags = topic_doc.find_all("h3", {"class": h3_selection_class})

In [88]:
repo_tags[1]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":45790596,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="14658fab6217ec4ba70f16dd98006d4334793fae49cc25ce2e1c0bb5a8950006" data-turbo="false" data-view-component="true" href="/pmndrs">
            pmndrs
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":172521926,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="629be4efc1260d27fe29201a1901eb808cbf995e4a51d877282b7164242dbadf" data-turbo="false" data-view-component="true" href="/pmndrs/re

In [42]:
len(repo_tags)

20

In [43]:
a_tags = repo_tags[0].find_all('a')

In [138]:
a_tags[0]

<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [139]:
a_tags[0].text

'\n            mrdoob\n'

In [44]:
a_tags[0].text.strip()

'mrdoob'

In [45]:
a_tags[1].text.strip()

'three.js'

In [46]:
repo_url = base_url + a_tags[1]["href"]
print(repo_url)

https://github.com/mrdoob/three.js


In [47]:

star_tags = topic_doc.find_all("span", {"class": "Counter js-social-count"})


In [48]:
len(star_tags)

20

In [51]:
star_tags[0].text

'88.4k'

In [53]:
star_tags[0].text

'88.4k'

In [59]:
type(star_tags[0].text)

str

In [60]:
def star_count(stars_):
    if stars_[-1] == "k":
        return int(float(stars_[:-1]) * 1000)
    else:
        return int(stars_)
        

In [62]:
star_count(star_tags[0].text)

88400

In [140]:
star_count(star_tags[1].text)

21100

In [78]:
def get_repo_info(h3_tag, star_tag):
    #returns all the required info about a repository
    a_tags = h3_tag.find_all("a")
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]["href"]
    stars = star_count(star_tag.text)
    return username, repo_name, stars, repo_url
    

In [77]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 88400, 'https://github.com/mrdoob/three.js')

In [141]:
get_repo_info(repo_tags[1], star_tags[1])

('pmndrs',
 'react-three-fiber',
 21100,
 'https://github.com/pmndrs/react-three-fiber')

In [84]:
topic_repos_dict = {
    "username": [],
    "repo_name": [],
    "stars": [],
    "repo_url": []
}





for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict["username"].append(repo_info[0])
    topic_repos_dict["repo_name"].append(repo_info[1])
    topic_repos_dict["stars"].append(repo_info[2])
    topic_repos_dict["repo_url"].append(repo_info[3])
    


In [86]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [87]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,88400,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,21100,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21000,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,19200,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,15800,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,15000,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,14400,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,13000,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,9800,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9500,https://github.com/metafizzy/zdog


In [112]:
import os

In [135]:
def get_topic_page(topic_url):
    # Download the page
    response2 = requests.get(topic_url)
    #check successful response
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    # parse using bautiful soup    
    topic_doc = BeautifulSoup(response2.text, "html.parser")
    return topic_doc

def get_repo_info(h3_tag, star_tag):
    #returns all the required info about a repository
    a_tags = h3_tag.find_all("a")
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]["href"]
    stars = star_count(star_tag.text)
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # get h3 tags containing repo title, repo url and username
    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all("h3", {"class": h3_selection_class})
    # get star tags
    star_tags = topic_doc.find_all("span", {"class": "Counter js-social-count"})

    topic_repos_dict = {
        "username": [],
        "repo_name": [],
        "stars": [],
        "repo_url": []
    }
    
         
    # get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict["username"].append(repo_info[0])
        topic_repos_dict["repo_name"].append(repo_info[1])
        topic_repos_dict["stars"].append(repo_info[2])
        topic_repos_dict["repo_url"].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)


def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping.......".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [101]:
url4 = topic_urls[4]

In [102]:
topic4_doc = get_topic_page(url4)

In [103]:
topic4_repos = get_topic_repos(topic4_doc)

In [104]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,149000,https://github.com/flutter/flutter
1,facebook,react-native,107000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,98900,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,75700,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,60500,https://github.com/Hack-with-Github/Awesome-Ha...
5,google,material-design-icons,47200,https://github.com/google/material-design-icons
6,wasabeef,awesome-android-ui,45200,https://github.com/wasabeef/awesome-android-ui
7,Solido,awesome-flutter,45000,https://github.com/Solido/awesome-flutter
8,square,okhttp,43400,https://github.com/square/okhttp
9,android,architecture-samples,42100,https://github.com/android/architecture-samples


In [130]:
#step 1


def get_topic_titles(doc):
    selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags = doc.find_all("p", {"class": selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_desc(doc):
    desc_selector = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags = doc.find_all("p", {"class": desc_selector})
    topic_desc = []
    for tag in topic_desc_tags:
        topic_desc.append(tag.text.strip())
    return topic_desc

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {"class": "no-underline flex-grow-0"})
    topic_urls=[]
    base_url = "https://github.com"
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag["href"])
    return topic_urls

def scrape_topics():
    topics_url = "https://github.com/topics"
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception("Failed to load page {}".format(topic_url))
    topics_dict = {
        "title": get_topic_titles(doc),
        "description": get_topic_desc(doc),
        "url": get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [129]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [136]:
# step 2

def scrape_topics_repos():
    print("Scraping list of topics")
    topics_df = scrape_topics()
    
    
    os.makedirs("data", exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scrapping top repositories for "{}"'.format(row["title"]))
        scrape_topic(row["url"], 'data/{}.csv'.format(row["title"]))

In [137]:
scrape_topics_repos()

Scraping list of topics
Scrapping top repositories for "3D"
Scrapping top repositories for "Ajax"
Scrapping top repositories for "Algorithm"
Scrapping top repositories for "Amp"
Scrapping top repositories for "Android"
Scrapping top repositories for "Angular"
Scrapping top repositories for "Ansible"
Scrapping top repositories for "API"
Scrapping top repositories for "Arduino"
Scrapping top repositories for "ASP.NET"
Scrapping top repositories for "Atom"
Scrapping top repositories for "Awesome Lists"
Scrapping top repositories for "Amazon Web Services"
Scrapping top repositories for "Azure"
Scrapping top repositories for "Babel"
Scrapping top repositories for "Bash"
Scrapping top repositories for "Bitcoin"
Scrapping top repositories for "Bootstrap"
Scrapping top repositories for "Bot"
Scrapping top repositories for "C"
Scrapping top repositories for "Chrome"
Scrapping top repositories for "Chrome extension"
Scrapping top repositories for "Command line interface"
Scrapping top repositori