# Scraping Top Repositories for Topics on GitHub

TODO:

- Introduction about web scraping.
- Introduction about GitHub and the problem statement
- Mention the tools you're using(Python, requests, Beautiful Soup, Pandas)

https://jovian.ai/aakashns/python-web-scraping-project-guide   steps for reference (short notes)

Extract Top 20 3D repositories from Github 

## 1. Pick a website and describe your objective
- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


#### Project Outline (Objectives) :

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description.
- For eacg topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a csv file in the following format:

```
Repo Name,Username,Stars,Repo URL
threejs,mrdoob,69700,https://github.com/mrdoob/three.js
```

## 2. Use the requests library to download web pages
- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
topics_url = 'https://github.com/topics'

In [3]:
response = requests.get(topics_url)

In [4]:
# 200 = request is successfull  (for more codes search for 'HTTP response status codes')
response.status_code

200

In [5]:
len(response.text)

195267

In [6]:
page_contents = response.text

In [7]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  >\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-f552bab6ce72.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-4589f64a2275.css" /><link data-color-theme="dark_dimmed" crossorigin

In [8]:
with open('webpage.html', 'w', encoding='utf-8') as f:
    f.write(page_contents)

## 3. Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

- Parsing involves reading the HTML code and breaking it down into its constituent parts to retrieve the desired data.

In [9]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [10]:
p_tags = doc.find_all('p')

In [11]:
len(p_tags)

69

In [12]:
p_tags[:15]

[<p>We read every piece of feedback, and take your input very seriously.</p>,
 <p class="text-small color-fg-muted">
             To see all available qualifiers, see our <a class="Link--inTextBlock" href="https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax">documentation</a>.
           </p>,
 <p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Unity
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Unity is a game engine used to create 2D/3D video games, and simulations for computers, consoles, and mobile devices.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         MongoDB
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">MongoDB is an open source NoSQL document-oriented database.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Homebrew
       </p>,

In [13]:
topic_title_tags = doc.find_all('p', {'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [14]:
len(topic_title_tags)

30

In [15]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [16]:
# lets get topic descriptions here
topic_title_descriptions = doc.find_all('p', {'class':'f5 color-fg-muted mb-0 mt-1'})

In [17]:
topic_title_descriptions[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [18]:
# get url's
topic_title_tag0 = topic_title_tags[0]

In [19]:
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [20]:
topic_link_tags = doc.find_all('a', {'class':'no-underline flex-grow-0'})

In [21]:
len(topic_link_tags)

30

In [22]:
topic_link_tags[0]['href']

'/topics/3d'

In [23]:
topic0_url = 'https://github.com' + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [24]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [25]:
topic_descs = []

for tag in topic_title_descriptions:
    topic_descs.append(tag.text.strip())
    
topic_descs[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [26]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [27]:
topics_dict = {
    'title' : topic_titles,
    'description' : topic_descs,
    'url' : topic_urls
}

In [28]:
topics_df = pd.DataFrame(topics_dict)

In [29]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## 4. Create CSV file(s) with the extracted information

In [30]:
topics_df.to_csv('topics.csv', index = None)

# Getting Information out of a topic page

In [31]:
topic_page_url = topic_urls[0]

In [32]:
topic_page_url

'https://github.com/topics/3d'

In [33]:
response = requests.get(topic_page_url)

In [34]:
response.status_code

200

In [35]:
len(response.text)

506395

In [36]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [37]:
h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h1_selection_class} )

In [38]:
len(repo_tags)

20

In [39]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href

In [40]:
a_tags = repo_tags[0].find_all('a')

In [41]:
a_tags[0]

<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>

In [42]:
a_tags[0].text.strip()

'mrdoob'

In [43]:
a_tags[1].text.strip()

'three.js'

In [44]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [45]:
base_url

'https://github.com'

In [46]:
# number of stars
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

In [47]:
len(star_tags)

20

In [48]:
star_tags[0].text

'99.8k'

In [49]:
stars_str = '99.8k'

In [50]:
stars_str[-1]

'k'

In [51]:
# convert to number
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [52]:
parse_star_count(star_tags[0].text.strip())

99800

##### Getting all findings at one place

In [53]:
def get_repo_info(h1_tag, star_tag):
    #returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [54]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 99800, 'https://github.com/mrdoob/three.js')

In [55]:
topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [56]:
def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc
    
def get_repo_info(h1_tag, star_tag):
    #returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get the h1 tags containing repo title, repo URL and username
    h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h1_selection_class} )
    # Get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : [] 
    }
    
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, topic_name):
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(topic_name + '.csv', index=None)

In [57]:
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [58]:
url4 = topic_urls[4]

In [59]:
topic4_doc = get_topic_page(url4)

In [60]:
topic4_repos = get_topic_repos(topic4_doc)

In [61]:
url4

'https://github.com/topics/android'

In [62]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,163000,https://github.com/flutter/flutter
1,facebook,react-native,117000,https://github.com/facebook/react-native
2,justjavac,free-programming-books-zh_CN,110000,https://github.com/justjavac/free-programming-...
3,Genymobile,scrcpy,104000,https://github.com/Genymobile/scrcpy
4,Hack-with-Github,Awesome-Hacking,78900,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,51800,https://github.com/Solido/awesome-flutter
6,google,material-design-icons,50000,https://github.com/google/material-design-icons
7,wasabeef,awesome-android-ui,49500,https://github.com/wasabeef/awesome-android-ui
8,tldr-pages,tldr,49000,https://github.com/tldr-pages/tldr
9,square,okhttp,45400,https://github.com/square/okhttp


In [63]:
topic_urls[5]

'https://github.com/topics/angular'

In [64]:
get_topic_repos(get_topic_page(topic_urls[5]))

Unnamed: 0,username,repo_name,stars,repo_url
0,justjavac,free-programming-books-zh_CN,110000,https://github.com/justjavac/free-programming-...
1,angular,angular,94900,https://github.com/angular/angular
2,storybookjs,storybook,83200,https://github.com/storybookjs/storybook
3,leonardomso,33-js-concepts,62400,https://github.com/leonardomso/33-js-concepts
4,ionic-team,ionic-framework,50600,https://github.com/ionic-team/ionic-framework
5,prettier,prettier,48500,https://github.com/prettier/prettier
6,Asabeneh,30-Days-Of-JavaScript,41400,https://github.com/Asabeneh/30-Days-Of-JavaScript
7,SheetJS,sheetjs,34600,https://github.com/SheetJS/sheetjs
8,angular,angular-cli,26600,https://github.com/angular/angular-cli
9,angular,components,24100,https://github.com/angular/components


In [65]:
topic_urls[6]

'https://github.com/topics/ansible'

In [66]:
get_topic_repos(get_topic_page(topic_urls[6])).to_csv('ansible.csv', index=None)

In [67]:
get_topic_repos(get_topic_page(topic_urls[6]))

Unnamed: 0,username,repo_name,stars,repo_url
0,bregman-arie,devops-exercises,64400,https://github.com/bregman-arie/devops-exercises
1,ansible,ansible,61600,https://github.com/ansible/ansible
2,trailofbits,algo,28400,https://github.com/trailofbits/algo
3,MichaelCade,90DaysOfDevOps,26000,https://github.com/MichaelCade/90DaysOfDevOps
4,StreisandEffect,streisand,23100,https://github.com/StreisandEffect/streisand
5,kubernetes-sigs,kubespray,15500,https://github.com/kubernetes-sigs/kubespray
6,ansible,awx,13600,https://github.com/ansible/awx
7,easzlab,kubeasz,10200,https://github.com/easzlab/kubeasz
8,semaphoreui,semaphore,9500,https://github.com/semaphoreui/semaphore
9,netbootxyz,netboot.xyz,8300,https://github.com/netbootxyz/netboot.xyz


Write a single funtion to :

1. Get the list of topics from the topics page.
2. Get the list of top repos from the individual topic pages.
3. For each topic, create a CSV of top repos for the topic

In [68]:
def get_topic_titles(doc):
    topic_title_tags = doc.find_all('p', {'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


def get_topic_descs(doc):
    topic_title_descriptions = doc.find_all('p', {'class':'f5 color-fg-muted mb-0 mt-1'})
    topic_descs = []
    for tag in topic_title_descriptions:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class':'no-underline flex-grow-0'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
        
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url':get_topic_urls(doc)
    }   
    return  pd.DataFrame(topics_dict)

Putting it all together

- We have a function to get the list of topics.
- We have a function to crate a CSV file for scraped repos from a topics page.
- Let's create a function to put them together

In [69]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], row['title'])

Let's run it to scrape the top repos for all the topics on the first page of https://github.com/topics

In [70]:
# all these files are created in the main homepage downloaded and in the csv form
scrape_topics_repos()

Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scraping top repositories for "

We can check that the CSV's were created properly

In [71]:
# for 3d
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [72]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,99800,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,26300,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,22800,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,22600,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,19600,https://github.com/ssloy/tinyrenderer
5,FreeCAD,FreeCAD,18000,https://github.com/FreeCAD/FreeCAD
6,lettier,3d-game-shaders-for-beginners,17300,https://github.com/lettier/3d-game-shaders-for...
7,aframevr,aframe,16300,https://github.com/aframevr/aframe
8,CesiumGS,cesium,12200,https://github.com/CesiumGS/cesium
9,blender,blender,11900,https://github.com/blender/blender


## 5. Document and share your work

#### Reference link :

https://www.youtube.com/watch?v=RKsLLG-bzEY&t=184s