# Scraping Top Repositories for Topics on GitHub

**TODO (Intro):**

* Introduction about web scraping
* Introduction about GitHub and the problem statement
* Mention the tools you're using (Python, requests, Beautiful Soup, Pandas)

---

**Here are the steps we'll follow:**

* We're going to scrape [https://github.com/topics](https://github.com/topics)
* We'll get a list of topics. For each topic, we'll get topic title, topic page URL, and topic description.
* For each topic, we'll get the top 25 repositories in the topic from the topic page.
* For each repository, we'll grab the repo name, username, stars, and repo URL.
* For each topic, we'll create a CSV file in the following format:

```csv
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,[https://github.com/mrdoob/three.js](https://github.com/mrdoob/three.js)
libgdx,libgdx,18300,[https://github.com/libgdx/libgdx](https://github.com/libgdx/libgdx)

## Use the request library to download web pages

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
topics_url = "https://github.com/topics"

In [3]:
response = requests.get(topics_url)

In [4]:
response.status_code

200

In [5]:
response.text

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  data-css-features="prs_diff_containment"\n  >\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-c59dc71e3a4c.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light_high_contrast-4bf0cb7

In [6]:
len(response.text)

216712

In [7]:
page_contents = response.text

In [8]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  data-css-features="prs_diff_containment"\n  >\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-c59dc71e3a4c.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light_high_contrast-4bf0cb7

In [9]:
with open('webpage.html','w',encoding = 'utf8') as f:
    f.write(page_contents)

## Use Beautful Soup to parse and extract info

In [10]:
doc = BeautifulSoup(page_contents,'html.parser')

In [11]:
type(doc)

bs4.BeautifulSoup

In [12]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p',{'class': selection_class})

In [13]:
len(topic_title_tags)

30

In [14]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [15]:
desc_selector = "f5 color-fg-muted mb-0 mt-1"

topic_desc_tags = doc.find_all('p',class_ = desc_selector)

In [16]:
len(topic_desc_tags)

30

In [17]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [18]:
topic_title_tag0 = topic_desc_tags[0]

In [19]:
div_tag = topic_title_tag0.parent

In [20]:
topic_link_tags = doc.find_all('a',class_ = "no-underline flex-grow-0")

In [21]:
len(topic_link_tags)

30

In [22]:
topic_link_tags[0]['href']

'/topics/3d'

In [23]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [24]:
topic_title_tags[0].text

'3D'

In [25]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.next)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command-line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal']


In [26]:
topic_descriptions = []

for tag in topic_desc_tags:
    topic_descriptions.append(tag.text.strip())

topic_descriptions[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [27]:
topic_urls = []
base_url = "https://github.com"
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compiler',
 'https://github.com/topics/co

## List to dataframe for csv file

In [28]:
import pandas as pd

In [29]:
topics_dict = {
    'title': topic_titles,
    'description' : topic_descriptions,
    'url' : topic_urls
    
}

In [30]:
topics_df = pd.DataFrame(topics_dict)

In [31]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Create CSV file with the extracted info

In [32]:
topics_df.to_csv('topics.csv',index = None)

## Getting info out of a Topic page

In [33]:
topic_page_url = topic_urls[0]

In [34]:
topic_page_url

'https://github.com/topics/3d'

In [35]:
response = requests.get(topic_page_url)

In [36]:
response.status_code

200

In [37]:
len(response.text)

520748

In [38]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [39]:
h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"

repo_tags = topic_doc.find_all('h3', class_=h3_selection_class)


In [40]:
len(repo_tags)

20

In [41]:
a_tags = repo_tags[0].find_all('a')

In [42]:
a_tags[0].text

'mrdoob'

In [43]:
a_tags[1].text

'three.js'

In [44]:
a_tags[1]['href']

'/mrdoob/three.js'

In [45]:
base_url = "https://github.com/"
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com//mrdoob/three.js


In [46]:
star_tags = topic_doc.find_all('span',class_="Counter js-social-count")

In [47]:
len(star_tags)

20

In [48]:
star_tags[0].text

'107k'

In [89]:
def parse_star_count(stars_str):
    stars_str = stars_str
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)
        

In [90]:
parse_star_count(star_tags[0].text)

107000

In [91]:
def get_repo_info(h3_tag,star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text
    repo_name = a_tags[1].text
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text)
    return username, repo_name, stars, repo_url

In [92]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 107000, 'https://github.com/mrdoob/three.js')

In [93]:
topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    

In [94]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [95]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,107000,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,29000,https://github.com/pmndrs/react-three-fiber
2,FreeCAD,FreeCAD,24900,https://github.com/FreeCAD/FreeCAD
3,BabylonJS,Babylon.js,24200,https://github.com/BabylonJS/Babylon.js
4,libgdx,libgdx,24100,https://github.com/libgdx/libgdx
5,ssloy,tinyrenderer,22000,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,18800,https://github.com/lettier/3d-game-shaders-for...
7,aframevr,aframe,17100,https://github.com/aframevr/aframe
8,blender,blender,15300,https://github.com/blender/blender
9,4ian,GDevelop,14500,https://github.com/4ian/GDevelop


In [128]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


def get_repo_info(h3_tag,star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text
    repo_name = a_tags[1].text
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text)
    return username, repo_name, stars, repo_url
    

def get_topic_repos(topic_doc):
    
    h3_selection_class = "f3 color-fg-muted text-normal lh-condensed"
    repo_tags = topic_doc.find_all('h3', class_=h3_selection_class)

    star_tags = topic_doc.find_all('span',class_="Counter js-social-count")


    topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, topic_name):
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(topic_name + '.csv', index=None)

In [129]:
url4 = topic_urls[4]
url4

'https://github.com/topics/android'

In [130]:
topic4_doc = get_topic_page(url4)
#topic4_doc

In [131]:
topic4_repos = get_topic_repos(topic4_doc)

In [132]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,171000,https://github.com/flutter/flutter
1,Genymobile,scrcpy,123000,https://github.com/Genymobile/scrcpy
2,facebook,react-native,123000,https://github.com/facebook/react-native
3,justjavac,free-programming-books-zh_CN,114000,https://github.com/justjavac/free-programming-...
4,Hack-with-Github,Awesome-Hacking,93300,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,56000,https://github.com/Solido/awesome-flutter
6,tldr-pages,tldr,55900,https://github.com/tldr-pages/tldr
7,wasabeef,awesome-android-ui,52700,https://github.com/wasabeef/awesome-android-ui
8,google,material-design-icons,51600,https://github.com/google/material-design-icons
9,appwrite,appwrite,50900,https://github.com/appwrite/appwrite


In [133]:
topic_urls[3]

'https://github.com/topics/amphp'

In [134]:
get_topic_repos(get_topic_page(topic_urls[3]))

Unnamed: 0,username,repo_name,stars,repo_url
0,amphp,amp,4300,https://github.com/amphp/amp
1,danog,MadelineProto,3100,https://github.com/danog/MadelineProto
2,amphp,parallel,817,https://github.com/amphp/parallel
3,unreal4u,telegram-api,795,https://github.com/unreal4u/telegram-api
4,amphp,http-client,719,https://github.com/amphp/http-client
5,amphp,byte-stream,380,https://github.com/amphp/byte-stream
6,amphp,mysql,374,https://github.com/amphp/mysql
7,php-service-bus,service-bus,349,https://github.com/php-service-bus/service-bus
8,amphp,parallel-functions,273,https://github.com/amphp/parallel-functions
9,xtrime-ru,TelegramRSS,272,https://github.com/xtrime-ru/TelegramRSS


In [135]:
topic_urls[6]

'https://github.com/topics/ansible'

In [136]:
get_topic_repos(get_topic_page(topic_urls[3])).to_csv('ansible.csv',index = None)

#### Write a single function to :

- Get the list of topics from the topics page
- Get the list of top repos from the individual topic pages
- For each topic, create a CSV of the top repos for the topic

In [137]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p',{'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags = doc.find_all('p',class_ = desc_selector)
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',class_ = "no-underline flex-grow-0")
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls


def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)


In [138]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [141]:
def scrape_topic_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()  
    
    for index, row in topics_df.iterrows():
        print("Scraping top repositories for {}".format(row['title']))
        scrape_topic(row['url'], row['title'])  

In [142]:
scrape_topic_repos()

Scraping list of topics
Scraping top repositories for 3D
Scraping top repositories for Ajax
Scraping top repositories for Algorithm
Scraping top repositories for Amp
Scraping top repositories for Android
Scraping top repositories for Angular
Scraping top repositories for Ansible
Scraping top repositories for API
Scraping top repositories for Arduino
Scraping top repositories for ASP.NET
Scraping top repositories for Awesome Lists
Scraping top repositories for Amazon Web Services
Scraping top repositories for Azure
Scraping top repositories for Babel
Scraping top repositories for Bash
Scraping top repositories for Bitcoin
Scraping top repositories for Bootstrap
Scraping top repositories for Bot
Scraping top repositories for C
Scraping top repositories for Chrome
Scraping top repositories for Chrome extension
Scraping top repositories for Command-line interface
Scraping top repositories for Clojure
Scraping top repositories for Code quality
Scraping top repositories for Code review
Scrap