# Top Repositories for Github Topics


Tarun Rama M

### Pick a website and describe your object
- Browse through different sites and pick on to scrape.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.


### Outline:

- We're going to scrape https://github.com/topics

- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll crete a CSV file in the following format.


#### 1. Use the requests library to download web pages

In [2]:
import requests

# python requests library is used for making HTTP requests to a specified URL.
# HTTP request returns a response object with all the response data.


In [3]:
topics_url = 'https://github.com/topics'

In [4]:
response = requests.get(topics_url)

In [5]:
response.status_code      # output: 200 - means it was successfull

200

In [6]:
len(response.text)

205141

In [7]:
page_contents = response.text

In [8]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html\n  lang="en"\n  \n  data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"\n  data-a11y-animated-images="system" data-a11y-link-underlines="true"\n  \n  >\n\n\n\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-0cfd1fd8509e.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-d782f59290e2.css" /><link data-color-theme="dark_dimmed" cross

In [9]:
with open('webpage.html','w') as f:
  f.write(page_contents)

#### 2. Use Beautiful Soup to parse and extract information

In [10]:
from bs4 import BeautifulSoup

In [11]:
doc = BeautifulSoup(page_contents,"html.parser")

In [12]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tags = doc.find_all('p', {'class': selection_class})

In [13]:
len(topic_title_tags)

30

In [14]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Bash</p>,
 <p class="f3 lh-condensed m

In [15]:
selection_class_two = 'f5 color-fg-muted mb-0 mt-1'
topic_description_tags = doc.find_all('p',{'class': selection_class_two})

In [16]:
len(topic_description_tags)

30

In [17]:
topic_description_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [18]:
topic_title_tag0 = topic_title_tags[0]

In [19]:
topic_title_tag0

<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>

In [20]:
div_tag =  topic_title_tag0.parent

In [21]:
selection_class_three = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a',{'class':selection_class_three})

In [22]:
len(topic_link_tags)

30

In [23]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [24]:
topic_titles = []

for tag in topic_title_tags:
  topic_titles.append(tag.text)

print(topic_titles)



['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command-line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'C++', 'Cryptocurrency', 'Crystal']


In [25]:
topic_descriptions = []

for tag in topic_description_tags:
  topic_descriptions.append(tag.text.strip())

print(topic_descriptions)

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source platform for building electronic devices.', 'ASP.NET is a web framework for building modern web apps and services.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud computing service created by Microsoft.', 'Babel is a compiler for w

In [26]:
topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
  topic_urls.append(base_url + tag['href'])
print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/topics/continuous-integration', 'ht

Create CSV file

In [27]:
import pandas as pd

In [28]:
topics_dict = {
    'title':topic_titles,
    'description':topic_descriptions,
    'url':topic_urls
}

In [29]:
topics_df = pd.DataFrame(topics_dict)

In [30]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


#### Create CSV file's with the extracted information

In [31]:
# topics_df.to_csv('topics.csv',index=None)
topics_df.to_csv('topics.csv')

#### Getting information from the github topic's page

In [32]:
topic_page_url = topic_urls[0]

In [33]:
topic_page_url

'https://github.com/topics/3d'

In [34]:
response = requests.get(topic_page_url)

In [35]:
response.status_code

200

In [36]:
len(response.text)

519802

In [37]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [38]:
selection_class_four = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': selection_class_four})

In [39]:
len(repo_tags)

20

In [40]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">mrdoob</a>          /
          <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/thre

In [41]:
a_tags = repo_tags[0].find_all('a')

In [42]:
a_tags

[<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">mrdoob</a>,
 <a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">three.js</a>]

In [43]:
a_tags[0]

<a class="Link" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="c72fbd5c69a8ee7c9c53a4e65de2b93c8fc7552dd793945819639bc165c0f0ba" data-turbo="false" data-view-component="true" href="/mrdoob">mrdoob</a>

In [44]:
a_tags[0].text.strip()

'mrdoob'

In [45]:
a_tags[1]

<a class="Link text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4a2667db3d63a1739c412e059e5da95afe419df83f70949b5d59dc3478f5c79a" data-turbo="false" data-view-component="true" href="/mrdoob/three.js">three.js</a>

In [46]:
a_tags[1].text.strip()

'three.js'

In [47]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [48]:
star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})

In [49]:
len(star_tags)

20

In [50]:
star_tags[0]

<span aria-label="103545 users starred this repository" class="Counter js-social-count" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-turbo-replace="true" data-view-component="true" id="repo-stars-counter-star" title="103,545">104k</span>

In [51]:
star_tags[0].text.strip()

'104k'

In [52]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)
  return int(stars_str)

In [53]:
print(parse_star_count("29.1k"))  # testing func

29100


In [54]:
parse_star_count(star_tags[0].text.strip())

104000

In [55]:
def get_repo_info(h3_tag,star_tag):
  # returns all the required info about a repository
  a_tags = h3_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tag.text.strip())
  return username, repo_name, stars, repo_url

In [56]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 104000, 'https://github.com/mrdoob/three.js')

In [106]:
import os
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check if successfull or not
    if response.status_code != 200:
        raise Exception('Failed to load the page! {}'.format(topic_url))
    # parse using beautifulSoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

def get_topic_repos(topic_doc):
    # Get the h3 tags containing repo title, repo_name and repo_url and username
    selection_class_four = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': selection_class_four})
    # get the star tags
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})

    topic_repos_dict = {
      'username': [],
      'repo_name': [],
      'stars':[],
      'repo_url':[]
    }

    # Get repo info
    for i in range(len(repo_tags)):
      repo_info = get_repo_info(repo_tags[i],star_tags[i])
      topic_repos_dict['username'].append(repo_info[0])
      topic_repos_dict['repo_name'].append(repo_info[1])
      topic_repos_dict['stars'].append(repo_info[2])
      topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,path):

  if os.path.exists(path):
    print("The file {} already exists. Skipping...".format(path))
    return
  topic_df = get_topic_repos(get_topic_page(topic_url))
  topic_df.to_csv(path, index=None)

In [103]:
url4 = topic_urls[4]

In [87]:
topic4_doc = get_topic_page(url4)

In [88]:
topic4_repos = get_topic_repos(topic4_doc)

In [89]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,167000,https://github.com/flutter/flutter
1,facebook,react-native,120000,https://github.com/facebook/react-native
2,Genymobile,scrcpy,115000,https://github.com/Genymobile/scrcpy
3,justjavac,free-programming-books-zh_CN,112000,https://github.com/justjavac/free-programming-...
4,Hack-with-Github,Awesome-Hacking,87200,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,54200,https://github.com/Solido/awesome-flutter
6,tldr-pages,tldr,52400,https://github.com/tldr-pages/tldr
7,wasabeef,awesome-android-ui,51100,https://github.com/wasabeef/awesome-android-ui
8,google,material-design-icons,50900,https://github.com/google/material-design-icons
9,laurent22,joplin,46800,https://github.com/laurent22/joplin


In [90]:
get_topic_repos(get_topic_page(topic_urls[4]))

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,167000,https://github.com/flutter/flutter
1,facebook,react-native,120000,https://github.com/facebook/react-native
2,Genymobile,scrcpy,115000,https://github.com/Genymobile/scrcpy
3,justjavac,free-programming-books-zh_CN,112000,https://github.com/justjavac/free-programming-...
4,Hack-with-Github,Awesome-Hacking,87200,https://github.com/Hack-with-Github/Awesome-Ha...
5,Solido,awesome-flutter,54200,https://github.com/Solido/awesome-flutter
6,tldr-pages,tldr,52400,https://github.com/tldr-pages/tldr
7,wasabeef,awesome-android-ui,51100,https://github.com/wasabeef/awesome-android-ui
8,google,material-design-icons,50900,https://github.com/google/material-design-icons
9,laurent22,joplin,46800,https://github.com/laurent22/joplin


In [91]:
get_topic_repos(get_topic_page(topic_urls[4])).to_csv('android.csv',index=None)

#### function:
- 1. Get the list of topics from the topics page
- 2. Get the list of top repos from the individual topic page
- 3. For each topic Create a CSV of the top repos for the topic

In [112]:
def get_topic_titles(doc):
  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p', {'class': selection_class})
  topic_titles = []
  for tag in topic_title_tags:
    topic_titles.append(tag.text)
  return topic_titles

def get_topic_descriptions(doc):
  selection_class_two = 'f5 color-fg-muted mb-0 mt-1'
  topic_description_tags = doc.find_all('p',{'class': selection_class_two})
  topic_descriptions = []
  for tag in topic_description_tags:
    topic_descriptions.append(tag.text.strip())
  return topic_descriptions

def get_topic_urls(doc):
  selection_class_three = 'no-underline flex-1 d-flex flex-column'
  topic_link_tags = doc.find_all('a',{'class':selection_class_three})
  topic_urls = []
  base_url = 'https://github.com'
  for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
  return topic_urls


def scrape_topics():

  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)
  if response.status_code != 200:
    raise Exception('Failed to load Page {}'.format(topics_url))
  doc = BeautifulSoup(response.text,'html.parser')
  topics_dict = {
      'title':get_topic_titles(doc),
      'description':get_topic_descriptions(doc),
      'url': get_topic_urls(doc)
  }
  return pd.DataFrame(topics_dict)

In [108]:
import os
help(os.makedirs)

Help on function makedirs in module os:

makedirs(name, mode=511, exist_ok=False)
    makedirs(name [, mode=0o777][, exist_ok=False])
    
    Super-mkdir; create a leaf directory and all intermediate ones.  Works like
    mkdir, except that any intermediate path segment (not just the rightmost)
    will be created if it does not exist. If the target directory already
    exists, raise an OSError if exist_ok is False. Otherwise no exception is
    raised.  This is recursive.



In [109]:
def scrape_topics_repos():
  print("Scarping list of topics")
  topics_df = scrape_topics()
  os.makedirs('data',exist_ok=True)
  for index,row in topics_df.iterrows():
    print('Scraping top repositories for "{}"'.format(row['title']))
    scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

In [110]:
scrape_topics_repos()

Scarping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command-line interface"
Scraping top repositories for "Clojure"
Scraping top repositories for "Code quality"

In [95]:
scrape_topics()

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet
