# Top Repositories for Guthub Topics

### Pick a website and describe your objective
- Browse through different sites and pick on to scrape.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook.

#### Project Outline :
- We're going to scrape https://github.com/topics
- We will get list of topics. For each topic we will get topic title, topic page url and topic description
- We will get top 25 repositories in each topic from the topic page
- For each repository we will grab the repo name, user name, stars and repo url
- For each topic we create a csv file in the following format : 
```
Repo Name,User Name,Stars,Repo URL
```

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
topics_url='https://github.com/topics'

In [4]:
import datetime

In [5]:
import pandas as pd
import sys

In [6]:
try:
    response=requests.get(topics_url)
except Exception as e:
    error_type,error_object,error_info=sys.exc_info()
    print("Error in retrieving the url",topics_url)
    print(error_type," Error in line no ",error_info.tb_lineno)

In [7]:
response

<Response [200]>

In [8]:
len(response.text)

175325

In [9]:
page_contents=response.text

In [10]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-1G4rYJktwRTQKn7fVfJUxH8RRZFUJlGo77xMZfBfIhZPx4BHVrzPE1VgnafttXI8G3y/PywH3uXyhNkSLp3+oA==" rel="stylesheet" href="https://github.githubassets.com/assets/light-d46e2b60992dc114d02a7edf55f254c4.css" /><link crossorigin="anonymous" media="all" integrity="sha512-hI5b2oqTE9njfjYrfuzXqA4bSGSNrE5OMc9IiFhZy+RDGg9Qn4Si1A97o0MlinlwFt3xAifvoLX0s7jH

In [11]:
with open('topics.html','w',encoding='utf-8') as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information

In [12]:
!pip install BeautifulSoup4 --upgrade --quiet

In [13]:
from bs4 import BeautifulSoup

In [14]:
doc=BeautifulSoup(page_contents,'html.parser')

In [15]:
type(doc)

bs4.BeautifulSoup

In [16]:
topic_title_tags=doc.find_all('p',attrs={'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [17]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [18]:
len(topic_title_tags)

30

In [19]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [20]:
topic_description_tags=doc.find_all('p',class_='f5 color-fg-muted mb-0 mt-1')

In [21]:
len(topic_description_tags)

30

In [22]:
topic_description_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

In [23]:
topic_link_tags=doc.find_all('a',class_="d-flex no-underline",attrs={'data-ga-click':True})

In [24]:
len(topic_link_tags)

30

In [25]:
topic_link_tags[0].attrs

{'href': '/topics/3d',
 'class': ['d-flex', 'no-underline'],
 'data-ga-click': 'Explore, go to 3d, location:All featured topics'}

In [26]:
topic0_url='https://github.com'+topic_link_tags[0]['href']

In [27]:
print(topic0_url)

https://github.com/topics/3d


In [28]:
topic_titles=[]
for tag in topic_title_tags:
    topic_titles.append(tag.text)

In [29]:
topic_desc=[]
for tag in topic_description_tags:
    topic_desc.append(tag.text.strip())

In [30]:
topic_urls=[]
base_url='https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url+tag['href'])

In [31]:
topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [32]:
topic_desc

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'A

In [33]:
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [34]:
topics_df=pd.DataFrame({'Title':topic_titles,'Description':topic_desc,'URL':topic_urls})

In [35]:
topics_df

Unnamed: 0,Title,Description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


### Create CSV file(s) with the extracted information

In [36]:
topics_df.to_csv('topics.csv',index='Title')

### Getting information out of a topic page

In [37]:
topic_page_url=topic_urls[0]

In [38]:
topic_page_url

'https://github.com/topics/3d'

In [39]:
try:
    response=requests.get(topic_page_url)
except Exception as e:
    error_type,error_object,error_info=sys.exc_info()
    print('Error occured while retrieving the url',topic_page_url)
    print(error_type,' error in line number ',error_info.tb_lineno)

In [40]:
response.status_code

200

In [41]:
page_contents=response.text

In [42]:
doc2=BeautifulSoup(page_contents,'html.parser')

In [43]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-1G4rYJktwRTQKn7fVfJUxH8RRZFUJlGo77xMZfBfIhZPx4BHVrzPE1VgnafttXI8G3y/PywH3uXyhNkSLp3+oA==" rel="stylesheet" href="https://github.githubassets.com/assets/light-d46e2b60992dc114d02a7edf55f254c4.css" /><link crossorigin="anonymous" media="all" integrity="sha512-hI5b2oqTE9njfjYrfuzXqA4bSGSNrE5OMc9IiFhZy+RDGg9Qn4Si1A97o0MlinlwFt3xAifvoLX0s7jH

In [44]:
repo_tags=doc2.find_all('h3',class_="f3 color-fg-muted text-normal lh-condensed")

In [45]:
len(repo_tags)

30

In [46]:
a_tags=repo_tags[0].find_all('a')

In [47]:
a_tags[0].text.strip()

'mrdoob'

In [48]:
a_tags[1].text.strip()

'three.js'

In [49]:
base_url+a_tags[1]['href']

'https://github.com/mrdoob/three.js'

In [50]:
star_tags=doc2.find_all('a',class_="social-count js-social-count")

In [51]:
def parse_star_counts(star):
    star=star.strip()
    if(star[-1]=='k'):
        return int(float(star[:-1])*1000)
    return int(star)

In [52]:
parse_star_counts('55.3k')

55300

In [53]:
def get_repo_info(h1_tag,star_tag):
    #returns all the required information about a repository
    a_tags=h1_tag.find_all('a')
    user_name=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    stars=parse_star_counts(star_tag.text)
    return repo_name,user_name,stars,repo_url

In [54]:
get_repo_info(repo_tags[0],star_tags[0])

('three.js', 'mrdoob', 76300, 'https://github.com/mrdoob/three.js')

In [55]:
repo_tags

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>          /
           <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d8

In [56]:
topic_repos_dict={
    'Repo Name':[],
    'Username':[],
    'Stars':[],
    'URL':[]
}
for i in range(len(repo_tags)):
    repo_info=get_repo_info(repo_tags[i],star_tags[i])
    topic_repos_dict['Repo Name'].append(repo_info[0])
    topic_repos_dict['Username'].append(repo_info[1])
    topic_repos_dict['Stars'].append(repo_info[2])
    topic_repos_dict['URL'].append(repo_info[3])

In [57]:
topic_repos_df=pd.DataFrame(topic_repos_dict,index=range(1,31))

In [58]:
topic_repos_df

Unnamed: 0,Repo Name,Username,Stars,URL
1,three.js,mrdoob,76300,https://github.com/mrdoob/three.js
2,libgdx,libgdx,19300,https://github.com/libgdx/libgdx
3,react-three-fiber,pmndrs,15800,https://github.com/pmndrs/react-three-fiber
4,Babylon.js,BabylonJS,15300,https://github.com/BabylonJS/Babylon.js
5,aframe,aframevr,13300,https://github.com/aframevr/aframe
6,tinyrenderer,ssloy,11700,https://github.com/ssloy/tinyrenderer
7,3d-game-shaders-for-beginners,lettier,11600,https://github.com/lettier/3d-game-shaders-for...
8,FreeCAD,FreeCAD,10200,https://github.com/FreeCAD/FreeCAD
9,zdog,metafizzy,8900,https://github.com/metafizzy/zdog
10,cesium,CesiumGS,8000,https://github.com/CesiumGS/cesium


In [59]:
topic_repos_df.to_csv(topic_titles[0]+'.csv')

## Final Code

### write a single function to:
- Get the listy of topics from topics page
- Get the top repositories of individual topic page
- Create a csv for top repositories for each topic

In [78]:
import os


In [77]:
scrape_topics_repos()

Scraping the repositories of 3D
Scraping the repositories of Ajax
Scraping the repositories of Algorithm
Scraping the repositories of Amp
Scraping the repositories of Android
Scraping the repositories of Angular
Scraping the repositories of Ansible
Scraping the repositories of API
Scraping the repositories of Arduino
Scraping the repositories of ASP.NET
Scraping the repositories of Atom
Scraping the repositories of Awesome Lists
Scraping the repositories of Amazon Web Services
Scraping the repositories of Azure
Scraping the repositories of Babel
Scraping the repositories of Bash
Scraping the repositories of Bitcoin
Scraping the repositories of Bootstrap
Scraping the repositories of Bot
Scraping the repositories of C
Scraping the repositories of Chrome
Scraping the repositories of Chrome extension
Scraping the repositories of Command line interface
Scraping the repositories of Clojure
Scraping the repositories of Code quality
Scraping the repositories of Code review
Scraping the reposit

### Document and share your work