# Scraping Top Repositories for Topics on GitHub

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repository name , username , stars and repository URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

In [3]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import os

In [4]:
topic_Url='https://github.com/topics'

In [5]:
response=requests.get(topic_Url)

In [6]:
response.status_code

200

In [7]:
# BeautifulSoup helps to work with HTML documents in python 
soup=BeautifulSoup(response.content)

In [8]:
topic_name=soup.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [9]:
topic_name

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [10]:
len(topic_name)

30

In [11]:
topicName=[]
for i in topic_name:
    topicName.append(i.text)
print(topicName)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [12]:
# Get Description of Topics
topic_desc=soup.find_all('p',{'class':"f5 color-fg-muted mb-0 mt-1"})

In [13]:
desc_lis = [topic.text.strip() for topic in topic_desc]

In [14]:
desc_lis[:3]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.']

In [34]:
# Get URl of topics 
link_tags=soup.find_all('a',{'class':"no-underline flex-grow-0"})

In [38]:
link_tags[0]

<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>

In [17]:
url_1 = link_tags[0]['href']
print(url_1)

/topics/3d


In [18]:
base_url='http://github.com'

In [19]:
print(base_url+url_1)

http://github.com/topics/3d


In [20]:
linkTags=[]
for i in link_tags:
    linkTags.append(base_url+ i['href'])
print(linkTags)

['http://github.com/topics/3d', 'http://github.com/topics/ajax', 'http://github.com/topics/algorithm', 'http://github.com/topics/amphp', 'http://github.com/topics/android', 'http://github.com/topics/angular', 'http://github.com/topics/ansible', 'http://github.com/topics/api', 'http://github.com/topics/arduino', 'http://github.com/topics/aspnet', 'http://github.com/topics/atom', 'http://github.com/topics/awesome', 'http://github.com/topics/aws', 'http://github.com/topics/azure', 'http://github.com/topics/babel', 'http://github.com/topics/bash', 'http://github.com/topics/bitcoin', 'http://github.com/topics/bootstrap', 'http://github.com/topics/bot', 'http://github.com/topics/c', 'http://github.com/topics/chrome', 'http://github.com/topics/chrome-extension', 'http://github.com/topics/cli', 'http://github.com/topics/clojure', 'http://github.com/topics/code-quality', 'http://github.com/topics/code-review', 'http://github.com/topics/compiler', 'http://github.com/topics/continuous-integration

### Creating Small CSV file

In [40]:
topicDict={
    'Title':topicName,
    'Description':desc_lis,
    'URL':linkTags
}

In [53]:
topicsDf=pd.DataFrame(topicDict)
topicsDf

Unnamed: 0,Title,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,http://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,http://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,http://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,http://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,http://github.com/topics/android
5,Angular,Angular is an open source web application plat...,http://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,http://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,http://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,http://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,http://github.com/topics/aspnet


In [22]:
topicsDf.to_csv('Top 30 Topics on GitHub.csv',index=False)

# Extract Information out of topic page
#### raw = for overall page

### Lets Begin with One repository

In [154]:
linkTags[0]

'http://github.com/topics/3d'

In [157]:
response = requests.get(linkTags[0])
soup1 = BeautifulSoup(response.content)

In [158]:
# Split such that NRL (Owner name, Repo name, link) are in one.
raw_name_repo_link = bs.find_all('h3',{'class':"f3 color-fg-muted text-normal lh-condensed"})
len(raw_name_repo_link)

20

In [159]:
# Split NRL into name, repo, link
# It consist of 2 'a-tags' Owner name and repository name  
name_repo_link_0 = raw_name_repo_link[0].find_all('a')
name = name_repo_link_0[0].text.split()[0]
repo = name_repo_link_0[1].text.split()[0]
suffix_repo_url = name_repo_link_0[1]['href']

repo_url = base_url + suffix_repo_url
repo_url

'http://github.com/mrdoob/three.js'

In [160]:
print(f"Repo Owner : {name}\nRepo Name : {repo}\nRepo URL : {repo_url}")

Repo Owner : mrdoob
Repo Name : three.js
Repo URL : http://github.com/mrdoob/three.js


In [145]:
# Get out star count for overall page
raw_star_count = bs.find_all('span',{'id':'repo-stars-counter-star'})

In [153]:
raw_star_count[0].text

'96k'

In [88]:
# Convert k to 1000
int(float((raw_star_count[0].text)[:-1]))*1000

96000

In [161]:
def starClean(star_count):
    star_count=star_count.strip()
    if star_count[-1]=='k':
        return int(float(star_count[:-1])*1000)
    return int(star_count)

In [164]:
starClean(raw_star_count[0].text)

96000

### Merge into One Function

In [166]:
def getTagInOne(repo,startag): # This function takes single repo name and startag
    aTags=repo.find_all('a')
    userName=aTags[0].text.strip()
    repoName=aTags[1].text.strip()
    repoUrl=baseUrl+aTags[1]['href']
    starTag=starClean(startag.text)
    
    return userName,repoName,repoUrl,starTag
    

In [167]:
getTagInOne(raw_name_repo_link[0],raw_star_count[0])

('mrdoob', 'three.js', 'http://github.com/mrdoob/three.js', 96000)

In [168]:
repoDict={
    "Username":[],
    "Repository Name":[],
    "Repository Url":[],
    "Stars":[]
    
}


for i in range(len(raw_name_repo_link)):
    oneStatment=getTagInOne(raw_name_repo_link[i],raw_star_count[i])
    
    repoDict['Username'].append(oneStatment[0])
    repoDict['Repository Name'].append(oneStatment[1])
    repoDict['Repository Url'].append(oneStatment[2])
    repoDict['Stars'].append(oneStatment[3])

In [171]:
pd.DataFrame(repoDict).head()

Unnamed: 0,Username,Repository Name,Repository Url,Stars
0,mrdoob,three.js,http://github.com/mrdoob/three.js,96000
1,pmndrs,react-three-fiber,http://github.com/pmndrs/react-three-fiber,24600
2,libgdx,libgdx,http://github.com/libgdx/libgdx,22200
3,BabylonJS,Babylon.js,http://github.com/BabylonJS/Babylon.js,21700
4,ssloy,tinyrenderer,http://github.com/ssloy/tinyrenderer,18400


# Get Remaining Repositories
###### Present inside main topics. Each Topic have 20 repo

In [43]:
def getSoup(topic_url):
    # Download the page
    response = requests.get(topic_url)
    
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    # Parse using Beautiful soup
    soup = BeautifulSoup(response.content)
    return soup


def starClean(star_count):
    star_count=star_count.strip()
    if star_count[-1]=='k':
        return int(float(star_count[:-1])*1000)
    return int(star_count)


def getNRLS_InOne(repo,startag):
    aTags=repo.find_all('a')
    userName=aTags[0].text.strip()
    repoName=aTags[1].text.strip()
    repoUrl=baseUrl+aTags[1]['href']
    starTag=starClean(startag.text)
    
    return userName,repoName,repoUrl,starTag

def getRepoPD(soup):
    
    # Get Tags
    repoTags=soup.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})
    starTags=soup.find_all('span',{'class':'Counter js-social-count'})
    
    repoDict={
    "Username":[],
    "Repository Name":[],
    "Repository Url":[],
    "Stars":[]
    
    }
    
    # Get all the 30 Repository present in Topics
    for i in range(len(repoTag)):
        oneStatment=getTagInOne(repoTags[i],starTags[i])

        repoDict['Username'].append(oneStatment[0])
        repoDict['Repository Name'].append(oneStatment[1])
        repoDict['Repository Url'].append(oneStatment[2])
        repoDict['Stars'].append(oneStatment[3])
        
    # Convert into Pandas Dataframe
    return pd.DataFrame(repoDict)

In [61]:
def getCSVfile(topicUrl,path):
    if os.path.exists(path):
        print(f"File with name {path} already exists.\nSkipping......{path}")
        return
    
    df=getRepoPD(getSoup(topicUrl))
    df.to_csv(path,index=None)

### Get Repository of 5th Topic

In [45]:
linkTags[5]

'http://github.com/topics/angular'

In [46]:
sop=getSoup(linkTags[5])

In [47]:
getRepoPD(sop)

Unnamed: 0,Username,Repository Name,Repository Url,Stars
0,justjavac,free-programming-books-zh_CN,http://github.com/justjavac/free-programming-b...,86100
1,angular,angular,http://github.com/angular/angular,78500
2,storybookjs,storybook,http://github.com/storybookjs/storybook,67700
3,ionic-team,ionic-framework,http://github.com/ionic-team/ionic-framework,45800
4,leonardomso,33-js-concepts,http://github.com/leonardomso/33-js-concepts,45700
5,prettier,prettier,http://github.com/prettier/prettier,41500
6,SheetJS,sheetjs,http://github.com/SheetJS/sheetjs,28600
7,angular,angular-cli,http://github.com/angular/angular-cli,25100
8,angular,components,http://github.com/angular/components,22400
9,NativeScript,NativeScript,http://github.com/NativeScript/NativeScript,20800


### Get Repository of 6th Topic

In [48]:
getRepoPD(getSoup(linkTags[6]))

Unnamed: 0,Username,Repository Name,Repository Url,Stars
0,ansible,ansible,http://github.com/ansible/ansible,51200
1,trailofbits,algo,http://github.com/trailofbits/algo,24300
2,StreisandEffect,streisand,http://github.com/StreisandEffect/streisand,22600
3,bregman-arie,devops-exercises,http://github.com/bregman-arie/devops-exercises,20000
4,kubernetes-sigs,kubespray,http://github.com/kubernetes-sigs/kubespray,11600
5,ansible,awx,http://github.com/ansible/awx,10500
6,easzlab,kubeasz,http://github.com/easzlab/kubeasz,7700
7,geerlingguy,ansible-for-devops,http://github.com/geerlingguy/ansible-for-devops,5300
8,ansible-semaphore,semaphore,http://github.com/ansible-semaphore/semaphore,4600
9,rundeck,rundeck,http://github.com/rundeck/rundeck,4400


## Lets Get into our Top 30 Topics 

In [49]:
topicsDf

Unnamed: 0,Title,Description,URL
0,3D,3D modeling is the process of virtually develo...,http://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,http://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,http://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,http://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,http://github.com/topics/android
5,Angular,Angular is an open source web application plat...,http://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,http://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,http://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,http://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,http://github.com/topics/aspnet


In [175]:
for index,row in topicsDf.iterrows():
    print(row['Title'],row['URL'])

3D http://github.com/topics/3d
Ajax http://github.com/topics/ajax
Algorithm http://github.com/topics/algorithm
Amp http://github.com/topics/amphp
Android http://github.com/topics/android
Angular http://github.com/topics/angular
Ansible http://github.com/topics/ansible
API http://github.com/topics/api
Arduino http://github.com/topics/arduino
ASP.NET http://github.com/topics/aspnet
Atom http://github.com/topics/atom
Awesome Lists http://github.com/topics/awesome
Amazon Web Services http://github.com/topics/aws
Azure http://github.com/topics/azure
Babel http://github.com/topics/babel
Bash http://github.com/topics/bash
Bitcoin http://github.com/topics/bitcoin
Bootstrap http://github.com/topics/bootstrap
Bot http://github.com/topics/bot
C http://github.com/topics/c
Chrome http://github.com/topics/chrome
Chrome extension http://github.com/topics/chrome-extension
Command line interface http://github.com/topics/cli
Clojure http://github.com/topics/clojure
Code quality http://github.com/topics/

In [176]:
data='Top 30 Topics on Github'
os.makedirs(data,exist_ok=True) # If folder exist then dont make directory 


In [None]:
for index,row in topicsDf.iterrows():
    print(f"Scraping {row['Title']} Repository.........")
    
    getCSVfile(row['URL'],data+'/'+row['Title']+'.csv')