# Scraping Top Repositories for Topics on GitHub

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repository name , username , stars and repository URL
- For each topic we'll create a CSV file in the following format:


In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import os

In [2]:
topic_Url='https://github.com/topics'

In [3]:
response=requests.get(topic_Url)

In [4]:
response.status_code

200

In [5]:
soup=BeautifulSoup(response.content)

In [6]:
topic_name=soup.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [7]:
topic_name

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [8]:
len(topic_name)

30

In [9]:
topicName=[]
for i in topic_name:
    topicName.append(i.text)
print(topicName)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [10]:
# Get Description of Topics
topic_desc=soup.find_all('p',{'class':"f5 color-fg-muted mb-0 mt-1"})

In [11]:
desc_lis = [topic.text.strip() for topic in topic_desc]

In [12]:
desc_lis[:3]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.']

In [13]:
# Get URl of topics 
link_tags=soup.find_all('a',{'class':"no-underline flex-grow-0"})

In [14]:
link_tags[0]

<a class="no-underline flex-grow-0" href="/topics/3d">
<div class="color-bg-accent f4 color-fg-muted text-bold rounded mr-3 flex-shrink-0 text-center" style="width:64px; height:64px; line-height:64px;">
            #
          </div>
</a>

In [15]:
url_1 = link_tags[0]['href']
print(url_1)

/topics/3d


In [16]:
base_url='http://github.com'

In [17]:
print(base_url+url_1)

http://github.com/topics/3d


In [18]:
linkTags=[]
for i in link_tags:
    linkTags.append(base_url+ i['href'])
print(linkTags)

['http://github.com/topics/3d', 'http://github.com/topics/ajax', 'http://github.com/topics/algorithm', 'http://github.com/topics/amphp', 'http://github.com/topics/android', 'http://github.com/topics/angular', 'http://github.com/topics/ansible', 'http://github.com/topics/api', 'http://github.com/topics/arduino', 'http://github.com/topics/aspnet', 'http://github.com/topics/atom', 'http://github.com/topics/awesome', 'http://github.com/topics/aws', 'http://github.com/topics/azure', 'http://github.com/topics/babel', 'http://github.com/topics/bash', 'http://github.com/topics/bitcoin', 'http://github.com/topics/bootstrap', 'http://github.com/topics/bot', 'http://github.com/topics/c', 'http://github.com/topics/chrome', 'http://github.com/topics/chrome-extension', 'http://github.com/topics/cli', 'http://github.com/topics/clojure', 'http://github.com/topics/code-quality', 'http://github.com/topics/code-review', 'http://github.com/topics/compiler', 'http://github.com/topics/continuous-integration

### Creating CSV file for Topics

In [19]:
topicDict={
    'Title':topicName,
    'Description':desc_lis,
    'URL':linkTags
}

In [20]:
topicsDf=pd.DataFrame(topicDict)
topicsDf

Unnamed: 0,Title,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,http://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,http://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,http://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,http://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,http://github.com/topics/android
5,Angular,Angular is an open source web application plat...,http://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,http://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,http://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,http://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,http://github.com/topics/aspnet


In [21]:
topicsDf.to_csv('Top 30 Topics on GitHub.csv',index=False)

# Extract Repository out of topic page

- Lets Begin with One repository

In [22]:
linkTags[0]

'http://github.com/topics/3d'

In [23]:
response = requests.get(linkTags[0])
soup1 = BeautifulSoup(response.content)

In [27]:
raw_name_repo_link = soup1.find_all('h3',{'class':"f3 color-fg-muted text-normal lh-condensed"})
len(raw_name_repo_link)

20

In [30]:
# Get star count
raw_star_count = soup1.find_all('span',{'id':'repo-stars-counter-star'})

In [31]:
# Convert 1K stars as 1000 stars
def starClean(star_count):
    star_count=star_count.strip()
    if star_count[-1]=='k':
        return int(float(star_count[:-1])*1000)
    return int(star_count)

In [34]:
def getTagInOne(repo,startag):
    aTags=repo.find_all('a')
    userName=aTags[0].text.strip()
    repoName=aTags[1].text.strip()
    repoUrl=base_url+aTags[1]['href']
    starTag=starClean(startag.text)
    
    return userName,repoName,repoUrl,starTag
    

In [35]:
getTagInOne(raw_name_repo_link[0],raw_star_count[0])

('mrdoob', 'three.js', 'http://github.com/mrdoob/three.js', 96000)

In [36]:
repoDict={
    "Username":[],
    "Repository Name":[],
    "Repository Url":[],
    "Stars":[]
    
}


for i in range(len(raw_name_repo_link)):
    oneStatment=getTagInOne(raw_name_repo_link[i],raw_star_count[i])
    
    repoDict['Username'].append(oneStatment[0])
    repoDict['Repository Name'].append(oneStatment[1])
    repoDict['Repository Url'].append(oneStatment[2])
    repoDict['Stars'].append(oneStatment[3])

In [37]:
pd.DataFrame(repoDict).head()

Unnamed: 0,Username,Repository Name,Repository Url,Stars
0,mrdoob,three.js,http://github.com/mrdoob/three.js,96000
1,pmndrs,react-three-fiber,http://github.com/pmndrs/react-three-fiber,24600
2,libgdx,libgdx,http://github.com/libgdx/libgdx,22200
3,BabylonJS,Babylon.js,http://github.com/BabylonJS/Babylon.js,21700
4,ssloy,tinyrenderer,http://github.com/ssloy/tinyrenderer,18400


**Merge everthing in one function**

In [38]:
def getSoup(topic_url):
    # Download the page
    response = requests.get(topic_url)
    
    # Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    # Parse using Beautiful soup
    soup = BeautifulSoup(response.content)
    return soup


def starClean(star_count):
    star_count=star_count.strip()
    if star_count[-1]=='k':
        return int(float(star_count[:-1])*1000)
    return int(star_count)


def getNRLS_InOne(repo,startag):
    aTags=repo.find_all('a')
    userName=aTags[0].text.strip()
    repoName=aTags[1].text.strip()
    repoUrl=baseUrl+aTags[1]['href']
    starTag=starClean(startag.text)
    
    return userName,repoName,repoUrl,starTag

def getRepoPD(soup):
    
    # Get Tags
    repoTags=soup.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})
    starTags=soup.find_all('span',{'class':'Counter js-social-count'})
    
    repoDict={
    "Username":[],
    "Repository Name":[],
    "Repository Url":[],
    "Stars":[]
    
    }
    
    # Get all the 30 Repository present in Topics
    for i in range(len(repoTag)):
        oneStatment=getTagInOne(repoTags[i],starTags[i])

        repoDict['Username'].append(oneStatment[0])
        repoDict['Repository Name'].append(oneStatment[1])
        repoDict['Repository Url'].append(oneStatment[2])
        repoDict['Stars'].append(oneStatment[3])
        
    # Convert into Pandas Dataframe
    return pd.DataFrame(repoDict)

In [39]:
def getCSVfile(topicUrl,path):
    if os.path.exists(path):
        print(f"File with name {path} already exists.\nSkipping......{path}")
        return
    
    df=getRepoPD(getSoup(topicUrl))
    df.to_csv(path,index=None)

#### Top 30 Topics 

In [40]:
topicsDf

Unnamed: 0,Title,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,http://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,http://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,http://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,http://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,http://github.com/topics/android
5,Angular,Angular is an open source web application plat...,http://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,http://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,http://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,http://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,http://github.com/topics/aspnet


In [41]:
for index,row in topicsDf.iterrows():
    print(row['Title'],row['URL'])

3D http://github.com/topics/3d
Ajax http://github.com/topics/ajax
Algorithm http://github.com/topics/algorithm
Amp http://github.com/topics/amphp
Android http://github.com/topics/android
Angular http://github.com/topics/angular
Ansible http://github.com/topics/ansible
API http://github.com/topics/api
Arduino http://github.com/topics/arduino
ASP.NET http://github.com/topics/aspnet
Atom http://github.com/topics/atom
Awesome Lists http://github.com/topics/awesome
Amazon Web Services http://github.com/topics/aws
Azure http://github.com/topics/azure
Babel http://github.com/topics/babel
Bash http://github.com/topics/bash
Bitcoin http://github.com/topics/bitcoin
Bootstrap http://github.com/topics/bootstrap
Bot http://github.com/topics/bot
C http://github.com/topics/c
Chrome http://github.com/topics/chrome
Chrome extension http://github.com/topics/chrome-extension
Command line interface http://github.com/topics/cli
Clojure http://github.com/topics/clojure
Code quality http://github.com/topics/

# Saving data into csv based on topic

In [176]:
data='Top 30 Topics on Github'
os.makedirs(data,exist_ok=True) # If folder exist then dont make directory 


In [None]:
for index,row in topicsDf.iterrows():
    print(f"Scraping {row['Title']} Repository.........")
    
    getCSVfile(row['URL'],data+'/'+row['Title']+'.csv')