# Scraping Repositories for Topics on Github
- Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning.
- GitHub, Inc. is a provider of Internet hosting for software development and version control using Git. It offers the distributed version control and source code management functionality of Git, plus its own features.
- We are going to scrape the top repositories of different topics on github
- Tools being used are Python, Pandas, BeautifulSoup4, sys and Requests.

Here are the steps we follow:
Project Outline :
- We're going to scrape https://github.com/topics
- We will get list of topics. For each topic we will get topic title, topic page url and topic description
- We will get top 25 repositories in each topic from the topic page
- For each repository we will grab the repo name, user name, stars and repo url
- For each topic we create a csv file in the following format :
```
Repo Name,User Name,Stars,Repo URL
```

## Scrape the list of topics from Github
- we use requests library to download the page
- we use bs4 to parse the downloaded page
- we use pandas to convert the information into a dataframe

Function to download the page

In [60]:
import requests
from bs4 import BeautifulSoup
import sys
import os

def get_topics_page():
    # This function returns a beautifulsoup object of the topics page from github
    url='https://github.com/topics'
    try:
        response=requests.get(url)
    except Exception as e:
        error_type,error_object,error_info=sys.exc_info()
        print('Error in retrieving the url',url)
        print(error_type," error in line number ",error_info.tb_lineno)
    if response.status_code!=200:
        raise Exception("Couldn't load the page")
    doc=BeautifulSoup(response.text,'html.parser')
    return doc


In [3]:
doc=get_topics_page()

In [4]:
type(doc)

bs4.BeautifulSoup

Let's create some helper functions to parse information from topics page

In [5]:
def get_topic_titles(doc):
    topic_title_tags=doc.find_all('p',attrs={'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


`get_topic_titles` can be used to get the titles of the topics on the page

In [9]:
topic_titles=get_topic_titles(doc)

In [10]:
len(topic_titles)

30

In [11]:
topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [23]:
def get_topic_description(doc):
    topic_description_tags=doc.find_all('p',class_='f5 color-fg-muted mb-0 mt-1')
    topic_desc=[]
    for tag in topic_description_tags:
        topic_desc.append(tag.text.strip())
    return topic_desc


`get_topic_description` can be used to get the description of each topic on the topics page

In [24]:
topic_desc=get_topic_description(doc)

In [25]:
len(topic_desc)

30

In [26]:
topic_desc[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency framework for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

Similarly we get van get the url's of the topics using the `get_topic_urls`function

In [52]:
base_url='https://www.github.com'
def get_topic_urls(doc):
    topic_link_tags=doc.find_all('a',class_="d-flex no-underline",attrs={'data-ga-click':True})
    topic_urls=[]
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag['href'])
    return topic_urls

Let's put it all together into a single function

In [53]:
import pandas as pd

def scrape_topics(doc):
    topic_titles=get_topic_titles(doc)
    topic_desc=get_topic_description(doc)
    topic_urls=get_topic_urls(doc)
    topics_dict={
        'Topic Title':topic_titles,
        'Topic Description':topic_desc,
        'Topic URL':topic_urls
    }
    return pd.DataFrame(topics_dict)

In [54]:
topics_df=scrape_topics(doc)

Getting the first five records from the topics dataframe

In [55]:
topics_df.head()

Unnamed: 0,Topic Title,Topic Description,Topic URL


## Getting the top 25 repositories from a topic page

- Download the topic page
- Parse the information from topic page
- Create a dataframe for top repositories on the topic 
- Store the datsframe into a CSV file

Function to download the topic page

In [41]:
def get_topic_page(topic_url):
    #Downloading the topic page
    try:
        response=requests.get(topic_url)
    except Exception as e:
        error_type,error_object,error_info=sys.exc_info()
        print("Error while retrieving the url ",topic_url)
        print(error_type,' error in line number ',error_info.tb_lineno)
        
    #Checking for valid response
    if(response.status_code!=200):
        raise Exception("Failed to load page") 
        
    #Creating a soup object
    topic_doc=BeautifulSoup(response.text,'html.parser')
    
    return topic_doc


In [42]:
doc = get_topic_page('https://github.com/topics/3d')

In [44]:
type(doc)

bs4.BeautifulSoup

`h3` tags with class 'f3 color-fg-muted text-normal lh-condensed' contains two a tags within which the first a tag holds the username and the second a tag contains reponame, and repoURL. The star tags contains the star rating of each repository

In [56]:
def parse_star_counts(star):
    star=star.strip()
    if(star[-1]=='k'):
        return int(float(star[:-1])*1000)
    return int(star)

def get_repo_info(repo_tag,star_tag):
    a_tags=repo_tag.find_all('a')
    repo_name=a_tags[1].text.strip()
    repo_url=base_url+a_tags[1]['href']
    username=a_tags[0].text.strip()
    star=parse_star_counts(star_tag.text)
    return repo_name,username,star,repo_url

In [58]:
repo_info=get_repo_info(doc.find(class_='f3 color-fg-muted text-normal lh-condensed'),doc.find(class_='social-count js-social-count'))

In [59]:
repo_info

('three.js', 'mrdoob', 76400, 'https://www.github.com/mrdoob/three.js')

Let's write a function to create a dataframe of top repositories of a topic which has the following column names 
```
reponame username stars url
```

In [61]:
def get_topic_repos(topic_doc):
    """
        This function takes topic_url as input and returns a pandas dataframe containing the following information
        Repo Name, Username, Star, URL
    """
    #Getting the repo tags to get reponame, username, url
    repo_tags=topic_doc.find_all('h3',class_="f3 color-fg-muted text-normal lh-condensed")
    
    #Getting the star rating of repository
    star_tags=topic_doc.find_all('a',class_="social-count js-social-count")
    
    #Creating a dictionary of topic repositories
    topic_repos_dict={
        'Repo Name':[],
        'Username':[],
        'Stars':[],
        'URL':[]
    }
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['Repo Name'].append(repo_info[0])
        topic_repos_dict['Username'].append(repo_info[1])
        topic_repos_dict['Stars'].append(repo_info[2])
        topic_repos_dict['URL'].append(repo_info[3])
        
    #Returning a data frame of repositories information
    return pd.DataFrame(topic_repos_dict)


In [62]:
topic_df=get_topic_repos(doc)

Getting the first five records of topic dataframe i.e the topic 3D

In [65]:
topic_df.head()

Unnamed: 0,Repo Name,Username,Stars,URL
0,three.js,mrdoob,76400,https://www.github.com/mrdoob/three.js
1,libgdx,libgdx,19300,https://www.github.com/libgdx/libgdx
2,react-three-fiber,pmndrs,15900,https://www.github.com/pmndrs/react-three-fiber
3,Babylon.js,BabylonJS,15300,https://www.github.com/BabylonJS/Babylon.js
4,aframe,aframevr,13300,https://www.github.com/aframevr/aframe


Let's write a function to put everything together and finally creating a CSV file of top repositories of a topic

In [66]:
def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print('The file {} already exist'.format(path))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)

In [69]:
os.makedirs('Untitled')
scrape_topic('https://github.com/topics/3d','./Untitled/3D.csv')

## Putting it all together
- We have a function to scrape topics from github
- We have a funvtion to scrape top repositories of a topic and save the information into a CSV file
- Let's create a function to put them together

In [74]:
def scrape_topics_repos():
    topics_df=scrape_topics(get_topics_page())
    os.makedirs('data2',exist_ok=True)
    for index,row in topics_df.iterrows():
        print(f"Scraping the repositories of {row['Topic Title']}")
        scrape_topic(row['Topic URL'],'data2/{}.csv'.format(row['Topic Title']))

Let's run it to scrape all the top repos of topics on first page of github.com/topics

In [76]:
scrape_topics_repos()

Scraping the repositories of 3D
The file data2/3D.csv already exist
Scraping the repositories of Ajax
The file data2/Ajax.csv already exist
Scraping the repositories of Algorithm
The file data2/Algorithm.csv already exist
Scraping the repositories of Amp
The file data2/Amp.csv already exist
Scraping the repositories of Android
The file data2/Android.csv already exist
Scraping the repositories of Angular
The file data2/Angular.csv already exist
Scraping the repositories of Ansible
The file data2/Ansible.csv already exist
Scraping the repositories of API
The file data2/API.csv already exist
Scraping the repositories of Arduino
The file data2/Arduino.csv already exist
Scraping the repositories of ASP.NET
The file data2/ASP.NET.csv already exist
Scraping the repositories of Atom
The file data2/Atom.csv already exist
Scraping the repositories of Awesome Lists
The file data2/Awesome Lists.csv already exist
Scraping the repositories of Amazon Web Services
The file data2/Amazon Web Services.cs

Let's check if the CSV's were created properly

In [79]:
_3d=pd.read_csv('./data2/3D.csv')

In [80]:
_3d.head()

Unnamed: 0,Repo Name,Username,Stars,URL
0,three.js,mrdoob,76400,https://www.github.com/mrdoob/three.js
1,libgdx,libgdx,19300,https://www.github.com/libgdx/libgdx
2,react-three-fiber,pmndrs,15900,https://www.github.com/pmndrs/react-three-fiber
3,Babylon.js,BabylonJS,15300,https://www.github.com/BabylonJS/Babylon.js
4,aframe,aframevr,13300,https://www.github.com/aframevr/aframe
