# Scraping the Top Repositories on Topic from Github

### Introduction:
#### Web Scraping 
- Web Scraping is an automatic method to obtain large amounts of data from websites.
- Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. 
#### Github 
- GitHub is an web based platform which hosts software development projects and uses Git for version management. 
- Git is a distributed version control system that helps developers to work together on same software projects and keep track of changes made to their code by on another.

### Problem Statement:

- We need to scrape the top repositories on topic from Github site and put all the collected data into CSV files that are further put into a directory named as 'data'. 

#### Technologies Used
- Python, requests, Beautiful Soup and pandas.


Here are the steps that we'll follow for our project:
##### Project Outline:
- We are going to scrape the site github https://github.com/topics
- We'll get a list of topics and for each topic we'll get topic title, topic page URL and topic description.
- For each topic, we'll get the top 30 repositories in the topic from the topic page.
- From each repositories, we'll take the Repo Name, Username, Stars and Repo URL.
- For each topic we'll create a CSV file in the following format:

```
Repo Name ,Username ,Stars,Repo URL
three.js,mrdoob,100000,https://github.com/mrdoob/three.js
react-three-fiber,pmndrs,26500,https://github.com/pmndrs/react-three-fiber
```
    


# Scrape the list of Topics from Github

#### How we'll achieve it:
- First we'll use the requests module to check the status code of the Github website, if the result is 200 then we are good to proceed.
- Then, we'll use the Beautiful Soup module to parse the webpage.
- Atlast, we'll convert the whole data in the form of dataframes which are finally transformed into CSV files ithin a data directory.

Let's write a function to download the page

In [1]:
import requests
from bs4 import BeautifulSoup
def get_topic_page():
### Here, we have provided topics_url for which we are checking the status code and as mentioned above it should be equal to 200 for successful proceeding...
    topics_url= 'https://github.com/topics'
    response= requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc=BeautifulSoup(response.text,'html.parser')
    return doc

In [2]:
doc= get_topic_page()

In [3]:
### Now we have a parsed doc which have a type BeautifulSoup
type(doc)

bs4.BeautifulSoup

Lets create some helper functions to parse the information from the webpage.

Our first function is to extract the  topic's titles from the parse web page.

In [4]:
def get_topic_titles(doc):
    selected_text="f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_title_tags= doc.find_all('p',class_=selected_text)
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

In [5]:
titles= get_topic_titles(doc)

In [6]:
len(titles)

30

In [7]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Next is to fetch the description of each topics from the parse web page.

In [8]:
def get_topic_descs(doc):
    desc_text="f5 color-fg-muted mb-0 mt-1"
    topic_desc_tags= doc.find_all('p',class_=desc_text)
    topic_desc=[]
    for tag0 in topic_desc_tags:
        topic_desc.append(tag0.text.strip())
    return topic_desc

Finally,fetching the URLs from the parsed web page.

In [9]:
base_url='https://github.com'
def get_topic_urls(doc):
    link_text="no-underline flex-1 d-flex flex-column"
    topic_link_tags=doc.find_all('a',class_=link_text)
    topic_url=[]
    for tag1 in topic_link_tags:
        topic_url.append(base_url+tag1['href'])
    return topic_url

Let's put it all together into a single function

In [10]:
import pandas as pd
def scrape_topics():
    topics_url= 'https://github.com/topics'
    response= requests.get(topics_url)
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc=BeautifulSoup(response.text,'html.parser')
    topics_dict={
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

# For each topic, we'll get the top 30 repositories in the topic from the topic page.

In [11]:
def get_topic_page(topic_url):
    #Download the page 
    response = requests.get(topic_url)
    #Check successful response 
    if response.status_code !=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #Parse using Beautiful Soup
    topic_doc= BeautifulSoup(response.text,'html.parser')
    return topic_doc

In [12]:
def parse_star_count(star_str):
    if star_str[-1]=='k':
        return int(float(star_str[:-1]) *1000)
    return int(star_str)
def get_repo_info(h3_tag,star_tags):
    # return all the info about a repository
    a_tags= h3_tag.find_all('a')
    username= a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url= base_url+ a_tags[1]['href']
    stars= parse_star_count(star_tags.text)
    return username,repo_name,stars,repo_url

In [13]:
def get_topic_info(topic_doc):
    
    #Get h3 tags containing repo_title,repo URL and username
    h3_selection_class="f3 color-fg-muted text-normal lh-condensed"
    repo_tags= topic_doc.find_all('h3',class_=h3_selection_class)   
    #Get star tags
    star_selection_text="Counter js-social-count"
    star_tags=topic_doc.find_all('span',class_=star_selection_text)
    #Get repo info
    topic_repo_dict={
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
    }
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i],star_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['stars'].append(repo_info[2])
        topic_repo_dict['repo_url'].append(repo_info[3])
        
    ## Return the desired things into a DataFrame
    return pd.DataFrame(topic_repo_dict)
    

In [14]:
import os
def scrape_topic(topic_url,path):
    #fname= topic_name+'.csv'
    if os.path.exists(path):
        print('The file {} already exists. Skipping...'.format(path))
        return
    topic_df= get_topic_info(get_topic_page(topic_url))
    topic_df.to_csv(path+'.csv', index=None)

# Putting it all together
- We have a function to get the list of topics
- We have a function to create a CSV file for scraped repos from a topic page 
- Let's create a function to put them together

In [15]:
def scrape_topics_repos():
    print('Scraping top list of topics from Github')
    topics_df= scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],'data/'+row['title'])

Let's run it to grab the top repos present on the first page of the web page : https://github.com

In [16]:
scrape_topics_repos()

Scraping top list of topics from Github
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scraping top repositories fo

We can check whether a CSV has created properly or not 

In [17]:
pd.read_csv('data/Ansible.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,bregman-arie,devops-exercises,64599,https://github.com/bregman-arie/devops-exercises
1,ansible,ansible,61700,https://github.com/ansible/ansible
2,trailofbits,algo,28500,https://github.com/trailofbits/algo
3,MichaelCade,90DaysOfDevOps,26100,https://github.com/MichaelCade/90DaysOfDevOps
4,StreisandEffect,streisand,23100,https://github.com/StreisandEffect/streisand
5,kubernetes-sigs,kubespray,15600,https://github.com/kubernetes-sigs/kubespray
6,ansible,awx,13600,https://github.com/ansible/awx
7,easzlab,kubeasz,10200,https://github.com/easzlab/kubeasz
8,semaphoreui,semaphore,9600,https://github.com/semaphoreui/semaphore
9,netbootxyz,netboot.xyz,8400,https://github.com/netbootxyz/netboot.xyz


Hence, we can say that we have created CSV properly as shown above.

Reference of project: https://www.youtube.com/watch?v=RKsLLG-bzEY
                      https://www.geeksforgeeks.org/introduction-to-github/