# Scrapping top repositories on GitHub

INTRO - 
- Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning.
- GitHub:
 GitHub is a hosting site where developers and programmers can upload the code they create and work collaboratively to improve it. An important feature of GitHub is its version control system. The version control lets coders tweak software—potentially fixing bugs or improving efficiency—without affecting the software itself or risking the experience of any current users.
- In this project we are using the scraping technique to scrap GitHub for the top 30 repositories.
- Tools Used -  Python,Beautiful soup,Pandas

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics.
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description.
- For each topic, we'll get the top 30 repositories in the topic from the topic page.
- For each repository, we'll grab the repo name, username, stars and repo URL.
- For each topic we'll create a CSV file in the following format:
    Repo Name,Username,Stars,Repo URL
    - Example:

      three.js,mrdoob,180700,https://github.com/mrdoob/three.js
      
      libgdx,libgdx,20300,https://github.com/libgdx/libgdx

## Scraping the list of topics from GitHub
- Use requests to download the page
- Use Beautifulsoup to parse it the HTML format and extract some information
- Next we will convert it to a Pandas Dataframe

In [1]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    # Downloading the required page
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # parsing using BeautifulSoup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc


Now we are creating some helper functions to parse information from the page

In [2]:
def get_topic_titles(doc):
    topic_title_tags = doc.find_all('p',{"class":"f3 lh-condensed mb-0 mt-1 Link--primary"})
    topic_titles =[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

get_topic_titles can be used to get the list of titles
#### Similarly we define functions for descriptions and url of the repository.

In [3]:
def get_topic_descs(doc):
    topic_desc_tags= doc.find_all("p",{"class":"f5 color-fg-muted mb-0 mt-1"})
    topic_descs=[]
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

In [4]:
def get_topic_urls(doc):
    topic_link_tags=doc.find_all("a",{"class":"no-underline flex-1 d-flex flex-column"})
    topic_urls=[]
    base_url = "http://github.com"
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag['href'])
    return topic_urls

##### Putting together all in a single function--

In [5]:
import pandas as pd

def scrape_topics():
    topics_url ="https://github.com/topics"
    response = requests.get(topics_url)
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text,'html.parser')
    topics_dict = {
    'title':get_topic_titles(doc),
    'description':get_topic_descs(doc),
    'url':get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

#### Now getting top 30 repositories from topic page--

In [6]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code!=200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

In [7]:
# The stars were in the form 60.7k, it needed a conversion therefore the below function-
def parse_star_count(stars_str):
    stars_str=stars_str.strip()
    if stars_str[-1]=='k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

# We create a fuction get_repo_info to fetch all the information we need about a repo 
def get_repo_info(h3_tag,star_tags):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = "http://github.com" + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username,repo_name,stars,repo_url

In [8]:
def get_topic_repo(topic_doc):
    # Get the h3 tags containing repo title,repo url and username
    repo_tags = topic_doc.findAll('h3',{"class":"f3 color-fg-muted text-normal lh-condensed"})
    # Get star tags 
    star_tags=topic_doc.find_all('span',{"class":"Counter js-social-count"})
    # Get repo info
    topic_repos_dict = {
        'username':[],
        'repo_name':[],
        'stars':[],
        'repo_url':[]
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

In [9]:
# Converting information fetched into a CSV file--
import os

def scrape_topic(topic_url,topic_name):
    fname =  topic_name+'.csv'
    if os.path.exists(fname):
        print('The file {} already exists. Skipping... '.format(fname))
        return
    topic_df = get_topic_repo(get_topic_page(topic_url))
    topic_df.to_csv(fname,index=None)

### Putting everything together--

- We have a funciton to get the list of topics.
- We have a function to create a CSV file for scraped repos from a topics page.
- Let's create a function to put them together.

In [10]:
def scrape_topics_repos():
    print('Scrapping list of topics')
    topics_df = scrape_topics()
    for index,row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],row['title'])

Let's run it to scrape the top repos for the all the topics on the first page of https://github.com/topics

In [11]:
scrape_topics_repos()

Scrapping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapi

## Summary--
- I managed to successfully fetch data from https://github.com/topics and store it in a CSV file format.
## Future Scope--
- I plan to use the data fetched here for Tree-based data perturbation and modify it to reduce squared deviation based on an attribute though none of the attribute here is private and need hiding.