# Scraping Top Repositories for Topics on GitHub

### Github

GitHub is an Internet hosting service for software development and version control using Git. It provides the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous integration, and wikis for every project. Headquartered in California, it has been a subsidiary of Microsoft since 2018.

It is commonly used to host open source software development projects. As of January 2023, GitHub reported having over 100 million developers and more than 372 million repositories, including at least 28 million public repositories. It is the largest source code host as of November 2021.

### Web Scraping

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.


#### Steps to follow to complete the task

- We'll scrape the github repositories topic page: https://github.com/topics with the help of python requests and BeautifulSoup Library.
- We'll create a list of all the topics with description and their Url and save it into the pandas dataframe.
- Each topic is further scraped, and extracting the username, their repo title, repo urls and number of stars they recieved.
- Saving the whole data in data folder with topic_name.csv with the below format:

Repo Name,Username,Stars,Repo URL<br>
three.js,mrdoob,69700,https://github.com/mrdoob/three.js<br>
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

In [1]:
#Importing Required Libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import os

In [2]:
#Base Url
base_url = 'https://github.com/topics'

`get_page_content(url)` will give us the content of the url in html format which can be scraped using BeautifulSoup

In [3]:
#Get the content of any webpage using BeautifulSoup
def get_page_content(url):
    response = requests.get(url)
    #Check if the request is successfully made or not
    if response.status_code != 200:
        raise Exception("Failed to load {}".format(url))
    #parse the content using beautifulsoup
    content = BeautifulSoup(response.text, 'html.parser')
    return content


`get_topic_dataframe(url)` function will scrape the content of the webpage with BeautifulSoup and save it into dataframe using Pandas. We can use the selectors to get the required tags from the webpage as shown in example below:

<img src="https://i.imgur.com/ThmwRM3.png" title="Selector"/>

In [4]:
#Get topic dataframe with topic name, description and the URL
def get_topic_dataframe(url):
    #Using selector to select the tag on a webpage
    topic_selector = 'no-underline flex-1 d-flex flex-column'
    topic_tags = get_page_content(url).find_all('a' , {'class': topic_selector})
    
    #Empty Dictionary to store the topic details
    topics_dict = {"Topic":[], 'Description':[], 'URL':[]}
    for i in range(len(topic_tags)):
        topics_dict['Topic'].append(topic_tags[i].contents[1].text.strip())
        topics_dict['Description'].append(topic_tags[i].contents[3].text.strip())
        topics_dict['URL'].append(f'https://github.com{topic_tags[i]["href"]}')
    
    #Creating DataFrame of the dictionary
    topics_df = pd.DataFrame(topics_dict, index=None)
    return topics_df


__Below Examples shows dataframe created using `get_topic_dataframe(url)` function__

In [5]:
#Example dataframe
get_topic_dataframe(base_url).head(5)

Unnamed: 0,Topic,Description,URL
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


`get_repo_info(topic_url)` this function will give us the username, repositories name, repo URLs and Stars they received in the form of dataframe which can later be saved into csv file.

In [6]:
#Function to get top repositories information for each topic in DataFrame
def get_repo_info(topic_url):

    #User and Star Selectors for selecting tags for scrapping.
    user_selector = 'f3 color-fg-muted text-normal lh-condensed'
    star_selector = 'repo-stars-counter-star'
    users = get_page_content(topic_url).find_all('h3', {'class': user_selector})
    stars = get_page_content(topic_url).find_all(id = star_selector)
    
    #User Dictionary to save the username, repository name, repository Urls and number of stars they receieved
    user_dict = {"Username":[], 'Repo Name':[], 'Repo URL':[], 'Stars': []}
    for i in range(len(users)):
        a_tags = users[i].find_all('a')
        user_dict['Username'].append(a_tags[0].text.strip())
        user_dict['Repo Name'].append(a_tags[1].text.strip())
        user_dict['Repo URL'].append(base_url + a_tags[1]['href']) 
        user_dict['Stars'].append(get_stars(stars[i]))
    return pd.DataFrame(user_dict)


- ##### Following function will give us the number of stars any repositoriy received in Interger number.

In [7]:
#Function to get the number of stars in Integer format
def get_stars(star_tag):
    if star_tag.text[-1]=='k':
        return int(float(star_tag.text[:-1]) * 1000)
    return int(star_tag.text[:-1])

__Below Examples shows dataframe created using `get_repo_info(topic_url)` function__

In [8]:
get_repo_info(f'{base_url}/3d').head(5)

Unnamed: 0,Username,Repo Name,Repo URL,Stars
0,mrdoob,three.js,https://github.com/topics/mrdoob/three.js,92600
1,pmndrs,react-three-fiber,https://github.com/topics/pmndrs/react-three-f...,22900
2,libgdx,libgdx,https://github.com/topics/libgdx/libgdx,21600
3,BabylonJS,Babylon.js,https://github.com/topics/BabylonJS/Babylon.js,20800
4,ssloy,tinyrenderer,https://github.com/topics/ssloy/tinyrenderer,17100


`scrape_each_topics(topic_url, path)` it saves the users dataframe content into CSV for each topic post checking the already existence of the file.

In [9]:
#Function to scrape each topic and saving it in CSV file
def scrape_each_topics(topic_url, path):
    
    #Checking whether file already exist or not.
    if os.path.exists(path):
        print("file {} already Exist, Skipping...".format(path))
        return
    
    #Getting user repository information using get_repo_info function
    user_repo_df = get_repo_info(topic_url)
    return user_repo_df.to_csv(path, index=None)

We have `get_repo_info(topic_url)` function to get dataframe of users repositotries and we have `scrape_each_topics(topic_url, path)` to save the data into csv, let's create another function to combine every function `scrape_github_topics()` into one to perform our final step.

In [10]:
#Final function to combine all the existing function
def scrape_github_topics():
    topics_df = get_topic_dataframe(base_url)
    
    #Creating directory to save all the topics CSVs
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['Topic']))
        scrape_each_topics(row['URL'], f'data/{row["Topic"]}.csv')
        

In [11]:
scrape_github_topics()

Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scraping top repositories for "

#### Lets check the content of the CSV matches with the dataframe or not

Importing CSV content into `content_csv_df` using pandas

In [12]:
content_csv_df = pd.read_csv('data/3d.csv')
content_csv_df.head(5)

Unnamed: 0,Username,Repo Name,Repo URL,Stars
0,mrdoob,three.js,https://github.com/topics/mrdoob/three.js,92600
1,pmndrs,react-three-fiber,https://github.com/topics/pmndrs/react-three-f...,22900
2,libgdx,libgdx,https://github.com/topics/libgdx/libgdx,21600
3,BabylonJS,Babylon.js,https://github.com/topics/BabylonJS/Babylon.js,20800
4,ssloy,tinyrenderer,https://github.com/topics/ssloy/tinyrenderer,17100


Getting dataframe using our function `get_repo_info(url)`

In [13]:
get_repo_info(f'{base_url}/3d').head(5)

Unnamed: 0,Username,Repo Name,Repo URL,Stars
0,mrdoob,three.js,https://github.com/topics/mrdoob/three.js,92600
1,pmndrs,react-three-fiber,https://github.com/topics/pmndrs/react-three-f...,22900
2,libgdx,libgdx,https://github.com/topics/libgdx/libgdx,21600
3,BabylonJS,Babylon.js,https://github.com/topics/BabylonJS/Babylon.js,20800
4,ssloy,tinyrenderer,https://github.com/topics/ssloy/tinyrenderer,17100


___Both the content matches which shows that our data is correct hence we can use the given data for further use___

## References


- [Building a Python Web Scraping Project From Scratch](https://jovian.com/aakashns/python-web-scraping-project-guide)
- [Github Topics](https://github.com/topics)
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all)
- [Building Webscrapping project by Jovian](https://www.youtube.com/live/RKsLLG-bzEY?feature=share)