# Scraping Top Repositories for Topics on GitHub

## Introduction about web scraping
Web scraping is a technique used to automatically extract information from websites. It involves making HTTP requests to a website, retrieving the HTML content, and then parsing that content to extract the desired data. Web scraping is particularly useful when the information you need is spread across multiple web pages and not available through an official API. By automating the process of data collection, web scraping can save time and effort, allowing you to gather large datasets for analysis, research, or personal use.

## Introduction about GitHub
GitHub is one of the most widely used platforms for version control and collaboration, primarily among software developers. It allows users to host and share code repositories, contribute to open-source projects, and collaborate on software development. Given the vast amount of data hosted on GitHub, it can be challenging to keep track of trending repositories or find repositories on specific topics of interest.

##  The problem statement
The problem we're addressing in this project is how to efficiently identify and analyze top repositories on GitHub based on specific topics. Scraping GitHub's topics pages allows us to gather a curated list of repositories that are categorized by various themes such as machine learning, web development, or data science. By analyzing these repositories, we can gain insights into the most popular projects, observe trends in technology, and even explore potential projects for personal learning or contribution.

## Tools Mentioned
In this project, we’ll be using Python along with a few powerful libraries:

* requests: This library is used to send HTTP requests to websites and retrieve their content. It is simple to use and allows us to interact with web pages as if we were browsing them in a web browser.

* BeautifulSoup: A library for parsing HTML and XML documents. It helps us navigate through the HTML content of a web page and extract the specific data we need. BeautifulSoup is particularly useful for web scraping because it simplifies the process of locating elements on a page by their tags, attributes, or hierarchy.

* pandas: A data manipulation library that allows us to organize, analyze, and manipulate the data we scrape. Once we have extracted the data from GitHub, we’ll use pandas to structure it into a DataFrame, making it easier to analyze and export, for example, as a CSV file.

These tools are chosen because they are widely used in the Python ecosystem for web scraping and data analysis. They offer the right balance of simplicity and power, making them ideal for this type of project.

#### Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

In [8]:
!pip install bs4 --upgrade --quiet # installing BeautifulSoup modeule

ERROR: Invalid requirement: '#'


In [9]:
# importing the necessary modules we will use in out project
import requests
import pandas as pd
import os
from bs4 import BeautifulSoup

#### The URLs which we will use 

In [11]:
github_topics_url = "https://github.com/topics"  # github topics page URL
git_base_url = "https://github.com" # github official base URL

## Scrape the list of topics from Github

To scrape the list of topics from a GitHub page, we'll:

* Use the requests library to download the page.
* Use BeautifulSoup (BS4) to parse and extract the information.
* Convert the extracted information to a Pandas DataFrame for easier manipulation.
  
Let's write a function to download the page.

In [13]:
def get_url_page(topic_url):
    # Dowload the web-page info
    response = requests.get(topic_url)
    # check for any Exception and catch it
    if response.status_code != 200:
        raise Exception('Failed to Load page.')
        
    # Use soup to parse page html info and return soup
    topic_soup = BeautifulSoup(response.text,'html.parser')
    return topic_soup

The get_url_page function downloads the HTML content of a specified web page using its URL and parses it using BeautifulSoup. If the page fails to load, the function raises an exception. The result is a BeautifulSoup object that represents the parsed HTML, which can be used for further data extraction and manipulation.

In [15]:
def get_topic_titles(soup):
    """
    Extracts the titles of topics from the parsed HTML content.

    Parameters:
    soup (BeautifulSoup): The BeautifulSoup object containing the parsed HTML content.

    Returns:
    list: A list of topic titles as strings.
    """
    # CSS class for topic titles
    topics_selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    # Find all <p> tags with the specified class
    all_topics = soup.find_all('p', {'class': topics_selection_class})
    
    topics = []
    
    # Extract and clean text from each <p> tag
    for topic in all_topics:
        topics.append(topic.getText().strip())
    return topics

def get_topic_desc(soup):
    """
    Extracts the descriptions of topics from the parsed HTML content.

    Parameters:
    soup (BeautifulSoup): The BeautifulSoup object containing the parsed HTML content.

    Returns:
    list: A list of topic descriptions as strings.
    """
    # CSS class for topic descriptions
    topic_desc_class = "f5 color-fg-muted mb-0 mt-1"
    # Find all <p> tags with the specified class
    all_topic_desc = soup.find_all('p', class_=topic_desc_class)

    topic_desc = []

    # Extract and clean text from each <p> tag
    for desc in all_topic_desc:
        topic_desc.append(desc.getText().strip())
    return topic_desc

def get_topic_url(soup):
    """
    Extracts the URLs of topics from the parsed HTML content.

    Parameters:
    soup (BeautifulSoup): The BeautifulSoup object containing the parsed HTML content.

    Returns:
    list: A list of topic URLs as strings.
    """
    # Base URL for GitHub topics
    git_base_url = 'https://github.com'
    
    # CSS class for topic URLs
    all_url_selector_class = "no-underline flex-grow-0"
    # Find all <a> tags with the specified class
    all_anchor_tags = soup.find_all('a', class_=all_url_selector_class)
    
    all_url = []
    
    # Construct full URLs from relative paths
    for tag in all_anchor_tags:
        all_url.append(git_base_url + tag.get('href'))
    return all_url

In [16]:
def scrape_topics_page(url):
    """
    Scrapes the topics page and returns a DataFrame with topic titles, URLs, and descriptions.

    Parameters:
    url (str): The URL of the topics page to scrape.

    Returns:
    pd.DataFrame: A DataFrame containing the topic titles, URLs, and descriptions.
    """
    # Get the BeautifulSoup object for the topics page
    topic_soup = get_url_page(url)
    
    # Create a dictionary with the scraped data
    my_dict = {
        'Title': get_topic_titles(topic_soup),       # List of topic titles
        'URLs': get_topic_url(topic_soup),           # List of topic URLs
        'Description': get_topic_desc(topic_soup)    # List of topic descriptions
    }
    
    # Convert the dictionary to a pandas DataFrame and return it
    return pd.DataFrame(my_dict)

In [17]:
df = scrape_topics_page(github_topics_url)
df.head()

Unnamed: 0,Title,URLs,Description
0,3D,https://github.com/topics/3d,3D refers to the use of three-dimensional grap...
1,Ajax,https://github.com/topics/ajax,Ajax is a technique for creating interactive w...
2,Algorithm,https://github.com/topics/algorithm,Algorithms are self-contained sequences that c...
3,Amp,https://github.com/topics/amphp,Amp is a non-blocking concurrency library for ...
4,Android,https://github.com/topics/android,Android is an operating system built by Google...


#### This is how the dataframe appears after the above operations

## Getting info from the topics page URLs

In [20]:
def get_stars_count(span):
    """
    Converts a star count string (e.g., '3.5k') to an integer.

    Parameters:
    span (Tag): The span tag containing the star count text.

    Returns:
    int: The star count as an integer.
    """
    # Convert star count from 'k' format to an integer
    span = float(span.getText().split('k')[0]) * 1000
    return int(span)

In [21]:
def get_repo_info(h3_tags, star_tags):
    """
    Extracts repository information including the username, repository name, URL, and star count.

    Parameters:
    h3_tags (Tag): The h3 tag containing the repository's username and name.
    star_tags (Tag): The span tag containing the star count for the repository.

    Returns:
    tuple: A tuple containing the username, repository name, repository URL, and star count.
    """
    # Extract all <a> tags within the h3 tag
    all_a = h3_tags.find_all('a')
    
    # Extract and clean the username and repository name
    username = all_a[0].text.strip()
    repo = all_a[1].text.strip()
    
    # Construct the full repository URL
    Repo_URL = git_base_url + all_a[1].get('href')
    
    # Get the star count for the repository
    Stars = get_stars_count(star_tags)
    
    return username, repo, Repo_URL, Stars

In [22]:
def get_topic_repos(topic_soup):
    """
    Extracts repository information for all repositories listed under a topic.

    Parameters:
    topic_soup (BeautifulSoup): The BeautifulSoup object containing the parsed HTML content of the topic page.

    Returns:
    pd.DataFrame: A DataFrame containing the repository information, including username, repository name, URL, and star count.
    """
    # CSS class for selecting <h3> tags that contain the repository names and usernames
    h3_selector = "f3 color-fg-muted text-normal lh-condensed"
    # Find all <h3> tags with the specified class
    all_h3 = topic_soup.find_all('h3', class_=h3_selector)
    
    # Find all <span> tags that contain the star count
    all_star_spans = topic_soup.find_all('span', id="repo-stars-counter-star")
    
    # Initialize a dictionary to store repository information
    repo_info = {
        'username': [],
        'reponame': [],
        'repoURL': [],
        'no_of_stars': []
    }
    
    # Loop through each repository and extract information
    for i in range(len(all_star_spans)):
        repo_info['username'].append(get_repo_info(all_h3[i], all_star_spans[i])[0])
        repo_info['reponame'].append(get_repo_info(all_h3[i], all_star_spans[i])[1])
        repo_info['repoURL'].append(get_repo_info(all_h3[i], all_star_spans[i])[2])
        repo_info['no_of_stars'].append(get_repo_info(all_h3[i], all_star_spans[i])[3])

    # Convert the dictionary to a pandas DataFrame and return it
    return pd.DataFrame(repo_info)

In [23]:
def scrape_topics(topic_url, path):
    """
    Scrapes repository information for a topic and saves it to a CSV file.

    Parameters:
    topic_url (str): The URL of the topic page to scrape.
    path (str): The file path where the CSV file will be saved.

    Returns:
    None
    """
    # Check if the CSV file already exists
    if os.path.exists(path):
        print(f"The {path} file already exists. Skipping.... ")
        return
    
    # Get the repository information and save it to a CSV file
    topic_df = get_topic_repos(get_url_page(topic_url))
    topic_df.to_csv(path, index=False)

In [24]:
soup = get_url_page(df['URLs'][12])
get_topic_repos(soup).head()

Unnamed: 0,username,reponame,repoURL,no_of_stars
0,bregman-arie,devops-exercises,https://github.com/bregman-arie/devops-exercises,65800
1,microsoft,generative-ai-for-beginners,https://github.com/microsoft/generative-ai-for...,61700
2,pulumi,pulumi,https://github.com/pulumi/pulumi,20900
3,recommenders-team,recommenders,https://github.com/recommenders-team/recommenders,18800
4,danny-avila,LibreChat,https://github.com/danny-avila/LibreChat,17300


In [25]:
df['URLs'].head() # Checking for the five URLs

0           https://github.com/topics/3d
1         https://github.com/topics/ajax
2    https://github.com/topics/algorithm
3        https://github.com/topics/amphp
4      https://github.com/topics/android
Name: URLs, dtype: object

In [26]:
scrape_topics(df['URLs'][5],df['Title'][5])

The Angular file already exists. Skipping.... 


In [27]:
def scrape_topic_repos(url):
    """
    Scrapes the list of topics from the provided GitHub topics URL and their respective top repositories,
    saving the data to CSV files.

    Parameters:
    url (str): The URL of the GitHub topics page to scrape.

    Returns:
    None
    """
    # Inform the user that the scraping process has started
    print("Scraping list of topics from GitHub topics:")

    # Scrape the list of topics and store them in a DataFrame
    topics_df = scrape_topics_page(url)

    # Create a directory named 'Data' to store the CSV files, if it doesn't exist
    os.makedirs('Data', exist_ok=True)

    # Iterate through each row in the DataFrame (each topic)
    for index, row in topics_df.iterrows():
        # Inform the user about the topic currently being scraped
        print(f"Scraping top repositories for '{row['Title']}' ")

        # Scrape the repositories for the current topic and save them to a CSV file
        scrape_topics(row['URLs'], f"Data/{row['Title']}.csv" )

    # Inform the user that the scraping process is complete
    print("Done Scraping.")

## The above function is the main function used for whole purpose, which does the following:
* Starting Message: Notifies the user that the scraping process has started.
* Scraping Topics: Explains that the topics are being scraped and stored in a DataFrame.
* Creating Directory: Describes the creation of the Data directory to store CSV files.
* Iterating Through Topics: Clarifies the iteration process over each topic to scrape repositories.
* Completion Message: Indicates that the scraping process is complete.

In [48]:
scrape_topic_repos(github_topics_url) # An example of the execution

Scraping list of topics from GitHub topics:
Scraping top repositories for '3D' 
The Data/3D.csv file already exists. Skipping.... 
Scraping top repositories for 'Ajax' 
The Data/Ajax.csv file already exists. Skipping.... 
Scraping top repositories for 'Algorithm' 
The Data/Algorithm.csv file already exists. Skipping.... 
Scraping top repositories for 'Amp' 
The Data/Amp.csv file already exists. Skipping.... 
Scraping top repositories for 'Android' 
The Data/Android.csv file already exists. Skipping.... 
Scraping top repositories for 'Angular' 
The Data/Angular.csv file already exists. Skipping.... 
Scraping top repositories for 'Ansible' 
The Data/Ansible.csv file already exists. Skipping.... 
Scraping top repositories for 'API' 
The Data/API.csv file already exists. Skipping.... 
Scraping top repositories for 'Arduino' 
The Data/Arduino.csv file already exists. Skipping.... 
Scraping top repositories for 'ASP.NET' 
The Data/ASP.NET.csv file already exists. Skipping.... 
Scraping top r