# scraping-github-topics-repo

## Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:<br>
Repo Name,Username,Stars,Repo URL<br>
three.js,mrdoob,69700,https://github.com/mrdoob/three.js<br>
libgdx,libgdx,18300,https://github.com/libgdx/libgdx<br>

## Use the requests library to download web pages

- Inspect the website's HTML source and identify the right URLs to download.
- Download and save web pages locally using the requests library.
- Create a function to automate downloading for different topics/search queries.

In [1]:
%pip install requests --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [1]:
import requests
topics_url = "https://github.com/topics"
response = requests.get(topics_url)
page_contents = response.text
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark"  data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-946902aac6a1.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-030e28cb8394.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="

To save the page_contents as JSON: (Checkpoint)

In [2]:
import json

# Define the file path to save the data
file_path = "page_contents.json"

# Save the page_contents as JSON
with open(file_path, 'w') as file:
    json.dump(page_contents, file)


To load the saved data and retrieve the page_contents:

In [8]:
import json

# Define the file path of saved data
file_path = "page_contents.json"

# Load the data from the JSON file
with open(file_path, 'r') as file:
    loaded_data = json.load(file)

# Retrieve the page_contents
page_contents = loaded_data


## Use Beautiful Soup to parse and extract information

- Parse and explore the structure of downloaded web pages using Beautiful soup.
- Use the right properties and methods to extract the required information.
- Create functions to extract from the page into lists and dictionaries.
- (Optional) Use a REST API to acquire additional information if required.

The code below parses an HTML page, extracts topic titles, descriptions, and links, and stores them in a DataFrame for further analysis or processing. It provides a structured representation of the topic information available on the page.

In [9]:
from bs4 import BeautifulSoup
import pandas as pd

# Parse the HTML contents
soup = BeautifulSoup(page_contents, 'html.parser')

# Find topic title tags
topic_title_tags = soup.find_all('p', class_='f3 lh-condensed mb-0 mt-1 Link--primary')

# Find topic description tags
topic_desc_tags = soup.find_all('p', class_="f5 color-fg-muted mb-0 mt-1")

# Find topic link tags
a_tags = soup.find_all('a', class_="no-underline flex-1 d-flex flex-column")
topic_link_tags = [a.get('href') for a in a_tags]

# Initialize lists to store topic details
topic_titles = []
topic_descriptions = []
topic_links = []

# Extract topic details and store them in lists
for title_tag, desc_tag, link_tag in zip(topic_title_tags, topic_desc_tags, topic_link_tags):
    topic_titles.append(title_tag.text)
    topic_descriptions.append(desc_tag.text.strip())
    topic_links.append("https://github.com" + link_tag)

# Create a dictionary to store the topic details
topics_dict = {
    'title': topic_titles,
    'description': topic_descriptions,
    'topic_url': topic_links
}

# Create a DataFrame from the topics dictionary
topics_df = pd.DataFrame(topics_dict)


In [27]:
topics_df

Unnamed: 0,title,description,topic_url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Creating Dataset for Repositories of every Topic

The code iterates over each topic URL, sends a GET request, and extracts information about the repositories such as owner, name, stars, and link. Then it creates a dataframe from the extracted data and appends it to the repo_dfs list.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

repo_dfs = []  # List to store the dataframes for each topic

# Iterate over each topic URL
for topic_url in topics_df['topic_url']:
    # Send a GET request to the topic URL
    response = requests.get(topic_url)
    topic_page_contents = response.text
    
    # Parse the HTML content using BeautifulSoup
    doc_repo = BeautifulSoup(topic_page_contents, 'html.parser')

    # Extract repository owner, name, and link
    repo_owner_nd_name = doc_repo.find_all('h3', class_="f3 color-fg-muted text-normal lh-condensed")
    repo_owner = []
    repo_name = []
    repo_link = []
    for repo in repo_owner_nd_name:
        text = repo.get_text(strip=True).split("/")
        repo_owner.append(text[0])
        repo_name.append(text[1])
        repo_link.append("https://github.com/" + text[0] + "/" + text[1])

    # Extract repository stars
    star = doc_repo.find_all('span', class_="Counter js-social-count")
    repo_stars = []
    for s in star:
        text = s.get_text(strip=True)
        repo_stars.append(int(text.replace(".", "").replace("k", "")) * 1000)

    # Create a dictionary with repository data
    repo_dict = {
        'repo_owner': repo_owner,
        'repo_name': repo_name,
        'repo_stars': repo_stars,
        'repo_link': repo_link
    }

    # Create a dataframe from the dictionary and append it to the list
    repo_df = pd.DataFrame(repo_dict)
    repo_dfs.append(repo_df)


The below code iterates over each DataFrame in the repo_dfs list, which contains the repository information for each topic. For each DataFrame, it retrieves the corresponding topic title from the topics_df DataFrame. It then writes the DataFrame to a separate worksheet in the Excel file, with the topic title as the sheet name.

In [26]:
import pandas as pd

# Create an ExcelWriter object
writer = pd.ExcelWriter('repositories.xlsx', engine='xlsxwriter')

# Iterate over each DataFrame in the repo_dfs list
for index, df in enumerate(repo_dfs):
    topic_title = topics_df['title'][index]  # Get the corresponding topic title
    # Write the DataFrame to a worksheet with the topic title as the sheet name
    df.to_excel(writer, sheet_name=topic_title, index=False)

# Save the Excel file
writer.save()
