# Scraping  Top Repositiories for Topics on Github

- Introduction to Web Scraping.
```
Web scraping is like having a digital detective that fetches information from websites. Imagine you want to gather a bunch of news articles about your favorite topic. Instead of reading them one by one, web scraping automates this process. It's like a smart robot that visits websites, collects specific details you want, and brings them back to you in a neat pile. This is useful for getting data from various websites quickly, like comparing prices or tracking changes. Just like you'd use a magnifying glass to inspect something up close, web scraping lets you zoom in on web content and organize it for your needs.
```
- Introduction to github.
```
GitHub is like a digital playground where people work together on computer code. Think of it as a shared sandbox where each person can build and improve their own sandcastles, called projects. They can see what others are building, borrow ideas, and even help fix problems. GitHub tracks changes like a time machine, so if someone adds something cool or fixes a mistake, you can see exactly what they did. It's like having a team of friends working on a puzzle together. When the puzzle is complete, everyone gets to celebrate their masterpiece!
```


**Here are the steps we'll follow:**

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```
- Tools which we will be using are:
1. Python
2. requests
3. Beautiful Soup
4. Pandas

## Scrape the list of topics from github

- use request to download the page
- use Beautiful Soup to parse and extract information
- convert it into pandas dataframe

Let's write a function to download the page

In [6]:
import requests
from bs4 import BeautifulSoup

def get_topic_page():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception ('Failed to load page {}'.format(topics_url))
        
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [7]:
# put all information into a variable 
doc = get_topic_page()

Let's create some helper functions to parse information from the page<br>
**To get topic titles, we can pick `p` tags with the `class`...**

<img src = 'h3_tag.png'>

In [9]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

`get_topic_titles` can be used to get the list of titles

In [15]:
titles = get_topic_titles(doc)

In [17]:
# Let's see first 5 titles
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

**To get topic description, we can pick `p` tags with the `class`...**
<img src = 'desc_tag.png'>

In [10]:
def get_topic_descriptions(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p',{'class': desc_selector})
    
    topic_descriptions = []
    for tag in topic_desc_tags:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

`get_topic_descriptions` can be used to get the list of topics description

In [18]:
description = get_topic_descriptions(doc)

In [19]:
# Lets see first 5 descriptions of topics
description[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

**To get topic urls, we can pick `a` tags with the `class`...**
<img src = 'url_tag.png'>

In [11]:
def get_topic_urls(doc):
    topic_selector = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a', {'class': topic_selector})
    
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag['href'])
    return topic_urls

`get_topic_urls` can be used to get the url of each topic

In [20]:
topic_urls = get_topic_urls(doc)

In [21]:
# Let's see first 5 urls
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Let's put this all together into a single function

In [22]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception ('Failed to load page {}'.format(topics_url))
        
    doc = BeautifulSoup(response.text, 'html.parser')
    
    
    topics_dict = {
        'title' : get_topic_titles(doc),
        'description' : get_topic_descriptions(doc),
        'url' : get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)

## Get the top 25 repositories from a topic page
- use request to get all information about a topic
- use Beautiful Soup to parse and extract the information

In [23]:
def get_topic_page(topic_url):
    # download the page
    response = requests.get(topic_url)
    # Check successful response   
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc

In [24]:
doc = get_topic_page('https://github.com/topics/3d')

In [25]:
def get_repo_info(h3_tag, star_tag):
    #returns all required info about a repository
    base_url = 'https://github.com'
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

In [28]:
# Use this function to convert strings into integer
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    
    return int(stars_str)

In [26]:
def get_topic_repos(topic_doc):
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
    # Get star tags
    star_tags = topic_doc.find_all('span',{'class' : 'Counter js-social-count'})
    
    # Get repo info
    
    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [30]:
# Use pandas to make a dataframe for the given topic (3d) repos
import pandas as pd
get_topic_repos(doc)

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,93800,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23500,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21800,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21200,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17600,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15900,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15600,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14800,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10800,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,10000,https://github.com/metafizzy/zdog


In [34]:
import os
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)

## Putting everything together
- We have a function to get the list of topics
- We have a function to create a CSV file for scraped repos from a topics page
- Let's create a function to put them together

In [35]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    # create a folder
    os.makedirs('data', exist_ok = True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for {}'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

Let's run it to scrape the top repos for all the topics on the first page of https://github.com/topics

In [36]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for 3D
Scraping top repositories for Ajax
Scraping top repositories for Algorithm
Scraping top repositories for Amp
Scraping top repositories for Android
Scraping top repositories for Angular
Scraping top repositories for Ansible
Scraping top repositories for API
Scraping top repositories for Arduino
Scraping top repositories for ASP.NET
Scraping top repositories for Atom
Scraping top repositories for Awesome Lists
Scraping top repositories for Amazon Web Services
Scraping top repositories for Azure
Scraping top repositories for Babel
Scraping top repositories for Bash
Scraping top repositories for Bitcoin
Scraping top repositories for Bootstrap
Scraping top repositories for Bot
Scraping top repositories for C
Scraping top repositories for Chrome
Scraping top repositories for Chrome extension
Scraping top repositories for Command line interface
Scraping top repositories for Clojure
Scraping top repositories for Code quality
Scraping top

In [37]:
# Let's check whether the data is been scraped and converted into CSV's for any topic
#Let's take 3d topic
pd.read_csv('data/3D.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,93800,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,23500,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21800,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,21200,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,17600,https://github.com/ssloy/tinyrenderer
5,lettier,3d-game-shaders-for-beginners,15900,https://github.com/lettier/3d-game-shaders-for...
6,aframevr,aframe,15600,https://github.com/aframevr/aframe
7,FreeCAD,FreeCAD,14800,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10800,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,10000,https://github.com/metafizzy/zdog


In [38]:
# Let's check for another topic consider :-  chrome
pd.read_csv('data/Chrome.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,electron,electron,109000,https://github.com/electron/electron
1,puppeteer,puppeteer,84200,https://github.com/puppeteer/puppeteer
2,microsoft,playwright,53800,https://github.com/microsoft/playwright
3,FiloSottile,mkcert,42100,https://github.com/FiloSottile/mkcert
4,iamadamdev,bypass-paywalls-chrome,39600,https://github.com/iamadamdev/bypass-paywalls-...
5,jaywcjlove,linux-command,24600,https://github.com/jaywcjlove/linux-command
6,ovity,octotree,22500,https://github.com/ovity/octotree
7,segment-boneyard,nightmare,19400,https://github.com/segment-boneyard/nightmare
8,checkly,headless-recorder,14800,https://github.com/checkly/headless-recorder
9,google,WebFundamentals,13800,https://github.com/google/WebFundamentals


## References and Future work

**Summary:**
- We scraped http://github.com/topics.
- We used request and bs4 libraries to scrape the topics.
- After scraping all the topics titles, topics description and topics urls i.e.  http://github.com/topics + [topicName].
- Again we used request and beautiful soup to extract top repositories information on topics urls.
- We extracted username, repo name, stars, repo url for each repo inside a topic
- finally converted all of it into CSV's

**References:**
- jovian web scraping tutorial :- https://www.youtube.com/live/RKsLLG-bzEY?feature=share
- jovian web scraping project guide :- https://jovian.com/aakashns/python-web-scraping-project-guide

**Future Work:**
- `Weather Data Tracker:`
Scrape weather data from various sources to create a weather tracker. Users can input a location, and the scraper fetches current weather conditions, forecasts, and historical data.

- `eCommerce Price Tracker:`
Build a scraper that monitors the prices of products on eCommerce websites. Users can input products they're interested in, and the tool sends alerts when prices drop.

- `Social Media Sentiment Analyzer:`
Scrape social media platforms for posts related to a specific topic or brand. Use sentiment analysis to gauge public sentiment and present the results visually.

- `Recipe Aggregator and Analyzer:`
Scrape cooking websites to gather recipes. Users can search for recipes based on ingredients, dietary preferences, or cuisine types. Include features like nutritional information and user ratings.

- `Stock Market Data Collector:`
Create a scraper that collects stock market data like stock prices, historical data, and news related to specific companies. Use this data to provide insights and trends.

- `Travel Destination Recommender:`
Scrape travel websites for information on different travel destinations. Provide users with recommendations based on factors like budget, activities, and weather conditions.