## GitHub Topics Scrapper

In [166]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

In [167]:
topics_url = 'https://github.com/topics'

In [168]:
response = requests.get(topics_url)

In [169]:
page_contents = response.text

In [170]:
doc = BeautifulSoup(page_contents, 'html.parser')

## Scraping Titles

In [171]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'

topic_title_tag = doc.find_all('p', {'class': selection_class } )

In [172]:
len(topic_title_tag)

30

In [173]:
topic_title_tag[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

## Scraping Descriptions

In [174]:
selection_class2 = 'f5 color-text-secondary mb-0 mt-1'

topic_description_tag = doc.find_all('p', {'class': selection_class2})

In [175]:
topic_description_tag[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

## Scraping URLs

In [176]:
selection_class3 = 'd-flex no-underline'

topic_url_tag =  doc.find_all('a', {'class': selection_class3})

#To get Url:
#topic0_url = 'https://github.com' + topic_url[0]['href']
#print(topic0_url)

## Appending Tags to Lists

In [177]:
topic_title = []

for tag in topic_title_tag:
    topic_title.append(tag.text)
    
print(topic_title[0])

3D


In [178]:
topic_description = []

for tag in topic_description_tag:
    topic_description.append(tag.text.strip())
    
print(topic_description[0])

3D modeling is the process of virtually developing the surface and structure of a 3D object.


In [179]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_url_tag:
    topic_urls.append(base_url+tag['href'])
    

print(topic_urls[0])

https://github.com/topics/3d


## Transform Lists into Dictionary, then into DataFrames.

In [180]:
topics_dict = {'title': topic_title, 'description': topic_description, 'URL': topic_urls}

In [181]:
topics_df = pd.DataFrame(topics_dict)

topics_df

Unnamed: 0,title,description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Creating CSV file with the extracted information

In [182]:
topics_df.to_csv('topics.csv', index = None)

## Getting information out of Topic Page

In [183]:
topic_page_url = topic_urls[0]

In [184]:
response = requests.get(topic_page_url)

In [185]:
response.status_code

200

In [186]:
len(response.text)

619849

In [187]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [188]:
h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'

repo_tags = topic_doc.find_all('h3', {'class': h1_selection_class})

In [189]:
len(repo_tags)

30

In [190]:
repo_tags[0]

<h3 class="f3 color-text-secondary text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904

In [191]:
a_tags = repo_tags[0].find_all('a')

In [192]:
a_tags[0].text.strip()

'mrdoob'

In [193]:
a_tags[1].text.strip()

'three.js'

In [194]:
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [195]:
star_tags = topic_doc.find_all('a', {'class': 'social-count float-none'})

In [196]:
star_tags[0].text.strip()

'73.6k'

### Defining Function to Transform from 73.5k format to 73500

In [197]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

### Defining Function to get Information from the Repository 

In [198]:
def get_repo_info(h1_tag, star_tag):
    #Returns all required information about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [199]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 73600, 'https://github.com/mrdoob/three.js')

### Append the Information to Dictionary

In [200]:
topic_repos_dict = {'username': [],'repo_name': [],'stars': [],'repo_url': []}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [201]:
topic_repos_dict

{'username': ['mrdoob',
  'libgdx',
  'pmndrs',
  'BabylonJS',
  'aframevr',
  'ssloy',
  'lettier',
  'FreeCAD',
  'metafizzy',
  'CesiumGS',
  'timzhang642',
  'a1studmuffin',
  'isl-org',
  'spritejs',
  'tensorspace-team',
  'jagenjo',
  'YadiraF',
  'AaronJackson',
  'domlysz',
  'openscad',
  'ssloy',
  'mosra',
  'google',
  'blender',
  'gfxfundamentals',
  'cleardusk',
  'jasonlong',
  'rg3dengine',
  'antvis',
  'cnr-isti-vclab'],
 'repo_name': ['three.js',
  'libgdx',
  'react-three-fiber',
  'Babylon.js',
  'aframe',
  'tinyrenderer',
  '3d-game-shaders-for-beginners',
  'FreeCAD',
  'zdog',
  'cesium',
  '3D-Machine-Learning',
  'SpaceshipGenerator',
  'Open3D',
  'spritejs',
  'tensorspace',
  'webglstudio.js',
  'PRNet',
  'vrn',
  'BlenderGIS',
  'openscad',
  'tinyraytracer',
  'magnum',
  'model-viewer',
  'blender',
  'webgl-fundamentals',
  '3DDFA',
  'isometric-contributions',
  'rg3d',
  'L7',
  'meshlab'],
 'stars': [73600,
  18800,
  14700,
  14700,
  13000,
  1

### Transform into DataFrame

In [202]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [203]:
topic_repos_df

# These are the top 30 repositories for the topid 3D... Now I want do the same for all scraped topics

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,73600,https://github.com/mrdoob/three.js
1,libgdx,libgdx,18800,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,14700,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,14700,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13000,https://github.com/aframevr/aframe
5,ssloy,tinyrenderer,11100,https://github.com/ssloy/tinyrenderer
6,lettier,3d-game-shaders-for-beginners,10900,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,9700,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,8600,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,7400,https://github.com/CesiumGS/cesium


In [204]:
#
# CLEANING UP
#



# parse_star_count:
# Takes str input in the format 76.5K and returns int 765000
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1])*1000)
    return int(stars_str)

# get_topic_page:
# Given a Topic Url it's going to try to download the page. 
# Give me an error if it was not succesfull. 
# Parse Using BeautifulSoup
def get_topic_page(topic_url):
    # Download the Page
    response = requests.get(topic_url)
    
    # Check Download Response
    if response.status_code != 200:
        raise Exception(f'Falied to load page{topic_url}')
    
    #Parse Using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

# get_repo_info:
# Given h1_tag and star_tag returns Username, Repository Name, Repository Stars, and URL.
def get_repo_info(h1_tag, star_tag):
    #Returns all required information about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

# get_topic_repos:
# Gets the h3 tags that we need
# Creates Dictionary
# Inputs repository info into dictionary
# Returns dataframe
def get_topic_repos(topic_doc):  
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
    
    # Get star tags
    star_tags = topic_doc.find_all('a', {'class': 'social-count float-none'})
    
    # Get repository info
    topic_repos_dict = {'username': [],'repo_name': [],'stars': [],'repo_url': []}
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    #Transforming into dataframe
    return pd.DataFrame(topic_repos_dict)


### Testing

In [205]:
url4 = topic_urls[4]

In [206]:
topic4_doc = get_topic_page(url4)

In [207]:
topic4_repos = get_topic_repos(topic4_doc)

In [208]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,128000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,82200,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,53200,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,45500,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,43600,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,41100,https://github.com/wasabeef/awesome-android-ui
6,square,okhttp,40600,https://github.com/square/okhttp
7,android,architecture-samples,39300,https://github.com/android/architecture-samples
8,square,retrofit,38500,https://github.com/square/retrofit
9,Solido,awesome-flutter,37000,https://github.com/Solido/awesome-flutter


In [209]:
# In one Line:

get_topic_repos(get_topic_page(topic_urls[5]))

Unnamed: 0,username,repo_name,stars,repo_url
0,justjavac,free-programming-books-zh_CN,82200,https://github.com/justjavac/free-programming-...
1,angular,angular,75800,https://github.com/angular/angular
2,storybookjs,storybook,64099,https://github.com/storybookjs/storybook
3,ionic-team,ionic-framework,45100,https://github.com/ionic-team/ionic-framework
4,leonardomso,33-js-concepts,43100,https://github.com/leonardomso/33-js-concepts
5,prettier,prettier,40300,https://github.com/prettier/prettier
6,SheetJS,sheetjs,27000,https://github.com/SheetJS/sheetjs
7,angular,angular-cli,24800,https://github.com/angular/angular-cli
8,angular,components,21700,https://github.com/angular/components
9,NativeScript,NativeScript,20400,https://github.com/NativeScript/NativeScript


In [210]:
topic_urls[5]

'https://github.com/topics/angular'

### Saving to CSV

In [211]:
get_topic_repos(get_topic_page(topic_urls[5])).to_csv('angular.csv', index = None)

## Writing function to :

### 1- Get the list of topics from the topics page
### 2- Get the list of top repos from the individual topic pages
### 3- For each topic, create a CSV of the top repos for the topic

In [212]:
def get_topic_titles(doc):
    # Selecting Topic Title 
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tag = doc.find_all('p', {'class': selection_class } )
    # Appending Topic Titles to List
    topic_title = []
    for tag in topic_title_tag:
        topic_title.append(tag.text)
    return topic_title

def get_topic_descriptions(doc):
    # Selecting Topic Description
    selection_class2 = 'f5 color-text-secondary mb-0 mt-1'
    topic_description_tag = doc.find_all('p', {'class': selection_class2})
    # Appending Topic Descriptions to List
    topic_description = []
    for tag in topic_description_tag:
        topic_description.append(tag.text.strip()) 
    return topic_description
    
def get_topic_urls(doc):
    #  Selecting Topic URL
    selection_class3 = 'd-flex no-underline'
    topic_url_tag =  doc.find_all('a', {'class': selection_class3})
    # Appending topic Url to List
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_url_tag:
        topic_urls.append(base_url+tag['href'])
    return topic_urls
    

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception(f'Failed to load page{topic_url}')
    # Transforming Lists into Dictionary
    topics_dict = {'title': get_topic_titles(doc), 
                   'description': get_topic_descriptions(doc), 
                   'URL': get_topic_urls(doc)}
    # Transforiming Dictionary into DF
    topics_df = pd.DataFrame(topics_dict)
    return pd.DataFrame(topics_dict)
    
    
    


In [213]:
scrape_topics()

Unnamed: 0,title,description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [217]:
# Tool to Write Topic URL, Topic Page, and Repositories to CSV file

#def scrape_topic(topic_url, topic_name):
#    # From topic Url we get Topic Page, From topic Page we get the repositories
#    topic_df = get_topic_repos(get_topic_page(topic_url))
#    # We then write all that in a CSV File
#    topic_df.to_csv(topic_name + '.csv', index=None)

In [225]:
# Now I'm adding a check on the scrape_topic function. it will not download csv files that already exist

import os

def scrape_topic(topic_url, topic_name):
    fname = topic_name + '.csv'
    if os.path.exists(fname):
        print(f'The file {fname} already exists. Skipping...')
        return
    # From topic Url we get Topic Page, From topic Page we get the repositories
    topic_df = get_topic_repos(get_topic_page(topic_url))
    # We then write all that in a CSV File
    topic_df.to_csv(fname, index=None)

In [222]:
def scrape_topics_repos():
    print('Scraping list of from GitHub')
    # we use the function scrape_topics to return the DataFrame with all our topics, their descriptions, and URL
    topics_df = scrape_topics()
    for index, row in topics_df.iterrows():
        title = row['title']
        URL = row['URL']
        print(f'Scraping top repositories for {title} from {URL}')
        scrape_topic(URL, f'data/{title}.csv')

In [226]:
scrape_topics_repos()

Scraping list of from GitHub
Scraping top repositories for 3D from https://github.com/topics/3d
Scraping top repositories for Ajax from https://github.com/topics/ajax
Scraping top repositories for Algorithm from https://github.com/topics/algorithm
Scraping top repositories for Amp from https://github.com/topics/amphp
The file data/Amp.csv.csv already exists. Skipping...
Scraping top repositories for Android from https://github.com/topics/android
The file data/Android.csv.csv already exists. Skipping...
Scraping top repositories for Angular from https://github.com/topics/angular
The file data/Angular.csv.csv already exists. Skipping...
Scraping top repositories for Ansible from https://github.com/topics/ansible
The file data/Ansible.csv.csv already exists. Skipping...
Scraping top repositories for API from https://github.com/topics/api
The file data/API.csv.csv already exists. Skipping...
Scraping top repositories for Arduino from https://github.com/topics/arduino
The file data/Arduino.