Here are the steps we'll follow:

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description.
- For each topic, we'll get the top 25 repositories in the topic from the topic page.
- For each repository, we'll grab the repo name, username, stars and repo URL.
- For each topic we'll create a CSV file in the following format:

```
Username,Repo Name,Stars count,Repo URL
mrdoob, three.js, 69700, https://github.com/mrdoob/three.js
libgdx, libgdx, 18300, https://github.com/libgdx/libgdx
```

In [1]:
# Importing required dependencies

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

In [3]:
URL = 'https://github.com/topics'
response = requests.get(URL)
response.status_code

200

In [4]:
len(response.text)

152245

In [5]:
soup = BeautifulSoup(response.text, 'html.parser')

In [6]:
# name of topic
name_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
topic_names_tag = soup.find_all('p', class_ = name_class)

In [7]:
topic_names = []
for tag in topic_names_tag:
    topic_names.append(tag.text)
    
print(topic_names)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [8]:
# Description of topic
desc_class = "f5 color-fg-muted mb-0 mt-1"
topic_desc_tag = soup.find_all('p', class_ = desc_class) 

In [9]:
topic_descs = []
for tag in topic_desc_tag:
    topic_descs.append(tag.text.strip())

topic_descs

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azu

In [10]:
# link for topic
link_class = "no-underline flex-1 d-flex flex-column"
topic_link_tags = soup.find_all('a', class_ = link_class)

In [11]:
topic_links = []
for link in topic_link_tags:
    topic_links.append("https://github.com" + link['href'])
    
topic_links

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [12]:
# create a df combining all above
topic_df = {'topic_names': topic_names,
            'topic_descs': topic_descs,
            'topic_links': topic_links}

In [13]:
df = pd.DataFrame(topic_df)

In [14]:
df.head()

Unnamed: 0,topic_names,topic_descs,topic_links
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


### Popular Repos for particular topics.

In [15]:
# Lets check for a example topic
first_topic = topic_links[0]

In [16]:
# loading html page 
response = requests.get(first_topic)

In [17]:
len(response.text)

451083

In [18]:
# parsing page
soup2 = BeautifulSoup(response.text, 'html.parser')

In [19]:
# print(soup2.prettify())

we can pick `p` tags with the `class` ...
which has the repo username, repo_name, and url 

In [20]:
repo_tags = soup2.find_all('h3', {'class':'f3 color-fg-muted text-normal lh-condensed'})

In [21]:
user_tags = repo_tags[0].find_all('a')

In [22]:
user_tags[0].text.strip()

'mrdoob'

In [23]:
user_tags[1].text.strip()

'three.js'

In [24]:
base_url = "https://github.com"
base_url + user_tags[1]['href']

'https://github.com/mrdoob/three.js'

span and class will get us the total starcount

In [25]:

stars_tag = soup2.find_all('span', {'class':'Counter js-social-count'})

In [26]:
def star_tag(stars_tag):
    star_count = stars_tag['title'].replace(',','')
    return int(star_count)

In [27]:
star_tag(stars_tag[0])

85695

**Writing a function to combine all of the above**

In [28]:
def repo_info(h3_tags, str_tag):
    a_tags = h3_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    stars = star_tag(str_tag)
    repo_url = base_url + a_tags[1]['href']
    return username, repo_name, stars, repo_url
    

In [29]:
repo_info(repo_tags[0], stars_tag[0])

('mrdoob', 'three.js', 85695, 'https://github.com/mrdoob/three.js')

In [30]:
star_tag(stars_tag[0])

85695

In [31]:
# we can get the remaining repositories.
topic_repo_dict = {'username': [],
         'repo name': [],
         'stars count': [],
         'repo url':[]}


for i in range(len(repo_tags)):
    repo = repo_info(repo_tags[i], stars_tag[i])
    topic_repo_dict['username'].append(repo[0])
    topic_repo_dict['repo name'].append(repo[1])
    topic_repo_dict['stars count'].append(repo[2])
    topic_repo_dict['repo url'].append(repo[3])
    

In [32]:
topic_df = pd.DataFrame(topic_repo_dict)

In [33]:
topic_df

Unnamed: 0,username,repo name,stars count,repo url
0,mrdoob,three.js,85695,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20533,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,19727,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,18442,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,14763,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,14584,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,13750,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,12261,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9395,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,9299,https://github.com/CesiumGS/cesium


We got the popular repositories for first topic (3D).

### Now we can get top repositories for each topic

Lets write some more functions to get repos for different topics and store them in a file.

In [35]:

def topic_page(topic_url):
    # request page content
    response = requests.get(topic_url)
    # parseusing bs4
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

def repo_info(h3_tags, str_tag):
    a_tags = h3_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    stars = star_tag(str_tag)
    repo_url = base_url + a_tags[1]['href']
    return username, repo_name, stars, repo_url
    

def topic_content(topic_doc):

    repo_tags = topic_doc.find_all('h3', {'class':'f3 color-fg-muted text-normal lh-condensed'})
    
    stars_tag = topic_doc.find_all('span', {'class':'Counter js-social-count'})
    
    
    topic_repo_dict = {'username': [],
             'repo name': [],
             'stars count': [],
             'repo url':[]
    }


    for i in range(len(repo_tags)):
        repo = repo_info(repo_tags[i], stars_tag[i])
        topic_repo_dict['username'].append(repo[0])
        topic_repo_dict['repo name'].append(repo[1])
        topic_repo_dict['stars count'].append(repo[2])
        topic_repo_dict['repo url'].append(repo[3])
        
    return pd.DataFrame(topic_repo_dict)

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = topic_content(topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [36]:
def get_topic_titles(soup):
    name_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
    topic_names_tag = soup.find_all('p', class_ = name_class)
    topic_titles = []
    for tag in topic_names_tag:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(soup):
    # Description of topic
    desc_class = "f5 color-fg-muted mb-0 mt-1"
    topic_desc_tag = soup.find_all('p', class_ = desc_class) 
    topic_descs = []
    for tag in topic_desc_tag:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(soup):
    # link for topic
    link_class = "no-underline flex-1 d-flex flex-column"
    topic_link_tags = soup.find_all('a', class_ = link_class)
    topic_links = []
    for link in topic_link_tags:
        topic_links.append("https://github.com" + link['href'])

    return topic_links
    

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topics_dict = {
        'title': get_topic_titles(soup),
        'description': get_topic_descs(soup),
        'url': get_topic_urls(soup)
    }
    return pd.DataFrame(topics_dict)

In [37]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [38]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin