# Top Repositories in different GitHub Topics

### Pick website and describe objective

Outline:

- Scrape https://github.com/topics
- Get a list of topics. For each topic, we'll get a topic title, topic page URL, and topic description
- Top 25 repos for each topic
- For each repo, grab the repo name, username, stars, and repo URL
- For each topic, create a CSV file with the following format:

```
Repo Name, Username, Stars, Repo URL
```


### Use requests library to download webpages

In [1]:
import requests

In [2]:
topics_url = 'https://github.com/topics'
response = requests.get(topics_url)
response.status_code
#http status code 200 is sucessful

200

In [3]:
#preview the html code for github
page_contents = response.text
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

### Use Beautiful Soup to parse and extract information

In [4]:
from bs4 import BeautifulSoup

In [5]:
# Parse as html
doc = BeautifulSoup(page_contents, 'html.parser')

In [6]:
# Find all of the topic titles on the first page
# Topic titles are contained in p tags
topic_title_tags = doc.find_all('p', {'class': "f3 lh-condensed mb-0 mt-1 Link--primary"})

In [7]:
# Checking to see if we grabbed the correct amount of p tags
# There are 30 topics on the first page of github.com/topics
len(topic_title_tags)

30

In [8]:
topic_title_tags

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Angular</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ansible</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">API</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Arduino</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">ASP.NET</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Atom</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Awesome Lists</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amazon Web Services</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Azure</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Babel</p>,
 <p class="f3 lh-condensed m

In [9]:
# Find all of the description of topics on the first page
# Descriptions are contained in p tags
# Similar to topic title but with a different class
topic_desc_tags = doc.find_all('p', {'class' : "f5 color-fg-muted mb-0 mt-1"})

In [10]:
# Checking to see if we grabbed the correct amount of p tags
# There are 30 topics on the first page of github.com/topics
len(topic_desc_tags)

30

In [11]:
topic_desc_tags

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Angular is an open source web application platform.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ansible is a simple and powerful automation engine.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           An API (Applicati

In [12]:
# Find all of the URLs to each topic page
# URLs are contained in a tags
topic_url_tags = doc.find_all('a', {'class' : "no-underline flex-1 d-flex flex-column"})

In [13]:
# Checking to see if we grabbed the correct amount of p tags
# There are 30 topics on the first page of github.com/topics
len(topic_url_tags)

30

In [14]:
# Testing URL for first topic
topic0_url = "https://github.com" + topic_url_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [15]:
# Get the text from the topic title p tags
# The text is the name of each topic
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
topic_titles

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET',
 'Atom',
 'Awesome Lists',
 'Amazon Web Services',
 'Azure',
 'Babel',
 'Bash',
 'Bitcoin',
 'Bootstrap',
 'Bot',
 'C',
 'Chrome',
 'Chrome extension',
 'Command line interface',
 'Clojure',
 'Code quality',
 'Code review',
 'Compiler',
 'Continuous integration',
 'COVID-19',
 'C++']

In [16]:
# Get the text from the topic desc p tags
# The text is the description of each topic
topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    # .strip() method of a string removes any whitespace from beginning and end
topic_descs

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source hardware and software company and maker community.',
 'ASP.NET is a web framework for building modern web apps and services.',
 'Atom is a open source text editor built with web technologies.',
 'An awesome list is a list of awesome things curated by the community.',
 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.',
 'Azu

In [17]:
# Get the href from the topic url a tags
# Concatenate that to github.com to form the URL
topic_urls = []

for tag in topic_url_tags:
    topic_urls.append('https://github.com' + tag['href'])
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

### Create CSV file(s) with the extracted information

In [18]:
import pandas as pd

In [19]:
# Create dictionary to turn into pandas df
topics_dict30 = {'Topic Name' : topic_titles, 
                 'Description' : topic_descs, 
                 'Topic URL' : topic_urls}
topics_df = pd.DataFrame(topics_dict30)
topics_df

Unnamed: 0,Topic Name,Description,Topic URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [20]:
# Create CSV file
topics_df.to_csv('topics.csv', index=None)

### Getting information from the topic pages

In [21]:
response = requests.get(topic_urls[0])

In [22]:
# Status code 200 is successful
response.status_code

200

In [23]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [24]:
# Get the repository information in the h3 tags
repo_tags = topic_doc.find_all('h3', {'class' : "f3 color-fg-muted text-normal lh-condensed"})

# There are 30 repositories on the first page
len(repo_tags)

20

In [25]:
# The h3 tag has two children tag containing information on the username and repo
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [26]:
a_tags = repo_tags[0].find_all('a')

In [27]:
star_tags = topic_doc.find_all('span', {'class' : "Counter js-social-count"})
len(star_tags)

20

In [28]:
#This is the star counter for the first repo
star_tags[0].text

'86.3k'

In [29]:
def parse_star_count(stars_str):
    '''
    This function takes in a string of the stars count from a github repo 
    and parses it as an int
    '''
    stars_str = stars_str.strip()
    # Check to see if star count is in the thousands
    if stars_str[-1] == 'k':
        # Convert to thousands
        return int(float(stars_str[:-1]) * 1000)
    # If star count is not in thousands, just return as int
    return int(stars_str)

In [30]:
def get_repo_info(h3_tag, star_tag):
    '''
    Returns username, repo name, star count, and repo url
    '''
    # The h3 tag has two children tag containing information on the username and repo
    a_tags = h3_tag.find_all('a')
    # Index 0 a tag is the username
    username = a_tags[0].text.strip()
    # Index 1 a tag is the repo name
    repo_name = a_tags[1].text.strip()
    repo_url = 'https://github.com' + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [31]:
import os

def get_topic_repos(topic_url):
    '''
    This function return a pandas Dataframe with the username, repo name
    star count, and repo url of a github topic
    '''
    response = requests.get(topic_url)
    #Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #Parse using Beautiful soup    
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    #Get h3 tags containing username, repo name, and repo URL
    repo_tags = topic_doc.find_all('h3', {'class' : "f3 color-fg-muted text-normal lh-condensed"})
    #Get star tags
    star_tags = topic_doc.find_all('span', {'class' : "Counter js-social-count"})
    #Get repo info
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : []
    }
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])    
        
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, topic_name):
    fname = topic_name + '.csv'
    if os.path.exists(fname):
        print("The file {} already exists. Skipping.".format(fname))
        return
    topic_df = get_topic_repos(topic_url)
    topic_df.to_csv(fname, index=None)

In [32]:
def get_topic_titles():
    # Find all of the topic titles on the first page
    # Topic titles are contained in p tags
    topic_title_tags = doc.find_all('p', {'class': "f3 lh-condensed mb-0 mt-1 Link--primary"})
    # Get the text from the topic title p tags
    # The text is the name of each topic
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs():
    # Find all of the description of topics on the first page
    # Descriptions are contained in p tags
    # Similar to topic title but with a different class
    topic_desc_tags = doc.find_all('p', {'class' : "f5 color-fg-muted mb-0 mt-1"})
    # Get the text from the topic desc p tags
    # The text is the description of each topic
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(): 
    # Find all of the URLs to each topic page
    # URLs are contained in a tags
    topic_url_tags = doc.find_all('a', {'class' : "no-underline flex-1 d-flex flex-column"})
    # Get the href from the topic url a tags
    # Concatenate that to github.com to form the URL
    topic_urls = []
    for tag in topic_url_tags:
        topic_urls.append('https://github.com' + tag['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    #Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    topics_dict = {
    'Topic Name' : get_topic_titles(),
    'Description' : get_topic_descs(),
    'URL' : get_topic_urls()
    }
    return pd.DataFrame(topics_dict)


In [33]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['Topic Name']))
        scrape_topic(row['URL'], row['Topic Name'])

# Scrape the top Topics repositories and create CSV files with its information

In [34]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for "3D"
Scraping top repositories for "Ajax"
Scraping top repositories for "Algorithm"
Scraping top repositories for "Amp"
Scraping top repositories for "Android"
Scraping top repositories for "Angular"
Scraping top repositories for "Ansible"
Scraping top repositories for "API"
Scraping top repositories for "Arduino"
Scraping top repositories for "ASP.NET"
Scraping top repositories for "Atom"
Scraping top repositories for "Awesome Lists"
Scraping top repositories for "Amazon Web Services"
Scraping top repositories for "Azure"
Scraping top repositories for "Babel"
Scraping top repositories for "Bash"
Scraping top repositories for "Bitcoin"
Scraping top repositories for "Bootstrap"
Scraping top repositories for "Bot"
Scraping top repositories for "C"
Scraping top repositories for "Chrome"
Scraping top repositories for "Chrome extension"
Scraping top repositories for "Command line interface"
Scraping top repositories for "Clojure"
Scrapin