# scraping-github-topic-repositories

## Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


## Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format


```
   Repo Name,Username,Stars,Repo URL
   three.js,mrdoob,69700,https://github.com/mrdoob/three.js
   libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

In [None]:
pip install requests --quiet

In [1]:
import requests

In [2]:
topics_url = 'https://github.com/topics'

In [3]:
response = requests.get(topics_url)

In [4]:
len(response.text)

144372

In [5]:
page_contents = response.text

In [6]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-UXiu4O52iBFkqt6Kx5t+pqHYP2/LWWIw9+l5ia74TWw+xPzpH44BFfAQp7yzCe0XFGZa72Xiqyml6tox1KkUjw==" rel="stylesheet" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" /><link crossorigin="anonymous" media="all" integrity="sha512-IX1PnI5wWBz8Kgb1JI0f2QFa/WuRQQHJHe0vkKinQzsxRlNb4b8NgODX5htSZVAAk

In [7]:
with open('webpage.html', "w", encoding="utf-8") as f:
    f.write(page_contents)

In [None]:
!pip install beautifulsoup4

In [8]:
from bs4 import BeautifulSoup

In [10]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [11]:
type(doc)

bs4.BeautifulSoup

In [13]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class':selection_class })

In [14]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [15]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'

topic_desc_tags = doc.find_all('p', {'class':desc_selector })

In [16]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [17]:
topic_title_tags0 = topic_title_tags[0]

In [18]:
topic_link_tags = doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})

In [19]:
topic0_url= 'https://github.com' + topic_link_tags[0]['href']

In [20]:
print(topic0_url)

https://github.com/topics/3d


In [31]:
topic_titles =[]

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)
    
    
    

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [37]:
topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [39]:
topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

In [None]:
pip install pandas 

In [46]:
import pandas as pd

In [48]:
topics_dict = {
'title' : topic_titles,
'description' : topic_descs,
'url' : topic_urls
}

In [50]:
topics_df = pd.DataFrame(topics_dict)

In [51]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [73]:
topics_df.to_csv(r'D:\git_topic.csv', index= None)

## Getting info from a Topic Page

In [77]:
topic_page_url = topic_urls[0]

In [79]:
topic_page_url

'https://github.com/topics/3d'

In [80]:
response = requests.get(topic_page_url)

In [82]:
len(response.text)

649615

In [83]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [97]:
h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'

repo_tag = topic_doc.find_all('h3',{'class':h1_selection_class})


In [99]:
len(repo_tag)

30

In [102]:
a_tags = repo_tag[0].find_all('a')

In [111]:
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data

In [110]:
a_tags[0].text.strip()

'mrdoob'

In [112]:
a_tags[1].text.strip()

'three.js'

In [115]:
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [187]:
star_tag = topic_doc.find_all('span',{'class':"Counter js-social-count"})

In [188]:
star_tag[0].text.strip()

'83.9k'

In [150]:
def star_parser(star_num):
    if star_num[-1] =='k':
        return float(star_num[:-1]) * 1000
    return int(star_num)

In [152]:
star_parser(star_tag[0].text)

83900.0

In [161]:
def get_repo_info(h1_tag,star_tag):
   #returns all the required information about a repository
    a_tags    = h1_tag.find_all('a')
    username  = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url  = base_url + a_tags[1]['href']
    stars     = star_parser(star_tag.text.strip())
    return username, repo_name, repo_url, stars

In [163]:
get_repo_info(repo_tag[0],star_tag[0])

('mrdoob', 'three.js', 'https://github.com/mrdoob/three.js', 83900.0)

In [181]:
topic_repos_dict = {
    'username' : [],
    'repo_name': [],
    'stars'    : [],
    'repo_url' : []
}

for i in range(len((repo_tag))):
    repo_info = get_repo_info(repo_tag[i],star_tag[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])   
    

In [320]:
import os
def get_topic_page(topic_url):
    #Download the page
    response = requests.get(topic_url)
    #check successful response
    if response.status_code != 200:
        raise Exception('failed to load page {}'.format(topic_url))
    # Parse using Beautifulsoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

def get_repo_info(h1_tag,star_tag):
   #returns all the required information about a repository
    a_tags    = h1_tag.find_all('a')
    username  = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url  = base_url + a_tags[1]['href']
    stars     = star_parser(star_tag.text.strip())
    return username, repo_name, repo_url, stars



def get_topic_repo(topic_doc):
    # Get h1 tags containing repo > title, username, url
    h1_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tag = topic_doc.find_all('h3',{'class':h1_selection_class})
    # Get Star Tag
    star_tag = topic_doc.find_all('span',{'class':"Counter js-social-count"})
    # Get repo info
    
    topic_repos_dict = {'username' : [],'repo_name': [],'stars': [],'repo_url' : []    }
    for i in range(len((repo_tag))):
        repo_info = get_repo_info(repo_tag[i],star_tag[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['repo_url'].append(repo_info[2])   
        topic_repos_dict['stars'].append(repo_info[3])

    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,topic_name):
    fname = 'D:/' + topic_name + '.csv'
    if os.path.exists(fname):
        print('this file already exists skipping.......'.format(fname))
        return
    topic_df = get_topic_repo(get_topic_page(topic_url))
    topic_df.to_csv(fname, index=None)



Write a single function to :
1. Get the list of topics from the topics page
2. Get the list of top repos from the individual topic pages
3. For each topic, create a CSV of the top repos for the topic



In [295]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class':selection_class }) 
    topic_titles =[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_description(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class':desc_selector })
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_url(doc):
    topic_link_tags = doc.find_all('a',{'class':'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    

def scrape_topics():
    topics_url ='https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('failed to load page {}'.format(topic_url))
        topics_dict = {
        'title'      : get_topic_titles(doc),
        'description' : get_topic_description(doc),
        'url'         : get_topic_url(doc)   
    }
    return pd.DataFrame(topics_dict)



In [315]:
def scrape_topics_repos():
    print('scraping list of topics')
    topics_df = scrape_topics()
    for index, row in topics_df.iterrows():
        print('scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],row['title'])

In [321]:
scrape_topics_repos()

scraping list of topics
scraping top repositories for "3D"
this file already exists skipping.......
scraping top repositories for "Ajax"
this file already exists skipping.......
scraping top repositories for "Algorithm"
this file already exists skipping.......
scraping top repositories for "Amp"
this file already exists skipping.......
scraping top repositories for "Android"
this file already exists skipping.......
scraping top repositories for "Angular"
this file already exists skipping.......
scraping top repositories for "Ansible"
this file already exists skipping.......
scraping top repositories for "API"
this file already exists skipping.......
scraping top repositories for "Arduino"
this file already exists skipping.......
scraping top repositories for "ASP.NET"
this file already exists skipping.......
scraping top repositories for "Atom"
this file already exists skipping.......
scraping top repositories for "Awesome Lists"
this file already exists skipping.......
scraping top re