# Top github Repositories

### Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above

Note: followed tutorial by Jovian.ai

#### Project Outline:

- We are going to scrap: https://github.com/topics
- We'll get list of topic, for each topic we'll topic tile, topic URL page and its desription
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
```
    Repo Name,Username,Stars,Repo URL
    three.js,mrdoob,69700,https://github.com/mrdoob/three.js
    libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

### Use the requests library to download web pages

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
topic_url = 'https://github.com/topics'

In [4]:
response = requests.get(topic_url)

In [5]:
response.status_code

200

In [6]:
len(response.text)

152471

In [13]:
page_contents = response.text

In [14]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [15]:
with open('webpage.html','w') as f:
    f.write(page_contents)

### Use Beautiful Soup to parse and extract information

In [10]:
!pip install beautifulsoup4 --upgrade --quiet

In [11]:
from bs4 import BeautifulSoup

In [16]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [17]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': selection_class})

In [18]:
len(topic_title_tags)

30

In [19]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [21]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})

In [22]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [23]:
topic_title_tag0 = topic_title_tags[0]

In [25]:
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})

In [26]:
len(topic_link_tags)

30

In [27]:
topic_link_tags[0]

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [28]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [29]:
topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [30]:
topic_descs = []

for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
    
topic_descs[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [31]:
topic_urls = []
base_url = 'https://github.com'

for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
    
topic_urls

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet',
 'https://github.com/topics/atom',
 'https://github.com/topics/awesome',
 'https://github.com/topics/aws',
 'https://github.com/topics/azure',
 'https://github.com/topics/babel',
 'https://github.com/topics/bash',
 'https://github.com/topics/bitcoin',
 'https://github.com/topics/bootstrap',
 'https://github.com/topics/bot',
 'https://github.com/topics/c',
 'https://github.com/topics/chrome',
 'https://github.com/topics/chrome-extension',
 'https://github.com/topics/cli',
 'https://github.com/topics/clojure',
 'https://github.com/topics/code-quality',
 'https://github.com/topics/code-review',
 'https://github.com/topics/compil

### Create CSV file(s) with the extracted information

In [32]:
!pip install pandas --quiet

In [33]:
import pandas as pd

In [34]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url': topic_urls
}

In [35]:
topics_df = pd.DataFrame(topics_dict)

In [36]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [37]:
topics_df.to_csv('topics.csv', index=None)

### Getting information out of topic page

In [38]:
topic_page_url = topic_urls[0]

In [39]:
topic_page_url

'https://github.com/topics/3d'

In [40]:
response = requests.get(topic_page_url)

In [41]:
response.status_code

200

In [42]:
len(response.text)

451243

In [43]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [48]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )

In [49]:
repo_tags[0]

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-turbo="false" data-view-component="true" href="/mrdoob">
            mrdoob
</a>          /
          <a class="text-bold wb-break-word" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-turbo="false" data-view-component="true" href="/mrdoob/three.js

In [50]:
len(repo_tags)

20

In [51]:
a_tags = repo_tags[0].find_all('a')

In [54]:
a_tags[0].text.strip()

'mrdoob'

In [56]:
a_tags[1].text.strip()

'three.js'

In [57]:

repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [60]:
star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'})

In [61]:
len(star_tags)

20

In [62]:
star_tags[0].text

'85.8k'

In [63]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [64]:
parse_star_count(star_tags[0].text.strip())

85800

In [65]:
def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [66]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 85800, 'https://github.com/mrdoob/three.js')

In [68]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}


for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [69]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [70]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,85800,https://github.com/mrdoob/three.js
1,libgdx,libgdx,20600,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,19800,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,18500,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,14800,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,14600,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,13800,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,12300,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9400,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,9300,https://github.com/CesiumGS/cesium


## Final code

In [106]:
import os

def get_topic_page(topic_url):
     # download the page
    response = requests.get(topic_url)
    # check for success reponse
    if response.status_code != 200:
        raise Exception('failed to load page{}'.fomat(topic_url))
    # parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc


def get_repo_info(h1_tag, star_tag):
    # returns all the required info about a repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url



def get_topic_repos(topic_doc):
    # grab h3 tag containing topic titlr, url and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class} )
    # get satr tags
    star_tags = topic_doc.find_all('span', { 'class': 'Counter js-social-count'})
    
    #get repo info
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }


    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)


## added later
#def scrape_topic(topic_url,topic_name):
#    topic_df = get_topic_repos(get_topic_page(topic_url))
#    fname = topic_name+'.csv'
#    if os.path.exists(fname):
#        print('the file {} already exists,. Skipping.....'.format(fname))
#        return
#    topic_df.to_csv(fname, index = None)

def scrape_topic(topic_url,path):
    topic_df = get_topic_repos(get_topic_page(topic_url))
    #fname = topic_name+'.csv'
    if os.path.exists(path):
        print('the file {} already exists,. Skipping.....'.format(path))
        return
    topic_df.to_csv(path, index = None)

In [72]:
topic_urls[4]

'https://github.com/topics/android'

In [79]:
get_topic_repos(get_topic_page(topic_urls[4]))

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,146000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,96000,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,71200,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,56700,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,46600,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,44100,https://github.com/wasabeef/awesome-android-ui
6,Solido,awesome-flutter,43100,https://github.com/Solido/awesome-flutter
7,square,okhttp,43000,https://github.com/square/okhttp
8,android,architecture-samples,41600,https://github.com/android/architecture-samples
9,square,retrofit,40600,https://github.com/square/retrofit


In [81]:
get_topic_repos(get_topic_page(topic_urls[4])).to_csv('android.csv',index=None)

Write a single function

1. get the list of topics from the topic page
2. get the list of top repos from individual topic page
3. for each topic, create a CSV of top repos for each topic

In [86]:
topic_url

'https://github.com/topics'

In [103]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles


def get_topic_desc(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class': desc_selector})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls
    


def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('failed to load page{}'.format(fname))
    topic_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_desc(doc),
        'urls':get_topic_urls(doc)
    }
    return pd.DataFrame(topic_dict)
    

    

In [89]:
scrape_topics()

Unnamed: 0,title,description,urls
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [107]:
def scarpe_topics_repos():
    print('Scrapping list of topics from Github')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scarpping top repositories for {} topic'.format(row['title']))
        scrape_topic(row['urls'],'data/{}.csv'.format(row['title']))

In [108]:
scarpe_topics_repos()

Scrapping list of topics from Github
Scarpping top repositories for 3D topic
Scarpping top repositories for Ajax topic
Scarpping top repositories for Algorithm topic
Scarpping top repositories for Amp topic
Scarpping top repositories for Android topic
Scarpping top repositories for Angular topic
Scarpping top repositories for Ansible topic
Scarpping top repositories for API topic
Scarpping top repositories for Arduino topic
Scarpping top repositories for ASP.NET topic
Scarpping top repositories for Atom topic
Scarpping top repositories for Awesome Lists topic
Scarpping top repositories for Amazon Web Services topic
Scarpping top repositories for Azure topic
Scarpping top repositories for Babel topic
Scarpping top repositories for Bash topic
Scarpping top repositories for Bitcoin topic
Scarpping top repositories for Bootstrap topic
Scarpping top repositories for Bot topic
Scarpping top repositories for C topic
Scarpping top repositories for Chrome topic
Scarpping top repositories for Ch

In [105]:
import os

help(os.makedirs)

Help on function makedirs in module os:

makedirs(name, mode=511, exist_ok=False)
    makedirs(name [, mode=0o777][, exist_ok=False])
    
    Super-mkdir; create a leaf directory and all intermediate ones.  Works like
    mkdir, except that any intermediate path segment (not just the rightmost)
    will be created if it does not exist. If the target directory already
    exists, raise an OSError if exist_ok is False. Otherwise no exception is
    raised.  This is recursive.

