# Top Repositories for GitHub Topics

## Pick a website and describe your objective

* Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
* Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
* Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.


#### Project Outline

- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:

```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

### Use the requests library to download web pages

In [1]:
import requests

In [2]:
topics_url = 'https://github.com/topics'

In [3]:
response = requests.get(topics_url)

In [4]:
response.status_code

200

In [5]:
len(response.text) 
# optional

164889

In [6]:
# writing a web page to a file (optional)
with open('webpage.html', 'w') as f:
    f.write(response.text)

### Use Beautiful Soup to parse and extract information

In [7]:
from bs4 import BeautifulSoup

In [8]:
doc = BeautifulSoup(response.text, 'html.parser')

In [9]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': selection_class})

In [10]:
len(topic_title_tags)

30

In [11]:
# let's see first 3
topic_title_tags[:3]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>]

In [12]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p',{'class': desc_selector})

In [13]:
len(topic_desc_tags)

30

In [14]:
# First 3
topic_desc_tags[:3]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>]

In [15]:
#topic_url
topic_selector = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a', {'class': topic_selector})

In [16]:
len(topic_link_tags)

30

In [17]:
# first 3
topic_link_tags[:3]

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/algorithm">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>
 </a>]

In [18]:
topic_link_tags[0]['href']

'/topics/3d'

In [19]:
print('https://github.com' + topic_link_tags[0]['href'])

https://github.com/topics/3d


In [20]:
# Now let's put all titles in a list

topic_titles = []

for tag in topic_title_tags:
    topic_titles.append(tag.text)
    
topic_titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

In [21]:
# Now let's put all description of topics in a list

topic_descriptions = []

for tag in topic_desc_tags:
    topic_descriptions.append(tag.text.strip()) # use strip to remove unwanted space
    
topic_descriptions[:5]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

In [22]:
# let's put all topic urls in a list

topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url+tag['href'])
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

In [23]:
import pandas as pd

In [24]:
# Putting all lists in a dictionary
topics_dict = {
    'Title': topic_titles,
    'Description' : topic_descriptions,
    'Topic Url' : topic_urls
}

In [25]:
topics_df = pd.DataFrame(topics_dict)

In [26]:
topics_df.head()

Unnamed: 0,Title,Description,Topic Url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android


In [27]:
len(topics_df)

30

In [28]:
# Create a CSV file for this dataframe
topics_df.to_csv('topics.csv', index=None)

## Getting information out of a topic page (DAY-2)

In [29]:
# ALL THIS DATA IS BEING EXTRACTED FOR SINGLE TOPIC URL 'https://github.com/topics/3d'
topic_page_url = topic_urls[0]
topic_page_url

'https://github.com/topics/3d'

In [30]:
response = requests.get(topic_page_url)

In [31]:
response.status_code

200

In [32]:
len(response.text)

476509

In [33]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [34]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})

In [35]:
len(repo_tags)

20

In [36]:
a_tags = repo_tags[0].find_all('a')

In [37]:
a_tags[0].text.strip() # USERNAME

'mrdoob'

In [38]:
a_tags[1].text.strip() # REPOSITORY NAME

'three.js'

In [39]:
a_tags[1]['href'] # REPOSITORY URL

'/mrdoob/three.js'

In [40]:
# LINK BASE URL WITH REPOSITORY URL
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
repo_url

'https://github.com/mrdoob/three.js'

In [41]:
star_tags = topic_doc.find_all('span',{'class' : 'Counter js-social-count'})

In [42]:
len(star_tags)

20

In [43]:
star_tags[0].text # RATING STARS

'93.8k'

In [44]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    
    return int(stars_str)

In [45]:
parse_star_count('93.7k')

93700

In [46]:
# h3 tag which is in repo tag
def get_repo_info(h3_tag, star_tag):
    #returns all required info about a repository
    base_url = 'https://github.com'
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

In [47]:
get_repo_info(repo_tags[0],star_tags[0])

('mrdoob', 'three.js', 93800, 'https://github.com/mrdoob/three.js')

In [48]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}

for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

In [49]:
# Convert dictionary into dataframe
topic_repos_df = pd.DataFrame(topic_repos_dict)
topic_repos_df.head(2), len(topic_repos_df)

(  username          repo_name  stars  \
 0   mrdoob           three.js  93800   
 1   pmndrs  react-three-fiber  23500   
 
                                       repo_url  
 0           https://github.com/mrdoob/three.js  
 1  https://github.com/pmndrs/react-three-fiber  ,
 20)

# Final Code

In [75]:
import os

def get_topic_page(topic_url):
    # download the page
    response = requests.get(topic_url)
    # Check successful response   
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    # Parse using beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return topic_doc
    
def get_repo_info(h3_tag, star_tag):
    #returns all required info about a repository
    base_url = 'https://github.com'
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get the h3 tags containing repo title, repo URL and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class': h3_selection_class})
    # Get star tags
    star_tags = topic_doc.find_all('span',{'class' : 'Counter js-social-count'})
    
    # Get repo info
    
    topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
    }

    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index = None)

In [51]:
# Let's choose another url 
topic_urls[:4]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp']

In [52]:
topic_doc = get_topic_page('https://github.com/topics/ajax')

In [53]:
get_topic_repos(topic_doc)

Unnamed: 0,username,repo_name,stars,repo_url
0,ljianshu,Blog,7700,https://github.com/ljianshu/Blog
1,metafizzy,infinite-scroll,7300,https://github.com/metafizzy/infinite-scroll
2,developit,unfetch,5600,https://github.com/developit/unfetch
3,olifolkerd,tabulator,5600,https://github.com/olifolkerd/tabulator
4,jquery-form,form,5200,https://github.com/jquery-form/form
5,Studio-42,elFinder,4500,https://github.com/Studio-42/elFinder
6,elbywan,wretch,4100,https://github.com/elbywan/wretch
7,dwyl,learn-to-send-email-via-google-script-html-no-...,3000,https://github.com/dwyl/learn-to-send-email-vi...
8,ded,reqwest,2900,https://github.com/ded/reqwest
9,wendux,ajax-hook,2400,https://github.com/wendux/ajax-hook


Write a function to:
1. Get the list of topics from the topics page
2. Get the list of top repos from indivisual topic pages
3. For each topic, create a CSV of the top repos for the topics

In [78]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descriptions(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p',{'class': desc_selector})
    
    topic_descriptions = []
    for tag in topic_desc_tags:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

def get_topic_urls(doc):
    topic_selector = 'no-underline flex-1 d-flex flex-column'
    topic_link_tags = doc.find_all('a', {'class': topic_selector})
    
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception ('Failed to load page {}'.format(topics_url))
        
    doc = BeautifulSoup(response.text, 'html.parser')
    
    
    topics_dict = {
        'title' : get_topic_titles(doc),
        'description' : get_topic_descriptions(doc),
        'url' : get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)

In [76]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    # create a folder
    os.makedirs('data', exist_ok = True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for {}'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [77]:
scrape_topics_repos()

Scraping list of topics
Scraping top repositories for 3D
Scraping top repositories for Ajax
Scraping top repositories for Algorithm
Scraping top repositories for Amp
Scraping top repositories for Android
Scraping top repositories for Angular
Scraping top repositories for Ansible
Scraping top repositories for API
Scraping top repositories for Arduino
Scraping top repositories for ASP.NET
Scraping top repositories for Atom
Scraping top repositories for Awesome Lists
Scraping top repositories for Amazon Web Services
Scraping top repositories for Azure
Scraping top repositories for Babel
Scraping top repositories for Bash
Scraping top repositories for Bitcoin
Scraping top repositories for Bootstrap
Scraping top repositories for Bot
Scraping top repositories for C
Scraping top repositories for Chrome
Scraping top repositories for Chrome extension
Scraping top repositories for Command line interface
Scraping top repositories for Clojure
Scraping top repositories for Code quality
Scraping top