# Web Scraping TOP GitHub Repositories for Topics Page (https://github.com/topics)

#### TODO:
    - Introduction about WEB SCRAPING
    - Introduction about GitHub and the PROBLEM STATEMENT
    - Tools used (Python, Requests, Beautiful Soup, Pandas)

## OUTLINE:

- Scrape https://github.com/topics
- Then take a list of topics. For each topic, we have topic title, topic page URL and topic description
- Each topic, there are top 25 repositories in the topic from the topic page
- Each repository, we'll grab the repo name, username, stars and repo URL
- Each topic we are going to create a CSV file in the following format:
```
  Repo Name,Username,Stars,Repo URL
  three.js,mrdoob,69700,https://github.com/mrdoob/three.js
  libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## 1. Scrape https://github.com/topics

- Python REQUEST Module to download the Page
- BS4 Python library to parse and extract data
- Using PANDAS we are converting the data to a Dataframe

#### Installing Modules

In [1]:
!pip install requests --upgrade --quiet
!pip install pandas --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

You should consider upgrading via the 'c:\users\singh\appdata\local\programs\python\python37-32\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'c:\users\singh\appdata\local\programs\python\python37-32\python.exe -m pip install --upgrade pip' command.
You should consider upgrading via the 'c:\users\singh\appdata\local\programs\python\python37-32\python.exe -m pip install --upgrade pip' command.


#### Importing Modules

In [2]:
import os
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Function to Download the Page

In [3]:
def get_topics_page():
    
    # Topics URL Page
    topic_url = 'https://github.com/topics'
    
    # Download the page
    response = requests.get(topic_url)
    
    #Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc

#### Get the topics doc:

In [4]:
doc = get_topics_page()

#### Checking the type of doc:

In [5]:
type(doc)

bs4.BeautifulSoup

#### Find some values from the doc:

In [6]:
doc.find('a')

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

### Creating Helper Functions: To parse information from the page

To get topic titles, we can pick `p` tags with the `class`, we can INSPECT the https://github.com/topics Page. And the result will be something like this:



![title](images/Topics-P_Tag.PNG)


In [7]:
def get_topic_titles(doc):
    title_selector = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class' : title_selector})
    
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    
    return topic_titles

`get_topic_titles` : Used to get list of TITLES

In [8]:
titles = get_topic_titles(doc)

In [9]:
len(titles)

30

In [10]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

`get_topic_descs` : Similarly, we have defined functions for descriptions

In [11]:
def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class' : desc_selector})
    
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
        
    return topic_descs

In [13]:
descriptions = get_topic_descs(doc)

In [15]:
descriptions[:5]

['3D modeling is the process of virtually developing the surface and structure of a 3D object.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.']

`get_topic_urls` : And then we have URLs

In [16]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class' : 'no-underline flex-1 d-flex flex-column'})
    
    topic_urls = []
    base_url = "https://github.com"
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    
    return topic_urls

In [18]:
urls = get_topic_urls(doc)

In [20]:
urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

Let's use all these together in one function i.e. `scrape_topics`

In [21]:
def scrape_topics():
    topic_url = "https://github.com/topics"
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    
    doc = BeautifulSoup(response.text, 'html.parser')
    
    topics_dict = {
        'Title' : get_topic_titles(doc),
        'Description' : get_topic_descs(doc),
        'URL' : get_topic_urls(doc)
    }
    
    return pd.DataFrame(topics_dict)

In [22]:
scrape_topics()

Unnamed: 0,Title,Description,URL
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## 2. Get the TOP 25 repositories from the TOPICS page

- Extracting Single TOPIC Page
- From that Page we extract `h3` tags
- Extracting repositories from Topics page
- Converting those repos in CSV files

### Giving URL and extracting Topics Page

In [23]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    
    #Check successful response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
        
    # Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    
    return topic_doc


In [24]:
doc = get_topic_page('https://github.com/topics/3d')

### Extracting h3 text from the DOC

In [26]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [27]:
def get_repo_info(h3_tag, star_tag):
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    base_url = "https://github.com"
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [28]:
repo_tags = doc.find_all('h3', {'class' : 'f3 color-fg-muted text-normal lh-condensed'})

In [29]:
repo_tags[:2]

[<h3 class="f3 color-fg-muted text-normal lh-condensed">
 <a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>          /
           <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d8

In [30]:
star_tags = doc.find_all('span', {'class' : 'Counter js-social-count'})

In [31]:
star_tags[:3]

[<span aria-label="79356 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="79,356">79.4k</span>,
 <span aria-label="19670 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="19,670">19.7k</span>,
 <span aria-label="16900 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="16,900">16.9k</span>]

### Extracting Repositories from the Topics Page

In [32]:
def get_topics_repos(topic_doc):
        
    # get h3 tags containing repo title, repo URL and username
    h3_selector = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3', {'class' : h3_selector})
    
    # Get star tags
    star_tags = topic_doc.find_all('span', {'class' : 'Counter js-social-count'})
    
    topic_repos_dict = {
        'Username' : [],
        'Repo_Name' : [],
        'Stars' : [],
        'Repo_URL' : []
    }
    
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['Username'].append(repo_info[0])
        topic_repos_dict['Repo_Name'].append(repo_info[1])
        topic_repos_dict['Stars'].append(repo_info[2])
        topic_repos_dict['Repo_URL'].append(repo_info[3])
        
    return pd.DataFrame(topic_repos_dict)

In [33]:
get_topics_repos(doc)

Unnamed: 0,Username,Repo_Name,Stars,Repo_URL
0,mrdoob,three.js,79400,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19700,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,16900,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,16000,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13800,https://github.com/aframevr/aframe
5,lettier,3d-game-shaders-for-beginners,12300,https://github.com/lettier/3d-game-shaders-for...
6,ssloy,tinyrenderer,12100,https://github.com/ssloy/tinyrenderer
7,FreeCAD,FreeCAD,10800,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9000,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8300,https://github.com/CesiumGS/cesium


### Creating Repository-Wise CSV Files

In [34]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file already {} exists. Skipping.....".format(path))
        return
    topic_df = get_topics_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

## 2. Create a CSV file in the following format:
```
  Repo Name,Username,Stars,Repo URL
  three.js,mrdoob,69700,https://github.com/mrdoob/three.js
  libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Final Function i.e. `scrape_topics_repo`

- First, we have a function to get a list of topics from TOPICS Page on Github
- Then we have `scrape_topic` to create a CSV File for scraped repos
- Then we have `scrape_topics_repo()` function to put everything together

In [36]:
def scrape_topics_repo():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data', exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top reprositories for "{}"'.format(row['Title']))
        scrape_topic(row['URL'], 'Repo-Dataset/{}.csv'.format(row['Title']))

Let's run the above code to scrape all the top repos for all the topics on the first page of https://github.com/topics/

In [37]:
scrape_topics_repo()

Scraping list of topics
Scraping top reprositories for "3D"
Scraping top reprositories for "Ajax"
Scraping top reprositories for "Algorithm"
Scraping top reprositories for "Amp"
Scraping top reprositories for "Android"
Scraping top reprositories for "Angular"
Scraping top reprositories for "Ansible"
Scraping top reprositories for "API"
Scraping top reprositories for "Arduino"
Scraping top reprositories for "ASP.NET"
Scraping top reprositories for "Atom"
Scraping top reprositories for "Awesome Lists"
Scraping top reprositories for "Amazon Web Services"
Scraping top reprositories for "Azure"
Scraping top reprositories for "Babel"
Scraping top reprositories for "Bash"
Scraping top reprositories for "Bitcoin"
Scraping top reprositories for "Bootstrap"
Scraping top reprositories for "Bot"
Scraping top reprositories for "C"
Scraping top reprositories for "Chrome"
Scraping top reprositories for "Chrome extension"
Scraping top reprositories for "Command line interface"
Scraping top reprositori

To check whether CSVs are created properly or not: Read and display CSVs using PANDAS

In [38]:
pd.read_csv("Repo-Dataset/Bot.csv")

Unnamed: 0,Username,Repo_Name,Stars,Repo_URL
0,ccxt,ccxt,23400,https://github.com/ccxt/ccxt
1,python-telegram-bot,python-telegram-bot,17700,https://github.com/python-telegram-bot/python-...
2,discordjs,discord.js,17500,https://github.com/discordjs/discord.js
3,hubotio,hubot,16100,https://github.com/hubotio/hubot
4,InstaPy,InstaPy,13900,https://github.com/InstaPy/InstaPy
5,RasaHQ,rasa,13600,https://github.com/RasaHQ/rasa
6,wechaty,wechaty,12100,https://github.com/wechaty/wechaty
7,gunthercox,ChatterBot,12000,https://github.com/gunthercox/ChatterBot
8,howdyai,botkit,10600,https://github.com/howdyai/botkit
9,botpress,botpress,9600,https://github.com/botpress/botpress


## References

Summary of what I did:

- SCRAPED GITHUB Topics Page
- CREATED CSV files for all the repos available on the first page


References to links I found useful:

- https://github.com/topics/
- https://www.youtube.com/
