# Building a Python Web Scraping Project From Scratch


**Steps:**
- Scrape https://github.com/topics
- Get a list of all topics. For each topic, get topic title, topic page URL and topic description
- For each topic, Get the top 25 repositories in the topic from the topic page
- For each repository, Grab the repo name, username, stars and repo URL
- At last create a CSV file by compling all scraped data

## Scraping Popular GitHub Topics
- `requests` to downlaod the page
- `BeautifulSoup` to parse and extract information
- `Pandas` to convert into dataframe

Format of Final Topics Dataframe
Attributes:
1. `title` - Name of the topic - [3D]

2. `description` - Description of that topics - [3D modeling uses specialized software to create a digital model of a physical object. It is an aspect of 3D computer graphics, used for video games, 3D printing, and VR, among other applications.]
3. `url` - URL of that topic - [https://github.com/topics/3d]

#### Importing required libraries

In [138]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

#### Function to Get Topic Page from Github Using `requests` Library

In [139]:
def get_topics_page(topics_url):
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    page = BeautifulSoup(response.text, 'html.parser')
    return page

#### Function to Get Popular Topic's Title

In [140]:
def get_topic_titles(page):
    topic_title_tags = page.find_all('p', {'class': 'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text.strip())
    return topic_titles

#### Function to Get Topic's Discription

In [141]:
def get_topic_descs(page):
    topic_desc_tags = page.find_all('p', {'class': 'f5 color-text-secondary mb-0 mt-1'})
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

#### Function to Get Topic's URL

In [142]:
def get_topic_urls(page):
    topic_link_tags = page.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    for tag in topic_link_tags:
        topic_urls.append('https://github.com' + tag['href'])
    return topic_urls

#### Scraping all Data and Formating It Into CSV File

In [143]:
def scrape_topics(topics_url):
    page = get_topics_page(topics_url)
    topics_dict = {
        'title': get_topic_titles(page),
        'description': get_topic_descs(page),
        'url': get_topic_urls(page)
    }
    return pd.DataFrame(topics_dict)

##### Scraping All Existing Topic Pages and Compile Them Into Single CSV File

In [144]:
pages = [1, 2, 3, 4, 5, 6, 7]
topics_df = pd.DataFrame()
for i in pages:
    topics = scrape_topics("https://github.com/topics?page="+str(i))
    topics_df = topics_df.append(topics, ignore_index = True)

In [145]:
topics_df[:10]

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency framework fo...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [146]:
len(topics_df)

181



---



## Get the Top 30 Repositories Under Each Topic
For each repository, we'll grab the repo name, username, stars and repo URL
Format of Final Topics Dataframe Attributes:
- `repo_name` - Name of the repository
- `username` - Owner of that repository
- `stars` - Stars on that repository
- `repo_url` - URL of that repository

In [154]:
def get_topic_repos(topic_doc):
    h1_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h1', {'class': h1_selection_class} )
    star_tags = topic_doc.find_all('a', { 'class': 'social-count float-none'})
    topic_repos_dict = { 'username': [], 'repo_name': [], 'stars': [],'repo_url': []}
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])   
    return pd.DataFrame(topic_repos_dict)

In [155]:
def get_repo_info(h1_tag, star_tag):
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  'https://github.com' + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [156]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [157]:
def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a', {'class': 'd-flex no-underline'})
    topic_urls = []
    for tag in topic_link_tags:
        topic_urls.append('https://github.com' + tag['href'])
    return topic_urls

In [158]:
def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

In [159]:
def scrape_topics(topic_page):
    topic_url = topic_page
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }
    return pd.DataFrame(topics_dict)

In [160]:
def scrape_topics_repos(i):
    print('Scraping list of topics')
    topic_page = scrape_topics("https://github.com/topics?page="+str(i)) 
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))

In [161]:
pages = [1, 2, 3, 4, 5, 6, 7]
for i in pages:
    scrape_topics_repos(i)

Scraping list of topics
Scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
Scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
Scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
Scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
Scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
Scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
Scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
Scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
Scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
Scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
Scraping top repositories for "Atom"
The file data/Atom.csv alre