## This is the rough code. 
- Please read it thoroughly to understand the whole process

### PROJECT OUTLINE:
- I am going to scrape: https://github.com/topics website
- Getting different topics from this page. For each topic gather: Topic Title, Topic page URL,
  Topic description
- For each topic, the titles of top 25 repositories in the topic from the 
  topic page
- For each Repository, the Repository Name, the Username who posted the
  Repository, No. of stars and the repository URL would be gathered.
- For each Topics a CSV file would be created.

###  2. : Use the requests library to download web pages

In [2]:
!pip install requests --upgrade --quiet


In [3]:
import requests

In [4]:
topics_url = 'https://github.com/topics'

In [5]:
response = requests.get(topics_url)      # Downloads the webpage and saves it

In [6]:
response.status_code               # Status_code=(200-299) means the request was successfull

200

In [76]:
len(response.text)

129374

In [9]:
page_contents = response.text
page_contents[:500]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n\n\n\n  <link crossorigin="anonymous" media="all" integrity="sha512-B/'

In [10]:
 with open('webpage.html', 'w') as f:
        f.write(page_contents)           # Writes the Html data into a file 

### 3. :  Use Beautiful Soup to parse and extract information

In [11]:
!pip install beautifulsoup4 --upgrade --quiet    # Installing beautiful soup

In [12]:
from bs4 import BeautifulSoup

In [14]:
soup = BeautifulSoup(page_contents, 'html.parser')   # Create an instance of BeautifulSoup class

 Getting the Inspect code from webpage, in which we can see:
- Each topics are under the '<p class=.....> "topic-name </p>
- Therefore, we find every P-tag and check if it has topic names.
- To filter out only relevant topic names, we search for class="f3 lh-condensed mb-0 mt-1 Link--primary"

In [25]:
topic_title_tags = soup.find_all('p')    # Finding all occurences of "p-tags"

In [30]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = soup.find_all('p', {'class' : selection_class}) # Filtering and searching for class as key

In [31]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

- Getting Topic descriptions

In [34]:
desc_selector = "f5 color-text-secondary mb-0 mt-1"
topic_desc_tags = soup.find_all('p', {'class' : desc_selector})

In [35]:
topic_desc_tags[:5]

[<p class="f5 color-text-secondary mb-0 mt-1">
               3D modeling is the process of virtually developing the surface and structure of a 3D object.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Ajax is a technique for creating interactive web applications.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Algorithms are self-contained sequences that carry out a variety of tasks.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Amp is a non-blocking concurrency framework for PHP.
             </p>,
 <p class="f5 color-text-secondary mb-0 mt-1">
               Android is an operating system built by Google designed for mobile devices.
             </p>]

- Finding the URL of each topic and storing it
- The link for each topics are in <a>...<\a> tags

In [36]:
topic_link_tags = soup.find_all('a', class_ = 'd-flex no-underline')

In [243]:
topic_link_tags[0]['href']

'/topics/3d'

In [43]:
topic_0_link = "https://github.com" + topic_link_tags[0]['href']
print(topic_0_link)

https://github.com/topics/3d


- Getting only Title Text from Tags, and storing it in a list

In [46]:
topic_title_tags[0].text

'3D'

In [51]:
topic_titles = []
for tags in topic_title_tags:
    topic_titles.append(tags.text)
print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


- For Getting Desc text from desc tags

In [57]:
topic_descs = []
for tags in topic_desc_tags:
    topic_descs.append(tags.text.strip())
print(topic_descs[:5])

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency framework for PHP.', 'Android is an operating system built by Google designed for mobile devices.']


- For getting only relevant urls from each url tags

In [165]:
topic_urls = []
base_url = 'https://github.com'
for tags in topic_link_tags:
    topic_urls.append(base_url + tags['href'])
topic_urls[:5]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android']

### 4. : Create CSV file(s) with the extracted information


- Using Pandas Lib to create a table and write it to csv

In [68]:
import pandas as pd

In [216]:
topics_dict = {
    'title': topic_titles,
    'descriptions' : topic_descs,
    'topic_url' : topic_urls,
}

In [260]:
topics_df = pd.DataFrame(topics_dict)


In [79]:
topics_df.to_csv('github_topics.csv', index = None)

### 5 : Getting Info out of a topic Page

In [156]:
topic_page_url = topic_urls[4]
topic_page_url

'https://github.com/topics/android'

In [157]:
response = requests.get(topic_page_url)

In [158]:
response.status_code

200

In [159]:
len(response.text)

581100

In [89]:
subtopic_soup = BeautifulSoup(response.text, 'html.parser')

- Since The Username & repository of the project are both 'a- tags', and both are stored in a superclass 'h1-tag', I'll find for the h-tags

In [131]:

h1_repo_tags = subtopic_soup.find_all('h1', {'class': 'f3 color-text-secondary text-normal lh-condensed' } )

In [132]:
a_tags = h1_repo_tags[0].find_all('a')


In [133]:
a_tags[0].text.strip()

'mrdoob'

In [134]:
a_tags[1].text.strip()

'three.js'

In [123]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


- Now for number of Stars:

In [126]:
star_finder_class = 'social-count float-none'
star_tags = subtopic_soup.find_all('a', class_ = star_finder_class)

In [127]:
len(star_tags)

30

In [130]:
star_tags[0].text.strip()

'71.8k'

In [135]:
def parse_star_count(ele):
    ele = ele.strip()
    if ele[-1] == 'k':
        return int(float(ele[:-1])*1000)
    return int(ele)   

In [136]:
parse_star_count(star_tags[0].text.strip())

71800

- Creating a function to take:
   - h1 tag
   - star_tag  and 
- return username, repo_name, no. of stars and repo url

- To summarise:
- There are 30 h1 tags 
- each h1 tag has 1- a tag for username and 1-a tag for repo name & url
- There are 30 star_tags

In [139]:
def get_repo_info(h1_tag, star_tag):
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url

In [140]:
get_repo_info(h1_repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 71800, 'https://github.com/mrdoob/three.js')

- Getting values of all the sub_topics:

In [142]:
topic_repos_dict = {
    'username' : [],
    'repo_name' : [],
    'stars' : [],
    'repo_url' : [],
}
for i in range(len(h1_repo_tags)):
    repo_info = get_repo_info(h1_repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])    

In [145]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

## Now, Collecting up the code in a few functions:

In [194]:
def get_subtopic_page(topic_urls):
    #Download the page
    response = requests.get(topic_urls)
    # Check successful response
    if response.status_code not in range(200,300):
        raise Exception("Failed to load page {}".format(topic_urls))
    # Parse web-data using Beautiful_Soup
    subtopic_soup = BeautifulSoup(response.text, 'html.parser')
    return subtopic_soup

def get_repo_info(h1_tag, star_tag):
    # Returns all the required info about the repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


def get_subtopic_repos(subtopic_soup):
    # gets the h1-tags containing username, repo_name & repo_url
    h1_finder_class = 'f3 color-text-secondary text-normal lh-condensed'
    h1_repo_tags = subtopic_soup.find_all('h1', {'class': h1_finder_class})
    # gets the star_tags containing the no of stars in each repos
    star_finder_class = 'social-count float-none'
    star_tags = subtopic_soup.find_all('a', class_ = star_finder_class)
    
    # Gets required data for each repository under the main topic and stores
    # it as a dictionary
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : [],
    }
    
    for i in range(len(h1_repo_tags)):
        repo_info = get_repo_info(h1_repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)   

def scrape_subtopic(topic_url, topic_name):
    subtopic_df = get_subtopic_repos(get_subtopic_page(topic_urls))
    subtopic_df.to_csv(topic_name + '.csv', index=None)
    

- Get the list of top 30 Topics from GitHub site
- Get the top 30 repos from each topic pages
- For each topic, Create CSV file of the top repos 

### FINAL CODE :

In [255]:
import os
import pandas as pd
from bs4 import BeautifulSoup
import requests
def get_subtopic_page(topic_urls):
    #Download the page
    response = requests.get(topic_urls)
    # Check successful response
    if response.status_code not in range(200,300):
        raise Exception("Failed to load page {}".format(topic_urls))
    # Parse web-data using Beautiful_Soup
    subtopic_soup = BeautifulSoup(response.text, 'html.parser')
    return subtopic_soup

def get_repo_info(h1_tag, star_tag):
    # Returns all the required info about the repository
    a_tags = h1_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    stars = parse_star_count(star_tag.text.strip())
    return username, repo_name, stars, repo_url


def get_subtopic_repos(subtopic_soup):
    # gets the h1-tags containing username, repo_name & repo_url
    h1_finder_class = 'f3 color-text-secondary text-normal lh-condensed'
    h1_repo_tags = subtopic_soup.find_all('h1', {'class': h1_finder_class})
    # gets the star_tags containing the no of stars in each repos
    star_finder_class = 'social-count float-none'
    star_tags = subtopic_soup.find_all('a', class_ = star_finder_class)
    
    # Gets required data for each repository under the main topic and stores
    # it as a dictionary
    topic_repos_dict = {
        'username' : [],
        'repo_name' : [],
        'stars' : [],
        'repo_url' : [],
    }
    
    for i in range(len(h1_repo_tags)):
        repo_info = get_repo_info(h1_repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    
    return pd.DataFrame(topic_repos_dict)   

def scrape_subtopic(topic_url, path):
    if os.path.exists(path):
        print(f"The file {path} already exists. Skipping...")
        return
    subtopic_df = get_subtopic_repos(get_subtopic_page(topic_url))
    subtopic_df.to_csv(path, index=None)

In [256]:
def get_topic_titles(soup):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = soup.find_all('p', {'class' : selection_class})
    topic_titles = []
    for tags in topic_title_tags:
        topic_titles.append(tags.text)
    return topic_titles
def get_topic_descs(soup):
    desc_selector = "f5 color-text-secondary mb-0 mt-1"
    topic_desc_tags = soup.find_all('p', {'class' : desc_selector})
    topic_descs = []
    for tags in topic_desc_tags:
        topic_descs.append(tags.text.strip())
    return topic_descs
def get_topic_urls(soup):
    topic_link_tags = soup.find_all('a', class_ = 'd-flex no-underline')
    topic_urls = []
    base_url = 'https://github.com'
    for tags in topic_link_tags:
        topic_urls.append(base_url + tags['href'])
    return topic_urls  
        
    
def  scrape_topics():
    response = requests.get(topics_url)
    if response.status_code not in range(200,300):
        raise Exception("Failed to load page {}".format(topic_url))
    page_contents = response.text
    soup = BeautifulSoup(page_contents, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(soup),
        'descriptions' : get_topic_descs(soup),
        'topic_url' : get_topic_urls(soup),
    }
    return pd.DataFrame(topics_dict)

In [257]:
topics_url = 'https://github.com/topics'
def scrape_topics_repos():
    # Creating a separate folder = 'scraped_data(github)'
    os.makedirs('scraped_data(github)', exist_ok = True)
    print(f"Scraping list of topics from {topics_url}")
    topics_df = scrape_topics()
    for index,row in topics_df.iterrows():
        print(f"Scraping top repositories for {row['title']}")
        scrape_subtopic(row['topic_url'], f"scraped_data(github)/{row['title']}.csv")

In [None]:
scrape_topics_repos()