##  A WebScraping Tool to scrape top Github repositories for different Topics

Description of Project: 
- Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API.
- I scraped https://github.com/topics. I got a list of topics and for every topic I got the the topic title, topic page URL and topic description. For each topic, I got the top 30 repositories in the topic from the topic page. For each repository I used the repo name, username, stars and repo URL. Then I created a CSV file for each repository in the following format: Repo Name, Username, Stars, Repo URL
- I used the following tools: Python, Beautiful Soup, requests, and Pandas to create the project

Use the "Run" button to execute the code.

 ## Project Outline: 
        - We are going to scrape https://github.com/topics
        - We'll get a list of topics. For each topic, we'll get the topic title, topic page URL
          and topic description
        - For each topic, we'll get the top 25 repositories in the topic from the topic page
        - For each repository we'll grab the repo name, username, stars and repo URL
        - For each topic we'll create a CSV file in the following format:
          Repo Name, Username, Stars, Repo URL

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="web-scraping-github-repositories")

## 1. Use the requests library to download web pages

In [None]:
!pip install requests --upgrade --quiet

In [None]:
import requests

In [None]:
topics_url = 'https://github.com/topics'

In [None]:
response = requests.get(topics_url)

In [None]:
response.status_code

In [None]:
len(response.text)

In [None]:
page_contents = response.text

In [None]:
page_contents[:1000]

In [None]:
with open('webpage.html', 'w') as f:
    f.write(page_contents)

## 2. Use Beautiful Soup to parse and extract information


In [None]:
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
from bs4 import BeautifulSoup

In [None]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [None]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': selection_class})

In [None]:
len(topic_title_tags)

In [None]:
topic_title_tags[:5]

In [None]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags =  doc.find_all('p', {'class' : desc_selector })

In [None]:
topic_desc_tags[:5]

In [None]:
topic_title_tag0 = topic_title_tags[0]

In [None]:
div_tag = topic_title_tag0.parent

In [None]:
topic_link_tags = doc.find_all('a' , {'class': 'no-underline flex-1 d-flex flex-column'})

In [None]:
len(topic_link_tags)

In [None]:
topic0_url = "https://github.com" + topic_link_tags[0]['href']
print(topic0_url)

In [None]:
topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

In [None]:
topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
topic_descs[:5]

In [None]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': selection_class})
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags =  doc.find_all('p', {'class' : desc_selector })
topic_link_tags = doc.find_all('a' , {'class': 'no-underline flex-1 d-flex flex-column'})

topic_titles = []
for tag in topic_title_tags:
    topic_titles.append(tag.text)
print(topic_titles)

topic_descs = []
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
topic_descs[:5]

topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
topic_urls

In [None]:
topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
topic_urls
    

In [None]:
!pip install pandas --quiet

In [None]:
import pandas as pd

In [None]:
topics_dict = {
    'title': topic_titles,
    'description': topic_descs,
    'url':topic_urls
}

In [None]:
topics_df = pd.DataFrame(topics_dict)

In [None]:
topics_df

## 3. Create CSV file(s) with the extracted information

In [None]:
topics_df.to_csv('topics.csv', index=None)

## Getting information out of a topic page

In [None]:
topic_page_url = topic_urls[0]

In [None]:
topic_page_url

In [None]:
response = requests.get(topic_page_url)

In [None]:
response.status_code

In [None]:
len(response.text)

In [None]:
topic_doc = BeautifulSoup(response.text,'html.parser')

In [None]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3',{'class':h3_selection_class } )

In [None]:
len(repo_tags)

In [None]:
a_tags = repo_tags[0].find_all('a')

In [None]:
a_tags[0].text.strip()

In [None]:
a_tags[1].text.strip()

In [None]:
repo_url = base_url + a_tags[1]['href']
print(repo_url)

In [None]:
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

In [None]:
len(star_tags)

In [None]:
star_tags[0].text.strip()

In [None]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
         return int(float(stars_str[:-1]) * 1000)
    return int(stars_str)

In [None]:
parse_star_count(star_tags[0].text.strip())

In [None]:
def get_repo_info(h3_tag, star_tag):
    # returns all the required info about repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags[0].text.strip())
    return username, repo_name, stars, repo_url
    

In [None]:
get_repo_info(repo_tags[0], star_tags[0])

In [None]:
topic_repos_dict = {
    'username': [],
    'repo_name': [],
    'stars': [],
    'repo_url': []
}
for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])
    

## 4. FINAL CODE

In [None]:
import os

def get_topic_page(topic_url):
     # Downloading page
    response = requests.get(topic_url)
    # Check succesful response
    if response.status_code!= 200:
        raise Exception('Page cannot be loaded {}'.format(topic_url))
    # Parse using beautiful soup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    return  topic_doc

def get_repo_info(h3_tag, star_tag):
    # returns all the required info about repository
    a_tags = h3_tag.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags[0].text.strip())
    return username, repo_name, stars, repo_url

def get_topic_repos(topic_doc):
    # Get h3 tags conatining repo title, repo URL, and username
    h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class':h3_selection_class } )
    # Get star tags
    star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
    
    topic_repos_dict = {
        'username': [],
        'repo_name': [],
        'stars': [],
        'repo_url': []
    }
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i], star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping......".format(path))
        return 
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)
    
    

 I then wrote single function to:
 1. Get the list of topics from topics page
 2. Get the list of yop repos from the individual topic pages
 3. For each topic, create a CSV of the top repos for the topic 


In [None]:
def get_topic_titles(doc):
    selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p', {'class': selection_class})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    desc_selector = 'f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags = doc.find_all('p', {'class' : desc_selector })
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a' , {'class': 'no-underline flex-1 d-flex flex-column'})
    topic_urls = []
    base_url = 'https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code!= 200:
        raise Exception('Page cannot be loaded {}'.format(topic_url))
    doc = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
            'title': get_topic_titles(doc),
            'description': get_topic_descs(doc),
            'url':get_topic_urls(doc)
        }
    return pd.DataFrame(topics_dict)
                
        
            
       




In [None]:
import os 

In [None]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    os.makedirs('data', exist_ok=True)
    for index, row in topics_df.iterrows():
        print('Scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'], 'data/{}.csv'.format(row['title'])) 
    

In [None]:
scrape_topics_repos()

In [None]:
import jovian 

In [None]:
jovian.commit()