<a href="https://colab.research.google.com/github/vindhyathallu/python_project/blob/main/scraping_github_topics_repositories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scraping Top Repositories for Topics on Github




Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program.       it's a useful technique for creating datasets for research and learning. 

GitHub is a code hosting platform for collaboration and version control.
GitHub lets you (and others) work together on projects.

python is the language used in this project and
requests,Beautiful soup,pandas are the libraries are used.

Here are the steps we will follow:

- we are going to scrape https://github.com/topics
- we will get a list of topics. For each topic,we'll get topic title,topic page URL and topic description.
- For each topic,we'll get the top 25 repositories in the topic from the topic page.
- For each repository,we'll grab the repo name,username,stars and repo URL.
- for each URL we'll create a csv file in the following format:
----
- Repo Name,username,Stars,Repo URL
- three.js,mrdoob,90000,https://github.com/mrdoob/three.js
- pmndrs,react-three-fiber,22100,https://github.com/pmndrs/react-three-fiber


## Scraping the list of topics from Github

Explain how you will do it

- use requests to download the page
- use BS4 to parse and extract information
- convert to a pandas DataFrame

let's write a function to download the page

In [None]:
import requests
from bs4 import BeautifulSoup

def get_topics_page():
    #ToDo-add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc=BeautifulSoup(response.text,'html.parser')
    return doc

In [None]:
doc=get_topics_page()

Let's create some helper functions to parse information from the page

To get topic titles,we can pick p tags with the class...

In [None]:
def get_topic_titles(doc):
    selection_class='f3 lh-condensed mb-0 mt-1 Link--primary'
    topic_title_tags = doc.find_all('p',{'class':selection_class})
    topic_titles=[]
    for tag in topic_title_tags:
        topic_titles.append(tag.text)

    return topic_titles


get_topic_titles can be used to get the list of titles

In [None]:
titles=get_topic_titles(doc)

In [None]:
len(titles)

30

For Example

In [None]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

similarly we have defined functions for descriptions and URLs.

In [None]:
def get_topic_descs(doc):
    desc_selector='f5 color-fg-muted mb-0 mt-1'
    topic_desc_tags=doc.find_all('p',{'class':desc_selector})
        
                        
    topic_descriptions=[]
                        
    for tag in topic_desc_tags:
        topic_descriptions.append(tag.text.strip())
    return topic_descriptions

To get topic url's we can a tags with class..

In [None]:
def get_topic_urls(doc):
    topic_link_tags=doc.find_all('a',{'class':'no-underline flex-grow-0'})
    topic_urls=[]
    base_url='https://github.com'
    for tag in topic_link_tags:
        topic_urls.append(base_url+tag['href'])
    return topic_urls

Let's put it all together into a single function.

In [None]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    doc=BeautifulSoup(response.text,'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
    }


    return pd.DataFrame(topics_dict)

## Get the top 25 repositories from the topic page.


In [None]:
import pandas as pd
import os

In [None]:
def get_topic_page(topic_url):
    #Download the page
    response=requests.get(topic_url)
    #check successful response
    if response.status_code!=200:
        raise Exception('Failed to load page{}'.format(topic_url))
        
    #parse using Beautiful soup
    topic_doc=BeautifulSoup(response.text,'html.parser')
    return topic_doc

In [None]:
doc=get_topic_page('https://github.com/topics/3d')

In [None]:
def parse_star_count(star_str):
    star_str=star_str.strip()
    if star_str[-1]=='k':
        return int(float(star_str[:-1])*1000)
    return int(star_str)

In [None]:
def get_repo_info(h1_tag,star_tag):
    a_tags=h1_tag.find_all('a')
    username=a_tags[0].text.strip()
    repo_name=a_tags[1].text.strip()
    repo_url='https://github.com'+a_tags[1]['href']
    stars=parse_star_count(star_tag.text.strip())
    return username,repo_name,stars,repo_url

In [None]:
def get_topic_repos(topic_doc):
    
    #Get h3 tags containing repo title,repo URL and username
    h3_selection_class='f3 color-fg-muted text-normal lh-condensed'
    Repo_tags=topic_doc.find_all('h3',{'class':h3_selection_class})
    ## Get star tags
    star_tags=topic_doc.find_all('span',{'class':'Counter js-social-count'})
    
    topic_repo_dict={
    'username':[],'repo_name':[],'stars':[],'repo_url':[]}

    #get repo info
    for i in range(len(Repo_tags)):
        repo_info=get_repo_info(Repo_tags[i],star_tags[i])
        topic_repo_dict['username'].append(repo_info[0])
        topic_repo_dict['repo_name'].append(repo_info[1])
        topic_repo_dict['stars'].append(repo_info[2])
        topic_repo_dict['repo_url'].append(repo_info[3])
  

    return pd.DataFrame(topic_repo_dict)


In [None]:
def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print("The file {} already exists. skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path + '.csv' ,index=None)


### Putting it all together
- we have a function to get the all topics.
- we have a function to create a csv file for scrapped repos from the topic page.
- let's create a function to put them together.


In [None]:
def scrape_topics_repos():
    print('scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('scraping top repositories for "{}"'.format(row['title']))
        scrape_topic(row['url'],'data/{}'.format(row['title']))

let's run it to scrape the top repos for all the topics on the first page of https://github.com/topics

In [None]:
scrape_topics_repos()

scraping list of topics
scraping top repositories for "3D"
scraping top repositories for "Ajax"
scraping top repositories for "Algorithm"
scraping top repositories for "Amp"
scraping top repositories for "Android"
scraping top repositories for "Angular"
scraping top repositories for "Ansible"
scraping top repositories for "API"
scraping top repositories for "Arduino"
scraping top repositories for "ASP.NET"
scraping top repositories for "Atom"
scraping top repositories for "Awesome Lists"
scraping top repositories for "Amazon Web Services"
scraping top repositories for "Azure"
scraping top repositories for "Babel"
scraping top repositories for "Bash"
scraping top repositories for "Bitcoin"
scraping top repositories for "Bootstrap"
scraping top repositories for "Bot"
scraping top repositories for "C"
scraping top repositories for "Chrome"
scraping top repositories for "Chrome extension"
scraping top repositories for "Command line interface"
scraping top repositories for "Clojure"
scrapin

we can check that the CSV's were created properly

In [None]:
#read and display a CSV using pandas

In [None]:
pd.read_csv('data/3D.csv')

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,91300,https://github.com/mrdoob/three.js
1,pmndrs,react-three-fiber,22400,https://github.com/pmndrs/react-three-fiber
2,libgdx,libgdx,21500,https://github.com/libgdx/libgdx
3,BabylonJS,Babylon.js,20500,https://github.com/BabylonJS/Babylon.js
4,ssloy,tinyrenderer,16800,https://github.com/ssloy/tinyrenderer
5,aframevr,aframe,15300,https://github.com/aframevr/aframe
6,lettier,3d-game-shaders-for-beginners,15100,https://github.com/lettier/3d-game-shaders-for...
7,FreeCAD,FreeCAD,13900,https://github.com/FreeCAD/FreeCAD
8,CesiumGS,cesium,10300,https://github.com/CesiumGS/cesium
9,metafizzy,zdog,9700,https://github.com/metafizzy/zdog


##  Future work


Ideas for future work

- Expand the scope of scraping to other categories in github like trending,collections etc....

- scraping topics more than 20.

- using another tools like scrapy,selenium.