# Scraping Top Repositories from GitHub

TODO (Intro):
- Introduction about WebScraping
- Introduction about the GitHub and the problem statement
- Mention the tools you're using(Python, requests, Beautiful Soup, Pandas)

### Project Outline

Here are the steps we follow
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx

## Scrape the list of topics from GitHub

Explain how you'll do it

- Usiing requests to download the page
- Using bs4 to parse and extract innformation
- Convert to a Pandas dataframe

Let's write a function to download the page

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
def get_topic_page():
    # TODO - add comments
    topics_url = 'https://github.com/topics'
    response = requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Fail to load page {}'.format(topics_url))
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

In [3]:
doc = get_topic_page()

Add some explination

Let's create some helper function to parse information from the page.



To get topic titles, we can pick `p` tags with the `class`...

![](https://i.imgur.com/nR2YFka.png)

In [4]:
def get_topic_titles(doc):
    topic_title_tags = doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    title_tags = []
    for tags in topic_title_tags:
        title_tags.append(tags.text.strip())
    return title_tags
        


`get_topic_titles` can be used to get the list of titles

In [5]:
titles = get_topic_titles(doc)

In [6]:
len(titles)

30

In [7]:
titles[:5]

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']

Similarly we have defined funtions for descriptions and URLs.

In [8]:
def get_topics_desc(doc):
    h3_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class':h1_selection_class})
    desc_tags = []
    for tags in topic_desc_tags:
        desc_tags.append(tags.text.strip())
    return desc_tags
    


TODO - example and explination

In [9]:
def get_topics_urls(doc):
    topic_link_tags = doc.find_all('a',{'class':'d-flex no-underline'})
    topic_urls = []
    base_url = 'https://github.com'
    for i in range(0,len(topic_link_tags)):
        topic_urls.append("https://github.com" + topic_link_tags[i]['href'])
    return topic_urls

Let's put this all togather into a single function

In [10]:
def scrape_topics():
    topics_url = 'https://github.com/topics'
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Fail to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text,'html.parser')
    topics_dict = {
        'title': get_topic_titles(doc),
        'description': get_topics_desc(doc),
        'url': get_topics_urls(doc)
    }
    return pd.DataFrame(topics_dict)

## Get the top repositories from the topic page

TODO - explination and steps

In [11]:
def get_topic_page(topic_url):
    # Download the page
    response = requests.get(topic_url)
    # Check successful response
    if response.status_code != 200:
        raise Exception('Fail to load page {}'.format(topic_url))
    # Parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

In [12]:
doc = get_topic_page('https://github.com/topics/3d')

TODO - talk about the h3 tags 

In [13]:
def get_repo_info(h3_tags,star_tags):
    # Returns all the required info about the repository
    a_tags = h3_tags.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url =  base_url + a_tags[1]['href']
    stars = parse_star_count(star_tags.text.strip())
    return username,repo_name,stars,repo_url

TODO - show an example about get_repo_info

In [14]:
def get_topic_repos(topic_doc):
    
    # Get h3 tag containing repo title, repo url and username
    h3_selection_class = 'f3 color-text-secondary text-normal lh-condensed'
    repo_tags = topic_doc.find_all('h3',{'class':h1_selection_class})
    # Get star tags
    star_tags = topic_doc.find_all('a',{'class':'social-count float-none'})
    
    topic_repos_dict = {
    'username':[],
    'repo_name':[],
    'stars':[],
    'repo_url':[]
}
    
    # Get repo info
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['username'].append(repo_info[0])
        topic_repos_dict['repo_name'].append(repo_info[1])
        topic_repos_dict['stars'].append(repo_info[2])
        topic_repos_dict['repo_url'].append(repo_info[3])
        
        
    return pd.DataFrame(topic_repos_dict)

TODO - Show an example

In [19]:
def scrape_topic(topics_url, path):
    if os.path.exists(path):
        print("The file {} already exists. Skipping...".format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path, index=None)

TODO - Show an example

## Putting it all togather

- We have a functin to get the list of topics
- We have function to create a CSV file for scraped repos for a tpoics page 
-  Let's create a function to put them togather

In [20]:
def scrape_topics_repos():
    print('Scraping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    
    for index, row in topics_df.iterrows():
        print('Scraping top repositiories for "{}"'.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']) )

Let's run it to scrape the top repos for all the topics on the first page of the https://github.com/topics

In [23]:
# scrape_topics_repos()

We can check that the CSVs were created properly

In [24]:
# read a CSV using pandas and show the data collected

## References and Future Work

Summary of what we did

- We scrapped the github topics page in order to find out the topic 30 topics of the github topics
- We got there names, the number of star they contain, there URL

Refences to links you found useful

- https://www.youtube.com/watch?v=RKsLLG-bzEY&t=555s
- https://github.com/topics

Ideas for future work

- Based on the data we extracted, we can analyse the data using pandas dataframe and get the useful know the topics in github
- We can also extract most of the data from differet websites in almost the same way
- Filmography of Actors/Directors (Wikipedia)
- Discography of an Artist (Wikipedia)
- Dataset of Movies (TMDb)
- Dataset of TV Shows (TMDb)
- Collections of Popular Repositories (GitHub)
- Dataset of Books (BooksToScrape)
- Dataset of Quotes (QuotesToScrape)
- Bibliography of an Author (Wikipedia)
- Country Demographics (Wikipedia)
- Stocks Prices (Yahoo Finance)
- Create a Dataset of YouTube Videos (YouTube)
- Songs Dataset (AZLyrics
- Scrape a Popular Blog
- Weekly Top Songs (Top 40 Weekly)
- Video Games Dataset (Steam)