# Webscraping-github-topics

## Pick a website and describe your objective

- Browse through different sites and pick on to scrape. Check the "Project Ideas" section for inspiration.
- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.
- Summarize your project idea and outline your strategy in a Juptyer notebook. Use the "New" button above.

## Project Outline
- We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic we'll create a CSV file in the following format:
```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```

## Use the requests library to download web pages

In [74]:
!pip install requests --upgrade --quiet

In [75]:
import requests

In [76]:
topic_url = 'https://github.com/topics'

In [77]:
response = requests.get(topic_url)

In [78]:
response.status_code

200

In [79]:
len(response.text)

155309

In [80]:
page_contents = response.text
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n  \n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-0946cdc16f15.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-3946c959759a.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="h

In [81]:
with open('webpage.html','w') as f:
    f.write(page_contents)

## Use Beautiful Soup to parse and extract information

In [82]:
!pip install beautifulsoup4 --upgrade --quiet

In [83]:
from bs4 import BeautifulSoup

In [84]:
doc = BeautifulSoup(page_contents , 'html.parser')

In [85]:
type(doc)

bs4.BeautifulSoup

In [86]:
p_tags = doc.find_all('p')

In [87]:
p_tags[:10]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Scala
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Scala is an object-oriented programming language.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Ruby
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Ruby is a scripting language designed for simplified object-oriented programming.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Express
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Express is a minimal Node.js framework for web and mobile applications.</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f3 lh-condensed mb-0 mt

In [88]:
topic_title_tags = doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})

In [89]:
len(topic_title_tags)

30

In [90]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [91]:
topic_desc_tags = doc.find_all('p','f5 color-fg-muted mb-0 mt-1')

In [92]:
len(topic_desc_tags)

30

In [93]:
topic_desc_tags[:5]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>,
 <p class="f5 color-fg-muted mb-0 mt-1">
           Android is an operating system built by Google designed for mobile devices.
         </p>]

In [94]:
topic_link_tags = []
for x in topic_title_tags:
    topic_link_tags.append(x.parent)

In [95]:
topic_link_tags[:5]

[<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/ajax">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/algorithm">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>
 <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>
 </a>,
 <a class="no-underline flex-1 d-flex flex-column" href="/topics/amphp">
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>
 <p cl

In [96]:
topic_titles = []
topic_descs = []
topic_urls = []
base_url = 'https://github.com'
for tag in topic_title_tags:
    topic_titles.append(tag.text)
for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

In [97]:
topic_titles[:10]

['3D',
 'Ajax',
 'Algorithm',
 'Amp',
 'Android',
 'Angular',
 'Ansible',
 'API',
 'Arduino',
 'ASP.NET']

In [98]:
topic_descs[:10]

['3D refers to the use of three-dimensional graphics, modeling, and animation in various industries.',
 'Ajax is a technique for creating interactive web applications.',
 'Algorithms are self-contained sequences that carry out a variety of tasks.',
 'Amp is a non-blocking concurrency library for PHP.',
 'Android is an operating system built by Google designed for mobile devices.',
 'Angular is an open source web application platform.',
 'Ansible is a simple and powerful automation engine.',
 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.',
 'Arduino is an open source platform for building electronic devices.',
 'ASP.NET is a web framework for building modern web apps and services.']

In [99]:
topic_urls[:10]

['https://github.com/topics/3d',
 'https://github.com/topics/ajax',
 'https://github.com/topics/algorithm',
 'https://github.com/topics/amphp',
 'https://github.com/topics/android',
 'https://github.com/topics/angular',
 'https://github.com/topics/ansible',
 'https://github.com/topics/api',
 'https://github.com/topics/arduino',
 'https://github.com/topics/aspnet']

In [100]:
import pandas as pd

In [101]:
topics_dict = {
    'title':topic_titles,
    'description':topic_descs,
    'url':topic_urls
}

In [102]:
topics_df = pd.DataFrame(topics_dict)
topics_df

Unnamed: 0,title,description,url
0,3D,3D refers to the use of three-dimensional grap...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source platform for buildin...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


## Create CSV file(s) with the extracted information

In [103]:
topics_df.to_csv('topics.csv',index=None)

## Getting Information out of a Topic page

## Putting all together

In [104]:
def parse_star_count(star_text):
    star_text = star_text.strip()
    if star_text[-1]=='k':
        return int(float(star_text[:-1])*1000)
    return int(star_text)

In [105]:
import requests 
import pandas as pd
from bs4 import BeautifulSoup
import os
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

def get_repo_info(h3_tag,star_tag):
    a_tags = h3_tag.find_all('a')
    user_name = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href']
    star_count = parse_star_count(star_tag.text.strip())
    return user_name,repo_name,star_count,repo_url

def get_topic_repos(topic_doc):
    repo_tags = topic_doc.find_all('h3',{'class':'f3 color-fg-muted text-normal lh-condensed'})
    star_tags = topic_doc.find_all('span',{'class':'Counter js-social-count'})
    topic_repos_dict = {'Username':[],'Reponame':[],'Stars':[],'URL':[]}
    for i in range(len(repo_tags)):
        repo_info = get_repo_info(repo_tags[i],star_tags[i])
        topic_repos_dict['Username'].append(repo_info[0])
        topic_repos_dict['Reponame'].append(repo_info[1])
        topic_repos_dict['Stars'].append(repo_info[2])
        topic_repos_dict['URL'].append(repo_info[3])
    return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url,path):
    if os.path.exists(path):
        print('The file already exists {}'.format(path))
        return
    topic_df = get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)

In [106]:
base_url = 'https://github.com'
def get_topic_titles(doc):
    topic_title_tags = doc.find_all('p',{'class':'f3 lh-condensed mb-0 mt-1 Link--primary'})
    topic_titles = []
    for tag in topic_title_tags:
        topic_titles.append(tag.text)
    return topic_titles

def get_topic_descs(doc):
    topic_desc_tags = doc.find_all('p','f5 color-fg-muted mb-0 mt-1')
    topic_descs = []
    for tag in topic_desc_tags:
        topic_descs.append(tag.text.strip())
    return topic_descs

def get_topic_urls(doc):
    topic_link_tags = doc.find_all('a',{'class':'no-underline flex-grow-0'})
    topic_urls = []
    for tag in topic_link_tags:
        topic_urls.append(base_url + tag['href'])
    return topic_urls

def scrape_topics():
    main_url = 'https://github.com/topics'
    response = requests.get(main_url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    doc = BeautifulSoup(response.text,'html.parser')
    topics_dict = {'title':get_topic_titles(doc)
                   ,'description':get_topic_descs(doc),
                   'url':get_topic_urls(doc)}
    return pd.DataFrame(topics_dict)

In [107]:
def scrape_topics_repos():
    print('Scrapping list of topics')
    topics_df = scrape_topics()
    
    os.makedirs('data',exist_ok=True)
    for index,row in topics_df.iterrows():
        print('Scrapping top repositories for {}'.format(row['title']))
        scrape_topic(row['url'],'data/{}.csv'.format(row['title']))

In [108]:
scrape_topics_repos()

Scrapping list of topics
Scrapping top repositories for 3D
The file already exists data/3D.csv
Scrapping top repositories for Ajax
The file already exists data/Ajax.csv
Scrapping top repositories for Algorithm
The file already exists data/Algorithm.csv
Scrapping top repositories for Amp
The file already exists data/Amp.csv
Scrapping top repositories for Android
The file already exists data/Android.csv
Scrapping top repositories for Angular
The file already exists data/Angular.csv
Scrapping top repositories for Ansible
The file already exists data/Ansible.csv
Scrapping top repositories for API
The file already exists data/API.csv
Scrapping top repositories for Arduino
The file already exists data/Arduino.csv
Scrapping top repositories for ASP.NET
The file already exists data/ASP.NET.csv
Scrapping top repositories for Atom
The file already exists data/Atom.csv
Scrapping top repositories for Awesome Lists
The file already exists data/Awesome Lists.csv
Scrapping top repositories for Amazon

## Document and share your work