# Pick a website and describe your objective
- **Browse through different sites and pick one to scrape.**
- **Identify the operation you'd like to scrape from the site. Select the format of the output CSV file.**
-**Summarize your project idea and outline your strategy in a Jupyter Notebook.**

## Project Outline
-We're going to scrape https://github.com/topics
- We'll get a list of topics. For each topic, we'll get a topic title, topic page URL and topic description
- For each topic, we'll get the top 25 repositories in the topic from the topic page
- For each repository, we'll grab the repo name, username, stars and repo URL
- For each topic, we'll create a CSV file in the following format:
```
Repo Name,Username,Stars,Repo URL
three.js,mrdoob,69700,https://github.com/mrdoob/three.js
libgdx,libgdx,18300,https://github.com/libgdx/libgdx
```


# Use the requests library to download web pages

In [4]:
!pip install requests --upgrade --quiet

[?25l[K     |█████▏                          | 10 kB 22.8 MB/s eta 0:00:01[K     |██████████▍                     | 20 kB 30.1 MB/s eta 0:00:01[K     |███████████████▋                | 30 kB 25.8 MB/s eta 0:00:01[K     |████████████████████▊           | 40 kB 14.8 MB/s eta 0:00:01[K     |██████████████████████████      | 51 kB 10.8 MB/s eta 0:00:01[K     |███████████████████████████████▏| 61 kB 9.6 MB/s eta 0:00:01[K     |████████████████████████████████| 63 kB 1.2 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests~=2.23.0, but you have requests 2.27.1 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

In [5]:
import requests

In [6]:
topics_url = 'https://github.com/topics'

In [7]:
# creating a response object
response = requests.get(topics_url)

In [8]:
response.status_code

200

In [9]:
page_contents = response.text

In [10]:
page_contents[:200]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubasse'

In [11]:
with open('webpage.html', 'w') as f:
  f.write(page_contents)

# Use BeautifulSoup to parse and extract information


In [12]:
!pip install beautifulsoup4 --upgrade --quiet

[?25l[K     |███▍                            | 10 kB 21.6 MB/s eta 0:00:01[K     |██████▊                         | 20 kB 26.3 MB/s eta 0:00:01[K     |██████████                      | 30 kB 29.1 MB/s eta 0:00:01[K     |█████████████▌                  | 40 kB 32.5 MB/s eta 0:00:01[K     |████████████████▉               | 51 kB 11.1 MB/s eta 0:00:01[K     |████████████████████▏           | 61 kB 12.6 MB/s eta 0:00:01[K     |███████████████████████▌        | 71 kB 10.8 MB/s eta 0:00:01[K     |███████████████████████████     | 81 kB 9.6 MB/s eta 0:00:01[K     |██████████████████████████████▎ | 92 kB 10.5 MB/s eta 0:00:01[K     |████████████████████████████████| 97 kB 3.6 MB/s 
[?25h

In [13]:
from bs4 import BeautifulSoup


In [14]:
doc = BeautifulSoup(page_contents, 'html.parser')

In [15]:
type(doc)

bs4.BeautifulSoup

Doing queries on webpage

In [16]:
p_tags = doc.find_all('p')

In [17]:
len(p_tags)

67

In [18]:
p_tags[:5]

[<p class="f4 color-fg-muted col-md-6 mx-auto">Browse popular topics on GitHub.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Ajax
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Ajax is a technique for creating interactive web applications.</p>,
 <p class="f3 lh-condensed text-center Link--primary mb-0 mt-1">
         Rust
       </p>,
 <p class="f5 color-fg-muted text-center mb-0 mt-1">Rust is a systems programming language created by Mozilla.</p>]

Specifying the attribute on the p tag and value on that tag.

In [19]:
selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
topic_title_tags = doc.find_all('p', {'class': selection_class})

In [20]:
len(topic_title_tags)

30

In [21]:
topic_title_tags[:5]

[<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Ajax</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Algorithm</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Amp</p>,
 <p class="f3 lh-condensed mb-0 mt-1 Link--primary">Android</p>]

In [22]:
desc_selector = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', {'class': desc_selector})


In [23]:
topic_desc_tags[:4]

[<p class="f5 color-fg-muted mb-0 mt-1">
           3D modeling is the process of virtually developing the surface and structure of a 3D object.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Ajax is a technique for creating interactive web applications.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Algorithms are self-contained sequences that carry out a variety of tasks.
         </p>, <p class="f5 color-fg-muted mb-0 mt-1">
           Amp is a non-blocking concurrency library for PHP.
         </p>]

In [24]:
topic_title_tag0 = topic_title_tags[0]

In [25]:
topic_title_tag0.parent

<a class="no-underline flex-1 d-flex flex-column" href="/topics/3d">
<p class="f3 lh-condensed mb-0 mt-1 Link--primary">3D</p>
<p class="f5 color-fg-muted mb-0 mt-1">
          3D modeling is the process of virtually developing the surface and structure of a 3D object.
        </p>
</a>

In [26]:
topic_link_tags = doc.find_all('a', {'class': 'no-underline flex-1 d-flex flex-column'})

In [27]:
len(topic_link_tags)

30

In [28]:
topic_link_tags[0]['href']

'/topics/3d'

In [29]:
topic0_url = 'https://github.com' + topic_link_tags[0]['href']
print(topic0_url)

https://github.com/topics/3d


In [30]:
topic_title_tags[0].text

'3D'

In [31]:
topic_titles = []

for tag in topic_title_tags:
  topic_titles.append(tag.text)

print(topic_titles)

['3D', 'Ajax', 'Algorithm', 'Amp', 'Android', 'Angular', 'Ansible', 'API', 'Arduino', 'ASP.NET', 'Atom', 'Awesome Lists', 'Amazon Web Services', 'Azure', 'Babel', 'Bash', 'Bitcoin', 'Bootstrap', 'Bot', 'C', 'Chrome', 'Chrome extension', 'Command line interface', 'Clojure', 'Code quality', 'Code review', 'Compiler', 'Continuous integration', 'COVID-19', 'C++']


In [32]:
topic_descs = []

for tag in topic_desc_tags:
  topic_descs.append(tag.text.strip())

print(topic_descs)

['3D modeling is the process of virtually developing the surface and structure of a 3D object.', 'Ajax is a technique for creating interactive web applications.', 'Algorithms are self-contained sequences that carry out a variety of tasks.', 'Amp is a non-blocking concurrency library for PHP.', 'Android is an operating system built by Google designed for mobile devices.', 'Angular is an open source web application platform.', 'Ansible is a simple and powerful automation engine.', 'An API (Application Programming Interface) is a collection of protocols and subroutines for building software.', 'Arduino is an open source hardware and software company and maker community.', 'ASP.NET is a web framework for building modern web apps and services.', 'Atom is a open source text editor built with web technologies.', 'An awesome list is a list of awesome things curated by the community.', 'Amazon Web Services provides on-demand cloud computing platforms on a subscription basis.', 'Azure is a cloud

In [33]:
topic_urls = []
base_url = 'https://github.com'
for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])

print(topic_urls)

['https://github.com/topics/3d', 'https://github.com/topics/ajax', 'https://github.com/topics/algorithm', 'https://github.com/topics/amphp', 'https://github.com/topics/android', 'https://github.com/topics/angular', 'https://github.com/topics/ansible', 'https://github.com/topics/api', 'https://github.com/topics/arduino', 'https://github.com/topics/aspnet', 'https://github.com/topics/atom', 'https://github.com/topics/awesome', 'https://github.com/topics/aws', 'https://github.com/topics/azure', 'https://github.com/topics/babel', 'https://github.com/topics/bash', 'https://github.com/topics/bitcoin', 'https://github.com/topics/bootstrap', 'https://github.com/topics/bot', 'https://github.com/topics/c', 'https://github.com/topics/chrome', 'https://github.com/topics/chrome-extension', 'https://github.com/topics/cli', 'https://github.com/topics/clojure', 'https://github.com/topics/code-quality', 'https://github.com/topics/code-review', 'https://github.com/topics/compiler', 'https://github.com/t

In [34]:
!pip install pandas --upgrade --quiet

# Create CSV files with the extracted information

In [35]:
import pandas as pd

In [36]:
topic_df = pd.DataFrame({
    'topic title': topic_titles,
     'topic description': topic_descs,
      'topic_url': topic_urls
})

In [37]:
topic_df

Unnamed: 0,topic title,topic description,topic_url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [38]:
topic_df.to_csv('topics.csv', index=None)

# Getting information out of a topic page

In [39]:
topic_page_url = topic_urls[0]

In [40]:
topic_page_url

'https://github.com/topics/3d'

In [41]:
response = requests.get(topic_page_url)

In [42]:
response.status_code

200

In [43]:
len(response.text)

676192

In [44]:
topic_doc = BeautifulSoup(response.text, 'html.parser')

In [45]:
h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})

In [46]:
len(repo_tags)

30

There are exactly 30 repos on the page. So there are 30 h3 tags, with one tag for each repo.

In [47]:
a_tags = repo_tags[0].find_all('a') # there are two a tags in first repo tag.

In [48]:
a_tags

[<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":97088,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940" data-view-component="true" href="/mrdoob">
             mrdoob
 </a>,
 <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":576201,"originating_url":"https://github.com/topics/3d","user_id":null}}' data-hydro-click-hmac="517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5" data-view-component="tr

In [49]:
a_tags[0].text.strip()

'mrdoob'

In [50]:
a_tags[1].text.strip()

'three.js'

In [51]:
base_url = 'https://github.com'
repo_url = base_url + a_tags[1]['href']
print(repo_url)

https://github.com/mrdoob/three.js


In [52]:
star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})

In [53]:
len(star_tags)

30

In [54]:
star_tags[0].text.strip()

'79.6k'

In [55]:
def parse_star_count(stars_str):
  stars_str = stars_str.strip()
  if stars_str[-1] == 'k':
    return int(float(stars_str[:-1]) * 1000)

In [56]:
parse_star_count(star_tags[0].text.strip())

79600

In [57]:
def get_repo_info(repo_tag, star_tag):
  # returns all the required info about a repo
  a_tags = repo_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tag.text.strip())
  return username, repo_name, stars, repo_url

In [58]:
get_repo_info(repo_tags[0], star_tags[0])

('mrdoob', 'three.js', 79600, 'https://github.com/mrdoob/three.js')

In [59]:
topic_repos_dict = {
    'username': [],
     'repo_name': [],
      'stars': [],
       'repo_url': []
}

for i in range(len(repo_tags)):
  repo_info = get_repo_info(repo_tags[i], star_tags[i])
  topic_repos_dict['username'].append(repo_info[0])
  topic_repos_dict['repo_name'].append(repo_info[1])
  topic_repos_dict['stars'].append(repo_info[2])
  topic_repos_dict['repo_url'].append(repo_info[3])

In [60]:
topic_repos_df = pd.DataFrame(topic_repos_dict)

In [61]:
topic_repos_df

Unnamed: 0,username,repo_name,stars,repo_url
0,mrdoob,three.js,79600,https://github.com/mrdoob/three.js
1,libgdx,libgdx,19700,https://github.com/libgdx/libgdx
2,pmndrs,react-three-fiber,17000,https://github.com/pmndrs/react-three-fiber
3,BabylonJS,Babylon.js,16000,https://github.com/BabylonJS/Babylon.js
4,aframevr,aframe,13900,https://github.com/aframevr/aframe
5,lettier,3d-game-shaders-for-beginners,12300,https://github.com/lettier/3d-game-shaders-for...
6,ssloy,tinyrenderer,12200,https://github.com/ssloy/tinyrenderer
7,FreeCAD,FreeCAD,10800,https://github.com/FreeCAD/FreeCAD
8,metafizzy,zdog,9000,https://github.com/metafizzy/zdog
9,CesiumGS,cesium,8300,https://github.com/CesiumGS/cesium


In [62]:
import os

In [77]:
def get_topic_page(topic_url):
  # download the page
  response = requests.get(topic_url)
  # check if the response is successful
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topic_url))
  # parse using BeautifulSoup
  topic_doc = BeautifulSoup(response.text, 'html.parser')

  return topic_doc

def get_repo_info(repo_tag, star_tag):
  # returns all the required info about a repo
  a_tags = repo_tag.find_all('a')
  username = a_tags[0].text.strip()
  repo_name = a_tags[1].text.strip()
  repo_url = base_url + a_tags[1]['href']
  stars = parse_star_count(star_tag.text.strip())
  return username, repo_name, stars, repo_url


def get_topic_repos(topic_doc):
  # get the h3 tags containing repo title, repo url and username
  h3_selection_class = 'f3 color-fg-muted text-normal lh-condensed'
  repo_tags = topic_doc.find_all('h3', {'class': h3_selection_class})
  # get star tags
  star_tags = topic_doc.find_all('span', {'class': 'Counter js-social-count'})
  
  # topic repos dict
  topic_repos_dict = {
    'username': [],
     'repo_name': [],
      'stars': [],
       'repo_url': []
}


  # get repo info
  for i in range(len(repo_tags)):
    repo_info = get_repo_info(repo_tags[i], star_tags[i])
    topic_repos_dict['username'].append(repo_info[0])
    topic_repos_dict['repo_name'].append(repo_info[1])
    topic_repos_dict['stars'].append(repo_info[2])
    topic_repos_dict['repo_url'].append(repo_info[3])

  return pd.DataFrame(topic_repos_dict)

def scrape_topic(topic_url, path):
  if os.path.exists(path):
    print('The file {} already exists. Skipping...'.format(path))
    return
  topic_df = get_topic_repos(get_topic_page(topic_url))
  topic_df.to_csv(path, index=None)

In [64]:
url4 = topic_urls[4]

In [65]:
url4

'https://github.com/topics/android'

In [66]:
topic4_doc = get_topic_page(url4)

In [67]:
topic4_repos = get_topic_repos(topic4_doc)

In [68]:
topic4_repos

Unnamed: 0,username,repo_name,stars,repo_url
0,flutter,flutter,137000,https://github.com/flutter/flutter
1,justjavac,free-programming-books-zh_CN,87800,https://github.com/justjavac/free-programming-...
2,Genymobile,scrcpy,62000,https://github.com/Genymobile/scrcpy
3,Hack-with-Github,Awesome-Hacking,49600,https://github.com/Hack-with-Github/Awesome-Ha...
4,google,material-design-icons,45200,https://github.com/google/material-design-icons
5,wasabeef,awesome-android-ui,42200,https://github.com/wasabeef/awesome-android-ui
6,square,okhttp,41700,https://github.com/square/okhttp
7,android,architecture-samples,40300,https://github.com/android/architecture-samples
8,square,retrofit,39600,https://github.com/square/retrofit
9,Solido,awesome-flutter,39600,https://github.com/Solido/awesome-flutter


In [69]:
get_topic_repos(get_topic_page(topic_urls[4])).to_csv('android.csv', index=None)

Write a single function to:
1. Get the list of topics from the topics page
2. Get the list of top repos from the individual topic pages
3. For each topic create a CSV file of the top repos for the topic

In [70]:
def get_topic_titles(doc):
  selection_class = 'f3 lh-condensed mb-0 mt-1 Link--primary'
  topic_title_tags = doc.find_all('p', {'class': selection_class})
  topic_titles = []
  for tag in topic_title_tags:
    topic_titles.append(tag.text)
  return topic_titles

def get_topic_descs(doc):
  desc_selector = 'f5 color-fg-muted mb-0 mt-1'
  topic_desc_tags = doc.find_all('p', {'class': desc_selector})
  topic_descs = []
  for tag in topic_desc_tags:
    topic_descs.append(tag.text.strip())
  return topic_descs

def get_topic_urls(doc):
  topic_link_tags = doc.find_all('a', {'class':'no-underline flex-1 d-flex flex-column' }) 
  topic_urls = []
  base_url = 'https://github.com'
  for tag in topic_link_tags:
    topic_urls.append(base_url + tag['href'])
  return topic_urls


def scrape_topics():
  topics_url = 'https://github.com/topics'
  response = requests.get(topics_url)
  if response.status_code != 200:
    raise Exception('Failed to load page {}'.format(topics_url))
  doc = BeautifulSoup(response.text, 'html.parser')
  topics_dict = {
      'title': get_topic_titles(doc),
       'description': get_topic_descs(doc),
        'url': get_topic_urls(doc)
  }
  return pd.DataFrame(topics_dict)

In [78]:
def scrape_topics_repos():
  print('Scraping list of topics')
  topics_df = scrape_topics()

  os.makedirs('data', exist_ok=True)
  for index, row in topics_df.iterrows():
    print('scraping top repositories for "{}"'.format(row['title']))
    scrape_topic(row['url'], 'data/{}.csv'.format(row['title']))
  

In [72]:
topics_df = scrape_topics()

In [73]:
topics_df

Unnamed: 0,title,description,url
0,3D,3D modeling is the process of virtually develo...,https://github.com/topics/3d
1,Ajax,Ajax is a technique for creating interactive w...,https://github.com/topics/ajax
2,Algorithm,Algorithms are self-contained sequences that c...,https://github.com/topics/algorithm
3,Amp,Amp is a non-blocking concurrency library for ...,https://github.com/topics/amphp
4,Android,Android is an operating system built by Google...,https://github.com/topics/android
5,Angular,Angular is an open source web application plat...,https://github.com/topics/angular
6,Ansible,Ansible is a simple and powerful automation en...,https://github.com/topics/ansible
7,API,An API (Application Programming Interface) is ...,https://github.com/topics/api
8,Arduino,Arduino is an open source hardware and softwar...,https://github.com/topics/arduino
9,ASP.NET,ASP.NET is a web framework for building modern...,https://github.com/topics/aspnet


In [81]:
scrape_topics_repos()

Scraping list of topics
scraping top repositories for "3D"
The file data/3D.csv already exists. Skipping...
scraping top repositories for "Ajax"
The file data/Ajax.csv already exists. Skipping...
scraping top repositories for "Algorithm"
The file data/Algorithm.csv already exists. Skipping...
scraping top repositories for "Amp"
The file data/Amp.csv already exists. Skipping...
scraping top repositories for "Android"
The file data/Android.csv already exists. Skipping...
scraping top repositories for "Angular"
The file data/Angular.csv already exists. Skipping...
scraping top repositories for "Ansible"
The file data/Ansible.csv already exists. Skipping...
scraping top repositories for "API"
The file data/API.csv already exists. Skipping...
scraping top repositories for "Arduino"
The file data/Arduino.csv already exists. Skipping...
scraping top repositories for "ASP.NET"
The file data/ASP.NET.csv already exists. Skipping...
scraping top repositories for "Atom"
The file data/Atom.csv alre

In [None]:
import os

In [82]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
