<a href="https://colab.research.google.com/github/roopadm/WebScraping/blob/main/python_web_scraping_and_rest_api_Jovian.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Web Scraping and REST APIs 

![](https://i.imgur.com/6zM7JBq.png)


Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. While web scraping often involves parsing and processing [HTML documents](https://developer.mozilla.org/en-US/docs/Web/HTML), some platforms also offer [REST APIs](https://www.smashingmagazine.com/2018/01/understanding-using-rest-api/) to retrieve information in a machine-readable format like [JSON](https://www.digitalocean.com/community/tutorials/an-introduction-to-json). In this tutorial, we'll use web scraping and REST APIs to create a real-world dataset.


This tutorial covers the following topics: 

* Downloading web pages using the requests library
* Inspecting the HTML source code of a web page
* Parsing parts of a website using Beautiful Soup
* Writing parsed information into CSV files
* Using a REST API to retrieve data as JSON
* Combining data from multiple sources
* Using links on a page to crawl a website


### How to Run the Code

The best way to learn the material is to execute the code and experiment with it yourself. This tutorial is an executable [Jupyter notebook](https://jupyter.org). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.





## Problem 

Over the course of this tutorial, we'll solve the following problem to learn the tools and techniques used for web scraping:


> **QUESTION**: Write a Python function that creates a CSV file (comma-separated values) containing details about the 25 top GitHub repositories for any given topic. You can view the top repositories for the topic `machine-learning` on this page: [https://github.com/topics/machine-learning](https://github.com/topics/machine-learning). The output CSV should contain these details: repository name, owner's username, no. of stars, repository URL. 


 <a href="https://github.com/topics/machine-learning"><img src="https://i.imgur.com/5V1HGLs.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;"></a>
 
 
How would you go about solving this problem in Python? Explore the web page and take a couple of minutes to come up with an approach before proceeding further. How many lines of code do you think the solution will require?

## Downloading a web page using `requests`

When you access a URL like https://github.com/topics/machine-learning using a web browser, it downloads the contents of the web page the URL points to and displays the output on the screen. Before we can extract information from a web page, we need to download the page using Python.

We'll use a library called [`requests`](https://docs.python-requests.org/en/master/) to download web pages from the internet. Let's begin by installing and importing the library.

In [None]:
# Install the library
!pip install requests --upgrade --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.8 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Import the library
import requests

We can download a web page using the `requests.get` function.

In [None]:
topic_url = 'https://github.com/topics/machine-learning'

In [None]:
response = requests.get(topic_url)

In [None]:
type(response)

requests.models.Response

`requests.get` returns a response object with the page contents and some information indicating whether the request was successful, using a status code. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status. 

 If the request was successful, `response.status_code` is set to a value between 200 and 299. 

In [None]:
response.status_code

200

The contents of the web page can be accessed using the `.text` property of the `response`. 

In [None]:
page_contents = response.text

In [None]:
len(page_contents)

424155

The page contains over 60,000 characters! Let's view the first 1000 characters of the web page.

In [None]:
page_contents[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

What you see above is the *source code* of the web page. It written in a language called [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML). It defines the content and structure of the web page. 

Let's save the contents to a file with the `.html` extension.

In [None]:
with open('machine-learning-topics.html', 'w', encoding="utf-8") as file:
    file.write(page_contents)

You can now view the file using the "File > Open" menu option within Jupyter and clicking on *machine-learning.html* in the list of files displayed. Here's what you'll see when you open the file:

<img src="https://i.imgur.com/8gEbT1P.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

While this looks similar to the original web page, note that it's simply a copy. You will notice that none of the links or buttons work. To view or edit the source code of the file, click "File > Open" within Jupyter, then select the file *machine-learning.html* from the list and click the "Edit" button.

<img src="https://i.imgur.com/JG7Q8CK.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

As you might expect, the source code looks something like this:

<img src="https://i.imgur.com/6ynXNdz.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

Try scrolling through the source code. Can you make sense of it? Can you see how the information on the page is organized within the file? We'll learn more about it in the next section.

> **EXERCISE**: Download the web page for a different topic, e.g., https://github.com/topics/data-analysis using `requests` and save it to a file, e.g., `data-analysis.html`. View the page and compare it with the previously downloaded page? How are the two different? Can you spot the differences in the source code?

In [None]:
url_DA=' https://github.com/topics/data-analysis'

In [None]:
from requests.api import request
response_DA=requests.get(url_DA)

In [None]:
type(response_DA)

requests.models.Response

In [None]:
response_DA.status_code

200

In [None]:
page_contents_DA=response_DA.text

In [None]:
page_contents_DA[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-719f1193e0c0.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-0c343b529849.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [None]:
len(page_contents_DA)

465553

In [None]:
with open("Data Analysis.html",'w',encoding='utf8') as file1:
  file1.write("page_contents_DA")

Let's save our work using `jovian` before continuing.

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
jovian.commit(project='python-web-scraping-and-rest-api')

<IPython.core.display.Javascript object>

[jovian] Updating notebook "samanvitha/python-web-scraping-and-rest-api" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/samanvitha/python-web-scraping-and-rest-api[0m


'https://jovian.ai/samanvitha/python-web-scraping-and-rest-api'

## Inspecting the HTML source code of a web page

![](https://i.imgur.com/mvBpQIP.png)

As mentioned earlier, web pages are written in a language called HTML (Hyper Text Markup Language). HTML is a fairly simple language comprised of *tags*  (also called *nodes* or *elements*) e.g. `<a href="https://jovian.ai" target="_blank">Go to Jovian</a>`. An HTML tag has three parts:

1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.


### Inside an HTML Document

Here's a simple HTML document that uses many commonly used tags:

```html
<html>
  <head>
    <title>All About Python</title>
  </head>
  <body>
    <div style="width: 640px; margin: 40px auto">
      <h1 style="text-align:center;">Python - A Programming Language</h1>
      <img src="https://www.python.org/static/community_logos/python-logo-master-v3-TM.png" alt="python-logo" style="width:240px;margin:0 auto;display:block;">
      <div>
        <h2>About Python</h2>
        <p>
          Python is an <span style="font-style: italic">interpreted, high-level and general-purpose</span> programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects. Visit the <a href="https://docs.python.org/3/">official documentation</a> to learn more.
        </p>
      </div>
      <div>
        <h2>Some Python Libraries</h2>
        <ul id="libraries">
          <li>Numpy</li>
          <li>Pandas</li>
          <li>PyTorch</li>
          <li>Scikit Learn</li>
        </ul>
      </div>
      <div>
        <h2>Recent Python Versions</h2>
        <table id="versions-table">
          <tr>
            <th class="bordered-table">Version</th>
            <th class="bordered-table">Released on</th>
          </tr>
          <tr>
            <td class="bordered-table">Python 3.8</td>
            <td class="bordered-table">October 2019</td>
          </tr>
          <tr>
            <td class="bordered-table">Python 3.7</td>
            <td class="bordered-table">June 2018</td>
          </tr>
        </table>
          <style>
              .bordered-table { 
                  border: 1px solid black; padding: 8px;
              }
          </style>
      </div>
    </div>
  </body>
</html>

```

> **EXERCISE**: Copy the above HTML code and paste it into a new file called `webpage.html`. To create a new file,  select "File > Open" from the menu bar, then select "New > Text" file. View the saved file. Can you see how the different tags are displayed in different ways by the browser?


<img src="https://i.imgur.com/lcSHz5V.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

> **EXERCISE**: Make some changes to the code inside `webpage.html`. Save the file and view it again. Do you see your changes reflected? Play with the structure of the file. Try to break things and fix them!

### Common Tags and Attributes

Following are some of the most commonly used HTML tags:

* `html`
* `head`
* `title`
* `body`
* `div`
* `span`
* `h1` to `h6`
* `p`
* `img`
* `ul`, `ol` and `li`
* `table`, `tr`, `th` and `td`
* `style`
* ...

Each tag supports several attributes. Following are some common attributes used to modify the behavior of tags:

* `id`
* `style`
* `class`
* `href` (used with `<a>`)
* `src` (used with `<img>`)




> **EXERCISE**: Complete this tutorial on HTML: https://www.htmldog.com/guides/html/ . Once done, try describing what the above tags and attributes are used for. Try creating a new HTML page using the tags you find most interesting. 
> 
> To learn how to style HTML tags, check out this tutorial on CSS: https://www.htmldog.com/guides/css/



### Inspecting HTML in the Browser

You can view the source code of any webpage right within your browser by right-clicking anywhere on a page and selecting the "Inspect" option. It opens the "Developer Tools" pane, where you can see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page.

Here's what it looks like on the Chrome browser:


<img src="https://i.imgur.com/jCA1T6Z.png" width="640" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">


> **EXERCISE**: Explore the source code of the web page https://github.com/topics/machine-learning . Try to find the portions in the source code corresponding to the repository name, owner's username, and the number of stars for each repository listed on the page.

Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "samanvitha/python-web-scraping-and-rest-api" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/samanvitha/python-web-scraping-and-rest-api[0m


'https://jovian.ai/samanvitha/python-web-scraping-and-rest-api'

## Extracting information from HTML using Beautiful Soup

To extract information from the HTML source code of a webpage programmatically, we can use the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library. Let's install the library and import the `BeautifulSoup` class from the `bs4` module.

In [None]:
# Install the library
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
# Import the library
from bs4 import BeautifulSoup

In [None]:
?BeautifulSoup

Next, let's read the contents of the file `machine-learning.html` and create a `BeautifulSoup` object to parse the content.

In [None]:
with open('machine-learning-topics.html', 'r') as f:
    html_source = f.read()

In [None]:
html_source[:1000]

'\n\n<!DOCTYPE html>\n<html lang="en" data-color-mode="auto" data-light-theme="light" data-dark-theme="dark" data-a11y-animated-images="system">\n  <head>\n    <meta charset="utf-8">\n  <link rel="dns-prefetch" href="https://github.githubassets.com">\n  <link rel="dns-prefetch" href="https://avatars.githubusercontent.com">\n  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">\n  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">\n  <link rel="preconnect" href="https://github.githubassets.com" crossorigin>\n  <link rel="preconnect" href="https://avatars.githubusercontent.com">\n\n\n\n  <link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" /><link crossorigin="anonymous" media="all" rel="stylesheet" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" /><link data-color-theme="dark_dimmed" crossorigin="anonymous" media="all" rel="stylesheet" data-href="htt

In [None]:
doc = BeautifulSoup(html_source, 'html.parser')

In [None]:
type(doc)

bs4.BeautifulSoup

The `doc` object contains several properties and methods for extracting information from the HTML document. Let's look at a few examples below.

**NOTE**: You don't need to remember all (or any) of the properties/methods. You can look up [the documentation of BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) or [search online](https://www.google.co.in/search?q=beautifulsoup+how+to+get+href+of+link) to find what you need when you need it.

### Accessing a tag

> **QUESTION**: Find the title of the page represented by `doc`.

The title of the page is contained within the `<title>` tag. We can access the title tag using `doc.title`.

In [None]:
title_tag = doc.title

In [None]:
title_tag

<title>machine-learning · GitHub Topics · GitHub</title>

In [None]:
type(title_tag)

bs4.element.Tag

We can access a tag's name using the `.name` property.

In [None]:
title_tag.name

'title'

The text within a tag can be accessed using `.text`.

In [None]:
title_tag.text

'machine-learning · GitHub Topics · GitHub'

> **EXERCISE**: Explore the `html`, `body`, and `head` tags of `doc`. Do you see what you expect to see?

If a tag occurs more than once in a document e.g. `<a>` (which represents links), then `doc.a` finds the first `<a>` tag.

In [None]:
first_link = doc.a

In [None]:
first_link

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

In [None]:
first_link.text

'Skip to content'

> **EXERCISE**: Find the first occurrence of each of these tags in `doc`: `div`, `img`, `span`, `p`, etc.

### Finding all tags of the same type

To find all the occurrences of a tag, use the `find_all` method.

> **QUESTION**: Find all the link tags on the page. How many links does the page contain?

In [None]:
all_link_tags = doc.find_all('a')

In [None]:
len(all_link_tags)

417

In [None]:
all_link_tags[:3]

[<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>,
 <a aria-label="Homepage" class="mr-4 color-fg-inherit" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
 <svg aria-hidden="true" class="octicon octicon-mark-github" data-view-component="true" height="32" version="1.1" viewbox="0 0 16 16" width="32">
 <path d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 

> **EXERCISE**: Get a list of all the `img` tags on the page. How many images does the page contain?

### Accessing attributes

The attributes of a tag can be accessed using the indexing notation, e.g., `first_link['href']`

In [None]:
first_link

<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>

In [None]:
first_link['href']

'#start-of-content'

In [None]:
first_link['class']

['px-2',
 'py-4',
 'color-bg-accent-emphasis',
 'color-fg-on-emphasis',
 'show-on-focus',
 'js-skip-to-content']

Note that the `class` attribute is automatically split into a list of classes (this isn't done for any other attribute). This is because it's common practice to check for a specific class within a tag.

You can use the `.attrs` property to view all the attributes as a dictionary.

In [None]:
first_link.attrs

{'href': '#start-of-content',
 'class': ['px-2',
  'py-4',
  'color-bg-accent-emphasis',
  'color-fg-on-emphasis',
  'show-on-focus',
  'js-skip-to-content']}

> **EXERCISE**: Find the 5th image tag on the page (counting from 0). Which attributes does the tag contain? Find the values of the `src` and `alt` attributes of the tag.

### Searching by Attribute Value

> **QUESTION**: Find the `img` tag(s) on the page with the `alt` attribute set to `transformers`.

We can provide a dictionary of attributes as the second argument to `find_all`

In [None]:
doc.find_all('img', { 'alt': 'transformers'})

[<img alt="transformers" class="d-block width-full" loading="lazy" src="https://repository-images.githubusercontent.com/155220641/a16c4880-a501-11ea-9e8f-646cf611702e"/>]

If we're just interested in the first element, we can use the `find` method. Keep in mind that `find` returns `None` if no matching tag is found.

In [None]:
doc.find('img', { 'alt': 'transformers'})

<img alt="transformers" class="d-block width-full" loading="lazy" src="https://repository-images.githubusercontent.com/155220641/a16c4880-a501-11ea-9e8f-646cf611702e"/>

> **EXERCISE**: Find the `src` attribute of the first `img` tag with the `alt` attribute set to `julia`. Visit the link and check what the image represents.

### Searching by Class

The `class` attribute is one of the most frequently used attributes on HTML tags (used for layout and styling). We can search for tags containing a class using the `class_` argument in `find_all` (note that `class` is a reserved keyword in Python, hence the underscore in the argument name).

> **QUESTION**: Find all the tags containing the class `HeaderMenu-link`. 

In [None]:
matching_tags = doc.find_all(class_='HeaderMenu-link')

In [None]:
matching_tags

[<summary class="HeaderMenu-summary HeaderMenu-link px-0 py-3 border-0 no-wrap d-block d-lg-inline-block">
         Product
         <svg class="icon-chevon-down-mktg position-absolute position-lg-relative" fill="none" viewbox="0 0 14 8" x="0" xml:space="preserve" y="0"><path d="M1,1l6.2,6L13,1"></path></svg>
 </summary>,
 <a class="HeaderMenu-link no-underline py-3 d-block d-lg-inline-block" data-analytics-event='{"category":"Header menu top item (logged out)","action":"click to go to Team","label":"ref_page:/topics/machine-learning;ref_cta:Team;"}' href="/team">Team</a>,
 <a class="HeaderMenu-link no-underline py-3 d-block d-lg-inline-block" data-analytics-event='{"category":"Header menu top item (logged out)","action":"click to go to Enterprise","label":"ref_page:/topics/machine-learning;ref_cta:Enterprise;"}' href="/enterprise">Enterprise</a>,
 <summary class="HeaderMenu-summary HeaderMenu-link px-0 py-3 border-0 no-wrap d-block d-lg-inline-block">
         Explore
         <svg cl

We can also for a specific type of tag e.g. `<a>` matching the given class.

In [None]:
header_link_tags = doc.find_all('a', class_='HeaderMenu-link')

In [None]:
header_link_tags

[<a class="HeaderMenu-link no-underline py-3 d-block d-lg-inline-block" data-analytics-event='{"category":"Header menu top item (logged out)","action":"click to go to Team","label":"ref_page:/topics/machine-learning;ref_cta:Team;"}' href="/team">Team</a>,
 <a class="HeaderMenu-link no-underline py-3 d-block d-lg-inline-block" data-analytics-event='{"category":"Header menu top item (logged out)","action":"click to go to Enterprise","label":"ref_page:/topics/machine-learning;ref_cta:Enterprise;"}' href="/enterprise">Enterprise</a>,
 <a class="HeaderMenu-link no-underline py-3 d-block d-lg-inline-block" data-analytics-event='{"category":"Header menu top item (logged out)","action":"click to go to Marketplace","label":"ref_page:/topics/machine-learning;ref_cta:Marketplace;"}' href="/marketplace">Marketplace</a>,
 <a class="HeaderMenu-link flex-shrink-0 no-underline" data-ga-click="(Logged out) Header, clicked Sign in, text:sign-in" data-hydro-click='{"event_type":"authentication.click","pa

### Parsing Information from Tags

Once we have a list of tags matching some criteria, it's easy to extract information and convert it to a more convenient format.

> **QUESTION**: Find the link text and URL of all the links withing the page header on https://github.com/topics/machine-learning .

We'll create a list of dictionaries containing the required information. We'll add the base URL https://github.com as a prefix because the `href` attribute only contains the relative path e.g. `/explore`.

In [None]:
header_link_tags[0]['href']

'/team'

In [None]:
header_links = []
base_url = 'https://github.com'

for tag in header_link_tags:
    header_links.append({ 'title': tag.text.strip(), 'url': base_url + tag['href']})
    
header_links

[{'title': 'Team', 'url': 'https://github.com/team'},
 {'title': 'Enterprise', 'url': 'https://github.com/enterprise'},
 {'title': 'Marketplace', 'url': 'https://github.com/marketplace'},
 {'title': 'Sign in',
  'url': 'https://github.com/login?return_to=https%3A%2F%2Fgithub.com%2Ftopics%2Fmachine-learning'},
 {'title': 'Sign up',
  'url': 'https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2Ftopics%2Fmachine-learning&source=header'}]

We have successfully extracted the required information about links in the page header. This is precisely what web scraping is: downloading a webpage, parsing the HTML, and extracting useful information.

> **EXERCISE**: Find the list of all the images matching the class `d-block width-full`. Each list element should be a dictionary containing two keys, `"username"` and `"url"`. You can obtain the username using the `alt` attribute of a tag and the URL using the `src` attribute.

### Elements inside a tag

> **QUESTION**: Find the `li` tags that are direct children of `ul` tag with the class `top-list` in the sample HTML document below.


In [None]:
sample_html = """
<html>
    <body>
        <ul class="top-list">
            <li>Item 1</li>
            <li>Item 2</li>
            <li>
                <ul>
                    <li>Item 3.1</li>
                    <li>Item 3.2</li>
                    <li>Item 3.3</li>
                </ul> 
            </li>
        </ul>
    </body>
</html>"""

In [None]:
sample_doc = BeautifulSoup(sample_html)

In [None]:
list_tag = sample_doc.find('ul', class_='top-list')

We can use the `find_all` method on the tag, and set `recursive=False` to find just the direct children.

In [None]:
list_item_tags = list_tag.find_all('li', recursive=False)

In [None]:
list_item_tags

[<li>Item 1</li>,
 <li>Item 2</li>,
 <li>
 <ul>
 <li>Item 3.1</li>
 <li>Item 3.2</li>
 <li>Item 3.3</li>
 </ul>
 </li>]

Without `recursive=False`, the inner list items are also included in the result.

In [None]:
list_tag.find_all('li')

[<li>Item 1</li>,
 <li>Item 2</li>,
 <li>
 <ul>
 <li>Item 3.1</li>
 <li>Item 3.2</li>
 <li>Item 3.3</li>
 </ul>
 </li>,
 <li>Item 3.1</li>,
 <li>Item 3.2</li>,
 <li>Item 3.3</li>]

Keep in mind that you don't need to remember all (or any) of the methods or properties offered by Beautiful Soup documents and tags. You should be able to figure out what you need to do, when you need to do it. Here's how:

* Look up the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Google what you're trying to do: https://www.google.co.in/search?q=beautiful+soup+get+href
* Ask a question on StackOverflow: https://stackoverflow.com/questions/tagged/beautifulsoup



Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

### Top Repositories for a Topic

Let's return to our original problem statement of finding the top repositories for a given topic. Before we parse a page and find the top repositories, let's define a helper function to get the web page for any topic.

> **QUESTION**: Define a function `get_topic_page` that downloads the GitHub web page for a given topic and returns a beautiful soup document representing the page.

In [None]:
def get_topic_page(topic):
    # Construct the URL
    topic_repos_url = 'https://github.com/topics/' + topic
    
    # Get the HTML page content using requests
    response = requests.get(topic_repos_url)
    
    # Ensure that the reponse is valid
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + topic_repos_url)
    
    # Construct a beautiful soup document
    doc = BeautifulSoup(response.text)
    
    return doc

In [None]:
doc = get_topic_page('machine-learning')

In [None]:
doc.title.text

'machine-learning · GitHub Topics · GitHub'

Getting the topic page for another topic is now as simple as invoking the function with a different argument.

In [None]:
doc2 = get_topic_page('data-analysis')

In [None]:
doc2.title.text

'data-analysis · GitHub Topics · GitHub'

> **QUESTION**: Develop an approach to find the repository name, owner's username, no. of stars, and repository link for the repositories listed on a topic page.

<img src="https://i.imgur.com/szL76cU.png" width="640" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

Upon inspecting the box containing the information for a repository, you will find an `article` tag for each repository, with `class` attribute set to  `border rounded color-shadow-small color-bg-secondary my-4`.

Let's find all the `article` tags matching this class.


In [None]:
article_tags = doc.find_all('article', class_='border rounded color-shadow-small color-bg-subtle my-4')

In [None]:
len(article_tags)

30

There are 30 repositories listed on the page, and our query resulted in 30 article tags. It looks like we've found the enclosing tag for each repository. 

We need to extract the following information from each tag:

1. Repository name
2. Owner's username
3. Number of stars
4. Repository link

Look at the source of any of the article tags. You will notice that the repository name, owner's username, and the repository link are all part of an `h1` tag.

In [None]:
article_tag = article_tags[4]

In [None]:
# Uncomment to view
# article_tag

Let's retrieve the first `h1` inside an article.

In [None]:
h3_tag = article_tag.find('h3')
h3_tag

<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"OWNER","click_visual_representation":"REPOSITORY_OWNER_HEADING","actor_id":null,"record_id":8401422,"originating_url":"https://github.com/topics/machine-learning","user_id":null}}' data-hydro-click-hmac="c14e22ad4cc65812caacdf8b76899e4e4979ea7d56890715930cef3abddde9c5" data-view-component="true" href="/tesseract-ocr">
            tesseract-ocr
</a>          /
          <a class="text-bold wb-break-word" data-ga-click="Explore, go to repository, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click_target":"REPOSITORY","click_visual_representation":"REPOSITORY_NAME_HEADING","actor_id":null,"record_id":22887094,"originating_url":"https://github.com/topics/machine-learning","user_id":nu

The `h1` has `a` tags inside it, one containing the owner's username and the second containing the repository title. The `href` of the second tag also includes the relative path of the repository. Let's extract this information from the `a` tags.

In [None]:
a_tags = h3_tag.find_all('a', recursive=False)

In [None]:
username = a_tags[0].text
username

'\n            tesseract-ocr\n'

Looks like the username contains some leading and trailing whitespace. We can get rid of it using `strip`.

In [None]:
username = a_tags[0].text.strip()
username

'tesseract-ocr'

We can get the repository name and repository path in the same fashion.

In [None]:
repo_name = a_tags[1].text.strip()
repo_name

'tesseract'

In [None]:
repo_path = a_tags[1]['href'].strip()
repo_path

'/tesseract-ocr/tesseract'

To get the full URL to the repository, we can append the base URL `https://github.com` at the beginning of the path.

In [None]:
base_url = 'https://github.com'
repo_url = base_url + repo_path 
repo_url

'https://github.com/tesseract-ocr/tesseract'


Next, to get the number of starts, we notice that it is contained within an `span` tag which has the count `Counter js-social-count`.


In [None]:
article_tags[4]

<article class="border rounded color-shadow-small color-bg-subtle my-4">
<div class="px-3">
<div class="d-flex flex-justify-between my-3">
<div class="d-flex flex-auto">
<span style="margin-top:2px">
<svg aria-hidden="true" class="octicon octicon-repo color-fg-muted mr-2" data-view-component="true" height="16" version="1.1" viewbox="0 0 16 16" width="16">
<path d="M2 2.5A2.5 2.5 0 014.5 0h8.75a.75.75 0 01.75.75v12.5a.75.75 0 01-.75.75h-2.5a.75.75 0 110-1.5h1.75v-2h-8a1 1 0 00-.714 1.7.75.75 0 01-1.072 1.05A2.495 2.495 0 012 11.5v-9zm10.5-1V9h-8c-.356 0-.694.074-1 .208V2.5a1 1 0 011-1h8zM5 12.25v3.25a.25.25 0 00.4.2l1.45-1.087a.25.25 0 01.3 0L8.6 15.7a.25.25 0 00.4-.2v-3.25a.25.25 0 00-.25-.25h-3.5a.25.25 0 00-.25.25z" fill-rule="evenodd"></path>
</svg>
</span>
<h3 class="f3 color-fg-muted text-normal lh-condensed">
<a data-ga-click="Explore, go to repository owner, location:explore feed" data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"REPOSITORY_CARD","click

In [None]:
a_star_tag = article_tags[4].find('span', class_='Counter js-social-count')

In [None]:
a_star_tag

<span aria-label="43557 users starred this repository" class="Counter js-social-count" data-pjax-replace="true" data-plural-suffix="users starred this repository" data-singular-suffix="user starred this repository" data-view-component="true" id="repo-stars-counter-star" title="43,557">43.6k</span>

Let's extract the star count from the `a` tag.

In [None]:
a_star_tag.text.strip()

'43.6k'

The `k` at the end indicates `1000`. Let's write a helper function which can convert strings like `40.3k` into the number `40,300`.

In [None]:
def parse_star_count(stars_str):
    stars_str = stars_str.strip()
    if stars_str[-1] == 'k':
        return int(float(stars_str[:-1]) * 1000)
    else:
        return int(stars_str)

In [None]:
parse_star_count('40.3k')

40300

In [None]:
parse_star_count('991')

991

We can now determine the star count as a number.

In [None]:
star_count = parse_star_count(a_star_tag.text.strip())

In [None]:
star_count

43600

Perfect, we've extracted all the information we were interested in.

In [None]:
print('Repository name:', repo_name)
print("Owner's username:", username)
print('Stars:', star_count)
print('Repository URL:', repo_url)

Repository name: tesseract
Owner's username: tesseract-ocr
Stars: 43600
Repository URL: https://github.com/tesseract-ocr/tesseract


Let's extract the logic for parsing the required information from an article tag into a function.

> **QUESTION**: Write a function `parse_repostory` that returns a dictionary containing the repository name, owner's username, number of stars, and repository URL by parsing a given `article` tag representing a repository.

In [None]:
def parse_repository(article_tag):
    # <a> tags containing username, repository name and URL
    a_tags = article_tag.h3.find_all('a')
    # Owner's username
    username = a_tags[0].text.strip()
    # Repository name
    repo_name = a_tags[1].text.strip()
    # Repository URL
    repo_url = base_url + a_tags[1]['href'].strip()
    # Star count
    stars_tag = article_tag.find('span', class_='Counter js-social-count')
    star_count = parse_star_count(stars_tag.text.strip())
    # Return a dictionary
    return {
        'repository_name': repo_name,
        'owner_username': username,        
        'stars': star_count,
        'repository_url': repo_url
    }

We can now use the function to parse any `article` tag.

In [None]:
parse_repository(article_tags[0])

{'repository_name': 'tensorflow',
 'owner_username': 'tensorflow',
 'stars': 162000,
 'repository_url': 'https://github.com/tensorflow/tensorflow'}

In [None]:
parse_repository(article_tags[10])

{'repository_name': '100-Days-Of-ML-Code',
 'owner_username': 'Avik-Jain',
 'stars': 34200,
 'repository_url': 'https://github.com/Avik-Jain/100-Days-Of-ML-Code'}

We can use a list comprehension to parse all the `article` tags in one go.

In [None]:
top_repositories = [parse_repository(tag) for tag in article_tags]

In [None]:
len(top_repositories)

30

In [None]:
top_repositories[:5]

[{'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 162000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'repository_name': 'keras',
  'owner_username': 'keras-team',
  'stars': 53700,
  'repository_url': 'https://github.com/keras-team/keras'},
 {'repository_name': 'pytorch',
  'owner_username': 'pytorch',
  'stars': 53300,
  'repository_url': 'https://github.com/pytorch/pytorch'},
 {'repository_name': 'scikit-learn',
  'owner_username': 'scikit-learn',
  'stars': 48500,
  'repository_url': 'https://github.com/scikit-learn/scikit-learn'},
 {'repository_name': 'tesseract',
  'owner_username': 'tesseract-ocr',
  'stars': 43600,
  'repository_url': 'https://github.com/tesseract-ocr/tesseract'}]



> **QUESTION**: Write a function that takes a `BeautifulSoup` object representing a topic page and returns a list of dictionaries containing information about the top repositories for the topic.


In [None]:
def get_top_repositories(doc):
    article_tags = doc.find_all('article', class_='border rounded color-shadow-small color-bg-subtle my-4')
    topic_repos = [parse_repository(tag) for tag in article_tags]
    return topic_repos

We can now use the functions we've defined to get the top repositories for any topic.

In [None]:
topic_page_ml = get_topic_page('machine-learning')
top_repos_ml = get_top_repositories(topic_page_ml)
top_repos_ml[:5]

[{'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 162000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'repository_name': 'keras',
  'owner_username': 'keras-team',
  'stars': 53700,
  'repository_url': 'https://github.com/keras-team/keras'},
 {'repository_name': 'pytorch',
  'owner_username': 'pytorch',
  'stars': 53300,
  'repository_url': 'https://github.com/pytorch/pytorch'},
 {'repository_name': 'scikit-learn',
  'owner_username': 'scikit-learn',
  'stars': 48500,
  'repository_url': 'https://github.com/scikit-learn/scikit-learn'},
 {'repository_name': 'tesseract',
  'owner_username': 'tesseract-ocr',
  'stars': 43600,
  'repository_url': 'https://github.com/tesseract-ocr/tesseract'}]

Here are the top repositories for the keyword `data-analysis`.

In [None]:
topic_page_da = get_topic_page('data-analysis')
top_repos_da = get_top_repositories(topic_page_da)
top_repos_da[:5]

[{'repository_name': 'scikit-learn',
  'owner_username': 'scikit-learn',
  'stars': 48500,
  'repository_url': 'https://github.com/scikit-learn/scikit-learn'},
 {'repository_name': 'superset',
  'owner_username': 'apache',
  'stars': 43500,
  'repository_url': 'https://github.com/apache/superset'},
 {'repository_name': 'pandas',
  'owner_username': 'pandas-dev',
  'stars': 32299,
  'repository_url': 'https://github.com/pandas-dev/pandas'},
 {'repository_name': 'metabase',
  'owner_username': 'metabase',
  'stars': 27100,
  'repository_url': 'https://github.com/metabase/metabase'},
 {'repository_name': 'streamlit',
  'owner_username': 'streamlit',
  'stars': 17300,
  'repository_url': 'https://github.com/streamlit/streamlit'}]

Here are the top repositories for the keyword `python`

In [None]:
get_top_repositories(get_topic_page('python'))[:5]

[{'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 162000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'repository_name': 'system-design-primer',
  'owner_username': 'donnemartin',
  'stars': 158000,
  'repository_url': 'https://github.com/donnemartin/system-design-primer'},
 {'repository_name': 'CS-Notes',
  'owner_username': 'CyC2018',
  'stars': 145000,
  'repository_url': 'https://github.com/CyC2018/CS-Notes'},
 {'repository_name': 'Python',
  'owner_username': 'TheAlgorithms',
  'stars': 127000,
  'repository_url': 'https://github.com/TheAlgorithms/Python'},
 {'repository_name': 'awesome-python',
  'owner_username': 'vinta',
  'stars': 113000,
  'repository_url': 'https://github.com/vinta/awesome-python'}]

Do you see the power of defining functions and using libraries? With just one line of code, we can scrape GitHub and find the top repositories for any topic.

Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "aakashns/python-web-scraping-and-rest-api" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/python-web-scraping-and-rest-api[0m


'https://jovian.ai/aakashns/python-web-scraping-and-rest-api'

## Writing information to CSV files

Let's create a helper function which takes a list of dictionaries and writes them to a CSV file.

The input to our function will be a list of dictionary of the form:

```
[
  {'key1': 'abc', 'key2': 'def', 'key3': 'ghi'},
  {'key1': 'jkl', 'key2': 'mno', 'key3': 'pqr'},
  {'key1': 'stu', 'key2': 'vwx', 'key3': 'yza'}
  ...
]
```

The function will create a file with a given name containing the following data:

```
key1,key2,key3
abc,def,ghi
jkl,mno,pqr
stu,vwx,yza

```

In [None]:
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

Let's write the data stored in `top_repos_ml` into a CSV file.

In [None]:
len(top_repos_ml)

30

In [None]:
top_repos_ml[:3]

[{'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 162000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'repository_name': 'keras',
  'owner_username': 'keras-team',
  'stars': 53700,
  'repository_url': 'https://github.com/keras-team/keras'},
 {'repository_name': 'pytorch',
  'owner_username': 'pytorch',
  'stars': 53300,
  'repository_url': 'https://github.com/pytorch/pytorch'}]

In [None]:
write_csv(top_repositories, 'machine-learning.csv')

We can now read the file and inspect its contents. The contents of the file can also be inspected using the "File > Open" menu option within Jupyter.

In [None]:
with open('machine-learning.csv', 'r') as f:
    print(f.read())

repository_name,owner_username,stars,repository_url
tensorflow,tensorflow,162000,https://github.com/tensorflow/tensorflow
keras,keras-team,53700,https://github.com/keras-team/keras
pytorch,pytorch,53300,https://github.com/pytorch/pytorch
scikit-learn,scikit-learn,48500,https://github.com/scikit-learn/scikit-learn
tesseract,tesseract-ocr,43600,https://github.com/tesseract-ocr/tesseract
face_recognition,ageitgey,42800,https://github.com/ageitgey/face_recognition
TensorFlow-Examples,aymericdamien,41600,https://github.com/aymericdamien/TensorFlow-Examples
faceswap,deepfakes,40100,https://github.com/deepfakes/faceswap
julia,JuliaLang,37800,https://github.com/JuliaLang/julia
awesome-scalability,binhnguyennus,37000,https://github.com/binhnguyennus/awesome-scalability
100-Days-Of-ML-Code,Avik-Jain,34200,https://github.com/Avik-Jain/100-Days-Of-ML-Code
caffe,BVLC,32200,https://github.com/BVLC/caffe
DeepFaceLab,iperov,30700,https://github.com/iperov/DeepFaceLab
d2l-zh,d2l-ai,29700,https://github

Perfect! We've created a CSV containing the information about the top GitHub repositories for the topic `machine-learning`. We can now put together everything we've done so far to solve the original problem.

> **QUESTION**: Write a Python function that creates a CSV file (comma-separated values) containing details about the 25 top GitHub repositories for any given topic. The top repositories for the topic `machine-learning` can be found on this page: [https://github.com/topics/machine-learning](https://github.com/topics/machine-learning). The output CSV should contain these details: repository name, owner's username, no. of stars, repository URL. 



In [None]:
import requests
from bs4 import BeautifulSoup
base_url = 'https://github.com'

def scrape_topic_repositories(topic, path=None):
    """Get the top repositories for a topic and write them to a CSV file"""
    if path is None:
        path = topic + '.csv'
    topic_page_doc = get_topic_page(topic)
    topic_repositories = get_top_repositories(topic_page_doc)
    write_csv(topic_repositories, path)
    print('Top repositories for topic "{}" written to file "{}"'.format(topic, path))
    return path

def get_top_repositories(doc):
    """Parse the top repositories for a topic given a Beautiful Soup document"""
    article_tags = doc.find_all('article', class_='border rounded color-shadow-small color-bg-subtle my-4')
    topic_repos = [parse_repository(tag) for tag in article_tags]
    return topic_repos

def get_topic_page(topic):
    """Get the web page containing the top repositories for a topic as a Beautiful Soup document"""
    topic_repos_url = 'https://github.com/topics/' + topic
    response = requests.get(topic_repos_url)
    if response.status_code != 200:
        print('Status code:', response.status_code)
        raise Exception('Failed to fetch web page ' + topic_repos_url)
    return BeautifulSoup(response.text)    

def parse_repository(article_tag):
    """Parse information about a repository from an <article> tag"""
    a_tags = article_tag.h3.find_all('a')
    username = a_tags[0].text.strip()
    repo_name = a_tags[1].text.strip()
    repo_url = base_url + a_tags[1]['href'].strip()
    stars_tag = article_tag.find('span', class_='Counter js-social-count')
    star_count = parse_star_count(stars_tag.text.strip())
    return {'repository_name': repo_name, 'owner_username': username, 'stars': star_count, 'repository_url': repo_url}

def parse_star_count(stars_str):
    """Parse strings like 40.3k and get the no. of stars as a number"""
    stars_str = stars_str.strip()
    return int(float(stars_str[:-1]) * 1000) if stars_str[-1] == 'k' else int(stars_str)

def write_csv(items, path):
    """Write a list of dictionaries to a CSV file"""
    with open(path, 'w') as f:
        if len(items) == 0:
            return
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

The entire code of this problem is only about 50 lines long. Isn't that neat? 

Put another way, if you understand these 50 lines of code, you know pretty much all there is to know about web scraping. Use the interactive nature of Jupyter to experiment with each function and add print statements wherever required to display intermediate output. Reading and understanding code is an essential skill for programmers.

In [None]:
scrape_topic_repositories('machine-learning')

Top repositories for topic "machine-learning" written to file "machine-learning.csv"


'machine-learning.csv'

Now that we have a CSV file, we can use the `pandas` library to view its contents.

In [None]:
import pandas as pd

In [None]:
pd.read_csv('machine-learning.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,tensorflow,tensorflow,162000,https://gitub.com/tensorflow/tensorflow
1,keras,keras-team,53700,https://gitub.com/keras-team/keras
2,pytorch,pytorch,53300,https://gitub.com/pytorch/pytorch
3,scikit-learn,scikit-learn,48500,https://gitub.com/scikit-learn/scikit-learn
4,tesseract,tesseract-ocr,43600,https://gitub.com/tesseract-ocr/tesseract
5,face_recognition,ageitgey,42800,https://gitub.com/ageitgey/face_recognition
6,TensorFlow-Examples,aymericdamien,41600,https://gitub.com/aymericdamien/TensorFlow-Exa...
7,faceswap,deepfakes,40100,https://gitub.com/deepfakes/faceswap
8,julia,JuliaLang,37800,https://gitub.com/JuliaLang/julia
9,awesome-scalability,binhnguyennus,37000,https://gitub.com/binhnguyennus/awesome-scalab...


In [None]:
scrape_topic_repositories('data-analysis')

Top repositories for topic "data-analysis" written to file "data-analysis.csv"


'data-analysis.csv'

In [None]:
pd.read_csv('data-analysis.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,scikit-learn,scikit-learn,48500,https://gitub.com/scikit-learn/scikit-learn
1,superset,apache,43500,https://gitub.com/apache/superset
2,pandas,pandas-dev,32299,https://gitub.com/pandas-dev/pandas
3,metabase,metabase,27100,https://gitub.com/metabase/metabase
4,streamlit,streamlit,17300,https://gitub.com/streamlit/streamlit
5,AI-Expert-Roadmap,AMAI-GmbH,15600,https://gitub.com/AMAI-GmbH/AI-Expert-Roadmap
6,goaccess,allinurl,14200,https://gitub.com/allinurl/goaccess
7,CyberChef,gchq,13900,https://gitub.com/gchq/CyberChef
8,OpenRefine,OpenRefine,8600,https://gitub.com/OpenRefine/OpenRefine
9,pandas-profiling,pandas-profiling,8400,https://gitub.com/pandas-profiling/pandas-prof...


In [None]:
scrape_topic_repositories('python')

Top repositories for topic "python" written to file "python.csv"


'python.csv'

In [None]:
pd.read_csv('python.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,tensorflow,tensorflow,162000,https://gitub.com/tensorflow/tensorflow
1,system-design-primer,donnemartin,158000,https://gitub.com/donnemartin/system-design-pr...
2,CS-Notes,CyC2018,145000,https://gitub.com/CyC2018/CS-Notes
3,Python,TheAlgorithms,127000,https://gitub.com/TheAlgorithms/Python
4,awesome-python,vinta,113000,https://gitub.com/vinta/awesome-python
5,free-programming-books-zh_CN,justjavac,86400,https://gitub.com/justjavac/free-programming-b...
6,thefuck,nvbn,66100,https://gitub.com/nvbn/thefuck
7,django,django,61700,https://gitub.com/django/django
8,project-based-learning,practical-tutorials,61100,https://gitub.com/practical-tutorials/project-...
9,flask,pallets,57600,https://gitub.com/pallets/flask


Of course, we can go even further and write a function that scrapes top repositories for several topics.

> **EXERCISE**: Write a function `scrape_topics` which takes a list of topics and creates CSV files containing top repositories for a list of topics. Test it out using the empty cells below.

Let's save our work before continuing.

In [None]:
jovian.commit(files=['machine-learning.csv', 'python.csv', 'data-analysis.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "aakashns/python-web-scraping-and-rest-api" on https://jovian.ai[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/python-web-scraping-and-rest-api[0m


'https://jovian.ai/aakashns/python-web-scraping-and-rest-api'

## Using a REST API to retrieve data as JSON

Not all URLs point to an HTML page. Consider this URL for example: https://api.github.com/repos/octocat/hello-world . It points to a JSON document, which has a structure like this:


```json
{
  "name": "Hello-World",
  "full_name": "octocat/Hello-World",
  "private": false,
  "owner": {
    "login": "octocat",
    "id": 583231,
  },
  "html_url": "https://github.com/octocat/Hello-World",
}
```

It's quite similar to a Python dictionary. In fact, you can use the `json` module from python to convert a JSON document into a Python dictionary.

In [None]:
response = requests.get('https://api.github.com/repos/octocat/hello-world')

In [None]:
import json

data_dict = json.loads(response.text)

In [None]:
data_dict

{'id': 1296269,
 'node_id': 'MDEwOlJlcG9zaXRvcnkxMjk2MjY5',
 'name': 'Hello-World',
 'full_name': 'octocat/Hello-World',
 'private': False,
 'owner': {'login': 'octocat',
  'id': 583231,
  'node_id': 'MDQ6VXNlcjU4MzIzMQ==',
  'avatar_url': 'https://avatars.githubusercontent.com/u/583231?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/octocat',
  'html_url': 'https://github.com/octocat',
  'followers_url': 'https://api.github.com/users/octocat/followers',
  'following_url': 'https://api.github.com/users/octocat/following{/other_user}',
  'gists_url': 'https://api.github.com/users/octocat/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/octocat/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/octocat/subscriptions',
  'organizations_url': 'https://api.github.com/users/octocat/orgs',
  'repos_url': 'https://api.github.com/users/octocat/repos',
  'events_url': 'https://api.github.com/users/octocat/events{/privacy}',
  'received

Unlike HTML, it's really easy to work with JSON using Python, simply fetch the contents of the URL and convert it to a dictionary. Such URLs are often called **REST APIs** or REST API endpoints. Many websites offer well-documented REST APIs to access data from the site in JSON format:

* GitHub: https://docs.github.com/en/rest/reference/repos
* Facebook: https://developers.facebook.com/docs/groups-api/reference
* Twitter: https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline
* Reddit: https://www.reddit.com/dev/api/

Using an API is the *officially supported* way of extracting information from a website. To use an API, you will often need to register as a developer on the platform and generate an API key, which you'll need to send with every request to authenticate yourself. 

Since GitHub offers a public API, we can use it without any restrictions to fetch information about public repositories.


> **QUESTION**: Write a function `get_repo_details` to find the following information about a repository: description, watcher count, fork count, open issues count, created at time and updated at time.



In [None]:
def get_repo_details(username, repo_name):
    print('Fetching information for {}/{}'.format(username, repo_name))
    repo_details_url = "https://api.github.com/repos/" + username + "/" + repo_name
    response = requests.get(repo_details_url)
    if not response.ok:
        print("Failed to fetch!")
        return {}
    repo_data = json.loads(response.text)
    return {
        'description': repo_data['description'],
        'watchers': repo_data['watchers_count'],
        'open_issues': repo_data['open_issues_count'],
        'created_at': repo_data['created_at'],
        'updated_at': repo_data['updated_at']
    }

In [None]:
get_repo_details('octocat', 'hello-world')

Fetching information for octocat/hello-world


{'description': 'My first repository on GitHub!',
 'watchers': 1748,
 'open_issues': 752,
 'created_at': '2011-01-26T19:01:12Z',
 'updated_at': '2022-01-13T04:09:25Z'}

In [None]:
get_repo_details('tensorflow', 'tensorflow')

Fetching information for tensorflow/tensorflow


{'description': 'An Open Source Machine Learning Framework for Everyone',
 'watchers': 162000,
 'open_issues': 2544,
 'created_at': '2015-11-07T01:19:20Z',
 'updated_at': '2022-01-13T09:01:41Z'}

> **QUESTION**: Augment the list of top repositories for a topic with the repository description, watcher count, fork count, open issues count, created at time and updated at time.



In [None]:
def add_repo_details(repos):
    return [dict(**get_repo_details(repo['owner_username'], repo['repository_name']), **repo) for repo in repos]

In [None]:
add_repo_details(top_repositories[:5])

Fetching information for tensorflow/tensorflow
Fetching information for keras-team/keras
Fetching information for pytorch/pytorch
Fetching information for scikit-learn/scikit-learn
Fetching information for tesseract-ocr/tesseract


[{'description': 'An Open Source Machine Learning Framework for Everyone',
  'watchers': 162000,
  'open_issues': 2544,
  'created_at': '2015-11-07T01:19:20Z',
  'updated_at': '2022-01-13T09:01:41Z',
  'repository_name': 'tensorflow',
  'owner_username': 'tensorflow',
  'stars': 162000,
  'repository_url': 'https://github.com/tensorflow/tensorflow'},
 {'description': 'Deep Learning for humans',
  'watchers': 53666,
  'open_issues': 253,
  'created_at': '2015-03-28T00:35:42Z',
  'updated_at': '2022-01-13T05:22:16Z',
  'repository_name': 'keras',
  'owner_username': 'keras-team',
  'stars': 53700,
  'repository_url': 'https://github.com/keras-team/keras'},
 {'description': 'Tensors and Dynamic neural networks in Python with strong GPU acceleration',
  'watchers': 53268,
  'open_issues': 11047,
  'created_at': '2016-08-13T05:26:41Z',
  'updated_at': '2022-01-13T09:03:50Z',
  'repository_name': 'pytorch',
  'owner_username': 'pytorch',
  'stars': 53300,
  'repository_url': 'https://github.

You may get rate limited if you attempt to make more than 60 requests per hour. To overcome the rate limit, use the Github OAuth token as described here: https://towardsdatascience.com/all-the-things-you-can-do-with-github-api-and-python-f01790fca131

Note: Never publish your Github API token publicly, as it can be used to access your Github account. To store your API token without displaying it on the screen, use `getpass`.

In [None]:
from getpass import getpass

token = getpass()

········


> **EXERCISE**: Augment the list of top repositories for a topic with some additional information about the user/organization the repository belong to: name, description, Github URL, no. of repositories, type (user or organization) etc.

### Acronyms

In case you're feeling overwhelmed by all the acronyms, here are their expansions:
- **REST**: Represetational State Transfer
- **API**: Application Programming Interface
- **JSON**: JavaScript Object Notation
- **URL**: Universal Resource Locator

Don't worry, you needn't remember any of them!


Let's save our work before continuing.

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "aakashns/python-web-scraping-and-rest-api" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/aakashns/python-web-scraping-and-rest-api[0m


'https://jovian.ai/aakashns/python-web-scraping-and-rest-api'

## Crawling Websites by Parsing Links on a Page

When you scrape you a web page, you are likely to find several links on the page. For, example, on the page https://github.com/topics, you will find links to several topic pages. You can parse all the topic page links from this page, and scrape those pages to get the top repositories for each topic. Further, you can parse all the repository links from a topic page and scrape individual repository pages, and so on. 

The process of scraping a page, parsing links and then using the links to parsing other pages on the same site is called **web crawling**. It's how search engines like Google are able to index and search data from millions of websites on the internet. Python offer libraries like [Scrapy](https://scrapy.org) for crawling websites easily.

You can do some basic crawling with `requests`, Beautiful soup, and few simple `for` loops in Python. Here's an exercise to get you started


> **EXERCISE**: Get the top 100 repositories for the all the featured topics on GitHub. You might find these URLs useful:
> 
> * Eighth page of featured topics: https://github.com/topics/?page=8  
> * Second page of top repositories for a topic: https://github.com/topics/machine-learning?page=2 

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "samanvitha/python-web-scraping-and-rest-api" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/samanvitha/python-web-scraping-and-rest-api[0m


'https://jovian.ai/samanvitha/python-web-scraping-and-rest-api'

## Summary and Further Reading

We've covered the following topics in this tutorial:

* Downloading web pages using the requests library
* Inspecting the HTML source code of a web page
* Parsing parts of a website using Beautiful Soup
* Writing parsed information into CSV files
* Using a REST API to retrieve data as JSON
* Combining data from multiple sources
* Using links on a page to crawl a website


Here are some things to keep in mind w.r.t. web scraping:

* Most websites disallow web scraping for commercial purposes
* Prefer using web scraping only for learning and research purposes
* Some websites may block your IP or stop sending valid information if you send too many requests
* Review the terms and conditions of a website before scraping data from it
* Remove sensitive and personally identifiable information before publishing a dataset online
* Use official REST APIs wherever available, with proper API keys
* Scraping data that you see after logging in is harder (it requires special cookies and headers)
* Websites change their HTML layout frequently, which may cause your scarping scripts to break
* Websites with dynamic content cannot be scraped using BeautifulSoup. One way to scrape dynamic website is by using Selenium


Here are some more examples of scraping:

* https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b
* https://medium.com/the-innovation/scraping-medium-with-python-beautiful-soup-3314f898bbf5
* https://medium.com/brainstation23/how-to-become-a-pro-with-scraping-youtube-videos-in-3-minutes-a6ac56021961
* https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
* https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/
* https://towardsdatascience.com/web-scraping-yahoo-finance-477fe3daa852
* https://www.analyticsvidhya.com/blog/2020/10/web-scraping-selenium-in-python/
* https://medium.com/ml-book/web-scraping-using-selenium-python-3be7b8762747

### Project Ideas

Here are some project ideas if you're looking to work on a web scraping project. You can work of one of these ideas, or pick something entirely different.

1. **Dataset of Books (Amazon)**: Create a dataset of popular books in different genres by scraping the site: https://www.amazon.in/gp/bestsellers/books/ 


2. **Dataset of Quotes (BrainyQuote)**: Create a dataset of quotes for different tags/topics by scraping the site :https://www.brainyquote.com/topics


3. **Dataset of Movies (TMDb)**: The Movie Database (TMDb) contains information about thousands of movies from around the world: https://www.themoviedb.org/movie . Can you scape the site to create a dataset of movies containing information like title, release date, cast, etc. ? You can also create datasets of movie actors/actresses/directors using this site.


4. **Dataset of TV Shows (TMDb)**: The Movie Database (TMDb) contains information about thousands of TV shows from around the world: https://www.themoviedb.org/tv . Can you scrape the site to create a dataset of TV shows containing information like title, release date, cast, crew, etc. ? You can also create datasets of TV actors/actresses/directors using this site.


5. **Collections of Popular Repositories (GitHub)**: Scape GitHub collections ( https://github.com/collections ) to create a dataset of popular repositories organized by different use cases.


6. **Dataset of Books (BooksToScrape)**: Create a dataset of popular books in different genres by scraping the site *Books To Scrape*: http://books.toscrape.com


7. **Dataset of Quotes (QuotesToScrape)**: Create a dataset of popular quotes for different tags by scraping the site *Quotes To Scrape*: http://quotes.toscrape.com


8. **Scrape a User's Repositories (GitHub)**: Given someone's GitHub username, can you scrape their GitHub profile to create a list of their repositories with information like repository name, no. of stars, no. of forks, etc.?


9. **Scrape User's Reviews (ConsumerAffairs)**: Consumeraffairs contains reviews about thousands of brands: https://www.consumeraffairs.com/. Can you scrape any category from the site to create a dataset of Reviews containing information like Title, Rating, Reviews and toll-free number etc.?.


10. **Songs Dataset (AZLyrics)**: Create a dataset of songs by scraping AZLyrics: https://www.azlyrics.com/f.html . Capture information like song title, artist name, year of release and lyrics URL. 


11. **Scrape a Popular Blog**: Create a dataset of blog posts on a popular blog e.g. https://m.signalvnoise.com/search/ . The dataset can contain information like the blog title, published date, tags, author, link to blog post, etc.


12. **Weekly Top Songs (Top 40 Weekly)**: Create a dataset of the top 40 songs of each week in a given year by scraping the site https://top40weekly.com . Capture information like song title, artist, weekly rank, etc.

## Questions for Revision
1. Why do we need to scrape websites? 
2. What different tools can we use to scrape websites?
3. What are the applications of web-scraping?
4. What are the steps involved in web-scraping?
5. What are the techniques to get data from websites?
6. What technique is used to retrieve data in a machine-readable format in python?
7. How can one download a webpage from the internet using python?
8. What library do we need for downloading the webpage in python?
9. What function from the library do we need for downloading the webpage?
10. How do we make sure that the webpage is downloaded successfully?
11. How can we access the content of the downloaded webpage?
12. What function do we need to find out the total number of characters in the downloaded webpage?
13. What defines the content and structure of the downloaded webpage?
14. What is a source code? In what language is it usually written in?
15. How different are the original webpage and scraped webpage?
16. How many parts does HTML tag have? What are they?
17. Is it possible to be blocked by website when you scrape more pages? If yes, how can one avoid this?
18. How do we get the information we need from the downloaded website?
19. What library do we need to install to extract information from HTML source code?
20. What is doc object?
21. How can we access attributes of a tag?
22. How do we find the direct children of the tag?
23. What is the purpose of strip()?
24. How can we write the extracted information into CSV files?
25. What are REST APIs? How are they different from usual URLs?
26. What is the official way to extract information from a website? What do we need for that? How does it help one in extracting information?
27. What websites offer public APIs?
28. Can we extract data from all the websites on web? If not, why?
29. What is getpass()?
30. What is web crawling and how is it different from web scraping?
31. What are the applications of web crawling?
32. What does python offer for crawling websites?
33. How do we extract data from dynamic websites?

## Solutions for Exercises

> **EXERCISE**: Find the first occurrence of each of these tags in `doc`: `div`, `img`, `span`, `p`, etc.

In [None]:
first_div=doc.find_all('div')
#first=doc.find('div')
#first=doc.div
first_div[0]

<div class="position-relative js-header-wrapper">
<a class="px-2 py-4 color-bg-accent-emphasis color-fg-on-emphasis show-on-focus js-skip-to-content" href="#start-of-content">Skip to content</a>
<span class="progress-pjax-loader js-pjax-loader-bar Progress position-fixed width-full" data-view-component="true">
<span class="Progress-item progress-pjax-loader-bar left-0 top-0 color-bg-accent-emphasis" data-view-component="true" style="width: 0%;"></span>
</span>
<header class="Header-old header-logged-out js-details-container Details position-relative f4 py-2" role="banner">
<div class="container-lg d-lg-flex flex-items-center p-responsive">
<div class="d-flex flex-justify-between flex-items-center">
<a aria-label="Homepage" class="mr-4 color-fg-inherit" data-ga-click="(Logged out) Header, go to homepage, icon:logo-wordmark" href="https://github.com/">
<svg aria-hidden="true" class="octicon octicon-mark-github" data-view-component="true" height="32" version="1.1" viewbox="0 0 16 16" widt

In [None]:
first_img=doc.find('img')
#first_img=doc.find_all('img')
#first_img=doc.img
first_img

<img alt="" aria-label="Team" class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" height="28" src="" width="28"/>

In [None]:
first_span=doc.span
#first_span=doc.find_all['span']
#first_span=doc.find['span']
first_span

<span class="progress-pjax-loader js-pjax-loader-bar Progress position-fixed width-full" data-view-component="true">
<span class="Progress-item progress-pjax-loader-bar left-0 top-0 color-bg-accent-emphasis" data-view-component="true" style="width: 0%;"></span>
</span>

In [None]:
first_p=doc.find_all('p')
first_p[0]

<p>Machine learning is the practice of teaching a computer to learn. The concept uses pattern recognition, as well as other forms of predictive algorithms, to make judgments on incoming data. This field is closely related to artificial intelligence and computational statistics.</p>

> **EXERCISE**: Get a list of all the `img` tags on the page. How many images does the page contain?

In [None]:
all_images=doc.find_all('img')
len(all_images)

16

> **EXERCISE**: Find the 5th image tag on the page (counting from 0). Which attributes does the tag contain? Find the values of the `src` and `alt` attributes of the tag.

In [None]:
fifth_image=all_images[5]
fifth_image

<img alt="janeyx99" class="avatar avatar-user avatar-small" height="32" src="https://avatars.githubusercontent.com/u/31798555?v=4" width="32"/>

In [None]:
fifth_image['src']

'https://avatars.githubusercontent.com/u/31798555?v=4'

In [None]:
fifth_image['alt']

'janeyx99'

> **EXERCISE**: Find the `src` attribute of the first `img` tag with the `alt` attribute set to `julia`. Visit the link and check what the image represents.

In [None]:
doc.find('img',{'alt':'julia'})['src']

'https://repository-images.githubusercontent.com/1644196/ddfc1e00-6638-11e9-9b80-0fe7b9aedd72'

> **EXERCISE**: Find the list of all the images matching the class `d-block width-full`. Each list element should be a dictionary containing two keys, `"username"` and `"url"`. You can obtain the username using the `alt` attribute of a tag and the URL using the `src` attribute.

In [None]:
image_link_tags = doc.find_all('img', class_='d-block width-full')
avatar_users = []
for tag in image_link_tags:
    avatar_users.append({
        'username' : tag['alt'],
        'url': tag['src']
        })
avatar_users

[{'username': 'transformers',
  'url': 'https://repository-images.githubusercontent.com/155220641/a16c4880-a501-11ea-9e8f-646cf611702e'},
 {'username': 'ML-For-Beginners',
  'url': 'https://repository-images.githubusercontent.com/343965132/549b1a80-c897-11eb-9436-918072d2e0f8'},
 {'username': 'awesome-scalability',
  'url': 'https://repository-images.githubusercontent.com/115478820/109a8e00-283a-11ea-8891-ad7215b06a4c'},
 {'username': 'julia',
  'url': 'https://repository-images.githubusercontent.com/1644196/ddfc1e00-6638-11e9-9b80-0fe7b9aedd72'},
 {'username': 'Made-With-ML',
  'url': 'https://repository-images.githubusercontent.com/156157055/5d88d9c5-1030-4da2-a153-d2836b299eac'},
 {'username': 'yolov5',
  'url': 'https://repository-images.githubusercontent.com/264818686/40f8c2c3-7919-4652-b278-ec6a7fb06a53'}]

> **EXERCISE**: Write a function `scrape_topics` which takes a list of topics and creates CSV files containing top repositories for a list of topics. Test it out using the empty cells below.

In [None]:
topics=['data-analysis','python','deep-learning']

In [None]:
def scrape_topics(topics):
    for topic in topics:
        scrape_topic_repositories(topic)

In [None]:
scrape_topics(topics)

Top repositories for topic "data-analysis" written to file "data-analysis.csv"
Top repositories for topic "python" written to file "python.csv"
Top repositories for topic "deep-learning" written to file "deep-learning.csv"


In [None]:
pd.read_csv('data-analysis.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,scikit-learn,scikit-learn,48500,https://gitub.com/scikit-learn/scikit-learn
1,superset,apache,43500,https://gitub.com/apache/superset
2,pandas,pandas-dev,32299,https://gitub.com/pandas-dev/pandas
3,metabase,metabase,27100,https://gitub.com/metabase/metabase
4,streamlit,streamlit,17300,https://gitub.com/streamlit/streamlit
5,AI-Expert-Roadmap,AMAI-GmbH,15600,https://gitub.com/AMAI-GmbH/AI-Expert-Roadmap
6,goaccess,allinurl,14200,https://gitub.com/allinurl/goaccess
7,CyberChef,gchq,13900,https://gitub.com/gchq/CyberChef
8,OpenRefine,OpenRefine,8600,https://gitub.com/OpenRefine/OpenRefine
9,pandas-profiling,pandas-profiling,8400,https://gitub.com/pandas-profiling/pandas-prof...


In [None]:
pd.read_csv('python.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,tensorflow,tensorflow,162000,https://gitub.com/tensorflow/tensorflow
1,system-design-primer,donnemartin,158000,https://gitub.com/donnemartin/system-design-pr...
2,CS-Notes,CyC2018,145000,https://gitub.com/CyC2018/CS-Notes
3,Python,TheAlgorithms,127000,https://gitub.com/TheAlgorithms/Python
4,awesome-python,vinta,113000,https://gitub.com/vinta/awesome-python
5,free-programming-books-zh_CN,justjavac,86400,https://gitub.com/justjavac/free-programming-b...
6,thefuck,nvbn,66100,https://gitub.com/nvbn/thefuck
7,django,django,61700,https://gitub.com/django/django
8,project-based-learning,practical-tutorials,61100,https://gitub.com/practical-tutorials/project-...
9,flask,pallets,57600,https://gitub.com/pallets/flask


In [None]:
pd.read_csv('deep-learning.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,tensorflow,tensorflow,162000,https://gitub.com/tensorflow/tensorflow
1,opencv,opencv,59100,https://gitub.com/opencv/opencv
2,keras,keras-team,53700,https://gitub.com/keras-team/keras
3,pytorch,pytorch,53300,https://gitub.com/pytorch/pytorch
4,TensorFlow-Examples,aymericdamien,41600,https://gitub.com/aymericdamien/TensorFlow-Exa...
5,faceswap,deepfakes,40100,https://gitub.com/deepfakes/faceswap
6,100-Days-Of-ML-Code,Avik-Jain,34200,https://gitub.com/Avik-Jain/100-Days-Of-ML-Code
7,Real-Time-Voice-Cloning,CorentinJ,32700,https://gitub.com/CorentinJ/Real-Time-Voice-Cl...
8,caffe,BVLC,32200,https://gitub.com/BVLC/caffe
9,Deep-Learning-Papers-Reading-Roadmap,floodsung,31500,https://gitub.com/floodsung/Deep-Learning-Pape...


> **EXERCISE**: Get the top 100 repositories for the all the featured topics on GitHub. You might find these URLs useful:
> 
> * Eighth page of featured topics: https://github.com/topics/?page=8  
> * Second page of top repositories for a topic: https://github.com/topics/machine-learning?page=2 

In [None]:
def get_feature_page(n):
    docs=[]
    for i in range(1,n):
        topic_repos_url = 'https://github.com/topics/?page=' + str(i)
        response = requests.get(topic_repos_url)
        if response.status_code != 200:
            print('Status code:', response.status_code)
            raise Exception('Failed to fetch web page ' + topic_repos_url)
        doc = BeautifulSoup(response.text)
        docs.append(doc)
    
    return docs 

In [None]:
def get_featured_topics(docs):
    hrefs=[]
    for doc in docs:
        href=doc.find_all('a',class_='no-underline d-flex flex-column flex-justify-center')
        for a in href:
            link=a['href']
            hrefs.append(link)
    return hrefs

In [None]:
def scrape_featured_repositories(topic, path=None):
    """Get the top repositories for a topic and write them to a CSV file"""
    for i in range(1,6):
        path = topic.strip('/topics')+ str(i) + '.csv'
        topic_repos_url = 'https://github.com/' + topic +'?page=' + str(i)
        response = requests.get(topic_repos_url)
        if response.status_code != 200:
            print('Status code:', response.status_code)
            raise Exception('Failed to fetch web page ' + topic_repos_url)
        topic_page_doc = BeautifulSoup(response.text)
        topic_repositories = get_top_repositories(topic_page_doc)
        write_csv(topic_repositories, path)
        print('Top repositories for topic "{}" written to file "{}"'.format(topic, path))
    return path

In [None]:
def get_top_featured_repositories(n):
    docs=get_feature_page(n)
    topics=get_featured_topics(docs)
    dfs=[]
    for topic in topics:
        scrape_featured_repositories(topic)

In [None]:
#scarping featured topics in first page. You can go ahead and scrape for as many pages as you'd like.
get_top_featured_repositories(2)

Top repositories for topic "/topics/matlab" written to file "matlab1.csv"
Top repositories for topic "/topics/matlab" written to file "matlab2.csv"
Top repositories for topic "/topics/matlab" written to file "matlab3.csv"
Top repositories for topic "/topics/matlab" written to file "matlab4.csv"
Top repositories for topic "/topics/matlab" written to file "matlab5.csv"
Top repositories for topic "/topics/covid-19" written to file "vid-191.csv"
Top repositories for topic "/topics/covid-19" written to file "vid-192.csv"
Top repositories for topic "/topics/covid-19" written to file "vid-193.csv"
Top repositories for topic "/topics/covid-19" written to file "vid-194.csv"
Top repositories for topic "/topics/covid-19" written to file "vid-195.csv"
Top repositories for topic "/topics/r" written to file "r1.csv"
Top repositories for topic "/topics/r" written to file "r2.csv"
Top repositories for topic "/topics/r" written to file "r3.csv"
Top repositories for topic "/topics/r" written to file "r4

In [None]:
pd.read_csv('matlab1.csv')

Unnamed: 0,repository_name,owner_username,stars,repository_url
0,PRMLT,PRML,5400,https://gitub.com/PRML/PRMLT
1,Realtime_Multi-Person_Pose_Estimation,ZheC,4800,https://gitub.com/ZheC/Realtime_Multi-Person_P...
2,a32nx,flybywiresim,3900,https://gitub.com/flybywiresim/a32nx
3,MathModel,zhanwen,3900,https://gitub.com/zhanwen/MathModel
4,Lenia,Chakazul,2800,https://gitub.com/Chakazul/Lenia
5,Bilibili-plus,19PDP,2000,https://gitub.com/19PDP/Bilibili-plus
6,arl,kaxap,1600,https://gitub.com/kaxap/arl
7,Algorithms_MathModels,HuangCongQing,1300,https://gitub.com/HuangCongQing/Algorithms_Mat...
8,matlab2tikz,matlab2tikz,1200,https://gitub.com/matlab2tikz/matlab2tikz
9,facerec,bytefish,934,https://gitub.com/bytefish/facerec


Try to combine each topic's all pages CSVs into single one ;)