<a href="https://colab.research.google.com/github/tavi1402/Data_Science_bootcamp/blob/main/3_1_python_web_scraping_and_rest_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Web Scraping and REST APIs

![](https://i.imgur.com/6zM7JBq.png)


Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. While web scraping often involves parsing and processing [HTML documents](https://developer.mozilla.org/en-US/docs/Web/HTML), some platforms also offer [REST APIs](https://www.smashingmagazine.com/2018/01/understanding-using-rest-api/) to retrieve information in a machine-readable format like [JSON](https://www.digitalocean.com/community/tutorials/an-introduction-to-json). In this tutorial, we'll use web scraping and REST APIs to create a real-world dataset.


This tutorial covers the following topics:

* Downloading web pages using the requests library
* Inspecting the HTML source code of a web page
* Parsing parts of a website using Beautiful Soup
* Writing parsed information into CSV files
* Using a REST API to retrieve data as JSON
* Combining data from multiple sources
* Using links on a page to crawl a website


### How to Run the Code

The best way to learn the material is to execute the code and experiment with it yourself. This tutorial is an executable [Jupyter notebook](https://jupyter.org). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.





## Problem

Over the course of this tutorial, we'll solve the following problem to learn the tools and techniques used for web scraping:


> **QUESTION**: Write a Python function that creates a CSV file (comma-separated values) containing details about the 25 top GitHub repositories for any given topic. You can view the top repositories for the topic `machine-learning` on this page: [https://github.com/topics/machine-learning](https://github.com/topics/machine-learning). The output CSV should contain these details: repository name, owner's username, no. of stars, repository URL.


 <a href="https://github.com/topics/machine-learning"><img src="https://i.imgur.com/5V1HGLs.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;"></a>


How would you go about solving this problem in Python? Explore the web page and take a couple of minutes to come up with an approach before proceeding further. How many lines of code do you think the solution will require?

## Downloading a web page using `requests`

When you access a URL like https://github.com/topics/machine-learning using a web browser, it downloads the contents of the web page the URL points to and displays the output on the screen. Before we can extract information from a web page, we need to download the page using Python.

We'll use a library called [`requests`](https://docs.python-requests.org/en/master/) to download web pages from the internet. Let's begin by installing and importing the library.

We can download a web page using the `requests.get` function.

`requests.get` returns a response object with the page contents and some information indicating whether the request was successful, using a status code. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

 If the request was successful, `response.status_code` is set to a value between 200 and 299.

The contents of the web page can be accessed using the `.text` property of the `response`.

The page contains over 60,000 characters! Let's view the first 1000 characters of the web page.

What you see above is the *source code* of the web page. It written in a language called [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML). It defines the content and structure of the web page.

Let's save the contents to a file with the `.html` extension.

You can now view the file using the "File > Open" menu option within Jupyter and clicking on *machine-learning.html* in the list of files displayed. Here's what you'll see when you open the file:

<img src="https://i.imgur.com/8gEbT1P.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

While this looks similar to the original web page, note that it's simply a copy. You will notice that none of the links or buttons work. To view or edit the source code of the file, click "File > Open" within Jupyter, then select the file *machine-learning.html* from the list and click the "Edit" button.

<img src="https://i.imgur.com/JG7Q8CK.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

As you might expect, the source code looks something like this:

<img src="https://i.imgur.com/6ynXNdz.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

Try scrolling through the source code. Can you make sense of it? Can you see how the information on the page is organized within the file? We'll learn more about it in the next section.

> **EXERCISE**: Download the web page for a different topic, e.g., https://github.com/topics/data-analysis using `requests` and save it to a file, e.g., `data-analysis.html`. View the page and compare it with the previously downloaded page? How are the two different? Can you spot the differences in the source code?

Let's save our work using `jovian` before continuing.

## Inspecting the HTML source code of a web page

![](https://i.imgur.com/mvBpQIP.png)

As mentioned earlier, web pages are written in a language called HTML (Hyper Text Markup Language). HTML is a fairly simple language comprised of *tags*  (also called *nodes* or *elements*) e.g. `<a href="https://jovian.ai" target="_blank">Go to Jovian</a>`. An HTML tag has three parts:

1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.


### Inside an HTML Document

Here's a simple HTML document that uses many commonly used tags:

```html
<html>
  <head>
    <title>All About Python</title>
  </head>
  <body>
    <div style="width: 640px; margin: 40px auto">
      <h1 style="text-align:center;">Python - A Programming Language</h1>
      <img src="https://www.python.org/static/community_logos/python-logo-master-v3-TM.png" alt="python-logo" style="width:240px;margin:0 auto;display:block;">
      <div>
        <h2>About Python</h2>
        <p>
          Python is an <span style="font-style: italic">interpreted, high-level and general-purpose</span> programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects. Visit the <a href="https://docs.python.org/3/">official documentation</a> to learn more.
        </p>
      </div>
      <div>
        <h2>Some Python Libraries</h2>
        <ul id="libraries">
          <li>Numpy</li>
          <li>Pandas</li>
          <li>PyTorch</li>
          <li>Scikit Learn</li>
        </ul>
      </div>
      <div>
        <h2>Recent Python Versions</h2>
        <table id="versions-table">
          <tr>
            <th class="bordered-table">Version</th>
            <th class="bordered-table">Released on</th>
          </tr>
          <tr>
            <td class="bordered-table">Python 3.8</td>
            <td class="bordered-table">October 2019</td>
          </tr>
          <tr>
            <td class="bordered-table">Python 3.7</td>
            <td class="bordered-table">June 2018</td>
          </tr>
        </table>
          <style>
              .bordered-table {
                  border: 1px solid black; padding: 8px;
              }
          </style>
      </div>
    </div>
  </body>
</html>

```

> **EXERCISE**: Copy the above HTML code and paste it into a new file called `webpage.html`. To create a new file,  select "File > Open" from the menu bar, then select "New > Text" file. View the saved file. Can you see how the different tags are displayed in different ways by the browser?


<img src="https://i.imgur.com/lcSHz5V.png" width="480" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

> **EXERCISE**: Make some changes to the code inside `webpage.html`. Save the file and view it again. Do you see your changes reflected? Play with the structure of the file. Try to break things and fix them!

### Common Tags and Attributes

Following are some of the most commonly used HTML tags:

* `html`
* `head`
* `title`
* `body`
* `div`
* `span`
* `h1` to `h6`
* `p`
* `img`
* `ul`, `ol` and `li`
* `table`, `tr`, `th` and `td`
* `style`
* ...

Each tag supports several attributes. Following are some common attributes used to modify the behavior of tags:

* `id`
* `style`
* `class`
* `href` (used with `<a>`)
* `src` (used with `<img>`)




> **EXERCISE**: Complete this tutorial on HTML: https://www.htmldog.com/guides/html/ . Once done, try describing what the above tags and attributes are used for. Try creating a new HTML page using the tags you find most interesting.
>
> To learn how to style HTML tags, check out this tutorial on CSS: https://www.htmldog.com/guides/css/



### Inspecting HTML in the Browser

You can view the source code of any webpage right within your browser by right-clicking anywhere on a page and selecting the "Inspect" option. It opens the "Developer Tools" pane, where you can see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page.

Here's what it looks like on the Chrome browser:


<img src="https://i.imgur.com/jCA1T6Z.png" width="640" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">


> **EXERCISE**: Explore the source code of the web page https://github.com/topics/machine-learning . Try to find the portions in the source code corresponding to the repository name, owner's username, and the number of stars for each repository listed on the page.

Let's save our work before continuing.

## Extracting information from HTML using Beautiful Soup

To extract information from the HTML source code of a webpage programmatically, we can use the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library. Let's install the library and import the `BeautifulSoup` class from the `bs4` module.

Next, let's read the contents of the file `machine-learning.html` and create a `BeautifulSoup` object to parse the content.

The `doc` object contains several properties and methods for extracting information from the HTML document. Let's look at a few examples below.

**NOTE**: You don't need to remember all (or any) of the properties/methods. You can look up [the documentation of BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) or [search online](https://www.google.co.in/search?q=beautifulsoup+how+to+get+href+of+link) to find what you need when you need it.

### Accessing a tag

> **QUESTION**: Find the title of the page represented by `doc`.

The title of the page is contained within the `<title>` tag. We can access the title tag using `doc.title`.

We can access a tag's name using the `.name` property.

The text within a tag can be accessed using `.text`.

> **EXERCISE**: Explore the `html`, `body`, and `head` tags of `doc`. Do you see what you expect to see?

If a tag occurs more than once in a document e.g. `<a>` (which represents links), then `doc.a` finds the first `<a>` tag.

> **EXERCISE**: Find the first occurrence of each of these tags in `doc`: `div`, `img`, `span`, `p`, etc.

### Finding all tags of the same type

To find all the occurrences of a tag, use the `find_all` method.

> **QUESTION**: Find all the link tags on the page. How many links does the page contain?

> **EXERCISE**: Get a list of all the `img` tags on the page. How many images does the page contain?

### Accessing attributes

The attributes of a tag can be accessed using the indexing notation, e.g., `first_link['href']`

Note that the `class` attribute is automatically split into a list of classes (this isn't done for any other attribute). This is because it's common practice to check for a specific class within a tag.

You can use the `.attrs` property to view all the attributes as a dictionary.

> **EXERCISE**: Find the 5th image tag on the page (counting from 0). Which attributes does the tag contain? Find the values of the `src` and `alt` attributes of the tag.

### Searching by Attribute Value

> **QUESTION**: Find the `img` tag(s) on the page with the `alt` attribute set to `transformers`.

We can provide a dictionary of attributes as the second argument to `find_all`

If we're just interested in the first element, we can use the `find` method. Keep in mind that `find` returns `None` if no matching tag is found.

> **EXERCISE**: Find the `src` attribute of the first `img` tag with the `alt` attribute set to `julia`. Visit the link and check what the image represents.

### Searching by Class

The `class` attribute is one of the most frequently used attributes on HTML tags (used for layout and styling). We can search for tags containing a class using the `class_` argument in `find_all` (note that `class` is a reserved keyword in Python, hence the underscore in the argument name).

> **QUESTION**: Find all the tags containing the class `HeaderMenu-link`.

We can also for a specific type of tag e.g. `<a>` matching the given class.

### Parsing Information from Tags

Once we have a list of tags matching some criteria, it's easy to extract information and convert it to a more convenient format.

> **QUESTION**: Find the link text and URL of all the links withing the page header on https://github.com/topics/machine-learning .

We'll create a list of dictionaries containing the required information. We'll add the base URL https://github.com as a prefix because the `href` attribute only contains the relative path e.g. `/explore`.

We have successfully extracted the required information about links in the page header. This is precisely what web scraping is: downloading a webpage, parsing the HTML, and extracting useful information.

> **EXERCISE**: Find the list of all the images matching the class `d-block width-full`. Each list element should be a dictionary containing two keys, `"username"` and `"url"`. You can obtain the username using the `alt` attribute of a tag and the URL using the `src` attribute.

### Elements inside a tag

> **QUESTION**: Find the `li` tags that are direct children of `ul` tag with the class `top-list` in the sample HTML document below.


We can use the `find_all` method on the tag, and set `recursive=False` to find just the direct children.

Without `recursive=False`, the inner list items are also included in the result.

Keep in mind that you don't need to remember all (or any) of the methods or properties offered by Beautiful Soup documents and tags. You should be able to figure out what you need to do, when you need to do it. Here's how:

* Look up the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* Google what you're trying to do: https://www.google.co.in/search?q=beautiful+soup+get+href
* Ask a question on StackOverflow: https://stackoverflow.com/questions/tagged/beautifulsoup



Let's save our work before continuing.

### Top Repositories for a Topic

Let's return to our original problem statement of finding the top repositories for a given topic. Before we parse a page and find the top repositories, let's define a helper function to get the web page for any topic.

> **QUESTION**: Define a function `get_topic_page` that downloads the GitHub web page for a given topic and returns a beautiful soup document representing the page.

Getting the topic page for another topic is now as simple as invoking the function with a different argument.

> **QUESTION**: Develop an approach to find the repository name, owner's username, no. of stars, and repository link for the repositories listed on a topic page.

<img src="https://i.imgur.com/szL76cU.png" width="640" style="box-shadow:rgba(52, 64, 77, 0.2) 0px 1px 5px 0px;border-radius:4px;">

Upon inspecting the box containing the information for a repository, you will find an `article` tag for each repository, with `class` attribute set to  `border rounded color-shadow-small color-bg-secondary my-4`.

Let's find all the `article` tags matching this class.


There are 30 repositories listed on the page, and our query resulted in 30 article tags. It looks like we've found the enclosing tag for each repository.

We need to extract the following information from each tag:

1. Repository name
2. Owner's username
3. Number of stars
4. Repository link

Look at the source of any of the article tags. You will notice that the repository name, owner's username, and the repository link are all part of an `h1` tag.

Let's retrieve the first `h1` inside an article.

The `h1` has `a` tags inside it, one containing the owner's username and the second containing the repository title. The `href` of the second tag also includes the relative path of the repository. Let's extract this information from the `a` tags.

Looks like the username contains some leading and trailing whitespace. We can get rid of it using `strip`.

We can get the repository name and repository path in the same fashion.

To get the full URL to the repository, we can append the base URL `https://github.com` at the beginning of the path.


Next, to get the number of starts, we notice that it is contained within an `span` tag which has the count `Counter js-social-count`.


Let's extract the star count from the `a` tag.

The `k` at the end indicates `1000`. Let's write a helper function which can convert strings like `40.3k` into the number `40,300`.

We can now determine the star count as a number.

Perfect, we've extracted all the information we were interested in.

Let's extract the logic for parsing the required information from an article tag into a function.

> **QUESTION**: Write a function `parse_repostory` that returns a dictionary containing the repository name, owner's username, number of stars, and repository URL by parsing a given `article` tag representing a repository.

We can now use the function to parse any `article` tag.

We can use a list comprehension to parse all the `article` tags in one go.



> **QUESTION**: Write a function that takes a `BeautifulSoup` object representing a topic page and returns a list of dictionaries containing information about the top repositories for the topic.


We can now use the functions we've defined to get the top repositories for any topic.

Here are the top repositories for the keyword `data-analysis`.

Here are the top repositories for the keyword `python`

Do you see the power of defining functions and using libraries? With just one line of code, we can scrape GitHub and find the top repositories for any topic.

Let's save our work before continuing.

## Writing information to CSV files

Let's create a helper function which takes a list of dictionaries and writes them to a CSV file.

The input to our function will be a list of dictionary of the form:

```
[
  {'key1': 'abc', 'key2': 'def', 'key3': 'ghi'},
  {'key1': 'jkl', 'key2': 'mno', 'key3': 'pqr'},
  {'key1': 'stu', 'key2': 'vwx', 'key3': 'yza'}
  ...
]
```

The function will create a file with a given name containing the following data:

```
key1,key2,key3
abc,def,ghi
jkl,mno,pqr
stu,vwx,yza

```

Let's write the data stored in `top_repos_ml` into a CSV file.

We can now read the file and inspect its contents. The contents of the file can also be inspected using the "File > Open" menu option within Jupyter.

Perfect! We've created a CSV containing the information about the top GitHub repositories for the topic `machine-learning`. We can now put together everything we've done so far to solve the original problem.

> **QUESTION**: Write a Python function that creates a CSV file (comma-separated values) containing details about the 25 top GitHub repositories for any given topic. The top repositories for the topic `machine-learning` can be found on this page: [https://github.com/topics/machine-learning](https://github.com/topics/machine-learning). The output CSV should contain these details: repository name, owner's username, no. of stars, repository URL.



The entire code of this problem is only about 50 lines long. Isn't that neat?

Put another way, if you understand these 50 lines of code, you know pretty much all there is to know about web scraping. Use the interactive nature of Jupyter to experiment with each function and add print statements wherever required to display intermediate output. Reading and understanding code is an essential skill for programmers.

Now that we have a CSV file, we can use the `pandas` library to view its contents.

Of course, we can go even further and write a function that scrapes top repositories for several topics.

> **EXERCISE**: Write a function `scrape_topics` which takes a list of topics and creates CSV files containing top repositories for a list of topics. Test it out using the empty cells below.

Let's save our work before continuing.

## Using a REST API to retrieve data as JSON

Not all URLs point to an HTML page. Consider this URL for example: https://api.github.com/repos/octocat/hello-world . It points to a JSON document, which has a structure like this:


```json
{
  "name": "Hello-World",
  "full_name": "octocat/Hello-World",
  "private": false,
  "owner": {
    "login": "octocat",
    "id": 583231,
  },
  "html_url": "https://github.com/octocat/Hello-World",
}
```

It's quite similar to a Python dictionary. In fact, you can use the `json` module from python to convert a JSON document into a Python dictionary.

Unlike HTML, it's really easy to work with JSON using Python, simply fetch the contents of the URL and convert it to a dictionary. Such URLs are often called **REST APIs** or REST API endpoints. Many websites offer well-documented REST APIs to access data from the site in JSON format:

* GitHub: https://docs.github.com/en/rest/reference/repos
* Facebook: https://developers.facebook.com/docs/groups-api/reference
* Twitter: https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/api-reference/get-statuses-user_timeline
* Reddit: https://www.reddit.com/dev/api/

Using an API is the *officially supported* way of extracting information from a website. To use an API, you will often need to register as a developer on the platform and generate an API key, which you'll need to send with every request to authenticate yourself.

Since GitHub offers a public API, we can use it without any restrictions to fetch information about public repositories.


> **QUESTION**: Write a function `get_repo_details` to find the following information about a repository: description, watcher count, fork count, open issues count, created at time and updated at time.



> **QUESTION**: Augment the list of top repositories for a topic with the repository description, watcher count, fork count, open issues count, created at time and updated at time.



You may get rate limited if you attempt to make more than 60 requests per hour. To overcome the rate limit, use the Github OAuth token as described here: https://towardsdatascience.com/all-the-things-you-can-do-with-github-api-and-python-f01790fca131

Note: Never publish your Github API token publicly, as it can be used to access your Github account. To store your API token without displaying it on the screen, use `getpass`.

> **EXERCISE**: Augment the list of top repositories for a topic with some additional information about the user/organization the repository belong to: name, description, Github URL, no. of repositories, type (user or organization) etc.

### Acronyms

In case you're feeling overwhelmed by all the acronyms, here are their expansions:
- **REST**: Represetational State Transfer
- **API**: Application Programming Interface
- **JSON**: JavaScript Object Notation
- **URL**: Universal Resource Locator

Don't worry, you needn't remember any of them!


Let's save our work before continuing.

## Crawling Websites by Parsing Links on a Page

When you scrape you a web page, you are likely to find several links on the page. For, example, on the page https://github.com/topics, you will find links to several topic pages. You can parse all the topic page links from this page, and scrape those pages to get the top repositories for each topic. Further, you can parse all the repository links from a topic page and scrape individual repository pages, and so on.

The process of scraping a page, parsing links and then using the links to parsing other pages on the same site is called **web crawling**. It's how search engines like Google are able to index and search data from millions of websites on the internet. Python offer libraries like [Scrapy](https://scrapy.org) for crawling websites easily.

You can do some basic crawling with `requests`, Beautiful soup, and few simple `for` loops in Python. Here's an exercise to get you started


> **EXERCISE**: Get the top 100 repositories for the all the featured topics on GitHub. You might find these URLs useful:
>
> * Eighth page of featured topics: https://github.com/topics/?page=8  
> * Second page of top repositories for a topic: https://github.com/topics/machine-learning?page=2

## Summary and Further Reading

We've covered the following topics in this tutorial:

* Downloading web pages using the requests library
* Inspecting the HTML source code of a web page
* Parsing parts of a website using Beautiful Soup
* Writing parsed information into CSV files
* Using a REST API to retrieve data as JSON
* Combining data from multiple sources
* Using links on a page to crawl a website


Here are some things to keep in mind w.r.t. web scraping:

* Most websites disallow web scraping for commercial purposes
* Prefer using web scraping only for learning and research purposes
* Some websites may block your IP or stop sending valid information if you send too many requests
* Review the terms and conditions of a website before scraping data from it
* Remove sensitive and personally identifiable information before publishing a dataset online
* Use official REST APIs wherever available, with proper API keys
* Scraping data that you see after logging in is harder (it requires special cookies and headers)
* Websites change their HTML layout frequently, which may cause your scarping scripts to break
* Websites with dynamic content cannot be scraped using BeautifulSoup. One way to scrape dynamic website is by using Selenium


Here are some more examples of scraping:

* https://medium.com/@msalmon00/web-scraping-job-postings-from-indeed-96bd588dcb4b
* https://medium.com/the-innovation/scraping-medium-with-python-beautiful-soup-3314f898bbf5
* https://medium.com/brainstation23/how-to-become-a-pro-with-scraping-youtube-videos-in-3-minutes-a6ac56021961
* https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
* https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/
* https://towardsdatascience.com/web-scraping-yahoo-finance-477fe3daa852
* https://www.analyticsvidhya.com/blog/2020/10/web-scraping-selenium-in-python/
* https://medium.com/ml-book/web-scraping-using-selenium-python-3be7b8762747

### Project Ideas

Here are some project ideas if you're looking to work on a web scraping project. You can work of one of these ideas, or pick something entirely different.

1. **Dataset of Books (Amazon)**: Create a dataset of popular books in different genres by scraping the site: https://www.amazon.in/gp/bestsellers/books/


2. **Dataset of Quotes (BrainyQuote)**: Create a dataset of quotes for different tags/topics by scraping the site :https://www.brainyquote.com/topics


3. **Dataset of Movies (TMDb)**: The Movie Database (TMDb) contains information about thousands of movies from around the world: https://www.themoviedb.org/movie . Can you scape the site to create a dataset of movies containing information like title, release date, cast, etc. ? You can also create datasets of movie actors/actresses/directors using this site.


4. **Dataset of TV Shows (TMDb)**: The Movie Database (TMDb) contains information about thousands of TV shows from around the world: https://www.themoviedb.org/tv . Can you scrape the site to create a dataset of TV shows containing information like title, release date, cast, crew, etc. ? You can also create datasets of TV actors/actresses/directors using this site.


5. **Collections of Popular Repositories (GitHub)**: Scape GitHub collections ( https://github.com/collections ) to create a dataset of popular repositories organized by different use cases.


6. **Dataset of Books (BooksToScrape)**: Create a dataset of popular books in different genres by scraping the site *Books To Scrape*: http://books.toscrape.com


7. **Dataset of Quotes (QuotesToScrape)**: Create a dataset of popular quotes for different tags by scraping the site *Quotes To Scrape*: http://quotes.toscrape.com


8. **Scrape a User's Repositories (GitHub)**: Given someone's GitHub username, can you scrape their GitHub profile to create a list of their repositories with information like repository name, no. of stars, no. of forks, etc.?


9. **Scrape User's Reviews (ConsumerAffairs)**: Consumeraffairs contains reviews about thousands of brands: https://www.consumeraffairs.com/. Can you scrape any category from the site to create a dataset of Reviews containing information like Title, Rating, Reviews and toll-free number etc.?.


10. **Songs Dataset (AZLyrics)**: Create a dataset of songs by scraping AZLyrics: https://www.azlyrics.com/f.html . Capture information like song title, artist name, year of release and lyrics URL.


11. **Scrape a Popular Blog**: Create a dataset of blog posts on a popular blog e.g. https://m.signalvnoise.com/search/ . The dataset can contain information like the blog title, published date, tags, author, link to blog post, etc.


12. **Weekly Top Songs (Top 40 Weekly)**: Create a dataset of the top 40 songs of each week in a given year by scraping the site https://top40weekly.com . Capture information like song title, artist, weekly rank, etc.

## Questions for Revision
1. Why do we need to scrape websites?
2. What different tools can we use to scrape websites?
3. What are the applications of web-scraping?
4. What are the steps involved in web-scraping?
5. What are the techniques to get data from websites?
6. What technique is used to retrieve data in a machine-readable format in python?
7. How can one download a webpage from the internet using python?
8. What library do we need for downloading the webpage in python?
9. What function from the library do we need for downloading the webpage?
10. How do we make sure that the webpage is downloaded successfully?
11. How can we access the content of the downloaded webpage?
12. What function do we need to find out the total number of characters in the downloaded webpage?
13. What defines the content and structure of the downloaded webpage?
14. What is a source code? In what language is it usually written in?
15. How different are the original webpage and scraped webpage?
16. How many parts does HTML tag have? What are they?
17. Is it possible to be blocked by website when you scrape more pages? If yes, how can one avoid this?
18. How do we get the information we need from the downloaded website?
19. What library do we need to install to extract information from HTML source code?
20. What is doc object?
21. How can we access attributes of a tag?
22. How do we find the direct children of the tag?
23. What is the purpose of strip()?
24. How can we write the extracted information into CSV files?
25. What are REST APIs? How are they different from usual URLs?
26. What is the official way to extract information from a website? What do we need for that? How does it help one in extracting information?
27. What websites offer public APIs?
28. Can we extract data from all the websites on web? If not, why?
29. What is getpass()?
30. What is web crawling and how is it different from web scraping?
31. What are the applications of web crawling?
32. What does python offer for crawling websites?
33. How do we extract data from dynamic websites?

## Solutions for Exercises

> **EXERCISE**: Find the first occurrence of each of these tags in `doc`: `div`, `img`, `span`, `p`, etc.

> **EXERCISE**: Get a list of all the `img` tags on the page. How many images does the page contain?

> **EXERCISE**: Find the 5th image tag on the page (counting from 0). Which attributes does the tag contain? Find the values of the `src` and `alt` attributes of the tag.

> **EXERCISE**: Find the `src` attribute of the first `img` tag with the `alt` attribute set to `julia`. Visit the link and check what the image represents.

> **EXERCISE**: Find the list of all the images matching the class `d-block width-full`. Each list element should be a dictionary containing two keys, `"username"` and `"url"`. You can obtain the username using the `alt` attribute of a tag and the URL using the `src` attribute.

> **EXERCISE**: Write a function `scrape_topics` which takes a list of topics and creates CSV files containing top repositories for a list of topics. Test it out using the empty cells below.

> **EXERCISE**: Get the top 100 repositories for the all the featured topics on GitHub. You might find these URLs useful:
>
> * Eighth page of featured topics: https://github.com/topics/?page=8  
> * Second page of top repositories for a topic: https://github.com/topics/machine-learning?page=2

Try to combine each topic's all pages CSVs into single one ;)