# Module 10: Mission to Mars - Web Scraping with HTML/CSS


## 10.0.1: Web Scraping to Extract Online Data

## 10.0.2: Module 10 Roadmap

### Looking Ahead
![Module 10 Roadmap Image](https://courses.bootcampspot.com/courses/691/files/583632/preview)

In this module, you'll automate a web browser to visit different websites to extract data about the Mission to Mars. You'll store it in a NoSQL database, and then render the data in a web application created with Flask. The completed work will be displayed in your portfolio, which you will also create.

Web scraping is a method used by organizations worldwide to extract online data for analysis. Large companies employ web scraping to assess their reputations or track their competitors' online presence.

On a smaller scale, web scraping automates tedious tasks for personal projects. For example, if you're collecting current news on a specific subject, web scraping can make it a simple process. Instead of visiting each website and copying an article, a web scraping script will perform those actions and save the scraped data for later analysis.

### What You Will Learn
By the end of this module, you will be able to: 

- Gain familiarity with and use HTML elements, as well as class and id attributes, to identify content for web scraping.
- Use BeautifulSoup and Splinter to automate a web browser and perform a web scrape.
- Create a MongoDB database to store data from the web scrape.
- Create a web application with Flask to display the data from the web scrape.
- Create an HTML/CSS portfolio to showcase projects.
- Use Bootstrap components to polish and customize the portfolio.

### Planning Your Schedule
Here's a quick look at the lessons and assignments you'll cover in this module. You can use the time estimates to help pace your learning and plan your schedule.

- Introduction (15 minutes)
- Get Started Using Web Scraping Tools (1 hour)
- Open the Window to the Internet (2 hours)
- Automate a Web Browser and Perform a Web Scrape (3 hours)
- Access Data in MongoDB (1 hour)
- Display Data With Flask (3 hours)
- Make It Pretty (3 hours)
- Show It Off (2 hours)
- Application (5 hours)

## 10.0.3: Getting Ready for Virtual Class

## 10.0.4: Tools for Scraping

## 10.1.1: Install Your Tools
<i> Robin is pretty excited about putting together this web-scraping project. Being able to get the latest news and updates with the click of a button? That's a really useful tool for someone who wants to keep up with the Mission to Mars.

First things first, though—preparation. Robin needs to download a few libraries and tools that she'll need when she's ready to start scraping data: Splinter to automate a web browser, BeautifulSoup to parse and extract the data, and MongoDB to hold the data that has been gathered.</i>

With all of the information available on the web, people are able to stay up-to-date with almost every subject out there. What if a person wants to narrow their focus to a single topic? Are there tools that would make gathering the latest data easier?

Robin, who loves astronomy and wants to work for NASA one day, has decided to use a specific method of gathering the latest data: web scraping. Using this technique, she has the ability to pull data from multiple websites, store it in a database, then present the collected data in a central location: a webpage

![flask](https://courses.bootcampspot.com/courses/691/files/604614/preview)

<b>Before installing new tools, open your terminal and make sure your Python coding environment is active.</b>

### Splinter
Splinter is the tool that will automate our web browser as we begin scraping. This means that it will open the browser, visit a webpage, and then interact with it (such as logging in or searching for an item). To do all of this, we'll need to install Splinter and ChromeDriver.
- To install Splinter, open your terminal and make sure your coding environment is active. Then, run the command <code>pip install splinter</code>. Once that installation is complete, we'll install ChromeDriver.
    - IMPORTANT: To successfully use the ChromeDriver and scrape website data, a separate package will need to be installed into our virtual environment.

### Web-Driver Manager
The web driver manager package will allow us to easily use a driver  to scrape websites without having to go through the complicated process of installing the stand alone ChromeDriver.
- To install the manager make sure your are still in your active coding environment and run the command <code>pip install webdriver_manager</code>.

### BeautifulSoup
To install BeautifulSoup, run the command <code>pip install bs4</code> in your terminal. Make sure the environment you plan to work from is active first.

## MongoDB
MongoDB (also known as Mongo) is a document database that thrives on chaos. Well, maybe it's not that extreme, but it is far more flexible when it comes to storing data than a structured database such as SQL. It's able to handle smaller, more personal projects as well as larger-scale projects that a company might require. For this module, Mongo is a better choice than SQL because the data we'll scrape from the web isn't going to be uniform. For example, how would we break down an image into rows and columns? We can't. But Mongo will store and access it as a document instead.
- To install PyMongo, first open a terminal window (make sure your virtual environment is active) and execute the "pip install pymongo" command. 

     <b>PyMongo is the tool that allow developers to use Python with Mongo.</b>


#### Installation
To install Mongo on your macOS computer, follow the instructions in the official documentation. Be sure to follow all of the steps listed for installing the MongoDB Community Edition.

Here are a few important tips during this installation:
- It's best to use the Homebrew <code>brew</code> package for this installation. In your terminal, the installation command will start with <code>brew tap</code>.
- Once Mongo is installed, we want to run it as a macOS service because doing so will automatically set system ulimit values correctly.
    - Mac users will use this line to create a database instance: <code>brew services start mongodb-community@4.4</code> instead of mongod.

##### GITHUB
Navigate to GitHub and create a new repository to hold the code for this module. Name the new repo "Mission-to-Mars" and clone it into your class folder. Remember to add, commit, and push your code as you work through the module.

### Flask-PyMongo
To bridge Flask and Mongo, you'll also want to install the (Flask-PyMongo)[https://flask-pymongo.readthedocs.io/en/latest/] library. 
- This library can be installed using pip and the following command from your terminal: <code>pip install Flask-PyMongo</code>.

#### Additional Libraries
There are two final Python libraries required to run scraping code successfully: <b>html5lib</b> and <b>lxml</b>. Both packages are used to parse HTML in Python, which will be important as you traverse through different web pages to find and collect information.

To install these libraries, first make sure your coding environment is active. Then, type the following commands in your terminal to install them:
1. <code>pip install html5lib</code>
2. <code>pip install lxml</code>

# Opening the Window to the Internet
## 10.2.1: Use HTML Elements
<i>Robin has gotten all of her tools installed and tested Mongo to make sure it's ready for data, but before she can really start pulling it off of the web, she needs to be able to identify where the data is stored within the HTML code.</i>

Every webpage is built using hypertext markup language, more commonly known as HTML. 
- Some sites are more sophisticated than others, but they all have the same basic structure. 
- Each element of a page, such as a title or a paragraph, is wrapped in a tag. 
    - Each tag is specific to the element it's holding, and there are many different types of tags.

### HTML tags
Think of a webpage as a window into the internet. HTML is the glass, boards, and blinds on that window. Just like there are many sizes and shapes to windows, each webpage has been customized to present users with a view into a different topic. (ie. a weather report delivered through a weather site, a news source or social media platform). Each of these examples are all built using custom HTML. 

Our first step will be to explore that design so that we can write a script that knows what it's looking at when it interacts with a webpage.
- Open VS Code and create a file named index.html. This file can be saved to your desktop because it's just for practice.
    - In this blank HTML file put an exclamation point on the first line and press Enter. This should autofill the editor to contain everything we need for a basic HTML page.

            <!DOCTYPE html>
            <html lang="en">
            <head>
             <meta charset="UTF-8">
             <meta name="viewport" content="width=device-width, initial-scale=1.0">
             <meta http-equiv="X-UA-Compatible" content="ie=edge">
             <title>Document</title>
            </head>
            <body>
            </body>
            </html>
 
In this code, each line of code is wrapped in a tag, such as < title >. Let's take a closer look at this tag to get a feel for HTML syntax.


- HTML tags always begin with a left angle bracket (<code><</code>) followed by the name of the tag (in our case, "title"). Once the name has been entered, the tag is then closed with a right angle bracket (<code>></code>). This is only the first half of the completed tag, or the opening tag. We'll also need to add the closing tag.
    
- A closing HTML tag is very similar to the opening tag, but the only difference is a single character, the forward slash inside the left angle bracket: `</title>` . Now that there are both opening and closing HTML title tags, you can add the title of the document.

    For example, if you wanted your webpage title to be "Math Is Fun!" then the entire line of HTML code would look like this: `<title>Math Is Fun!</title>` .

  **NOTE:**
  Sometimes you'll see tags like this: `<title/>`. This isn't a new-fangled way of presenting HTML code, it's just a way of summarizing the tags and their contents. This is more commonly seen in written descriptors of HTML and not in live HTML code.

You'll see different tags referred to in the same manner as you learn how to create and customize webpages.

#### REWIND
HTML is a coding language used for creating webpages. It’s built using specific tags and arranging them in a nested order, a bit like building blocks. 
- For example, if we wanted a header and a paragraph in the same section of a webpage, we would nest `<h1/>` and `<p/>` tags inside a `<div/>` tag, with the `<div/>` tag acting as a box to hold the other pieces.

        <div>
          <h1>Hello, world!</h1>
          <p>This is a great beginning.</p>
        </div>

Most elements have opening and closing tags, which are identical except for the forward slash that begins the closing tag. The closing tags represent the end of that HTML element.
These tags are what define each element of this webpage. We can open this page right now, but it will be blank because we haven't added anything to it yet. Let's take a closer look at how these different elements work together.


`<!DOCTYPE html >` is a declaration, not a tag. It tells web browsers in which HTML version the document is written. **This should always be the first line in an HTML document.**

`<head>` is the opening tag that serves as a container for the setup elements. Jupyter Notebook imports occur in the top cell whereas Python imports occur at the top of the code. HTML imports (e.g., a stylesheet or a library) will be within the `<head>`.

`<meta>` is short for "metadata" and tells the web browser basic information, such as page width.

`<title>` and `</title>` are the opening and closing tags that serve as a container for the page title displayed on the tab at the top of your web browser. In the example above, the title is "Document."

`</head>` is the closing tag for the `<head>` tag, much like the end of a code block in Python.

`<body>` and `</body>` are opening and closing tags. They also serve as a container, but for data we can see (navigation menus, lists, and paragraphs).

`<html lang=”en” >` and `</html>` are opening and closing tags that serve as a container for all elements within an HTML page.

**Nesting** is when HTML elements are contained within other elements. Picture a set of nesting dolls with each nested in proper order, by design, into the largest doll. It is the same for HTML tags—they must be in the correct order to not break the design of the webpage.

An easy way to keep the tags in visual order is by using indentation. 
- Nested HTML code appears indented so it is easier to understand. 
- Containers nested within other containers are indented by **two to four** spaces. This helps to keep our code clean and easy to understand.

Let's take another look at this webpage, only with a few more elements added to it:

    <!DOCTYPE html>
    <html lang="en">
    <head>
      <meta charset="UTF-8" />
      <meta name="viewport" content="width=device-width, initial-scale=1.0" />
      <meta http-equiv="X-UA-Compatible" content="ie=edge" />
      <title>Document</title>
    </head>
    <body>
      <h1>Hello, world!</h1>
      <p>
        Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin aliquet
        iaculis lorem non sollicitudin. Fusce elementum ac elit finibus auctor.
        Curabitur orci sem, accumsan a diam sit amet, efficitur tristique velit.
      </p>
      <ul>
        <li>First list item</li>
        <li>Second list item</li>
        <li>Third list item</li>
      </ul>
    </body>
    </html>

There are several more tags within the `<body/>` container. 
1. Add this new code to your index.html file and save it. 
2. Then, open the file by navigating to it and double-clicking it. 

Now you have a simple static webpage open in your browser, built from scratch. It's not super exciting yet, but that's okay. It's the innards of the page we're focusing on right now.

Let's review the new tags:

`<h1/>` is a first-level header. The text in this tag will be displayed bigger and bolder than the rest of the page's text. There are many different headers available to use, from h1 to h6, with h1 returning the largest text.
`<p/>` is a paragraph tag, currently holding lorem ipsum sentences. (lorem ipsum is dummy text used to stage websites). More can be read about it on the Lorem Ipsum reference website.
`<ul/>` is an unordered list.
`<li/>` is a list item.

This is only a small taste of how many tags exist out there. Remember, these tags are all part of website customization. Without the variety available to use, websites would look plain and uninspired. The sites that Robin intends to scrape data from are far more sophisticated, using many more combinations of tags than what we've discussed here. Understanding the basic layout and how nesting and containers work is an important part of successful web scraping.

We know that when we scrape data from the web, we're simply pulling specific data from websites we've chosen. How do we specify the data? Let's say we want the latest news article from a Mars website. Before we can program our script to pull that data, we have to tell it where to look. Basically, our script would say, "look in this `<div/>` tag, then look inside that for a `<p/>` tag."

For example, if the webpage was a window, we would use our script to direct it to a certain pane.
- The script specifies that data should be pulled from the bottom center pane of the webpage.

Once it found that pane, we can also tell it to look even closer, such as at the bottom-center pane.
- The script specifies that data should be pulled from a selected spot within the bottom center pane of the webpage.

That's a simple way of putting it, but we'll dive more deeply into how web scraping works soon. Visit W3Schools' developer site for an extensive list of ([HTML tags](https://www.w3schools.com/tags/tag_comment.asp)). 

## 10.2.2: Using Chrome Developer Tools
<em>Robin has installed all of her tools and researched different HTML tags in preparation for her web-scraping project. She's really ready to jump in and start gathering data, but even with her initial research, she realized she wasn't quite ready to start scraping. The HTML components on the first site she visited quickly became more complex than she expected.

Instead, Robin has decided to practice identifying specific data using Chrome Developer Tools (also known as DevTools). This tool allows developers to look at the structure of any webpage. Not only that, but there's a search function as well. This should help make more sense of the tags and components that hold the data she's looking for.

Let's visit one of the websites Robin plans to use and take a peek at its structure, then practice finding different components.</em>

Robin wants to be kept up to date with different Mars news, and she's enjoyed the articles published on the NASA news website. For her project specifically she would like to extract the most recently published article's title and summary. Let's find the HTML components in the page so we can help her with that.

1. Start with opening the news site in a new browser window. 
   - At first glance, we can see that there are article titles and a quick sentence describing each article. 
     - Open the DevTools by right-clicking anywhere on the page, then click "Inspect" from the pop-up menu.

2. After clicking "Inspect," a new window should open under the webpage. 
   - This new window is docked to the webpage itself—it's part of the webpage, it's attached to the webpage, but it has a different job.
   - There is a lot going on in this site. What we're currently looking at is how this news site is assembled. The `<html lang=”en”>` line should look familiar, as well as the `<head/>` and `<body/>` tags, but what is all of this other stuff? And the stuff inside the familiar tags? It's a good thing Robin wanted to take a deeper look at this webpage before trying to extract the data from it.


Let's start breaking this down a bit. Remember how we spoke about containers? For example, the `<body>` tag is a container for every visual component of a webpage, such as headers and paragraphs. Inside that `<body/>` tag are other containers, which are nested much like a nesting doll. In the case of this website (and most websites), these other containers inside the body are `<div/>` tags.
- There can be multiple levels of nesting, depending on how elaborate the website is.
- Each container is nested within another, with multiple layers, depending on the complexity of the website.
- Another perk of the DevTools is that if you hover over the code displayed in the window below, the connected visual is highlighted in the page above at the same time. This is helpful because it shows us which code is specifically tied to features of the website above.

There is a lot of custom code included in this website, so instead of scrolling through all of it to find a certain element, we will search for it instead. 

3. In your DevTools, press "ctrl + f" or "command + f" to bring up the search function. 
   1. Input "gallery_header" into the search bar then press enter. 
   2. Make sure the line "header class="gallery_header" is selected, then hover over it with your mouse pointer. 
   3. This will highlight the header section of the page: the title and its container element.


4. Hover over the next line of code, `<h2 class="module-title">News</h2>`.
- If the header isn't displaying the nested contents, click the arrow beside it to expand it, and then hover over the line `<h2 class="module-title">News</h2>`.
    - Notice how the highlighted portion of the above site has become smaller? That's because we're now looking at an element that is nested inside of a container, instead of the full container.
        - This is a great way to pinpoint where on the website we want our web scraping code to pull data from. We can't just tell the code to grab a div or a header though, because there could be many of these on the website when we only want one. This is where the class and id attributes come into play.

### HTML Classes and IDs
Because of how quickly HTML code can get bloated and confusing, it's important to keep specific containers unique. With everything contained within HTML code, it can be really difficult to find what we're looking for.

But how are developers able to distinguish one `<div />` from another? 
- By adding attributes unique to each container or element. 
  
  That's another reason to practice using DevTools. We can use it to search for these attributes. How exactly do they work?

  Think of it like a litter of puppies. They all look pretty similar, but they each have a personality quirk or trait that makes them act a little differently from their siblings. By adding a different color collar to each puppy, we can now tell them apart just by looking. HTML class and id attributes are like those collars.

Robin knows that she will want to pull the top article and summary sentence. How do we identify those components, though? Let's look at our DevTools again. This time, let's drill further down into the nested components—we want to find the element that highlights only the top article on the page.

The first `<li />` element with a class of "slide" highlights the top article on the page.
- Using the DevTools to select a list element with the class of "slide" will also highlight the article image, title, and paragraph on the NASA news webpage.

The section we're aiming for (the article title and text) is nested further in, and there are quite a few steps we'll need to take to get there.
- First, click the drop-down arrow on the `<li class=”slide”>` element (if it isn't already open).
- From there, we're directed to another element: a div with the class of "image_and_description_container." 
- Click that drop-down arrow as well. 
- Within that, we have another element, `<div class=”list_text”>`.
  - Using DevTools to select an element with a class of "list_text" will only highlight the article title and paragraph on the NASA news webpage. 

Maneuvering around these nested elements is called "drilling down," and it's a skill you'll encounter and employ fairly often as you continue to work with HTML.

This final container holds the information Robin will want: the article title and summary. With the use of DevTools, we've been able to sift through the nested HTML code to find the exact tags we'll need to reference in our scraping script. 
- This process is something we'll be following with each additional webpage we want to scrape: visit the page, identify the data, then shift through the HTML code to pinpoint its location on the webpage.

That is a lot of clicking to get to one single section of a webpage. With our advanced technology is, isn't there a faster way? You bet! Let's condense the steps above.

Go ahead and close your dev tools window, then take another look at the webpage. 
- Locate the first article's title and summary, and right-click the space below them. This time, click "inspect" from the pop-up menu.
- The dev tools window automatically opens again, but this time the highlighted section is already closer to the element you want to view, if it isn't already selected. 
  - You can tell by mousing over the highlighted element—it will simultaneously highlight the corresponding location on the webpage.

**Using this method will reduce your time spent sifting through the different elements on the page.**



### Mobile Device Preview
DevTools also comes with a feature that allows us to view webpages as we would if using a phone or tablet. Not only that, but there are specific device models we can use to test the page. Let's look at the DevTools again—this time at the Device icon.

Click on the Device icon to preview the webpage for mobile devices. This button toggles the device selector. When clicked, the webpage we're viewing automatically adjusts to the height and width of a responsive mobile device. When in mobile mode, there is a drop-down menu at the top left of the screen; this menu provides a selection of devices to choose from and to view the site with.


Switching between devices alters the webpage to reflect how it interacts with each device.

When you're ready to view the webpage as it normally is shown on your computer, you can toggle the responsive view with the same button (the device icon).

We'll use this later in the module when we build a portfolio.

# Automate a Web Browser and Perform a Web Scrape
## 10.3.1: Use Splinter

<em>Robin is a bit more familiar with HTML tags and how they fit together to create a webpage, which is a great first step. She also has the necessary tools installed to get started with the scraping, so she's eager to dive in.

The next part is to use Splinter to automate a browser—this is pretty fun because we'll actually be able to watch a browser work without us clicking anywhere or typing in fields, such as using a search bar or next button.

After we help Robin get Splinter rolling, we'll actually scrape data using BeautifulSoup. This is where our practice with HTML tags comes in. To scrape the data we want, we'll have to tell BeautifulSoup which HTML tag is being used and if it has an attribute such as a specific class or id.</em>

One of the fun things about web scraping is the automation—watching your script at work.
1. Once you execute your completed scraping script, a new Chrome web browser will pop up with a banner across the top that says "Chrome is being controlled by automated test software." 
2. This message lets you know that your Python script is directing the browser. The browser will visit websites and interact with them on its own. This message lets you know that your Python script is directing the browser. The browser will visit websites and interact with them on its own.
3. Depending on how you've programmed your script, your browser will click buttons, use a search bar, or even log in to a website.Depending on how you've programmed your script, your browser will click buttons, use a search bar, or even log in to a website.
   
Navigate to your Mission-to-Mars folder using the terminal. Then go ahead and activate Jupyter Notebook. Create a new `.ipynb` file to get started—this is where we'll begin our web scraping work. Let's name it "Practice." It can be deleted when we're done, or used as a reference later on. It's not necessary, but you can add it to your GitHub repo and to your .gitignore file so that it's hidden from public view.

#### REWIND
`.gitignore` is a text file that contains the names of files you don't want the public to see, such as configuration files, or files that aren't necessary for the completed project, but you want to keep for reference.

In the very first cell, we'll import our scraping tools: the Browser instance from splinter, the BeautifulSoup object, and the driver object for Chrome, ChromeDriverManager.

    from splinter import Browser
    from bs4 import BeautifulSoup as soup
    from webdriver_manager.chrome import ChromeDriverManager
    
We're using an alias, "soup," to simplify our code a bit when we reference it later.

Next, we'll set the executable path and initialize a browser:

```
# Set up Splinter  
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)
```
    
With these two lines of code, we are creating an instance of a Splinter browser. This means that we're prepping our automated browser. We're also specifying that we'll be using Chrome as our browser. 
- `**executable_path` is unpacking the dictionary we've stored the path in – think of it as unpacking a suitcase. 
- `headless=False` means that all of the browser's actions will be displayed in a Chrome window so we can see them.

You should have three cells ready to be run; go ahead and execute them. The third cell that initiates a Splinter browser may take a couple of seconds to finish, but an empty webpage should automatically open, ready for instructions. You'll know that it's an automated browser because it'll have a special message stating so, right under the tab.

This browser now belongs to Splinter (for the duration of our coding, anyway).  It's a lot of fun to watch Splinter do its thing and navigate through webpages without us physically interacting with any components. It's also a great way to make sure our code is working as we want it to. While the window can be closed at anytime, it's generally not a good idea to shut down the browser without ending the session properly – there's an excellent chance your code will fail or an error will be generated. 

**NOTE:**
Splinter provides us with many ways to interact with webpages. It can input terms into a Google search bar for us and click the Search button, or even log us into our email accounts by inputting a username and password combination.

## 10.3.2: Practice with Splinter and BeautifulSoup

<em>Robin is feeling more comfortable with the different HTML components used to build webpages, and she knows that the data she wants to scrape will be nested within different HTML tags. An HTML page can get very confusing very quickly, so Robin would like to practice on a less sophisticated site first. There are several sites available specifically for newly minted web scrapers to practice and hone their skills with Splinter and BeautifulSoup. These practice sites contain several different components that we'll encounter out in the wild: buttons to navigate, search bars, and nested HTML tags. It's a great introduction to how the tools we'll use work together to gather the data we want.</em>

Before we start scraping things directly from a Mars website, let's practice on another, less-involved site first. 
- Your Jupyter Notebook cells should reflect these activities: importing Splinter and BeautifulSoup; creating a path to ChromeDriver; and setting the executable path and initializing the Chrome browser in Splinter.



In the fourth cell, we'll scrape data from a website specifically created for practicing our skills:  [Quotes to Scrape.](http://quotes.toscrape.com/)
- Open the website in a browser and familiarize yourself with the page layout.

### Scrape the Top 10 Tags
Interacting with webpages is Splinter's specialty, and there are lots of things to interact with on this one, such as the login button and tags. Our goal for this practice is to scrape the "Top Ten tags" text.


Before we start with the code, we'll want to use the DevTools to look at the details of this line. 
- Right-click the webpage and select "Inspect." From the DevTools window, we can actually select an element on the page instead of searching through the tags.

   1. First, select the inspect icon (the one to the far left).
   2. Then, click the element you want to select on the page, such as the humor tag. This will direct your DevTools to the line of code the humor tag is nested in.
  
That was really quick. Sometimes we'll still have to dig through the tags to find the ones we want, but being able to select items directly from the webpage helps scale down time immensely. So with this shortcut, we've been able to select the `<h2 />` tag holding the text we want.


With this, we know that our data is in an `<h2 />` tag, and that's great! We've narrowed down where our data is hanging out. But what if there is more than one `<h2 />` tag on the page? When scraping one particular item, we will often need to be more specific in choosing the tag we're scraping from. We can narrow this down even further by using the search function in our DevTools.



### Search for Elements
Searching within the HTML code is another useful way to quickly find items we're looking for. Earlier, we were able to select a particular component from the page with the select tool. But there are times where we need to know how many of a certain type of tag are in the page. 
For example, the title we want to scrape is in an `<h2 />` tag, but there are several others on the page as well. Knowing this, we can expect to tailor our code to pull only the `<h2 />` tag we want. First, let's practice searching in the HTML.

1. With the DevTools still active, press Command + F if you use a Mac, or CTRL + F if you use a Windows computer. 
   - This activates the search functionality, only instead of searching the webpage, we're searching the HTML of the webpage. So if we search for all of the `<h2 />` tags in the document, we'll know if we need to make our search more specific by adding attributes such as a class name or an id.

2. In the search bar that we just activated, type h2 and then press Enter on your keyboard.
   - The result of our search immediately shows us two things: 
      i. That the first tag we've searched for is highlighted.
      ii. The number of those tags in the document.
   - We receive our first results matching "h2." The total tags with "h2" appear in the lower right.

Because there is only "1 of 1" h2 tags in the document, we know that we can scrape for an `<h2 />` without being more specific. In most other cases, we'll need to include a class or id, so we'll practice that in a little bit.

### Scrape the Title

1. In the next cell in Jupyter Notebook, type the following:

   ```
   # Visit the Quotes to Scrape site
   url = 'http://quotes.toscrape.com/'
   browser.visit(url)
   ```

- This code tells Splinter which site we want to visit by assigning the link to a URL. After executing the cell above, we will use BeautifulSoup to parse the HTML. 
  
2. In the next cell, we'll add two more lines of code:
   ```
   # Parse the HTML
   html = browser.html
   html_soup = soup(html, 'html.parser')
   ```

- Now we've parsed all of the HTML on the page. That means that BeautifulSoup has taken a look at the different components and can now access them. Specifically, BeautifulSoup parses the HTML text and then stores it as an object.
- In our code, we're using `‘html.parser’` to parse the information, but there are other options available as well.

3. In our next cell, we will find the title and extract it.
   ```
   # Scrape the Title
   title = html_soup.find('h2').text
   title
   ```

What we've just done in the last two lines of code is:

- We used our `html_soup` object we created earlier and chained `find()` to it to search for the `<h2 /> ` tag.
- We've also extracted only the text within the HTML tags by adding `.text` to the end of the code.

We've completed our first actual scrape. Let's practice again, this time using Splinter to scrape the actual tags to go with the title we just pulled.

### Scrape All of the Tags

Using our DevTools again, look at the code for the tags. We want all of the tags instead of just one, so we want to first use our select tool to highlight the `<div />` container that holds all of the tags.


Notice that the `<div />` container holding all of the tags has two classes. 
1. The `col-md-4` class is a Bootstrap feature.   
   - **Bootstrap** is an HTML and CSS framework that simplifies adding functional components that look nice by default. 
     - In this case, `col-md-4` means that this webpage is using a grid layout, and it's a common class that many webpages use. 


2. The other class, tags-box, looks custom, though. Let's make sure first by searching for it using our search box.
   - After searching for tags-box, we can see that only one result is returned. This means that it's unique in the HTML and can be used to locate specific data. Next, expand the tags-box div to take a look at the contents.
     - From here, we can see a list of `<span />` elements, each with a class of tag-item. 
       - Open some of the `<span />` elements to see what they contain; if you see `<a />` elements with the names in the list that we're targeting, then we're in the right place.
   - Since there are 10 items in the list displayed in the browser, let's use the dev tools' search function to verify the list item count. 
     - Search for `tag-item` and note the number of returned results. If there are 10, then we're ready to go.


3. In the next cell of your Jupyter Notebook, type the following:

   ```
   # Scrape the top ten tags
   tag_box = html_soup.find('div', class_='tags-box')
   # tag_box
   tags = tag_box.find_all('a', class_='tag')

   for tag in tags:
      word = tag.text
      print(word)
   ```

This code looks really similar to our last, but we've increased the difficulty a bit by incorporating a for loop, but let's start at the beginning.

- The first line, `tag_box = html_soup.find('div', class_='tags-box')`, creates a new variable `tag_box`, which will be used to store the results of a search. 
  
  - In this case, we're looking for `<div />` elements with a class of tags-box, and we're searching for it in the HTML we parsed earlier and stored in the `html_soup` variable.

- The second line, `tags = tag_box.find_all('a', class_='tag')`, is similar to the first but with a few tweaks to make the search more specific. 
  - The new "tags" variable will hold the results of a `find_all`, but this time we're searching through the parsed results stored in our `tag_box` variable to find `<a />` elements with a tag class.
  - We used `find_all` this time because we want to capture all results, instead of a single or specific one.

- Next, we've added a for loop. 
  - This for loop cycles through each tag in the tags variable, strips the HTML code out of it, and then prints only the text of each tag.

### Scrape Across Pages
Now that we've practiced scraping items from a single page, we're going to up the ante by scraping items that span multiple pages. Our next section of code will scrape the quotes on the first page, click the "Next" button, then scrape more quotes and so on until we have scraped the quotes on five pages.

We have already created the Browser instance and navigated to the `http://quotes.toscrape.com/` page with the `visit()` method. But, if you'd like to create the Browser instance again, run the following code in a new cell.
   ```
   url = 'http://quotes.toscrape.com/'
   browser.visit(url)
   ```

In the next cell, we'll create a for loop that will do the following:
1. Create a BeautifulSoup object
2. Find all the quotes on the page
3. Print each quote from the page
4. Click the "Next" button at the bottom of the page

We'll use range(1, 6) in our for loop to visit the first five pages of the website.
```
   # Iterate through page 1-6
   for x in range(1, 6):
      # Create an HTML object
      html = browser.html
      #Parse the HTML object using BeautifulSoup
      quote_soup = soup(html, 'html.parser')
      quotes = quote_soup.find_all('span', class_='text')
      # Print each quote parsed by BeautifulSoup
      for quote in quotes:
         print('page:', x, '----------')
         print(quote.text)
      # Click the "Next" button using Splinter
      browser.links.find_by_partial_text('Next').click()
```


It's important to note that there are many ways that BeautifulSoup can search for text, but the syntax is typically the same: **we look for a tag first, then an attribute.** 
- We can search for items using only a tag, such as a `<span />` or `<h1 />`, but a class or id attribute makes the search that much more specific.



By including an attribute, we have a far better chance of scraping the data we want.

Go ahead and run the code in this cell. Thanks to our print statements, five pages worth of quotes should be right at our fingertips.


#### SKILL DRILL
Stretch your scraping skills by visiting [Books to Scrape](http://books.toscrape.com/) and scraping the book URL list on the first page.



## Access Data in MongoDB
## 10.4.1: Store the Data
<em>Robin is pretty excited about all of the data we've managed to scrape. And the code is designed to grab the most recent data, so if it's run at a later time, all of the results will have been updated—without us needing to alter the code.

Now that she has the results she wants, she needs to store them in a spot where they can be easily accessed and retrieved as needed. SQL isn't a good option because it works with tabular data, and only one of the items we scraped is presented in that format. Even then, it's all condensed into a block of HTML code.

What Robin will need to use is a database that works differently from SQL with its neatly ordered tables and relationships. Mongo, a NoSQL database, is designed for exactly this task. While Robin is familiar with SQL databases, Mongo is completely new and we'll need to practice with it before loading in our scraped data.</em>

The data Robin has gathered for her web app is great. She's been able to pull a great image, the most recent news article summary, and even an HTML table. Each data type is different, though, with text and images and HTML all together. Compared to SQL's orderly relational system, where each table is linked to at least one other by a key, the data we've helped Robin gather is a bit chaotic. This is where a non-relational database comes in.

**MongoDB (Mongo for short) is a non-relational database that stores data in Binary JavaScript Object Notation (JSON), or BSON format.** We'll access data stored in Mongo the same way we access data stored in JSON files. This method of data storage is far more flexible than SQL's model. If you'd like to dig into MongoDB at a deeper level, check out the [official documentation](https://docs.mongodb.com/)

- **JSON, JavaScript Object Notation**, is a method that sorts and presents data in the form of key:value pairs. It looks much like a Python dictionary and can be traversed through using list notation.

A Mongo database contains collections. These collections contain documents, and each document contains fields, and fields are where the data is stored.

While Mongo and SQL are both databases, that's where the similarities end. They handle documents differently, the storage model isn't even close, and we even interact with them in very different ways.

1. To get started with Mongo, first open a new terminal window, but make sure your working environment is activated. 


2. Then, to start an instance, type `mongod` into the first line of your terminal and press return or enter on your keyboard. Some Mac users may not need to run this command as Mongo is already running in the background.
   - We need to keep this tab open and active so that the Mongo instance continues to run. While Mongo does have a GUI, similar to pgAdmin for Postgres, we'll be using a command line interface (CLI) to make connections within the database.


3. In our terminal, create a second window or tab to use for working in Mongo. Again, make sure your environment is active.
   - On the first line of this new window, type "mongo." This is done in a new window because, after you execute the command, you cannot use the terminal for other tasks—only to send information to and from the database.

   After executing the command, your terminal will show a right angle bracket and a blinking cursor. This indicates that the database is active and ready for use.

### Create a Database
We don't have a fancy GUI to use while navigating through creating a database and inserting data, but that doesn't mean that our commands will be very complex. Let's create a new practice database to get used to some of the more common commands.

In the terminal where Mongo is active and awaiting instruction, type **"use practicedb"** and then press Enter. 
- This creates a new database named "practicedb" and makes it our active database.

If you're not sure which database you're using, type **"db"** in the terminal and press Enter.
- After typing "db" into the terminal and pressing Enter, the name of the current active database is returned. This is a quick check to make sure we'll be saving data to the right spot.

You can also see how many databases are stored locally by typing **"show dbs" ** in your terminal. 
- There should be a few already there by default, so don't be alarmed if more than one appears that you didn't create yourself.

There is also a way to check to see what data, or collections, are already in the database. Type **"show collections"** into the shell, or terminal, then press Enter.
- Nothing came up after that, right? That's a good thing. We haven't entered any data yet. We'll practice doing that next.


### Insert Data
Now that we've confirmed we're in the right database, we can practice the commands to insert data or a document.

The syntax follows: `db.collectionName.insert({key:value})`. Its components do the following:

- `db` refers to the active database, practicedb.
- `collectionName` is the name of the new collection we're creating (we'll customize it when we practice).
- `.insert({ })` is how MongoDB knows we're inserting data into the collection.
- `key:value` is the format into which we're inserting our data; its construction is very similar to a Python dictionary.

In short, we're saying, "Hey, Mongo, use the database we've already specified, and insert a document into this collection. If there's not a collection named that, then create one."

Let's explore how this works a bit by adding some zoo animals to our collection.

In the shell, type:

`db.zoo.insert({name: 'Cleo', species: 'jaguar', age: 12, hobbies: ['sleeping', 'eating', 'climbing']})`

After pressing Enter, the next line in your terminal should read `WriteResult({ 'nInserted" : 1 })`. This means that we've successfully inserted Cleo into the database.

Now let's add another animal. In your shell, type the following:

`db.zoo.insert({name: 'Banzai', species: 'fox', age: 1, hobbies: ['sleeping', 'eating', 'playing']})`

This time we've added a fox to our collection, but the code is very similar to when we added Cleo.

#### SKILL DRILL
Add three more animals to your database, then type "show collections" in your shell.

Now that we've added data to our collection, when we type "show collections" we'll actually see a result: zoo. The name of our new collection is returned. We can also view what's inside a collection with the `find()` command. Executing `db.zoo.find()` in the terminal will return each of the documents we've already added.

A MongoDB shell with "db.zoo.find()" executed, returning the two documents already added: an entry for Cleo and an entry for Banzai.

Documents can also be deleted or dropped. The syntax to do so follows: 
   
`db.collectionName.remove({})`

So, if we wanted to remove Cleo from the database, we would update that line of code to:

`db.zoo.remove({name: 'Cleo'})`

We can also empty the collection at once, instead of one document at a time. For example, to empty our pets collection, we would type: `db.zoo.remove({})`. Because the inner curly brackets are empty, Mongo will assume that we want everything in our pets collection to be removed.

Additionally, to remove a collection all together, we would use `db.zoo.drop()`. After running that line in the shell, our pets collection will no longer exist at all.

And to remove the test database, we will use this line of code: `db.dropDatabase()`.

#### SKILL DRILL
Remove the animals you added earlier with the `remove({})` method, then drop the database.

You can quit the Mongo shell by using keyboard commands: Command + C for Mac or CTRL + C for Windows. This stops the processes that are actively running and frees up your terminal. Remember to quit both the server and the shell when you're done practicing. Otherwise, they'll continue to run in the background and use system resources, such as memory, and slow down the response time of your computer.

**Create a new database named "mars_app" to hold the Mars data we scrape. It is essential that this database exists before we start running our web app outside of Jupyter Notebook, otherwise we'll encounter errors.**



# Display Data with Flask
## 10.5.1: Use Flask to Create a Web App

<em>We've really come a long way in helping Robin prepare to build her web application. After familiarizing ourselves with HTML and its attributes, we've created code to scrape live data from scraping-friendly websites. Once the application is complete, we'll get the latest featured image, news article and its summary, and fact table at the push of a button.

Robin has also studied and practiced with Mongo, a NoSQL database that she'll be using to display the scraped data we've pulled. The next part is actually building the framework for the app using Flask and Mongo together.</em>

One really great part about how we interact with Mongo through the terminal is that it works really well with Python script and Flask.


**Flask is a web microframework that helps developers build a web application.** 
- The Pythonic tools and libraries it comes with provide the means to create anything from a small webpage or blog or something large enough for commercial use.

In your code editor, first make sure you're in your Mission-to-Mars directory, then create a new .py file named app.py. This is where we'll use Flask and Mongo to begin creating Robin's web app. Let's begin by importing our tools. In our new Python file, add the following lines of code:

```
from flask import Flask, render_template, redirect, url_for
from flask_pymongo import PyMongo
import scraping
```

Let's break down what this code is doing.

- The first line says that we'll use Flask to render a template, redirecting to another url, and creating a URL.
- The second line says we'll use PyMongo to interact with our Mongo database.
- The third line says that to use the scraping code, we will convert from Jupyter notebook to Python.

Under these lines, let's add the following to set up Flask:

`app = Flask(__name__)`

We also need to tell Python how to connect to Mongo using PyMongo. Next, add the following lines:

```
# Use flask_pymongo to set up mongo connection
app.config["MONGO_URI"] = "mongodb://localhost:27017/mars_app"
mongo = PyMongo(app)
```

- `app.config["MONGO_URI"]` tells Python that our app will connect to Mongo using a URI, a uniform resource identifier similar to a URL.
- `"mongodb://localhost:27017/mars_app"` is the URI we'll be using to connect our app to Mongo. This URI is saying that the app can reach Mongo through our localhost server, using port 27017, using a database named "mars_app".


### Set Up App Routes
The code we create next will set up our Flask routes: 

1. One for the main HTML page everyone will view when visiting the web app.
2. One to actually scrape new data using the code we've written.

#### REWIND

Flask routes bind URLs to functions. 

- For example, the URL `"ourpage.com/"` brings us to the homepage of our web app. The URL `"ourpage.com/scrape"` will activate our scraping code.

These routes can be embedded into our web app and accessed via links or buttons.

First, let's define the route for the HTML page. In our script, type the following:

```
@app.route("/")
def index():
   mars = mongo.db.mars.find_one()
   return render_template("index.html", mars=mars)
```

- This route, `@app.route("/")`, tells Flask what to display when we're looking at the home page, index.
  - html (index.html is the default HTML file that we'll use to display the content we've scraped). This means that when we visit our web app's HTML page, we will see the home page.

- Within the `def index():` function the following is accomplished:

  - `mars = mongo.db.mars.find_one()` uses PyMongo to find the "mars" collection in our database, which we will create when we convert our Jupyter scraping code to Python Script. We will also assign that path to the `mars` variable for use later.

  - `return render_template("index.html"` tells Flask to return an HTML template using an index.html file. We'll create this file after we build the Flask routes.

  - `, mars=mars)` tells Python to use the "mars" collection in MongoDB.

**This function is what links our visual representation of our work, our web app, to the code that powers it.**

Our next function will set up our scraping route. This route will be the "button" of the web application, the one that will scrape updated data when we tell it to from the homepage of our web app. It'll be tied to a button that will run the code when it's clicked.

Let's add the next route and function to our code. In the editor, type the following:

```
@app.route("/scrape")
def scrape():
   mars = mongo.db.mars
   mars_data = scraping.scrape_all()
   mars.update({}, mars_data, upsert=True)
   return redirect('/', code=302)
```

Let's look at these six lines a little closer.

- The first line, `@app.route(“/scrape”)` defines the route that Flask will be using. 
  - This route, `“/scrape”`, will run the function that we create just beneath it.

- The next lines allow us to access the database, scrape new data using our `scraping.py` script, update the database, and return a message when successful. Let's break it down.

   - First, we define it with `def scrape():`.

   - Then, we assign a new variable that points to our Mongo database: `mars = mongo.db.mars`.

   - Next, we created a new variable to hold the newly scraped data: `mars_data = scraping.scrape_all().` 
     - In this line, we're referencing the scrape_all function in the scraping.py file exported from Jupyter Notebook.

   - Now that we've gathered new data, we need to update the database using `.update()`. Let's take a look at the syntax we'll use, as shown below:

      `.update(query_parameter, data, options)`

      - We're inserting data, so first we'll need to add an empty JSON object with `{}` in place of the `query_parameter`. 
  
     - Next, we'll use the data we have stored in `mars_data`.
  
     - Finally, the option we'll include is `upsert=True`. 
       - This indicates to Mongo to create a new document if one doesn't already exist, and new data will always be saved (even if we haven't already created a document for it).
  
     -  The entire line of code looks like this: 
         `mars.update({}, mars_data, upsert=True)`.

   - Finally, we will add a redirect after successfully scraping the data: `return redirect('/', code=302)`. 
     - This will navigate our page back to `/` where we can see the updated content.

The final bit of code we need for Flask is to tell it to run. Add these two lines to the bottom of your script and save your work:

```
if __name__ == "__main__":
   app.run()
```

## 10.5.2: Update the Code
<em>Robin's almost ready to launch her web app. She's created scraping code and a template, and she's defined the two Flask routes her web app will be using. She's completed a ton of work and made it a really long way—her dreams of working at NASA seem that much closer with all she's accomplished.

Before her code is ready for deployment, she'll need to integrate her scraping code in a way that Flask can handle. That means updating it to include functions and even some error handling. This will help with our app's performance and add a level of professionalism to the end product.</em>

We've already downloaded our Jupyter Notebook code and converted it to a Python script, but it's not quite ready to be used as part of our Flask app yet. The bulk of our code will remain the same—we know it works and will successfully pull the data we need. 

There are two big things we want to update in our code: 
1. We want to refactor it to include functions.
2. We will be adding some error handling into the mix.


    **Functions are a very necessary part of programming.** 

   - They allow developers to create code that will be reused as needed, instead of needing to rewrite the same code repeatedly.

In our case, we want our code to be reused, and often, to pull the most recent data. That's what web scraping is all about, right? Pulling in the live data at the click of a button. Functions enable this capability by bundling our code into something that is easy for us (and once it's deployed, whoever else we share the web app with) to use and reuse as needed.

Also, because the intention is to reuse this code often, we need to update our `scraping.py` script to use functions. Each major scrape, such as the news title and paragraph or featured image, will be divided into a self-contained, reusable function. 


### News Title and Paragraph

Our first scrape, the news title and paragraph summary, currently looks like this:

```
# Visit the mars nasa news site
url = 'https://redplanetscience.com/'
browser.visit(url)

# Optional delay for loading the page
browser.is_element_present_by_css('div.list_text', wait_time=1)

# Convert the browser html to a soup object and then quit the browser
html = browser.html
news_soup = soup(html, 'html.parser')

slide_elem = news_soup.select_one('div.list_text')
slide_elem.find('div', class_='content_title')

# Use the parent element to find the first 'a' tag and save it as 'news_title'
news_title = slide_elem.find('div', class_='content_title').get_text()
news_title

# Use the parent element to find the paragraph text
news_p = slide_elem.find('div', class_='article_teaser_body').get_text()
news_p
```

Next, we will revisit that code and insert it into a function. Let's call it `mars_news`. Begin the function by defining it, then indent the code as needed to adhere to function syntax. It should look like the code below:

```
def mars_news():

   # Visit the mars nasa news site
   url = 'https://redplanetscience.com/'
   browser.visit(url)

   # Optional delay for loading the page
   browser.is_element_present_by_css('div.list_text', wait_time=1)

   # Convert the browser html to a soup object and then quit the browser
   html = browser.html
   news_soup = soup(html, 'html.parser')

   slide_elem = news_soup.select_one('div.list_text')
   slide_elem.find('div', class_='content_title')

   # Use the parent element to find the first <a> tag and save it as  `news_title`
   news_title = slide_elem.find('div', class_='content_title').get_text()
   news_title

   # Use the parent element to find the paragraph text
   news_p = slide_elem.find('div', class_='article_teaser_body').get_text()
   news_p
```

To complete the function, we need to add a return statement.


Instead of having our title and paragraph printed within the function, we want to return them from the function so they can be used outside of it. We'll adjust our code to do so by deleting `news_title` and `news_p` and include them in the return statement instead, as shown below.

```
def mars_news():

   # Visit the mars nasa news site
   url = 'https://redplanetscience.com/'
   browser.visit(url)

   # Optional delay for loading the page
   browser.is_element_present_by_css('div.list_text', wait_time=1)

   # Convert the browser html to a soup object and then quit the browser
   html = browser.html
   news_soup = soup(html, 'html.parser')

   slide_elem = news_soup.select_one('div.list_text')

   # Use the parent element to find the first <a> tag and save it as `news_title`
   news_title = slide_elem.find('div', class_='content_title').get_text()

   # Use the parent element to find the paragraph text
   news_p = slide_elem.find('div', class_='article_teaser_body').get_text()

   return news_title, news_p
```

This function is looking really good. There are two things left to do. First, we need to add an argument to the function.

Update your function like this:

`def mars_news(browser):`
- When we add the word "`browser`" to our function, we're telling Python that we'll be using the **browser** variable we defined outside the function. All of our scraping code utilizes an automated browser, and without this section, our function wouldn't work.

The finishing touch is to add error handling to the mix. This is to address any potential errors that may occur during web scraping. Errors can pop up from anywhere, but in web scraping the most common cause of an error is when the webpage's format has changed and the scraping code no longer matches the new HTML elements.

We're going to add a try and except clause addressing `AttributeErrors`. By adding this error handling, we are able to continue with our other scraping portions even if this one doesn't work.

In our code, we're going to add the `try` portion right before the scraping:

```
    # Add try/except for error handling
    try:
        slide_elem = news_soup.select_one('div.list_text')
        # Use the parent element to find the first 'a' tag and save it as 'news_title'
        news_title = slide_elem.find('div', class_='content_title').get_text()
        # Use the parent element to find the paragraph text
        news_p = slide_elem.find('div', class_='article_teaser_body').get_text()
```

After adding the `try` portion of our error handling, we need to add the `except` part. After these lines, we'll immediately add the following:

```
    except AttributeError:
        return None, None
```

By adding `try:` just before scraping, we're telling Python to look for these elements.
- If there's an error, Python will continue to run the remainder of the code. 
- If it runs into an `AttributeError`, however, instead of returning the title and paragraph, Python will return nothing instead.

The complete function should look as follows:

```
def mars_news(browser):

    # Scrape Mars News
    # Visit the mars nasa news site
    url = 'https://redplanetscience.com/'
    browser.visit(url)

    # Optional delay for loading the page
    browser.is_element_present_by_css('div.list_text', wait_time=1)

    # Convert the browser html to a soup object and then quit the browser
    html = browser.html
    news_soup = soup(html, 'html.parser')

    # Add try/except for error handling
    try:
        slide_elem = news_soup.select_one('div.list_text')
        # Use the parent element to find the first 'a' tag and save it as 'news_title'
        news_title = slide_elem.find('div', class_='content_title').get_text()
        # Use the parent element to find the paragraph text
        news_p = slide_elem.find('div', class_='article_teaser_body').get_text()

    except AttributeError:
        return None, None

    return news_title, news_p
```

Let's update our featured image the same way.

### Featured Image

The code to scrape the featured image will be updated in almost the exact same way we just updated the `mars_news` section. We will:

1. Declare and define our function.

    `def featured_image(browser):`

2. Remove print statement(s) and return them instead.

    In our Jupyter Notebook version of the code, we printed the results of our scraping by simply stating the variable (e.g., after assigning data to the `img_url` variable, we simply put `img_url` on the next line to view the data). We still want to view the data output in our Python script, but we want to see it at the end of our function instead of within it.

    `return img_url`

3. Add error handling for AttributeError.

    ```
    try:
    # find the relative image url
    img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')

    except AttributeError:
    return None
    ```

All together, this function should look as follows:

```
def featured_image(browser):
    # Visit URL
    url = 'https://spaceimages-mars.com'
    browser.visit(url)

    # Find and click the full image button
    full_image_elem = browser.find_by_tag('button')[1]
    full_image_elem.click()

    # Parse the resulting html with soup
    html = browser.html
    img_soup = soup(html, 'html.parser')

    # Add try/except for error handling
    try:
        # Find the relative image url
        img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')

    except AttributeError:
        return None

    # Use the base url to create an absolute url
    img_url = f'https://spaceimages-mars.com/{img_url_rel}'

    return img_url
```

#### Mars Facts

Code for the facts table will be updated in a similar manner to the other two. This time, though, we'll be adding `BaseException` to our except block for error handling.

- A `BaseException` is a little bit of a catchall when it comes to error handling. It is raised when any of the built-in exceptions are encountered and it won't handle any user-defined exceptions. 
  - We're using it here because we're using Pandas' `read_html()` function to pull data, instead of scraping with BeautifulSoup and Splinter. The data is returned a little differently and can result in errors other than AttributeErrors, which is what we've been addressing so far.

Let's first define our function:

`def mars_facts():`

Next, we'll update our code by adding the try and except block.
    ```
   try:
      # use 'read_html" to scrape the facts table into a dataframe
      df = pd.read_html('https://galaxyfacts-mars.com')[0]
   except BaseException:
      return None
    ```

As before, we've removed the print statements. Now that we know this code is working correctly, we don't need to view the DataFrame that's generated.

The code to assign columns and set the index of the DataFrame will remain the same, so the last update we need to complete for this function is to add the return statement.

   `return df.to_html()`

The full mars_facts function should look like this:

```
def mars_facts():
    # Add try/except for error handling
    try:
        # Use 'read_html' to scrape the facts table into a dataframe
        df = pd.read_html('https://galaxyfacts-mars.com')[0]

    except BaseException:
        return None

    # Assign columns and set index of dataframe
    df.columns=['Description', 'Mars', 'Earth']
    df.set_index('Description', inplace=True)

    # Convert dataframe into HTML format, add bootstrap
    return df.to_html()
```
Now you're ready to integrate Mongo.

## 10.5.3: Integrate MongoDB Into the Web App

<em>And now to add the very last bit of code before the coat of HTML paint. Robin has refactored her code so that it separates each scraping section into its own function, which will make reusing the code a much simpler task. She has already built out the Flask routes as well, which is an integral part of scraping—without the routes, the web app simply wouldn't function.

Robin has also set up a Mongo database to hold the data that gets scraped. The next step is to integrate Mongo into the web app. She wants the script to update the data stored in Mongo each time it's run. We need to add just a little bit more code to our `scraping.py` script to establish the link between scraped data and the database.</em>

Before we make our website look pretty (you never know when NASA is looking for its new analyst), we need to connect to Mongo and establish communication between our code and the database we're using. We'll add this last bit of code to our `scraping.py` script.

At the top of our `scraping.py` script, just after importing the dependencies, we'll add one more function. This function differs from the others in that it will:

1. Initialize the browser.
2. Create a data dictionary.
3. End the WebDriver and return the scraped data.

Let's define this function as "`scrape_all`" and then initiate the browser.
    ```
    def scrape_all():
        # Initiate headless driver for deployment
        executable_path = {'executable_path': ChromeDriverManager().install()}
        browser = Browser('chrome', **executable_path, headless=True)
    ```

While we can see the word "browser" here twice, one is the name of the variable passed into the function and the other is the name of a parameter. 
- Coding guidelines do not require that these match, even though they do in our current code.

When we were testing our code in Jupyter, **headless** was set as False so we could see the scraping in action. Now that we are deploying our code into a usable web app, we don't need to watch the script work (though it's totally okay if you still want to).

**NOTE:**
When scraping, the "headless" browsing session is when a browser is run without the users seeing it at all. So, when `headless=True` is declared as we initiate the browser, we are telling it to run in headless mode. All of the scraping will still be accomplished, but behind the scenes.

Next, we're going to set our news title and paragraph variables (remember, this function will return two values).

    `news_title, news_paragraph = mars_news(browser)`

This line of code tells Python that we'll be using our `mars_news` function to pull this data.

Now that we have our browser ready for work, we need to create the data dictionary. Add the following code to our `scrape_all()` function:

```
    # Run all scraping functions and store results in dictionary
    data = {
        "news_title": news_title,
        "news_paragraph": news_paragraph,
        "featured_image": featured_image(browser),
        "facts": mars_facts(),
        "last_modified": dt.datetime.now()
    }
```

This dictionary does two things: 

1. It runs all of the functions we've created-- `featured_image(browser)`, for example—and it also stores all of the results. When we create the HTML template, we'll create paths to the dictionary's values, which lets us present our data on our template. 
   
2. We're also adding the date the code was run last by adding `"last_modified": dt.datetime.now()`. For this line to work correctly, we'll also need to add `import datetime as dt` to our imported dependencies at the beginning of our code.

Just to double-check that all imports are captured, here are the dependencies we're using:

    ```
    from splinter import Browser
    from bs4 import BeautifulSoup as soup
    import pandas as pd
    import datetime as dt
    from webdriver_manager.chrome import ChromeDriverManager
    ```
For example, we're collecting the path to the featured image, then storing it in our database, then placing that link on our web application for everyone to see. We're basically finding the link to the image page and then reusing it on our own page.

![flowchart](https://lh3.googleusercontent.com/ZQFnLWFgP2pLdt4I2-Zx_NJk1xc6Sj4Hic5ra7a_cD3A4Zzo1ceP_Tr6dbEIV0uQxjL2KnU=s170)

The flowchart shows how the code works: 
1. Retrieve an image from one website.
2. Store it with Mongo
3. Then place it on the web app.

To finish up the function, there are two more things to do. The first is to end the WebDriver using the line `browser.quit()`. 

- You can quit the automated browser by physically closing it, but there's a chance it won't fully quit in the background. 
- By using code to exit the browser, you'll know that all of the processes have been stopped.

Second, the return statement needs to be added. This is the final line that will signal that the function is complete, and it will be inserted directly beneath `browser.quit()`. We want to return the data dictionary created earlier, so our return statement will simply read return data.


    # Stop webdriver and return data
    browser.quit()
    return data


The last step we need to add is similar to the last code block in our `app.py` file. At the bottom of our `scraping.py` script, add the following:

    ```
    if __name__ == "__main__":
        # If running as script, print scraped data
        print(scrape_all())
    ```

This last block of code tells Flask that our script is complete and ready for action. The print statement will print out the results of our scraping to our terminal after executing the code.

After fine-tuning our `scraping.py` script, the complete code should look like this:

```
# Import Splinter, BeautifulSoup, and Pandas
from splinter import Browser
from bs4 import BeautifulSoup as soup
import pandas as pd
import datetime as dt
from webdriver_manager.chrome import ChromeDriverManager


def scrape_all():
    # Initiate headless driver for deployment
    executable_path = {'executable_path': ChromeDriverManager().install()}
    browser = Browser('chrome', **executable_path, headless=True)

    news_title, news_paragraph = mars_news(browser)

    # Run all scraping functions and store results in a dictionary
    data = {
        "news_title": news_title,
        "news_paragraph": news_paragraph,
        "featured_image": featured_image(browser),
        "facts": mars_facts(),
        "last_modified": dt.datetime.now()
    }

    # Stop webdriver and return data
    browser.quit()
    return data


def mars_news(browser):

    # Scrape Mars News
    # Visit the mars nasa news site
    url = 'https://data-class-mars.s3.amazonaws.com/Mars/index.html'
    browser.visit(url)

    # Optional delay for loading the page
    browser.is_element_present_by_css('div.list_text', wait_time=1)

    # Convert the browser html to a soup object and then quit the browser
    html = browser.html
    news_soup = soup(html, 'html.parser')

    # Add try/except for error handling
    try:
        slide_elem = news_soup.select_one('div.list_text')
        # Use the parent element to find the first 'a' tag and save it as 'news_title'
        news_title = slide_elem.find('div', class_='content_title').get_text()
        # Use the parent element to find the paragraph text
        news_p = slide_elem.find('div', class_='article_teaser_body').get_text()

    except AttributeError:
        return None, None

    return news_title, news_p


def featured_image(browser):
    # Visit URL
    url = 'https://data-class-jpl-space.s3.amazonaws.com/JPL_Space/index.html'
    browser.visit(url)

    # Find and click the full image button
    full_image_elem = browser.find_by_tag('button')[1]
    full_image_elem.click()

    # Parse the resulting html with soup
    html = browser.html
    img_soup = soup(html, 'html.parser')

    # Add try/except for error handling
    try:
        # Find the relative image url
        img_url_rel = img_soup.find('img', class_='fancybox-image').get('src')

    except AttributeError:
        return None

    # Use the base url to create an absolute url
    img_url = f'https://data-class-jpl-space.s3.amazonaws.com/JPL_Space/{img_url_rel}'

    return img_url

def mars_facts():
    # Add try/except for error handling
    try:
        # Use 'read_html' to scrape the facts table into a dataframe
        df = pd.read_html('https://data-class-mars-facts.s3.amazonaws.com/Mars_Facts/index.html')[0]

    except BaseException:
        return None

    # Assign columns and set index of dataframe
    df.columns=['Description', 'Mars', 'Earth']
    df.set_index('Description', inplace=True)

    # Convert dataframe into HTML format, add bootstrap
    return df.to_html(classes="table table-striped")

if __name__ == "__main__":

    # If running as script, print scraped data
    print(scrape_all())

```

It's also a good idea at this point to run your code and check it for errors. Even though the Jupyter Notebook cells have already been tested and bugs were addressed, because we made some slight updates and fine-tuned the converted Python code, it's possible a new bug could have popped up.

**NOTE:**
In your terminal, make sure you're in the correct directory with the `ls` command (if you don't see the files you've been working on, then navigate to the folder you're storing them in). Make sure you have the correct environment activated, then type python `app.py` into your terminal.

The next message you see on your terminal should be a message that the Flask application is running on localhost. Enter that address (usually http://127.0.0.1:5000/) into the address bar of your web browser.

If you don't see that message on your terminal, you likely have a bug in your script. Thankfully, error messages will help you pinpoint where and why an error is occurring.


# Make it Pretty

# Show it Off

# Application
