# Web Scraping Homework - Mission to Mars

In [1]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

In [2]:
#import dependencies
from splinter import Browser
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests
import pymongo
from flask import Flask, render_template, redirect
from flask_pymongo import PyMongo
import time

In [3]:
# Setup config variables to enable Splinter interaction with browser
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)



Current google-chrome version is 93.0.4577
Get LATEST driver version for 93.0.4577
Driver [C:\Users\jchan\.wdm\drivers\chromedriver\win32\93.0.4577.63\chromedriver.exe] found in cache


<strong> Hint:</strong> Use Splinter to navigate the sites when needed and BeautifulSoup to help find and parse out the necessary data.

In [4]:
# Create dictionary to store news
scraped_data = {}

## NASA Mars News

Scrape the [NASA Mars News Site](https://mars.nasa.gov/news/) and collect the latest News Title and Paragraph Text. Assign the text to variables that you can reference later.

In [5]:
# Visit Nasa news url through splinter module
nasa_url = "https://redplanetscience.com" 
browser.visit(nasa_url)
# Wait for 5 seconds for error purpouses
time.sleep(5)

In [6]:
html = browser.html
# Create BeautifulSoup object; parse with 'html.parser'
soup = bs(html, 'html.parser')

In [7]:
# Get news title and news text by searching for appropriate div class 
news_title = soup.find('div', class_='content_title').text
news_p = soup.find('div', class_='article_teaser_body').text
print(news_title)
print(news_p)

NASA Wins 4 Webbys, 4 People's Voice Awards
Winners include the JPL-managed "Send Your Name to Mars" campaign, NASA's Global Climate Change website and Solar System Interactive.


In [8]:
# Create dictionary to store data and save entries
scrape_nasa_news={"Title":news_title, "Paragraph":news_p}
scrape_nasa_news

{'Title': "NASA Wins 4 Webbys, 4 People's Voice Awards",
 'Paragraph': 'Winners include the JPL-managed "Send Your Name to Mars" campaign, NASA\'s Global Climate Change website and Solar System Interactive.'}

In [9]:
# Save scraped data as a new entry in the dictionary
scraped_data ["Title"] = news_title
scraped_data["Paragraph"] = news_p

## JPL Mars Space Images - Featured Image

- Visit the url for JPL Featured Space Image [here](https://spaceimages-mars.com).
- Use splinter to navigate the site and find the image url for the current Featured Mars Image and assign the url string to a variable called featured_image_url.
- Find the image url to the full size .jpg image. Make sure to save a complete url string for this image.

In [10]:
mars_url = "https://spaceimages-mars.com" 
browser.visit( mars_url)
image_html = browser.html

# Create BeautifulSoup object; parse with 'html.parser'
soup = bs( image_html, "html.parser")

In [11]:
featured_image = soup.find_all("img", class_ = "headerimage fade-in")[0]["src"]
featured_image_url = mars_url + "/" + featured_image
print(featured_image_url)

https://spaceimages-mars.com/image/featured/mars1.jpg


In [12]:
# Create dictionary to store data and save entries
jpl = {"img_url":featured_image_url}
jpl

{'img_url': 'https://spaceimages-mars.com/image/featured/mars1.jpg'}

In [13]:
# Save scraped data as a new entry in the dictionary
scraped_data["img_url"] = featured_image_url

In [14]:
browser.quit()

## Mars Facts

Visit the Mars Facts webpage [here](https://galaxyfacts-mars.com/) and use Pandas to scrape the table containing facts about the planet.
Use Pandas to convert the data to a HTML table string.

In [15]:
facts_url = "https://galaxyfacts-mars.com/"
facts_data = pd.read_html(facts_url)[0]
facts_data

Unnamed: 0,0,1,2
0,Mars - Earth Comparison,Mars,Earth
1,Diameter:,"6,779 km","12,742 km"
2,Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
3,Moons:,2,1
4,Distance from Sun:,"227,943,824 km","149,598,262 km"
5,Length of Year:,687 Earth days,365.24 days
6,Temperature:,-87 to -5 °C,-88 to 58°C


In [16]:
facts_data.columns=["Description", "Mars", "Earth"]
facts_data.set_index("Description", inplace=True)
facts_data

Unnamed: 0_level_0,Mars,Earth
Description,Unnamed: 1_level_1,Unnamed: 2_level_1
Mars - Earth Comparison,Mars,Earth
Diameter:,"6,779 km","12,742 km"
Mass:,6.39 × 10^23 kg,5.97 × 10^24 kg
Moons:,2,1
Distance from Sun:,"227,943,824 km","149,598,262 km"
Length of Year:,687 Earth days,365.24 days
Temperature:,-87 to -5 °C,-88 to 58°C


In [17]:
facts_table = facts_data.to_html(index=False)
facts_table

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th>Mars</th>\n      <th>Earth</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>Mars</td>\n      <td>Earth</td>\n    </tr>\n    <tr>\n      <td>6,779 km</td>\n      <td>12,742 km</td>\n    </tr>\n    <tr>\n      <td>6.39 × 10^23 kg</td>\n      <td>5.97 × 10^24 kg</td>\n    </tr>\n    <tr>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>227,943,824 km</td>\n      <td>149,598,262 km</td>\n    </tr>\n    <tr>\n      <td>687 Earth days</td>\n      <td>365.24 days</td>\n    </tr>\n    <tr>\n      <td>-87 to -5 °C</td>\n      <td>-88 to 58°C</td>\n    </tr>\n  </tbody>\n</table>'

In [18]:
# Check out table
facts_table.replace("\n", "")
print(facts_table)

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>Mars</th>
      <th>Earth</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Mars</td>
      <td>Earth</td>
    </tr>
    <tr>
      <td>6,779 km</td>
      <td>12,742 km</td>
    </tr>
    <tr>
      <td>6.39 × 10^23 kg</td>
      <td>5.97 × 10^24 kg</td>
    </tr>
    <tr>
      <td>2</td>
      <td>1</td>
    </tr>
    <tr>
      <td>227,943,824 km</td>
      <td>149,598,262 km</td>
    </tr>
    <tr>
      <td>687 Earth days</td>
      <td>365.24 days</td>
    </tr>
    <tr>
      <td>-87 to -5 °C</td>
      <td>-88 to 58°C</td>
    </tr>
  </tbody>
</table>


In [19]:
# Create dictionary to store data and save entries
mars_facts = {"htmlTable":facts_data}

## Mars Hemispheres:

Visit the USGS Astrogeology site [here](https://marshemispheres.com/) to obtain high resolution images for each of Mar's hemispheres.

You will need to click each of the links to the hemispheres in order to find the image url to the full resolution image.

Save both the image url string for the full resolution hemisphere image, and the Hemisphere title containing the hemisphere name. Use a Python dictionary to store the data using the keys img_url and title.

Append the dictionary with the image url string and the hemisphere title to a list. This list will contain one dictionary for each hemisphere.

<strong>Example:</strong> hemisphere_image_urls = [ {"title": "Valles Marineris Hemisphere", "img_url": "..."}, {"title": "Cerberus Hemisphere", "img_url": "..."}, {"title": "Schiaparelli Hemisphere", "img_url": "..."}, {"title": "Syrtis Major Hemisphere", "img_url": "..."}, ]

In [20]:
# Visit hemisphere url through splinter module
#Mars Hemispheres
executable_path = {'executable_path': ChromeDriverManager().install()}
browser = Browser('chrome', **executable_path, headless=False)

hem_url = 'https://marshemispheres.com/'
browser.visit(hem_url)



Current google-chrome version is 93.0.4577
Get LATEST driver version for 93.0.4577
Driver [C:\Users\jchan\.wdm\drivers\chromedriver\win32\93.0.4577.63\chromedriver.exe] found in cache


In [21]:
html = browser.html
# Create BeautifulSoup object; parse with 'html.parser'
soup = bs(html, 'html.parser')

In [22]:
items = soup.find_all('div', class_='item')

In [23]:
hemi_urls = []
hemi_title = []

# Create list of dictionaries for each hemisphere and append the dict 
# with an image URL string and title.
for item in items:
    hemi_urls.append( hem_url + item.find('a')['href'])
    hemi_title.append( item.find('h3').text.strip())

print( hemi_urls)
hemi_title

['https://marshemispheres.com/cerberus.html', 'https://marshemispheres.com/schiaparelli.html', 'https://marshemispheres.com/syrtis.html', 'https://marshemispheres.com/valles.html']


['Cerberus Hemisphere Enhanced',
 'Schiaparelli Hemisphere Enhanced',
 'Syrtis Major Hemisphere Enhanced',
 'Valles Marineris Hemisphere Enhanced']

In [24]:
hemi_img_urls = []

for url in hemi_urls:
    browser.visit(url)
    html = browser.html
    soup = bs(html, 'html.parser')
    
    # Find image urls and append to list
    source_url = hem_url + soup.find('img',class_='wide-image')['src']
    hemi_img_urls.append( source_url)
    
hemi_img_urls

['https://marshemispheres.com/images/f5e372a36edfa389625da6d0cc25d905_cerberus_enhanced.tif_full.jpg',
 'https://marshemispheres.com/images/3778f7b43bbbc89d6e3cfabb3613ba93_schiaparelli_enhanced.tif_full.jpg',
 'https://marshemispheres.com/images/555e6403a6ddd7ba16ddb0e471cadcf7_syrtis_major_enhanced.tif_full.jpg',
 'https://marshemispheres.com/images/b3c7c6c9138f57b4756be9b9c43e3a48_valles_marineris_enhanced.tif_full.jpg']

In [25]:
# Create dictionary to store data and save entries
usgs = []
for i in range( len( hemi_title)):
    usgs.append({ 'title':hemi_title[i], 'img_url':hemi_img_urls[i]})

usgs

[{'title': 'Cerberus Hemisphere Enhanced',
  'img_url': 'https://marshemispheres.com/images/f5e372a36edfa389625da6d0cc25d905_cerberus_enhanced.tif_full.jpg'},
 {'title': 'Schiaparelli Hemisphere Enhanced',
  'img_url': 'https://marshemispheres.com/images/3778f7b43bbbc89d6e3cfabb3613ba93_schiaparelli_enhanced.tif_full.jpg'},
 {'title': 'Syrtis Major Hemisphere Enhanced',
  'img_url': 'https://marshemispheres.com/images/555e6403a6ddd7ba16ddb0e471cadcf7_syrtis_major_enhanced.tif_full.jpg'},
 {'title': 'Valles Marineris Hemisphere Enhanced',
  'img_url': 'https://marshemispheres.com/images/b3c7c6c9138f57b4756be9b9c43e3a48_valles_marineris_enhanced.tif_full.jpg'}]

In [26]:
browser.quit()

In [27]:
# Define mars dictionary
mars_dict ={"news_title": news_title, "news_p": news_p, "featured_image_url": featured_image_url, 
            "facts_table": facts_table, "hem_url":hem_url}
mars_dict

{'news_title': "NASA Wins 4 Webbys, 4 People's Voice Awards",
 'news_p': 'Winners include the JPL-managed "Send Your Name to Mars" campaign, NASA\'s Global Climate Change website and Solar System Interactive.',
 'featured_image_url': 'https://spaceimages-mars.com/image/featured/mars1.jpg',
 'facts_table': '<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th>Mars</th>\n      <th>Earth</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <td>Mars</td>\n      <td>Earth</td>\n    </tr>\n    <tr>\n      <td>6,779 km</td>\n      <td>12,742 km</td>\n    </tr>\n    <tr>\n      <td>6.39 × 10^23 kg</td>\n      <td>5.97 × 10^24 kg</td>\n    </tr>\n    <tr>\n      <td>2</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>227,943,824 km</td>\n      <td>149,598,262 km</td>\n    </tr>\n    <tr>\n      <td>687 Earth days</td>\n      <td>365.24 days</td>\n    </tr>\n    <tr>\n      <td>-87 to -5 °C</td>\n      <td>-88 to 58°C</td>\n    </tr>\n  </tbody>

## MongoDB and Flask Application

- Use MongoDB with Flask templating to create a new HTML page that displays all of the information that was scraped from the URLs above.
- Start by converting your Jupyter notebook into a Python script called scrape_mars.py with a function called scrape that will execute all of your scraping code from above and return one Python dictionary containing all of the scraped data.
- Next, create a route called /scrape that will import your scrape_mars.py script and call your scrape function.
- Store the return value in Mongo as a Python dictionary.
- Create a root route / that will query your Mongo database and pass the mars data into an HTML template to display the data.
- Create a template HTML file called index.html that will take the mars data dictionary and display all of the data in the appropriate HTML elements. Use the following as a guide for what the final product should look like, but feel free to create your own design.

In [31]:
# Use flask_pymongo to set up mongo connection
from pymongo import MongoClient
conn =  "mongodb://localhost:27017/mars_mission_scraping"
client =  pymongo.MongoClient(conn)

In [32]:
# Get collection and drop existing data for this application
db = client.mars_mission_scraping
db.mars_data.drop()

In [33]:
db.mars_data.insert_many([scraped_data])

In [34]:
query_result = (db.mars_data.find())
query_result

<pymongo.cursor.Cursor at 0x1be601bb7f0>