# Question for the TA
Regarding the below **Mars Hemispheres** section:  To obtain each image URL, I went to the applicable hemisphere web page by using Splinter to click the image in the parent web page (https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars).  To get the next image, I had to reload the parent web page, and then click the image.  Sort'a a ping-ponging approach.  Is there a better way, like going directly from one hemisphere page to the next, without returning to the parent?  How?

In [1]:
from splinter import Browser
from bs4 import BeautifulSoup
import requests
import pandas as pd

nasa_mars_news_url = 'https://mars.nasa.gov/news'
jpl_mars_site_url = 'https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'
twitter_mars_weather_url = 'https://twitter.com/marswxreport?lang=en'
mars_facts_site_url = 'https://space-facts.com/mars/'
usgs_astrogeology_site_url = 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'

## Scrape dictionary
The results of scrapping each web site is stored in a single dictionary

In [2]:
scrape_data = {}

### Caveat - the above web pages must be downloaded with Splinter, not Requests
It appears that Requests does not collect the correct web page content, which is seen by examining the content that Requests renders and comparing that content that is displayed by the browser inspector.  The Splinter content does agree with the browser inspector.  The opposite is true for the Twitter site.

#### Caveat to the caveat
**Sometimes code that downloads a web page must be repeated to insure that the proper content is received.**
A snippet of the content is printed for verification.

In [3]:
# initiate splinter
executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path, headless=False)

### Browser display
The Chrome browser will render the target web page when Splinter obtains it

## NASA Mars News
Inspecting https://mars.nasa.gov/news (with the browser inspector), the structures that contain news titles and paragraphs are in a list item having 'slide' class. 

In [4]:
# render the web page content as a BeautifulSoup object and archive as a txt file, which 
# can then be inspected with an editor to verify correctness (i.e. matches the content shown
# in the browser inspector)
browser.visit(nasa_mars_news_url)
nasa_mars_news_soup = BeautifulSoup(browser.html, 'html.parser')
with open("html_txt/nasa_mars_news.txt", "w") as file:
    file.write(nasa_mars_news_soup.prettify())

# it seems to need a repeat to insure the entire web page is loaded
browser.visit(nasa_mars_news_url)
nasa_mars_news_soup = BeautifulSoup(browser.html, 'html.parser')
with open("html_txt/nasa_mars_news.txt", "w") as file:
    file.write(nasa_mars_news_soup.prettify())

#### Inspect a single slide object
Identify the applicable elements in a slide object that may be used to extract the news title and paragraph.  
**Note:** This can also be used to verify the correct html was downloaded, if this notebook is run again.

In [5]:
# parse the soup object for the first slide class content and further examine
nasa_mars_news_element = nasa_mars_news_soup.find('li', class_='slide')
print(nasa_mars_news_element.prettify())

<li class="slide">
 <div class="image_and_description_container">
  <a href="/news/8613/a-year-of-surprising-science-from-nasas-insight-mars-mission/" target="_self">
   <div class="rollover_description">
    <div class="rollover_description_inner">
     A batch of new papers summarizes the lander's findings above and below the surface of the Red Planet.
    </div>
    <div class="overlay_arrow">
     <img alt="More" src="/assets/overlay-arrow.png"/>
    </div>
   </div>
   <div class="list_image">
    <img alt="In this artist's concept of NASA's InSight lander on Mars, layers of the planet's subsurface can be seen below and dust devils can be seen in the background." src="/system/news_items/list_view_images/8613_InSight-Nature-papers-320x240.jpg"/>
   </div>
   <div class="bottom_gradient">
    <div>
     <h3>
      A Year of Surprising Science From NASA's InSight Mars Mission
     </h3>
    </div>
   </div>
  </a>
  <div class="list_text">
   <div class="list_date">
    February 24, 

In [6]:
# verify extraction of the title and paragraph
print(nasa_mars_news_element.find('h3').text)                            # news_title
print(nasa_mars_news_element.find(class_="article_teaser_body").text)    # news_p

A Year of Surprising Science From NASA's InSight Mars Mission
A batch of new papers summarizes the lander's findings above and below the surface of the Red Planet.


In [7]:
# add to the scrape dictionary
scrape_data["nasa_mars_news_title"] = nasa_mars_news_element.find('h3').text
scrape_data["nasa_mars_news_p"] = nasa_mars_news_element.find(class_="article_teaser_body").text

## JPL Mars Space Images 
Inspecting https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars (with the browser inspector), the element that contains the featured image has class = "carousel_container."

In [8]:
# render the web page content as a BeautifulSoup object and archive as a txt file, which 
# can then be inspected with an editor to verify correctness (i.e. matches the content shown
# in the browser inspector)
browser.visit(jpl_mars_site_url)
jpl_mars_site_soup = BeautifulSoup(browser.html, 'html.parser')
with open("html_txt/jpl_mars_site.txt", "w") as file:
    file.write(jpl_mars_site_soup.prettify())

#### Inspect the carousel_container object
Identify the applicable element to extract the image url.  
**Note:** This can also be used to verify the correct html was downloaded, if this notebook is run again.

In [9]:
# parse the soup object for the carousel_container class content and further examine
jpl_mars_site_element = jpl_mars_site_soup.find(class_='carousel_container')
print(jpl_mars_site_element.prettify())

<div class="carousel_container">
 <div class="carousel_items">
  <article alt="NEOWISE: Back to Hunt More Asteroids (Artist Concept)" class="carousel_item" style="background-image: url('/spaceimages/images/wallpaper/PIA17254-1920x1200.jpg');">
   <div class="default floating_text_area ms-layer">
    <h2 class="category_title">
    </h2>
    <h2 class="brand_title">
     FEATURED IMAGE
    </h2>
    <h1 class="media_feature_title">
     NEOWISE: Back to Hunt More Asteroids (Artist Concept)
    </h1>
    <div class="description">
    </div>
    <footer>
     <a class="button fancybox" data-description="This artist's concept shows the NASA's WISE spacecraft, in its orbit around Earth. In September of 2013, engineers will attempt to bring the mission out of hibernation to hunt for more asteroids and comets in a project called NEOWISE." data-fancybox-group="images" data-fancybox-href="/spaceimages/images/mediumsize/PIA17254_ip.jpg" data-link="/spaceimages/details.php?id=PIA17254" data-title

In [10]:
# extract the local URL link 
jpl_mars_site_local_link = jpl_mars_site_element.find("a")['data-fancybox-href']

# append the host URL link
jpl_mars_site_link = 'https://www.jpl.nasa.gov' + jpl_mars_site_local_link
print(jpl_mars_site_link)

https://www.jpl.nasa.gov/spaceimages/images/mediumsize/PIA17254_ip.jpg


In [11]:
# add to the scrape dictionary
scrape_data["jpl_mars_site_link"] = jpl_mars_site_link

## Mars Weather
Inspecting https://twitter.com/marswxreport?lang=en (with the browser inspector), the element that contains the latest Mars weather is a \<p> element with class="TweetTextSize"

In [12]:
# retrieve page with the requests module 
# instead of Splinter, which did not get the complete content
twitter_mars_weather_response = requests.get(twitter_mars_weather_url)

# render the web page content as a BeautifulSoup object and archive as a txt file, which 
# can then be inspected with an editor to verify correctness (i.e. matches the content shown
# in the browser inspector)
twitter_mars_weather_soup = BeautifulSoup(twitter_mars_weather_response.text, 'html.parser')
with open("html_txt/twitter_mars_weather_site.txt", "w") as file:
    file.write(twitter_mars_weather_soup.prettify())

#### Inspect \<p> object
Identify the applicable element to extract the image url.  
**Note:** This can also be used to verify the correct html was downloaded, if this notebook is run again.

In [13]:
# parse the soup object for the <p> content and further examine
twitter_mars_weather_element = twitter_mars_weather_soup.find('p', class_="TweetTextSize")
print(twitter_mars_weather_element.prettify())

<p class="TweetTextSize TweetTextSize--normal js-tweet-text tweet-text" data-aria-label-part="0" lang="en">
 InSight sol 444 (2020-02-25) low -93.8ºC (-136.8ºF) high -12.0ºC (10.5ºF)
winds from the SSW at 6.2 m/s (13.9 mph) gusting to 21.2 m/s (47.4 mph)
pressure at 6.30 hPa
 <a class="twitter-timeline-link u-hidden" data-pre-embedded="true" dir="ltr" href="https://t.co/UeOmoDjhf3">
  pic.twitter.com/UeOmoDjhf3
 </a>
</p>



#### Observation
The first \<p> element with class="TweetTextSize also contains the latest weather

In [14]:
# extract the weather text
twitter_mars_weather = twitter_mars_weather_element.text

In [15]:
# add to the scrape dictionary
scrape_data["twitter_mars_weather"] = twitter_mars_weather

## Mars Facts 
Inspecting https://space-facts.com/mars/ (with the browser inspector), html content has a \<tr> tag (with a class modifier), which supports scrapping with Pandas.

In [16]:
# render the web page content as a BeautifulSoup object and archive as a txt file, which 
# can then be inspected with an editor to verify correctness (i.e. matches the content shown
# in the browser inspector)
browser.visit(mars_facts_site_url)
mars_facts_site_soup = BeautifulSoup(browser.html, 'html.parser')
with open("html_txt/mars_facts_site.txt", "w") as file:
    file.write(mars_facts_site_soup.prettify())

In [17]:
mars_facts_site_tables = pd.read_html(mars_facts_site_url)
mars_facts_site_tables

[                      0                              1
 0  Equatorial Diameter:                       6,792 km
 1       Polar Diameter:                       6,752 km
 2                 Mass:  6.39 × 10^23 kg (0.11 Earths)
 3                Moons:            2 (Phobos & Deimos)
 4       Orbit Distance:       227,943,824 km (1.38 AU)
 5         Orbit Period:           687 days (1.9 years)
 6  Surface Temperature:                   -87 to -5 °C
 7         First Record:              2nd millennium BC
 8          Recorded By:           Egyptian astronomers,
   Mars - Earth Comparison             Mars            Earth
 0               Diameter:         6,779 km        12,742 km
 1                   Mass:  6.39 × 10^23 kg  5.97 × 10^24 kg
 2                  Moons:                2                1
 3      Distance from Sun:   227,943,824 km   149,598,262 km
 4         Length of Year:   687 Earth days      365.24 days
 5            Temperature:    -153 to 20 °C      -88 to 58°C,
           

In [18]:
# the above shows only one table, but let's see if this really is a list of tables
type(mars_facts_site_tables)

list

In [19]:
# extract the single table from the list of tables and name the columns
mars_facts_site_table_df = mars_facts_site_tables[0]
mars_facts_site_table_df.columns = ['Item', 'Value']
mars_facts_site_table_df.head()

Unnamed: 0,Item,Value
0,Equatorial Diameter:,"6,792 km"
1,Polar Diameter:,"6,752 km"
2,Mass:,6.39 × 10^23 kg (0.11 Earths)
3,Moons:,2 (Phobos & Deimos)
4,Orbit Distance:,"227,943,824 km (1.38 AU)"


In [20]:
# convert the data frame to an html string
mars_facts_site_html_table = mars_facts_site_table_df.to_html()
mars_facts_site_html_table

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Item</th>\n      <th>Value</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Equatorial Diameter:</td>\n      <td>6,792 km</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Polar Diameter:</td>\n      <td>6,752 km</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>Mass:</td>\n      <td>6.39 × 10^23 kg (0.11 Earths)</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Moons:</td>\n      <td>2 (Phobos &amp; Deimos)</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Orbit Distance:</td>\n      <td>227,943,824 km (1.38 AU)</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>Orbit Period:</td>\n      <td>687 days (1.9 years)</td>\n    </tr>\n    <tr>\n      <th>6</th>\n      <td>Surface Temperature:</td>\n      <td>-87 to -5 °C</td>\n    </tr>\n    <tr>\n      <th>7</th>\n      <td>First Record:</td>\n      <td>2nd millennium BC</td>

In [21]:
# add to the scrape dictionary
scrape_data["mars_facts_html"] = mars_facts_site_html_table

## Mars Hemispheres
Inspecting https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars (with the browser inspector), each hemisphere is presented as a \"product\". The hemisphere title is the text of a \<h3> tag.  The link is in an element having the \"thumb\" class. 

Inspecting the linked page, the link for the high resolution image is in article having text = \"Original\"

After the linked page is consumed, the parent page must be revisited to link to the next hemisphere.

In [22]:
# render the web page content as a BeautifulSoup object and archive as a txt file, which 
# can then be inspected with an editor to verify correctness (i.e. matches the content shown
# in the browser inspector)
browser.visit(usgs_astrogeology_site_url)
usgs_astrogeology_site_soup = BeautifulSoup(browser.html, 'html.parser')
with open("html_txt/usgs_astrogeology_site.txt", "w") as file:
    file.write(usgs_astrogeology_site_soup.prettify())

In [23]:
# build list of products
# the webpage identifies each link as part of a "product" 
astrogeology_products_list = []
astrogeology_products = usgs_astrogeology_site_soup.find_all('h3')
for product in astrogeology_products:
    title = product.text
    astrogeology_products_list.append(title)
print(astrogeology_products_list)

['Cerberus Hemisphere Enhanced', 'Schiaparelli Hemisphere Enhanced', 'Syrtis Major Hemisphere Enhanced', 'Valles Marineris Hemisphere Enhanced']


In [24]:
# extract the images from the child sites
hemisphere_image_urls = []

# click the link of each product (on the parent page) and get its image
for image_idx in range(len(astrogeology_products_list)):
    # revisit the parent page (return from the image page) and build a list of buttons to be clicked
    # the button list is rebuilt
    # the term "button" refers to the element that contains the link to the page with the high
    # resolution image 
    browser.visit(usgs_astrogeology_site_url)
    usgs_astrogeology_site_soup = BeautifulSoup(browser.html, 'html.parser')
    buttons = browser.find_by_css('.thumb')
    
    # click the applicable button, per the loop count, which is image_idx
    print(f"Button #{image_idx} = {buttons[image_idx]}")
    buttons[image_idx].click()
    
    # obtain linked webpage content and save to a text file
    soup = BeautifulSoup(browser.html, 'html.parser')
    title = astrogeology_products_list[image_idx]
    with open("html_txt/" + title + ".txt", "w") as file:
        file.write(soup.prettify())
    
    # obtain the image of the product's corresponding webpage, which in the article having text = 'Original'
    articles = soup.find_all('a')
    for article in articles:
        if article.text == 'Original':
            img_url = article['href']

    # append the title and image url as a dictionary to the image list
    hemisphere_image_urls.append({"title": title, "img_url": img_url})

    # end of for loop

# print the dictionary
print("---------------")
for dict in hemisphere_image_urls:
    print(dict)

Button #0 = <splinter.driver.webdriver.WebDriverElement object at 0x120bfd950>
Button #1 = <splinter.driver.webdriver.WebDriverElement object at 0x1208aa2d0>
Button #2 = <splinter.driver.webdriver.WebDriverElement object at 0x120cbd350>
Button #3 = <splinter.driver.webdriver.WebDriverElement object at 0x121416290>
---------------
{'title': 'Cerberus Hemisphere Enhanced', 'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif'}
{'title': 'Schiaparelli Hemisphere Enhanced', 'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif'}
{'title': 'Syrtis Major Hemisphere Enhanced', 'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/syrtis_major_enhanced.tif'}
{'title': 'Valles Marineris Hemisphere Enhanced', 'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/valles_marineris_enhanced.tif'}


In [25]:
# add to the scrape dictionary
scrape_data["hemisphere_image_urls"] = hemisphere_image_urls

In [26]:
# test the scrape dictionary
print("latest Martian news title and paragraph")
print("---------------------------------------")
news_title = scrape_data["nasa_mars_news_title"]
news_p = scrape_data["nasa_mars_news_p"]
print(news_title)
print(news_p)

print("\nfeatured Mars image URL")
print("-----------------------")
jpl_mars_url = scrape_data["jpl_mars_site_link"]
print(jpl_mars_url)

print("\nMars weather")
print("------------")
twitter_mars_weather = scrape_data["twitter_mars_weather"]
print(twitter_mars_weather)

print("\nMars facts")
print("----------")
mars_facts_html = scrape_data["mars_facts_html"]
print(mars_facts_html)

print("\nMars hemispheres image urls")
print("---------------------------")
for dict in scrape_data["hemisphere_image_urls"]:
    print(f"title = {dict['title']}")
    print(f"url = {dict['img_url']}")


latest Martian news title and paragraph
---------------------------------------
A Year of Surprising Science From NASA's InSight Mars Mission
A batch of new papers summarizes the lander's findings above and below the surface of the Red Planet.

featured Mars image URL
-----------------------
https://www.jpl.nasa.gov/spaceimages/images/mediumsize/PIA17254_ip.jpg

Mars weather
------------
InSight sol 444 (2020-02-25) low -93.8ºC (-136.8ºF) high -12.0ºC (10.5ºF)
winds from the SSW at 6.2 m/s (13.9 mph) gusting to 21.2 m/s (47.4 mph)
pressure at 6.30 hPapic.twitter.com/UeOmoDjhf3

Mars facts
----------
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Item</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Equatorial Diameter:</td>
      <td>6,792 km</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Polar Diameter:</td>
      <td>6,752 km</td>
    </tr>
    <tr>
      <th>2</th>
      <td