# Web Scraping Homework - Mission to Mars
----
### Web Scraping Homework (web-scraping-challenge)    |    by: Shane Gatenby
----

### Import Dependencies

In [1]:
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup
from splinter import Browser

### Chrome Driver

#### Mac Users

In [2]:
# Identify chromedriver & location
# https://splinter.readthedocs.io/en/latest/drivers/chrome.html
!which chromedriver

/usr/local/bin/chromedriver


In [3]:
# Create executable_path and open chrome browser window
executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path, headless=False)

#### PC/Windows Users

In [4]:
# executable_path = {'executable_path': 'chromedriver.exe'}
# browser = Browser('chrome', **executable_path, headless=False)

### Step 1 - Scraping

#### NASA Mars News
* Scrape the NASA Mars News Site and collect the latest News Title and Paragraph Text. Assign the text to variables that you can reference later.



In [5]:
# URL of page to be scraped
mars_news_url = 'https://mars.nasa.gov/news/'
browser.visit(mars_news_url)

# Wait 3 seconds before proceeding to give browser time to fully load
time.sleep(3)

In [6]:
# Create BeautifulSoup object & parse with 'html.parser'
mars_news_html_text = browser.html
mars_news_soup = BeautifulSoup(mars_news_html_text, 'html.parser')


In [7]:
# Print formatted verion of BeautifulSoup object to examine
# print(mars_news_soup.prettify())

In [8]:
# Mars News Headline Title
# news_hl_title = mars_news_soup.find('div', class_='list_text').find('div', class_='content_title').text
news_hl_title = mars_news_soup.find_all('div', class_='list_text')[0].find('div', class_='content_title').text

# Mars News Headline Teaser Paragraph
# news_teaser_p = mars_news_soup.find('div', class_='list_text').find('div', class_='article_teaser_body').text
news_teaser_p = mars_news_soup.find_all('div', class_='list_text')[0].find('div', class_='article_teaser_body').text

In [9]:
# Visualize
print(news_hl_title)
print('----------')
print(news_teaser_p)

Virginia Middle School Student Earns Honor of Naming NASA's Next Mars Rover
----------
NASA chose a seventh-grader from Virginia as winner of the agency's "Name the Rover" essay contest. Alexander Mather's entry for "Perseverance" was voted tops among 28,000 entries. 


#### JPL Mars Space Images - Featured Image

* Visit the url for JPL Featured Space Image here: https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars.
* Use splinter to navigate the site and find the image url for the current Featured Mars Image and assign the url string to a variable called featured_image_url.
* Make sure to find the image url to the full size .jpg image.
* Make sure to save a complete url string for this image.

In [10]:
# URL of page to be scraped
jpl_base_url = 'https://www.jpl.nasa.gov'
jpl_image_url = 'https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'
browser.visit(jpl_image_url)

# Wait 3 seconds before proceeding to give browser time to fully load
time.sleep(3)

In [11]:
# Create BeautifulSoup object & parse with 'html.parser'
jpl_image_html_text = browser.html
jpl_image_soup = BeautifulSoup(jpl_image_html_text, 'html.parser')

In [12]:
# Print formatted verion of BeautifulSoup object to examine
# print(jpl_image_soup.prettify())

In [13]:
# Find featured image url path within the article tag
jpl_image_soup.find('article', class_="carousel_item")['style']

"background-image: url('/spaceimages/images/wallpaper/PIA15254-1920x1200.jpg');"

In [14]:
# Use split() to extract the portion of the url needed, setting to variable for later use
jpl_image_extra_path = jpl_image_soup.find('article', class_="carousel_item")['style'].\
                        split("url('")[1].\
                        split("');")[0]
print(jpl_image_extra_path)

/spaceimages/images/wallpaper/PIA15254-1920x1200.jpg


In [15]:
# Another way: using Splinter and browser.find_by_xpath:

# Set jpl_featured_image_xpath to XPath copied from Chrome inspector tool
jpl_featured_image_xpath = '//*[@id="page"]/section[1]/div/div/article'

# Pass variable through to spliter browser.find_by_xpath method, traverse to ['style']
browser.find_by_xpath(jpl_featured_image_xpath)['style']

'background-image: url("/spaceimages/images/wallpaper/PIA15254-1920x1200.jpg");'

In [16]:
# Use split to extract the portion of the url needed, setting to variable for later use
jpl_image_extra_path2 = browser.find_by_xpath(jpl_featured_image_xpath)['style'].\
                        split('url("')[1].\
                        split('");')[0]
print(jpl_image_extra_path2)

/spaceimages/images/wallpaper/PIA15254-1920x1200.jpg


In [17]:
# Set image path to be a concat of jpl_base_url + jpl_image_extra_path
featured_image_url = f'{jpl_base_url}{jpl_image_extra_path}'
featured_image_url2 = f'{jpl_base_url}{jpl_image_extra_path2}'
print(featured_image_url)
print(featured_image_url2)

https://www.jpl.nasa.gov/spaceimages/images/wallpaper/PIA15254-1920x1200.jpg
https://www.jpl.nasa.gov/spaceimages/images/wallpaper/PIA15254-1920x1200.jpg


In [18]:
# Find featured image title within the article tag setting to variable for later use
featured_image_title = jpl_image_soup.find('article', class_="carousel_item")['alt']
print(featured_image_title)

Dusty Space Cloud


#### Mars Weather

* Visit the Mars Weather twitter account here (https://twitter.com/marswxreport?lang=en)and scrape the latest Mars weather tweet from the page. Save the tweet text for the weather report as a variable called mars_weather.
* Note: Be sure you are not signed in to twitter, or scraping may become more difficult.
* Note: Twitter frequently changes how information is presented on their website. If you are having difficulty getting the correct html tag data, consider researching Regular Expression Patterns and how they can be used in combination with the .find() method.



In [19]:
# URL of page to be scraped
mars_wx_twitter_url = 'https://twitter.com/marswxreport?lang=en'
browser.visit(mars_wx_twitter_url)

# Wait 3 seconds before proceeding to give browser time to fully load
time.sleep(3)

In [20]:
# Create BeautifulSoup object & parse with 'html.parser'
mars_wx_twitter_html_text = browser.html
mars_wx_twitter_soup = BeautifulSoup(mars_wx_twitter_html_text, 'html.parser')

In [21]:
# Print formatted verion of BeautifulSoup object to examine
# print(mars_wx_twitter_soup.prettify())

In [22]:
# Using Splinter and browser.find_by_xpath:

# Set latest_mars_wx_tweet_xpath to XPath copied from Chrome inspector tool
latest_mars_wx_tweet_xpath = '//*[@id="react-root"]/div/div/div/main/div/div/div/div[1]/div/div/div/div/div[2]/section/div/div/div/div[1]/div/article/div/div[2]/div[2]/div[2]/span'

# Pass variable through to spliter browser.find_by_xpath method
latest_mars_wx_tweet = browser.find_by_xpath(latest_mars_wx_tweet_xpath).text
print(latest_mars_wx_tweet)

InSight sol 453 (2020-03-05) low -95.1ºC (-139.1ºF) high -10.8ºC (12.6ºF)
winds from the SSW at 6.0 m/s (13.3 mph) gusting to 21.4 m/s (47.9 mph)
pressure at 6.30 hPa


In [23]:
# mars_wx_tweets = mars_wx_twitter_soup.find_all('div', class_="css-901oao r-hkyrab r-1qd0xha r-a023e6 r-16dba41 r-ad9z0x r-bcqeeo r-bnwqim r-qvutc0")
# mars_wx_tweets2 = mars_wx_twitter_soup2.find_all('div', class_="js-tweet-text-container")

# print(len(mars_wx_tweets))
# print(len(mars_wx_tweets2))

In [24]:
# Using browser.visit()
mars_wx_tweets = mars_wx_twitter_soup.find_all('div', class_="css-901oao r-hkyrab r-1qd0xha r-a023e6 r-16dba41 r-ad9z0x r-bcqeeo r-bnwqim r-qvutc0")

# Iterate through Mars Weather tweets one by one, searching for <span> holding tweet text
for mars_wx_tweet in mars_wx_tweets:
    
    # Set variable to tweet text
    mars_wx_wx_tweet = mars_wx_tweet.find('span', class_='css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0').text
    
    # Remove unwanted charaters (pic link, if they exist at end of tweet), use split() to pull them off
    mars_wx_wx_tweet = mars_wx_wx_tweet.split('pic.twitter.com/')[0]
    mars_wx_wx_tweet = mars_wx_wx_tweet.replace('\n',' ')
    print(mars_wx_wx_tweet)
    
    # Determine if tweet is weather related and exit, else pass and move on to next tweet
    if 'InSight sol ' in mars_wx_wx_tweet:
        break
    else:
        pass

[]

In [25]:
# Set first weather related tweet to required variable
mars_weather = mars_wx_wx_tweet
print(mars_weather)

NameError: name 'mars_wx_wx_tweet' is not defined

In [27]:
# Using requests.get 
mars_wx_response2 = requests.get(mars_wx_twitter_url)
mars_wx_twitter_soup2 = BeautifulSoup(mars_wx_response2.text, 'html.parser')

# Print formatted verion of BeautifulSoup object to examine
# print(mars_wx_twitter_soup2.prettify())

mars_wx_tweets2 = mars_wx_twitter_soup2.find_all('div', class_="js-tweet-text-container")

# Iterate through Mars Weather tweets one by one, searching for <p> holding tweet text
for mars_wx_tweet2 in mars_wx_tweets2:
    
    # Set variable to tweet text
    mars_wx_wx_tweet2 = mars_wx_tweet2.find('p', class_='TweetTextSize TweetTextSize--normal js-tweet-text tweet-text').text
    
    # Remove unwanted charaters (pic link, if they exist at end of tweet), use split() to pull them off
    mars_wx_wx_tweet2 = mars_wx_wx_tweet2.split('pic.twitter.com/')[0]
    mars_wx_wx_tweet2 = mars_wx_wx_tweet2.replace('\n',' ')
    print(mars_wx_wx_tweet2)
    
    # Determine if tweet is weather related and exit, else pass and move on to next tweet
    if 'InSight sol ' in mars_wx_wx_tweet2:
        break
    else:
        pass

InSight sol 453 (2020-03-05) low -95.1ºC (-139.1ºF) high -10.8ºC (12.6ºF) winds from the SSW at 6.0 m/s (13.3 mph) gusting to 21.4 m/s (47.9 mph) pressure at 6.30 hPa


In [None]:
# # Set first weather related tweet to required variable
# mars_weather = mars_wx_wx_tweet2
# print(mars_weather)

#### Mars Facts

* Visit the Mars Facts webpage here (https://space-facts.com/mars/) and use Pandas to scrape the table containing facts about the planet including Diameter, Mass, etc.
* Use Pandas to convert the data to a HTML table string.

In [28]:
# URL of page to be scraped
mars_facts_url = 'https://space-facts.com/mars/'
browser.visit(mars_facts_url)

# Wait 3 seconds before proceeding to give browser time to fully load
time.sleep(3)

In [29]:
# Create BeautifulSoup object & parse with 'html.parser'
mars_facts_html_text = browser.html
mars_facts_soup = BeautifulSoup(mars_facts_html_text, 'html.parser')

In [30]:
# Print formatted verion of BeautifulSoup object to examine
# print(mars_facts_soup.prettify())

In [31]:
# Locate and set variable for the mars_facts_table
# mars_facts_table = mars_facts_soup.find('table', id='tablepress-p-mars-no-2').find('tbody')
# print(mars_facts_table)

In [70]:
# Using Splinter and browser.find_by_xpath:

# Set mars_facts_table_xpath to XPath copied from Chrome inspector tool
mars_facts_table_xpath = '//*[@id="tablepress-p-mars-no-2"]'
# mars_facts_table_xpath = '//*[@id="tablepress-p-mars-no-2"]/tbody'

# Pass variable through to spliter browser.find_by_xpath method
mars_facts_table_html = browser.find_by_xpath(mars_facts_table_xpath).html

# Add <table> tag around mars_facts_table_html to use in pd.read_html()
mars_facts_table_html = f'<table> {mars_facts_table_html} </table>'
# print(mars_facts_table_html)
# mars_facts_table_html

In [71]:
# Read in the mars_facts_table_html html table into a pandas DataFrame
mars_facts_df = pd.read_html(mars_facts_table_html)[0]

# Set first column as the DataFrame index, removed column header name and rename 2nd column
mars_facts_df = mars_facts_df.set_index(0)
mars_facts_df = mars_facts_df.rename_axis('')
mars_facts_df = mars_facts_df.rename(columns={1:'Value'})

# Visualize the DataFrame
mars_facts_df


Unnamed: 0,Value
,
Equatorial Diameter:,"6,792 km"
Polar Diameter:,"6,752 km"
Mass:,6.39 × 10^23 kg (0.11 Earths)
Moons:,2 (Phobos & Deimos)
Orbit Distance:,"227,943,824 km (1.38 AU)"
Orbit Period:,687 days (1.9 years)
Surface Temperature:,-87 to -5 °C
First Record:,2nd millennium BC
Recorded By:,Egyptian astronomers


In [76]:
# Use to_html method to create html table from mars_facts_df DataFrame
mars_facts_html_table = mars_facts_df.to_html()

# Remove unwanted charaters ('\n')
# mars_facts_html_table = mars_facts_html_table.replace('\n','')
# print(mars_facts_html_table)
# mars_facts_html_table

#### Mars Hemispheres

* Visit the USGS Astrogeology site here (https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars) to obtain high resolution images for each of Mar's hemispheres.
* You will need to click each of the links to the hemispheres in order to find the image url to the full resolution image.
* Save both the image url string for the full resolution hemisphere image, and the Hemisphere title containing the hemisphere name. Use a Python dictionary to store the data using the keys img_url and title.
* Append the dictionary with the image url string and the hemisphere title to a list. This list will contain one dictionary for each hemisphere.

In [77]:
# URL of page to be scraped
mars_hemispheres_base_url = 'https://astrogeology.usgs.gov'
mars_hemispheres_url = 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'
browser.visit(mars_hemispheres_url)

# Wait 3 seconds before proceeding to give browser time to fully load
time.sleep(3)

In [78]:
# Create BeautifulSoup object & parse with 'html.parser'
mars_hemispheres_html_text = browser.html
mars_hemispheres_soup = BeautifulSoup(mars_hemispheres_html_text, 'html.parser')

In [79]:
# Print formatted verion of BeautifulSoup object to examine
# print(mars_hemispheres_soup.prettify())

In [80]:
# Create dictionary to hold all of Mars Hemisphere key: value pairs for title and img_url
hemisphere_image_urls = []

# Iterate through the mars_hemispheres_soup html to find needed elements
for hemisphere in mars_hemispheres_soup.find_all('div', class_='item'):
    # Locate and set variable for Hemisphere Name/title and use .split() to remove ' Enhanced'
    # print(hemisphere.find('div', class_='description').find('h3').text.split(' Enhanced')[0])
    title = hemisphere.find('div', class_='description').find('h3').text.split(' Enhanced')[0]
    
    # Locate and set variable for Hemisphere's individual url (concat to mars_hemispheres_base_url)
    # print(hemisphere.find('div', class_='description').find('a')['href'])
    hemisphere_url_extra_path = hemisphere.find('div', class_='description').find('a')['href']
    hemisphere_url = f'{mars_hemispheres_base_url}{hemisphere_url_extra_path}'
    # print(hemisphere_url)
    
    # Navigate spliter's browser to each hemisphere's url to scrape that hemisphere's full resolution image url
    browser.visit(hemisphere_url)

    # Wait 3 seconds before proceeding to give browser time to fully load
    time.sleep(3)
    
    # Create BeautifulSoup object & parse with 'html.parser'
    hemisphere_html_text = browser.html
    hemisphere_soup = BeautifulSoup(hemisphere_html_text, 'html.parser')
    
    # Locate and set variable for each hemisphere's individual full resolution image url
    # print(hemisphere_soup.find('div', class_='downloads').find('li').find('a')['href'])
    hemisphere_img_url = hemisphere_soup.find('div', class_='downloads').find('li').find('a')['href']
    
    # Create a dictionary holding the Hemisphere's Name/title & it's full resolution image url
    hemisphere_image_dict = {'title': title, 'img_url': hemisphere_img_url}
    # print(hemisphere_image_dict)
    
    # Append hemisphere_image_urls dictionary to include Name/title and full resolution image url
    hemisphere_image_urls.append(hemisphere_image_dict)
    # print(hemisphere_image_urls)
    # print(f'hemisphere_image_urls appended with: {hemisphere_image_dict}')
    # print('---------')

In [81]:
# Visualize the hemisphere_image_urls list of dictionaries
hemisphere_image_urls

[{'title': 'Cerberus Hemisphere',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/cerberus_enhanced.tif/full.jpg'},
 {'title': 'Schiaparelli Hemisphere',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/schiaparelli_enhanced.tif/full.jpg'},
 {'title': 'Syrtis Major Hemisphere',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/syrtis_major_enhanced.tif/full.jpg'},
 {'title': 'Valles Marineris Hemisphere',
  'img_url': 'http://astropedia.astrogeology.usgs.gov/download/Mars/Viking/valles_marineris_enhanced.tif/full.jpg'}]

In [82]:
# Close the browser window opened by splinter
browser.quit()

### Step 2 - MongoDB and Flask Application
----
#### Use MongoDB with Flask templating to create a new HTML page that displays all of the information that was scraped from the URLs above.
----
* Start by converting your Jupyter notebook into a Python script called scrape_mars.py with a function called scrape that will execute all of your scraping code from above and return one Python dictionary containing all of the scraped data.
* Next, create a route called /scrape that will import your scrape_mars.py script and call your scrape function.
* Store the return value in Mongo as a Python dictionary.
* Create a root route / that will query your Mongo database and pass the mars data into an HTML template to display the data.
* Create a template HTML file called index.html that will take the mars data dictionary and display all of the data in the appropriate HTML elements. Use the following as a guide for what the final product should look like, but feel free to create your own design.

In [None]:
# !jupyter nbconvert --to script mission_to_mars_v1.ipynb

### NOTE: Please see scrape_mars.py for script from this Jupyter Notebook that has been converted into a Python script for use by the app