## This project is to build a web application that scrapes various websites for data related to the Mission to Mars and displays the information in a single HTML page.


### For initial scraping following libraries are used:
**BeautifulSoup**<br>
it is a Python library for pulling data out of HTML and XML files. It works with any parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
<br>
**Pandas**<br>
this python's data analysis library can be used with BeautifulSoup for web scraping. BeautifulSoup can pass the findings to pandas. Pandas can use its read_html function to read the HTML table data into a dataframe, which can be converted to JSON format.
<br>
**Requests**<br>
it is a Python library that is used to send HTTP requests, add headers, form data, multiplart files and parameters with simple Python dictionaries, and access the response data in the same way.
<br>
**Splinter**<br>
it is an open source tool for testing web applications using Python. it lets you automate browser actions, such as visting URLs and interacting with their items.

In [1]:
# Importing Dependencies
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests

# import Browser class from splinter
from splinter import Browser

# In order to use Google Chrome with Splinter, 
# Setting up Chrome WebDriver - Chrome WebDriver is provided by Selenium2. To use it, => pip install selenium
# Download "ChromeDriver server for win" extract the zip file and add .exe file to the path.
# Create Browser instance
executable_path = {'executable_path': 'chromedriver.exe'}
browser = Browser('chrome', **executable_path, headless=False)

### NASA Mars News
Scraping the NASA Mars News site and collecting the latest News Title and Paragraph text.

**Note:- on why use splinter over requests library in this case:**<br>
<br>
When a browser wants to access a webpage, it sends a request to the server on which the files that make up the webpage are located. The server then sends a response consisting of the page's source code back to the browser. The browser then interprets the HTML, CSS etc, in the source code, allows any Javascript to run, and displays the page.<br>
<br>
NASA Mars News site has a lot of Javascript code. When a Python library such as "urllib" or "requests" sends a request to a server, it receives the source code of the webpage, just like a browser does. However, Python cannot run Javascript and allow it to create the elements that hold the content that you need to scrape; the most it can do is parse the source code. In a situation like this, one in which you have to scrape content loaded dynamically by Javascript, content that is not present in the source code of the page, the Python module splinter comes in handy.

In [2]:
# URL of NASA Mars News site
nasa_news_url = 'https://mars.nasa.gov/news/'

# Navigating with the browser.visit method of Splinter
browser.visit(nasa_news_url)

# Create BeautifulSoup object; parse with 'html.parser'
html_nasa = browser.html
soup = bs(html_nasa, 'html.parser')

In [3]:
news_title = soup.find("div", class_="content_title").text
news_p = soup.find("div", class_="article_teaser_body").text

In [4]:
print(f'News Title: {news_title}')
print(f'News Paragraph: {news_p}')

News Title: NASA to Host Media Call on Next Mars Landing Site
News Paragraph: NASA will host a media teleconference at 9 a.m. PST (noon EST) Monday, Nov. 19, to provide details about the Mars 2020 rover’s landing site on the Red Planet.


### JPL Mars Space Images - Featured Image
Scraping the JPL Featured Space Images website to get the image url for the current Featured Mars Image.

In [5]:
# URL of JPL Mars Space Images site
jpl_base_url = 'https://www.jpl.nasa.gov'
jpl_mars_url = 'https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'

# Navigating with the browser.visit method of Splinter
browser.visit(jpl_mars_url)

# Create BeautifulSoup object; parse with 'html.parser'
html_jpl = browser.html
soup = bs(html_jpl, 'html.parser')

In [6]:
# Using Splinter - find_by_xpath method to find and click on the full size featured Mars image

img_xpath = '//*[@id="page"]/section[3]/div/ul/li[1]/a'

# Finding the full size image by xpath
find_img = browser.find_by_xpath(img_xpath)
image = find_img[0]
image.click()

In [7]:
# Using BeautifulSoup to get the url of full size featured Mars image
link = soup.find('a',class_="fancybox")
print(link)
featured_image_url = jpl_base_url + link['data-fancybox-href']
print(f"featured_image_url = {featured_image_url}")

<a class="button fancybox" data-description="This image from NASA's Curiosity Mars rover shows Curiosity at the 'Rocknest' site where the rover scooped up samples of windblown dust and sand." data-fancybox-group="images" data-fancybox-href="/spaceimages/images/mediumsize/PIA16919_ip.jpg" data-link="/spaceimages/details.php?id=PIA16919" data-title="Billion-Pixel View From Curiosity at Rock Nest, Raw Color" id="full_image">
					FULL IMAGE
				  </a>
featured_image_url = https://www.jpl.nasa.gov/spaceimages/images/mediumsize/PIA16919_ip.jpg


### Mars Weather
Scraping Mars Weather twitter account and scraping the latest Mars weather tweet

In [8]:
# URL of Mars weather twitter account
twitter_mars_url = 'https://twitter.com/marswxreport?lang=en'

# Navigating with the browser.visit method of Splinter
browser.visit(twitter_mars_url)

# Create BeautifulSoup object; parse with 'html.parser'
html_twitter = browser.html
soup = bs(html_twitter, 'html.parser')

In [9]:
mars_weather = soup.find("p", class_="TweetTextSize TweetTextSize--normal js-tweet-text tweet-text").text
print(f"Mars Weather = {mars_weather}")

Mars Weather = Sol 2230 (2018-11-14), high -5C/23F, low -72C/-97F, pressure at 8.59 hPa, daylight 06:22-18:39


### Mars Facts
Scraping the Mars Facts webpage using Pandas to scrape the table containing facts about the planet including Diameter, Mass, etc. and using Pandas to convert the data to a HTML table string.

In [10]:
# URL of Mars Facts webpage
mars_facts_url = 'https://space-facts.com/mars/'

# Navigating with the browser.visit method of Splinter
browser.visit(mars_facts_url)

# Creating BeautifulSoup object; parse with 'html.parser'
html_marsfacts = browser.html
soup = bs(html_marsfacts, 'html.parser')

In [11]:
# Use the read_html function to automatically scrape any tabular data from a page.
tables = pd.read_html(mars_facts_url)
mars_facts_df = tables[0]
mars_facts_df.columns = ["Description", "Value"]
mars_facts_df.set_index("Description", inplace=True)
mars_facts_df.head()

Unnamed: 0_level_0,Value
Description,Unnamed: 1_level_1
Equatorial Diameter:,"6,792 km"
Polar Diameter:,"6,752 km"
Mass:,6.42 x 10^23 kg (10.7% Earth)
Moons:,2 (Phobos & Deimos)
Orbit Distance:,"227,943,824 km (1.52 AU)"


In [12]:
# generate HTML table from DataFrame
html_table = mars_facts_df.to_html()
html_table

'<table border="1" class="dataframe">\n  <thead>\n    <tr style="text-align: right;">\n      <th></th>\n      <th>Value</th>\n    </tr>\n    <tr>\n      <th>Description</th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>Equatorial Diameter:</th>\n      <td>6,792 km</td>\n    </tr>\n    <tr>\n      <th>Polar Diameter:</th>\n      <td>6,752 km</td>\n    </tr>\n    <tr>\n      <th>Mass:</th>\n      <td>6.42 x 10^23 kg (10.7% Earth)</td>\n    </tr>\n    <tr>\n      <th>Moons:</th>\n      <td>2 (Phobos &amp; Deimos)</td>\n    </tr>\n    <tr>\n      <th>Orbit Distance:</th>\n      <td>227,943,824 km (1.52 AU)</td>\n    </tr>\n    <tr>\n      <th>Orbit Period:</th>\n      <td>687 days (1.9 years)</td>\n    </tr>\n    <tr>\n      <th>Surface Temperature:</th>\n      <td>-153 to 20 °C</td>\n    </tr>\n    <tr>\n      <th>First Record:</th>\n      <td>2nd millennium BC</td>\n    </tr>\n    <tr>\n      <th>Recorded By:</th>\n      <td>Egyptian astronomers</td>\n    </tr>

In [13]:
# Clean the table by stripping unwanted newlines
html_table.replace('\n', '')

'<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Value</th>    </tr>    <tr>      <th>Description</th>      <th></th>    </tr>  </thead>  <tbody>    <tr>      <th>Equatorial Diameter:</th>      <td>6,792 km</td>    </tr>    <tr>      <th>Polar Diameter:</th>      <td>6,752 km</td>    </tr>    <tr>      <th>Mass:</th>      <td>6.42 x 10^23 kg (10.7% Earth)</td>    </tr>    <tr>      <th>Moons:</th>      <td>2 (Phobos &amp; Deimos)</td>    </tr>    <tr>      <th>Orbit Distance:</th>      <td>227,943,824 km (1.52 AU)</td>    </tr>    <tr>      <th>Orbit Period:</th>      <td>687 days (1.9 years)</td>    </tr>    <tr>      <th>Surface Temperature:</th>      <td>-153 to 20 °C</td>    </tr>    <tr>      <th>First Record:</th>      <td>2nd millennium BC</td>    </tr>    <tr>      <th>Recorded By:</th>      <td>Egyptian astronomers</td>    </tr>  </tbody></table>'

In [14]:
# Save the table
mars_facts_df.to_html('mars_facts_table.html')

### Mars Hemispheres
Scraping the USGS Astrogeology site to obtain high resolution images for each of Mars' hemispheres.<br>

(click each of the links to the hemispheres in order to find the image url to the full resolution image.

Save both the image url string for the full resolution hemisphere image, and the Hemisphere title containing the hemisphere name. Use a Python dictionary to store the data using the keys img_url and title.

Append the dictionary with the image url string and the hemisphere title to a list. This list will contain one dictionary for each hemisphere.)

In [15]:
# URL of USGS Astrogeology site
usgs_url = 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'

# Navigating with the browser.visit method of Splinter
browser.visit(usgs_url)

# Creating BeautifulSoup object; parse with 'html.parser'
usgs_html = browser.html
soup = bs(usgs_html, 'html.parser')

In [17]:
usgs_base_url = 'https://astrogeology.usgs.gov'

results = soup.find_all("div", class_="description")

hemisphere_image_urls = []

for result in results:
    title = result.find("h3").text
    title = ' '.join(title.split(' ')[:-1])
    
    button = result.find("a", class_="itemLink product-item")['href']
    url = usgs_base_url + button
    
    # Using browser.visit method to navigate to each link
    browser.visit(url)
    
    # Creating BeautifulSoup object for the navigated page; parse with 'html.parser'
    link_html = browser.html
    soup = bs(link_html, 'html.parser')
    
    # Finding the url for the full resolution hemisphere image on the navigated page
    link = soup.find("img", class_="wide-image")["src"]
    
    # Creating a dictionary to store title and url for each hemisphere
    dict = {'title':title, 'img_url':usgs_base_url + link}
    
    # Appending the dictionaries to a list
    hemisphere_image_urls.append(dict)

In [18]:
hemisphere_image_urls

[{'title': 'Cerberus Hemisphere',
  'img_url': 'https://astrogeology.usgs.gov/cache/images/cfa62af2557222a02478f1fcd781d445_cerberus_enhanced.tif_full.jpg'},
 {'title': 'Schiaparelli Hemisphere',
  'img_url': 'https://astrogeology.usgs.gov/cache/images/3cdd1cbf5e0813bba925c9030d13b62e_schiaparelli_enhanced.tif_full.jpg'},
 {'title': 'Syrtis Major Hemisphere',
  'img_url': 'https://astrogeology.usgs.gov/cache/images/ae209b4e408bb6c3e67b6af38168cf28_syrtis_major_enhanced.tif_full.jpg'},
 {'title': 'Valles Marineris Hemisphere',
  'img_url': 'https://astrogeology.usgs.gov/cache/images/7cf2da4bf549ed01c17f206327be4db7_valles_marineris_enhanced.tif_full.jpg'}]