<h1>Web Scraping Project</h1>
<h3>Author: Tim Lucas</h3>
<p>This notebook is being used as a playground to practice web scraping for the web-scrapting-challenge. We'll visit severa Mars sites to scrape secific sets of data then use those scripts to create a python flask app to scrape and store the data in a mongo database.</p>

In [19]:
# Import Dependnecies
from bs4 import BeautifulSoup as bs
from splinter import Browser
import pandas as pd

In [2]:
# Initialize Browser
executable_path = {"executable_path": "chromedriver.exe"}
browser = Browser("chrome", **executable_path, headless=False)

<h3>Scraping the NASA Mars News</h3>

In [3]:
# Parse site HTML and put into Beautiful Soup object
url = 'https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest'
browser.visit(url)
html = browser.html
soup = bs(html, "html.parser")

In [4]:
# Find the title and paragraph text for the latest item
news_title = soup.find("div", class_="list_text").find("div", class_="content_title").find("a").get_text()
news_p = soup.find("div", class_="article_teaser_body").get_text()

In [5]:
print(f'Title: {news_title}')
print(f'Text: {news_p}')

Title: How NASA's Perseverance Mars Team Adjusted to Work in the Time of Coronavirus 
Text: Like much of the rest of the world, the Mars rover team is pushing forward with its mission-critical work while putting the health and safety of their colleagues and community first.


<h3>Scraping the JPL Mars Space Images</h3>

In [6]:
# Parse site HTML and put into Beautiful Soup object
url_jpl = 'https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'
browser.visit(url_jpl)
html_jpl = browser.html
soup_jpl = bs(html_jpl, "html.parser")

In [7]:
s = soup_jpl.find("article", class_="carousel_item")['style']
start = s.find("url('")
end = s.find("');")
feature_image_url = 'https://www.jpl.nasa.gov' + s[start+len("url('"):end]
print(f'URL: {feature_image_url}')

URL: https://www.jpl.nasa.gov/spaceimages/images/wallpaper/PIA18914-1920x1200.jpg


<h3>Scraping the Mars Weather Twitter Feed</h3>

In [39]:
# Parse site HTML and put into Beautiful Soup object
url_twt = 'https://twitter.com/marswxreport?lang=en'
browser.visit(url_twt)
html_twt = browser.html
soup_twt = bs(html_twt, "html.parser")

In [40]:
mars_weather = soup_twt.find("span", {"class": "css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0"})
mars_weather

<span class="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0">Log in</span>

<h3>Scraping Mars Facts</h3>

In [41]:
# Parse site HTML and put into Beautiful Soup object
url_facts = 'https://space-facts.com/mars/'
browser.visit(url_facts)
html_facts = browser.html
soup_facts = bs(html_facts, "html.parser")

In [42]:
# Put into Pandas dataframe
mars_facts = soup_facts.find("table", {"id": "tablepress-p-mars"})
mars_facts_df = pd.read_html(str(mars_facts))
mars_facts_df = mars_facts_df[0]

In [43]:
# Use Pandas to create new HTML Table
mars_facts_html = mars_facts_df.to_html(index=False, header=False)

In [44]:
mars_facts_html

'<table border="1" class="dataframe">\n  <tbody>\n    <tr>\n      <td>Equatorial Diameter:</td>\n      <td>6,792 km</td>\n    </tr>\n    <tr>\n      <td>Polar Diameter:</td>\n      <td>6,752 km</td>\n    </tr>\n    <tr>\n      <td>Mass:</td>\n      <td>6.39 × 10^23 kg (0.11 Earths)</td>\n    </tr>\n    <tr>\n      <td>Moons:</td>\n      <td>2 (Phobos &amp; Deimos)</td>\n    </tr>\n    <tr>\n      <td>Orbit Distance:</td>\n      <td>227,943,824 km (1.38 AU)</td>\n    </tr>\n    <tr>\n      <td>Orbit Period:</td>\n      <td>687 days (1.9 years)</td>\n    </tr>\n    <tr>\n      <td>Surface Temperature:</td>\n      <td>-87 to -5 °C</td>\n    </tr>\n    <tr>\n      <td>First Record:</td>\n      <td>2nd millennium BC</td>\n    </tr>\n    <tr>\n      <td>Recorded By:</td>\n      <td>Egyptian astronomers</td>\n    </tr>\n  </tbody>\n</table>'

<h3>Scraping Mars Hemisphere Data</h3>

In [52]:
# Parse site HTML and put into Beautiful Soup object
url_hem = 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'
browser.visit(url_hem)
html_hem = browser.html
soup_hem = bs(html_hem, "html.parser")

In [53]:
# Get list of all image links
image_links = soup_hem.find_all("div", {"class": "description"})

In [54]:
# Get each link text from list of links
links = []
for item in image_links:
    links.append(item.find("a").find("h3").get_text())

In [60]:
# for each click, scrape and navigate back.
hemisphere_image_urls = []
base_url = 'https://astrogeology.usgs.gov/'
for link in links:
    browser.click_link_by_partial_text(link)
    new_html = browser.html
    new_soup = bs(new_html, "html.parser")
    title = new_soup.find("h2", {"class": "title"}).get_text()
    img_url = base_url + new_soup.find("img", {"class": "wide-image"})['src']
    hemisphere_image_urls.append({"title": title, "img_url": img_url})
    browser.back()

In [61]:
hemisphere_image_urls

[{'title': 'Cerberus Hemisphere Enhanced',
  'img_url': 'https://astrogeology.usgs.gov//cache/images/cfa62af2557222a02478f1fcd781d445_cerberus_enhanced.tif_full.jpg'},
 {'title': 'Schiaparelli Hemisphere Enhanced',
  'img_url': 'https://astrogeology.usgs.gov//cache/images/3cdd1cbf5e0813bba925c9030d13b62e_schiaparelli_enhanced.tif_full.jpg'},
 {'title': 'Syrtis Major Hemisphere Enhanced',
  'img_url': 'https://astrogeology.usgs.gov//cache/images/ae209b4e408bb6c3e67b6af38168cf28_syrtis_major_enhanced.tif_full.jpg'},
 {'title': 'Valles Marineris Hemisphere Enhanced',
  'img_url': 'https://astrogeology.usgs.gov//cache/images/7cf2da4bf549ed01c17f206327be4db7_valles_marineris_enhanced.tif_full.jpg'}]