## Scraping and analysis

In [1]:
# Dependencies
from bs4 import BeautifulSoup
from splinter import Browser
import pandas as pd

executable_path = {'executable_path': '/usr/local/bin/chromedriver'}
browser = Browser('chrome', **executable_path, headless=False)

### 1. Scraping

#### 1.1. NASA Mars News

* Scrape the [NASA Mars News Site](https://mars.nasa.gov/news/) and collect the latest News Title and Paragraph Text. Assign the text to variables that you can reference later.

From URL to beautiful soup object

In [2]:
# URL of page to be scraped
url = 'https://mars.nasa.gov/news/?page=0&per_page=40&order=publish_date+desc%2Ccreated_at+desc&search=&category=19%2C165%2C184%2C204&blank_scope=Latest'

# Browser visit
browser.visit(url)

# Create a Beautiful Soup object
html = browser.html
soup = BeautifulSoup(html, 'html.parser')

Extract news title and paragraph text

In [3]:
latest_news = soup.find("li", class_="slide")
news_title = latest_news.find("h3").text
news_p = latest_news.find(class_="article_teaser_body").text

#### 1.2. JPL Mars Space Images - Featured Image

* Visit the url for JPL Featured Space Image [here](https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars).

* Use splinter to navigate the site and find the image url for the current Featured Mars Image and assign the url string to a variable called `featured_image_url`.

* Make sure to find the image url to the full size `.jpg` image.

* Make sure to save a complete url string for this image.

From URL to beautiful soup object

In [4]:
# URL of page to be scraped
url = 'https://www.jpl.nasa.gov/spaceimages/?search=&category=Mars'

# Browser visit
browser.visit(url)

# Create a Beautiful Soup object
html = browser.html
soup = BeautifulSoup(html, 'html.parser')

Find image URL for current featured Mars image

In [5]:
# soup.find(class_="main_feature").footer.a["data-fancybox-href"]
base_url = "https://www.jpl.nasa.gov"
style = soup.find(class_="main_feature").find(class_="carousel_items").article["style"]
featured_image_url = base_url + style.split("url")[1].strip(";(')")

#### 1.3. Mars Weather

* Visit the Mars Weather twitter account [here](https://twitter.com/marswxreport?lang=en) and scrape the latest Mars weather tweet from the page. Save the tweet text for the weather report as a variable called `mars_weather`.

From URL to beautiful soup object

In [6]:
# URL of page to be scraped
url = 'https://twitter.com/marswxreport?lang=en'

# Browser visit
browser.visit(url)

# Create a Beautiful Soup object
html = browser.html
soup = BeautifulSoup(html, 'html.parser')

Retrieve the latest Mars weather tweet

In [7]:
mars_weather = soup.find("li", class_="js-stream-item").find("p", class_="tweet-text").text

#### 1.4. Mars Facts

* Visit the Mars Facts webpage [here](http://space-facts.com/mars/) and use Pandas to scrape the table containing facts about the planet including Diameter, Mass, etc.

* Use Pandas to convert the data to a HTML table string.

From URL to pandas dataframe

In [8]:
# URL of page to be scraped
url = 'https://space-facts.com/mars/'

# Return a list of dataframes for any tabular data that Pandas found
table = pd.read_html(url)[0]

# Rename table columns
table.rename(columns={0:"metric", 1:"value"}, inplace=True)

Convert dataframe to HTML table string

In [9]:
table_html = table.to_html(index=False)

# Strip unwanted newlines to clean up the table
table_html = table_html.replace('\n', '')

In [10]:
# Strip table tag for easier html formatting
table_html = table_html.replace("<table border=\"1\" class=\"dataframe\">", "").replace("</table>", "").strip()

#### 1.5. Mars Hemispheres

* Visit the USGS Astrogeology site [here](https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars) to obtain high resolution images for each of Mar's hemispheres.

* You will need to click each of the links to the hemispheres in order to find the image url to the full resolution image.

* Save both the image url string for the full resolution hemisphere image, and the Hemisphere title containing the hemisphere name. Use a Python dictionary to store the data using the keys `img_url` and `title`.

* Append the dictionary with the image url string and the hemisphere title to a list. This list will contain one dictionary for each hemisphere.

Get child website links from parent website

In [11]:
# URL of page to be scraped
url_parent = 'https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars'

# Browser visit
browser.visit(url_parent)

# Create a Beautiful Soup object
html = browser.html
soup = BeautifulSoup(html, 'html.parser')

# Child website links for each hemisphere
base_url = "https://astrogeology.usgs.gov"
links = [base_url + item.find(class_="description").a["href"] for item in soup.find_all("div", class_="item")]

Extract hemisphere title and image url from each child website

In [12]:
hemisphere_image_urls = []

for url in links:
    
    # from url to soup
    browser.visit(url)
    html = browser.html
    soup = BeautifulSoup(html, 'html.parser')
    
    # Extract data
    title = soup.find("div", class_="content").find("h2", class_="title").text.replace(" Enhanced", "")
    img_url = base_url + soup.find("img", class_="wide-image")["src"]
    
    # Store in list
    hemisphere_image_urls.append({"title": title, "img_url": img_url})

Quit browser