# Data Visualization Capstone Project

## Project Introdution

This notebook develops the work associated with the above project. The requirement is to select a dataset or visualization from the [Makeover Monday](https://www.makeovermonday.co.uk/data/) site and improve it by creating a different dashboard/data presentation or animated story. 

It will be accompanied by a blog post that reflects on what improvements were made.

The required steps are:
- Step 1
    - Choose a dataset from the Makeover Monday site
    - Capture both the source article/visualization and the source data
- Step 2
    - Explore the data to identify any limitations and biases that can occur in collection, processing and insights
    - Document these in the blog
- Step 3
    - Dashboard: Define what questions the dashboard user will be able to answer and include in the blog
- Step 4
    - Complete the analysis and visualization
    - Include a link in your animated data story
- Step 5
    - Explain why your visualization is unique and improves on the original
    - Describe how annotations, chart choice, alignment, layouts make your data visualization better than the current version

## Dataset Introduction

The dataset that I will be using, inculding the visualization, can be found [here](https://www.nasa.gov/mission_pages/station/spacewalks/). 

The current visualization looks like this:

!['image'](nasa_spacewalks.PNG)

The data is found in text form on the site and will be gathered by scraping the page.

## Gathering

In [2]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [6]:
r = requests.get('https://www.nasa.gov/mission_pages/station/spacewalks/')
soup = BeautifulSoup(r.text, 'lxml')
#for script in soup(['script', 'style']):
#    script.decompose()

print(soup.get_text())

































{
    "@context": "http://schema.org",
    "@graph": [
        {
            "@type": "Article",
            "headline": "Space Station Spacewalks",
            "name": "Space Station Spacewalks",
            "description": "There have been 217 spacewalks at the International Space Station.",
            "author": {
                "@type": "Person",
                "name": "Mark Garcia"
            },
            "publisher": {
                "@type": "Organization",
                "@id": "https://www.nasa.gov",
                "name": "NASA",
                "url": "https://www.nasa.gov",
                "sameAs": "https://twitter.com/nasa,https://www.facebook.com/nasa,https://instagram.com/nasa,https://plus.google.com/+NASA",
                "logo": {
                    "@type": "ImageObject",
                    "url": "https://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg",
                    "width": "110",
                 

The original plan was to scrape the website for the data but it is written in such a way that this is not possible. 

Instead I:
1. Copied and pasted the page data in to word
2. Deleted all of the pictures and associated text
3. Extracted all of the hyperlinks ([Reference](https://answers.microsoft.com/en-us/msoffice/forum/all/how-do-i-extract-all-hyperlinks-from-word-document/cb1a57aa-a79f-40a3-a42c-5217433ce746)) and made sure the were in the correct order ([Reference](https://wordribbon.tips.net/T004803_Reversing_All_the_Paragraphs_in_a_Document.html))
4. Then collected the hyperlinks and text information into two csv's.

In [7]:
walks = pd.read_csv('spacewalks.csv', header=None)
walks.head()

Unnamed: 0,0,1,2
0,2019.0,Mission: Expedition 59,
1,2019.0,"Date: April 8, 2019",
2,2019.0,"Duration: 6 hours, 29 minutes",
3,2019.0,"Spacewalkers: Anne McClain, David Saint-Jacques",
4,2019.0,Mission: Expedition 59,


In [8]:
links = pd.read_csv('hyperlinks.csv', header=None)
links.head()

Unnamed: 0,0
0,http://www.nasa.gov/mission_pages/station/expe...
1,https://go.nasa.gov/2Df8Zbh
2,https://www.nasa.gov/astronauts/biographies/an...
3,http://www.asc-csa.gc.ca/eng/astronauts/canadi...
4,http://www.nasa.gov/mission_pages/station/expe...
