# CS 109A/AC 209A/STAT 121A Data Science: Midterm I
**Harvard University**<br>
**Fall 2016**<br>
**Instructors: W. Pan, P. Protopapas, K. Rader**

---

### Basic Information

**Name:** Farmer, Rick

**Course Number:** CS 109a

**Note:** 

- _All data sets can be found in the `datasets` folder_

---

Import libraries

In [178]:
import time
import random
import numpy as np
import pandas as pd
import scipy as sp
from sklearn.linear_model import LinearRegression as Lin_Reg
from sklearn.linear_model import Ridge as Ridge_Reg
from sklearn.linear_model import Lasso as Lasso_Reg
from statsmodels.regression.linear_model import OLS
import sklearn.preprocessing as Preprocessing
import itertools as it
from itertools import combinations

from bs4 import BeautifulSoup
import urllib
# The "requests" library makes working with HTTP requests easier
# than the built-in urllib libraries
import requests

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cmx
import matplotlib.colors as colors

from mpl_toolkits.mplot3d import Axes3D

from __future__ import division # Python 2.7 uses integer division by default, so that 5 / 2 equals 2, this sfixes that

%matplotlib inline

![Asteroid](https://github.com/cs109Alabs/lab_files/blob/master/astroid.gif?raw=true)

# Save the World! 



It is Oct 13, 2016. NASA's radars discovered a small, 3 meter, iron base meteorite that just entered the Earth's atmosphere.  A small meteorite will not create a wide-spread devastation but will still be dangerous for the citizens. Local authorities would like to know the location of the impact point so they that can warn residents and allocate resources based on the population affected.


The Governor has sought out the best data scientist in the state - you - to help save the day!

You are given two data sets:


1.  Radar position estimates (x, y, z - coordinates; z being the altitude) of the meteorite at various times are available here (https://cs109alabs.github.io/lab_files/). x, y, z coordinates are in kilometers and time is in seconds. 

2. Locations and other details of every dwelling in the town are provided herePreview the documentView in a new window.  


I. Using methods you learned in class to estimate the expected point of impact along with the region with 90% certainty.


II. Using the dwelling database, estimate the total number of people that will most likely be affected within this region.


AC209 students only: Additional measurements from another radar are available herePreview the documentView in a new window. The accuracy of this radar is approximately 5 times higher than that of the first radar. Your model should take into account both radar data sets.

## Load and Examine the Population Data

In [179]:
# Open the file containing the population data
population_df = pd.read_csv('datasets/pop_data.csv')

# Display the dimensions of the data with a pretty format
print "Population data dimensions:"
print population_df.shape
print "\n"

# Display the first five rows of data
print "First five rows:"
population_df.head()


Population data dimensions:
(2417, 5)


First five rows:


Unnamed: 0,residents,bed,bath,x,y
0,3.0,4,3,7201.6,6752.56
1,2.0,2,1,7079.68,6622.32
2,4.0,2,1,7154.4,6683.28
3,2.0,1,2,7093.44,6680.56
4,1.0,2,2,7198.72,6674.96


## Load and Examine the Radar Position Data

In [199]:
# Get the Asteroid radar data from the lab repository on GitHub
radar = requests.get("https://raw.githubusercontent.com/cs109Alabs/lab_files/master/index.html")

# Use Beautiful Soup to parse the html text into a DOM object
soup = BeautifulSoup(radar.text, 'html.parser')

# Prettify the parse tree returned by Beautiful Soup
parse_tree = soup.prettify()

# Print the first 100 characters of the parse tree to see if it is there
print(parse_tree[:101])

<html>
 <head>
  <title>
   Save the World!
  </title>
 </head>
 <style type="text/css">
  .tg  {bord


In [200]:
# Let's make sure that our html element parsed correctly and that we are able to access them

# Display the text of the title tag
print "\n# Text of the <title> tag is:\n"
print soup.html.head.title.text

# Display each child of the head tag, first 200
print "\n\n# Each child of the <table> tag:"
print ''.join(map(str, soup.html.table.children))[:201]


# Text of the <title> tag is:

Save the World!


# Each child of the <table> tag:

<tr>
<th class="tg-yw4l">Time</th>
<th class="tg-yw4l">X-Coord</th>
<th class="tg-yw4l">Y-Coord</th>
<th class="tg-yw4l">Z-Coord</th>
</tr>
<tr>
<td class="tg-yw4l">0.000000000000000000e+00</td>
<td c


In [205]:
# Look for a single "table" element with a class of tg-yw4l;
# Then look for all the "tr" elements on that table
rows = soup.find("table").find_all("tr")[1:]

# We then define a function to act on
# each column's element in each row in the table to convert them to floats
def cleaner(r):
    time = float(r[0].get_text())
    x = float(r[1].get_text())
    y = float(r[2].get_text())
    z = float(r[3].get_text())
    return [time, x, y, z]

# Next we'll create a list of names that will be used as dictionary keys.
fields = ["time", "x", "y", "z"]

# The zip function creates a list of pairs; which the dict function then uses
# to create a dictionary, using the first element of the pair as the key and the second as
# the value; and finally, the list comprehension iterates over each row element, and puts
# the result of each iteration on a list, which is then bound to the radar variable.
radar = [dict(zip(fields, cleaner(row.find_all("td")))) for row in rows]

# Print the first five elements
radar[:5]

[{'time': 0.0,
  'x': 48.829114978261245,
  'y': 8.475320943737545,
  'z': 17005.09768524586},
 {'time': 10.0,
  'x': 69.15037477111318,
  'y': 69.21075417425614,
  'z': 16941.295532073807},
 {'time': 20.0,
  'x': 177.87772942855491,
  'y': 134.0117758000764,
  'z': 16831.33032964538},
 {'time': 30.0,
  'x': 199.73400225858478,
  'y': 220.7435235846259,
  'z': 16569.07798159552},
 {'time': 40.0,
  'x': 278.017525735722,
  'y': 221.40312090873607,
  'z': 16849.67301245895}]

In [207]:
# We want to keep the radar data safe in case it becomes unavailable online at some point
import json

# Write the radar data to the local file system as a JSON file
fd = open("datasets/radar.json","w")
json.dump(radar, fd)
fd.close()

### Part (c): Simple Data Visualization

Visualize the data using a 3-D scatter plot. How does your visual analysis compare with the stats you've computed in Part (b)?

In [90]:
x, y, z = data[:]
fig = plt.figure()
axes = fig.add_subplot(n, m, k, projection='3d')
axes.scatter(x, y, z)


SyntaxError: invalid syntax (<ipython-input-90-91d960ab64e9>, line 2)

### Part (d): Simple Data Visualization (Continued)

Visualize two data attributes at a time,

1. maternal age against birth weight
2. maternal age against femur length
3. birth weight against femur length

using 2-D scatter plots.

Compare your visual analysis with your analysis from Part (b) and (c).

### Part (e): More Data Visualization

Finally, we want to visualize the data by maternal age group. Plot the data again using a 3-D scatter plot, this time, color the points in the plot according to the age group of the mother (e.g. use red, blue, green to represent group I, II and III respectively).

Compare your visual analysis with your analysis from Part (a) - (c).