# Exploring and Visualizing Data: Scatter Plots
In this exercise, we will explore a set of New York City train data. We will then use Python to create a **scatter plot** to visualize that data. 

A scatter plot is used for data with two independent variables. For example, you could use a scatter plot to show the height (x-axis) and weight (y-axis) of individuals:

![sp]

[sp]: https://chartio.com/images/tutorials/scatter-plot/Scatter-Plot-Weight-and-Height-Scatter-Plot-Trendline.png "Scatter Plot Example"

A scatter plot shows how the two variables are correlated. To continue with the example above, the scatter plot shows  that, in general, weight increases as height increases. The correlation isn’t perfect; some taller people weight less than some shorter people.

This exercise has three parts.
1. Create a demo scatter plot using random data.
2. Create a scatter plot that uses train pickup location data.
3. Zoom in on the scatter plot to examine the data more closely.

In this exercise, you will use the following elements. For more information about these elements, see the **Python Documentation** section at the end of the exercise.
* `figure` class 
* `circle` method
* `p.xaxis.axis_label`
* `p.yaxis.axis_label`


## Part 1. Create a Demo Scatter Plot
This demo has two steps.
1. Generate a set of 100 random points and create a table that lists two sample points.
2. Generate a scatter plot based on the random data points.

### Step 1. Generate Points and Create a Table
The following code cell:

* Generates 100 random points, each with an x coordinate and y coordinate.
* Classifies each point as above or below the y axis.
* Prints out a table that lists two sample generated points.

This code uses the Python `figure` class. For more information, see **Python Documentation** at the end of the exercise.

To generate random points and create a table that shows two sample points, run the following code cell.

In [None]:
# Scatter Plot Demo
# Import libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()
# bokeh is another plotting tool that comes from Javascript.
# bokeh is a wrapper around that library in Python.
# It improves aesthetics, simplifies the code, and allows us to plot large
# amounts of data.
from bokeh.plotting import figure, output_notebook, show; output_notebook()

# We create an easy way to return a set of size n random numbers.
def rup(n):
    # Returns a random number in [-1, 1)
    return np.random.uniform(low=-1, high = 1, size=n)
# We initialize 100 random points in the [-1,1] x [-1,1] plane.
data = pd.DataFrame({"X":rup(100), "Y":rup(100)})
# This list will hold our labels for points above and below y=0.
label=[]
# We iterate over the y-values.
for yval in data["Y"]:
    if yval > 0:           # The point is above y=0 if its y-value is greater than 0.
        label.append('+')  # Denote points above the origin as '+'.
    else:
        label.append('-')  # Denote points below the origin as '-'.
# Add a column to our data with labels.
data['Label'] = label
# The '+' labels will be grouped, followed by '-' labels.
data.sort_values(by=["Label"], inplace=True)
# This  resets the index from disordered (34, 3, 8) to ordered (0, 1, 2).
data.reset_index(drop=True, inplace=True)

# Separate '+' from '-' entries.
ind=0
# Enumerate(['a', 'b', 'c']) as [(0, 'a'), (1, 'b'), (2, 'c')].
for i, val in enumerate(data['Label']):
    # Specify the index where we change from '+' to '-'.
    if val == '-':
        ind = i
        # Exit the 'for' loop once you've found the transition.
        # Break exits its innermost 'for' loop,
        # so it doesn't matter that it's within an 'if' here.
        break

# See the point in the data where the labels switch.
data.iloc[i-1:i+1,:]

### Step 2. Generate the Scatter Plot
Run the following code cell to create a scatter plot based on the random data that you just generated.

In all likelyhood, the points will be randomly distributed and will not show a correlation (as height and weight typicallly would).

Note the `circle` method in the code cell (the line that begins with `p.circle`). When you create a scatter plot in the exercise below, you will add your own parameters for this method. (We also provide a sample that you can use.)

For more information about the circle method, see **Python Documentation** at the end of the exercise.

In [None]:
# Size of the resulting image, in pixels.
plot_width, plot_height = int(500), int(500)

# These tools will allow you to move around within the plot, zoom in, or reset 
# to the original image.
tools='pan, wheel_zoom, reset'
p = figure(title = '100 Random Points',
           tools=tools, plot_width=plot_width, plot_height=plot_height,
           x_range=(-1,1), y_range=(-1,1))


options1 = dict(line_color=None, fill_color='blue', size=5)
# The circle method plots points as circles.
# options1 is added to the end.
p.circle(x=data.iloc[:i, 0], y=data.iloc[:i, 1], **options1)

options2 = dict(line_color=None, fill_color='red', size=5)
# The square method plots points as squares. 
# options2 is added to the end.
p.square(x=data.iloc[i:, 0], y=data.iloc[i:, 1], **options2)

# Creates a generate - similar to a list - from [-1,1). 
t=np.arange(-1,1,.05)

# This plots y=0 at each point defined in t (-1, -.95, -.90, ...).
p.line(t, np.zeros(len(t)))

# This displays the plot.
show(p)

## Part 2. Create a Scatter Plot of Train Pickup Location Data

In this exercise, we will use train data to create a scatter plot of pickup locations. We will add pickup points to a graph, with the longitude as x and latitude as y.

This exercise has two steps.

1. Load the train data.
2. Create a scatter plot of train pickup location data.

### Step 1. Load the Train Data

Run the following code cell. The code loads the train data into the `df` variable, and then prints a summary of longitude and latitude data.

In [None]:
# Scatter Plot Exercise
# Import libraries.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()
# output_notebook() is similar to %matplotlib inline.
# It embeds the plot within the notebook.
from bokeh.plotting import figure, output_notebook, show; output_notebook()

df = pd.read_csv('train.csv')
# Allows us to get a better look at each variable.
print(df['pickup_longitude'].describe(),'\n')
print(df['pickup_latitude'].describe())

### Step 2. Create a Scatter Plot of Train Pickup Location Data

To create a scatter plot of train pickup location data, follow these steps.

1. Specify that you want to plot each point as a small circle by adding parameters to the `circle` method. To do this, replace the `p.circle()` line in the code cell with your own code. 
   
   Note: For the `circle` method parameters, use the pickup longitude and latitude of the data sample   (`sample['pickup_longitude']` and `sample['pickup_latitude']`).

   To see the code that we used, see **Answer Code** below the code cell.

   For more information about the `circle` method and its parameters, see **Python Documentation** at the end of the exercise.

2. Add labels for the x and y axes. To do this, replace the `p.xaxis.axis_label =` and `p.yaxis.axis_label =` lines in the code cell with your own code.
   
   Note: Remember to enclose the label name in quotation marks (").

   To see the code that we used, see **Answer Code** below the code cell.
   
In the resulting scatter plot, you will see that most pickup locations are closely grouped, with some outliers.

In [None]:
# Size of the resulting image, in pixels.
plot_width, plot_height = int(500), int(500)
p = figure(tools='pan,wheel_zoom,reset', plot_width=plot_width, plot_height=plot_height)

# Turn off gridlines.
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None

# There are more than one million rows in our data.
# We use the sample because it makes it a lot easier and faster for 
# your computer to display the graph without losing much meaning from
# the data. 
# Use 'sample' when you're plotting.
sample = df.sample(n=10000)

options = dict(line_color=None, fill_color='blue', size=5)
#
#---------------Enter your code here------------------------#
# Plot each point as a small circle.
p.circle()
# Label the axes.
p.xaxis.axis_label = 
p.yaxis.axis_label = 
show(p)
#-----------------------------------------------------------#


#### Answer Code
We used the following code in the code cell.

```python
#---------------Enter your code here------------------------#
# Plot each point as a small circle.
p.circle(x=sample['pickup_longitude'], y=sample['pickup_latitude'], **options)
# Label the axes.
p.xaxis.axis_label="Longitude"
p.yaxis.axis_label="Latitude"
show(p)
#-----------------------------------------------------------#
```

## Part 3. Zoom In on the Scatter Plot

Now that we've created a scatter plot, we can zoom in on the data to get a better view. Your sample may contain an outlier, which makes this especially relevant.

To zoom in on the scatter plot, enter the x and y range that you want in the following code cell, and then run the code.

**Note**: For the `x` and `y` variables, enter two numbers separated by a comma, such as `(1, 2)`. You can also use negative numbers. Examine the ranges in the x and y axes in the scatter plot above, and use values for the `x` and `y` variables below that are within those ranges.  You can also try different ranges to zoom in on different parts of the data.

To see the code that we used, see **Answer Code** below the code cell.

In [None]:

#----------------Enter your code here-------------#
x=()
y=()
#-------------------------------------------------#
c = figure(tools=tools, plot_width=plot_width, plot_height=plot_height, 
           x_range=x, y_range =y)

sample = df.sample(n=10000)

options = dict(line_color=None, fill_color='blue', size=5)
# Plot each point as a small circle.
c.circle(x=sample['pickup_longitude'], y=sample['pickup_latitude'], **options)
c.xaxis.axis_label="Longitude"
c.yaxis.axis_label="Latitude"
show(c)

### Answer code
We used the following code in the code cell.

```python
#----------------Enter your code here-------------#
x=(-74.1,-73.9)
y=(40.7, 40.8)
#-------------------------------------------------#
```

## Python Documentation
Run the following cell to access documentation for the `figure` class.

In [None]:
from bokeh.plotting import figure
?figure

Run the following cell to access documentation for the `circle` method.

In [None]:
?p.circle()