#**Module 2: Distance**
In this module, you will learn how to

* Describe what Euclidean and Manhattan Distance are
* Compute Euclidean Distance between two data series
* Compute Manhattan Distance between two data series
* Use them to interpret data in the adult dataset

**Be sure to expand all the hidden cells, run all the code, and do all the exercises--you will need the techniques for the lesson lab!**

#What is Distance?#
Let's say you want to buy a car. You probably know what you're looking for already: You know the color, the make, and the model year you want, and the budget you have available. What you don't know just yet is, of course, the exact price--that's why you're shopping around. 

Let's say you are looking for a red 2020 Mercedes for less than $5,000. Here are your shopping notes:

* Dealer 1: Red 2012 Audi $4,500

* Dealer 2: Silver 2008 Mercedes $8,000

* Dealer 3: Red 2020 Chevrolet $4,999

Which car are you going to buy?

If you were going to use Distance (i.e. **difference**) and a simple boolean algorithm to match these criteria, here is what this would look like:

* **Color**: Dealers 1 and 3 have red cars, so they get a 0 for red; Dealer 2 gets a 1 because their car is silver.

* **Year**: Dealers 1 and 2 get a 1 because their cars are older; only Dealer 3 gets a 0 because their car is from 2020.

* **Make**: Dealers 1 and 3 get a 1 because you were not looking for an Audi or a Chevrolet; you were looking for a Mercedes, which you found at Dealer 2 (who gets a 0).

* **Price**: Dealer 2 is too expensive (1); only Dealers 1 and 3 match your criteria (both get a 0). 

Now we total up the points: Dealer 1 gets 2 points (Year and Make are different); Dealer 2 gets 3 points (Color, Year, and Price are different); Dealer 3 gets 1 point (only the Make is different)--so, the red 2020 Chevy for $4,999 from Dealer 3 is **CLOSEST**to what you were looking for. For all the others, the Distance from your original search is bigger. 

This example is obviously a gross simplification, but it illustrates one thing: **The closer two data points, or even two data series, are--that is, the more they are alike--the smaller the distance between the two.** 

There are two major mathematical ways to measure distance in a two-dimensional plane: Euclidean and Manhattan. Then, there are a couple that combine the two. For now, we're keeping things simple.

Before we get started, let's set up our environment, though.

#0. Preparation and Setup#
First, we need to call all our basic packages again: pandas, numpy, and matplotlib. Then we'll read in our data file.

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

Now, we are going to read in the adult dataset again from the instructor's GitHub. When you work with your own dataset, the name of the dataframe and the URL will change based on what you choose to call your own dataframe and what you have named your own repository on GitHub (required: yourlastname_IT533)

In [2]:
adult = pd.read_csv("https://raw.githubusercontent.com/shstreuber/Data-Mining/master/data/adult.data.simplified.csv")

#1. Euclidean Distance#
The Euclidean distance between two points in either the plane or 3-dimensional space measures the length of a segment connecting the two points. It is the most obvious way of [representing distance between two points](https://github.com/shstreuber/Data-Mining/blob/master/images/Euclidean_distance.png):

<img src="https://github.com/shstreuber/Data-Mining/blob/master/images/Euclidean_distance.png" width=200 height=200>

As you can see, in this graphic, the Pythagorean theorem leads us to calculate side d like this:

            ** √((x2−x1)^2+(y2−y1)^2)`**

##And how does this work in Python?##
It's not that hard to compute the Euclidean distance with straight-up Python math, as long as you take your time and go stepwise. Pandas and numpy help us with that. The basic principle is shown below, first with dummy data and then with the adult dataset (explanations are in comments). 

It's ok to skip ahead to the section with the adult dataset, if you are out of time.

##1.1 Euclidean distance using straight-forward math##
We will set up a quick dummy dataset and do the math as the second step.

In [35]:
# Here, I am creating a quick dummy dataset. You won't need to do this with the adult dataset
# or with any data series that is already formatted as a series.

# We build two series called x and y
x = pd.Series([1, 2, 3, 4, 5])
y = pd.Series([6, 7, 8, 9, 10])

# Let's check what x and y contain:
print("Series 1:")
print(x)
  
print("Series 2:")
print(y)

Series 1:
0    1
1    2
2    3
3    4
4    5
dtype: int64
Series 2:
0     6
1     7
2     8
3     9
4    10
dtype: int64


Alright. The dataset is in place and correctly formatted. Now we start setting up the Euclidean distance formula. 

In [36]:
# First, we square each datapoint as a in series x and save the array into p1
# Then, we square each datapoint as b in series y and save the array into p2
p1 = np.sum([(a * a) for a in x])
p2 = np.sum([(b * b) for b in y])

# Now we can build the formula with the squared values
# The numpy zip() function makes it easy to iterate through x and y
dist = np.sqrt(np.sum([(a-b)*(a-b) for a, b in zip(x, y)])) 

# So, what's the distance?:
print("Euclidean distance between our two series is:", dist)

Euclidean distance between our two series is: 11.180339887498949


The entire operation took 3 lines of code. That was ... fun? Maybe? Is there an easier way?

##1.2 Euclidean distance using dot product##
Now let's do this math differently, not with pandas, but with numpy. For this, we will need our data to look like an **array** and not like a series. **Arrays are one of the most common ways to work with data in Python.**

In [37]:
# First, we build our data again, but this time as arrays, This will allow us to do simple vector math!

point1 = np.array((1, 2, 3, 4, 5))
point2 = np.array((6, 7, 8, 9, 10))
  
 # Let's check what x and y contain:
print("Array 1:")
print(point1)
  
print("Array 2:")
print(point2)

Array 1:
[1 2 3 4 5]
Array 2:
[ 6  7  8  9 10]


In [39]:
# Now we subtract point 2 from point 1.
temp = point1 - point2
  
# Then we use the dot product to find the sum of squares
sum_sq = np.dot(temp.T, temp)
  
# All we need now is to tae the spare root of the sum of squares
print("Euclidean distance between our two arrays is:",np.sqrt(sum_sq))

Euclidean distance between our two arrays is: 11.180339887498949


Still three rows. We can do better.

##1.3 Euclidean distance using linalg.norm##
You didn't sign up for this course to program straight-forward math, right? Numpy contains the very convenient [linalg.norm function](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html):

In [40]:
dist = np.linalg.norm(point1 - point2)
  
# printing Euclidean distance
print("Euclidean distance between our two arrays is:",dist)

Euclidean distance between our two arrays is: 11.180339887498949


One line (and one to print the result)--now we're talking!

##1.4 Euclidean Distance and the adult dataset##
Now let's try this on the adult dataset.

Our question to ask: Does incomeUSD depend more on a person's age or on their educationyears? In other words, we want to find out which of these two attributes is closer to incomeUSD.

**First, age**:

In [13]:
point1 = np.array((adult.age))
point2 = np.array((adult.incomeUSD))
  
 # Let's check what x and y contain:
print("Array 1:")
print(point1)
  
print("Array 2:")
print(point2)

Array 1:
[39 50 38 ... 58 22 52]
Array 2:
[ 43747  38907  25055 ...  46073  29618 196782]


In [5]:
dist = np.linalg.norm(point1 - point2)
  
# printing Euclidean distance
print("Euclidean distance between age and incomeUSD is:", dist)

Euclidean distance between age and incomeUSD is: 12969807.476741742


##**Your Turn**
In the space below, do the same transformation (that's a data-sciency way to say "ensure your data is in (here) array format>") and calculation, but for the educationyears attribute in comparison to incomeUSD.

First, transform the data into array format:

Second, use linalg (just like I did above) or dot product (just for fun) to calculate the Euclidean distance:

Third, use your good judgment and compare the Euclidean distance I have calculated comparing age and incomeUSD with the Euclidean distance you have calculated comparing educationyears and incomeUSD. A smaller distance number shows you that the two attributes are more closely related. A larger distance number shows you that they are not as closely related.

**Now you can answer our question**: Which distance is smaller? In other words, what attribute determines a person's incomeUSD more--age or educationyears? Type your answer in the text box below:

#2. Manhattan Distance#
The Manhattan distance, also often called rectilinear or city block distance, between two points is measured along axes at right angles. In a plane with p1 at (x1, y1) and p2 at (x2, y2), it is 

            **|x1 - x2| + |y1 - y2|**

In our [graphic], (https://github.com/shstreuber/Data-Mining/blob/master/images/Euclidean_distance.png), the measurement goes at an angle through the lower right-hand point. 






##2.1 Math? Not again, please!
Ok, ok, I get it: If you wanted to calculate things with complicated formulas, you would have taken a Math class and not a class in which we should be using easy programming methods. Fine, then. BUT we'll need a different Python library to accomplish "no math": 

**The [SciPy library](https://www.scipy.org/) does all the math for us.**

In [9]:
import scipy.spatial.distance as dist

Now, we set up our data. You already know this from section 1 above.

In [16]:
point1 = np.array((1, 2, 3, 4, 5))
point2 = np.array((6, 7, 8, 9, 10))

print ('Here is our sample data')
print ('------------------------')
print ("Array 1:", point1)
print ("Array 2:", point2)

Here is our sample data
------------------------
Array 1: [1 2 3 4 5]
Array 2: [ 6  7  8  9 10]


Time to program our algorithm!

In [30]:
print ("Manhattan Distance:", dist.cityblock(point1,point2))

Distance measurements with 10-dimensional vectors
-------------------------------------------------
Manhattan Distance: 25
Array 2: [ 6  7  8  9 10]


**Wait, WHAT?** That was quick! Can we do that with Euclidean Distance, too?

In [31]:
print ("Euclidean Distance:", dist.euclidean(point1,point2))

Euclidean Distance: 11.180339887498949


One line! No math! So short! So elegant! And, of course, different from our Euclidean measurement because we're not measuring the direct connection, but we're going around the "cityblocks" to get from our first to our second point.

**We have reached the goal!**

##2.2 Manhattan Distance and the adult dataset
Let's take the show on the road again, with our adult dataset. To show you the entire process, we'll walk through the array conversion step first and then display the Manhattan Distance.

I'm again working with the age and incomeUSD attributes. Your job is to compute the educationyears and incomeUSD attributes.

In [33]:
point1 = np.array((adult.age))
point2 = np.array((adult.incomeUSD))

print ("Manhattan Distance between age and incomeUSD:", dist.cityblock(point1,point2))

Manhattan Distance between age and incomeUSD: 1841172130


##**Your Turn**
Now, do the same thing I did above with age and incomeUSD, but with educationyears and incomeUSD. You'll need to set up your arrays and then use the print function to display the Manhattan distance.

Compare your results about educationyears and incomeUSD to my results about age and incomeUSD. Which attribute is close to incomeUSD--age or educationyears? Type your answer below:

#3. Why does this matter?#
Honestly, for our purposes, the kind of distance calculation that you're using doesn't matter as long as you use one of the formulas above in order to calculate distance. That said, the most popular distance formula I have seen is the **Euclidian distance** because it is the most direct connection between two points.

**Want more information?**

* Towardsdatascience has [a great blog entry](https://towardsdatascience.com/a-short-introduction-to-distance-measures-in-machine-learning-886fb579d148) about distance.

* If you are interested in what else the SciPy package has to offer, check out [this post on Kaggle](https://www.kdnuggets.com/2017/08/comparing-distance-measurements-python-scipy.html).