#### Intro to Modeling Data

before building models, use exploratory data analysis like visualization and descriptive statistic to characterize the data to be modeled 
next, build the models and use them to make predictions
quantify the confidence you have in those predictions
this chapter will also show how linear regression relates to inferential statistics by introductiong model parameter estimation

example applications of linear models
linear models can be used to interpolate (a model prediction for times inbetween the times that wes actually measured) or extrapolate (a model prediction for a distance for a time outside the range of measured times)
modeling can help you compare two data sets by building models for each and then comparing the models, for example figuring out fuel efficiency for two cars on the same roadtrip given gass consumption and refueling opportunities every 50 miles
visualization methods are a great first step to seeing trends that may be harder to find or interpret had you just jumped straight to quantitative methods
descriptive statistics will help you prepare a more quantitative basis for building a model 


In [None]:
# an example of a model as a python expression
miles = 50 * hours

# model predicts distance is 300 miles at 6 hours
time = 6
distance = 50 * time
# this model predicts that you would travel 300 miles in 6 hours, 1500 miles in 30 hours, and so on....

# models can also be expressed as functions
def model(time):
    return 50 * time

predicted_distance = model(time=10)

#### Visualizing Linear Relationships

before building models it's useful to explore your data
visualization is an important part of exploring data because it can detect qualities of the data that summary statistics might miss
visualization is also great for communicating your data and modeling results to others 

In [None]:
# quick plot, the data is stored as numpy arrays x and y and we want to plot it 
import matplotlib.pyplot as plt

# pass the data into the fuction plt.plot(), the string sets the plot style (r for red, - for solid line, o for round data point marker)
plt.plot(x, y, "r-o")
plt.show

In [None]:
# another way to use matplotlib, this is more object oriented, it's more customizable and easier to use for complex plots
import matplotlib.pyplot as plt

# construct two new objects, the figure object and the axis object
# this is a method because it's part of an object or a class, if it's not then you can call it a function 
fig, axis = plt.subplots()

# for ease of reuse, create a dictionary to store some of the style options
options = dict(marker='o', color='blue')
# could also add stuff like label="time", marker=None

# call the axis object method on the axis object
# this uses ** unpacking to transform the dictionary key-value pairs into keyword arguments 
line = axis.plot(x, y **options)

# add text labels to the axis object
# this will set any unused output to the _
_ = axis.set_ylabel("Times")
_ = axis.set_xlabel('Distances')

# add grid lines and a legend
axis.grid(True)
axis.legend(loc="best")

# display the figure
plt.show

# once the data is plotted, you might see a linear relationship 
# how can you connect the plot to the ranges of values?

# use two points
# start with the point (x1, y1) at (0, 0)
# move up some spaces (x2, y2) at (2, 3)

# change in x and y
# dy = (y2 - y1) = 3 - 0
# dx = (x2 - x1) = 2 - 0

# slope, rise over run, the ratio of increase in y, divided by the inclease in x
# slope = dy/dx = 3/2

# intercept, the y-intercept of the line is the y value where x=0
# x=0:y1=0

#### Quantifying Linear Relationships

so far we've used data visualization to explore the relationship between two variables 
now we'll look at methods from descriptive statistics including correlation, which is a way of quatifying linear trends in the data, the correlation value is a quantitative measure of how strong of a linear relationship there is between the two variables in your data 

single variable statistics
** mean, a measure of central tendency, describes the center of the data, mean = sum(x)/len(x)
** deviation, a measure of spread, subtract the mean from every data point and the results are the deviations, dx = x - np.mean(x), if these are averaged the tend to cancel out to 0 since some will be positive and others will be negative
** variance, the result of squaring the deviations and averaging, , to avoid the issue with deviation, we square them first and then average, variance = np.mean(dx*dx), variance measures how a single variable varies
** standard deviation, square root of the variance, describes the spread of the data, the variance won't be the same units as the data anymore so take the square root, stdev = np.sqrt(variance) 

** covariance is a measure of whether two variables change (vary) together

correlation always ranges from -1 to 1
correlation has magnitude of 1 to 0, and a direction - (one goes up the other goes down) or + (one goes up and the other goes up)

In [None]:
# covariance measures how two variables vary together, compute the deviation arrays (dx and dy) from each of the two arrays x and y
dx = x - np.mean(x)
dy = y -np.mean(y)

# get the product of each pair of deviations
deviation_products = dx * dy

# average all those products, covariance as the mean
covariance = np.mean(dx * dy)
# for each deviation product, if both x and y are varying in the same direction the result will be positive, 
# if they vary in opposite directions the product will be negative, therefore the average of those products will be larger if 
# both variables change in the same direction more often than not

# as with variance, covariance can be difficult to interpret and compare
# if we divide each deviation by the variables standard deviation the result is the covariance of the normalized deviations
# aka the correlation

# divide deviations by the standard deviation 
zx = dx/np.std(x)
zy = dy/np.std(y)

# correlation, mean of the normalized deviations
correlation = np.mean(zx * zy)

# why do we normalize?
# if you're comparing two variables you'll run into trouble because they each have a different center spread so covariance 
# is harder to interpret and harder to compare to other data sets 
# after normalization, both variables will have a mean of 0 and a standard deviation of 1
# if you imagine these are those deviations, no one variable will be weighted more heavily in the product anymore 

In [None]:
# example exercise 
#  in this exercise you will compare the 3 data sets by computing correlation, and determining which data set has the most strongly correlated variables x and y
# Complete the function that will compute correlation.
def correlation(x,y):
    x_dev = x - np.mean(x)
    y_dev = y - np.mean(y)
    x_norm = x_dev / np.std(x)
    y_norm = y_dev / np.std(y)
    return np.mean(x_norm * y_norm)

# Compute and store the correlation for each data set in the list.
for name, data in data_sets.items():
    data['correlation'] = correlation(data['x'], data['y'])
    print('data set {} has correlation {:.2f}'.format(name, data['correlation']))

# Assign the data set with the best correlation.
best_data = data_sets['A']
