#### Modeling Real Data

modeling real data (not just example datasets) can require some more powerful tools
sometimes scikit-learn will be more convenient, other times statsmodels will be your tool of choice so it's good to know how to use both because no one tool can offer the ideal solution for every problem 

when using real data, it's very important to consider the uncertainty in the model parameters and predictions

REMEMBER: Even our best model is never perfect :)



In [None]:
# using Scikit-Learn

# import
from sklearn.linear_model import LinearRegression

# initialize a general model
model = LinearRegression(fit_intercept=True)

# load and shape the data
x_raw, y_raw = load_data()
x_data = x_raw.reshape(len(y_raw), 1)
y_data = y_raw.reshape(len(y_raw), 1)

# fit the model to the data, this finds optimal values for a0 and a1 so that the model fits the data
model_fit = model.fit(x_data, y_data)

In [None]:
# predictions and parameters

# it's not needed to make predictions but you can access the fit model parameters using array style of indexing
# extract the linear model parameters
intercept = model.intercept_[0]
slope = model.coef_[0, 0]

# use the model to make predictions
future_x = 1200
futury_y = model.predict(future_x)

In [None]:
# another great tool for linear models in python is statsmodels

# start by loading the data into numpy arrays and the repack them into a pandas DataFrame
x, y = load_data()
df = pd.DataFrame(dict(times=x_data, distances=y_data))

# plot to preview (visualize) the data
fig = df.plot('times', 'distances')

# use the ols() and fit() methods to build the model from the data
model_fit = ols(formula="distances ~ times", data=df).fit()

In [None]:
# uncertainty

# extract the optimal values for the parameters (a0 and a1)intercept and slope
a0 = model_fit.params['Intercept']
a1 = model_fit.params['times']

# also extract an error/uncertainty value for each parameter, this lets us quantify the uncertainty to expect form the model
e0 = model_fit.bse['Intercept']
e1 = model_fit.bse['times']

intercept = a0
slope = a1
uncertainty_in_intercept = e0
uncertainty_in_slope = e1

#### The Limits of Predictions

the most common source of large errors is trying to make predictions with a model far outside its **domain of validity**

this lesson will show two kinds of prediction abuses: interpolation and extrapolation and the related errors that can arise

for a graph that looks like it might contain a linear trend with some random noise: you need to pay attention to the step size (how far apart you want to sample points along the x-axis) because if it's large you may be missing something (consider monthly data versus the same data on a daily scale, the daily scale will show many more features so then you'd realize this data is not simply a linear trend with some random noise) 
if you fit a model to the undersampled data the fit might look pretty good but if you interpolate (making predictions between the monthly boundaries) to the daily data you'd see that the result is really bad 
it could be better to fit the linear model just to a limited range (domain of times) from March to August of the same year

domain of validity
the further away you try to extrapolate, the less accurate your model may become, if you go outside that domain of the known data, the residuals can become huge, it's usually not a good idea to extrapolate too far unless you have other experience or domain knowledge to guide you 

you could extrapolate for just a segment that's linear but then you need to figure out the smallest and largest values of independent variable x that you'll allow your model to be applied to

In [None]:
# extrapolation example exercise
# one portion of the data looks linear and was used to build the model but extrapolating all the way will result in some huge residuals
# you can set a tolerance to figure out where the domain should be (the smallest and largest values)

# Compute the residuals, "data - model", and determine where [residuals < tolerance]
residuals = np.abs(y_data - y_model)
tolerance = 100
x_good = x_data[residuals < tolerance]

# Find the min and max of the "good" values, and plot y_data, y_model, and the tolerance range
print('Minimum good x value = {}'.format(np.min(x_good)))
print('Maximum good x value = {}'.format(np.max(x_good)))
fig = plot_data_model_tolerance(x_data, y_data, y_model, tolerance)

In [None]:
# interpolation example exercise
# the monthly data looks linear but the daily data shows different, non-linear trends, undersampling can result in missed data
# in this case, using it to interpolate the daily data results in large residuals (the RSS is like 30 times bigger!!)

# build and fit a model to the df_monthly data
model_fit = ols(formula="Close ~ DayCount", data=df_monthly).fit()

# Use the model FIT to the MONTHLY data to make a predictions for both monthly and daily data
df_monthly['Model'] = model_fit.predict(df_monthly.DayCount)
df_daily['Model'] = model_fit.predict(df_daily.DayCount)

# Plot the monthly and daily data and model, compare the RSS values seen on the figures
fig_monthly = plot_model_with_data(df_monthly)
fig_daily = plot_model_with_data(df_daily)

#### Goodness-of-Fit

goodness-of-fit is different but related to RSS

there's 3 different R's in linear models:
* RSS is for building models, used to help you find the optimal values for model parameters and thus the best model
there's no perfect model so there will still be some residuals, but how good is the model? there's two common ways to quantify the goodness-of-fit for a linear model:
* RMSE is used for evaluating models, the most common way of quantifying the goodness-of-fit
* R-Squared will tell how much of the variation in the data is due to linear trend and how much is not, it's a quantitative measure of that ratio 

RMSE is how much variation is residual
R-squared is what fraction of variation is linear 

when the variation due to linear trend is larger than the variation due to residuals, the model is better

the randomness of the residuals can completed mask the linear dependence for a small slope but be relatively  unimportant for a large slope, R-squared captures this effect but RMSE does not 

a value of 0 would mean none of the variation in the data is predicted by model, if all the data points are close to the line then r-squared will be closer to 1 

In [None]:
# RMSE
# start with the residuals and compute RSS as before 
residuals = y_model - y_data
RSS = np.sum(np.square(residuals))

# divide RSS by the number of residuals, as if to normalize, you'll get the mean of the squared 
# residuals instead of the sum 
mean_squared_residuals = np.sum(np.square(residuals)) / len(residuals)

# the residuals can be thought of modeling errors so this is called the mean-squared-error, MSE
MSE = np.mean(np.square(residuals))
# the form of the variance of the residuals

# the square root of MSE will get us the root mean squared error, RMSE
RMSE = np.sqrt(np.mean(np.square(residuals)))
# the normalized variance or standard deviation of the residuals 
RMSE = np.std(residuals)
# RMSE is a measure of how much the model deviates form the data 

In [None]:
# R-_Squared
# deviations are the difference between the data points and the data mean
deviations = np.mean(y_data) - y_data
# if you square and then sum the deviations you'll get the variance 
VAR = np.sum(np.square(deviations))
# var captures all variation in the data, both the linear trend and the randomness 

# residuals are the difference between the data points and the model
residuals = y_model - y_data
# summing and squaring the residuals gives RSS
RSS np.sum(np.square(residuals))
# RSS only captures the variations left over after the modeled linear trend is subtracted

# R-Squared is 1 - the ratio of VAR divided by RSS
r_squared = 1 - (RSS / VAR)
# it can also be computed as the correlation of the data and the model 
r = correlation(y_data, y_model)

In [None]:
# exercise code, some of the RMSE code simplified
# Build the model and compute the residuals "model - data"
y_model = model_fit_and_predict(x_data, y_data)
residuals = y_model - y_data
# Compute the RSS, MSE, and RMSE and print the results
RSS = np.sum(np.square(residuals))
MSE = RSS/len(residuals)
RMSE = np.sqrt(MSE)
print('RMSE = {:0.2f}, MSE = {:0.2f}, RSS = {:0.2f}'.format(RMSE, MSE, RSS))

In [None]:
# exercise code, r-squared
# Compute the residuals and the deviations
residuals = y_model - y_data
deviations = np.mean(y_data) - y_data

# Compute the variance of the residuals and deviations
var_residuals = np.sum(np.square(residuals))
var_deviations = np.sum(np.square(deviations))

# Compute r_squared as 1 - the ratio of RSS/Variance
r_squared = 1 - (var_residuals / var_deviations)
print('R-squared is {:0.2f}'.format(r_squared))

#### Standard Error 

so far you computed quantitative measures of variation and goodness of the model *predictions* but now we'll look at the variation or errors in the model *parameters* 
How accurate are the model parameters? Are there variation in those parameters? How much of the variation is due to deterministic trends versus inherent randomness? 

instead of using a single value like RMSE that summarizes the entire model prediction, we'll compute the standard error of each of tha model parameters separately 

standard error is a measure of the uncertainty in the model parameter values computed in the least-squares process, the parameter value as the center (the slope is the mean of the speed that was traveled turing the marathon) 
you can now think of those optimal parameter values as no the one true answer but as the best estimate

computer the standard error by hand can be tough or like not even possible, right now we'll use statsmodels to easily compute it 

to truly understand these uncertainties in our models we have to stop thinking of them as errors and begin thinking of them as probability distributions 

In [None]:
# computing standard errors
# build and fit a model and then extract model parameters
# start with a pandas dataframe
df = pd.DataFrame(dict(times=x_data, distances=y_data))
# ...and pass it into ols() from statsmodels, calling fit
model_fit = ols(formula="distances ~ times", data=df).fit()

# use dictionary style keyed indexing to get the wanted params
a1 = model_fit.params['times']
a0 = model_fit.params["Intercept"]
slope = a0
intercept = a1

# the standard error params
e0 = model_fit.bse["Intercept"]
e1 = model_fit.bse["times"]
standard_error_of_intercept = e0
standard_error_of_slope = e1

In [None]:
# exercise example 
# the standard error is the same for both models, but the r-squared changes. 
# The uncertainty in the estimates of the model parameters is indepedent from R-squred because 
that uncertainty is being driven not by the linear trend, but by the inherent randomness in the data.