## Week 4: Optimization/Canonical Formulation/Loss Function/Maximum Likelihood

### Version 1.1

This assignment is designed to help you gain some practical knowledge of how optimization works. You should be able to describe how optimization works in everyday life, mathematics, and data science. At the end of this assignment you should be able to identify canonical formulations for linear and logistic regression and express the intuition of a loss function and how it related to optimization. Please read the directions carefully, as we want to avoid submissions that are marked incorrect due to formatting mistakes. You will be using sympy, numpy, and scipy for this assignment.

Please enter your name: ""

# Part1: Optimization Problem 

<b>1.</b> Below is the function f(x) that you are trying to minimize. 

![function plotted on xy axis graph](assets/optimization.png)

<strong>1.1</strong> \[1 pt\] Where is the global maximum?

Please store the answer in the form of a single-character string named <strong>ANS11</strong>. 

In [None]:
ANS11="A"


In [None]:
assert type(ANS11) == str, "Problem 1.1, testing ANS11, type of value stored in variable does not match the expected type. Expecting String."

<strong>1.2</strong> \[1 pt\] Where is the global minimum?

Please store the answer in the form of a single-character string named <strong>ANS12</strong>.

In [None]:
ANS12= "F"


In [None]:
assert type(ANS12) == str,  "Problem 1.2, testing ANS12, type of value stored in variable does not match the expected type. Expecting String."

## Part2: Linear Regression Implementation on Data 

In this problem, you will be implementing linear regression models in scikit-learn library, as well as ols implementation in the statsmodels library.  

Difference between Statsmodels and Scikit-learn libraries: 

Scikit-learn: Simple and efficient tools for data mining and data analysis

Statsmodels: a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

Sources:
<ul>
    <li>https://towardsdatascience.com/introduction-to-linear-regression-in-python-c12a072bedf0</li>
    <li>https://towardsdatascience.com/linear-regression-simplified-ordinary-least-square-vs-gradient-descent-48145de2cf76</li>
    <li>https://www.statsmodels.org/stable/index.html</li>
    <li>https://scikit-learn.org/stable/</li>
</ul>

### 2. Simple Regression with Scikit-learn 

In [2]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import pandas as pd

In [3]:
#Let's read in nyc flight csv file 
nyc = pd.read_csv('assets/nyc.csv')

In [4]:
##Let's explore what is inside the data 
nyc.head()

Unnamed: 0.1,Unnamed: 0,X,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,tailnum,flight,origin,dest,air_time,distance,hour,minute
0,1,1,2013,1,1,517.0,2.0,830.0,11.0,UA,N14228,1545,EWR,IAH,227.0,1400,5.0,17.0
1,2,2,2013,1,1,533.0,4.0,850.0,20.0,UA,N24211,1714,LGA,IAH,227.0,1416,5.0,33.0
2,3,3,2013,1,1,542.0,2.0,923.0,33.0,AA,N619AA,1141,JFK,MIA,160.0,1089,5.0,42.0
3,4,4,2013,1,1,544.0,-1.0,1004.0,-18.0,B6,N804JB,725,JFK,BQN,183.0,1576,5.0,44.0
4,5,5,2013,1,1,554.0,-6.0,812.0,-25.0,DL,N668DN,461,LGA,ATL,116.0,762,5.0,54.0


In [5]:
## Data Cleaning: Drop all null values
nyc=nyc.dropna()

In [6]:
## Change datatype from dataframe to array 
x = nyc['distance'].values.reshape((-1, 1))

In [7]:
## model building 
model = LinearRegression().fit(x, nyc['arr_delay'])

In [8]:
## Show the slope the model 
print('slope:', model.coef_)

slope: [-0.00483183]


In [9]:
## Show the intercept of the model 
print('intercept:', model.intercept_)

intercept: 11.027244871764514


In [None]:
## predict arr_delay time when distance is 20
model.predict([[20]])

#### An example of  simple linear regression with Scikit-learn

Please apply simple linear regression in Scikit-learn to build a model called <strong>arr_dep_model</strong> to get the relationship between arrival delay(dependent variable) and departure delay(independent variable).  This will not be directly tested, but you will need this for future questions.

In [10]:
x2 = nyc['dep_delay'].values.reshape((-1, 1))
arr_dep_model = LinearRegression().fit(x2,nyc['arr_delay'])

<strong>2.1</strong> \[1 pt\] What is the slope of the arr_dep_model?
Please store the answer into variable <strong>ANS21</strong> as a float.

In [13]:
ANS21= arr_dep_model.coef_[0]

1.020178295133761

In [12]:
import numbers
assert isinstance(ANS21, numbers.Number),  "Problem 2.1, testing ANS21, type of value stored in variable does not match the expected type. Expecting a number"

<strong>2.2</strong> \[1 pt\] What is the intercept of the arr_dep_model?

Please store the answer into variable <strong>ANS22</strong>.  

In [17]:
ANS22 = arr_dep_model.intercept_

In [18]:
assert isinstance(ANS22, numbers.Number), "Problem 2.2, testing ANS22, type of value stored in variable does not match the expected type. Expecting a number."

<strong>2.3</strong> \[1 pt\] Put the answers from 2.1 and 2.2 into slope-intercept form into a variable called <strong>ANS23</strong>. Store the equation as a string that is valid python code.  Omit the "Y=/f(x)=". Use a lower case “x” as the representation of the independent variable. For example: 27.003 * x + 40.221.  You may round numbers to the nearest thousandth.  

In [19]:
ANS23 = "{} * x + {}".format(round(ANS21,3),round(ANS22,3))

In [20]:
#hidden tests for problem 2.3 are within this cell
assert type(ANS23) == str, "Problem 2.3, testing ANS23, type of value stored in variable does not match the expected type. Expecting String"

<strong>2.4</strong> \[1 pt\] If a flight’s departure is delayed for 15 minutes, what is its predicted arrival delay? Please store the answer into variable <strong>ANS24</strong>. Your answer should be a number.  

In [22]:
ANS24 = ANS21 * 15.0 + ANS22


In [None]:
assert isinstance(ANS24, numbers.Number),  "Problem 2.4, testing ANS24, type of value stored in variable does not match the expected type. Expecting Float"

### 3. Multiple  Regression with Scikit-learn
We are going to run through an example of using scikit-learn to do a multiple linear regression model. 

**Begin example:**

In [23]:
# Build linear regression model using TV and Radio as predictors
# Split data into predictors X and output Y
predictors = ['dep_delay', 'distance']
X = nyc[predictors]
y = nyc['arr_delay']

# Initialise and fit model
lm = LinearRegression()
model = lm.fit(X, y)

In [24]:
print(f'intercept = {model.intercept_}')
print(f'coefficient = {model.coef_}')

intercept = -1.4987042957768262
coefficient = [ 1.01791457 -0.00250182]


In [25]:
model.predict(X)

array([ -2.96542015,  -0.9696201 ,  -2.1873548 , ..., 260.29482685,
        90.42100499, 179.45209055])

In [26]:
new_X = [[30, 200]]
print(model.predict(new_X))

[28.53836913]


**Example end**

Now, it is your turn! 

Please apply multiple linear regression in Scikit-learn to build a model called arr_del_model to see how air_time and distance influence arrival delays. You will use the results from this model to answer questions 3.1 - 3.4.

In [27]:
x3 = nyc[['air_time','distance']]
y3 = nyc['arr_delay']
arr_del_model = LinearRegression().fit(x3,y3)

<strong>3.1</strong> \[1 pt\] Please identify the coefficient value for air_time. Assign the coefficient to <strong>ANS31</strong>. The value of <strong>ANS31</strong> should be a number.


In [28]:
ANS31 =  arr_del_model.coef_[0]

In [29]:
import numbers
assert isinstance(ANS31, numbers.Number), "Problem 3.1, testing ANS31, type of value stored in variable does not match the expected type. Expecting a number."

<strong>3.2</strong> \[1 pt\] Please identify the coefficient value for distance. Assign the coefficient to <strong>ANS32</strong>. The value of <strong>ANS32</strong> should be a number.

In [30]:
ANS32 = arr_del_model.coef_[1]

In [31]:
assert isinstance(ANS32, numbers.Number), "Problem 3.2, testing ANS32, type of value stored in variable does not match the expected type. Expecting a number."

<strong>3.3</strong> \[1 pt\] Please identify the intercept for the relationship between arr_delay, air_time, and distance. Assign the intercept to <strong>ANS33</strong>. The value of <strong>ANS33</strong> should be a number.

In [32]:
ANS33 = arr_del_model.intercept_

In [33]:
assert isinstance(ANS33, numbers.Number), "Problem 3.3, testing ANS33, type of value stored in variable does not match the expected type. Expecting a number."


<strong>3.4</strong> \[1 pt\] If a flight’s air_time is 175 and distance is 1300, what is its predicted arrival delay? Please store your answer into variable <strong>ANS34</strong> as a number. 

In [34]:
ANS34 = arr_del_model.predict([[175,1300]])[0]

In [35]:
assert isinstance(ANS34, numbers.Number), "Problem 3.4, testing ANS34, type of value stored in variable does not match the expected type. Expecting a number."

### 4. Linear Regression with statsmodels 

First, we will run through an example of linear regression using statsmodel. 

**Begin example:**

Please implement linear regreesion on the dataset

    Linear regression refers to any linear relationship between the independent and dependent variables, yet it does not indicate how the model is fitted. As a result, there are serveral different approaches to linear regression problems. Here, I would like to introduce one of a common methods called ordinary least squares (OLS). The purpose of OLS is to find the best fit line by minizing the sum of the squared vertical distances (residuals).
    Below is the mathmatic formula of ordinary least squares:
    Assume we have a data of a set pairs of (x1,y1),(x2,y2).., and we are trying to find the best fit line for the data
   (1)Find the mean of x values and y values
        ![the equation of the means for X and Y](assets/Mean.png)
    (2)Calculate the slope of the best fit line
        ![the equation for the slope](assets/slope.png)
    (3)Compute the y-intercept of the line 
        ![the equation for computing the y-intercept](assets/y-intercept.png)
    (4)The regression line with the least square of distance from each data point to the line
        ![a graph showing the regression line with the least square distance](assets/regression_line.png)

In [36]:
#in order to run OLS on the data, we have to install the model below:
import math
import numbers
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import csv
import pandas as pd 


statsmodels.formula.api allows you to use R-Style formulas: y ~ x1 + x2 + x3 + ...

    1.y represents the outcome/dependent variable
    2.x1, x2, x3, etc represent explanatory/independent variables

What is the relationship between depature delay and distance? 

In [37]:

model1 = smf.ols('arr_delay ~ distance', data=nyc).fit()
model1.summary()


0,1,2,3
Dep. Variable:,arr_delay,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,198.0
Date:,"Mon, 27 Jun 2022",Prob (F-statistic):,8.25e-45
Time:,18:58:36,Log-Likelihood:,-135020.0
No. Observations:,26398,AIC:,270000.0
Df Residuals:,26396,BIC:,270100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,11.0272,0.427,25.808,0.000,10.190,11.865
distance,-0.0048,0.000,-14.072,0.000,-0.006,-0.004

0,1,2,3
Omnibus:,28994.599,Durbin-Watson:,1.591
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7569174.795
Skew:,5.203,Prob(JB):,0.0
Kurtosis:,85.3,Cond. No.,2140.0


Based on the model, we hypothesize the model 

    arr_delay = -0.0048*distance + 11.0272

**Example end**

Now it is your turn!

<strong>4.1</strong> Build an OLS model called model2 that shows the relationship between arr_delay and dep_delay. You will not submit anything for this question, but you need the output to answer 4.2 through 4.4.

In [50]:
model2 =smf.ols('arr_delay ~ dep_delay', data=nyc).fit()
model2.summary()

0,1,2,3
Dep. Variable:,arr_delay,R-squared:,0.84
Model:,OLS,Adj. R-squared:,0.84
Method:,Least Squares,F-statistic:,138200.0
Date:,"Mon, 27 Jun 2022",Prob (F-statistic):,0.0
Time:,19:04:09,Log-Likelihood:,-110950.0
No. Observations:,26398,AIC:,221900.0
Df Residuals:,26396,BIC:,221900.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-4.0570,0.103,-39.264,0.000,-4.260,-3.854
dep_delay,1.0202,0.003,371.798,0.000,1.015,1.026

0,1,2,3
Omnibus:,3397.727,Durbin-Watson:,1.604
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9831.122
Skew:,0.695,Prob(JB):,0.0
Kurtosis:,5.647,Cond. No.,39.1


<strong>4.2</strong> \[1 pt\] Please identify the coefficient value (slope) for departure_delay. Assign the coefficient to <strong>ANS42</strong> and round to the thousandths decimal place (eg 1.001). The value of <strong>ANS42</strong> should be a python float.

In [100]:
ANS42 = round(model2.params[1],3)

1.02

In [None]:
assert isinstance(ANS42, numbers.Number), "Problem 4.2, testing ANS42, type of value stored in variable does not match the expected type. Expecting a number."

<strong>4.3</strong> \[1 pt\] Please identify the intercept for the relationship between arr_delay and dep_delay. Assign the intercept to <strong>ANS43</strong> and round to the thousandth decimal point (ex. 1.001). The value of <strong>ANS43</strong> should be a python float.

In [45]:
ANS43 = round(model2.params[0],3)

In [None]:
assert isinstance(ANS42, numbers.Number), "Problem 4.3, testing ANS43, type of value stored in variable does not match the expected type. Expecting a number."

<strong>4.4</strong> \[1 pt\] If a flight was delayed for 10 minutes, how will it affect its arrival delay based on model2? Assign your answer to the variable <strong>ANS44</strong> and round to the thousandths place (eg 1.001). The value of <strong>ANS44</strong> should be a python float. Inputs to the slope and intercept should be rounded to the thousandths place as well. 

In [104]:
ANS44= round(ANS42*10+ANS43,4)

In [105]:
assert isinstance(ANS44, numbers.Number), "Problem 4.4, testing ANS44, type of value stored in variable does not match the expected type. Expecting python float."

### 5. Gradient Descent Computation 

<strong>5.1</strong> \[1 pt\] Assume we have a function f(x.y) =5x+4xy-y^2, what is its gradient? Please store the answer into a string type variable <strong>ANS51</strong> that is valid python code  (e.g. `ANS51 = "[3*x + y, 3*x - 2y]"`).

In [None]:
ANS51 = "[4*y + 5,4*x - 2*y]"

In [None]:
#hidden tests for problem 5.1 are within this cell

<strong>5.2</strong> \[1 pt\] Evaluate the gradient at the point (3,-1). Your solution should be stored into variable ANS52 as an array (e.g. `ANS52= np.array([1, 1])').

In [52]:
ANS52 = np.array([1,14])


In [None]:
#hidden tests for problem 5.2 are within this cell
assert type(ANS52) == np.ndarray, "Problem 5.2, expecting array type."
assert ANS52.size == 2

<strong>5.3</strong> \[1 pt\] What is the magnitude of the gradient at the point (3,-1). Please store your answer into variable <strong>ANS53</strong> as a number.

In [56]:
ANS53 = math.sqrt(1**2 + 14**2)

In [57]:
#hidden tests for problem 5.3 are within this cell

<strong>5.4</strong> \[1 pt\] Starting from the given point (3,-1) and using the letter `a` as a symbol for your learning rate, find a new point $(x_{n},y_{n})$ on the gradient such that $f(x_{n},y_{n}) < f(3,-1)$ for sufficiently small 'a'. Store your answer into variable <strong>ANS54</strong> as a python string (e.g. ``ANS54 = "[1 * a, 2-10 * a]"`` containing valid python code.  

In [None]:
ANS54 = "[3-1 * a, -1-14 * a]"

In [None]:
#hidden tests for problem 5.4 are within this cell

### 6. Maximum Likelihood in Linear Regression

Let's say we have serveral data points: (2, 4.5, 8), and we know they are from a normal (Gaussian) distribution. 

6.1 [1 pt] Let's start by assuming some values for  𝜇  and  𝜎 . What is the likelihood of guess1:  𝑁(𝜇=4.5,𝜎=4) ? Store your answer in the variable <strong>ANS61</strong>.

In [59]:
from scipy.stats import norm

In [70]:
mu = 4.5
std = 4
ANS61= norm.pdf(2,loc=mu, scale=std) * norm.pdf(4.5,loc=mu, scale=std) * norm.pdf(8,loc=mu, scale=std)

0.0005565109655370626

In [None]:
assert isinstance(ANS61, numbers.Number), "Problem 6.1, testing ANS61, type of value stored in variable does not match the expected type. Expecting a numpy float."

6.2 [1 pt] Let's change our assumptions regarding the values of  𝜇  and  𝜎 . What is the likelihood of guess2:  𝑁(𝜇=6,𝜎=8) ? Store your answer in the variable <strong>ANS62</strong>.

In [73]:
mu = 6
std = 8
ANS62= norm.pdf(2,loc=mu, scale=std) * norm.pdf(4.5,loc=mu, scale=std) * norm.pdf(8,loc=mu, scale=std)

0.00010422397698910738

In [72]:
assert isinstance(ANS62, numbers.Number), "Problem 6.2, testing ANS62, type of value stored in variable does not match the expected type. Expecting a numpy float."

6.3 [1 pt] Which are the more likely estimates? Store your answer as either "guess1" or "guess2" into a variable called <strong>ANS63</strong>.

In [None]:
ANS63 = "guess1"

In [None]:
assert ((ANS63.strip().replace(" ", "").lower() == "guess1") or (ANS63.strip().replace(" ", "").lower() == "guess2")), "Problem 6.3, testing ANS63, value was not 'guess1' or 'guess2'. Please choose 'guess1' or 'guess2' as your answer."

## 7. Likelihood calculation based on a line

We have calculated the forecasted probability of winning according to some logistic regression line for some actual wins and losses (there are no ties). Here are those forecasted probabilities:

Wins: P(winning)-- 0.52, 0.76, 0.89, 0.91, 0.95, 0.95, 0.98, 0.98

Losses: P(winning)-- 0.08, 0.08, 0.08, 0.15, 0.2, 0.25, 0.6, 0.7

7.1 [1 pt] What is the likelihood of the line given the data (you do not need the line nor the data - only the resulting probabilities). Please store your answer into <strong>ANS71</strong>.

In [83]:
ANS71= 0.52*0.76*0.89*0.91*0.95*0.95*0.98*0.98*(1-.08)*(1-.08)*(1-.08)*(1-.15)*(1-.2)*(1-.25)*(1-.6)*(1-.7)

In [84]:
assert isinstance(ANS71, numbers.Number), "Problem 7.1, testing ANS71, type of value stored in variable does not match the expected type. Expecting Float."

7.2 [1 pt] What is the log likelihood (ln) of the line given the data? Please store your answer into the variable <strong>ANS72</strong>.

In [87]:
ANS72 = math.log(0.52)+math.log(0.76)+math.log(0.89)+math.log(0.91)+math.log(0.95)+math.log(0.95)+math.log(0.98)+math.log(0.98)+math.log(1-.08)+math.log(1-.08)+math.log(1-.08)+math.log(1-.15)+math.log(1-.2)+math.log(1-.25)+math.log(1-.6)+math.log(1-.7)

In [88]:
assert isinstance(ANS72, numbers.Number), "Problem 7.2, testing ANS72, type of value stored in variable does not match the expected type. Expecting Float."
assert (ANS72 > 2) == False, "Problem 7.2, testing ANS72, value is larger than expected. Which log did you use (log10 or ln)? Which are you supposed to use?"
assert (ANS72 < -4.5) == False, "Problem 7.2, testing ANS72, value is smaller than expected. Check your math. Did you account for wins/losses?"

Sources:
<ul>
    <li>https://www.statsmodels.org/stable/index.html</li>
    <li>https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/</li>
    <li>https://towardsdatascience.com/linear-regression-simplified-ordinary-least-square-vs-gradient-descent-48145de2cf76</li>
</ul>