## Load the dataset
The first few lines of code set up the coding environment and loaded the data. As you might be familiar with, you can call on the import function to import any necessary packages. You should use conventional aliases as needed. The example below references a dataset on penguins available through the seaborn package.

In [3]:
# Import packages
import pandas as pd
import seaborn as sns

# Load dataset
penguins = sns.load_dataset("penguins")

# Examine first 5 rows of dataset
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## Clean data
After loading the data, the data was cleaned up to create a subset of data for the purposes of our course. The example isolates just the Chinstrap penguins from the dataset and drops rows with missing data.

The index of the dataframe is reset using the **`reset_index()`** function. When you subset a dataframe, the original row indices are retained. For example, let’s say there were Adelie or Gentoo penguins in rows 2 and 3. By subsetting the data just for Chinstrap penguins, your new dataframe would be listed as row 1 and then row 4, as rows 2 and 3 were removed. By resetting the index of the dataframe, the row numbers become rows 1, 2, 3, etc. The data frame becomes easier to work with in the future. 

Review the code below. You are encouraged to run the code in your own notebook.

In [9]:
# Subset just Chinstrap penguins from data set
chinstrap_penguins = penguins[penguins["species"] == "Chinstrap"]

# Reset index of dataframe
chinstrap_penguins.reset_index(inplace=True, drop=True)
chinstrap_penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Chinstrap,Dream,46.5,17.9,192.0,3500.0,Female
1,Chinstrap,Dream,50.0,19.5,196.0,3900.0,Male
2,Chinstrap,Dream,51.3,19.2,193.0,3650.0,Male
3,Chinstrap,Dream,45.4,18.7,188.0,3525.0,Female
4,Chinstrap,Dream,52.7,19.8,197.0,3725.0,Male
...,...,...,...,...,...,...,...
63,Chinstrap,Dream,55.8,19.8,207.0,4000.0,Male
64,Chinstrap,Dream,43.5,18.1,202.0,3400.0,Female
65,Chinstrap,Dream,49.6,18.2,193.0,3775.0,Male
66,Chinstrap,Dream,50.8,19.0,210.0,4100.0,Male


## Setup for model construction
Now that the data is clean, you are able to plot the data and construct a linear regression model. First, extract the one X variable, **`flipper_length_mm`**, and the one Y variable, **`bill_depth_mm`**, that you are targeting.

In [10]:
# Subset Data
ols_data = chinstrap_penguins[["bill_depth_mm", "flipper_length_mm"]]
ols_data

Unnamed: 0,bill_depth_mm,flipper_length_mm
0,17.9,192.0
1,19.5,196.0
2,19.2,193.0
3,18.7,188.0
4,19.8,197.0
...,...,...
63,19.8,207.0
64,18.1,202.0
65,18.2,193.0
66,19.0,210.0


Because this example is using statsmodels, save the ordinary least squares formula as a string so the computer can understand how to run the regression. The Y variable, **`flipper_length_mm`**  comes first, followed by a tilde and the name for the X variable, **`bill_depth_mm`**.

In [15]:
# Write out formula
ols_formula = "flipper_length_mm ~ bill_depth_mm"

## Construct the model
In order to construct the model, you’ll first need to import the **`ols`** function from the **`statsmodels.formula.api`** interface.

In [16]:
# Import ols function
from statsmodels.formula.api import ols

Next, plug in the formula and the saved data into the **`ols`** function. Then, use the **`fit`** method to fit the model to the data. Lastly, use the **`summary`** method to get the results from the regression model. 

In [17]:
# Build OLS, fit model to data
OLS = ols(formula = ols_formula, data = ols_data)
model = OLS.fit()
model.summary()

0,1,2,3
Dep. Variable:,flipper_length_mm,R-squared:,0.337
Model:,OLS,Adj. R-squared:,0.327
Method:,Least Squares,F-statistic:,33.48
Date:,"Fri, 21 Jul 2023",Prob (F-statistic):,2.16e-07
Time:,23:01:21,Log-Likelihood:,-215.62
No. Observations:,68,AIC:,435.2
Df Residuals:,66,BIC:,439.7
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,128.6967,11.623,11.073,0.000,105.492,151.902
bill_depth_mm,3.6441,0.630,5.786,0.000,2.387,4.902

0,1,2,3
Omnibus:,1.35,Durbin-Watson:,1.994
Prob(Omnibus):,0.509,Jarque-Bera (JB):,0.837
Skew:,-0.255,Prob(JB):,0.658
Kurtosis:,3.19,Cond. No.,303.0


<center>
$
\text{flipper_length_mm} = 128.7 + 3.6 \times \text{bill_depth_mm}
$</center>
