linear regression is always a straight line
straight lines have an intercept (y-coord when x is zero)
straight lines have a slope (steepness of y, amount y increases if x is increased by one)
equation is y=intercept+slope*x

sns.regplot() displays a linear regression trend line but has limitations:
* you can't access the intercept and slope as variables
* you can't work with the model results as variables

this is why sometimes you have to manually run a linear regression

### Fitting a Linear Regression Steps
<ol>
    <li>import the ols function</li>
    <li>create the model object</li>
    <li>fit the model</li>
    <li>print the parameters of the fitted model</li>
</ol>

#### Import ols from statsmodels

In [None]:
from statsmodels.formula.api import ols 
#ols is ordinary least squares, a type of regression that is commonly used

#### Create the Model Object

In [None]:
# ols takes 2 arguments and is a formula
# the response variable on the left side of the tilde and the explanatory variable on the right
# the data argument takes the DataFrame contianing the variables

price_vs_bedrooms = ols("response_variable ~ explanatory_variable", data=dataframe)
# this would be "price ~ bedrooms"

#### Fit the Model

In [None]:
# add the .fit() method to your freshly created model object
price_vs_bedrooms = price_vs_bedrooms.fit()

#### Print the Parameters of the Fitted Model

In [None]:
# print the model result
# use the .params attribute because it contains the model's parameters
# the result will be 2 coefficients: the intercept and slope of the line
print(price_vs_bedrooms.params)

#### Interpreting the Results

In [None]:
# based on these results, we expect the price to be:
# total_payment_sek = intercept + slope * explanatory variable
# this means that for every additional explanatory varible, we expect the house price to increase by the slope

the intercept means that on average, a house with 0 bedrooms had a price of (intercept)

the explanatory coefficient (slope) result means that if you increase the number of bedrooms
by 1, then the expected increases in house price is (slope)

### Categorical Response Variable

In [None]:
# a histogram will be used to visualize categorical explanatory variables
# visualize numeric versus categorical
sns.displot(data=flowers, x="the histogram to plot, something like price", col="what you want to split the plot by", bins=10)
plt.show()

In [None]:
# calculate means by category
mean_price_by_age = dataset.groupby("the grouping you want")["what you want the mean of"].mean()
print(mean_price_by_age)

In [None]:
# linear regression with a categorical explanatory variable
# same code as numeric explanatory variables, the ruterned coefficients are a little different though
# create the model and fit it
something_vs_something = ols("response variable ~ explanatory variable"data=dataset).fit()
print(something_vs_something.params)
# results will showe the the first group as the coefficient and then the other groups relative to it
# this is good when dealing with multiple explanatory variable but if you just have one add zero to the formula
# this makes all the results relative to 0

### Making Predictions
data can be used to make predictions such as focusing on one species of fish (koi), and then comparing the mass to length so that predictions of the mass can be made if given the length of the fish

#### The Priciple of Predicting
If I set the explanatory variables to these values, what value would the response variable have?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# the specific category can be isolated 
koi = fish[fish["species"] == "koi"]
print(koi.head())

# start out by plotting mass versus length with a scatterplot
sns.regplot(x="length", y="mass", data=koi, ci=None)
plt.show()

# before making predictions, you need to fit the model
mass_vs_length = ols("response variable ~ explanatory variable", data=bream).fit()
# view the coefficients of the model
print(mass_vs_length)

# choose some values for the explanatory variables and store them in a pandas dataframe
# in this case the only explanatory variable is the length
# specify an interval of values with np.arange() with the start and end of the interval as arguments
explanatory_data = pd.DataFrame({"length": np.arange(20, 41)})

# next do the predictions, prints one prediction for each row of the explanatory data
print(mass_vs_length.predict(explanatory_data))

# since having a single column of predictions isn't that useful, so put them in a df with 
# the explanatory variables, use assign()
explanatory_data = pd.DataFrame({"length": np.arange(20, 41)})
prediction_data = explanatory_data.assign(mass=mas_vs_lenght.predict(explanatory_data))
print(prediction_data)

# now you can answer questions like how heavy would a koi that is 20 cm would be 

# include these predictions on the scatterplot from before, you can use figure() for this
fig = plt.figure()
sns.regplot(x="length", 
            y="mass", 
            data=koi, 
            ci=None)
sns.scatterplot(x="length", 
                y="mass",
                data=prediction_data,
                color="red",
                marker="s"
               )
plt.show()

# Extrapolating: making predictions outside the range of observed data
# all the fish were between 23 and 38cm but the linear model let's us make predictions outside that range
# sometimes it doesn't make sense to extrapolate because the value is extreme and the model will perform badly
# for example, if we try to check a teeny, tiny fish:
teeny_koi = pd.DataFrame({"length": [10]}) # a dataframe with a single observation of 10cm
pred_teeny_koi = teeny_koi.assign(mass=mas_vs_lenght.predict(teeny_koi))
print(pred_teeny_koi)
# result is almost -500 grams and this is not physically possible 
# think of the context of your data to figure out if it's sensible to extrapolate 
# linear models don't know what's possible in real life so sometimes you get results that don't make any sense
# like if you tried to predict price with -1 grocery stores or 2.5 groceries
# you need to understand what the data means in order to determine whether a prediction is nonsense or not


### Model Objects
The model objects created by ols can tell a lot of info, working with individual pieces of the model is often more useful than working with the whole model object at once

you can extract the coefficients or parameters from your fitted model with the **.params** attribute:
    from statsmodels.formula.api import ols
    mass_vs_length olss("mass ~ length", data=koi).fit()
    print(mass_vs_length.params)
    
you can use the **.fittedvalues** attribute to access the predictions on the original dataset used to create the model, this is a shortcut of taking the explanatory variable from the dataset and feeding them to the predict function:
    print(mass_vs_length.fittedvalues)
    
residuals are a measure of inaccuracy in the model fit (actual response values - predictied response values), acces with the **.resid** attribute
    print(mass_vs_length.resid)
    
use the **.summary()** method for a printout of the details of the model:
    mass_vs_length.summary()
    
even though you should use .predict(), you could manually calculate the predictions from the model coefficients, response = intercept + slope * explanatory data:
    # get the coefficients/parameters
    coeffs = price_vs_size.params
    # get the intercept and slope
    intercept = coeffs[0]
    slope = coeffs[1]
    # manually calculate the predictions
    price = intercept + slope * explanatory_data
    # Compare to the results from .predict()
    print(price_twd_msq.assign(predictions_auto=mdl_price_vs_conv.predict(explanatory_data)))

### Regression to the Mean
regression to the mean is a property of the data, not a type of model
linear regression can be used to quantify its effect
eventually extreme cases will look like average cases

the response value = fitted value + residual(how much the model missed by)(the parts you could explain + the parts you couldn't explain" with the model)

residuals exist due to problems in the model and because of fundamental randomness, extreme cases are often due to randomness but randomness/luck doesn't persist and will run out, you don't want a perfect model because the real world has randomness and you don't want your model to capture that


In [None]:
# first make a scatterplot of the data
fig = plt.figure() # ,enables plot layering
sns.scatterplot(x="father_height", 
                y="son_height", 
                data=father_son_data)

In [None]:
# add a line where the father height and son height are equal, 
# the first argument is intercept
plt.axline(xy1=(150, 150),
           slope=1, 
           linewidth=2,
           color="green")

In [None]:
# make it so one cm on the x-axis is the same as on the y-axis
plt.axis("equal")
plt.show()
# if sons always had the same price as their fathers, all the points would lie on this green line 

In [None]:
# add a linear regression line to the plot
# ?? this code could go before the above code for axline?
sns.regplot(x="father_height", 
            y="son_height", 
            data=father_son_data,
            ci=None, 
            line_kws={"color": "black"})
# the black line above the green line means that for very short fathers, their sons are taller than them on average
# the black line below the geen line means that for very tall fathers, their sons are shorter than them on average

In [None]:
# running a model quantifies the predictions of how much taller or shorter the sons will be
# run a regression
# sons' heights are the response variable and the fathers' heights are the explanatory variable
mdl_son_vs_father = ols("son_height ~ father_height", 
                        data=father_son_data).fit()
print(mdl_son_vs_father.params)

In [None]:
# now we can make predictions
# what is the predicted height of a son if the father is 190cm?
tall_dad = pd.DataFrame({"father_height": [190]}) # more values could be added
mdl_son_vs_father.predict(tall_dad)
# you could also try it with a short dad that's 150cm
# in both cases you will see that the extreme value became less extreme in the next generation
# this is a perfect example of regression to the mean
# for example winning teams will probably do worse the next year and losing ones better
# a player with many home runs in a season will likely decline the next because the elevated numbers are difficult to sustain

### Transforming Variables
sometimes the relationship between the explanatory variable and the response variable might not be a straight line and to fit the linear regression model you might need to transform the explanatory and/or response variable 
for example, the data might look curved, a perch might grow in 3 directions at once so the length cubed might give a better, more linear fit

In [None]:
# transform variable with CUBES
# plot mass vs. length cubed
# first create an additional column where you calculate the length cubed 
koi["length_cm_cubed"] = koi["length_cm"] ** 3
# next, replace this newly created column in the regplot call
sns.regplot(x="length_cm_cubed", 
            y="mass_g", 
            data=koi,
            ci=None)
plt.show()

# the data points will fit the line much better now so run the model next
# to do this, replace the original length variable with the cubed version
# then fit the model and extract its coefficients
mdl_koi = ols("mass_g ~ length_cm_cubed", data=perch).fit()
mdl_koi.params

# create the explanatory DataFrame in the same was as usual
# insert the cubed length, the untransformed length can be added as reference
explanatory_data = pd.DataFrame({"length_cm_cubed": np.arange(10, 41, 5) ** 3,
                                 "length_cm": np.arange(10, 41, 5)})
# assign/predict to add predictions
prediction_data = explanatory_data.assign(
    mass_g=mdl_koi.predict(explanatory_data))
print(prediction_data)
# these points will be added to the plot and be red squares
fig = plt.figure()
sns.regplot(x="length_cm_cubed", 
            y="mass_g", 
            data=koi,
            ci=None)
sns.scatterplot(data=prediction_data,
                x="length_cm_cubed", 
                y="mass_g", 
                color="red", 
                marker="square")
# you can run this plot again but with x="length_cm" for both to see the original x-axis
# the linear model will have non-linear predictions after the transformation is undone

In [None]:
# transform variables with SQUARE ROOT
# sometimes a plot can be cramped, like to the bottom left
# square roots are a common transformation for data with a right-skewed distribution
# this makes it difficult to assess whether there is a good fit
sns.regplot(x="spent_usd", 
            y="n_impressions",
            data=ad_conversion,
            ci=None)
# if you transform both variables by square roots, the data are more spread out 
# throughout the plot and you can see the points follow the line fairly closely
ad_conversion["sqrt_spendt_usd"] = np.sqrt(ad_convestion["spent_usd"])
ad_conversion["sqrt_n_impressions"] = np.sqrt(ad_convestion["n_impressions"])
sns.regplot(x="sqrt_spent_usd", 
            y="sqrt_n_impressions",
            data=ad_conversion,
            ci=None)
# modeling is the same as normal
mdl_ad = ols("sqrt_n_impressions ~ sqrt_spent_usd", data=ad_conversion).fit()
explanatory_data = pd.DataFrame({"sqrt_spent_usd": np.sqrt(np.arange(0, 601, 100)),
                                 "spent_usd": np.arange(0, 601, 100)})
# prediction takes an extra step, because we used the square root of the 
# response variable, not just the explanatory one, the predict function will 
# predict the square root of the number of impressions, so undo with squaring
# this is called "back transformation"
prediction_data = explanatory_data.assign(
    sqrt_n_impressions=mdl_ad.predict(explanatory_data),
    n_impressions=mdl_ad.predict(explanatory_data) ** 2)
print(prediction_data)