# Building a regression model

## Make sure you have downloaded the [supermarket_marketing google sheet](https://drive.google.com/drive/u/1/folders/1UDhRY1XZ1y0H3jHckdRvImAaJozq3gup) as Excel and uploaded the supermarket_marketing.xlsx to this directory

## Install libraries

One of the new libraries we will be using is `statsmodels`, created and given away as open-source software by Skipper Seabold (American U), Josef Perktold (UNC), Chad Fulton (Federal Reserve), Kevin Sheppard (Oxford), and many others.

We will also be using `seaborn` for visualizing graphs, also an open-source project by Martin Waskom (NYU, Flatiron Health)

In [None]:
!pip install pandas openpyxl statsmodels seaborn

## Import libraries -- a pink box with `FutureWarning` is normal and OK

In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import numpy as np

## Load and explore data

In [None]:
data = pd.read_excel('supermarket_marketing.xlsx')

In [None]:
data.sample(5).T

## Creating a new variable based on existing variables

In [None]:
data['kids_teens_at_home'] = data['kids_at_home'] + data['teens_at_home']
data.sample(5)

## Regressions
For regressions, we need one input column and at least one input column. You cannot include the output column in the list of input columns

### __For lab: change the output column to another column in this dataset you want to predict__

In [None]:
output_column = 'kids_teens_at_home'

### __For lab: change the input columns so you remove at least 3 of the existing items in this list, then add at least 3 new items from other number columns (no text columns) to build the model. Pick ones that make sense for what you are predicting.__

In [None]:
input_columns = ['birthyear','wine','fruit','sweets','num_web_orders','complained']

In [None]:
if output_column in input_columns:
    print("ERROR! You cannot include the output column in the input column")
else:
    print("OK")

In [None]:
data[input_columns]

In [None]:
all_relevant_columns = input_columns.copy()
all_relevant_columns.append(output_column)
all_relevant_columns

In [None]:
data_cleaned = data[all_relevant_columns].dropna()

output = data_cleaned[output_column]
inputs = data_cleaned[input_columns]

## Training the model

In [None]:
model = sm.OLS(output, inputs).fit()

### The model summary gives us a lot of information -- too much! We want to focus on two aspects: 
- R-squared, which is 0.816, which says that our model got within 81.6% of the correct values for `kids_teens_at_home`
- The second table with all of our columns
  - Focus on the `coef` column: the coefficient or weight of that variable in the formula, but normalized to the z-score.
  - Focus on the `P>|t|` column: the probability (between 0 and 1) that this column is actually unrelated to the output column
      - The closer to 0, the more this column should be in your final formula
      - People disagree over how large is too large: some remove everything over 0.01, 0.05, or 0.1
      - We will use 0.1, so remove columns from the list of inputs if they are __over__ 0.1
- Note that `e` means "10 to the power of", or the number of spaces to move the decimal point forward or backward. So 2.23e-07 is 0.000000223. If the number after `e` is negative, add 1 minus that number of zeros before it. 2.23e+07 would be 22300000.

In [None]:
model.summary()

## Weights for all columns with p-values under 0.1:

In [None]:
model.params[model.pvalues < 0.1]

## Weights for all columns with p-values greater or equal to 0.1:

In [None]:
model.params[model.pvalues >= 0.1]

## __For lab: if you have columns with p-values greater or equal to 0.1, remove them from the list of input columns__

In [None]:
input_columns = ['birthyear','wine','fruit','sweets','num_web_orders']

In [None]:
if output_column in input_columns:
    print("ERROR! You cannot include the output column in the input column")
else:
    print("OK")

In [None]:
data[input_columns]

In [None]:
all_relevant_columns = input_columns.copy()
all_relevant_columns.append(output_column)
all_relevant_columns

In [None]:
data_cleaned = data[all_relevant_columns].dropna()

output = data_cleaned[output_column]
inputs = data_cleaned[input_columns]

## Training the model

In [None]:
model = sm.OLS(output, inputs).fit()

In [None]:
model.summary()

## Weights for all columns with p-values under 0.1:

In [None]:
model.params[model.pvalues < 0.1]

## Weights for all columns with p-values greater or equal to 0.1:

In [None]:
model.params[model.pvalues >= 0.1]

## __For lab: If you have no p-values greater or equal to 0.1, move to the next section. If you do, edit the last section to remove the input columns with high p-values and run it again.__

## Make predictions using this formula on our same dataset

In [None]:
output_predicted_name = output_column + "_predicted"

In [None]:
data_cleaned[output_predicted_name] = model.predict(inputs)

In [None]:
data_cleaned.sample(5).T

## Visualize the actual and predicted columns in a scatterplot:

In [None]:
sns.scatterplot(data=data_cleaned, x=output_column, y=output_predicted_name, alpha=0.1)

## Save to excel file

In [None]:
data_cleaned.to_excel("supermarket_predictions.xlsx")