In [1]:
import matplotlib as mpl
import pandas as pd
import pyrsm as rsm
import statsmodels.formula.api as smf
import seaborn as sns

# increase plot resolution
# mpl.rcParams["figure.dpi"] = 150

In [2]:
# load data
diamonds = pd.read_pickle("data/diamonds.pkl")
click = pd.read_pickle("data/click.pkl")

In [3]:
# review the data
rsm.describe(diamonds)

## Diamond prices

Prices of 3,000 round cut diamonds

### Description

A dataset containing the prices and other attributes of a sample of 3000 diamonds. The variables are as follows:

### Variables

- price = price in US dollars ($338--$18,791)
- carat = weight of the diamond (0.2--3.00)
- clarity = a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- cut = quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color = diamond color, from J (worst) to D (best)
- depth = total depth percentage = z / mean(x, y) = 2 * z / (x + y) (54.2--70.80)
- table = width of top of diamond relative to widest point (50--69)
- x = length in mm (3.73--9.42)
- y = width in mm (3.71--9.29)
- z = depth in mm (2.33--5.58)
- date = shipment date

### Additional information

<a href="http://www.diamondse.info/diamonds-clarity.asp" target="_blank">Diamond search engine</a>


In [4]:
rsm.describe(click)

## Click ballpoint pens

### Description

The data represent annual sales for 40 territories for Click, a national manufacturer of ballpoint pens. The company uses regional wholesalers to distribute their products. In addition they use company sales representatives and TV advertising.  

### Variables

A data frame with 40 observations on 4 variables

- sales =	Sales of Click products in a territory in $1000
- advertising = Number of spots purchased in a territory
- salesreps = Total number of sales representatives assigned to a territory
- wholesaler_eff = A measure of wholesaler efficiency based on a survey. A factor with four levels: Poor, Fair, Good, and Excellent

### Source

Marketing Research: Methodological Foundations by Iacobucci and Churchill, Cengage Learning

Run a linear regression on `diamonds` data with `clarity` as the only explanatory variable

In [None]:
smf.ols(...).fit().summary()

Run a linear regression on `diamonds` data with `clarity` and `carat` as the only explanatory variable. Does the effect of `clarity` change? If so, why?

Calculate model fit measures, variable, importance, and prediction plots

Check the correlations between carat and clarity

In [None]:
pd.get_dummies(diamonds[["carat", "clarity"]]).corr().round(3)

Run a linear regression on `click` data with `sales` as the response variable and `advertising` as the only explanatory variable. Provide a full interpretation of the regression coefficient

In [None]:
smf.ols(...).fit().summary()

Run a linear regression on `click` data with `sales` as the response variable and `salesreps` as the only explanatory variable. Provide a full interpretation of the regression coefficient

Run a linear regression on `click` data with `sales` as the response variable and `advertising` and `salesreps` as the explanatory variable. Provide a full interpretation of the regression coefficient. Are the coefficients different compared to the previous models? If so, why do you think that is?

Calculate model fit measures, variable, importance, and prediction plots

Check the correlation between `sales`, `advertising`, and `salesreps`

In [None]:
click.corr().round(3)

In [None]:
rsm.correlation(click).summary()