# Project 3: Getting Started 

This notebook is intended to help you get off to a flying start with the cars dataset. You don't have to use this notebook and you can discard any parts you do not like, they are purely intended as a help to get started. 

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set_theme()
import clogit_project3
import estimation as est
from numpy import linalg as la
from scipy import optimize
import LinearModels as lm

import statsmodels.formula.api as smf

# Read in data

The dataset, `cars.csv`, contains cleaned and processed data. If you want to make changes, the notebook, `materialize.ipynb`, creates the data from the raw source datsets. 

In [2]:
cars = pd.read_csv('cars.csv')
lbl_vars = pd.read_csv('labels_variables.csv')
lbl_vals = pd.read_csv('labels_values.csv')

# convert from dataframe to dict
lbl_vals = {c: lbl_vals[c].dropna().to_dict() for c in lbl_vals.columns}
lbl_vars.set_index('variable', inplace=True)

# Set up for analysis

In [3]:
price_var = 'princ'
cars['logp'] = np.log(cars[price_var])
# new variable: price elasticity heterogeneous for home-region 
cars['logp_x_home'] = cars[price_var] * cars['home']

In [4]:
categorical_var = 'brand' # name of categorical variable
dummies = pd.get_dummies(cars[categorical_var]) # creates a matrix of dummies for each value of dummyvar
x_vars_dummies = list(dummies.columns[1:].values) # omit a reference category, here it is the first (hence columns[1:])

# add dummies to the dataframe 
assert dummies.columns[0] not in cars.columns, f'It looks like you have already added this dummy to the dataframe. Avoid duplicates! '
cars = pd.concat([cars,dummies], axis=1)

# x_vars
x_vars = ['logp', 'home', 'cy', 'hp', 'we', 'li'] # <--- !!! choose your preferred variables here 
print(f'K = {len(x_vars)} variables selected.')

K = len(x_vars)
N = cars.ma.nunique() * cars.ye.nunique()
J = 40
x = cars[x_vars].values.reshape((N,J,K))
y = (cars['s'].values.reshape((N,J)))

# standardize x
x = ((x - x.mean(0).mean(0))/(x.std(0).std(0)))
# "bange for at nogen variable driver det for meget"
# "singular matrix = collinearity"

K = 6 variables selected.


### `x_vars`: List of regressors to be used 

In [5]:
x_vars = ['logp', 'home', 'cy', 'hp', 'we', 'li'] # <--- !!! choose your preferred variables here 
print(f'K = {len(x_vars)} variables selected.')

K = 6 variables selected.


# Towards logit 

In order to work with the logit model, you have to be able to compute the utility indices, which typically take the form of some inner product of an $x$-vector and a $\theta$ vector. This is illustrated for you below. Since `x` is `(N,J,K)` (i.e. `x[i,j,:]` gives the $K$-vector of regressors for the car `j` in market-period `i`), we just have to form the matrix product `x @ theta`, and Python will do the sum over the 3rd dimension of `x`. 

In [8]:
#v = x @ theta0 # how to multiply a trial value with the matrix of regressors 
#np.exp(v) / np.sum(np.exp(v), 1, keepdims=True) # choice probabilities 

In [9]:
theta0 = clogit_project3.starting_values(y, x)

In [27]:
pairs = [
    ('Nelder-Mead', 'Outer Product')
    , ('BFGS', 'Hessian')
    , ('BFGS', 'Sandwich')
]

list_of_dfs = []

for pair in pairs:
    
    method = pair[0]
    cov_type = pair[1]
    
    res = est.estimate(clogit_project3.q
                       , theta0
                       , y
                       , x
                       , method=method
                       , cov_type=cov_type
                      )
    
    temp = pd.DataFrame({v:res[v] for v in ['theta', 'se', 't']})
    temp['method'] = [method for i in range(temp.shape[0])]
    temp['cov_type'] = [cov_type for i in range(temp.shape[0])]
    
    list_of_dfs.append(temp)

res_df = pd.concat(list_of_dfs, ignore_index=True)

Optimization terminated successfully.
         Current function value: 3.501650
         Iterations: 449
         Function evaluations: 709
Optimization terminated successfully.
         Current function value: 3.501647
         Iterations: 11
         Function evaluations: 126
         Gradient evaluations: 18
Optimization terminated successfully.
         Current function value: 3.501647
         Iterations: 11
         Function evaluations: 126
         Gradient evaluations: 18


In [29]:
#res_df

In [33]:
df_hess = res_df[(res_df.method == 'BFGS') & (res_df.cov_type == 'Hessian')].copy()

In [34]:
df_sand = res_df[(res_df.method == 'BFGS') & (res_df.cov_type == 'Sandwich')].copy()

In [36]:
df_hess['diff_in_se'] = df_hess.se - df_sand.se

In [39]:
type(df_hess.se[6])

numpy.float64

### Partial effect
$$
    \frac{\partial}{\partial x_kl} Pr(j) = Pr(j) \left[\boldsymbol{1}_{k=j} \beta_{l} -  Pr(l) \beta_{l} \right]
$$