# "Dummy Variable Regression"

- author: "<a href='https://www.linkedin.com/in/aneeshdata/'>Aneesh R</a>"
- toc: false
- comments: true
- categories: [Linear Algebra Tutorial]
- badges: false

In [1]:
import pandas as pd
import numpy as np
# Load the data
csv='https://raw.githubusercontent.com/varadan13/Data_Bank/main/Econ%20data/dummy_reg_gujar.csv'
data = pd.read_csv(csv).drop(['Unnamed: 4','Spending'],axis=1)

In [2]:
data.head(5)

Unnamed: 0,Salary,D2,D3
0,19583.0,1,0
1,20263.0,1,0
2,20325.0,1,0
3,26800.0,1,0
4,29470.0,1,0


The Data has information on average salary (in dollars) of public school teachers in 50 states and
the District of Columbia for the year 1985. These 51 areas are classified into three geographical
regions: (1) Northeast and North Central (21 states in all), (2) South (17 states in
all), and (3) West (13 states in all).

$D_2$ and $D_3$ are dummy variables where    
$D_2 = 1$ when the area is northeast and northcentral and 0 otherwise   
$D_3 = 1$ when the area is south and 0 otherwise   
When both $D_2$ & $D_3$ equal to 0 then the area is west.    

Let's compute the avg average salary by the area.

In [3]:
North_Avg = data.loc[data['D2']==1].Salary.mean()
np.round(North_Avg,2) # = mean average salary in the northern region

24424.14

In [4]:
South_Avg = data.loc[data['D3']==1].Salary.mean()
South_Avg # = mean average salary in the southern region

22894.0

In [5]:
West_Avg = data.loc[(data['D3']==0) & (data['D2']==0)].Salary.mean()
np.round(West_Avg,2) # = mean average salary in the western region

26158.62

We can see that they are different but are they statistically different?

We can obviously use ANOVA but let's use Regression to find the answer

The model we are going to fit here is as follows:

$avgSalary = \beta_0+\beta_1D_2+\beta_2D_3$

Let's intrepret the betas

$E(avgSalary|D_2=0,D_3=0)=\beta_0$

$E(avgSalary|D_2=1,D_3=0)=\beta_1$

$E(avgSalary|D_2=0,D_3=1)=\beta_2$

beta0 is the mean of avgSalary in west and likewise for the slope parameters.

Let's fit the model and find out if there are any statistical significance

In [9]:
X=data.drop(['Salary'],axis=1).to_numpy()
Y=data.Salary.to_numpy()
reg=udregression(X,Y)

In [10]:
reg.print_result() # x1 represents D2 and x2 is D3

var        coef    st error    t values    p values        LL         RL
-----  --------  ----------  ----------  ----------  --------  ---------
const  26158.6      1128.52    23.1795   1.0227e-27  23889.6   28427.7
x1     -1734.47     1435.95    -1.20789  0.233007    -4621.65   1152.7
x2     -3264.62     1499.15    -2.17764  0.0343794   -6278.87   -250.363

  rsquared    rsquared_Adj    fvalue
----------  --------------  --------
 0.0900828       0.0521696   2.37603


We can see that the p value for D2 is not significant. What does that mean?

It means that the mean value of avg salary of north and northeastern region is equal to the mean value of west 

Since we have a significant p value for D3 it means that mean avg salary of south is different from the west.

What exactly are the values of avg salary from this model?

In [11]:
avg_west=avg_north=reg.fit_model().params[0] # mean average salary in western 
                                             # and northern region derived from
                                             # the model

avg_south=reg.fit_model().params[0]+reg.fit_model().params[2] # mean average
                                                              # salary in south
                                                              # derived from
                                                              # the model

In [12]:
print("average salary in the west = ", np.round(avg_west,2))
print("average salary in the north = ", np.round(avg_north,2))
print("average salary in the south = ", avg_south)

average salary in the west =  26158.62
average salary in the north =  26158.62
average salary in the south =  22894.0


Let's do an Anova here.

In [14]:
from statsmodels.formula.api import ols

def create_treatments_col(DataFrame):
  if DataFrame.D2==1 and DataFrame.D3==0:
    return 'NORTH'
  elif DataFrame.D2==0 and DataFrame.D3==1:
    return 'SOUTH'
  else:
    return 'WEST'

data['treatments']=data.apply(create_treatments_col,axis=1)
data_ANOVA=data.drop(['D2','D3'],axis=1)

In [15]:
data_ANOVA.head(5)

Unnamed: 0,Salary,treatments
0,19583.0,NORTH
1,20263.0,NORTH
2,20325.0,NORTH
3,26800.0,NORTH
4,29470.0,NORTH


In [16]:
model = ols('Salary ~ C(treatments)', data=data_ANOVA).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(treatments),78676550.0,2.0,2.376027,0.103764
Residual,794703700.0,48.0,,


In [22]:
from scipy.stats import f

pvalue=1-f.cdf(2.376027,2,48)
pvalue<0.05 # = hence  insignificant result

# can't tell why this happened?

False

udregression codes

In [23]:
import statsmodels.api as sm 
import tabulate as tb # to draw tables

class udregression(object):
  """udregression stands for a class that is user defined to do regression"""
  def __init__(self,X,Y,intercept=True):
    self.X=X
    self.Y=Y
    self.intercept=intercept
  
  def fit_model(self):
    if self.intercept==True:
      X1=np.column_stack([[1 for _ in range(self.X.shape[0])],self.X])
      model=sm.OLS(self.Y,X1)
      f_model=model.fit()
    else:
      model=sm.OLS(self.Y,self.X)
      f_model=model.fit() 
    return f_model
  
  def print_result(self):
    fitted_model=self.fit_model()
    length=len(fitted_model.summary().tables[1].data)
    var=[i[0] for i in fitted_model.summary().tables[1].data[1:length]]
    prtmp1={'var':var,'coef':fitted_model.params,
          'st error':fitted_model.bse,'t values':fitted_model.tvalues,
          'p values':fitted_model.pvalues,'LL':fitted_model.conf_int()[:,0],
          'RL':fitted_model.conf_int()[:,1]}
    
    prtmp2={'rsquared':[fitted_model.rsquared],
          'rsquared_Adj':[fitted_model.rsquared_adj],
          'fvalue':[fitted_model.fvalue]}
    
    print(tb.tabulate(prtmp1,headers="keys"))
    print()
    print(tb.tabulate(prtmp2,headers="keys"))

  def __repr__(self): return "a custom regression runner has been intialised"