# Introudction

Understanding the factors that influence crime rates is crucial for effective crime prevention and the 
development of evidence-based policies. One key area of research focuses on investigating the relationship between the probabilities of apprehension and punishment and crime participation rates. 
By studying the deterrence effect of the criminal justice system, we can gain valuable insights into the mechanisms behind criminal decision-making and inform policy interventions aimed at reducing crime.
Researching the impact of these probabilities on crime participation rates has significant implications for policy and the allocation of resources within the criminal justice system. By understanding how 
variations in these probabilities influence criminal behaviour, policymakers can develop targeted 
strategies to enhance deterrence and reduce crime rates.

### ***RESEARCH QUESTION: Do the probabilities of apprehension and punishment affect crime participation rate?***

# About the Data

The provided dataset offers an in-depth exploration into the realm of crime dynamics spanning 90 counties within the North Carolina landscape. Capturing a temporal span from 1981 through 1987, this dataset encapsulates a comprehensive portrait of criminal activities over a significant period.

Total Number of Observations: 630

Features:
1. county = county identifier
2. year =>  81 to 87
3. crmrte =>  crimes committed per person
4. prbarr =>  '=>ty' of arrest
5. prbconv =>  'probability' of conviction
6. prbpris =>  'probability' of prison sentence
7. avgsen =>  avg. sentence, days
8. polpc =>  police per capita
9. density =>  people per sq. mile.
10. taxpc =>  tax revenue per capita
11. west => =1 if in western N.C.
12. central => =1 if in central N.C.
13. urban => =1 if in SMSA
14. pctmin80 => perc. minority, 1980
15. wcon => weekly wage, construction
16. wtuc => wkly wge, trns, util, commun
17. wtrd => wkly wge, whlesle, retail trade
18. wfir => wkly wge, fin, ins, real est
19. wser => wkly wge, service industry
20. wmfg => wkly wge, manufacturing
21. wfed => wkly wge, fed employees
22. wsta => wkly wge, state employees
23. wloc => wkly wge, local gov emps
24. mix => offense mix: face-to-face/other
25. pctymle => percent young male
26. d82 => =1 if year == 82
27. d83 => =1 if year == 83
28. d84 => =1 if year == 84
29. d85 => =1 if year == 85
30. d86 => =1 if year == 86
31. d87 => =1 if year == 87
32. lcrmrte => log(crmrte)
33. lprbarr => log(prbarr)
34. lprbconv => log(prbconv)
35. lprbpris => log(prbpris)
36. lavgsen => log(avgsen)
37. lpolpc => log(polpc)
38. ldensity => log(density)
39. ltaxpc => log(taxpc)
40. lwcon => log(wcon)
41. lwtuc => log(wtuc)



# Methodology

The dataset provided to us is categorised as panel data, which comprises of information on crime in 90 counties in North Carolina, for the years 1981 through 1987. We select our OLS model as follows.

***Model***: Ln(crmrte)= β_0+ β_1*Ln(prbarr)+ β_2*Ln(prbconv)+ β_3 prbpris+µ

***crmrte*** is crimes committed per person, this is our dependent variable. 

***prbarr*** is 'probability' of arrest, directly relates to the likelihood of a person being apprehended by law enforcement. It represents the chance that an individual will be arrested for a specific offense. We take the log as the distribution is log normal hence data will be concentrated in one part leaving a long tail, we take log to make it more normal.

***prbconv*** is 'probability' of conviction, indicates the likelihood of a person being found guilty in court. It represents the chance that someone arrested will be convicted and held legally responsible for the offense. We take the log as the distribution is log normal hence data will be concentrated in one part leaving a long tail, we take log to make it more normal.

***prbpris*** is ‘probability' of prison sentence, represents the probability of receiving a prison sentence if convicted. It signifies the likelihood of being punished with incarceration rather than alternative forms of punishment, such as fines or probation.

***We expect these independent variables to be negatively related to crime rate or in other words we expect the coefficient of these independent variables to be less than zero.***

Given that the data category is panel data it is very likely that simple OLS model will suffer from problem of endogeneity caused by unobserved Heterogeneity. “The unobserved dependency of other independent variable(s) is called unobserved heterogeneity and the correlation between the independent variable(s) and the error term (i.e. the unobserved independent variabels) is called endogeneity.”

There are five assumptions in simple linear regression:
1. Linearity
2. Exogeneity
3. (a) Homoskedasticity, and (b) Non-autocorrelation
4. Independent Variable are not stochastic
5. No Multicollinearity

Two of these assumptions can help us answer if we should be using Pooled OLS or Fixed Effect Model and Random Effect Model. If assumption 2 or 3 or both are violated then Fixed Effect and Random Effect are more suitable than Pooled OLS. 


# 1. Pooled OLS
We start by importing important libraries and our dataset and our data. Then we start with the Pooled OLS and validate the required assumptions. 

In [2]:
# importing libraries

library(lmtest)
library(readxl)
library(estimatr)
library(plm)

In [3]:
# loading data

df <- read_excel("/kaggle/input/crime-in-north-carolina-1981-to-1987/Crime_NorthCarolina.xlsx")
head(df,n=3)

county,year,crmrte,prbarr,prbconv,prbpris,avgsen,polpc,density,taxpc,⋯,lpctymle,lpctmin,clcrmrte,clprbarr,clprbcon,clprbpri,clavgsen,clpolpc,cltaxpc,clmix
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,81,0.0398849,0.289696,0.402062,0.472222,5.61,0.0017868,2.307159,25.69763,⋯,-2.43387,3.006608,.,.,.,.,.,.,.,.
1,82,0.0383449,0.338111,0.433005,0.506993,5.59,0.0017666,2.330254,24.87425,⋯,-2.449038,3.006608,-3.9376300000000003E-2,0.15454219999999999,7.4143000000000001E-2,7.1048E-2,-3.5714000000000002E-3,-1.1364000000000001E-2,-3.2565400000000001E-2,3.0857300000000001E-2
1,83,0.0303048,0.330449,0.525703,0.479705,5.8,0.0018358,2.341801,26.45144,⋯,-2.464036,3.006608,-0.23531560000000001,-2.2922000000000001E-2,0.1939871,-5.5325800000000001E-2,3.6878599999999997E-2,3.8413000000000003E-2,6.1477400000000001E-2,-0.2447317


In [4]:
# pooled OLS model

model <- lm_robust(log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris, data = df)
summary(model)


Call:
lm_robust(formula = log(crmrte) ~ log(prbarr) + log(prbconv) + 
    prbpris, data = df)

Standard error type:  HC2 

Coefficients:
             Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
(Intercept)   -5.0056    0.11804 -42.404 3.392e-186 -5.23739  -4.7738 626
log(prbarr)   -0.6833    0.05406 -12.640  8.844e-33 -0.78948  -0.5772 626
log(prbconv)  -0.4644    0.05450  -8.520  1.186e-16 -0.57141  -0.3573 626
prbpris        0.4791    0.21618   2.216  2.705e-02  0.05454   0.9036 626

Multiple R-squared:  0.4606 ,	Adjusted R-squared:  0.458 
F-statistic: 73.72 on 3 and 626 DF,  p-value: < 2.2e-16

We notice all the coefficients are negative except for one which is probability of prison sentence. According to Pooled OLS model it is positively related to crime rate or as the probability of prison sentence increases the crime rate also increases, which goes against the initial assumption that all three independent variables should be negatively related to crime rate. This could be caused because our model may be inconsistent due to unobserved heterogeneity.

# 2. Validating LR Assumptions

## Assumption 3(a)
We check our ***assumption (3a)*** Homoskedasticity using the Breusch-Pagan test, also known as the Breusch-Pagan-Godfrey test, is a statistical test used to assess the presence of heteroscedasticity in a regression model. Heteroscedasticity refers to the situation where the variability of the error terms (residuals) in a regression model is not constant across different levels of the independent variables. In other words, the spread or dispersion of the residuals is not the same throughout the range of the predictors.

We run the test on our model and observe that out p value is less than 0.05 , hence we can ***reject H0*** and conclude that ***heteroscedasticity is present in the regression model and we can conclude that assumption 3(a) is violated.***

In [21]:
# Perform Breusch-Pagan test
bptest(model)


	studentized Breusch-Pagan test

data:  model
BP = 50.569, df = 3, p-value = 6.044e-11


## Assumption 3(b)

Now we check for ***assumption (3b)*** Non-autocorrelation using The Durbin-Watson-Test which is a statistical test used to assess the presence of autocorrelation in the residuals of a regression model. Autocorrelation refers to the correlation between the error terms (residuals) at different time points or observations in panel data analysis. 

We run our dwtest in R and observe that our ***p value is less than 0.05 hence we reject H0 and accept H1: true autocorrelation is greater than 0.*** 

The output value of the test ranges from 0 to 4 with values between 0 and 2 indicating positive autocorrelation our DW = 0.5298 which suggest positive autocorrelation. With this we can conclude that the ***assumption 3(b) is violated.*** 

Hence the assumption 3 as a whole is violated and now we can conclude that ***we should be using Fixed effect or Random effect model for our regression.***

In [5]:
dwtest(model)
# violated both so we move to FM and RM


	Durbin-Watson test

data:  model
DW = 0.5298, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0


# 3. Fixed Effect Model and Random Effect Model 

Now we can proceed with our Fixed effect model and Random effect model.

***Fixed Effect Model:*** The fixed effects model, also known as the "within-effects" model, is used to control for unobserved individual-specific characteristics that remain constant over time. These characteristics are often considered as entity-specific "fixed" factors that influence the dependent variable.

***Random Effect Model:*** The random effects model, also known as the "between-effects" model, is used to account for unobserved entity-specific characteristics that are assumed to be random and uncorrelated with the independent variables. These characteristics might represent factors that change across entities but are not directly controlled or measured in the study. 

In [6]:
# fixed effect model and random effect model using plm
library(plm)

fixed = plm(log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris, data=df, index=c("county", "year"), model="within")
random = plm(log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris, data=df, index=c("county", "year"), model="random")

In [7]:
# fixed effect model summary
summary(fixed)

Oneway (individual) effect Within Model

Call:
plm(formula = log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris, 
    data = df, model = "within", index = c("county", "year"))

Balanced Panel: n = 90, T = 7, N = 630

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.9966468 -0.0829767 -0.0053309  0.0788693  1.0325647 

Coefficients:
              Estimate Std. Error t-value  Pr(>|t|)    
log(prbarr)  -0.220902   0.037815 -5.8417 8.962e-09 ***
log(prbconv) -0.135169   0.022356 -6.0463 2.777e-09 ***
prbpris      -0.302413   0.104317 -2.8990  0.003897 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    17.991
Residual Sum of Squares: 16.531
R-Squared:      0.081137
Adj. R-Squared: -0.076285
F-statistic: 15.8059 on 3 and 537 DF, p-value: 7.3095e-10

In [8]:
# random effect model summary
summary(random)

Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris, 
    data = df, model = "random", index = c("county", "year"))

Balanced Panel: n = 90, T = 7, N = 630

Effects:
                  var std.dev share
idiosyncratic 0.03078 0.17545 0.207
individual    0.11805 0.34359 0.793
theta: 0.8105

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-1.161902 -0.092732  0.012528  0.101142  1.184746 

Coefficients:
              Estimate Std. Error  z-value  Pr(>|z|)    
(Intercept)  -4.006002   0.074844 -53.5249 < 2.2e-16 ***
log(prbarr)  -0.312448   0.036947  -8.4567 < 2.2e-16 ***
log(prbconv) -0.189534   0.022196  -8.5390 < 2.2e-16 ***
prbpris      -0.311846   0.108345  -2.8783  0.003999 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    24.756
Residual Sum of Squares: 21.431
R-Squared:      0.13432
Adj. R-Squared: 0.13017
Chisq: 97.133 on

# 4. Selecting Final Model with Hausman test

To select one model we also perform the Hausman test which is a statistical test used to compare the consistency and efficiency of two different estimators, typically the fixed effects (FE) and random effects (RE) estimators, in panel data analysis. It helps determine whether the fixed effects model or the random effects model is more appropriate for a given dataset. 

The Hausman test is based on the principle that if the random effects estimator is consistent and efficient under the null hypothesis, but the fixed effects estimator is consistent under both the null and alternative hypotheses, then the fixed effects estimator is preferred. Conversely, if the random effects estimator is consistent and efficient under both the null and alternative hypotheses, then the random effects estimator is preferred. In simple words If the null hypothesis is rejected, indicating that the test statistic is greater than the critical value, it implies that the fixed effects estimator is more appropriate due to the presence of endogeneity or other issues that violate the random effects assumptions. 

***We observe that in our results the p-value is less than 0.05 hence we select the fixed effect model. We will discuss the results of our fixed effect model in the next section.***

In [9]:
hausman_test <- phtest(fixed, random)
hausman_test


	Hausman Test

data:  log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris
chisq = 113.33, df = 3, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent


# 5. Results

Fixed effect regression model:

***Ln(crmrte)= -0.2209*Ln(prbarr)- 0.1351*Ln(prbconv)- 0.3024*prbpris+µ***


In [10]:
# fixed model regression summary

summary(fixed)

Oneway (individual) effect Within Model

Call:
plm(formula = log(crmrte) ~ log(prbarr) + log(prbconv) + prbpris, 
    data = df, model = "within", index = c("county", "year"))

Balanced Panel: n = 90, T = 7, N = 630

Residuals:
      Min.    1st Qu.     Median    3rd Qu.       Max. 
-0.9966468 -0.0829767 -0.0053309  0.0788693  1.0325647 

Coefficients:
              Estimate Std. Error t-value  Pr(>|t|)    
log(prbarr)  -0.220902   0.037815 -5.8417 8.962e-09 ***
log(prbconv) -0.135169   0.022356 -6.0463 2.777e-09 ***
prbpris      -0.302413   0.104317 -2.8990  0.003897 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares:    17.991
Residual Sum of Squares: 16.531
R-Squared:      0.081137
Adj. R-Squared: -0.076285
F-statistic: 15.8059 on 3 and 537 DF, p-value: 7.3095e-10

1. To check the significance of our coefficients we check the critical t at alpha=0.01 and degrees of freedom = 626. We see the t critical is ±2.581 and all our coefficients are significant as they are outside the acceptance region.

2. R squared: 0.081 which indicates that we explain 8.11% of variation in our dependent variable using our independent variable.

3. The coefficient β_1 is -0.2209 which implies 1% change in probability of arrest will lead to -0.2209% change in crime rate, holding all other variables constant.

4. The coefficient β_2 is -0.1351 which implies 1% change in probability of conviction will lead to -0.1351% change in crime rate, holding all other variables constant.

5. The coefficient β_3 is -0.30241 which implies 1 unit increase in probability of prison sentence will lead to -30.24% change in crime rate, holding all other variables constant. β_3 has an observable effect on crime rate as compared to the other two variables.

6. We observe that our initial assumption of having a negative coefficient for each of our independent variable is true. We can conclude that all our independent variables are negatively related with crime rate. 

7. We can also observe that the coefficient of probability of prison sentence which was positive in our pooled OLS model is now negative in our fixed effect model which aligns with our assumption.


# 6. Conclusion

For this project we were provided with the dataset which includes information on crime in 90 counties in North Carolina, for the years 1981 through 1987. The research question selected was “Do the probabilities of apprehension and punishment affect crime participation rate?”. First we notice that the data category provided to us is panel data, we start by building a simple OLS model and observe that the coefficient of probability of prison sentence is negative which does not align with our assumption. Due to the data being panel data simple OLS model constructed can suffer from inconsistencies and bias. To prove that we should not be using the pooled OLS model that will suffer from problem of endogeneity caused by unobserved Heterogeneity we check our assumption (3a) Homoskedasticity using the Breusch-Pagan test and our assumption (3b) Non-autocorrelation using The Durbin-Watson-Test and find observe that the assumption 3 as a whole has been violated and conclude that we should be using Fixed effect or Random effect model for our regression. To select one model from Fixed effect model and Random effect model we use Hausman test which is a statistical test used to compare the consistency and efficiency of two different estimators and select the Fixed Effect Estimator. The R squared value of 0.0811 which indicates we explain 8.11% of variation in our dependent variable.

The coefficients from our fixed effect model are all negative confirming our initial assumption that all of the independent variables are negatively related to the crime rate or if any of the independent variable increase crime rate will decrease. The coefficient of β_3 is -0.30241 which had a significant impact on the crime rate and could be used by policy makers, implies 1 unit increase in probability of prison sentence will lead to -30.24% change in crime rate. We tested the significance of our coefficients and find that all the independent variables are significant proving yes probabilities of apprehension and punishment do affect crime participation rate.

Although we have provided significant results, we acknowledge that R-Squared values is on the small end. We can describe better if we can gather data on unemployment levels and if we want to compare the crime rate in recent years we can utilize the covid cases as well along with education levels as all of these factors are used to better describe the crime rate. If we consider the limitations of fixed effect model we can say that The fixed effect model is not suitable for analyzing the effects of variables that do not change over time within entities (individuals, groups, etc.). These time-invariant variables are absorbed by the fixed effects, and their impact cannot be estimated in the model.