In [36]:
# import statements
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV

In [3]:
# read in preprocessed data
data = pd.read_csv("../data/preprocessed/preprocessed_crime_data.csv")

### Regression Models

The initial goal is to predict `ViolentCrimesPerPop` which is a continuous variable. Given the complexity of the datset, a Linear Regression model is a good starting point. However, because there are likely correlated features and potential multicollinearity, regularization methods like Ridge or Lasso are more suitable. 

- Ridge: Linear model that penalizes large coefficients using L2 regularization, reducing the impact of multicollinearity.

- Lasso: Uses L1 regularization and may lead to feature selection, which may be useful if we suspect some irrelevant features (likely since the dataset is so large).

We will implement both and evaluate to select the best model. 

### Classification Models

In [19]:
# summary statistics for target column 
data['ViolentCrimesPerPop'].describe()

count    1009.000000
mean        0.177017
std         0.162876
min         0.000000
25%         0.060000
50%         0.120000
75%         0.230000
max         0.910000
Name: ViolentCrimesPerPop, dtype: float64

We will first discreticize the target variable `ViolentCrimesPerPop` to convert it into a categorical variable suitable for a classification task. To convert it into a binary classification problem, we need to set a threshold. The idea behind this process is to create an algorithm to predict communities with significant crime rates. An algorithm that is able to identify communities at high risk for crime could be used in a variety of policy contexts, including knowing where to increase police presence or have other targeted interventions in specific communities. 

For the binary variable, we will create an indicator for whether the `ViolentCrimesPerPop` falls above a threshold. We will define 1 as communities with significant crime, those falling above the 75th percentile. 

#### Discretization & Splitting Data

In [26]:
# set threshold 
threshold = data['ViolentCrimesPerPop'].quantile(0.75)

# create target variable 
data['target'] = (data['ViolentCrimesPerPop'] > threshold).astype(int)

In [32]:
data['target'].value_counts()

target
0    759
1    250
Name: count, dtype: int64

In [38]:
# split data
X = data.drop('target', axis=1)
y = data['target']

# split using stratification 
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, stratify=y, random_state=7)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=7)

#### Model Selection