## Step 1: Install and Import Libraries
- Installed required libraries using pip.
- Imported necessary modules for data loading, preprocessing, feature engineering, model training, and evaluation.
- Libraries used: pandas, numpy, scikit-learn, xgboost, lightgbm, imbalanced-learn, boto3 (for S3 integration).


In [5]:
# -----------------------------------------------
# STEP 1: Import Libraries
# -----------------------------------------------

!pip install pandas numpy scikit-learn xgboost lightgbm imbalanced-learn boto3

import boto3
import pandas as pd
import numpy as np
import io
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt



Note: You have installed the 'manylinux2014' variant of XGBoost. Certain features such as GPU algorithms or federated learning are not available. To use these features, please upgrade to a recent Linux distro with glibc 2.28+, and install the 'manylinux_2_28' variant.


## Step 2: Load Dataset from S3
- Connected to AWS S3 using boto3 client.
- Loaded the LAPD Crime Dataset directly into a pandas DataFrame.
- Parsed date columns to datetime format to enable temporal feature extraction.


In [2]:
# -----------------------------------------------
# STEP 2: Load Data
# -----------------------------------------------
s3 = boto3.client('s3')
bucket_name = 'mlc-project-1'
file_key = 'Crime_Data_from_2020_to_Present.csv'

response = s3.get_object(Bucket=bucket_name, Key=file_key)
crime_df = pd.read_csv(io.BytesIO(response['Body'].read()))
crime_df['DATE OCC'] = pd.to_datetime(crime_df['DATE OCC'], format='%m/%d/%Y', errors='coerce')

required_cols = [
    'Crm Cd Desc', 'TIME OCC', 'LAT', 'LON', 'AREA NAME',
    'Vict Age', 'Vict Sex', 'Vict Descent', 'Premis Desc'
]
crime_df = crime_df.dropna(subset=required_cols)
print("Step 2: Data loaded. Shape:", crime_df.shape)

Step 2: Data loaded. Shape: (859829, 28)


## Step 3: Feature Engineering
- Extracted important time-based features (Hour, Day of Week, Month).
- Created binary indicators (Night Time, Weekend).
- Calculated Distance from Downtown LA for spatial analysis.
- Grouped crime types into three categories: Assault, Burglary, and Other to simplify the multi-class classification problem.
- Added Victim Age interactions and apply KMeans clustering on geolocation (LAT, LON) data.


In [3]:
# -----------------------------------------------
# STEP 3: Feature Engineering
# -----------------------------------------------
crime_df['TIME OCC'] = crime_df['TIME OCC'].astype(str).str.zfill(4)
crime_df['Hour'] = crime_df['TIME OCC'].str[:2].astype(int)
crime_df['DayOfWeek'] = crime_df['DATE OCC'].dt.dayofweek
crime_df['Month'] = crime_df['DATE OCC'].dt.month
crime_df['Is_Night'] = crime_df['Hour'].apply(lambda x: 1 if x >= 22 or x <= 5 else 0)
crime_df['Is_Weekend'] = crime_df['DayOfWeek'].apply(lambda x: 1 if x in [5, 6] else 0)
crime_df['DistanceFromDowntown'] = ((crime_df['LAT'] - 34.05)**2 + (crime_df['LON'] + 118.25)**2)**0.5

def group_crime(row):
    if row == 'BATTERY - SIMPLE ASSAULT':
        return 'Assault'
    elif row == 'BURGLARY FROM VEHICLE':
        return 'Burglary'
    else:
        return 'Other'
crime_df['Crime_Group'] = crime_df['Crm Cd Desc'].apply(group_crime)

crime_df['VictAge_Weekend'] = crime_df['Vict Age'] * crime_df['Is_Weekend']
crime_df['Cluster'] = KMeans(n_clusters=15, random_state=42).fit_predict(crime_df[['LAT', 'LON']])

print("Step 3: Feature Engineering completed.")


Step 3: Feature Engineering completed.


## Step 4: Data Preparation
- Selected key predictive features related to time, location, demographics, and premises.
- Performed one-hot encoding on categorical variables.
- Ensured feature names are clean and compatible for ML modeling.


In [4]:
# -----------------------------------------------
# STEP 4: Prepare Features
# -----------------------------------------------
features = [
    'Hour', 'LAT', 'LON', 'AREA NAME', 'Vict Age', 'Vict Sex', 'Vict Descent', 
    'Premis Desc', 'DayOfWeek', 'Month', 'Is_Night', 'Is_Weekend', 
    'DistanceFromDowntown', 'VictAge_Weekend', 'Cluster'
]
X = crime_df[features]
y = crime_df['Crime_Group'].astype('category')
y_cat = y.cat.codes

X_encoded = pd.get_dummies(X, drop_first=True)
X_encoded.columns = X_encoded.columns.str.replace('[^A-Za-z0-9_]', '_', regex=True)

print("Step 4: Features prepared. Shape:", X_encoded.shape)

Step 4: Features prepared. Shape: (859829, 359)


## Step 5: Dimensionality Reduction
- Handled missing values using SimpleImputer (mean strategy).
- Applied TruncatedSVD to reduce dimensionality and improve model efficiency without major loss of information.
- This step significantly speeds up model training and avoids memory issues in AWS SageMaker.


In [5]:
# -----------------------------------------------
# STEP 5: Imputation + Dimensionality Reduction
# -----------------------------------------------
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X_encoded)

svd = TruncatedSVD(n_components=100, random_state=42)
X_reduced = svd.fit_transform(X_imputed)

print("Step 5: SVD Dimensionality Reduction done. Shape:", X_reduced.shape)



Step 5: SVD Dimensionality Reduction done. Shape: (859829, 100)


## Step 6: Data Balancing with SMOTE
- Addressed class imbalance using Synthetic Minority Oversampling Technique (SMOTE).
- This helps the model learn from underrepresented crime categories properly.
- Result: Balanced dataset for better generalization and fairness.


In [6]:
# -----------------------------------------------
# STEP 6: SMOTE
# -----------------------------------------------
smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(X_reduced, y_cat)

print("Step 6: SMOTE balancing done. Shape:", X_balanced.shape)

Step 6: SMOTE balancing done. Shape: (2164524, 100)


## Step 7: Train -Test Split
- Split the data into training and testing sets with stratified sampling to preserve class distributions.
- 80% for training and 20% for testing.


In [7]:
# -----------------------------------------------
# STEP 7: Train-Test Split
# -----------------------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X_balanced, y_balanced, stratify=y_balanced, test_size=0.2, random_state=42
)
print("Step 7: Train-test split complete.")

Step 7: Train-test split complete.


## Step 8: Model Training
- Trained three different models:
  - Random Forest Classifier
  - LightGBM Classifier
  - XGBoost Classifier
- Fine-tune hyperparameters manually for optimal performance.
- Applied early stopping in XGBoost to avoid overfitting.


## Random Forest 

In [8]:
# -----------------------------------------------
# STEP 8: Train Individual Models
# -----------------------------------------------

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    max_features='sqrt',
    class_weight='balanced',
    n_jobs=-1,
    random_state=42
)
rf.fit(X_train, y_train)


print("Step 8: Random Forest model trained.")


Step 8: Random Forest model trained.


In [9]:
# -----------------------------------------------
# Evaluate Random Forest Model
# -----------------------------------------------

def evaluate_model(name, model):
    y_pred = model.predict(X_test)
    print(f"\n{name} Evaluation:")
    print(classification_report(y_test, y_pred))
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
   
evaluate_model("Random Forest", rf)



Random Forest Evaluation:
              precision    recall  f1-score   support

           0       0.68      0.68      0.68    144301
           1       0.68      0.85      0.75    144302
           2       0.76      0.57      0.65    144302

    accuracy                           0.70    432905
   macro avg       0.71      0.70      0.69    432905
weighted avg       0.71      0.70      0.69    432905

Accuracy: 0.6991


## Random Forest Accuracy ~ 70%

Summary:
Quick to train but struggles with complex crime patterns. Performs better for Burglary but less accurate for Other crimes. Good baseline model but limited for deeper insights.

## LightGBM

In [10]:
lgbm = LGBMClassifier(
    n_estimators=300,
    learning_rate=0.03,
    max_depth=8,
    random_state=42,
    n_jobs=-1
)
lgbm.fit(X_train, y_train)

print("Step 8: LGBMClassifier trained.")

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 1.032473 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 25500
[LightGBM] [Info] Number of data points in the train set: 1731619, number of used features: 100
[LightGBM] [Info] Start training from score -1.098611
[LightGBM] [Info] Start training from score -1.098613
[LightGBM] [Info] Start training from score -1.098613
Step 8: LGBMClassifier trained.


In [11]:
# -----------------------------------------------
# Evaluate LightGBM Model
# -----------------------------------------------
def evaluate_model(name, model):
    y_pred = model.predict(X_test)
    print(f"\n{name} Evaluation:")
    print(classification_report(y_test, y_pred))
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")

evaluate_model("LightGBM", lgbm)





LightGBM Evaluation:
              precision    recall  f1-score   support

           0       0.74      0.69      0.72    144301
           1       0.71      0.85      0.78    144302
           2       0.76      0.67      0.71    144302

    accuracy                           0.74    432905
   macro avg       0.74      0.74      0.74    432905
weighted avg       0.74      0.74      0.74    432905

Accuracy: 0.7375


## LightGBM Accuracy ~ 74%
More efficient than Random Forest. Handles Burglary crimes well but shows slight bias towards common classes. Faster training, moderate performance improvement.

## XGBoost Model

In [12]:
xgb = XGBClassifier(
    n_estimators=1000,
    max_depth=12,
    learning_rate=0.015,
    subsample=0.85,
    colsample_bytree=0.75,
    gamma=0.2,
    reg_alpha=0.8,
    reg_lambda=1.2,
    objective='multi:softprob',
    eval_metric='mlogloss',
    early_stopping_rounds=30,
    use_label_encoder=False,
    n_jobs=-1,
    random_state=42
)

xgb.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    verbose=True
)

Parameters: { "use_label_encoder" } are not used.

  self.starting_round = model.num_boosted_rounds()


[0]	validation_0-mlogloss:1.09092
[1]	validation_0-mlogloss:1.08327
[2]	validation_0-mlogloss:1.07538
[3]	validation_0-mlogloss:1.06788
[4]	validation_0-mlogloss:1.06078
[5]	validation_0-mlogloss:1.05350
[6]	validation_0-mlogloss:1.04645
[7]	validation_0-mlogloss:1.03931
[8]	validation_0-mlogloss:1.03254
[9]	validation_0-mlogloss:1.02613
[10]	validation_0-mlogloss:1.01997
[11]	validation_0-mlogloss:1.01375
[12]	validation_0-mlogloss:1.00742
[13]	validation_0-mlogloss:1.00124
[14]	validation_0-mlogloss:0.99535
[15]	validation_0-mlogloss:0.98923
[16]	validation_0-mlogloss:0.98320
[17]	validation_0-mlogloss:0.97754
[18]	validation_0-mlogloss:0.97219
[19]	validation_0-mlogloss:0.96689
[20]	validation_0-mlogloss:0.96175
[21]	validation_0-mlogloss:0.95691
[22]	validation_0-mlogloss:0.95178
[23]	validation_0-mlogloss:0.94673
[24]	validation_0-mlogloss:0.94173
[25]	validation_0-mlogloss:0.93682
[26]	validation_0-mlogloss:0.93215
[27]	validation_0-mlogloss:0.92742
[28]	validation_0-mlogloss:0.9

In [13]:
# -----------------------------------------------
# Evaluate XGBoost Model
# -----------------------------------------------
def evaluate_model(model, name):
    y_pred = model.predict(X_test)
    print(f"\nModel: {name}")
    print(classification_report(y_test, y_pred, target_names=y.cat.categories))
    print("Accuracy:", accuracy_score(y_test, y_pred))

evaluate_model(xgb, "XGBoost")



Model: XGBoost
              precision    recall  f1-score   support

     Assault       0.85      0.77      0.81    144301
    Burglary       0.81      0.88      0.85    144302
       Other       0.83      0.83      0.83    144302

    accuracy                           0.83    432905
   macro avg       0.83      0.83      0.83    432905
weighted avg       0.83      0.83      0.83    432905

Accuracy: 0.8287176170291404


## XGBoost Accuracy ~ 83%

Best-performing model overall. Balanced precision and recall across all crime types. Captures complex patterns, making it highly reliable for crime prediction tasks.