# Crime Hotspot Prediction in India - Complete Analysis

This notebook demonstrates the complete workflow for crime hotspot prediction and analysis using machine learning and data visualization techniques.

## Table of Contents
1. [Data Preprocessing](#data-preprocessing)
2. [Exploratory Data Analysis](#exploratory-data-analysis)
3. [Hotspot Detection](#hotspot-detection)
4. [Predictive Modeling](#predictive-modeling)
5. [Time Series Forecasting](#time-series-forecasting)
6. [Conclusions and Recommendations](#conclusions-and-recommendations)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Import project modules
import sys
import os
sys.path.append('..')

from src.utils import create_sample_crime_data, create_sample_demographic_data
from src.data_preprocessing import CrimeDataPreprocessor
from src.eda import CrimeDataAnalyzer
from src.hotspot_detection import CrimeHotspotDetector
from src.predictive_modeling import CrimePredictiveModel
from src.time_series_forecasting import CrimeTimeSeriesForecaster

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("✅ All libraries imported successfully!")

## Data Preprocessing

First, let's create sample data and preprocess it for analysis.

In [None]:
# Create sample datasets
print("Creating sample crime and demographic data...")
crime_data = create_sample_crime_data(1500)  # Create larger dataset for better analysis
demographic_data = create_sample_demographic_data()

print(f"Crime data shape: {crime_data.shape}")
print(f"Demographic data shape: {demographic_data.shape}")

# Display sample data
print("\n📊 Sample Crime Data:")
display(crime_data.head())

print("\n📊 Sample Demographic Data:")
display(demographic_data.head())

In [None]:
# Initialize preprocessor and clean the data
preprocessor = CrimeDataPreprocessor()

print("Starting data preprocessing pipeline...")
processed_data = preprocessor.preprocess_pipeline(crime_data, demographic_data)

# Get preprocessing summary
summary = preprocessor.get_preprocessing_summary(crime_data, processed_data)

print("\n📈 Preprocessing Summary:")
for key, value in summary.items():
    print(f"  {key.replace('_', ' ').title()}: {value}")

print(f"\n📋 Final processed data shape: {processed_data.shape}")
print(f"📋 Total features: {processed_data.shape[1]}")

# Display processed data sample
display(processed_data.head())

## Exploratory Data Analysis

Let's explore the crime data to understand patterns and trends.

In [None]:
# Initialize EDA analyzer
analyzer = CrimeDataAnalyzer()

# Get data overview
overview = analyzer.get_data_overview(processed_data)

print("📊 Data Overview:")
print(f"  Shape: {overview['shape']}")
print(f"  Memory Usage: {overview['memory_usage_mb']:.2f} MB")
print(f"  Missing Values: {sum(overview['missing_values'].values())}")
print(f"  Numerical Columns: {len(overview['numeric_columns'])}")
print(f"  Categorical Columns: {len(overview['categorical_columns'])}")

In [None]:
# Analyze crime trends by year
analyzer.plot_crime_trends_by_year(processed_data)
plt.show()

In [None]:
# Analyze state-wise crime trends
analyzer.plot_crime_trends_by_state(processed_data, top_n=10)
plt.show()

In [None]:
# Crime intensity heatmap
analyzer.plot_crime_intensity_heatmap(processed_data)
plt.show()

In [None]:
# Analyze crime categories
crime_analysis = analyzer.analyze_crime_categories(processed_data)
plt.show()

print("\n🎯 Crime Category Analysis Results:")
print(f"  Total Categories: {crime_analysis['total_categories']}")
print(f"  Most Common Crime: {crime_analysis['most_common_crime']} ({crime_analysis['most_common_count']} cases)")
print(f"  Least Common Crime: {crime_analysis['least_common_crime']} ({crime_analysis['least_common_count']} cases)")

In [None]:
# Temporal patterns analysis
analyzer.plot_temporal_patterns(processed_data)
plt.show()

## Hotspot Detection

Now let's identify crime hotspots using clustering algorithms.

In [None]:
# Initialize hotspot detector
detector = CrimeHotspotDetector()

# Prepare features for clustering
features = detector.prepare_clustering_features(processed_data, "location")

print(f"📍 Prepared {features.shape[1]} features for clustering")
print(f"📍 Using {features.shape[0]} data points")

In [None]:
# Optimize K-Means clusters
optimization_results = detector.optimize_kmeans_clusters(features, max_clusters=10)

# Plot optimization results
detector.plot_clustering_optimization(optimization_results)
plt.show()

print(f"\n🎯 Optimal number of clusters: {optimization_results['best_k_silhouette']}")
print(f"🎯 Best silhouette score: {optimization_results['best_silhouette_score']:.3f}")

In [None]:
# Perform K-Means clustering
optimal_k = optimization_results['best_k_silhouette']
kmeans_data = detector.detect_hotspots(processed_data, method="kmeans", 
                                       feature_set="location", n_clusters=optimal_k)

# Plot K-Means results
detector.plot_hotspots_2d(kmeans_data, "kmeans")
plt.show()

In [None]:
# Perform DBSCAN clustering
dbscan_data = detector.detect_hotspots(processed_data, method="dbscan", 
                                       feature_set="location", eps=0.3, min_samples=10)

# Plot DBSCAN results
detector.plot_hotspots_2d(dbscan_data, "dbscan")
plt.show()

In [None]:
# Analyze hotspot characteristics
kmeans_analysis = detector.analyze_hotspot_characteristics(kmeans_data, "kmeans")
dbscan_analysis = detector.analyze_hotspot_characteristics(dbscan_data, "dbscan")

print("🗺️ K-Means Hotspot Analysis:")
print(f"  Total Clusters: {kmeans_analysis['total_clusters']}")
print(f"  Clustered Crimes: {kmeans_analysis['clustered_crimes']}/{kmeans_analysis['total_crimes']}")

if kmeans_analysis.get('largest_hotspot'):
    largest = kmeans_analysis['largest_hotspot']
    print(f"  Largest Hotspot: {largest['most_affected_state']} ({largest['crime_count']} crimes)")

print("\n🗺️ DBSCAN Hotspot Analysis:")
print(f"  Total Clusters: {dbscan_analysis['total_clusters']}")
print(f"  Clustered Crimes: {dbscan_analysis['clustered_crimes']}/{dbscan_analysis['total_crimes']}")

## Predictive Modeling

Let's build machine learning models to predict crime types.

In [None]:
# Initialize predictive model
predictor = CrimePredictiveModel()

# Prepare features and target
X, y = predictor.prepare_features_target(processed_data)

print(f"🎯 Features prepared: {X.shape}")
print(f"🎯 Target classes: {len(np.unique(y))}")
print(f"🎯 Feature columns: {len(predictor.feature_columns)}")

In [None]:
# Split data
X_train, X_test, y_train, y_test = predictor.split_data(X, y)

print(f"📊 Training set: {X_train.shape}")
print(f"📊 Test set: {X_test.shape}")

In [None]:
# Train Random Forest model
rf_model = predictor.train_random_forest(X_train, y_train)

# Evaluate Random Forest
rf_metrics = predictor.evaluate_model(rf_model, X_test, y_test, "Random Forest")

# Plot feature importance
predictor.plot_feature_importance(rf_model, "Random Forest", top_n=15)
plt.show()

In [None]:
# Train XGBoost model
xgb_model = predictor.train_xgboost(X_train, y_train)

# Evaluate XGBoost
xgb_metrics = predictor.evaluate_model(xgb_model, X_test, y_test, "XGBoost")

# Plot feature importance
predictor.plot_feature_importance(xgb_model, "XGBoost", top_n=15)
plt.show()

In [None]:
# Compare model performance
metrics_comparison = {'Random Forest': rf_metrics, 'XGBoost': xgb_metrics}
predictor.plot_model_comparison(metrics_comparison)
plt.show()

# Print detailed comparison
print("🏆 Model Performance Comparison:")
print("\nRandom Forest:")
for metric, value in rf_metrics.items():
    print(f"  {metric}: {value:.3f}")

print("\nXGBoost:")
for metric, value in xgb_metrics.items():
    print(f"  {metric}: {value:.3f}")

In [None]:
# Make a sample prediction
sample_features = {
    'latitude': 28.6139,  # Delhi coordinates
    'longitude': 77.2090,
    'month': 6,  # June
    'weekday': 1,  # Tuesday
    'population_normalized': 0.8,  # High population
    'literacy_rate_normalized': 0.9,  # High literacy
    'unemployment_rate_normalized': 0.3,  # Moderate unemployment
    'urban_population_pct_normalized': 0.9  # Highly urban
}

# Predict with Random Forest
rf_prediction = predictor.predict_crime_type(rf_model, sample_features, "Random Forest")

print("🔮 Sample Prediction (Delhi, June, Tuesday):")
print(f"  Predicted Crime Type: {rf_prediction['predicted_crime_type']}")
print(f"  Confidence: {rf_prediction['confidence']:.3f}")

if rf_prediction['probabilities']:
    print("\n  Top 3 Probabilities:")
    sorted_probs = sorted(rf_prediction['probabilities'].items(), 
                         key=lambda x: x[1], reverse=True)[:3]
    for crime_type, prob in sorted_probs:
        print(f"    {crime_type}: {prob:.3f}")

## Time Series Forecasting

Let's build time series models to forecast future crime trends.

In [None]:
# Initialize time series forecaster
forecaster = CrimeTimeSeriesForecaster()

# Prepare time series data
time_series = forecaster.prepare_time_series(processed_data, freq="M")

print(f"📈 Time series prepared: {len(time_series)} periods")
print(f"📈 Date range: {time_series.index[0]} to {time_series.index[-1]}")
print(f"📈 Mean crime count: {time_series.mean():.2f}")

# Display time series
display(time_series.head(10))

In [None]:
# Analyze time series characteristics
ts_analysis = forecaster.analyze_time_series(time_series)
plt.show()

print("📊 Time Series Analysis:")
for key, value in ts_analysis.items():
    if key not in ['length']:
        print(f"  {key.replace('_', ' ').title()}: {value}")

In [None]:
# Split time series for training and testing
split_point = int(len(time_series) * 0.8)
ts_train = time_series[:split_point]
ts_test = time_series[split_point:]

print(f"📊 Training periods: {len(ts_train)}")
print(f"📊 Testing periods: {len(ts_test)}")

In [None]:
# Auto-select best ARIMA parameters
best_order = forecaster.auto_arima_selection(ts_train, max_p=2, max_d=2, max_q=2)

# Train ARIMA model
arima_model = forecaster.train_arima_model(ts_train, best_order)

# Generate ARIMA forecast
arima_forecast, arima_conf = forecaster.forecast_arima(arima_model, len(ts_test))

print(f"📈 ARIMA{best_order} model trained")
print(f"📈 Forecast generated for {len(arima_forecast)} periods")

In [None]:
# Prepare LSTM data
X_train_lstm, X_test_lstm, y_train_lstm, y_test_lstm, scaler = forecaster.prepare_lstm_data(
    ts_train, lookback=6, test_size=0.2
)

if X_train_lstm is not None:
    # Train LSTM model
    lstm_model = forecaster.train_lstm_model(X_train_lstm, y_train_lstm, 
                                            X_test_lstm, y_test_lstm, epochs=30)
    
    # Generate LSTM forecast
    last_sequence = X_train_lstm[-1].flatten()
    lstm_forecast = forecaster.forecast_lstm(lstm_model, scaler, last_sequence, len(ts_test))
    
    print(f"📈 LSTM model trained")
    print(f"📈 LSTM forecast generated for {len(lstm_forecast)} periods")
else:
    print("⚠️ Insufficient data for LSTM training")
    lstm_forecast = None

In [None]:
# Plot forecasts comparison
forecaster.plot_forecasts()
plt.show()

# Evaluate forecasts
if 'arima' in forecaster.forecasts:
    arima_metrics = forecaster.evaluate_forecasts(ts_test, arima_forecast, "ARIMA")
    print("\n📊 ARIMA Forecast Evaluation:")
    for metric, value in arima_metrics.items():
        print(f"  {metric}: {value:.2f}")

if lstm_forecast is not None:
    lstm_metrics = forecaster.evaluate_forecasts(ts_test, lstm_forecast, "LSTM")
    print("\n📊 LSTM Forecast Evaluation:")
    for metric, value in lstm_metrics.items():
        print(f"  {metric}: {value:.2f}")

## Conclusions and Recommendations

Based on our comprehensive analysis, here are the key findings and recommendations:

In [None]:
# Generate comprehensive insights
print("🎯 KEY FINDINGS AND RECOMMENDATIONS")
print("=" * 50)

print("\n📊 Data Analysis Insights:")
print(f"  • Analyzed {processed_data.shape[0]:,} crime records across {processed_data['state'].nunique()} states")
print(f"  • Most common crime type: {crime_analysis['most_common_crime']}")
print(f"  • Geographic coverage: {processed_data['district'].nunique()} districts")

print("\n🗺️ Hotspot Detection Results:")
print(f"  • K-Means identified {kmeans_analysis['total_clusters']} distinct hotspots")
print(f"  • DBSCAN found {dbscan_analysis['total_clusters']} dense crime clusters")
if kmeans_analysis.get('largest_hotspot'):
    largest = kmeans_analysis['largest_hotspot']
    print(f"  • Priority hotspot: {largest['most_affected_state']} ({largest['crime_count']} crimes)")

print("\n🤖 Predictive Modeling Performance:")
print(f"  • Random Forest Accuracy: {rf_metrics['accuracy']:.3f}")
print(f"  • XGBoost Accuracy: {xgb_metrics['accuracy']:.3f}")
best_model = "Random Forest" if rf_metrics['f1_weighted'] > xgb_metrics['f1_weighted'] else "XGBoost"
print(f"  • Best performing model: {best_model}")

print("\n📈 Time Series Forecasting:")
print(f"  • Analyzed {len(time_series)} months of crime data")
print(f"  • Seasonal patterns detected: {ts_analysis.get('seasonality_detected', False)}")
print(f"  • Best ARIMA model: ARIMA{best_order}")

print("\n🎯 STRATEGIC RECOMMENDATIONS:")
print("  1. Deploy additional police patrols in identified K-Means hotspots")
print("  2. Implement targeted crime prevention programs in high-density areas")
print(f"  3. Focus on {crime_analysis['most_common_crime']} prevention strategies")
print("  4. Use machine learning models for predictive policing")
print("  5. Monitor seasonal crime patterns for resource allocation")
print("  6. Integrate demographic factors in crime prevention planning")
print("  7. Establish real-time monitoring systems in top hotspots")
print("  8. Regular model retraining with new crime data")

print("\n✅ ANALYSIS COMPLETED SUCCESSFULLY!")
print("All models, visualizations, and insights are ready for deployment.")

## Next Steps

1. **Deploy the Streamlit Dashboard**: Run `streamlit run streamlit_app/app.py` to access the interactive dashboard
2. **Model Deployment**: Use the saved models for real-time crime prediction
3. **Data Integration**: Connect with real crime databases for live analysis
4. **Continuous Monitoring**: Set up automated model retraining pipelines
5. **Stakeholder Engagement**: Share insights with law enforcement agencies

---

*This analysis demonstrates a complete end-to-end machine learning pipeline for crime hotspot prediction and analysis. The models and visualizations can be adapted for different regions and crime datasets.*