# Bangladesh Crime Data from 2010 - 2019

## Dataset
[Crime Bangladesh](https://www.kaggle.com/datasets/talhabu/bangladesh-crime-data-from-2010-2019?select=crime_data_bangladesh.csv) From Kaggle

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

## Read and verify dataset

In [2]:
# reading file path
file = Path(r"../files/CrimeDataBD.csv")

# reading csv file
df = pd.read_csv(file)

In [3]:
# verify data loading
df.head()

Unnamed: 0,area_name,year,dacoity,robbery,murder,speedy_trial,riot,woman_child_Repression,kidnapping,police_assault,burglary,theft,other_cases,recovery_cases_arms_act,recovery_cases_explosive,recovery_cases_narcotics,recovery_cases_smuggling
0,dhaka metropolitan,2010,47,220,245,363,3,1370,139,155,555,1915,7228,518,82,10535,144
1,chittagong metropolitan,2010,16,108,94,31,7,455,37,31,123,314,1831,51,0,866,99
2,khulna metropolitan,2010,3,9,29,25,0,153,11,4,65,91,551,19,2,792,13
3,rajshahi metropolitan,2010,4,20,21,9,15,157,9,12,53,106,578,3,4,332,248
4,barisal metropolitan,2010,8,12,19,21,0,112,6,8,24,83,557,17,0,155,117


# Understanding Dataset

In [4]:
# shape of the dataset
row, col = df.shape
print("Number of Rows:", row)
print("Number of Columns:", col)

Number of Rows: 146
Number of Columns: 17


In [5]:
# dataset overview
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   area_name                 146 non-null    object
 1   year                      146 non-null    int64 
 2   dacoity                   146 non-null    int64 
 3   robbery                   146 non-null    int64 
 4   murder                    146 non-null    int64 
 5   speedy_trial              146 non-null    int64 
 6   riot                      146 non-null    int64 
 7   woman_child_Repression    146 non-null    int64 
 8   kidnapping                146 non-null    int64 
 9   police_assault            146 non-null    int64 
 10  burglary                  146 non-null    int64 
 11  theft                     146 non-null    int64 
 12  other_cases               146 non-null    int64 
 13  recovery_cases_arms_act   146 non-null    int64 
 14  recovery_cases_explosive  

In [6]:
# dataset description
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,146.0,2014.746575,2.900206,2010.0,2012.0,2015.0,2017.0,2019.0
dacoity,146.0,32.143836,42.779787,0.0,3.0,16.5,44.75,184.0
robbery,146.0,56.232877,62.354718,0.0,11.0,30.0,73.75,294.0
murder,146.0,248.856164,307.630342,0.0,24.25,142.5,348.0,1395.0
speedy_trial,146.0,93.589041,118.006351,0.0,9.25,51.5,131.75,563.0
riot,146.0,5.342466,8.88116,0.0,0.0,1.0,7.0,56.0
woman_child_Repression,146.0,1199.321918,1374.987965,0.0,124.25,600.0,1904.0,5115.0
kidnapping,146.0,46.260274,52.556709,0.0,7.0,20.5,68.75,204.0
police_assault,146.0,42.808219,55.310823,0.0,8.0,20.5,46.75,336.0
burglary,146.0,163.80137,189.12319,0.0,32.25,77.5,214.0,686.0


# Sanity Check

In [7]:
df.columns

Index(['area_name', 'year', 'dacoity', 'robbery ', 'murder', 'speedy_trial',
       'riot ', 'woman_child_Repression', 'kidnapping', 'police_assault',
       'burglary', 'theft', 'other_cases', 'recovery_cases_arms_act',
       'recovery_cases_explosive', 'recovery_cases_narcotics',
       'recovery_cases_smuggling'],
      dtype='object')

## standardize column names

In [8]:
# remove leading and trailing spaces
df.columns = df.columns.str.strip()

df.columns

Index(['area_name', 'year', 'dacoity', 'robbery', 'murder', 'speedy_trial',
       'riot', 'woman_child_Repression', 'kidnapping', 'police_assault',
       'burglary', 'theft', 'other_cases', 'recovery_cases_arms_act',
       'recovery_cases_explosive', 'recovery_cases_narcotics',
       'recovery_cases_smuggling'],
      dtype='object')

In [9]:
# clean column names
df.columns = df.columns.str.replace("_", " ").str.title()

df.columns

Index(['Area Name', 'Year', 'Dacoity', 'Robbery', 'Murder', 'Speedy Trial',
       'Riot', 'Woman Child Repression', 'Kidnapping', 'Police Assault',
       'Burglary', 'Theft', 'Other Cases', 'Recovery Cases Arms Act',
       'Recovery Cases Explosive', 'Recovery Cases Narcotics',
       'Recovery Cases Smuggling'],
      dtype='object')

In [10]:
# more clear and easy to column name
df.rename(columns={
    "Area Name" : "Area",
    "Riot" : "Violence",
    "Woman Child Repression" : "Woman and Child Abuse",
}, inplace=True)

In [11]:
# modifying column name

# remove the word "Cases" containing 0 or more spaces before and after the word
df.columns = df.columns.str.replace(r"\bCases\b", "", regex=True).str.replace(r"\s+", " ", regex=True).str.strip()

In [12]:
# verify the columns name
df.columns

Index(['Area', 'Year', 'Dacoity', 'Robbery', 'Murder', 'Speedy Trial',
       'Violence', 'Woman and Child Abuse', 'Kidnapping', 'Police Assault',
       'Burglary', 'Theft', 'Other', 'Recovery Arms Act', 'Recovery Explosive',
       'Recovery Narcotics', 'Recovery Smuggling'],
      dtype='object')

**EXPLANATION:**
- **`r"\bCases\b"`** — matches the exact word "Cases" (not substrings like "showcases").

- **`.str.replace(r"\s+", " ", regex=True)`** — replaces multiple spaces (like the extra one left after removing "Cases") with a single space.

- **`.str.strip()`** — removes any leading or trailing spaces from column names after the cleanup.

## Check Duplicates

In [13]:
df.duplicated().sum()

0

## Check Missing Values

In [14]:
df.isnull().any()

Area                     False
Year                     False
Dacoity                  False
Robbery                  False
Murder                   False
Speedy Trial             False
Violence                 False
Woman and Child Abuse    False
Kidnapping               False
Police Assault           False
Burglary                 False
Theft                    False
Other                    False
Recovery Arms Act        False
Recovery Explosive       False
Recovery Narcotics       False
Recovery Smuggling       False
dtype: bool

## Transformation
adding a new column for calculating total cases per year

In [15]:
df["Total Cases"] = df[df.columns[2:]].sum(axis=1)

## Final Check

In [16]:
df.head()

Unnamed: 0,Area,Year,Dacoity,Robbery,Murder,Speedy Trial,Violence,Woman and Child Abuse,Kidnapping,Police Assault,Burglary,Theft,Other,Recovery Arms Act,Recovery Explosive,Recovery Narcotics,Recovery Smuggling,Total Cases
0,dhaka metropolitan,2010,47,220,245,363,3,1370,139,155,555,1915,7228,518,82,10535,144,23519
1,chittagong metropolitan,2010,16,108,94,31,7,455,37,31,123,314,1831,51,0,866,99,4063
2,khulna metropolitan,2010,3,9,29,25,0,153,11,4,65,91,551,19,2,792,13,1767
3,rajshahi metropolitan,2010,4,20,21,9,15,157,9,12,53,106,578,3,4,332,248,1571
4,barisal metropolitan,2010,8,12,19,21,0,112,6,8,24,83,557,17,0,155,117,1139


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146 entries, 0 to 145
Data columns (total 18 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Area                   146 non-null    object
 1   Year                   146 non-null    int64 
 2   Dacoity                146 non-null    int64 
 3   Robbery                146 non-null    int64 
 4   Murder                 146 non-null    int64 
 5   Speedy Trial           146 non-null    int64 
 6   Violence               146 non-null    int64 
 7   Woman and Child Abuse  146 non-null    int64 
 8   Kidnapping             146 non-null    int64 
 9   Police Assault         146 non-null    int64 
 10  Burglary               146 non-null    int64 
 11  Theft                  146 non-null    int64 
 12  Other                  146 non-null    int64 
 13  Recovery Arms Act      146 non-null    int64 
 14  Recovery Explosive     146 non-null    int64 
 15  Recovery Narcotics     

# Save The Output

In [20]:
file = Path(r"../files/CrimeDataBD_Cleaned.csv")
df.to_csv(file)