## Import required libraries

In [19]:
import pandas as pd
import numpy as np

<br>

## Import Dataset from File System

In [20]:
with open('data/crime.csv', encoding='latin-1') as csv_file:
    df = pd.read_csv(csv_file)

#### Test Imports

In [22]:
df.head()

Unnamed: 0,INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,Location
0,I182070945,619,Larceny,LARCENY ALL OTHERS,D14,808,,2018-09-02 13:00:00,2018,9,Sunday,13,Part One,LINCOLN ST,42.357791,-71.139371,"(42.35779134, -71.13937053)"
1,I182070943,1402,Vandalism,VANDALISM,C11,347,,2018-08-21 00:00:00,2018,8,Tuesday,0,Part Two,HECLA ST,42.306821,-71.0603,"(42.30682138, -71.06030035)"
2,I182070941,3410,Towed,TOWED MOTOR VEHICLE,D4,151,,2018-09-03 19:27:00,2018,9,Monday,19,Part Three,CAZENOVE ST,42.346589,-71.072429,"(42.34658879, -71.07242943)"
3,I182070940,3114,Investigate Property,INVESTIGATE PROPERTY,D4,272,,2018-09-03 21:16:00,2018,9,Monday,21,Part Three,NEWCOMB ST,42.334182,-71.078664,"(42.33418175, -71.07866441)"
4,I182070938,3114,Investigate Property,INVESTIGATE PROPERTY,B3,421,,2018-09-03 21:05:00,2018,9,Monday,21,Part Three,DELHI ST,42.275365,-71.090361,"(42.27536542, -71.09036101)"


<br>

## Dealing with Missing Values

##### Dropping the data with missing values

dropping all the rows with missing values

In [23]:
new_df = df.copy() # create a copy of the data set
new_df = new_df.dropna(axis = 0)

dropping all the columns with missing values

In [27]:
new_df = df.copy() # create a copy of the data set
new_df.dropna(axis = 1, inplace = True)

dropping all the rows whose values are missing in just "OFFENSE_CODE" column

In [29]:
new_df = df.copy() # create a copy of the data set
new_df.dropna(subset=['OFFENSE_CODE'], axis = 0, inplace = True)

#### Replacing missing values (or any other value) with newly calculated values

In [33]:
new_df = df.copy() # create a copy of the data set

# mean of all the values in 'OFFENSE_CODE' column
mean = new_df['OFFENSE_CODE'].mean()

# replace all the NaN in the 'OFFENSE_CODE' column with this mean value
new_df['OFFENSE_CODE'] = new_df['OFFENSE_CODE'].replace(np.nan, mean)

<br>

## Data Formatting

Renaming Columns with names that correctly defines them

In [39]:
new_df = df.copy() # create a new copy of the dataset
new_df.rename(columns = {'INCIDENT_NUMBER' : 'TEMP_NUMBER'}, inplace = True)

Converting units of a whole column, e.g. from seconds to minutes

In [40]:
new_df = df.copy()  # create a new copy of the dataset
new_df['OFFENSE_CODE'] = new_df['OFFENSE_CODE'] / 60

Changing datatypes of variables

In [45]:
new_df = df.copy()  # create a new copy of the dataset

# currently, 'OFFENSE_CODE' column is of type Integer
new_df['OFFENSE_CODE'].dtypes

# converting its type to Floating Integer
new_df['OFFENSE_CODE'] = new_df['OFFENSE_CODE'].astype('float')

<br>

## Data Normalization

Converts data of all the rows in a column to make the range of the values consistent. This does not mean that the relative difference between the row values is lost.

There are three ways of achieving this: 

- Simple Feature Scaling Method
- Min-Max Method
- Z-Score or Standard Score

#### Simple Feature Scaling Method

In [57]:
new_df = df.copy()  # create a new copy of the dataset

# compute the values
max_val = new_df['OFFENSE_CODE'].max()

# normalize the data
new_df['OFFENSE_CODE'] = new_df['OFFENSE_CODE'] / max_val

#### Min-Max Method

In [55]:
new_df = df.copy()  # create a new copy of the dataset

# compute values
min_val = new_df['OFFENSE_CODE'].min()
max_val = new_df['OFFENSE_CODE'].max()

# normalize the data
new_df['OFFENSE_CODE'] = (new_df['OFFENSE_CODE'] - min_val) / (max_val - min_val)

#### Z-Score or Standard Score

In [56]:
new_df = df.copy()  # create a new copy of the dataset

# compute values
mean = new_df['OFFENSE_CODE'].mean()
standard_deviation = new_df['OFFENSE_CODE'].std()

# normalize the data
new_df['OFFENSE_CODE'] = (new_df['OFFENSE_CODE'] - mean) / standard_deviation

<br>

## Binning

Grouping data together is called `Binning`, e.g. grouping age into (0-5), (6-10), etc.
Sometimes, binning can increase the accuracy of the predicting models.

Let's say that you want to categorize 'OFFENSE_CODE' into 3 categories. First of all you'll want to get the dividers.

This is really useful for **Data Visualization**

In [65]:
new_df = df.copy()  # create a new copy of the dataset

# Get 4 dividers (4 dividers would be required to categorize into 3 groups)

# compute values
dividers_count = 4
min_val = new_df['OFFENSE_CODE'].min()
max_val = new_df['OFFENSE_CODE'].max()

# get the dividers
bins = np.linspace(min_val, max_val, dividers_count)

In [70]:
# create group names for the bins
group_name = ['Low', 'Medium', 'High']

In [72]:
# categorize all the data
new_df['OFFENSE_CODE-binned'] = pd.cut(new_df['OFFENSE_CODE'], bins, labels=group_name, include_lowest = True)

# test
new_df['OFFENSE_CODE-binned'].head()

0       Low
1    Medium
2      High
3      High
4      High
Name: OFFENSE_CODE-binned, dtype: category
Categories (3, object): [Low < Medium < High]