##  Deep Neural Networks Project

In this project, you will be working with a real-world data set from the Las Vegas Metropolitan Police Department. The dataset  contains information about the reported incidents, including the time and location of the crime, type of incident, and number of persons involved. 

The dataset is downloaded from the public docket at: 
https://opendata-lvmpd.hub.arcgis.com

let's read the csv file and transform the data:

In [1]:
import torch
import pandas as pd
from torch.utils.data import DataLoader, Dataset
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [4]:
orig_df = pd.read_csv('C:/Users/viswe/Documents/CSULB_3rd_Sem/Pattern recognition/Assignments/HW3/datasets/LVMPD-Stats.csv', parse_dates=['ReportedOn'])

In [5]:
df = pd.read_csv('C:/Users/viswe/Documents/CSULB_3rd_Sem/Pattern recognition/Assignments/HW3/datasets/LVMPD-Stats.csv', parse_dates=['ReportedOn'],
                 usecols = ['X', 'Y', 'ReportedOn',
                            'Area_Command','NIBRSOffenseCode',
                            'VictimCount' ] )

df['DayOfWeek'] = df['ReportedOn'].dt.day_name()
df['Time' ]     = df['ReportedOn'].dt.hour
df.drop(columns = 'ReportedOn', inplace=True)

In [6]:

df['X'] = df['X'] 
df['Y'] = df['Y'] 
df['Time'] = pd.factorize(df['Time'])[0]
df['DayOfWeek'] = pd.factorize(df['DayOfWeek'])[0]
df.Area_Command = pd.factorize(df['Area_Command'])[0]
df.VictimCount = pd.factorize(df['VictimCount'])[0]
df.NIBRSOffenseCode = pd.factorize(df['NIBRSOffenseCode'])[0]
df.dropna(inplace=True)

In [7]:
df= df[['X', 'Y', 'Area_Command', 'NIBRSOffenseCode',
       'DayOfWeek', 'Time','VictimCount']]

In [8]:
df.values.shape

(275, 7)

# Goal
The goal is to build a predictive model that is trained on the following data:
* latitude and longitude (location)
* Hour of the day
* Day of the week
* Area-of-command code: The police designation of the bureau of the operation.
* Classification code for the crime committed
  
The predicted variable is the number of persons involved in the accident.


## Task 1
* print a few rows of the values in the dataframe ``df`` and explain what each column of data means. 
* identify the input and target variables
* what is the range of values in each column? Do you need to scale, shift or normalize your data? 


In [9]:
# print few rows
print(df.head())

            X          Y  Area_Command  NIBRSOffenseCode  DayOfWeek  Time  \
0 -115.087518  36.216702             0                 0          0     0   
1 -115.240172  36.189693             1                 1          1     1   
2 -115.143088  36.181329             2                 1          2     0   
3 -115.225014  36.117633             3                 1          1     2   
4 -115.176708  36.095967             4                 1          1     3   

   VictimCount  
0            0  
1            0  
2            1  
3            2  
4            0  


In [10]:
# identify the input and target variables
# Input Variables (Features): These are the attributes or columns in your dataset that you use to make predictions. 
# Input variables are used to describe or characterize your data, and they are the basis for making predictions. 
# These are often denoted as 'X'

# Target Variable (Label or Output): This is the variable you want to predict or classify. 
# The target variable is what your model is trained to predict based on the input variables. It's often denoted as 'Y'

In [12]:
# checking range of values
min_values = df.min()
max_values = df.max()
print(min_values,max_values)

X                  -116.000000
Y                    35.068419
Area_Command          0.000000
NIBRSOffenseCode      0.000000
DayOfWeek             0.000000
Time                  0.000000
VictimCount           0.000000
dtype: float64 X                  -114.62557
Y                    37.00000
Area_Command         11.00000
NIBRSOffenseCode      2.00000
DayOfWeek             6.00000
Time                 23.00000
VictimCount           6.00000
dtype: float64


In [13]:
print("Latitude (X) Range: {} to {}".format(df['X'].min(), df['X'].max()))
print("Longitude (Y) Range: {} to {}".format(df['Y'].min(), df['Y'].max()))
print("Time Range: {} to {}".format(df['Time'].min(), df['Time'].max()))
print("DayOfWeek Range: {} to {}".format(df['DayOfWeek'].min(), df['DayOfWeek'].max()))
print("Area_Command Range: {} to {}".format(df['Area_Command'].min(), df['Area_Command'].max()))
print("NIBRSOffenseCode Range: {} to {}".format(df['NIBRSOffenseCode'].min(), df['NIBRSOffenseCode'].max()))
print("VictimCount Range: {} to {}".format(df['VictimCount'].min(), df['VictimCount'].max()))

Latitude (X) Range: -116.0 to -114.6255705
Longitude (Y) Range: 35.0684190000001 to 37.0000000000001
Time Range: 0 to 23
DayOfWeek Range: 0 to 6
Area_Command Range: 0 to 11
NIBRSOffenseCode Range: 0 to 2
VictimCount Range: 0 to 6


## Task 2 

* Create two `DataLoader` objects for training and testing based on the input and output variables. Pick a reasonable batch size and verify the shape of data by iterating over the one dataset and printing the shape of the batched data. 

## Task 3
In this task you will try to predict number of crime victims as a **real number**. Therefore the machine learning problem is a **regression** problem. 

* Define the proper loss function for this task
* what should the size of the predicted output be?
* explain your choice of architecture, including how many layers you will be using
* define an optimizer for training this model, choose a proper learning rate 
* write a training loop that obtains a batch out of the  training data and calculates the forward and backward passes over the neural network. Call the optimizer to update the weights of the neural network.
* write a for loop that continues the training over a number of epochs. At the end of each epoch, calculate the ``MSE`` error on the test data and print it.
* is your model training well? Adjust the learning rate, hidden size of the network, and try different activation functions and number of layers to achieve the best accuracy and report it. 

## Task 4 

In this task, you will try to predict the number of crime victims as a **class number**. Therefore the machine learning problem is a **classification** problem. 

* Repeat all the steps in task 3. Specifically, pay attention to the differences with regression.
* How would you find the number of classes on the output data?
* How is the architecture different?
* How is the loss function different?
* Calculate the Accuracy for test data as the number of correct classified outputs divided by the total number of test data in each epoch. Report it at the end of each epoch
* Try a few variations of learning rate, hidden dimensions, layers, etc. What is the best accuracy that you can get? 

## Task 5

### Reflect on your results

* Write a paragraph about your experience with tasks 3 and 4. How do you compare the results? Which one worked better? Why?
* Write a piece of code that finds an example of a  miss-classification. Calculate the probabilities for the output classes and plot them in a bar chart. Also, indicate what is the correct class label.

## Task 6: Exploring the patterns in raw data

* Plot the crime incidents as a `scatter` plot using the corrdinates. Use the color property of each datapoint to indicate the day of the week. Is there a pattern in the plot?
* Now make a new scatter plot and use the color property of each datapoint to indicate the number of persons involved in the incident. Is there a pattern here?
* use numpy (or pandas if you like) to sort the number of crimes reported by the day of the week. What days are most frequent?
