# Data exploration in boston crime dataset using plotly.express

    * In this kernel, we will learn how to use plotly.express and to compare it with plotly.graph_objs library. We will see that most graph can be done with express library with much simipler code. 

    * Plotly library: plotly.py is an interactive, open-source, and JavaScript-based graphing library for Python. Built on top of plotly.js which is a high-level, declarative charting library that includes over 30 chart types, including scientific charts, 3D graphs, statistical charts, SVG maps, financial charts, and more. The ultimate responsibility of plotly.py is to produce Python dictionaries that can be serialized into a JSON data structure that represents a valid figure
    
    * What is plotly.express and plotly.graph_objs?
        * plotly.graph_objs provides a hierarchy of classes called "graph objects" that may be used to construct figures.
        * Plotly Express is a terse, consistent, high-level wrapper around plotly.graph_objects for rapid data exploration and figure generation. Most plots are made with just one function call that accepts a tidy Pandas data frame, and a simple description of the plot you want to make.
        
<br>Content:
1. [Loading Data and Explanation of Features](#1)
1. [Line Charts](#2)
    1. Use plotly.graph_objs
        1. Draw single line 
        2. Draw multiple lines
    2. Use plotly.express
        1. Draw single line
        2. Style line
        3. Draw multiple lines
1. [Scatter Charts](#3)
    1. Use plotly.graph_objs draw scatter and line plot
    2. Use plotly.express 
        1. Scatter plot with color
        2. Scatter plot with facet
        3. Scatter Plot with categorical size and hover
        4. Scatter plot matrix
1. [Bar Charts](#4)
1. [Histogram](#5)
1. [Box Plot](#6)
1. [Heatmap](#7)
1. [3D Plot](#8)





# Install packages
* To install this package with conda run the following:
    * plotly.express is now a part of plotly package in anaconda
    * conda install -c plotly/label/test plotly

In [1]:
import sys
print(sys.executable)

/Users/anna/anaconda3/envs/bulb/bin/python


In [2]:
import numpy as np
import pandas as pd
import datetime as dt
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
go.__path__

['/Users/anna/anaconda3/envs/bulb/lib/python3.7/site-packages/plotly/graph_objs']

In [3]:
import plotly.express as px

<a id="1"></a> <br>
# Loading Data and Data preprocessing

In [6]:
#Loading data
df = pd.read_csv('/Users/anna/Desktop/project/crimes-in-boston/crime.csv',header=0,encoding = 'unicode_escape')

In [7]:
#This dataset contains 14 columns and their datatype is listed below
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319073 entries, 0 to 319072
Data columns (total 17 columns):
INCIDENT_NUMBER        319073 non-null object
OFFENSE_CODE           319073 non-null int64
OFFENSE_CODE_GROUP     319073 non-null object
OFFENSE_DESCRIPTION    319073 non-null object
DISTRICT               317308 non-null object
REPORTING_AREA         319073 non-null object
SHOOTING               1019 non-null object
OCCURRED_ON_DATE       319073 non-null object
YEAR                   319073 non-null int64
MONTH                  319073 non-null int64
DAY_OF_WEEK            319073 non-null object
HOUR                   319073 non-null int64
UCR_PART               318983 non-null object
STREET                 308202 non-null object
Lat                    299074 non-null float64
Long                   299074 non-null float64
Location               319073 non-null object
dtypes: float64(2), int64(4), object(11)
memory usage: 41.4+ MB


In [8]:
df.head(5)

Unnamed: 0,INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,Lat,Long,Location
0,I182070945,619,Larceny,LARCENY ALL OTHERS,D14,808,,2018-09-02 13:00:00,2018,9,Sunday,13,Part One,LINCOLN ST,42.357791,-71.139371,"(42.35779134, -71.13937053)"
1,I182070943,1402,Vandalism,VANDALISM,C11,347,,2018-08-21 00:00:00,2018,8,Tuesday,0,Part Two,HECLA ST,42.306821,-71.0603,"(42.30682138, -71.06030035)"
2,I182070941,3410,Towed,TOWED MOTOR VEHICLE,D4,151,,2018-09-03 19:27:00,2018,9,Monday,19,Part Three,CAZENOVE ST,42.346589,-71.072429,"(42.34658879, -71.07242943)"
3,I182070940,3114,Investigate Property,INVESTIGATE PROPERTY,D4,272,,2018-09-03 21:16:00,2018,9,Monday,21,Part Three,NEWCOMB ST,42.334182,-71.078664,"(42.33418175, -71.07866441)"
4,I182070938,3114,Investigate Property,INVESTIGATE PROPERTY,B3,421,,2018-09-03 21:05:00,2018,9,Monday,21,Part Three,DELHI ST,42.275365,-71.090361,"(42.27536542, -71.09036101)"


In [9]:
#checking missing and empty data in the dataframe
#replacing empty string with np.nan
df =df.replace('',np.nan)
df.isnull().sum()

INCIDENT_NUMBER             0
OFFENSE_CODE                0
OFFENSE_CODE_GROUP          0
OFFENSE_DESCRIPTION         0
DISTRICT                 1765
REPORTING_AREA              0
SHOOTING               318054
OCCURRED_ON_DATE            0
YEAR                        0
MONTH                       0
DAY_OF_WEEK                 0
HOUR                        0
UCR_PART                   90
STREET                  10871
Lat                     19999
Long                    19999
Location                    0
dtype: int64

<a id="2"></a> <br>
# Line charts

In [10]:
#Total number of crimes in our dataset
df['INCIDENT_NUMBER'].nunique()

282517

## Using plotly.graph_objs library

#### Draw single line using plotly.graph_objs

In [20]:
month_crime = df.groupby(['MONTH'])['INCIDENT_NUMBER']\
    .nunique()\
    .apply(lambda x: x/282517)\
    .reset_index()

In [52]:
trace1 = go.Scatter(
                    x = month_crime.MONTH,
                    y = month_crime.INCIDENT_NUMBER,
                    mode = "lines",
                    line=dict(color='firebrick', width=4,dash='dash'))
data = [trace1]

layout = dict(title = 'Percentage of crimes across a year',
              width=600,
              height=400,
              xaxis= dict(title= 'MONTH',ticklen= 1,zeroline= False),
              yaxis= dict(title= 'Percentage of cimes')
             )
fig = dict(data = data, layout = layout)
iplot(fig)

In [11]:
# plotly.offline.iplot(fig, filename='linePlot')
# plotly.offline.plot(fig, include_plotlyjs=False, output_type='div')

#### Draw multiple lines using plotly.graph_objs

In [53]:
crime_TimePerYear = df.groupby(['YEAR','HOUR'])['INCIDENT_NUMBER']\
                    .nunique()\
                    .groupby('YEAR')\
                    .apply(lambda x: x/x.sum())\
                    .reset_index()

In [54]:
crime_TimePerYear.head()

Unnamed: 0,YEAR,HOUR,INCIDENT_NUMBER
0,2015,0,0.051136
1,2015,1,0.029029
2,2015,2,0.024152
3,2015,3,0.014291
4,2015,4,0.009882


In [55]:
x = crime_TimePerYear['HOUR']
trace1=crime_TimePerYear[crime_TimePerYear['YEAR']==2015]['INCIDENT_NUMBER']
trace2=crime_TimePerYear[crime_TimePerYear['YEAR']==2016]['INCIDENT_NUMBER']
trace3=crime_TimePerYear[crime_TimePerYear['YEAR']==2017]['INCIDENT_NUMBER']
trace4=crime_TimePerYear[crime_TimePerYear['YEAR']==2018]['INCIDENT_NUMBER']

In [56]:
year_2015 = go.Scatter(
                    x = x,
                    y = trace1,
                    name="Crime in 2015",
                    mode = "lines",
                    line=dict(color='firebrick', width=4,dash='dash'))
year_2016 = go.Scatter(
                    x = x,
                    y = trace2,
                    name="Crime in 2016",
                    mode = "lines",
                    line=dict(color='blue', width=4,dash='dot'))
year_2017 = go.Scatter(
                    x = x,
                    y = trace3,
                    name="Crime in 2017",
                    mode = "lines",
                    line=dict(color='pink', width=4,dash='dashdot'))
year_2018 = go.Scatter(
                    x = x,
                    y = trace4,
                    name="Crime in 2018",
                    mode = "lines",
                    line=dict(color='purple', width=4,dash='dash'))

data = [year_2015,year_2016,year_2017,year_2018]

layout = dict(title = 'Percentage of crimes across a day for each year',
              width=600,
              height=400,
              xaxis= dict(title= 'HOUR',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'Percentage of cimes'))

fig = dict(data = data, layout = layout)
iplot(fig)

## Using plotly.express library

#### Draw single line using plotly.express

In [58]:
month_crime.head()

Unnamed: 0,MONTH,INCIDENT_NUMBER
0,1,0.07385
1,2,0.067518
2,3,0.07561
3,4,0.075574
4,5,0.082522


In [65]:
month_crime=month_crime.rename({'INCIDENT_NUMBER':'Percentage'},axis=1)

In [67]:
fig = px.line(month_crime, x="MONTH", y="Percentage", title='Percentage of crimes across a year')
fig.update_layout(
    width=600,
    height=400,
)
fig.show()

#### Draw multiple lines using plotly.express

In [62]:
crime_TimePerYear=crime_TimePerYear.rename({'INCIDENT_NUMBER':'Percentage'},axis=1)

In [64]:
fig = px.line(crimei_TimePerYear, x="HOUR", y='Percentage',color='YEAR',
             title='Percentage of crimes across a day for each year')
fig.update_layout(
    width=600,
    height=400,
)
fig.show()

<a id="3"></a> <br>
# Scatter Plot

#### Line and scatter plot using plotly.graph_objs library
* go.Scatter can be used both for plotting points (makers) or lines, depending on the value of mode. The different options of go.Scatter are documented in its reference page.


In [69]:
df['SHOOTING'] =df['SHOOTING'].replace(np.nan,'N')

In [70]:
df_shotting = df[df['SHOOTING']=='Y']

In [71]:
hour_shotting= df_shotting.groupby('HOUR')['INCIDENT_NUMBER']\
        .nunique()\
        .reset_index()

In [73]:
# Creating trace1
trace1 = go.Scatter(
                    x = hour_shotting.HOUR,
                    y = hour_shotting.INCIDENT_NUMBER,
                    mode = "lines + markers") 
# dash options include 'dash', 'dot', and 'dashdot'

data = [trace1]

layout = dict(title = 'Number of shotting across a day',
              width=600,
              height=400,
              xaxis= dict(title= 'HOUR',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'Number of shotting'))
fig = dict(data = data, layout = layout)
iplot(fig)

#### Scatter plot using plotly.express

In [84]:
shotting_timeDist = df_shotting.groupby(['YEAR','HOUR'])['INCIDENT_NUMBER']\
                    .nunique()\
                    .reset_index()

In [106]:
fig = px.scatter(shotting_timeDist, x="HOUR", y="INCIDENT_NUMBER", color='YEAR',
                title='Number of shotting for across a day for each year')

fig.update_layout(
    width=600,
    height=400,
)
fig.update_xaxes(nticks=20,tick0=0, dtick=0,tickwidth=2) #Set number of tick marks and style
fig.update_yaxes(title_text='Amount of shotting')#Set axis title

fig.show()

#### Scatter plot with facet using plotly.express

In [145]:
fig = px.scatter(shotting_timeDist, x="HOUR", y="INCIDENT_NUMBER", facet_row="YEAR", 
                 title='Number of shooting at different time',
           color_continuous_scale=px.colors.sequential.Viridis, render_mode="webgl")
fig.update_layout(
    width=600,
    height=500,
)
fig.update_xaxes(nticks=20,tickwidth=5,title_text='') #Set number of tick marks and style
fig.update_yaxes(title_text='')#Set axis title

fig.show()

#### Scatter Plot with more style

In [151]:
fig = px.scatter(shotting_timeDist, x="HOUR", y="INCIDENT_NUMBER", color="YEAR",size='INCIDENT_NUMBER',
                 title='Number of shooting at different time',
                 hover_name='YEAR',size_max=20,
           color_continuous_scale=px.colors.sequential.Viridis, render_mode="webgl")
fig.update_layout(
    width=600,
    height=400,
)
fig.update_xaxes(nticks=20,tickwidth=5,title_text='') #Set number of tick marks and style
fig.update_yaxes(title_text='Amount of shotting')#Set axis title

fig.show()

#### Scatter plot with matrix

In [165]:
crime_matrix = df.groupby(['SHOOTING','YEAR','DAY_OF_WEEK','HOUR'])['INCIDENT_NUMBER']\
                      .nunique()\
                      .apply(lambda x: x/282517)\
                      .reset_index()

crime_matrix['DAY'] = crime_matrix['DAY_OF_WEEK'].apply(lambda x:x[0:3])

In [168]:
fig = px.scatter(crime_matrix, x="HOUR", y="INCIDENT_NUMBER",facet_row="YEAR", facet_col="DAY",
                 color="SHOOTING", 
                 trendline="ols",
          category_orders={"DAY": ["Mon", "Tue", "Wen", "Thu","Fri","Sat","Sun"], "YEAR": [2015, 2016,2017,2018]})

fig.update_xaxes(nticks=10,tickwidth=5,title_text='') #Set number of tick marks and style
fig.update_yaxes(title_text='')#Set axis title

fig.show()


<a id="4"></a> <br>
# Bar chart
* presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent 

#### Simple bar chart

In [172]:
df_bar = df.groupby(['DISTRICT','UCR_PART'])['INCIDENT_NUMBER'].nunique().reset_index()

In [173]:
df_bar.head()

Unnamed: 0,DISTRICT,UCR_PART,INCIDENT_NUMBER
0,A1,Other,69
1,A1,Part One,8272
2,A1,Part Three,15762
3,A1,Part Two,9025
4,A15,Other,23


In [175]:
fig = px.bar(df_bar, x="UCR_PART", y="INCIDENT_NUMBER", color="DISTRICT", barmode="group",
            title='Number of crime for different parts and district')
fig.update_layout(
    width=600,
    height=400,
)
fig.show()
# can add facet_col or facet_row 

#### Bar chart with facet

In [183]:
fig = px.bar(df_bar, x="DISTRICT", y="INCIDENT_NUMBER", facet_col="UCR_PART", barmode="group",
            title='Number of crime for different parts and district')
fig.update_layout(
    width=800,
    height=400,
)

fig.update_xaxes(nticks=10,tickwidth=5) #Set number of tick marks and style

fig.show()

<a id="5"></a> <br>
# Histogram
* presents distribution of continuous variable

In [185]:
hist_df = df.groupby(['SHOOTING','YEAR','HOUR'])['INCIDENT_NUMBER']\
                    .nunique()\
                    .reset_index()

In [186]:
hist_df.head()

Unnamed: 0,SHOOTING,YEAR,HOUR,INCIDENT_NUMBER
0,N,2015,0,2389
1,N,2015,1,1352
2,N,2015,2,1123
3,N,2015,3,669
4,N,2015,4,461


In [211]:
fig = px.histogram(hist_df, x="HOUR", y="INCIDENT_NUMBER",histfunc="avg", barmode="group",
                   facet_col='SHOOTING'
                   ,nbins=100
                   ,title='Distibution of the number of shotting across a day')
fig.update_layout(
    width=600,
    height=400,
    yaxis_type="log",
)
fig.show()

<a id="6"></a> <br>
# Boxplot

In [194]:
hist_df.head()

Unnamed: 0,SHOOTING,YEAR,HOUR,INCIDENT_NUMBER
0,N,2015,0,2389
1,N,2015,1,1352
2,N,2015,2,1123
3,N,2015,3,669
4,N,2015,4,461


In [209]:
fig = px.box(hist_df, x="YEAR", y="INCIDENT_NUMBER", color="SHOOTING", notched=True)
fig.update_layout(
    width=600,
    height=400,
    yaxis_type="log"
)
fig.show()

<a id="7"></a> <br>
# Heatmap

In [205]:
crime_matrix.head()

Unnamed: 0,SHOOTING,YEAR,DAY_OF_WEEK,HOUR,INCIDENT_NUMBER,DAY
0,N,2015,Friday,0,0.001126,Fri
1,N,2015,Friday,1,0.000676,Fri
2,N,2015,Friday,2,0.000531,Fri
3,N,2015,Friday,3,0.000276,Fri
4,N,2015,Friday,4,0.000223,Fri


In [210]:
fig = px.density_heatmap(crime_matrix, x="HOUR", y="YEAR", marginal_x="rug", marginal_y="histogram")
fig.update_layout(
    width=600,
    height=400,
)
fig.show()

<a id="8"></a> <br>
# 3D

In [208]:
fig = px.scatter_3d(hist_df, x="YEAR", y="INCIDENT_NUMBER", z="HOUR",color='SHOOTING')
fig.show()