## SF crime data analysis and modeling

### In this notebook, you can learn how to use Spark SQL for big data analysis on SF crime data. (https://data.sfgov.org/Public-Safety/sf-data/skgt-fej3/data). 
The first part of Homework is OLAP for scrime data analysis (80 credits).  
The second part is unsupervised learning for spatial data analysis (20 credits).   
The option part is the time series data analysis (50 credits).  
**Note**: you can download the small data (one month e.g. 2018-10) for debug, then download the data from 2013 to 2018 for testing and analysising. 

### How to submit the report for grading ? 
Publish your notebook and send your notebook to mike@laioffer.com, the email title would be like this way: Laidata190221_Spark_Hw1_Yourname  
Your report have to contain your data analysis insights.

In [3]:
from csv import reader
from pyspark.sql import Row 
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from ggplot import *
import warnings

import os
os.environ["PYSPARK_PYTHON"] = "python3"


In [4]:
# download data from SF gov's official website
# execute only once
# comment out the later one if it runs a second time
import urllib.request
urllib.request.urlretrieve("https://data.sfgov.org/api/views/tmnf-yvry/rows.csv?accessType=DOWNLOAD", "/tmp/sf_03_18.csv")
dbutils.fs.mv("file:/tmp/sf_03_18.csv", "dbfs:/laioffer/spark_hw1/data/sf_03_18.csv")
display(dbutils.fs.ls("dbfs:/laioffer/spark_hw1/data/"))
## or download the file yourself
# https://data.sfgov.org/api/views/tmnf-yvry/rows.csv?accessType=DOWNLOAD


path,name,size
dbfs:/laioffer/spark_hw1/data/sf_03_18.csv,sf_03_18.csv,544214906


In [5]:
data_path = "dbfs:/laioffer/spark_hw1/data/sf_03_18.csv"
# use this file name later

In [6]:
# read data from the data storage
# please upload your data into databricks community at first. 
crime_data_lines = sc.textFile(data_path)
#prepare data: remove "
df_crimes = crime_data_lines.map(lambda line: [x.strip('"') for x in next(reader([line]))])
#get header
header = df_crimes.first()
print(header)

#remove the first line of data
crimes = df_crimes.filter(lambda x: x != header)

#get the first three lines of data
display(crimes.take(3))

#get the total number of data 
print(crimes.count())


_1,_2,_3,_4,_5,_6,_7,_8,_9,_10,_11,_12,_13,_14,_15,_16,_17,_18,_19,_20,_21,_22,_23,_24,_25,_26,_27,_28,_29,_30,_31,_32,_33
180362289,VEHICLE THEFT,STOLEN MOTORCYCLE,Tuesday,05/15/2018,10:30,SOUTHERN,NONE,700 Block of TEHAMA ST,-122.41191202732875,37.77520656149669,"(37.77520656149669, -122.41191202732877)",18036228907023,32,1,10,34,8,2,9,28853,34,,1.0,,1,,,2,,,1,
180360948,NON-CRIMINAL,"AIDED CASE, MENTAL DISTURBED",Tuesday,05/15/2018,04:14,SOUTHERN,NONE,MARKET ST / SOUTH VAN NESS AV,-122.41925789481355,37.77514629165388,"(37.77514629165388, -122.41925789481357)",18036094864020,32,1,10,20,8,2,9,28853,19,,1.0,,1,,8.0,2,1.0,1.0,1,
180360879,OTHER OFFENSES,PAROLE VIOLATION,Tuesday,05/15/2018,02:01,MISSION,"ARREST, BOOKED",CAPP ST / 21ST ST,-122.41781255878657,37.75710057964282,"(37.757100579642824, -122.41781255878655)",18036087926150,53,3,2,20,2,4,7,28859,19,13.0,,15.0,3,15.0,,2,,,3,


### Solove  big data issues via Spark
approach 1: use RDD (not recommend)  
approach 2: use Dataframe, register the RDD to a dataframe (recommend for DE)  
approach 3: use SQL (recomend for data analysis or DS， 基础比较差的同学)  
***note***: you only need to choose one of approaches as introduced above

#### We provide 3 options to transform distributed data into dataframe and SQL table, you can choose any one of them to practice

In [9]:

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("crime analysis") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# Load .csv
df_opt1 = spark.read.format("csv").option("header", "true").load(data_path)
# display data
display(df_opt1)
# Create temporary table called "sf_crime"
df_opt1.createOrReplaceTempView("sf_crime")

IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId,SF Find Neighborhoods,Current Police Districts,Current Supervisor Districts,Analysis Neighborhoods,:@computed_region_yftq_j783,:@computed_region_p5aj_wyqh,:@computed_region_rxqg_mtj9,:@computed_region_bh8s_q3mv,:@computed_region_fyvs_ahh9,:@computed_region_9dfj_4gjx,:@computed_region_n4xg_c4py,:@computed_region_4isq_27mq,:@computed_region_fcz8_est8,:@computed_region_pigm_ib2e,:@computed_region_9jxd_iqea,:@computed_region_6pnf_4xz7,:@computed_region_6ezc_tdp2,:@computed_region_h4ep_8xdi,:@computed_region_nqbw_i6c3,:@computed_region_2dwj_jsy4
180362289,VEHICLE THEFT,STOLEN MOTORCYCLE,Tuesday,05/15/2018,10:30,SOUTHERN,NONE,700 Block of TEHAMA ST,-122.41191202732875,37.77520656149669,"(37.77520656149669, -122.41191202732877)",18036228907023,32.0,1,10,34,8.0,2,9,28853,34,,1.0,,1.0,,,2,,,1.0,
180360948,NON-CRIMINAL,"AIDED CASE, MENTAL DISTURBED",Tuesday,05/15/2018,04:14,SOUTHERN,NONE,MARKET ST / SOUTH VAN NESS AV,-122.41925789481355,37.77514629165388,"(37.77514629165388, -122.41925789481357)",18036094864020,32.0,1,10,20,8.0,2,9,28853,19,,1.0,,1.0,,8.0,2,1.0,1.0,1.0,
180360879,OTHER OFFENSES,PAROLE VIOLATION,Tuesday,05/15/2018,02:01,MISSION,"ARREST, BOOKED",CAPP ST / 21ST ST,-122.41781255878657,37.75710057964282,"(37.757100579642824, -122.41781255878655)",18036087926150,53.0,3,2,20,2.0,4,7,28859,19,13.0,,15.0,3.0,15.0,,2,,,3.0,
180360879,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Tuesday,05/15/2018,02:01,MISSION,"ARREST, BOOKED",CAPP ST / 21ST ST,-122.41781255878657,37.75710057964282,"(37.757100579642824, -122.41781255878655)",18036087965010,53.0,3,2,20,2.0,4,7,28859,19,13.0,,15.0,3.0,15.0,,2,,,3.0,
180360879,OTHER OFFENSES,TRAFFIC VIOLATION,Tuesday,05/15/2018,02:01,MISSION,"ARREST, BOOKED",CAPP ST / 21ST ST,-122.41781255878657,37.75710057964282,"(37.757100579642824, -122.41781255878655)",18036087965015,53.0,3,2,20,2.0,4,7,28859,19,13.0,,15.0,3.0,15.0,,2,,,3.0,
180360829,OTHER OFFENSES,"DRIVERS LICENSE, SUSPENDED OR REVOKED",Tuesday,05/15/2018,01:27,MISSION,NONE,700 Block of SHOTWELL ST,-122.41561725232026,37.75641376904809,"(37.75641376904809, -122.41561725232026)",18036082965016,53.0,3,2,20,2.0,4,7,28859,19,,,,3.0,,,2,,,3.0,
180360835,ROBBERY,"ROBBERY, BODILY FORCE",Tuesday,05/15/2018,01:25,SOUTHERN,"ARREST, BOOKED",0 Block of 6TH ST,-122.41004163181596,37.78195365372572,"(37.781953653725715, -122.41004163181597)",18036083503074,32.0,5,10,34,14.0,2,9,28853,34,17.0,1.0,18.0,1.0,18.0,7.0,2,1.0,1.0,1.0,
180360835,DRUG/NARCOTIC,POSSESSION OF NARCOTICS PARAPHERNALIA,Tuesday,05/15/2018,01:25,SOUTHERN,"ARREST, BOOKED",0 Block of 6TH ST,-122.41004163181596,37.78195365372572,"(37.781953653725715, -122.41004163181597)",18036083516710,32.0,5,10,34,14.0,2,9,28853,34,17.0,1.0,18.0,1.0,18.0,7.0,2,1.0,1.0,1.0,
180360794,LIQUOR LAWS,MISCELLANEOUS LIQOUR LAW VIOLATION,Tuesday,05/15/2018,00:19,PARK,"ARREST, BOOKED",1500 Block of HAIGHT ST,-122.44776112231956,37.76984648754153,"(37.76984648754153, -122.44776112231955)",18036079417030,25.0,7,11,3,15.0,5,11,29492,9,22.0,,24.0,,25.0,,1,,,,
180360794,WARRANTS,ENROUTE TO OUTSIDE JURISDICTION,Tuesday,05/15/2018,00:19,PARK,"ARREST, BOOKED",1500 Block of HAIGHT ST,-122.44776112231956,37.76984648754153,"(37.76984648754153, -122.44776112231955)",18036079462050,25.0,7,11,3,15.0,5,11,29492,9,22.0,,24.0,,25.0,,1,,,,


#### Q1 question (OLAP): 
#####Write a Spark program that counts the number of crimes for different category.

Below are some example codes to demonstrate the way to use Spark RDD, DF, and SQL to work with big data. You can follow this example to finish other questions.

In [11]:
q1_result = df_opt1.groupBy('category').count().orderBy('count', ascending=False)

# display result
display(q1_result)

category,count
LARCENY/THEFT,480448
OTHER OFFENSES,309358
NON-CRIMINAL,238323
ASSAULT,194694
VEHICLE THEFT,126602
DRUG/NARCOTIC,119628
VANDALISM,116059
WARRANTS,101379
BURGLARY,91543
SUSPICIOUS OCC,80444


In [12]:
# Spark SQL based
## in descending order
crimeCategory_desc = spark.sql("SELECT category, COUNT(*) AS Count FROM sf_crime GROUP BY category ORDER BY Count DESC")

# display result
display(crimeCategory_desc)

## in ascecding order
#crimeCategory_asc = spark.sql("SELECT category, COUNT(*) AS Count FROM sf_crime GROUP BY category ORDER BY Count ASC")
#display(crimeCategory_asc)

category,Count
LARCENY/THEFT,480448
OTHER OFFENSES,309358
NON-CRIMINAL,238323
ASSAULT,194694
VEHICLE THEFT,126602
DRUG/NARCOTIC,119628
VANDALISM,116059
WARRANTS,101379
BURGLARY,91543
SUSPICIOUS OCC,80444


In [13]:
# important hints: 
## first step: spark df or sql to compute the statisitc result 
## second step: export your result to a pandas dataframe. 

## show data
#crimeCategory_desc.show()

## convert to pandas dataframe
crimes_pd_df = crimeCategory_desc.toPandas()

## Spark does not support this function, please refer https://matplotlib.org/ for visuliation. You need to use display to show the figure in the databricks community. 

## display/visualize the result to a pandas dataframe
display(crimes_pd_df)

# Conclusion:
## Larceny/Theft category has the highest count, meaning that this type of crime happens the most in SF
## whereas trea has the lowest count, trea is observed the least time 

category,Count
LARCENY/THEFT,480448
OTHER OFFENSES,309358
NON-CRIMINAL,238323
ASSAULT,194694
VEHICLE THEFT,126602
DRUG/NARCOTIC,119628
VANDALISM,116059
WARRANTS,101379
BURGLARY,91543
SUSPICIOUS OCC,80444


#### Q2 question (OLAP)
Counts the number of crimes for different district, and visualize your results

In [15]:
# Spark SQL based - Soulution for Q2
## count the number of crimes for different district using column PdDistrict

## in descending order
crimePdDistrict_desc = spark.sql("SELECT PdDistrict, COUNT(*) AS Count FROM sf_crime GROUP BY PdDistrict ORDER BY Count DESC")

# display result
display(crimePdDistrict_desc)

## in ascending order
#crimePdDistrict_asc = spark.sql("SELECT PdDistrict, COUNT(*) AS Count FROM sf_crime GROUP BY PdDistrict ORDER BY Count ASC")
#display(crimePdDistrict_asc)

PdDistrict,Count
SOUTHERN,399785
MISSION,300076
NORTHERN,272713
CENTRAL,226255
BAYVIEW,221000
INGLESIDE,194180
TENDERLOIN,191746
TARAVAL,166971
PARK,125479
RICHMOND,116818


In [16]:
# visualize the result
# pandas dateframe

crimes2_pd_df = crimePdDistrict_desc.toPandas()

# display result
display(crimes2_pd_df)

# Conclustion:
## southern PdDistrict has the highest number of crimes, whereas crimes in Richmond are low,
## we can conclude that Richmond Park, and Taraval are the top3 safest districts, Southern district is the most dangerous district

PdDistrict,Count
SOUTHERN,399785
MISSION,300076
NORTHERN,272713
CENTRAL,226255
BAYVIEW,221000
INGLESIDE,194180
TENDERLOIN,191746
TARAVAL,166971
PARK,125479
RICHMOND,116818


#### Q3 question (OLAP)
Count the number of crimes each "Sunday" at "SF downtown".   
hints: SF downtown is defiend  via the range of spatial location. For example, you can use a rectangle to define the SF downtown, or you can define a cicle with center as well. Thus, you need to write your own UDF function to filter data which are located inside certain spatial range. You can follow the example here: https://changhsinlee.com/pyspark-udf/

In [18]:
## google search appears SF downtown (financial district)'s coordinates to be: 37.7946° N, 122.3999° W
## therefore, we can assume SF downtown is centered at 37.7946° N, 122.3999° W
## within a range of 1.19 square kilometers (approx. 119ha) is called 'SF downtown' / largest distance to the center point: (1.19/3.14)**0.5 = 0.616 km
## assume 1° ≈ 111 km at longitude & latitude

# obtain longitude & latitude
crimesSFdt_x = spark.sql("SELECT DayOfWeek, Date, Address, X FROM sf_crime")
crimesSFdt_y = spark.sql("SELECT DayOfWeek, Date, Address, Y FROM sf_crime")
# crimesSFdt_x.show()
# crimesSFdt_y.show()

from pyspark.sql.types import FloatType

def x_square_float(x):
    return ((float(x)-(-122.3999))*111)**2 

x_square_udf_float = udf(lambda z: x_square_float(z), FloatType())
crimes3_df = crimesSFdt_x.select('DayOfWeek', 'Date', 'Address', 'X', x_square_udf_float('X').alias('x_float_squared'))
#crimes3_df.show()

# convert to Pandas dataframe
#crimes3_pd_df = crimes3_df.toPandas()
#display(crimes3_pd_df)

def y_square_float(y):
    return ((float(y)-37.7946)*111)**2 

y_square_udf_float = udf(lambda z: y_square_float(z), FloatType())
crimes4_df = crimesSFdt_y.select('DayOfWeek', 'Date', 'Address', 'Y', y_square_udf_float('Y').alias('y_float_squared'))
#crimes4_df.show()

# convert to Pandas dataframe
#crimes4_pd_df = crimes4_df.toPandas()
#display(crimes4_pd_df)

# Create temporary table called df3, df4
crimes3_df.createOrReplaceTempView("df3")
crimes4_df.createOrReplaceTempView("df4")

# write spark.SQL to retrieve distance
crimesSFdt = spark.sql("SELECT dt.Date, COUNT(distance) AS num_of_crimes FROM (SELECT df3.DayOfWeek, df3.Date, df3.Address, SQRT(df3.x_float_squared + df4.y_float_squared) as distance FROM df3 INNER JOIN df4 on df3.address = df4.address WHERE df3.DayOfWeek = 'Sunday') AS dt WHERE dt.distance < 0.616 GROUP BY dt.Date ORDER BY num_of_crimes DESC")

# display result
display(crimesSFdt)

# Conclustion:
## According to calculations, we can conclude that amongst all Sundays, on 01/01/2006 the number of crimes is at its highest level
## 01/01/2006 appeared to be during New Year Holidays, and this could be a major reason which resulted in a high number of crimes
## additional information needed

Date,num_of_crimes
01/01/2006,14227
04/13/2014,12163
05/25/2014,10763
11/23/2014,9488
11/24/2013,9212
01/16/2011,9116
04/21/2013,9109
10/27/2013,8890
08/03/2014,8874
08/24/2003,8771


#### Q4 question (OLAP)
Analysis the number of crime in each month of 2015, 2016, 2017, 2018. Then, give your insights for the output results. What is the business impact for your result?

In [20]:
# Spark SQL based - Soulution for Q4
# counts the number of crimes in each month of 2015, 2016, 2017, 2018

crimeInYearMonth = spark.sql("SELECT year(to_date(date, 'MM/dd/yyyy')) as year, month(to_date(date, 'MM/dd/yyyy')) as month, COUNT(*) as num_of_crimes FROM sf_crime GROUP BY year, month HAVING year in (2015, 2016, 2017, 2018) ORDER BY num_of_crimes DESC")
                             
# display result
display(crimeInYearMonth)


# using extract() function:
#crimeInYearMonth = spark.sql("SELECT EXTRACT(YEAR FROM dt.date) as year, EXTRACT(MONTH FROM dt.date) AS month, COUNT(*) as num_of_crimes FROM (SELECT TO_DATE(CAST(UNIX_TIMESTAMP(date, 'MM/dd/yyyy') AS TIMESTAMP)) AS date FROM sf_crime) AS dt GROUP BY year, month HAVING year in (2015, 2016, 2017, 2018) ORDER BY num_of_crimes DESC")

# Subquery: format the date to timestamp YYYY-MM-DD 
#crimeInYearMonth = spark.sql("SELECT TO_DATE(CAST(UNIX_TIMESTAMP(date, 'MM/dd/yyyy') AS TIMESTAMP)) AS newdate FROM sf_crime GROUP BY 1 ORDER BY 1 DESC")


# Conclution:
## the number of crimes hit its highest level in March 2015
## looking at the data we obtained from 2015-2018, we can also conclude that the year 2015 has the most crime cases becasue the top-3 months are all in 2015

# business impact for the result:
## there might be some incidents happened in 2015 (politically or financially) 
## therefore, for companies in SF, they might have been impacted to some extent
## in these particular months and years, business activities (in the sector of goods and services) might have been slightly affected as well 
## due to safety concerns expressed by people

year,month,num_of_crimes
2015,3,13929
2015,8,13730
2015,5,13729
2017,3,13711
2015,1,13606
2016,10,13388
2015,7,13365
2017,10,13355
2015,6,13304
2017,5,13267


#### Q5 question (OLAP)
Analysis the number of crime w.r.t the hour in certian day like 2015/12/15, 2016/12/15, 2017/12/15. Then, give your travel suggestion to visit SF.

In [22]:
# Spark SQL based - Soulution for Q4
# counts the number of crimes in each month of 2015, 2016, 2017, 2018

crimesDate = spark.sql("SELECT res.date, res.hours, COUNT(*) as num_of_crimes FROM (SELECT date, CASE WHEN time BETWEEN '00:00' AND '06:00:00' THEN 'early morning' WHEN time BETWEEN '06:00' AND '12:00:00' THEN 'morning' WHEN time BETWEEN '12:00' AND '18:00:00' THEN 'afternoon' WHEN time BETWEEN '18:00' AND '24:00:00' THEN 'night' END AS hours FROM sf_crime WHERE date IN ('12/15/2015', '12/15/2016', '12/15/2017')) AS res GROUP BY 1,2 ORDER BY num_of_crimes DESC")

# display result
display(crimesDate)

# Conclustion:
## analyze the number of crime with respect to the hours in 12/15/2015, 12/15/2016; 12/15/2017: 
## between 00:00-6:00 'early morning'; between 6:00-12:00 'morning'; between 12:00-18:00 'afternoon'; and lastly between 18:00-00:00 'night'.

# travel tips to visit SF:
## in general, afternoon and night (between 12:00 and 00:00) have higher crime rates, whereas early morning and morning hours appear to be much safer
## my advice would be if you plan to visit San Francisco, try not to arrive at SF city later than 12PM
## instead, you should prepare one day earlier and set out in the early morning or in the morning
## and that so after you chcek-in your hotel, you can relax a bit and explore the city during daytimes
## remember to go back home early (no later than 18:00), because there are more chances to be involed in crimes in SF at night.

date,hours,num_of_crimes
12/15/2017,night,154
12/15/2016,night,135
12/15/2017,afternoon,133
12/15/2015,afternoon,124
12/15/2017,morning,118
12/15/2016,afternoon,116
12/15/2016,morning,99
12/15/2015,morning,93
12/15/2015,night,88
12/15/2017,early morning,66


#### Q6 question (OLAP)
(1) Step1: Find out the top-3 danger disrict  
(2) Step2: find out the crime event w.r.t category and time (hour) from the result of step 1  
(3) give your advice to distribute the police based on your analysis results.

In [24]:
# (1) Step1: Find out the top-3 danger disrict  
#crimePdDistrict_desc = spark.sql("SELECT PdDistrict, COUNT(*) AS Count FROM sf_crime GROUP BY PdDistrict ORDER BY Count DESC")
#display(crimePdDistrict_desc.take(3))

# Top-3 danger district: Southern, Mission, and Norther


# (2) Step2: find out the crime event w.r.t category and time (hour) from the result of step 1  

## three districts were calculated respectively:
crimes_in_top3 = spark.sql("SELECT res.category, res.PdDistrict, res.hours, COUNT(*) as num_of_crimes FROM (SELECT category, PdDistrict, CASE WHEN time BETWEEN '00:00' AND '06:00:00' THEN 'early morning' WHEN time BETWEEN '06:00' AND '12:00:00' THEN 'morning' WHEN time BETWEEN '12:00' AND '18:00:00' THEN 'afternoon' WHEN time BETWEEN '18:00' AND '24:00:00' THEN 'night' END AS hours FROM sf_crime WHERE PdDistrict IN ('SOUTHERN', 'MISSION', 'NORTHERN')) AS res GROUP BY 1,2,3 ORDER BY num_of_crimes DESC")

# display result
display(crimes_in_top3)

## three districts treated as one area:
#crimes_in_top3 = spark.sql("SELECT res.category, res.hours, COUNT(*) as num_of_crimes FROM (SELECT category, PdDistrict, CASE WHEN time BETWEEN '00:00' AND '06:00:00' THEN 'early morning' WHEN time BETWEEN '06:00' AND '12:00:00' THEN 'morning' WHEN time BETWEEN '12:00' AND '18:00:00' THEN 'afternoon' WHEN time BETWEEN '18:00' AND '24:00:00' THEN 'night' END AS hours FROM sf_crime WHERE PdDistrict IN ('SOUTHERN', 'MISSION', 'NORTHERN')) AS res GROUP BY 1,2 ORDER BY num_of_crimes DESC")
#display(crimes_in_top3)

# (3) give your advice to distribute the police based on your analysis results. 
## from the results we obtained, since we are looking at the top-3 danger districts, police force should be distributed primarily in Southern district, then Nothern district followed by Mission district
## police on-duty time is suggested to be at night for LARCENY/THEFT; in the afternoon for OTHER OFFENSES
## residents in the city should cooperate with the police and stay at home during dangerous hours from 12:PM to midnight

category,PdDistrict,hours,num_of_crimes
LARCENY/THEFT,SOUTHERN,afternoon,41261
LARCENY/THEFT,SOUTHERN,night,39937
LARCENY/THEFT,NORTHERN,night,30390
LARCENY/THEFT,NORTHERN,afternoon,26616
LARCENY/THEFT,SOUTHERN,morning,21001
LARCENY/THEFT,MISSION,night,18147
NON-CRIMINAL,SOUTHERN,afternoon,17250
OTHER OFFENSES,SOUTHERN,afternoon,17156
OTHER OFFENSES,MISSION,afternoon,15244
LARCENY/THEFT,MISSION,afternoon,15153


#### Q7 question (OLAP)
For different category of crime, find the percentage of resolution. Based on the output, give your hints to adjust the policy.

In [26]:
# write subqueries tb1 & tb2 to obtain results where resolution is None("Unresolved") or else("resolved")
# left join two tables together
# calculate percentage of resolution
# cnt_un/(cnt_un+cnt_re) is the percentage of unresolved crimes, saved as cnt_un_pc
# cnt_re/(cnt_un+cnt_re) is the percentage of resolved crimes, saved as cnt_re_pc

crimeCategoryPercentage = spark.sql("SELECT tb3.Category, cnt_un/(cnt_un+cnt_re) as cnt_un_pc, cnt_re/(cnt_un+cnt_re) as cnt_re_pc FROM (SELECT tb1.Category, cnt_un, cnt_re from (SELECT Category, COUNT(resolved_or_not) as cnt_un FROM (SELECT Category, CASE WHEN Resolution LIKE 'NONE' THEN 'unresolved' else 'resolved' end as resolved_or_not FROM sf_crime) WHERE resolved_or_not = 'unresolved' GROUP BY Category) tb1 LEFT JOIN (SELECT Category, COUNT(resolved_or_not) as cnt_re FROM (SELECT Category, CASE WHEN Resolution LIKE 'NONE' THEN 'unresolved' else 'resolved' end as resolved_or_not FROM sf_crime) WHERE resolved_or_not = 'resolved' GROUP BY Category) tb2 ON tb1.Category = tb2.Category) tb3 ORDER BY cnt_un_pc DESC")

# display result
display(crimeCategoryPercentage)


# hints to adjust the policy:
## police officers should focus more on cimes like RECOVERED VEHICLE, VEHICLE THEFT, and LARCENY/THEFT, because this three categories have the highest 'unresolved' percentage.
## crimes related to vehicles require additional attention from the police department
## police should take some actions to raise the awareness of vehicle owners, in terms of vehicles safty/security issues.

Category,cnt_un_pc,cnt_re_pc
RECOVERED VEHICLE,0.9308168884809546,0.0691831115190454
VEHICLE THEFT,0.916099271733464,0.083900728266536
LARCENY/THEFT,0.9114971859597708,0.0885028140402291
SUSPICIOUS OCC,0.8824275272239073,0.1175724727760926
VANDALISM,0.8779155429565997,0.1220844570434003
BURGLARY,0.8373441989010629,0.1626558010989371
ARSON,0.8074281353345205,0.1925718646654795
BAD CHECKS,0.8043243243243243,0.1956756756756756
EXTORTION,0.7786774628879892,0.2213225371120108
NON-CRIMINAL,0.7756616021114202,0.2243383978885798


### Conclusion. 
Use four sentences to summary your work. Like what you have done, how to do it, what the techinical steps, what is your business impact. 
More details are appreciated. You can think about this a report for your manager. Then, you need to use this experience to prove that you have strong background on big  data analysis.  
Point 1:  what is your story ? and why you do this work ?   
Point 2:  how can you do it ?  keywords: Spark, Spark SQL, Dataframe, Data clean, Data visulization, Data size, clustering, OLAP,   
Point 3:  what do you learn from the data ?  keywords: crime, trend, advising, conclusion, runtime

In [28]:
# HW Conclustion:
## I have leveraged a large-sized data set of SF crimes using Spark, Spark SQL on Databricks
## the goal was to provide some safety suggestions (travel tips) to residents living in SF, and to travelers who plan to visit the city
## prior to jumping into the homework, we need to create a clustering to ensure code implementation
## I did some data cleaning as well as data processing through importing packages and through using lines of SQL code, such as applying map() & filter() functions
## I mainly used spark SQL in this homework to answer OLAP questions
## some useful findings include the SF top-3 most-committed crimes, SF top-3 danger districts, number of crime around downtown SF, hours (time) for particular dates, and resolved & unresolved percentages for different crime categories, etc.
## advertisings on safety topics are necessarily needed to raise awareness among residents in SOUTHERN, MISSION, and NORTHERN districts
## also, police officers should focus more on cimes like RECOVERED VEHICLE, VEHICLE THEFT, and LARCENY/THEFT, because this three categories have the highest 'unresolved' percentage
## runtime for question 3 is relatively long, because the result (dataframe) we want to get is large

## overall, my suggestion would be avoiding visiting the city or walking around the streets at night or in the afternoon
## because the chances of being involved in a crime are higher than that during other hours
## in addition, pay extra attention during holidays, because during these time, crimes tend to be higher (for example, see Q3. -> 01/01/2006 Sunday num_of_crimes)

### Optional part: Time series analysis
This part is not based on Spark, and only based on Pandas Time Series package.   
Note: I am not familiar with time series model, please refer the ARIMA model introduced by other teacher.   
process:  
1.visualize time series  
2.plot ACF and find optimal parameter  
3.Train ARIMA  
4.Prediction 

Refer:   
https://zhuanlan.zhihu.com/p/35282988  
https://zhuanlan.zhihu.com/p/35128342  
https://www.statsmodels.org/dev/examples/notebooks/generated/tsa_arma_0.html  
https://www.howtoing.com/a-guide-to-time-series-forecasting-with-arima-in-python-3  
https://www.joinquant.com/post/9576?tag=algorithm  
https://blog.csdn.net/u012052268/article/details/79452244