## Important
- remove links

## Tasks

- [x] pyspark with mapreduce
    - do these using pyspark
        - [ ] parallelize
        - [ ] filter nan values
        - [x] groupby column
- [ ] visualization improvements
- [ ] so independent would be the thing your testing for (ie. number of libraries in the community ) and the dependent would be number of crimes in the community
- [ ] read from hdfs into a pyspark dataframe
- [ ] ratio - (community pop for a specific community / total community population for all communities the zip code is in)
- [ ] Find the total amount of x for the zip code. If there’s a zipcode that is shared by multiple communities, find the ratio based on the population of each community, and split the amount of x for the zip code based on this ratio.
- [ ] i need to first create a table of number of public schools per zip code
    - [ ] then create a new table with number of public schools per community area
    - [ ] This will help - Team created csv - PopulationPerCommunity.csv
- [ ] So we need every factor corresponding to each community area, and the crime rate per community area to do the regression model
 


# Crime rate per zip code

In [None]:
!wget https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv

### **Running Pyspark in Colab**

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. One important note is that if you are new in Spark, it is better to avoid Spark 2.4.0 version since some people have already complained about its compatibility issue with python. 
Follow the steps to install the dependencies:

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
!tar xf spark-3.0.1-bin-hadoop2.7.tgz
!pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

Run a local spark session to test your installation:

In [None]:
import findspark
findspark.init()

In [None]:
findspark.find()

'/content/spark-3.0.1-bin-hadoop2.7'

The entry-point of any PySpark program is a SparkContext object. This object allows you to connect to a Spark cluster and create RDDs. The local[*] string is a special string denoting that you’re using a local cluster, which is another way of saying you’re running in single-machine mode. The * tells Spark to create as many worker threads as logical cores on your machine.

In [None]:
import pyspark
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession.builder.master("local[*]").getOrCreate()

Creating a SparkContext can be more involved when you’re using a cluster. To connect to a Spark cluster, you might need to handle authentication and a few other pieces of information specific to your cluster. You can set up those details similarly to the following:

In [None]:
# conf = pyspark.SparkConf()
# conf.setMaster('spark://head_node:56887')
# conf.set('spark.authenticate', True)
# conf.set('spark.authenticate.secret', 'secret-key')
# sc = SparkContext(conf=conf)

In [None]:
from pyspark.sql.functions import to_timestamp,col,lit
rc = spark.read.csv('rows.csv',header=True).withColumn('Date',to_timestamp(col('Date'),'MM/dd/yyyy hh:mm:ss a'))
rc.show(50)

+--------+-----------+-------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+--------+---------+--------+
|      ID|Case Number|               Date|               Block|IUCR|        Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|Latitude|Longitude|Location|
+--------+-----------+-------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+--------+---------+--------+
|11034701|   JA366925|2001-01-01 11:00:00|     016XX E 86TH PL|1153|  DECEPTIVE PRACTICE|FINANCIAL IDENTIT...|           RESIDENCE| false|   false|0412|     004|   8|            45|      

## Data Exploration using pyspark

In [None]:
rc.count(), len(rc.columns)

(7236960, 22)

In [None]:
rc.show(5)

+--------+-----------+-------------------+--------------------+----+-------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+--------+---------+--------+
|      ID|Case Number|               Date|               Block|IUCR|       Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|Latitude|Longitude|Location|
+--------+-----------+-------------------+--------------------+----+-------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+--------+---------+--------+
|11034701|   JA366925|2001-01-01 11:00:00|     016XX E 86TH PL|1153| DECEPTIVE PRACTICE|FINANCIAL IDENTIT...|           RESIDENCE| false|   false|0412|     004|   8|            45|      11| 

In [None]:
# First showed the Boolean count of Community Area and Primary Type (Crime Type)
rc.groupBy('Community Area').count().orderBy("count", ascending=False).show(20)

+--------------+------+
|Community Area| count|
+--------------+------+
|          null|613484|
|            25|419167|
|             8|229888|
|            43|217107|
|            23|209242|
|            28|196881|
|            24|194510|
|            29|193492|
|            67|193384|
|            71|188082|
|            49|176795|
|            68|175584|
|            69|164009|
|            66|163057|
|            32|161172|
|            44|144074|
|            22|138671|
|            61|135354|
|             6|132097|
|            26|125490|
+--------------+------+
only showing top 20 rows



In [None]:
rc.groupBy('Primary Type').count().orderBy("count", ascending=False).show(20)

+--------------------+-------+
|        Primary Type|  count|
+--------------------+-------+
|               THEFT|1526994|
|             BATTERY|1325767|
|     CRIMINAL DAMAGE| 824263|
|           NARCOTICS| 735018|
|             ASSAULT| 458272|
|       OTHER OFFENSE| 449036|
|            BURGLARY| 407222|
| MOTOR VEHICLE THEFT| 333281|
|  DECEPTIVE PRACTICE| 300548|
|             ROBBERY| 271991|
|   CRIMINAL TRESPASS| 204880|
|   WEAPONS VIOLATION|  85132|
|        PROSTITUTION|  69365|
|PUBLIC PEACE VIOL...|  50713|
|OFFENSE INVOLVING...|  50130|
| CRIM SEXUAL ASSAULT|  28059|
|         SEX OFFENSE|  27624|
|INTERFERENCE WITH...|  17451|
|            GAMBLING|  14594|
|LIQUOR LAW VIOLATION|  14450|
+--------------------+-------+
only showing top 20 rows



In [None]:
# Print out schema to see what data type Arrest is. It's String type
rc.printSchema() # Thus, in the filter must be string too, 'true'. If boolean type, True

root
 |-- ID: string (nullable = true)
 |-- Case Number: string (nullable = true)
 |-- Date: timestamp (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: string (nullable = true)
 |-- Domestic: string (nullable = true)
 |-- Beat: string (nullable = true)
 |-- District: string (nullable = true)
 |-- Ward: string (nullable = true)
 |-- Community Area: string (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: string (nullable = true)
 |-- Y Coordinate: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Updated On: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)
 |-- Location: string (nullable = true)



In [None]:
rc.select('Year').show(15)

+----+
|Year|
+----+
|2001|
|2017|
|2017|
|2017|
|2017|
|2015|
|2017|
|2017|
|2017|
|2017|
|2017|
|2012|
|2017|
|2017|
|2017|
+----+
only showing top 15 rows



In [None]:
rc.where(col("Location").isNotNull()).show()

+--------+-----------+-------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|      ID|Case Number|               Date|               Block|IUCR|        Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|
+--------+-----------+-------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|11665567|   JC234307|2019-04-10 16:37:00|  102XX S VERNON AVE|1562|         SEX OFFENSE|AGGRAVATED CRIMIN...|SCHOOL - PUBLIC B

In [None]:
# Query all successful arrests and divde by total
rc.filter(col('Arrest') == 'true').count() / rc.select('Arrest').count()

0.27183250978311335

In [None]:
tot = rc.count()

com_per = rc.groupBy("Community Area") \
  .count() \
  .withColumnRenamed('count', 'cnt_per_community') \
  .withColumn('perc_of_count_total', (F.col('cnt_per_community') / tot) * 100 ) \
  .orderBy('Community Area') \
  .toPandas()

In [None]:
com_per

Unnamed: 0,Community Area,cnt_per_community,perc_of_count_total
0,,613484,8.477095
1,0,76,0.001050
2,1,101563,1.403393
3,10,28249,0.390343
4,11,26225,0.362376
...,...,...,...
74,75,52879,0.730680
75,76,39526,0.546169
76,77,65273,0.901939
77,8,229888,3.176582


In [None]:
import plotly.graph_objs as go

https://plotly.com/python/v3/apache-spark/

https://community.plotly.com/t/hiding-labels-in-px-pie-chart-python/40168/3

### Interactive plotly charts
- you can interact with the legends and pie chart

In [None]:
fig = go.Figure(data=[go.Pie(labels=com_per['Community Area'], values=com_per['perc_of_count_total'])])
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen']
fig.update_traces(textposition='inside', hoverinfo='label+percent',
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(title_text='Crime rate per Community Area')
fig.show()

In [None]:
tot = rc.count()

crime_per = rc.groupBy("Primary Type") \
  .count() \
  .withColumnRenamed('count', 'crime type') \
  .withColumn('perc_of_count_total', (F.col('crime type') / tot) * 100 ) \
  .orderBy('Primary Type') \
  .toPandas()

In [None]:
fig = go.Figure(data=[go.Pie(labels=crime_per['Primary Type'], values=crime_per['perc_of_count_total'])])
colors = ['gold', 'mediumturquoise', 'darkorange', 'lightgreen']
fig.update_traces(textposition='inside', hoverinfo='label+percent',
                  marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(title_text='Crime types')
fig.show()

In [None]:
import plotly.graph_objects as go
import plotly.express as px
data = [go.Bar(x=crime_per['Primary Type'], y=crime_per['perc_of_count_total'], 
               text = 'crime percentage', hovertext=['perc_of_count_total'],
               marker={'color': crime_per['perc_of_count_total'], 'colorscale': 'YlOrRd', "showscale": True}, 
               )]
layout = go.Layout(xaxis=dict(type='category'))
fig = go.Figure(data=data, layout=layout)
fig.update_xaxes(
        tickangle = 90,
        title_text = "Crime Type")
fig.update_yaxes(
        title_text = "Crime rate")
fig.update_layout(title="Crime rate by Type", hovermode='x')
fig.show()

In [None]:
import plotly.graph_objects as go
import plotly.express as px
data = [go.Bar(x=com_per['Community Area'], y=com_per['perc_of_count_total'], 
               text = 'crime percentage', hovertext=['perc_of_count_total'],
               marker={'color': com_per['perc_of_count_total'], 'colorscale': 'YlOrRd', "showscale": True}, 
               )]
layout = go.Layout(xaxis=dict(type='category'))
fig = go.Figure(data=data, layout=layout)
fig.update_xaxes(
        tickangle = 90,
        title_text = "Community Area",)
fig.update_yaxes(
        title_text = "Crime rate")
fig.update_layout(title="Crime rate by Community Area", hovermode='x')
fig.show()

### Choropleth map

https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6

In [None]:
!pip install geopandas



In [None]:
import folium
import geopandas

df = geopandas.read_file("Boundaries - Community Areas (current).geojson")
df

Unnamed: 0,community,area,shape_area,perimeter,area_num_1,area_numbe,comarea_id,comarea,shape_len,geometry
0,DOUGLAS,0,46004621.1581,0,35,35,0,0,31027.0545098,"MULTIPOLYGON (((-87.60914 41.84469, -87.60915 ..."
1,OAKLAND,0,16913961.0408,0,36,36,0,0,19565.5061533,"MULTIPOLYGON (((-87.59215 41.81693, -87.59231 ..."
2,FULLER PARK,0,19916704.8692,0,37,37,0,0,25339.0897503,"MULTIPOLYGON (((-87.62880 41.80189, -87.62879 ..."
3,GRAND BOULEVARD,0,48492503.1554,0,38,38,0,0,28196.8371573,"MULTIPOLYGON (((-87.60671 41.81681, -87.60670 ..."
4,KENWOOD,0,29071741.9283,0,39,39,0,0,23325.1679062,"MULTIPOLYGON (((-87.59215 41.81693, -87.59215 ..."
...,...,...,...,...,...,...,...,...,...,...
72,MOUNT GREENWOOD,0,75584290.0209,0,74,74,0,0,48665.1305392,"MULTIPOLYGON (((-87.69646 41.70714, -87.69644 ..."
73,MORGAN PARK,0,91877340.6988,0,75,75,0,0,46396.419362,"MULTIPOLYGON (((-87.64215 41.68508, -87.64249 ..."
74,OHARE,0,371835607.687,0,76,76,0,0,173625.98466,"MULTIPOLYGON (((-87.83658 41.98640, -87.83658 ..."
75,EDGEWATER,0,48449990.8397,0,77,77,0,0,31004.8309456,"MULTIPOLYGON (((-87.65456 41.99817, -87.65456 ..."


In [None]:
com_per_null_removed = com_per[com_per['Community Area'].notnull()]
com_per_null_removed

Unnamed: 0,Community Area,cnt_per_community,perc_of_count_total
1,0,76,0.001050
2,1,101563,1.403393
3,10,28249,0.390343
4,11,26225,0.362376
5,12,11984,0.165594
...,...,...,...
74,75,52879,0.730680
75,76,39526,0.546169
76,77,65273,0.901939
77,8,229888,3.176582


In [None]:
#definition of the boundaries in the map
Chicago_COORDINATES = (41.895140898, -87.624255632)
district_geo = r'Boundaries - Community Areas (current).geojson'
 
#creating choropleth map for Chicago District 2016
map1 = folium.Map(location=Chicago_COORDINATES, zoom_start=11)
folium.Choropleth(geo_data = district_geo,
                data = com_per_null_removed,
                columns = ['Community Area', 'perc_of_count_total'],
                key_on = 'feature.properties.area_numbe',
                fill_color='YlOrRd',
                fill_opacity = 0.7, 
                line_opacity = 0.2,
                legend_name = 'Crime rate',
                highlight = True,
                overlay = True).add_to(map1)
map1

Output hidden; open in https://colab.research.google.com to view.

## Additional Data Exploration Using pandas

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('rows.csv', )

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
data.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,11034701,JA366925,01/01/2001 11:00:00 AM,016XX E 86TH PL,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,412,4.0,8.0,45.0,11,,,2001,08/05/2017 03:50:08 PM,,,
1,11227287,JB147188,10/08/2017 03:00:00 AM,092XX S RACINE AVE,281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,RESIDENCE,False,False,2222,22.0,21.0,73.0,2,,,2017,02/11/2018 03:57:41 PM,,,
2,11227583,JB147595,03/28/2017 02:00:00 PM,026XX W 79TH ST,620,BURGLARY,UNLAWFUL ENTRY,OTHER,False,False,835,8.0,18.0,70.0,5,,,2017,02/11/2018 03:57:41 PM,,,
3,11227293,JB147230,09/09/2017 08:17:00 PM,060XX S EBERHART AVE,810,THEFT,OVER $500,RESIDENCE,False,False,313,3.0,20.0,42.0,6,,,2017,02/11/2018 03:57:41 PM,,,
4,11227634,JB147599,08/26/2017 10:00:00 AM,001XX W RANDOLPH ST,281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,HOTEL/MOTEL,False,False,122,1.0,42.0,32.0,2,,,2017,02/11/2018 03:57:41 PM,,,


In [None]:
data.tail()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
7236955,11700926,JC279725,05/26/2019 05:13:00 PM,036XX W DOUGLAS BLVD,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,APARTMENT,False,True,1011,10.0,24.0,29.0,26,1152126.0,1893208.0,2019,06/30/2019 03:56:27 PM,41.86283,-87.71704,"(41.862830429, -87.717040084)"
7236956,24560,JC279072,05/26/2019 06:48:00 AM,013XX W HASTINGS ST,110,HOMICIDE,FIRST DEGREE MURDER,CHA PARKING LOT,True,False,1233,12.0,25.0,28.0,01A,1167752.0,1893853.0,2019,06/20/2020 03:48:45 PM,41.864278,-87.65966,"(41.864278228, -87.659660218)"
7236957,11707734,JC287730,07/01/2014 07:30:00 AM,063XX S NORMAL BLVD,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,722,7.0,20.0,68.0,11,,,2014,06/02/2019 04:09:42 PM,,,
7236958,11707239,JC287563,11/30/2017 09:00:00 AM,022XX S KOSTNER AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,1013,10.0,22.0,29.0,11,,,2017,06/02/2019 04:09:42 PM,,,
7236959,24559,JC278908,05/26/2019 02:11:00 AM,013XX W HASTINGS ST,110,HOMICIDE,FIRST DEGREE MURDER,STREET,False,False,1233,12.0,25.0,28.0,01A,1167746.0,1893853.0,2019,06/20/2020 03:48:45 PM,41.864278,-87.659682,"(41.864278357, -87.659682244)"


In [None]:
data.columns

Index(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type',
       'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat',
       'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate',
       'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude',
       'Location'],
      dtype='object')

In [None]:
data['Primary Type'].unique()

array(['DECEPTIVE PRACTICE', 'CRIM SEXUAL ASSAULT', 'BURGLARY', 'THEFT',
       'OFFENSE INVOLVING CHILDREN', 'CRIMINAL DAMAGE', 'OTHER OFFENSE',
       'SEX OFFENSE', 'CRIMINAL SEXUAL ASSAULT', 'BATTERY', 'ASSAULT',
       'NARCOTICS', 'MOTOR VEHICLE THEFT', 'ROBBERY', 'CRIMINAL TRESPASS',
       'WEAPONS VIOLATION', 'OBSCENITY', 'LIQUOR LAW VIOLATION',
       'PROSTITUTION', 'NON-CRIMINAL', 'PUBLIC PEACE VIOLATION',
       'INTIMIDATION', 'ARSON', 'INTERFERENCE WITH PUBLIC OFFICER',
       'GAMBLING', 'STALKING', 'KIDNAPPING', 'OTHER NARCOTIC VIOLATION',
       'CONCEALED CARRY LICENSE VIOLATION', 'HOMICIDE', 'RITUALISM',
       'HUMAN TRAFFICKING', 'PUBLIC INDECENCY', 'NON - CRIMINAL',
       'NON-CRIMINAL (SUBJECT SPECIFIED)', 'DOMESTIC VIOLENCE'],
      dtype=object)

In [None]:
# percentage of crime types
data['Primary Type'].value_counts()/data.shape[0]*100

THEFT                                21.099937
BATTERY                              18.319391
CRIMINAL DAMAGE                      11.389630
NARCOTICS                            10.156447
ASSAULT                               6.332383
OTHER OFFENSE                         6.204760
BURGLARY                              5.626976
MOTOR VEHICLE THEFT                   4.605262
DECEPTIVE PRACTICE                    4.152959
ROBBERY                               3.758360
CRIMINAL TRESPASS                     2.831023
WEAPONS VIOLATION                     1.176350
PROSTITUTION                          0.958483
PUBLIC PEACE VIOLATION                0.700750
OFFENSE INVOLVING CHILDREN            0.692694
CRIM SEXUAL ASSAULT                   0.387718
SEX OFFENSE                           0.381707
INTERFERENCE WITH PUBLIC OFFICER      0.241137
GAMBLING                              0.201659
LIQUOR LAW VIOLATION                  0.199669
ARSON                                 0.167253
HOMICIDE     

In [None]:
data.describe()

Unnamed: 0,ID,Beat,District,Ward,Community Area,X Coordinate,Y Coordinate,Year,Latitude,Longitude
count,7236960.0,7236960.0,7236913.0,6622127.0,6623476.0,7144470.0,7144470.0,7236960.0,7144470.0,7144470.0
mean,6604332.0,1188.438,11.29448,22.71369,37.54913,1164547.0,1885724.0,2009.061,41.84202,-87.67169
std,3259084.0,702.9817,6.945797,13.83226,21.53549,17121.39,32645.64,5.566692,0.08983008,0.06197428
min,634.0,111.0,1.0,1.0,0.0,0.0,0.0,2001.0,36.61945,-91.68657
25%,3585881.0,622.0,6.0,10.0,23.0,1152947.0,1859115.0,2004.0,41.76879,-87.71379
50%,6594214.0,1111.0,10.0,22.0,32.0,1166041.0,1890652.0,2008.0,41.85574,-87.66603
75%,9433263.0,1731.0,17.0,34.0,57.0,1176360.0,1909224.0,2013.0,41.90669,-87.6283
max,12231880.0,2535.0,31.0,50.0,77.0,1205119.0,1951622.0,2020.0,42.02291,-87.52453


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7236960 entries, 0 to 7236959
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Case Number           object 
 2   Date                  object 
 3   Block                 object 
 4   IUCR                  object 
 5   Primary Type          object 
 6   Description           object 
 7   Location Description  object 
 8   Arrest                bool   
 9   Domestic              bool   
 10  Beat                  int64  
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object 
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64  
 18  Updated On            object 
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object 
dtypes: bool(2), float64(7), int64(3), object(1

In [None]:
data.isnull().any()

ID                      False
Case Number              True
Date                    False
Block                   False
IUCR                    False
Primary Type            False
Description             False
Location Description     True
Arrest                  False
Domestic                False
Beat                    False
District                 True
Ward                     True
Community Area           True
FBI Code                False
X Coordinate             True
Y Coordinate             True
Year                    False
Updated On              False
Latitude                 True
Longitude                True
Location                 True
dtype: bool

In [None]:
data.isnull().sum()

ID                           0
Case Number                  4
Date                         0
Block                        0
IUCR                         0
Primary Type                 0
Description                  0
Location Description      7614
Arrest                       0
Domestic                     0
Beat                         0
District                    47
Ward                    614833
Community Area          613484
FBI Code                     0
X Coordinate             92490
Y Coordinate             92490
Year                         0
Updated On                   0
Latitude                 92490
Longitude                92490
Location                 92490
dtype: int64

In [None]:
# percentage of null values in columns with null values
data[data.columns[data.isnull().any()]].isnull().sum() * 100 / data.shape[0]

Case Number             0.000055
Location Description    0.105210
District                0.000649
Ward                    8.495736
Community Area          8.477095
X Coordinate            1.278023
Y Coordinate            1.278023
Latitude                1.278023
Longitude               1.278023
Location                1.278023
dtype: float64

In [None]:
# percentage of crime types
data['Primary Type'].value_counts()/data.shape[0]*100

THEFT                                21.099937
BATTERY                              18.319391
CRIMINAL DAMAGE                      11.389630
NARCOTICS                            10.156447
ASSAULT                               6.332383
OTHER OFFENSE                         6.204760
BURGLARY                              5.626976
MOTOR VEHICLE THEFT                   4.605262
DECEPTIVE PRACTICE                    4.152959
ROBBERY                               3.758360
CRIMINAL TRESPASS                     2.831023
WEAPONS VIOLATION                     1.176350
PROSTITUTION                          0.958483
PUBLIC PEACE VIOLATION                0.700750
OFFENSE INVOLVING CHILDREN            0.692694
CRIM SEXUAL ASSAULT                   0.387718
SEX OFFENSE                           0.381707
INTERFERENCE WITH PUBLIC OFFICER      0.241137
GAMBLING                              0.201659
LIQUOR LAW VIOLATION                  0.199669
ARSON                                 0.167253
HOMICIDE     

In [None]:
# percentage of crime rate for community areas
data['Community Area'].value_counts()/data.shape[0]*100

25.0    5.792031
8.0     3.176582
43.0    2.999975
23.0    2.891297
28.0    2.720493
          ...   
55.0    0.198232
12.0    0.165594
47.0    0.137157
9.0     0.088891
0.0     0.001050
Name: Community Area, Length: 78, dtype: float64

In [None]:
import tqdm
from tqdm.notebook import tqdm_notebook

In [None]:
# !pip install geopy
from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="vishal", timeout=10)
rgeocode = RateLimiter(geolocator.reverse, min_delay_seconds=0.00001)

In [None]:
!pip install pygeocoder
from pygeocoder import Geocoder

Collecting pygeocoder
  Downloading https://files.pythonhosted.org/packages/3b/79/2cf3a4dfe54705bbf07cbb25940078dfa595608aa4ecb9f0aaaae9faba08/pygeocoder-1.2.5.tar.gz
Building wheels for collected packages: pygeocoder
  Building wheel for pygeocoder (setup.py) ... [?25l[?25hdone
  Created wheel for pygeocoder: filename=pygeocoder-1.2.5-cp36-none-any.whl size=8886 sha256=c380bbad958d00ecd8cce05aef10d2e3a90bad4578ec9f5b7e64b63fa49b5217
  Stored in directory: /root/.cache/pip/wheels/7c/4c/00/d05c66c4af5411c554c91b8079732c8a0359c2226fb8c01031
Successfully built pygeocoder
Installing collected packages: pygeocoder
Successfully installed pygeocoder-1.2.5


In [None]:
# Geocoder.reverse_geocode(data2['Latitude'][100000], data2['Longitude'][100000])

In [None]:
data2['Latitude'][100000]

100000    41.750934
100001    42.009258
100002    41.880490
100003    41.706987
100004    41.723928
Name: Latitude, dtype: float64

In [None]:
data2 = data[100000:100005]
data2 = data2[data2.Location.notnull()]
data2

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
100000,11737995,JC324957,06/27/2019 08:57:00 PM,0000X W 79TH ST,1812,NARCOTICS,POSS: CANNABIS MORE THAN 30GMS,PARKING LOT/GARAGE(NON.RESID.),True,False,623,6.0,17.0,44.0,18,1177172.0,1852624.0,2019,07/04/2019 04:09:46 PM,41.750934,-87.626325,"(41.750934055, -87.626325022)"
100001,11739119,JC326051,06/27/2019 06:00:00 PM,070XX N KEDZIE AVE,890,THEFT,FROM BUILDING,APARTMENT,False,False,2411,24.0,50.0,2.0,06,1153867.0,1946582.0,2019,07/04/2019 04:09:46 PM,42.009258,-87.709223,"(42.009258281, -87.709222657)"
100002,11738105,JC325021,06/27/2019 10:10:00 PM,048XX W MADISON ST,460,BATTERY,SIMPLE,GAS STATION,True,False,1533,15.0,28.0,25.0,08B,1144213.0,1899588.0,2019,07/04/2019 04:09:46 PM,41.88049,-87.745928,"(41.880490151, -87.745927891)"
100003,11737879,JC324642,06/27/2019 05:15:00 PM,103XX S MICHIGAN AVE,460,BATTERY,SIMPLE,DRUG STORE,True,False,512,5.0,9.0,49.0,08B,1178964.0,1836624.0,2019,07/04/2019 04:09:46 PM,41.706987,-87.620244,"(41.706987456, -87.620244029)"
100004,11738599,JC325531,06/27/2019 07:45:00 PM,093XX S GREEN ST,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE-GARAGE,False,False,2223,22.0,21.0,73.0,14,1172293.0,1842743.0,2019,07/04/2019 04:09:46 PM,41.723928,-87.644494,"(41.72392788, -87.644493825)"


In [None]:
tqdm_notebook.pandas()
data2['coords'] = data2['Location'].progress_apply(rgeocode)
data2.head()

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location,coords
100000,11737995,JC324957,06/27/2019 08:57:00 PM,0000X W 79TH ST,1812,NARCOTICS,POSS: CANNABIS MORE THAN 30GMS,PARKING LOT/GARAGE(NON.RESID.),True,False,623,6.0,17.0,44.0,18,1177172.0,1852624.0,2019,07/04/2019 04:09:46 PM,41.750934,-87.626325,"(41.750934055, -87.626325022)","(Falcon, 33, West 79th Street, Chatham, Chicag..."
100001,11739119,JC326051,06/27/2019 06:00:00 PM,070XX N KEDZIE AVE,890,THEFT,FROM BUILDING,APARTMENT,False,False,2411,24.0,50.0,2.0,06,1153867.0,1946582.0,2019,07/04/2019 04:09:46 PM,42.009258,-87.709223,"(42.009258281, -87.709222657)","(7033-7053, North Kedzie Avenue, West Ridge, L..."
100002,11738105,JC325021,06/27/2019 10:10:00 PM,048XX W MADISON ST,460,BATTERY,SIMPLE,GAS STATION,True,False,1533,15.0,28.0,25.0,08B,1144213.0,1899588.0,2019,07/04/2019 04:09:46 PM,41.88049,-87.745928,"(41.880490151, -87.745927891)","(4812, West Madison Street, Austin, Chicago, C..."
100003,11737879,JC324642,06/27/2019 05:15:00 PM,103XX S MICHIGAN AVE,460,BATTERY,SIMPLE,DRUG STORE,True,False,512,5.0,9.0,49.0,08B,1178964.0,1836624.0,2019,07/04/2019 04:09:46 PM,41.706987,-87.620244,"(41.706987456, -87.620244029)","(White Castle, 10301, South Michigan Avenue, H..."
100004,11738599,JC325531,06/27/2019 07:45:00 PM,093XX S GREEN ST,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE-GARAGE,False,False,2223,22.0,21.0,73.0,14,1172293.0,1842743.0,2019,07/04/2019 04:09:46 PM,41.723928,-87.644494,"(41.72392788, -87.644493825)","(9331, South Green Street, Washington Heights,..."


In [None]:
data2.coords[100001]

Location(7033-7053, North Kedzie Avenue, West Ridge, Lincolnwood, Niles Township, Cook County, Illinois, 60645, United States of America, (42.00906205, -87.70871007953437, 0.0))

In [None]:
def str_to_loc(str):
    la, lo = str.split(',')
    la = float(la.split('(')[1])
    lo = float(lo.split(')')[0])
    return (la,lo)

In [None]:
def loc_to_zip(loc_str):
    loc = str_to_loc(loc_str)
    location = geolocator.reverse(loc)
    return {loc_str:int(location.raw['address']['postcode'])}

In [None]:
str_to_loc(unique_locations[-1])

In [None]:
data2 = data[0:100]

In [None]:
loc_to_zip(data.iloc[-1]['Location'])

{'(41.864278357, -87.659682244)': 60608}

In [None]:
unique_locations = tuple(data['Location'].value_counts().keys())

In [None]:
unique_locations[1:4]

('(41.754592961, -87.741528537)',
 '(41.883500187, -87.627876698)',
 '(41.897895128, -87.624096605)')

In [None]:
%time list(map(loc_to_zip, unique_locations[0:10]))

CPU times: user 30.6 ms, sys: 9.97 ms, total: 40.6 ms
Wall time: 9.21 s


[{'(41.976290414, -87.905227221)': 60666},
 {'(41.754592961, -87.741528537)': 60456},
 {'(41.883500187, -87.627876698)': 60602},
 {'(41.897895128, -87.624096605)': 60611},
 {'(41.896888586, -87.628203192)': 60611},
 {'(41.909664252, -87.742728815)': 60302},
 {'(41.885487535, -87.726422045)': 60624},
 {'(41.788987036, -87.74147999)': 60638},
 {'(41.88233367, -87.627841791)': 60602},
 {'(41.904192368, -87.647000785)': 60610}]

In [None]:
unique_locations[0]

'(41.976290414, -87.905227221)'

In [None]:
data.iloc[-1]['Location']

'(41.864278357, -87.659682244)'

In [None]:
loc_to_zip(unique_locations[-1])

{'(41.881555171, -87.626197878)': 60603}

In [None]:
!pip install pgeocode

Collecting pgeocode
  Downloading https://files.pythonhosted.org/packages/35/88/688b8550ceeed59a4b0be3f54d2ad8075ce708b5d5c0b0ec1a3abe58d4cb/pgeocode-0.3.0-py3-none-any.whl
Installing collected packages: pgeocode
Successfully installed pgeocode-0.3.0


In [None]:
!pip install reverse_geocoder

Collecting reverse_geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/0b/0f/b7d5d4b36553731f11983e19e1813a1059ad0732c5162c01b3220c927d31/reverse_geocoder-1.5.1.tar.gz (2.2MB)
[K     |████████████████████████████████| 2.3MB 5.3MB/s 
Building wheels for collected packages: reverse-geocoder
  Building wheel for reverse-geocoder (setup.py) ... [?25l[?25hdone
  Created wheel for reverse-geocoder: filename=reverse_geocoder-1.5.1-cp36-none-any.whl size=2268090 sha256=8adc838239ab406935d522af0d866dbe32d11bf3853b2f956ba446a1ae4dc1d2
  Stored in directory: /root/.cache/pip/wheels/47/05/50/b1350ff094ef91e082665b4a2f9ca551f8acea4aa55d796b26
Successfully built reverse-geocoder
Installing collected packages: reverse-geocoder
Successfully installed reverse-geocoder-1.5.1


In [None]:
!pip install uszipcode

Collecting uszipcode
[?25l  Downloading https://files.pythonhosted.org/packages/bc/94/1b908c6fe2008f0e913b0b2d97951aa76e00ec1044883c012afb2e477b4a/uszipcode-0.2.4-py2.py3-none-any.whl (378kB)
[K     |████████████████████████████████| 378kB 4.1MB/s 
[?25hCollecting pathlib-mate
[?25l  Downloading https://files.pythonhosted.org/packages/ee/90/b414af97dea2b4f98b0cebaa69ec02eacca82e6b1ba18632c5927f01591a/pathlib_mate-1.0.0-py2.py3-none-any.whl (77kB)
[K     |████████████████████████████████| 81kB 4.6MB/s 
Collecting autopep8
[?25l  Downloading https://files.pythonhosted.org/packages/94/37/19bc53fd63fc1caaa15ddb695e32a5d6f6463b3de6b0922ba2a3cbb798c8/autopep8-1.5.4.tar.gz (121kB)
[K     |████████████████████████████████| 122kB 7.5MB/s 
Collecting pycodestyle>=2.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/10/5b/88879fb861ab79aef45c7e199cae3ef7af487b5603dcb363517a50602dd7/pycodestyle-2.6.0-py2.py3-none-any.whl (41kB)
[K     |████████████████████████████████| 51kB 4.

In [None]:
from uszipcode import Zipcode
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=True) # set simple_zipcode=False to use rich info database
result = search.by_coordinates(41.976290414, -87.905227221, radius=20)
len(result) # by default 5 results returned

5

In [None]:
data.head(100)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,11034701,JA366925,01/01/2001 11:00:00 AM,016XX E 86TH PL,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,412,4.0,8.0,45.0,11,,,2001,08/05/2017 03:50:08 PM,,,
1,11227287,JB147188,10/08/2017 03:00:00 AM,092XX S RACINE AVE,0281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,RESIDENCE,False,False,2222,22.0,21.0,73.0,02,,,2017,02/11/2018 03:57:41 PM,,,
2,11227583,JB147595,03/28/2017 02:00:00 PM,026XX W 79TH ST,0620,BURGLARY,UNLAWFUL ENTRY,OTHER,False,False,835,8.0,18.0,70.0,05,,,2017,02/11/2018 03:57:41 PM,,,
3,11227293,JB147230,09/09/2017 08:17:00 PM,060XX S EBERHART AVE,0810,THEFT,OVER $500,RESIDENCE,False,False,313,3.0,20.0,42.0,06,,,2017,02/11/2018 03:57:41 PM,,,
4,11227634,JB147599,08/26/2017 10:00:00 AM,001XX W RANDOLPH ST,0281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,HOTEL/MOTEL,False,False,122,1.0,42.0,32.0,02,,,2017,02/11/2018 03:57:41 PM,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,11243493,JB169102,08/04/2017 03:00:00 PM,003XX E RANDOLPH ST,0810,THEFT,OVER $500,PARK PROPERTY,False,False,114,1.0,42.0,32.0,06,,,2017,03/01/2018 03:54:55 PM,,,
96,11243514,JB168799,11/12/2014 12:00:00 AM,107XX S PEORIA ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,2233,22.0,34.0,75.0,11,,,2014,03/01/2018 03:54:55 PM,,,
97,11037536,JA371293,07/31/2017 08:05:00 AM,004XX N MC CLURG CT,1310,CRIMINAL DAMAGE,TO PROPERTY,BOAT/WATERCRAFT,False,False,1834,18.0,42.0,8.0,14,,,2017,08/07/2017 03:52:24 PM,,,
98,11243040,JB167177,12/12/2017 12:00:00 PM,033XX W 47TH ST,0890,THEFT,FROM BUILDING,WAREHOUSE,False,False,821,8.0,14.0,58.0,06,,,2017,03/01/2018 03:54:55 PM,,,


In [None]:
!pip install -U googlemaps

Collecting googlemaps
  Downloading https://files.pythonhosted.org/packages/00/fa/508909813a3f0ff969d341695ee0b90cb0e954b4b536f17f15cc19b5c304/googlemaps-4.4.2.tar.gz
Building wheels for collected packages: googlemaps
  Building wheel for googlemaps (setup.py) ... [?25l[?25hdone
  Created wheel for googlemaps: filename=googlemaps-4.4.2-cp36-none-any.whl size=37858 sha256=c5c17890145b8daaac2365ae50db69f45946bdc14e95854403864b2f58d6407c
  Stored in directory: /root/.cache/pip/wheels/f4/21/41/0c84572e21d52bb322f6c299f38ac7cd8ad6d4d6ce23dc3631
Successfully built googlemaps
Installing collected packages: googlemaps
Successfully installed googlemaps-4.4.2


In [None]:
from pygeocoder import Geocoder

ModuleNotFoundError: ignored

https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/

# abc