## Interactive Crime Map Setup Code by Abu Nayeem
### Table of Contents

* Introduction
* Data Setup
    



### Intro   <a name="Intro"></a>

The goal of this notebook is to illustrate the notes and steps needed to prepare interactive crime data for Saint Paul and how to repeat steps for other instances and/or additional data. In addition, I will explain how I utilize my proxy algorithm, which is 90% of the workload. Python, unlike R environment, does not do well in work environment on the number of dataframes created in a workspace.

The [Crime Incident Report - Dataset](https://information.stpaul.gov/Public-Safety/Crime-Incident-Report-Dataset/gppb-g9cg) was obtained from the Saint Paul Website. It is publicly available. The report contains incidents from Aug 14 2014 through the most recent date, as released by the Saint Paul Police Department.

### Proxy Algorithm

**Challenge:**
* How can we find the geo-coordinates of a masked column?

**Values of Column:**
* '45X University Ave' i.e Masked Address
* 'Victoria Street and Avon Avenue' i.e. Intersection

**Strategy:** The algorithm will treat both steps separately; where I split the data between Intersection and Mashed Address and combined them back. (More details as we code!)
* PreCoding: For __Intersection Key__, we will find the geocoordinates of all potential intersections of interest and save it as Key Table



#### Hardcoding Intersection Key <a name="Intro"></a>

0) Create a DataSheet (i.e Excel or GoogleSheet [preferred]) 
    * Setup four columns IntersectionID (used as index), IntersectionName1; IntersectionName2; and clumped geocoordinates 
    * A grid column  is not included because that data can be messy and it's not clear what grid a boundary intersection will be located
    * Do not worry about undercase and uppercase
    * The actual location data is not consistent in the order it names an intersection; so I have created a post-code so you don't need to enter data a second time
    * Avoid double-count when entering the data! Don't worry we will perform some debugging and error checks
    
1) List all possible intersections of interest. Use the [police grid boundaries] when selecting intersections of Interest.
    * To address the boundary problem I've included the neighboring police grids of Frogtown to assure all relevant points are being mapped. However, the boundary problem does exist at the outer boundaries.

2) Strategy for Setup: Often urban areas are organized in a grid; so intersections can follow a pattern where two avenues have almost the same intersection pairs. You can copy and paste some of the columns, etc. See picture below
    * Make sure you set the naming of the intersections in the way you actually scroll down on a map to enter data. (Saves a lot of time in data entry)
    * Go to Google Maps on a web browser and point to an intersection until you some geocoordinates come up and then click on the geocoordinates hyperlink; From that window, you copy and paste the geocoordinates. You could manual enter the values separately, but it will make the process much tedious. The post code can handle it much readily

3) Since the intersection key is a static document, I recommend exporting it as csv and loading it to your machine

#### Data Setup

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline 
import plotly
from pygeocoder import Geocoder #GeoCoding Algorithm
import folium
from IPython.display import HTML
from IPython.display import display
import json # library to handle JSON files
#from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

df_crime = pd.read_csv('Datasets/Crime_Incident_Report_-_Dataset.csv')
#print(df_crime.head())

#rename columns
cols= ['Case','Date','Time','Code','IncType','Incident','Grid','NNum','Neighborhood','Block','CallDispCode','CallDisposition', 'Count']
df_crime.columns= cols

#selection for Frogtown and nearby area
df=df_crime.query('Grid in [66,67, 68, 86, 87,88,89, 90, 91, 92,106,107,108,109,110,]')
print(df.shape)
print(df.dtypes)
df.head(4)



(26802, 13)
Case                 int64
Date                object
Time                object
Code                 int64
IncType             object
Incident            object
Grid               float64
NNum                 int64
Neighborhood        object
Block               object
CallDispCode        object
CallDisposition     object
Count                int64
dtype: object


Unnamed: 0,Case,Date,Time,Code,IncType,Incident,Grid,NNum,Neighborhood,Block,CallDispCode,CallDisposition,Count
17,19088980,04/30/2019,2019-04-30T22:48:00.000,2619,"Weapons, Discharging a Firearm in the City Limits",Discharge,90.0,7,7 - Thomas/Dale(Frogtown),19X SHERBURNE AV,RR,Report Written,1
24,19088940,04/30/2019,2019-04-30T22:06:00.000,9954,Proactive Police Visit,Proactive Police Visit,109.0,8,8 - Summit/University,43X DALE ST N,A,Advised,1
42,19088880,04/30/2019,2019-04-30T20:30:00.000,710,"Motor Vehicle Theft, Automobile",Auto Theft,89.0,7,7 - Thomas/Dale(Frogtown),59X THOMAS AV,RR,Report Written,1
52,19089126,04/30/2019,2019-04-30T20:00:00.000,693,"Theft, All Other, Over $1000",Theft,110.0,8,8 - Summit/University,24X AURORA AV,RR,Report Written,1


### Create New Variables

In [2]:
#Add Time Variables
df= df[df.Case != 18254093] #messed up time variable

#Convert Date to Datetime!
from datetime import datetime

df['DateTime']= pd.to_datetime(df['Date']) # Create new column called DateTime
df['Year']= df['DateTime'].dt.year #create year column
df['DayofWeek']=df['DateTime'].dt.dayofweek #create day of the week column where default 0=Monday
df['Weekend'] = df['DayofWeek'].apply(lambda x: 1 if (x>4)  else 0) #Create a weekend category
df['Month'] = df['DateTime'].dt.month # Create Month Category
df['Day'] = df['DateTime'].dt.day #Create Day of the Current month
df['DayYear'] = df['DateTime'].dt.dayofyear  #Create Day of the year (0-365)
df['Day_Max'] = df.iloc[0,-1] #selects uptodate day; NOTE: the data is sorted chronologically

#Hour Data
df['TimeHour']= pd.to_datetime(df['Time'])
df['Hour'] = df['TimeHour'].dt.hour.astype(int) #Create Hour Colum
df['LateNight'] = df['Hour'].apply(lambda x: 1 if (x>21 or x<5)  else 0) #Latenight designation from 10Pm to 6PM

#Creating the intersection Column. Note: the Block column has the address information
df.Block = df.Block.astype(str) #first change the type to string
df['Block']= df['Block'].str.lower() #lowercase string to create uniformity

#While scanning the data I noticed that all intersections had "&" 
df['Intersection'] = df['Block'].apply(lambda x: 1 if '&' in x else 0) #intersection

df.head(5)

Unnamed: 0,Case,Date,Time,Code,IncType,Incident,Grid,NNum,Neighborhood,Block,...,DayofWeek,Weekend,Month,Day,DayYear,Day_Max,TimeHour,Hour,LateNight,Intersection
17,19088980,04/30/2019,2019-04-30T22:48:00.000,2619,"Weapons, Discharging a Firearm in the City Limits",Discharge,90.0,7,7 - Thomas/Dale(Frogtown),19x sherburne av,...,1,0,4,30,120,120,2019-04-30 22:48:00,22,1,0
24,19088940,04/30/2019,2019-04-30T22:06:00.000,9954,Proactive Police Visit,Proactive Police Visit,109.0,8,8 - Summit/University,43x dale st n,...,1,0,4,30,120,120,2019-04-30 22:06:00,22,1,0
42,19088880,04/30/2019,2019-04-30T20:30:00.000,710,"Motor Vehicle Theft, Automobile",Auto Theft,89.0,7,7 - Thomas/Dale(Frogtown),59x thomas av,...,1,0,4,30,120,120,2019-04-30 20:30:00,20,0,0
52,19089126,04/30/2019,2019-04-30T20:00:00.000,693,"Theft, All Other, Over $1000",Theft,110.0,8,8 - Summit/University,24x aurora av,...,1,0,4,30,120,120,2019-04-30 20:00:00,20,0,0
66,19088803,04/30/2019,2019-04-30T18:40:32.000,9954,Proactive Police Visit,Proactive Police Visit,90.0,7,7 - Thomas/Dale(Frogtown),17x charles av,...,1,0,4,30,120,120,2019-04-30 18:40:32,18,0,0


# Setup Intersection Key

In this step, I'll be loading the Intersection Key File and prep it up for joining the primary dataset. The primary key is the name of the intersection; the format I decided to go with is 'name1_name2'. However, 'name2_name1' is also valid and should have same coordinates. The final dataframe has an indexkey to connect to join with the primary output key. 

**Note:** I discovered the bugs of the code during the data exploration phase.

In [3]:
#Setting up the Coordinate Key

#Prep from appropiate key (note this process can be done in excel as well)
df_key= pd.read_csv('Datasets/Frog_key - Sheet1.csv')

#convert to lowercase
df_key['Intersection1']= df_key['Intersection1'].str.lower()
df_key['Intersection2']= df_key['Intersection2'].str.lower()
#remove empty space; found out when debugging!
df_key['Intersection1']= df_key['Intersection1'].str.replace(' ', '', regex=True)
df_key['Intersection2']= df_key['Intersection2'].str.replace(' ', '', regex=True)

df_key.head(2)


Unnamed: 0,IntersectionID,Intersection1,Intersection2,Coordinates
0,1,lexington,front,"44.970295, -93.146572"
1,2,lexington,stinson,"44.969316, -93.146529"


In [4]:
#Create a dataframe and new columns on potential mapping
A=df_key[['Intersection1','Intersection2', 'Coordinates']]
A['Int1_2']= A['Intersection1']+ '_' + A['Intersection2'] #int1_int2
A['Int2_1']= A['Intersection2']+ '_' + A['Intersection1'] #int2_int1
A['OutputKey']= A['Int1_2'] #create an output key based on oneinstersection pair
A.head(2)

#A.query('Int=="marshall_victoria"')
#A.query('Intersection1=="marshall"')
#Intersection_key.query('IndexKey=="marshall_victoria"')


Unnamed: 0,Intersection1,Intersection2,Coordinates,Int1_2,Int2_1,OutputKey
0,lexington,front,"44.970295, -93.146572",lexington_front,front_lexington,lexington_front
1,lexington,stinson,"44.969316, -93.146529",lexington_stinson,stinson_lexington,lexington_stinson


In [5]:
# Take a subset of data and rename the Int columns to IndexKey
H1=A[['Int1_2','Coordinates','OutputKey']]
H1.columns= ['IndexKey','Coordinates','OutputKey']
H2=A[['Int2_1','Coordinates','OutputKey']]
H2.columns= ['IndexKey','Coordinates','OutputKey']

#We finally merge the two columns
Intersection_key=H1.append(H2, ignore_index=True)
Intersection_key.tail(2)


Unnamed: 0,IndexKey,Coordinates,OutputKey
816,pennsylvania_rice,"44.960366, -93.105941",rice_pennsylvania
817,sherburne_cedar,"44.956855, -93.100879",cedar_sherburne


### Setup Intersection DataTable

We have prepared the key, but we need to prepare datatable to match with the IndexKey. This will require several string splittings.

In [6]:
# Create a new dateframe specifying only intersections
dfI=df.query('Intersection ==1')
print('The intersection table dimension are ' + str(dfI.shape))
print(dfI.Block.head(10))

The intersection table dimension are (4948, 25)
182              dale st n & edmund
257            milton st n & thomas
258           thomas av  & stalbans
266             energy la  & norris
315        arundel st  & university
368         hubbard av  & syndicate
416             thomas av  & milton
467             thomas av  & milton
539    chatsworth st n & university
601            dale st n & marshall
Name: Block, dtype: object


There is a clear pattern above.

**Strategy**
1) Split the string to two sections on the ' ', the first section has an intersection variable
2) Split the string to two sections on the '& ', the second section has an intersection variable
3) Note: The avenue and direction does not matter for our purposes and probability that there is same named street and avenue having the same paired intersection is unlikely

In [7]:
#Split the strings
new=dfI['Block'].str.split("& ", n = 1, expand = True) 
dfI['Inter2']= new[1]
new=dfI['Block'].str.split(" ", n = 1, expand = True) #Note the code specifies the first time a space occured
dfI['Inter1']=new[0]

#Create the IndexKey; recall we prepared the IntersectionKey where we consider any order
dfI['IndexKey']= dfI['Inter1']+ '_' + dfI['Inter2']
dfI.reset_index()
dfI=pd.merge(dfI, Intersection_key, on='IndexKey', how='left')
dfI.head(3)



Unnamed: 0,Case,Date,Time,Code,IncType,Incident,Grid,NNum,Neighborhood,Block,...,Day_Max,TimeHour,Hour,LateNight,Intersection,Inter2,Inter1,IndexKey,Coordinates,OutputKey
0,19088395,04/30/2019,2019-04-30T08:00:00.000,861,"Assault, Domestic, Opposite Sex",Simple Asasult Dom.,89.0,7,7 - Thomas/Dale(Frogtown),dale st n & edmund,...,120,2019-04-30 08:00:00,8,0,1,edmund,dale,dale_edmund,"44.958439, -93.126376",dale_edmund
1,19088075,04/29/2019,2019-04-29T21:40:24.000,9954,Proactive Police Visit,Proactive Police Visit,87.0,7,7 - Thomas/Dale(Frogtown),milton st n & thomas,...,120,2019-04-29 21:40:24,21,0,1,thomas,milton,milton_thomas,"44.959361, -93.139031",milton_thomas
2,19088071,04/29/2019,2019-04-29T21:39:13.000,9954,Proactive Police Visit,Proactive Police Visit,88.0,7,7 - Thomas/Dale(Frogtown),thomas av & stalbans,...,120,2019-04-29 21:39:13,21,0,1,stalbans,thomas,thomas_stalbans,"44.959350, -93.128908",stalbans_thomas


### Algorithm Sanity Check

Check which intersections have not matched and the respective count

In [8]:
#find null subset
B= dfI[dfI['Coordinates'].isnull()]
C=B[['Neighborhood','IndexKey']]
#C=C.query('Neighborhood != "7 - Thomas/Dale(Frogtown)"')
#C.groupby(['Neighborhood','IndexKey']).sum()
#C.IndexKey.value_counts()


#dfI=dfI.Coordinates.fillna('Mi')
#dfI=dfI.query('Coordinates != "Mi"')

In [9]:
#Drop rows with missing coordinates
dfI=dfI[dfI['Coordinates'].notnull()]

# Separate Latitude and Longitude 
new=dfI['Coordinates'].str.split(",", n = 1, expand = True) 
# making seperate first name column from new data frame 
dfI['Latitude']= pd.to_numeric(new[0]) #pd.to_numeric convert it to float
dfI['Longitude']= pd.to_numeric(new[1])

dfI.columns

Index(['Case', 'Date', 'Time', 'Code', 'IncType', 'Incident', 'Grid', 'NNum',
       'Neighborhood', 'Block', 'CallDispCode', 'CallDisposition', 'Count',
       'DateTime', 'Year', 'DayofWeek', 'Weekend', 'Month', 'Day', 'DayYear',
       'Day_Max', 'TimeHour', 'Hour', 'LateNight', 'Intersection', 'Inter2',
       'Inter1', 'IndexKey', 'Coordinates', 'OutputKey', 'Latitude',
       'Longitude'],
      dtype='object')

In [158]:
#Final Tough
dfI['Block']=dfI['OutputKey'] #for practical purposes it makes sense
Drop_col=['Inter2','Inter1', 'IndexKey', 'Coordinates', 'OutputKey']
dfI_Final=dfI.drop(Drop_col, axis=1,)
dfI_Final.head(5)

Unnamed: 0,Case,Date,Time,Code,IncType,Incident,Grid,NNum,Neighborhood,Block,...,Month,Day,DayYear,Day_Max,TimeHour,Hour,LateNight,Intersection,Latitude,Longitude
0,19078070,04/17/2019,15:05,9954,Proactive Police Visit,Proactive Police Visit,109.0,8,8 - Summit/University,arundel_central,...,4,17,107,107,2019-04-30 15:05:00,15,0,1,44.953081,-93.118654
1,19078068,04/17/2019,15:03,9954,Proactive Police Visit,Proactive Police Visit,110.0,8,8 - Summit/University,farrington_fuller,...,4,17,107,107,2019-04-30 15:03:00,15,0,1,44.953989,-93.113264
2,19078110,04/17/2019,16:08,9954,Proactive Police Visit,Proactive Police Visit,89.0,7,7 - Thomas/Dale(Frogtown),mackubin_university,...,4,17,107,107,2019-04-30 16:08:00,16,0,1,44.955842,-93.121236
3,19078182,04/17/2019,17:22,9954,Proactive Police Visit,Proactive Police Visit,87.0,7,7 - Thomas/Dale(Frogtown),lexington_university,...,4,17,107,107,2019-04-30 17:22:00,17,0,1,44.955826,-93.146539
4,19078441,04/17/2019,23:48,9954,Proactive Police Visit,Proactive Police Visit,87.0,7,7 - Thomas/Dale(Frogtown),milton_thomas,...,4,17,107,107,2019-04-30 23:48:00,23,1,1,44.959361,-93.139031


### Figuring out the Address Key

So how do we get geocoordinates from masked address?

The intended strategy was to fill in the missing values with numericals and have a geo-coder application convert it to coordinates. First, I tried the 'Nominator' that is an in-built geocoder function. It failed quite horribly even for actual addresses. Thank goodness, Google's API is very good at approximating address including those that don't necessarily exist. It was not entirely perfect, but had success rate of around 96%. The drawback of the Google API is that it is not fully automated or at least I am not aware how to do it

Note: If I truly desired to go overkill, I could create a centroid boundary where the address can be located based on the geo-coordinates on the intersections previously mapped out. If out of boundary, then incorrectly matched

#### Basic Setup

In [None]:



#dfW['Block1']= dfW.Block.str.replace(' n',' north')
#dfW['Block1']= dfW.Block.str.replace(' s',' south')
#dfW['Block1']= dfW.Block.str.replace(' w',' west')
#dfW['Block1']= dfW.Block.str.replace(' e',' east')
#dfW['Block1']= dfW.Block.str.replace(' n',' north')
#dfW['Block1']= dfW.Block.str.replace(' s',' south')

In [13]:
#from geopy.geocoders import Nominatim
#nom=Nominatim()
#pd.set_option('display.max_colwidth', -1)

import geocoder 
import requests
#geocoder.google("1022 edmund avenue west, St. Paul, MN, 55104", key=API_KEY)


def get_google_results(address, api_key='FILLL MEMME', return_full_response=False):
    """
    Get geocode results from Google Maps Geocoding API.
    
    Note, that in the case of multiple google geocode reuslts, this function returns details of the FIRST result.
    
    @param address: String address as accurate as possible. For Example "18 Grafton Street, Dublin, Ireland"
    @param api_key: String API key if present from google. 
                    If supplied, requests will use your allowance from the Google API. If not, you
                    will be limited to the free usage of 2500 requests per day.
    @param return_full_response: Boolean to indicate if you'd like to return the full response from google. This
                    is useful if you'd like additional location details for storage or parsing later.
    """
    # Set up your Geocoding url
    geocode_url = "https://maps.googleapis.com/maps/api/geocode/json?address={}".format(address)
    if api_key is not None:
        geocode_url = geocode_url + "&key={}".format(api_key)
        
    # Ping google for the reuslts:
    results = requests.get(geocode_url)
    # Results will be in JSON format - convert to dict using requests functionality
    results = results.json()
#    results =results['formatted_address']
    
    # if there's no results or an error, return empty results.
    if len(results['results']) == 0:
        output = {
            "formatted_address" : None,
            "latitude": None,
            "longitude": None
        }
    else:    
        answer = results['results'][0]
        output = {
            "formatted_address" : answer.get('formatted_address'),
            "latitude": answer.get('geometry').get('location').get('lat'),
            "longitude": answer.get('geometry').get('location').get('lng')
        }
        
    # Append some other details:    
    output['input_string'] = address
    output['number_of_results'] = len(results['results'])
    output['status'] = results.get('status')
    if return_full_response is True:
        output['response'] = results
    
    return output


The limitation of the Google API is that there is a time-out when more than 50 entries, so I would have to do batches of 50 in plugging in the algorithm. The Google API spits out an address associated with the provided geo-coordinates, which can be used to determine if it is a good match.


In [160]:
dfW=df.query('Intersection==0')
dfW=dfW.query('Grid in [66.0]') #Perform algorithm separately for each grid

#I tried different specifications to improve accuracy; In retrospect, I should of given the xx value a different value just for graphical purposes
dfW['Block1']= dfW.Block.str.replace('xx','05')
dfW['Block1']= dfW.Block1.str.replace('x ','0 ') #notice the space
#dfW['Block1']= dfW.Block1.str.replace(' pa ' ,' parkway ')
#dfW['Block1']= dfW.Block1.str.replace(' w',' west')
#dfW['Block1']= dfW.Block.str.replace(' e',' east')

#special Case!
dfW['Block1']= dfW.Block1.str.replace('ravou0','ravoux')

In [162]:
# Set it up for Address format; get a general idea on the zipcode of th edataset
dfW['Address']= dfW['Block1'] + ', Saint Paul, MN, 55103'

# Create a datatable showing original, transformed, and full address
A=dfW[['Block','Block1','Address']].sort_values('Address')
#print(A)
A=A.groupby(['Address','Block']).count()
A=A.reset_index()
A.shape

#Creating 50 bins
B=A.loc[0:50,:]
C=A.loc[51:100,:]
D=A.loc[101:150,:]
E=A.loc[151:200,:]
F=A.loc[201:250,]
G=A.loc[251:,]
A.head(2)

Unnamed: 0,Address,Block,Block1
0,"1090 piercebutler rd, Saint Paul, MN, 55103",109x piercebutler rd,2
1,"1100 hubbard av, Saint Paul, MN, 55103",110x hubbard av,2


When running the algorithm, I wait till one groups finishes and then another group

In [26]:

#allows to display
#B['Coordinates']= B['Address'].apply(get_google_results)
#C['Coordinates']= C['Address'].apply(get_google_results)
#D['Coordinates']= D['Address'].apply(get_google_results)
#E['Coordinates']= E['Address'].apply(get_google_results)
#F['Coordinates']= F['Address'].apply(get_google_results)
G['Coordinates']= G['Address'].apply(get_google_results)

G

Unnamed: 0,Address,Block,Block1,Coordinates,For_Address,Latitude,Longitude
151,"700 thomas av, Saint Paul, MN, 55103",70x thomas av,4,"{'formatted_address': '700 Thomas Ave W, St Pa...","700 Thomas Ave W, St Paul, MN 55104, USA",44.95909,-93.129816
152,"700 university av w, Saint Paul, MN, 55103",70x university av w,10,"{'formatted_address': '700 University Ave W, S...","700 University Ave W, St Paul, MN 55104, USA",44.955653,-93.129695
153,"700 vanburen av, Saint Paul, MN, 55103",70x vanburen av,2,"{'formatted_address': '700 Van Buren Ave, St P...","700 Van Buren Ave, St Paul, MN 55104, USA",44.961815,-93.129843
154,"710 avon st n, Saint Paul, MN, 55103",71x avon st n,2,"{'formatted_address': '710 Avon St N, St Paul,...","710 Avon St N, St Paul, MN 55103, USA",44.9702,-93.133415
155,"710 blair av, Saint Paul, MN, 55103",71x blair av,1,"{'formatted_address': '710 Blair Ave, St Paul,...","710 Blair Ave, St Paul, MN 55104, USA",44.96096,-93.13021
156,"710 charles av, Saint Paul, MN, 55103",71x charles av,16,"{'formatted_address': '710 Charles Ave, St Pau...","710 Charles Ave, St Paul, MN 55104, USA",44.957336,-93.130187
157,"710 dale st n, Saint Paul, MN, 55103",71x dale st n,1,"{'formatted_address': '710 Dale St N, St Paul,...","710 Dale St N, St Paul, MN 55103, USA",44.962641,-93.126031
158,"710 edmund av, Saint Paul, MN, 55103",71x edmund av,19,"{'formatted_address': '710 Edmund Ave W, St Pa...","710 Edmund Ave W, St Paul, MN 55104, USA",44.958265,-93.130191
159,"710 lafond av, Saint Paul, MN, 55103",71x lafond av,4,"{'formatted_address': '710 Lafond Ave, St Paul...","710 Lafond Ave, St Paul, MN 55104, USA",44.960049,-93.130229
160,"710 minnehaha av w, Saint Paul, MN, 55103",71x minnehaha av w,1,"{'formatted_address': '710 W Minnehaha Ave, St...","710 W Minnehaha Ave, St Paul, MN 55104, USA",44.96273,-93.130266


In [15]:
B['For_Address'] = B['Coordinates'].apply(lambda x: x['formatted_address']) #intersection
B['Latitude'] = B['Coordinates'].apply(lambda x: x['latitude']) 
B['Longitude'] = B['Coordinates'].apply(lambda x: x['longitude']) 
B[['For_Address', 'Address', ]]

Unnamed: 0,For_Address,Address
0,"1010 N Cypress St, St Paul, MN 55106, USA","1010 cypress st, Saint Paul, MN, 55103"
1,"1370 Coach Rd, St Paul, MN 55108, USA","1370 coach rd, Saint Paul, MN, 55103"
2,"1440 Minnehaha Ave E, St Paul, MN 55106, USA","1440 minnehaha av e, Saint Paul, MN, 55103"
3,"1450 University Ave W, St Paul, MN 55104, USA","1450 university av w, Saint Paul, MN, 55103"
4,"1480 7th St E, St Paul, MN 55106, USA","1480 7 st e, Saint Paul, MN, 55103"
5,"1660 Barclay St, St Paul, MN 55106, USA","1660 barclay st, Saint Paul, MN, 55103"
6,"1660 White Bear Ave, St Paul, MN 55106, USA","1660 whitebear av n, Saint Paul, MN, 55103"
7,"1690 Nebraska Ave E, St Paul, MN 55106, USA","1690 nebraska av e, Saint Paul, MN, 55103"
8,"180 Old 6th St W, St Paul, MN 55102, USA","180 oldsixth st, Saint Paul, MN, 55103"
9,"1830 Ames Ave E, St Paul, MN 55119, USA","1830 ames av, Saint Paul, MN, 55103"


In [28]:
G['For_Address'] = G['Coordinates'].apply(lambda x: x['formatted_address']) #intersection
G['Latitude'] = G['Coordinates'].apply(lambda x: x['latitude']) 
G['Longitude'] = G['Coordinates'].apply(lambda x: x['longitude']) 
G[['For_Address', 'Address', 'Block' ]]

Unnamed: 0,For_Address,Address,Block
251,"820 Blair Ave, St Paul, MN 55104, USA","820 blair av, Saint Paul, MN, 55103",82x blair av
252,"820 Charles Ave, St Paul, MN 55104, USA","820 charles av, Saint Paul, MN, 55103",82x charles av
253,"820 Edmund Ave W, St Paul, MN 55104, USA","820 edmund av, Saint Paul, MN, 55103",82x edmund av
254,"820 Lafond Ave, St Paul, MN 55104, USA","820 lafond av, Saint Paul, MN, 55103",82x lafond av
255,"820 W Minnehaha Ave, St Paul, MN 55104, USA","820 minnehaha av w, Saint Paul, MN, 55103",82x minnehaha av w
256,"820 Sherburne Ave, St Paul, MN 55104, USA","820 sherburne av, Saint Paul, MN, 55103",82x sherburne av
257,"820 Thomas Ave W, St Paul, MN 55104, USA","820 thomas av, Saint Paul, MN, 55103",82x thomas av
258,"820 University Ave W, St Paul, MN 55104, USA","820 university av w, Saint Paul, MN, 55103",82x university av w
259,"820 Van Buren Ave, St Paul, MN 55104, USA","820 vanburen av, Saint Paul, MN, 55103",82x vanburen av
260,"830 Blair Ave, St Paul, MN 55104, USA","830 blair av, Saint Paul, MN, 55103",83x blair av


In [397]:
#Address_Clean= pd.concat([B_Clean, C_Clean, D_Clean, E_Clean, F_Clean], ignore_index=True)
#Address_Clean
#Address_Mess= pd.concat([B_Mess, C_Mess, D_Mess, E_Mess, F_Mess], ignore_index=True)
#Address_Mess

In [99]:
B= df_C.drop(df_C.index[528])
D1=B.groupby('Block').count().reset_index()
D1.query('Address > 1')

Unnamed: 0,Block,Address,Latitude,Longitude
836,42x rice st,2,2,2


In [398]:
#interation of numeric 1

M =Address_Mess
M['Block1']= M.Block.str.replace('xx','01')
M['Block1']= M.Block1.str.replace('x ','1 ')

M['Address']= M['Block1'] + ', Saint Paul, MN, 55104'
#print(dfW['Address'].unique())
#rint(dfW['Block1'].unique())

A=M[['Block','Block1','Address']].sort_values('Address')
#print(M)
A=A.groupby(['Address','Block']).count()
A=A.reset_index()
A.shape
B=A.loc[0:50,:]
C=A.loc[51:100,:]
D=A.loc[101:,:]
#E=A.loc[151:200,:]
#F=A.loc[201:,]


Location(1000, Sherburne Avenue, St. Paul, Ramsey County, Minnesota, 55104, USA, (44.956552, -93.144261, 0.0))

In [381]:
import geocoder 
API_KEY = 'AIzaSyCvJaUnOOixAfRU0f9n8ZmceMDj-wRVrfw'
RETURN_FULL_RESULTS = False

g = geocoder.google("1022 edmund avenue west, St. Paul, MN, 55104", key='AIzaSyCvJaUnOOixAfRU0f9n8ZmceMDj-wRVrfw')
#g = geocoder.google('Mountain View, CA', key='puting my key here')
#result = geocoder.geocode("7250 South Tucson Boulevard, Tucson, AZ 85756")

g.housenumber

'1022'

# Setup Address Key

In [10]:
dfW=df.query('Intersection==0')
#print(dfW.shape)
#df_key2 = pd.read_csv('Address_KeySing.csv')
df_keymess= pd.read_csv('Datasets/AddressKeyMess.csv')

#Separate Coordinates
new=df_keymess['Coordinates'].str.split(",", n = 1, expand = True) 
# making seperate first name column from new data frame 
df_keymess['Latitude']= pd.to_numeric(new[0]) #pd.to_numeric convert it to float
df_keymess['Longitude']= pd.to_numeric(new[1])
df_keymess= df_keymess[['Block','Address','Latitude','Longitude']]

D1=df_keymess.groupby('Block').count().reset_index()
D1.query('Address > 1')
#df_key2.columns= ['Block','Address','Latitude','Longitude']

#Merge Both Dataframes
#df_C=df_key2.append(df_keymess, ignore_index=True)


Unnamed: 0,Block,Address,Latitude,Longitude


In [11]:
dfW=df.query('Intersection==0')
df_C= pd.read_csv('SemiKey.csv')
df_C= df_C[['Block','Latitude','Longitude']]

# Merge with the split dataset
dC=pd.merge(dfW, df_C, on='Block', how='left')
dC=dC.fillna('Mi')
dC=dC.query('Latitude != "Mi"')


In [12]:
fg= dfI.append(dC, ignore_index=True)
print(fg.shape)
print(fg.columns)

#Few Quick Edits
fg.CallDisposition.loc[(fg['CallDisposition'] == 'G - Gone on Arrival')] = 'Gone on Arrival'
fg.CallDisposition.loc[(fg['CallDisposition'] == 'A - Advised')] = 'Advised'
fg.CallDisposition.loc[(fg['CallDisposition'] == 'RR - Report Written')] = 'Report Written'
fg.Incident.loc[(fg['Incident'] == 'Simple Asasult Dom.')] = 'Simple Assault Dom.'
fg.Incident.loc[(fg['Incident'] == 'Graffiti')] = 'Vandalism'
fg.Incident.loc[fg["Incident"].isin([ "Rape","Agg. Assault",'Homicide'])]= 'Violent'
fg.Incident.loc[fg["Incident"].isin(["Simple Assault Dom.","Agg. Assault Dom."])]= 'Domestic Assault'

#[fg["Incident"].isin(["Simple Assault Dom.", "Rape"])


#Create a dummy for each crime category
fg= pd.concat([fg,pd.get_dummies(fg['Incident'])], axis=1)
fg= pd.concat([fg,pd.get_dummies(fg['CallDisposition'])], axis=1)

fg.to_csv('FGCrime_Final.csv', encoding='utf-8', index=False)
#fg.DayYear.tail(5)

(22684, 32)
Index(['Block', 'CallDispCode', 'CallDisposition', 'Case', 'Code',
       'Coordinates', 'Count', 'Date', 'DateTime', 'Day', 'DayYear', 'Day_Max',
       'DayofWeek', 'Grid', 'Hour', 'IncType', 'Incident', 'IndexKey',
       'Inter1', 'Inter2', 'Intersection', 'LateNight', 'Latitude',
       'Longitude', 'Month', 'NNum', 'Neighborhood', 'OutputKey', 'Time',
       'TimeHour', 'Weekend', 'Year'],
      dtype='object')
