In [1]:
import datetime
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

### Introduction

This project is about analyzing crimes and house prices in Chicago. Based on the different house prices of different neighborhoods and the crime occurrence in those areas, our goal is to determine whether there is correlation between the crime rates and house prices in different neighborhoods. We have analyzed chicago crime data and median sale price of houses of different neighborhoods in Chicago. We have calculated the correlations between change in house prices and number of crimes in different neighborhoods from 2012-2018 and we came to a conclusion on whether or not there is a connection between them based on available data. Furthurmore, we have predicted house prices and number of crimes for two more years and shown that the predicted data also supports our conclusion. 

### Data Cleaning

We have used two different datasets for this project:

1. Chicago crime data from city of chicago website: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/data
2. Chicago house prices data from redfin: https://www.redfin.com/blog/data-center

Since we have house price of chicago neighborhood from 2012-2018, we got rid of all the crimes which occurred before 2012 and after 2018. We also got rid of the rows with null values. There is no separate script for cleaning the crime data. Data before 2012 and after 2018 was filtered out when we downloaded the dataset. For removing rows with null values, we checked for null values before every operations. We ignored the rows which have a null value. 

After getting the house price data, we removed those column where there was no value available. We also planned to work with price per square feet value. So we keep Price per square and sale price and filtered out the rest.

### Exploratory Data Analysis

In our project we are using two datasets: Chicago Crime Data and House Prices Data.

#### Chicago Crime Data: 
This dataset contains the criminal offences occurred in Chicago from 2001-2019. However, we are going to use crime data in 2012-2019. Each record in this dataset represents the details of a crime that happened in Chicago. The details include the time, type, description, location etc. of the crime. However, we are only going to use time and location of every crime. In this dataset, the neighborhood or the zipcode of the location of the crime is not given. Since, our goal is to analyze the crime happening in Chicago based on different areas and their relationship with the economy of those areas, we had to extract the neighborhoods from the location given in coordinates. We used the ArcGIS API to extract the neighborhoods. However, it is taking a huge amount of time to extract the neighborhoods with ArcGIS API. Although our whole dataset contains more than 2 million records, we are currently experimenting with half a million. We will use the whole dataset before our final submission. The function used for extracting the neighborhood data is given below.

#### House Prices Data from Redfin: 
Redfin provides the data of home sale prices from 2012-2019. Each record represents a 90-day period. Each row includes median sale price, median price per square foot, median listed price, number of houses sold within that 90-day period in a particular neighborhood. We are going to relate our 2 datasets using their neighborhood.

Some of the rows from out data sets are shown below:

In [2]:
CrimeFile = 'D:/CS 418/Project/Merged/0To2043066.csv'
CrimeDF = pd.read_csv(CrimeFile)
HousePriceFile = 'databyneighborhood.csv'
HousePriceDF = pd.read_csv(HousePriceFile)

CrimeDF['Date','Primary Type',''].head()

Unnamed: 0.2,Unnamed: 0,Address,Arrest,Beat,Block,Case Number,Community Area,Date,Description,District,...,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,Unnamed: 0.1.1.1.1,Updated On,Ward,X Coordinate,Y Coordinate,Year,Zipcode
0,0,{'address': {'Match_addr': '4700-4728 W Ohio S...,False,1111,047XX W OHIO ST,HY189866,25.0,03/18/2015 07:44:00 PM,AGGRAVATED: HANDGUN,11.0,...,0,0.0,0.0,0.0,02/10/2018 03:50:01 PM,28.0,1144606.0,1903566.0,2015,60644.0
1,1,{'address': {'Match_addr': '6601-6699 S Marshf...,True,725,066XX S MARSHFIELD AVE,HY190059,67.0,03/18/2015 11:00:00 PM,PAROLE VIOLATION,7.0,...,1,1.0,1.0,1.0,02/10/2018 03:50:01 PM,15.0,1166468.0,1860715.0,2015,60636.0
2,2,{'address': {'Match_addr': '4431-4477 S Lake P...,False,222,044XX S LAKE PARK AVE,HY190052,39.0,03/18/2015 10:45:00 PM,DOMESTIC BATTERY SIMPLE,2.0,...,2,2.0,2.0,2.0,02/10/2018 03:50:01 PM,4.0,1185075.0,1875622.0,2015,60653.0
3,3,{'address': {'Match_addr': '5100-5198 S Michig...,False,225,051XX S MICHIGAN AVE,HY190054,40.0,03/18/2015 10:30:00 PM,SIMPLE,2.0,...,3,3.0,3.0,3.0,02/10/2018 03:50:01 PM,3.0,1178033.0,1870804.0,2015,60615.0
4,4,{'address': {'Match_addr': '4700-4798 W Adams ...,False,1113,047XX W ADAMS ST,HY189976,25.0,03/18/2015 09:00:00 PM,ARMED: HANDGUN,11.0,...,4,4.0,4.0,4.0,02/10/2018 03:50:01 PM,28.0,1144920.0,1898709.0,2015,60644.0
