# Read data sets and visualize

# NOTE: HAD TO do the following for the geopandas join
- curl and install libspatial index
- use lines from below
- then open up autogen.sh and add in 
--force --install
- on line 45
- then re-run


In [5]:
# %%bash

# curl -L https://github.com/libspatialindex/libspatialindex/archive/1.8.5.tar.gz | tar xz
# cd libspatialindex-1.8.5
# ./autogen.sh
# ./configure
# make
# sudo make install
# sudo ldconfig

In [6]:
# # conda install -c conda-forge osmnx 
# !pip install rtree
# #!pip install descartes
# # !pip install libspatialindex

Let's plot this out!

# Another approach 
Is to determine the level of variation across Chicago in terms of temperature for a single day. If the variation is not huge, and if it appears to not be correlated with crime, then we can simply take global temperature indicators for each day and assign them to the crime records. Let's check it out.

Out of 356 days per year, we have 13 stations with roughly 250 weather points for the year. Let's see if that means we don't have weather points for the weekends and holidays. This would be a major blocker for this project, because in reality, most of the crime in Chicago happens on extremely hot weekends.

This is going to be tricky. We need a function that takes a date, and tells us the day of the week it refers to. One way to do that might be to use a Python iterator that creates all of the days of the year in 2018, store it as a data structure, and then create a hash map that maps from our dates in the dataframe to Python days of the week.

Great! Now let's combine that with our weather data and see where we stand.

Great!! So we are still in luck. Even though there are still more missing values on the weekends, we have enough observations in those categories to continue our project. Now we need to compute the variation in temperature across all stations by day, and plot that out over time.

At this point, I'm OK with just keeping the average temperature. If we run into accuracy problems with our model later, we can circle back and re-think how we're computing the temperate. Let's just settle with the average over those 13 stations, and use that for each crime. This means that rather than using the GeoPandas Data Frames and joining the 2 sets, we can just append temperature data to the crimes data.

We should NOT write over the old date. It contains information about the time of day that the crime was committed, which we will want to keep for other modeling purposes. We'll just add a new column for reformatted dates so we can join it with the weather averages. This function will take a few minutes to run.

# New Main Function

In [34]:
import pandas as pd
from datetime import date
from dateutil.rrule import rrule, DAILY

def reformat_crime_date(date_string):

    items = date_string.split("/")
    year = items[-1]
    month = items[0]
    day = items[1]    
    
    new_date = "{}-{}-{}".format(year, month, day)
        
    return new_date

def get_mapper(crime_days):

    d = {}

    for each in crime_days:
        old_date = each.split()[0]
        if old_date not in d:
            new_date = reformat_crime_date(old_date[:])
            d[old_date] = new_date

    return d

def get_average_by_day(weather_df):

    days_in_2018 = get_days_from_range([2018, 1, 1], [2018, 9, 30])
    
    rt = {}
    
    for d in days_in_2018:

        # now, for each day, grab every TMAX
        rows = weather_df.loc[ weather_df["DATE"] == d]
        rt[d] = rows["TMAX"].mean() 
    return rt


def get_mapping_dict():

    rt = {}

    a = date(2018, 1, 1)
    b = date(2018, 9, 30)

    day_of_week = {0: "Monday", 1:"Tuesday", 2:"Wednesday",3:"Thursday",4:"Friday",5:"Saturday",6:"Sunday"}

    for dt in rrule(DAILY, dtstart=a, until=b):
        day_of_year = dt.strftime("%Y-%m-%d")
        n = dt.weekday()
        day = day_of_week[n]
        rt[day_of_year] = day
        
    return rt

def add_day_of_week(df):

    for idx, row in df.iterrows():
        day_of_year = row["DATE"]
        day_of_week = m[day_of_year]
        df.loc[idx, "Day of Week"] = day_of_week
        
    return df


def get_days_from_range(day_1, day_2):

    rt = []
    
    a = date(day_1[0], day_1[1], day_1[2])
    b = date(day_2[0], day_2[1], day_2[2])

    for dt in rrule(DAILY, dtstart=a, until=b):
        day_of_year = dt.strftime("%Y-%m-%d")
        rt.append(day_of_year)
    return rt


In [22]:
def get_updated_data(weather_file, crime_file):
    
    # read into memory
    weather_df = pd.read_csv(weather_file)
    crime_df = pd.read_csv(crime_file, index_col = "ID")
    
    # add new columns
    weather_df["Day Of Week"] = [" " for i in range(weather_df.shape[0])]    
    crime_df["Reformatted Date"] = [" " for i in range(crime_df.shape[0])]

    # get day of week for weather
    weather_df = add_day_of_week(weather_df)
    
    # compute averages by day
    
    return weather_df, crime_df


In [None]:
def modify_crime_dates(df):
    mapper = get_mapper(crime_df["Date"].unique().tolist()[:])

    for idx, row in crime_df.iterrows():

        new_date = mapper[row["Date"].split()[0]]

        crime_df.loc[idx, "Reformatted Date"] = new_date
    
    
    
    

In [29]:
def main(weather_file, crime_file):
    '''
    The primary function for execution. This does the following:
        (1) Reads weather and crime data into memory
        (2) Adds a Day of the week to the weather data set
        (3) Builds a mapper to reformat the crime dates
        (4) 

    '''
    weather_df, crime_df = get_updated_data(weather_file, crime_file)

#     crime_df = modify_crime_dates(crime_df)
    
    crime_df = pd.read_csv("../Data/crimes_reformatted.csv")

    
main("../Data/cleaned_weather_2018.csv", "../Data/crimes_2018_reduced.csv")

In [27]:
# crime_df.to_csv("../Data/crimes_reformatted.csv")

In [41]:
# def add_avg_temp_to_crime(weather_df, crime_df):

averages = get_average_by_day(weather_df)

crime_df["AVG TEMP"] = [0.0 for i in range(crime_df.shape[0])]

for idx, row in crime_df.iterrows():
    day = row["Reformatted Date"]
    temp = averages[day]
    crime_df.loc[idx, "AVG TEMP"] = temp

    
    
    
# add_avg_temp_to_crime(weather_df, crime_df)

In [37]:
# 

{'2018-01-01': 3.111111111111111,
 '2018-01-02': 7.777777777777778,
 '2018-01-03': 16.555555555555557,
 '2018-01-04': 12.625,
 '2018-01-05': 11.11111111111111,
 '2018-01-06': 15.0,
 '2018-01-07': 31.0,
 '2018-01-08': 36.666666666666664,
 '2018-01-09': 35.22222222222222,
 '2018-01-10': 53.77777777777778,
 '2018-01-11': 59.888888888888886,
 '2018-01-12': 28.125,
 '2018-01-13': 17.22222222222222,
 '2018-01-14': 21.0,
 '2018-01-15': 23.555555555555557,
 '2018-01-16': 24.444444444444443,
 '2018-01-17': 21.22222222222222,
 '2018-01-18': 31.666666666666668,
 '2018-01-19': 39.666666666666664,
 '2018-01-20': 45.333333333333336,
 '2018-01-21': 44.888888888888886,
 '2018-01-22': 53.44444444444444,
 '2018-01-23': 39.44444444444444,
 '2018-01-24': 29.555555555555557,
 '2018-01-25': 41.44444444444444,
 '2018-01-26': 50.55555555555556,
 '2018-01-27': 50.0,
 '2018-01-28': 41.111111111111114,
 '2018-01-29': 30.666666666666668,
 '2018-01-30': 30.0,
 '2018-01-31': 41.333333333333336,
 '2018-02-01': 35.77