# Structure of the notebook
Welcome to our notebook. 

# 1. Motivation
**What is your dataset?**

Our main dataset is New York Police Department's (NYPD) Motor Vehicle Collisions dataset from New York City Open Data. It provides details of motor vehicle collissions in the city with information on where and when they happend as well as which vehicles are involved in them and which factors have contributed to them. To supplement our main dataset with additional contextual information, we are also drawing on weather data from the National Oceanic and Atmospheric Administration and population data from New York City Open Data. The links to the datasets can be found below.   

- Link to data on vehicle collisions (NYC Open Data): https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95
- Link to data on weather conditions (National Oceanic and Atmospheric Administration): http://www.noaa.gov
- Link to data on New York City's population by boroughs (NYC Open Data): https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Boroughs/9mhd-na2n

**Why did you choose this/these particular datasets?**

The choice of data relies on a twofold answer. 

1. Vehicle accidents and traffic incidents are of political importance. According to the organisation 'Transportation Alternatives' and the U.S Department of Transportation, traffic accidents in New York City cost the city an estimated four billions dollars a year. The costs range from medical treatment to property damage and emergency service. On top of this, 229 persons were killed and around 42000 injured in the New York City traffic in 2016. Traffic accidents then have monetary as well as human costs, which bring forward questions of better traffic regulation and safety. This points to, at least we believe, the importance of investigating patterns in traffic accidents. If the city is to regulate better and make the traffic safer, then information on where and when accidents occur is necessary. Following this, we have chosen to pay particular interest to accidents causing injuries as we find this to be one of the more important steps if the local government is to target efficiently due to the human costs. With the help of data, we hope to see better regulation. With better regulation, we hope to see less people injured. This is the main motivation behind our final project, and our choice of this dataset as our main research concern.     

2. Technically, the dataset offers a good amount of data with a little over a million rows and is very granular. It is thus possible to track and trace each single accident and its context.  

**What was your goal for the end user's experience?**
Our goal has been to shed light on vehicle accidents causing injuries in NYC. It has been our ambition to raise awareness, stimulate curiousity, produce new hypothesis, and illustrate important patterns in relation to traffic safety. On our site, we have tried to target several subpublics within the greater public by allowing users to engage with our visualizations in different ways through details on demand. Apart from this, one of our goals have also been to raise questions for further exploration - questions which may not be answered best by the methods used in our process. Thus we have suggested the need for qualitative studies of specific traffic intersections, whether in the form of observation or video material, if one is to dive deeper into the factors contributing to injuries.  
In the end, it is our hope that the user leaves our site with a greater knowledge of vehicle accidents in NYC 

# 2. Basic statistics - understanding the data
**Write about your choices in data cleaning and preprocessing**

Our procedure on data cleaning and preprocessing has mainly been about removing observations that do not live up to relevant criterias for our analysis and descriptive overviews (say missing values) and reformation of dataformats. An example may be illustrative of this. When working with the decision trees algorithm, observations without a zipcode specification were removed as the algorithm did not allow missing values (NaN). In the case of reformation of dataformats, we splitted several columns into new ones, especially concerning temporal variables such as date, and made several dummy variables (see the multiple regression analysis). Also, we constructed several subdatasets from the original once. For example, in the section displaying some basic distributions of the accident data, we depended on a dataset with only injuries.     

**Write a short section that discusses the dataset stats.** 

Our main dataset is collected between July 2012 and April 2017. The size of the dataset is 225 mb. It has a litle over a million rows (1002838 to be specific), which each corresponds to a specific vehicle collision incident. Further, the dataset has 29 columns, each representing a variable. The dataset contain the following information:
- Temporal data: Information on when the accident took place. Access to year, month, weekday, and hour.
- Spatial data: Information on where the accident took place. Access to borough, zipcode, coordinates (latitude/longitude), and streetnames. 
- Damage data: Information on how many people were injured or killed in the dataset and if they were either pedestrians, cyclists or motorists. 
- Causes of incident: Information on the contributing factors to the accidents, why they took place. Examples are driver inattention, steering failure or driving under the influence of alcohol.   
- Type of Vehicle: Information on what type of vehicle were involved in the incident (station wagon, passenger vehicle).

Apart from the main dataset, secondary contextual datasets have been incorporated. This was necessary as the original dataset lacked several important components surrounding the context of the accidents. Most importantly, these were weather conditions (accident happen more frequently under specific weather conditions) and population information. The latter was chosen in order to monitor whether a high frequency of accidents in one area was simply the result of a high amount of people living there. The two supplementary dataset contain the following data: 
- Weather conditions: Information on temperature and precipitation on specific dates. 
- Population data: New York City's population by boroughs. 

**Scatterplot for D3 Visualization.**
In the following lines of code, you can find the data preparation for the interactive scatterplot on the site under the page 'Statistics'.  

In [None]:
#Importing packages and data
import pandas as pd
from pandas import DataFrame, read_csv
import numpy as np
df = pd.read_csv('traffic_data.csv')

In [None]:
#Splitting data
df['YEAR'] = df['DATE'].str.split("/").str.get(2).astype(int)

In [None]:
#Making four different datasets. Each representing a year. 
df_2013 = df[df['YEAR'] ==2013]
df_2014 = df[df['YEAR'] ==2014]
df_2015 = df[df['YEAR'] ==2015]
df_2016 = df[df['YEAR'] ==2016]

In [None]:
#Counting the occurrences of total accidents. 
pd.value_counts(df_2013['BOROUGH'].values, sort=False)
pd.value_counts(df_2014['BOROUGH'].values, sort=False)
pd.value_counts(df_2015['BOROUGH'].values, sort=False)
pd.value_counts(df_2016['BOROUGH'].values, sort=False)

In [None]:
totalaccidents2013 = pd.value_counts(df_2013['BOROUGH'].values, sort=False)
totalaccidents2013 = totalaccidents2013.sort_index()
totalaccidents2013_test = np.array([totalaccidents2013])

totalaccidents2014 = pd.value_counts(df_2014['BOROUGH'].values, sort=False)
totalaccidents2014 = totalaccidents2014.sort_index()
totalaccidents2014_test = np.array([totalaccidents2014])

totalaccidents2015 = pd.value_counts(df_2015['BOROUGH'].values, sort=False)
totalaccidents2015 = totalaccidents2015.sort_index()
totalaccidents2015_test = np.array([totalaccidents2015])

totalaccidents2016 = pd.value_counts(df_2016['BOROUGH'].values, sort=False)
totalaccidents2016 = totalaccidents2016.sort_index()
totalaccidents2016_test = np.array([totalaccidents2015])

In [None]:
totalaccidents2013_test = totalaccidents2013_test[0]
totalaccidents2014_test = totalaccidents2014_test[0]
totalaccidents2015_test = totalaccidents2015_test[0]
totalaccidents2016_test = totalaccidents2016_test[0]

In [None]:
df_2013_accidents_test = pd.DataFrame(totalaccidents2013_test)
df_2014_accidents_test = pd.DataFrame(totalaccidents2014_test)
df_2015_accidents_test = pd.DataFrame(totalaccidents2015_test)
df_2016_accidents_test = pd.DataFrame(totalaccidents2016_test)

In [None]:
df_2013_accidents_test[1] = df_2013_accidents_test[0]
df_2013_accidents_test[0] = totalaccidents2013.keys()

df_2014_accidents_test[1] = df_2014_accidents_test[0]
df_2014_accidents_test[0] = totalaccidents2014.keys()

df_2015_accidents_test[1] = df_2015_accidents_test[0]
df_2015_accidents_test[0] = totalaccidents2015.keys()

df_2016_accidents_test[1] = df_2016_accidents_test[0]
df_2016_accidents_test[0] = totalaccidents2016.keys()

In [None]:
#Adding columns
df_2013_accidents_test.columns =['BOROUGH', 'TOTAL ACCIDENTS']
df_2014_accidents_test.columns =['BOROUGH', 'TOTAL ACCIDENTS']
df_2015_accidents_test.columns =['BOROUGH', 'TOTAL ACCIDENTS']
df_2016_accidents_test.columns =['BOROUGH', 'TOTAL ACCIDENTS']

In [None]:
#Adding a dataset only with injuries
df_injured = df[df['NUMBER OF PERSONS INJURED']!=0]

In [None]:
#Making four different injury datasets. Each representing a year.
df_2013_focus = df_injured[df_injured['YEAR'].isin([2013])]
df_2014_focus = df_injured[df_injured['YEAR'].isin([2014])]
df_2015_focus = df_injured[df_injured['YEAR'].isin([2015])]
df_2016_focus = df_injured[df_injured['YEAR'].isin([2016])]

In [None]:
#Counting instances of injuries. 
pd.value_counts(df_2013_focus['BOROUGH'].values, sort=False)
pd.value_counts(df_2014_focus['BOROUGH'].values, sort=False)
pd.value_counts(df_2015_focus['BOROUGH'].values, sort=False)
pd.value_counts(df_2016_focus['BOROUGH'].values, sort=False)

In [None]:
totalinjuries2013 = pd.value_counts(df_2013_focus['BOROUGH'].values, sort=False)
totalinjuries2013 = totalinjuries2013.sort_index()
totalinjuries2013_test = np.array([totalinjuries2013])

totalinjuries2014 = pd.value_counts(df_2014_focus['BOROUGH'].values, sort=False)
totalinjuries2014 = totalinjuries2014.sort_index()
totalinjuries2014_test = np.array([totalinjuries2014])

totalinjuries2015 = pd.value_counts(df_2015_focus['BOROUGH'].values, sort=False)
totalinjuries2015 = totalinjuries2015.sort_index()
totalinjuries2015_test = np.array([totalinjuries2015])

totalinjuries2016 = pd.value_counts(df_2016_focus['BOROUGH'].values, sort=False)
totalinjuries2016 = totalinjuries2016.sort_index()
totalinjuries2016_test = np.array([totalinjuries2016])

In [None]:
df_2013_accidents_test['TOTAL INJURIES'] = totalinjuries2013.values
df_2014_accidents_test['TOTAL INJURIES'] = totalinjuries2014.values
df_2015_accidents_test['TOTAL INJURIES'] = totalinjuries2015.values
df_2016_accidents_test['TOTAL INJURIES'] = totalinjuries2016.values

In [None]:
#Making the ratio: How many injuries out of total accidents?
df_2013_accidents_test['RATIO'] = df_2013_accidents_test['TOTAL INJURIES']/df_2013_accidents_test['TOTAL ACCIDENTS']
df_2014_accidents_test['RATIO'] = df_2014_accidents_test['TOTAL INJURIES']/df_2014_accidents_test['TOTAL ACCIDENTS']
df_2015_accidents_test['RATIO'] = df_2015_accidents_test['TOTAL INJURIES']/df_2015_accidents_test['TOTAL ACCIDENTS']
df_2016_accidents_test['RATIO'] = df_2016_accidents_test['TOTAL INJURIES']/df_2016_accidents_test['TOTAL ACCIDENTS']

In [None]:
#Adding population: Making sure to control for population, the amount of people living there
POPULATION = np.array([1385108, 2504700, 1585873, 2230722, 468730]) #This data is from NYC Open data. See link in the beginning of the notebook
df_2013_accidents_test['POPULATION'] = POPULATION
df_2014_accidents_test['POPULATION'] = POPULATION
df_2015_accidents_test['POPULATION'] = POPULATION
df_2016_accidents_test['POPULATION'] = POPULATION

In [None]:
#Injuries and accidents per citizen in each borough. 
df_2013_accidents_test['Accident per 100000 inhabitant'] = df_2013_accidents_test['TOTAL ACCIDENTS']/df_2013_accidents_test['POPULATION']*100000
df_2013_accidents_test['Injury per 100000 inhabitant'] = df_2013_accidents_test['TOTAL INJURIES']/df_2013_accidents_test['POPULATION']*100000

df_2014_accidents_test['Accident per 100000 inhabitant'] = df_2014_accidents_test['TOTAL ACCIDENTS']/df_2014_accidents_test['POPULATION']*100000
df_2014_accidents_test['Injury per 100000 inhabitant'] = df_2014_accidents_test['TOTAL INJURIES']/df_2014_accidents_test['POPULATION']*100000

df_2015_accidents_test['Accident per 100000 inhabitant'] = df_2015_accidents_test['TOTAL ACCIDENTS']/df_2015_accidents_test['POPULATION']*100000
df_2015_accidents_test['Injury per 100000 inhabitant'] = df_2015_accidents_test['TOTAL INJURIES']/df_2015_accidents_test['POPULATION']*100000

df_2016_accidents_test['Accident per 100000 inhabitant'] = df_2016_accidents_test['TOTAL ACCIDENTS']/df_2016_accidents_test['POPULATION']*100000
df_2016_accidents_test['Injury per 100000 inhabitant'] = df_2016_accidents_test['TOTAL INJURIES']/df_2016_accidents_test['POPULATION']*100000

In [None]:
#Exporting the dataframes to csv files. These are further used in D3
df_2013_accidents_test.to_csv('trafficdata2013.csv')
df_2014_accidents_test.to_csv('trafficdata2014.csv')
df_2015_accidents_test.to_csv('trafficdata2015.csv')
df_2016_accidents_test.to_csv('trafficdata2016.csv')

# 3. Theory and theoretical tools
**Describe which machine learning tools you use and why the tools you've chosen are right for the problem you're solving.**

We used support vector machines (SVM), decision trees, and multiple linear regression for our data analysis. In the following, we will go through each method one by one. 

**Support Vector Machines**

In order to classify whether or not a an accident would involve 1 or more injuried, we chose to use the commonly used supervised classifier Support Vector Machines (SVM) for binary classification. SVM works well with datasets of lower dimentionality, and using a radial basis function kernel, it is able to learn a non-linear classification.

In order to try to guard against overfitting, we trained SVM with different hyperparameters and tested and validated it on datasets with different accidents compositions.

We used the python library sklearn's builtin support vector classifier for performing the classification

**Decision Trees**

In order to construct the best model for binary classification, we supplement the Support Vector Machine with Decision Trees. It is here our ambition to see which of the models does the best job of predicting whether an accident causes an injury or not. Decision trees are a non-parametric supervised learning method used for classification and regression. The idea behind decision trees is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features (Grus 2015:201f). 
-	Why: We have chosen decision trees as they are, first of all, fairly easy to implement, understand and interpret. In relation to this, Grus (2015:202) point out that the process by which they reach their prediction, is quite transparent. Secondly, they can easily handle both numeric and categorical features. This is important in our dataset where we have many categorical/qualitative variables. 

The main pitfall when working with decision trees is that they are very prone to **overfitting** – they don’t generalize well on unseen data (Ibid.:202). One of the reasons why is that decision trees are very good at fitting themselves to the training data. In order to address this problem, we are applying the method of ‘random forest’ classification. This technique is used to reduce overfitting when working with decision trees. The idea is that instead of building one decision tree, multiple trees are constructed with different data through which the best classification is chosen (Ibid.:211). 

Below, we have included the code for the decision tree analysis. 

In [1]:
#Importing data and packages for the decision tree analysis. 
import pandas as pd
import matplotlib.pyplot as plt
import sys 
import matplotlib
from pandas import DataFrame, read_csv
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
import datetime as dt
from datetime import date

df = pd.read_csv('traffic_data.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
#Setting up the data: Splitting the data into smaller parts (years, weekdays, months etc.) and creating variables for the analysis
df["DATETIME"] = pd.to_datetime(df['DATE'])
df['HOUR'] = df['TIME'].str.split(":").str.get(0).astype(int)
df['YEAR'] = df['DATE'].str.split("/").str.get(2).astype(int)
df['MONTH'] = df['DATE'].str.split("/").str.get(0).astype(int)
df['WEEKDAY'] = df['DATETIME'].dt.dayofweek
df = df.rename(columns={'CONTRIBUTING FACTOR VEHICLE 1': 'WHY'})
df['Multiple_Vehicles'] = df['CONTRIBUTING FACTOR VEHICLE 2'].fillna(0)
df.loc[df['Multiple_Vehicles']!=0, 'Multiple_Vehicles'] =1
df.loc[df['NUMBER OF PERSONS INJURED']!= 0, 'NUMBER OF PERSONS INJURED'] = 1 

In [3]:
#We exclude missing observations from the dataset
df = df.dropna(subset=['BOROUGH','HOUR', 'ZIP CODE', 'VEHICLE TYPE CODE 1', 'WHY', 'MONTH', 'DATE'])

In [4]:
#Specific datapreperation for the Decision Tree Algorithm. It needs numerical versions of categorical data. 
stacked_borough = df[['BOROUGH']].stack()
stacked_zipcode = df[['ZIP CODE']].stack()
stacked_vehicle = df[['VEHICLE TYPE CODE 1']].stack()
stacked_factor = df[['WHY']].stack()
stacked_date = df[['DATE']].stack()
df[['BOROUGH_number']] = pd.Series(stacked_borough.factorize()[0], index=stacked_borough.index).unstack()
df[['ZIPCODE_number']] = pd.Series(stacked_zipcode.factorize()[0], index=stacked_zipcode.index).unstack()
df[['VEHICLE_number']] = pd.Series(stacked_vehicle.factorize()[0], index=stacked_vehicle.index).unstack()
df[['WHY_number']] = pd.Series(stacked_factor.factorize()[0], index=stacked_factor.index).unstack()
df[['DATE_number']] = pd.Series(stacked_date.factorize()[0], index=stacked_date.index).unstack() 

In [5]:
#We sample randomly for the full dataset. 
df_1 = df.sample(n=100000) #This is the dataframe from which we train the model.
df_2 = df.sample(n=100000) #This is one of the dataframes on which we test the model.
df_3 = df.sample(n=100000) #This is another one of the dataframes on which we test the model.

In [7]:
#Making the balances dataset for testing the model. 50% injuries, 50% noninjuries
df_injured_1 = df_2[df_2['NUMBER OF PERSONS INJURED'] >0] #Dataset only with injuries
df_notinjured = df_2[df_2['NUMBER OF PERSONS INJURED'] == 0] #Dataset without injuries

len(df_injured_1) #18607 injured

df_notinjured_balances = df_notinjured.sample(n=18607) #Sampling equal amount of observations
df_balanced = pd.concat([df_injured, df_notinjured_balances], ignore_index=True) #Merging the two dataset.

NameError: name 'df_injured' is not defined

In [None]:
#Making a dataset only with injuries for testing the model. 
df_injured = df_3[df_3['NUMBER OF PERSONS INJURED'] >0]

In [None]:
#Splitting the datasets into test/train using skleans package. 

#Dataframe with balanced dataset
df_train_balanced, df_test_balanced = train_test_split(df_balanced, test_size = 0.1, random_state=42)

#Dataframe with injuries 
df_train_injury, df_test_injury = train_test_split(df_injured, test_size = 0.1, random_state=42)

#Dataframe with the whole dataset
df_train, df_test = train_test_split(df_1, test_size = 0.1, random_state=42)

In [None]:
#Making arrays that the decision tree function supports. We train on the full dataset. 
X_accidents = np.array([df_train['BOROUGH_number'], df_train['WHY_number'], df_train['VEHICLE_number'], df_train['Multiple_Vehicles'], df_train['HOUR'], df_train['WEEKDAY'], df_train['ZIPCODE_number'], df_train['DATE_number']]).T
Y_accidents = np.array([df_train['NUMBER OF PERSONS INJURED']])

In [None]:
#Fitting the model to the data
model = RandomForestClassifier(n_estimators=100)
model.fit(X_accidents, Y_accidents.T)

In [None]:
#Evaluation: How well did our model do?
X_accidents_full_test = np.array([df_test['BOROUGH_number'], df_test['WHY_number'], df_test['VEHICLE_number'], df_test['Multiple_Vehicles'], df_test['HOUR'], df_test['WEEKDAY'], df_test['ZIPCODE_number'], df_test['DATE_number']]).T
X_accidents_balanced_test = np.array([df_test_balanced['BOROUGH_number'], df_test_balanced['WHY_number'], df_test_balanced['VEHICLE_number'], df_test_balanced['Multiple_Vehicles'], df_test_balanced['HOUR'], df_test_balanced['WEEKDAY'], df_test_balanced['ZIPCODE_number'], df_test_balanced['DATE_number']]).T
X_accidents_injuries_test = np.array([df_test_injury['BOROUGH_number'], df_test_injury['WHY_number'], df_test_injury['VEHICLE_number'], df_test_injury['Multiple_Vehicles'], df_test_injury['HOUR'], df_test_injury['WEEKDAY'], df_test_injury['ZIPCODE_number'], df_test_injury['DATE_number']]).T

Y_accidents_test_full = np.array([df_test['NUMBER OF PERSONS INJURED']])
Y_accidents_test_balanced = np.array([df_test_balanced['NUMBER OF PERSONS INJURED']])
Y_accidents_test_injuries = np.array([df_test_injury['NUMBER OF PERSONS INJURED']])

model.predict(X_accidents_full_test)
model.predict(X_accidents_balanced_test)
model.predict(X_accidents_injuries_test)

print "Fraction of correct predictions for the models"
print "Accuracy for the full dataset: %0.4f " % model.score(X_accidents_full_test, Y_accidents_test_full.T)
print "Accuracy for the balanced dataset: %0.4f " % model.score(X_accidents_balanced_test, Y_accidents_test_balanced.T)
print "Accuracy for the dataset with only injuries: %0.4f " % model.score(X_accidents_injuries_test, Y_accidents_test_injuries.T)

In [None]:
#Evaluating the model on the training data
model.predict(X_accidents)
print "Fraction of correct predictions on the training data for the model"
print "Accuracy_full: %0.4f " % model.score(X_accidents, Y_accidents.T)

**Talk about your model selection. How did you split the data in to test/training. Did you use cross validation?**

In order to simulate out-of-sample data in an effort to prevent or discover overfitting, we made three datasets for training and testing our two classifiers.

The first dataset is the regular dataset with missing zip codes removed.
The second dataset is the first dataset resized to have the same number of accidents without injured persons as with injured persons.
The third dataset is the first dataset with all but the accidents containing injured persons removed.

These three datasets were sampled from separately.
For the Decision Tree classifier we used train_test_split() function to split these samples into training and test.
For the SVM classifier we used the np.split() function to split these samples into training, test and validation.

For the Decision Tree classifier we used the random-forest classifier, which in itself performs a form of crossvalidation.
For the SVM classifier we used the three different hyperparameter settings. We tried performing cross-validation using sklearn's GridSearch but this method did not finish even after 24 hours so we gave it up.

**Explain the model performance. How did you measure it? Are your results what you expected?**

We were not able to predict whether or not an injury will occur using either decision tree or SVM classification. We do however find that the SVM classification did the best job of predicting the injuries. However, we do only observe around 50% accuracy on the 50/50 dataset for all three settings, which suggests that the model is no better than randomly guessing. 

We were hoping for better results, but given the lack of numeric independent variables we are not entirely surprised that we could not find a good predictor. We would guess that if we had more detailed (and quantified) information on alchohol consumption, driver attention, road condition, etc. we might be able to make better predictions.

# 4. Visualizations
**Explain the visualizations you've chosen.**

We're working with three different kinds of visualizations on the site: Scatterplots, geomaps (in the form of a heatmaps and dot/point maps) and more ordinary barcharts. The visualizations and their design has been chosen due to some specific considerations. Some are common for them all, some for the specific visualizations. 

First of all, an important reflection has been about the audience. If visualizations are partly about communicating results (Healy & Moody 2014:106; Murray 2013:1), problems arise when a visual is perceived differently by an audience from the intent of the creator (Rougier et al. 2014:1). Because of this, we identified our audience as non-scientific readers early on, and designed our visualizations informed by this intent. However, designing for a more general public may be a tricky case as Rougier et al. (2014:1) points out: It demands simple figures that reveal only the most salient parts of the findings. It was thus important for us to design simple graphs that did not contain overwhelming and complex information on academic intrinsic ratios or parametres, but rather sought to express relatively simple patterns in a clear and clean way. 

   Secondly, we wanted some of our visualizations to be interactive, allowing the audience to toggle between different years and hover their mouses over specific datapoints to get details on demand. This is because we want to make the data open to different audiences within the general public (Murray 2013:3), and encourage engagement with the topic through the allowance of exploration. The latter seems easier when the images are interactive rather than static.    
  
  Thirdly, and more briefly, all of the chosen visualizations were chosen with specific questions and messages in mind. We found that some encodings helped visualizing interesting patterns in the data better than others. Implicit in these considerations was a balance between the readability of the visualizations and inclusion of their complexity (Venturini 2010:803). A good graph seem to be one which is loyal to the complexity of data while being able to communicate this in a clear and clean manner.   
  
As noted, we designed scatterplots, geomaps and barcharts for our website. Why these encodings? 
- **Scatterplots** were chosen to illustrate the development of accidents and injuries across the period 2013-2016. We found the visualization to be good at communicating the difference across the years, as it was easy to follow the movement of one point to another place in the graph when toggling. It is possible to toggle between the years and hover one's mouse over the datapoints to gain access to specific details. The scatterplots were thus designed with details-on-demand in mind. We chose to keep the axes of the scatterplot constant in order to secure comparison and allow the user to visually understand the yearly changes. 
 
- **Geomaps** were used mainly for explorative purposes. They were chosen to visualize which intersections in New York City have the most accidents and injuries. We thus found it good at approaching the question of 'where injuries most frequently take place'. We used three distinct colors to represent what was going on in the specific locations. It was also possible for the reader to acquire extra details by hovering the mouse over the datapoints. In the databoxes, we included the addresses of the points, thus making it possible for the reader to go on google street view to explore the intersection in question further. Our geomaps were designed with specific color contrasts in mind in order to make the findings stand out (e.g. red/yellow/orange on black). Although our geomaps have mostly been explorative, they have also been explanatory: It seems difficult to seperate the two tasks of visualization in practice sometimes.   

- **Barcharts** were used for explorative and explanatory reasons. We used it to give the user a sense of some fundamental distributions concerning time and place of the injuries. This was more explorative. We also used it for reporting some of the more salient features of our multiple regression analysis. This was more explanatory - we wanted to communicate some concrete results - although we also allowed exploration and details on demand by making the barchart interactive.    

**Why are they right for the story you want to tell?**

We found our visualizations to be right for our questions and purposes as they were good at creating overview of specific parametres (place, time, relevant factors). They served as good tools for illustrating some of the finds of our exploratory data approach, and tell a good story of the data. To take an example, on the page with the geomaps, we 'zoomed' in with the heatmap, which later lead us to produce the 'top 25' maps, which then became a reference point for the choice of a specific intersection to look into. 


# 5. Discussion
**What went well?**

Overall, we found our exploratory approach to the data to be successfull. We started out with next to no knowledge on traffic accidents in NYC, and ended up with an understanding of when injuries tend to happen more freqeuently (as illustrated in the descriptive statistics), where they happen most often (as shown with the geomaps), and which factors are of significant importance (shown with the multiple regression and binary classifiers). Our aim was to shed some light on whether there exist any patterns in the traffic accident dataset, and we found this to be the case. Despite this, not every pattern have been surprising. To take an example, it was not shocking that we found that injuries tend to happen more frequently in the rush hour periods. However, we also found several interesting patterns. Our geographical analysis pointed out how the intersections causing many injuries are not always the injuries with the most accidents. Some places are thus more dangerous to humans. Also, based on our regression analysis, solo accidents are more likely to cause an injury than one with multiple vehicles, which may seem a bit counter-intuitive at first.    
All in all, we had some interesting finds, and how localized some areas, which would be interesting to look into further in another study. 

**What is still missing? What could be improved? Why?**

Several things are still missing and improvable. We'll point to a few areas that need light shedding. 

**1. Data problems**. In the dataset, we found a large amount of missing values related to specific variables such as 'vehicle type' and 'contributing factor'. There is thus a lack of information on why these accidents are happening. This, we believe, cause instances of unprecision and insecurity as we either refrain from trying to explain what goes on in specific intersections or attemp hypothetical guesses.

**2. Qualitative assessments**. As a consequence of the first argument, it could be interesting to see qualitative studies of the intersections causing injuries which we have pointed to in this study. This may be done through observation studies or the monitoring of video material. This will give a deeper understanding of the circumstances surrounding the injuries we believe.    

**3. Contextual information**: So far, we've pointed to a few research designs that can illustrate other aspects of the traffic safety environment in NYC. We would also like to highlight some extra variables or datasets that may increase the accuracy of our estimats and improve our models. These are mostly contextual. It could thus be interesting to see how factors such as road conditions and speed limits correlate with injuries. Personal data on the persons involved in the accidents could also be informative as well as some more detailed descriptions of the vehicles. On our site, we merely touched the surface concerning weather conditions as a source for contextual data. It could be interesting to see how more detailed weather datasets would correlate with accidents.  

**4. Machine Learning**: As a last point, we would like to discuss potentials for machine learning. First of all, with the dataset, we had a difficult time finding out exactly what to predict: Location? Time of the day? We ended up predicting whether an accident caused an injury or not. This scenario is, however, a bit arbitrary as it is unclear whether the emergency agencies already know if there is an injury when the accident is reported. Secondly, in the case of obtaining a better accuracy of our models, there is still much that can be done and improved. As we argued above, several variables and features should be added to the dataset. This rise in dimensionality could, we hope, lead to more grounded models with an higher accuracy. Thirdly, following the former, there is a need for more quantitative variables, which may also increase the accuracy of our predictions. Many of the variables in the dataset are qualitative - these may succesfully be supplemented by more quantitative variables.  


# Bibliography
Clowes, E. S.: Street Accidents, New York City. Publications of the American Statistical Association, Vol. 13, No. 102 (Jun., 1913), pp. 449-456

Grus, Joel (2015): Data Science from Scratch. Sebastopol: O'Reilly Media. 

Healy, K., & Moody, J. (2014): Data Visualization in Sociology. Annual Review of Sociology, 40, 105-128.

Murray, Scott (2013): Interactive Data Visualization for the Web. Sebastopol: O'Reilly Media. 

Rougier NP, Droettboom M, Bourne PE (2014): Ten Simple Rules for Better Figures. PLoS Comput Biol 10(9): e1003833. doi:10.1371/journal.pcbi.1003833

Venturini, Tommaso (2010): “Building on Faults: How to Represent Controversies with Digital Methods”, Public Understanding of Science 21(7): 796-812. 

Woolridge, Jeffrey M (2013): Introductory Econometrics: A Modern Approach. 5th. South-Western Cengage Learning. 