# Part 1: Questions to text and lectures.

## A) Segal and Heer paper

### What is the *Oxford English Dictionary's* defintion of a narrative?

The Oxford English Dictionary defines narrative as “an account of a series of events, facts, etc., given in order and with the establishing of connections between them.”

### What is your favorite visualization among the examples in section 3? Explain why in a few words.

My favorite visualization from section 3, is “Gapminder Human Development Trends”. The data is presented in a clear way and does not become unmanageable or confusing. The visualizations cover different sections of the presentation and are split into multiple screens, which can all accessed by using the progress bar. For each section, the user is guided through the visualization and important key observations are made clear by using annotations, highlighting and animated transitions. Furthermore, the user also has the freedom to stop and take a look around for themselves and inspect more specific observations by using the timeline or details-on-demand functionalities. 

### What's the point of Figure 7?

The goal of the paper is to generalize from examples to identify salient design dimensions and in the process clarify how narrative visualization differs from other forms of storytelling. Figure 7 is made to give an overview of the most commonly used design principles and depicts the unique patterns for narrative visualizations. 

### Use Figure 7 to find the most common design choice within each category for the Visual narrative and Narrative structure (the categories within visual narrative are 'visual structuring', 'highlighting', etc).

There are 3 categories within “Visual Narrative” namely “Visual structuring”, “Highlighting” and “Transition Guidance”, with the and most common design choices being resp. “Consistent Visual Platform”, “Feature Distinction” and “Object Continuity” as seen in the table below:

| Category            | Most common choice         |
|---------------------|----------------------------|
| Visual structuring  | Consistent Visual Platform |
| Highlighting        | Feature Distinction        |
| Transition Guidance | Object Continuity          |

There are also 3 categories within “Narrative Structure” namely “Ordering”, “Interactivity” and “Messaging”, with the most common design choices being resp. “User Directed Path”, “Filtering / Selection / Search” and “Captions / Headlines” as seen below:

| Category      | Most common choice             |
|---------------|--------------------------------|
| Ordering      | User Directed Path             |
| Interactivity | Filtering / Selection / Search |
| Messaging     | Captions / Headlines           |

### Check out Figure 8 and section 4.3. What is your favorite genre of narrative visualization? Why? What is your least favorite genre? Why?

We prefer the slideshow genre, because the visualizations can be split up and displayed on different screens, all under capturing titles. It gives a clear overview of the visualizations and the sorting of them. Furthermore, it becomes easy to investigate a specific category/visualization. In the slideshow genre an order is only suggested but you are not limited to this ordering.

Our least favorite genre is the magazine style, which can be categories as having author-driven design properties. Everything is displayed on a single page with heavy messaging, there are typically no interactivity and the reader is constricted to the defined linear order. This style simply relies on the author knowing what the reader wants to see, which results in a very constricted view for the reader. 

## B) Explanatory data visualisation

### What are the three key elements to keep in mind when you design an explanatory visualization?

The three key points are “Start with a question” (what do we wish to communicate through our results?), “Allow exploration” (we let the users investigate the results themselves, by implementing visualizations following D3) and “Know your readers” (we cater to the audience).

* In the video I talk about (1) *overview first*,  (2) *zoom and filter*,  (3) *details on demand*.
  - Go online and find a visualization that follows these principles (don't use one from the video).
  - Explain how it does achieves (1)-(3). It might be useful to use screenshots to illustrate your explanation.

### Example of visualisation
The visualization found with the link https://worldpoverty.io/map, clearly demonstrate how to implement a visualization following D3. 
The overview is a present depiction of the countries dealing with poverty clearly marked on the map of the world. Both above and below the map, it is clear what factors can be used to filter with, and a timeline is made accessible to investigate previous years and forecast of the further following the SDG target.

![Overview](./images/world_overview.png)

The zoom functionality can be seen when clicking on a country.

![Zoom](./images/world_zoom.png)

The filtering functionality can be experienced by for example choosing an age interval.

![Filtering](./images/world_filtering.png)

The details-on-demand functionality can by seen, when holding the mouse over a country.

![Details-on-demand](./images/world_demand.png)

### How is explanatory data analysis different from exploratory data analysis?
In short, an exploratory analysis is about exploring the data to find insights, while explanatory is about sharing those insights with others.

To elaborate, an **exploratory** analysis gives insight in the dataset, helps us better understand it, and gives us ideas of what to investigate further. Typically, we make an exploratory analysis hoping that the visualization can confirm or disprove a hypothesis we are working from, or just point us in what directions to go in next. 

An **explanatory** analysis on the other hand is tailored to a specific audience. An audience we want to convey our finding to. 

## Part 2: Random forest and weather

### A) Random forest binary classification

Below is the code for Part 2A, followed by an answer to the questions.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

raw_crimes = pd.read_csv("Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv")

In [2]:
import numpy as np 
from datetime import datetime

focuscrimes = ["VEHICLE THEFT", "FRAUD"]

crimes = raw_crimes[raw_crimes["Category"].isin(focuscrimes)].copy()

In [None]:
# Features
cat = "category"
dt = "datetime"
year = "year"
month = "month"
day = "day"
hour = "hour"
hour_of_month = "hour_of_month"
hour_of_week = "hour_of_week"
day_of_week = "dayofweek"
pddistrict = "pddistrict"

crimes.columns = crimes.columns.str.lower()
crimes[dt] = pd.to_datetime(crimes["date"] + " " + crimes["time"])

crimes[year] = crimes[dt].dt.year
crimes[month] = crimes[dt].dt.month
crimes[day] = crimes[dt].dt.day
crimes[hour] = crimes[dt].dt.hour

crimes[hour_of_month] = crimes.apply(lambda row: row[dt].day * 24 + row[hour], axis=1)
crimes[hour_of_week] = crimes.apply(lambda row: row[dt].dayofweek * 24 + row[hour], axis=1)

In [None]:
crimes_in_range = crimes[crimes[year].between(2012, 2017, inclusive=True)]
burglary = crimes_in_range[crimes_in_range[cat].isin([focuscrimes[0]])]
fraud = crimes_in_range[crimes_in_range[cat].isin([focuscrimes[1]])]

print(burglary.shape)
print(fraud.shape)

In [None]:
sample_size = 15000

# Create balanced data set
type1 = burglary.sample(sample_size)
type2 = fraud.sample(sample_size)

crime_df = pd.concat([type1, type2], ignore_index=True).copy()

In [None]:
crime_features = [cat, day_of_week, month, hour, pddistrict]
crime_dummies = [day_of_week, pddistrict]

# Let's make the crime features
Xc = crime_df[crime_features].copy()

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
Xc[cat] = le.fit_transform(Xc[cat])

# One-hot encode the categorical data
Xc = pd.get_dummies(Xc, columns=crime_dummies)

# Labels will be the values we want to predict
yc = np.array(Xc[cat])

# We remove the labels from the crime dataframe to get all the values we need for the features
Xc = Xc.drop(cat, axis=1)

# We save the feature names for later
Xc_list = list(Xc.columns)

# Convert the dataframe to a numpy array so we can work with the features
Xc = np.array(Xc)

# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
Xc_train, Xc_test, yc_train, yc_test = train_test_split(Xc, yc, test_size = 0.25, random_state = 42)

print('Training Features Shape:', Xc_train.shape)
print('Training Labels Shape:', yc_train.shape)
print('Testing Features Shape:', Xc_test.shape)
print('Testing Labels Shape:', yc_test.shape)
print('\n')

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error

cclf_max = RandomForestClassifier(n_estimators=99, random_state=42)
cclf_max.fit(Xc_train, yc_train)

print("Average Tree Depth:", round(np.mean([estimator.get_depth() for estimator in cclf_max.estimators_]), 2))

ctrain_pred = cclf_max.predict(Xc_train)
print("Train")
print('- Mean Absolute Error:', round(mean_absolute_error(yc_train, ctrain_pred), 2), 'degrees.')
print("- Accuracy: ", 100 * round(cclf_max.score(Xc_train, yc_train), 2), "%")

ctest_pred = cclf_max.predict(Xc_test)
print("Test")
print('- Mean Absolute Error:', round(mean_absolute_error(yc_test, ctest_pred), 2), 'degrees.')
print("- TeAccuracy: ", 100 * round(cclf_max.score(Xc_test, yc_test), 2), "%\n")


cclf = RandomForestClassifier(n_estimators=99, random_state=42, max_depth=3)
cclf.fit(Xc_train, yc_train)

print("Max Depth 3")

ctrain_pred = cclf.predict(Xc_train)
print("Train")
print('- Mean Absolute Error:', round(mean_absolute_error(yc_train, ctrain_pred), 2), 'degrees.')
print("- Accuracy: ", 100 * round(cclf.score(Xc_train, yc_train), 2), "%")

ctest_pred = cclf.predict(Xc_test)
print("Test")
print('- Mean Absolute Error:', round(mean_absolute_error(yc_test, ctest_pred), 2), 'degrees.')

# Crime accuracy
cacc = cclf.score(Xc_test, yc_test) 
print("- Accuracy: ", 100 * round(cacc, 2), "%")

## Part 2A

**Did you balance the training data? What are the pros/cons of balancing?**

The dataset is balanced with 20000 randomly picked samples from each crime category, as to ensure the crime are distributed equally over time with no favor of one over the other.

**Do you think your model is overfitting?**

Initially where the classifier had no maximum depth, the training accuracy was near 89% where test accuracy was at 52-53%. Together with a avg. tree depth of around 45 of a dataset with 18 feautures, it seems safe to assume that the model was overfitting, as it clearly shows it did not generalize well from the training data to the testing data.

However, with a maximum depth of 3, a higher accuracy is reached but with a drastical smaller tree size, which could indicate a better fitted model.

**Did you choose to do cross-validation?**

To error estimate the classifier, the Holdout Method is used by creating training and testing/validation datasets. The testing datasets are then used to calculate the mean accuracy of the classifier.

**Which specific features did you end up using? Why?**

The features used are "DayOfWeek", "Date", "Time", and "PdDistrict", because they tell something about the time and place of the crime.

**Which features (if any) did you one-hot encode? Why ... or why not?))**

The features to be one-hot encoded was "DayOfWeek" and "PdDistrict", where the crime category was just label encoded. Both "DayOfWeek" and "PdDistrict" includes categorical variables that should be converted to binary data which the machine can understand without preferring one over the other, why Pandas' get_dummies function is used.

Because the crime category is what should be predicted, these are not converted to binary data, but are just given a numeric representation using Sklearn's LabelEncoder.

The "Date" and "Time" features are also kind of included. When the raw crime data is loaded into a dataframe, the columns are just treated as strings. We want to use them to determine how time influceses the crimes. To make the machine understand this, however, the columns are merged together to a datetime column "Date_Time" where the dates are converted to their ordinal numeric values.
At the time this seemed smart, but after doing some thinking, it would probably have been better to split the date times up into something like; year, month, day, hour, minute or something, as humans, and therefor crimes, follow more patterns of our gregorian calendar rather than a UNIX timestamp...

**Report accuracy. Discuss the model performance.**

Well, the accuracy is around 57% for the Random Forest classifier, an 14 % better accuracy of the baseline of 50/50. This is probably not good enough for any practical application.

In [None]:
weather = pd.read_csv("weather_data.csv")

# Format date and time for easy processing and training
weather[dt] = pd.to_datetime(weather["date"])
weather[year] = weather[dt].dt.year
weather[month] = weather[dt].dt.month
weather[day] = weather[dt].dt.day
weather[hour] = weather[dt].dt.hour

weather[hour_of_month] = weather.apply(lambda row: row[dt].day * 24 + row[hour], axis=1)
weather[hour_of_week] = weather.apply(lambda row: row[dt].dayofweek * 24 + row[hour], axis=1)

In [None]:
# Let's merge the weather and crime dataframes together!

merged_df = pd.merge(crime_df, weather, how="left", on=[year, month, day, hour, hour_of_month, hour_of_week])
merged_df.dropna()

weather_features = ["weather", "humidity", "temperature", "pressure", "wind_speed"]
weather_dummies = ["weather"]

concatted_features = np.concatenate([weather_features, crime_features])
concatted_dummies = np.concatenate([weather_dummies, crime_dummies])

Xwc = merged_df[concatted_features].copy()
Xwc = Xwc.dropna()

# Label encode the categories
Xwc[cat] = le.fit_transform(Xwc[cat])

# One-hot encode the categorical data
Xwc = pd.get_dummies(Xwc, columns=concatted_dummies)

# Labels will be the values we want to predict
ywc = np.array(Xwc[cat])

# We remove the labels from the crime dataframe to get all the values we need for the features
Xwc = Xwc.drop(cat, axis=1)

# We save the feature names for later
Xwc_list = list(Xwc.columns)

# Convert the dataframe to a numpy array so we can work with the features
Xwc = np.array(Xwc)

# Using Skicit-learn to split data into training and testing sets
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
Xwc_train, Xwc_test, ywc_train, ywc_test = train_test_split(Xwc, ywc, test_size = 0.25, random_state = 42)

clf = RandomForestClassifier(n_estimators=99, random_state=42, max_depth=7)
clf.fit(Xwc_train, ywc_train)

preds = clf.predict(Xwc_test)
print("Test")
print('- Mean Absolute Error:', round(mean_absolute_error(ywc_test, preds), 2), 'degrees.')

# Weatherd crime accuracy
wcacc = clf.score(Xwc_test, ywc_test)
print("- Accuracy: ", 100 * round(wcacc, 2), "%\n")

print("Using weather data, the predictions become", round(((wcacc / cacc) - 1) * 100, 2), "% better")

## Part 2B

**Report accuracy**

Well, the accuracy haven't really improved much.

**Discuss how the model performance changes relative to the version with no weather data**

Not much, between 3 and 5 percentage points.

**Discuss what you have learned about crime from including weather data in your model**

That weather is probably not a good indicator for predicting crimes. Especially burglary might not be the best fit, as it

## Part 3: Data visualization

* Create the Bokeh visualization from Part 2 of the Week 8 Lecture, displayed in a beautiful `.gif` below. 
* Provide nice comments for your code. Don't just use the `# inline comments`, but the full Notebook markdown capabilities and explain what you're doing.

Initially, we import the relevant libraries and load the dataset.

In [None]:
from bokeh.models import ColumnDataSource, FactorRange, Legend
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
from bokeh.palettes import Category20
from bokeh.io import output_notebook, push_notebook, show

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import folium
import datetime

df = pd.read_csv("Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv")
df['Date'] = pd.to_datetime(df['Date'])
df = df.loc[df.Date > datetime.datetime(year=2010, month=1, day=1)]
df.head(3)

Then, we group by category and hours and pick out a few relevant columns.

In [None]:
df['Time'] = pd.to_datetime(df['Time'])
df["Hours"] = df.Time.dt.strftime('%H')
df_hours = df.groupby(["Category", "Hours"]).agg('count')
df_hours = df_hours.reset_index()[["Category", "Hours", "IncidntNum"]]
df_hours.head(3)

Aftwards, the sum of each crime is calculated and added to a column.

In [None]:
focuscrimes = set(['WEAPON LAWS', 'PROSTITUTION', 'DRIVING UNDER THE INFLUENCE', 'ROBBERY', 'BURGLARY', 'ASSAULT', 'DRUNKENNESS', 'DRUG/NARCOTIC', 'TRESPASS', 'LARCENY/THEFT', 'VANDALISM', 'VEHICLE THEFT', 'STOLEN PROPERTY', 'DISORDERLY CONDUCT'])
crimes = df_hours.Category.unique()

for crime in df_hours.Category.unique():
    df_hours[crime] = df_hours.loc[df_hours.Category==crime].IncidntNum
    
result_df = df_hours.groupby("Hours").agg("sum")
result_df.drop('IncidntNum', inplace=True, axis=1)
result_df.head(3)

Then, we normalize the dataset.

In [None]:
for name, data in result_df.iteritems():
    result_df[name] = result_df[name]/sum(data)
    
result_df.head(3)

In [None]:
source = ColumnDataSource(result_df)
hours = [str(x) for x in range(0,23)]

p = figure(x_range = FactorRange(factors=result_df.index), plot_width=1200, plot_height=500)

bar = {}
for indx,name in enumerate(focuscrimes):
    bar[name] = p.vbar(x='Hours',  top=name, source=source, 
                 legend_label=name, muted_color="white", muted_alpha=0, color=Category20[14][indx], alpha=0.8) 
    
p.legend.click_policy="mute" ### assigns the click policy (you can try to use ''hide'

We also need to move the legend to the right of the figure to avoid overlaying the data. First, we remove the existing legend.

In [None]:
items = []
for indx,name in enumerate(focuscrimes):
    items.append((name, [bar[name]]))
    
p.legend.items = []

Then, we add the vbars to a list of items togeter with the name of the crime. Finally, the legend is added to the figure.

In [None]:
legend = Legend(items=items, location=(0,0))
legend.click_policy="mute"
p.add_layout(legend, 'right')
show(p)