# Goal

The attached data files are part of the UK Road Safety Data for 2019, available from [the UK Department for Transport](https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data). The dataset contains three types of files: one with records of accidents, the other with vehicles, and the third on casualties. 

We will create a new dataset from these, where for each vehicle, there is information on the age of the vehicle, severity of the accident, weather conditions, and the date of the accident. The relevant information is kept in the first two files:

`Road Safety Data - Accidents_2019.csv`:

* `Accident_Severity`
* `Date`
* `Weather_Conditions`

`Road Safety Data - Vehicles_2019.csv`

* `Age_of_Vehicle`

To link vehicles to accidents, we will also need the `Accident_Index` column, present in both files.

We will also perform several cleaning steps, check the quality of the data, and plot some of the variables.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

sns.set_theme(palette="Set2")

# 1. Accidents data

## 1.1 Load the data

Read the Accidents file into a dataframe, find out its shape (the number of rows and columns) and the data type of each column.

In [None]:
???

## 1.2. Select relevant columns

We are going to use "Accident_Index", "Accident_Severity", "Date", "Time" and "Weather_Conditions", so delete the rest, for convenience.

In [None]:
???

## 1.3. Convert columns to correct data types

1. Convert `Date` to datetime.

In [None]:
df1.loc[:, 'Date'] = pd.to_datetime(df1['Date'], format="???")

2. Replace `Weather_conditions` to contain actual nominal values (the replacement values are in "variable lookup.xls"):

In [None]:
# use nominal values for Weather conditions
replacement_dict = {
    1: "Fine no high winds",
    2: "Raining no high winds",
    3: "Snowing no high winds",
    4: "Fine + high winds",
    5: "Raining + high winds",
    6: "Snowing + high winds",
    7: "Fog or mist",
    8: "Other",
    9: "Unknown",
    -1: "Data missing or out of range"
}
df1.loc[:, 'Weather_Conditions'] = df1['Weather_Conditions'].map(replacement_dict)

Looking at the values in `Weather_Conditions`, we see that each cell encodes two categories: precipitation and wind. So let's separate them, i.e., create a new column with only wind information ("True" for wind, "False" for no wind):

In [None]:
# create a column called high winds
def func(row):
    """Return True is high winds, False otherwise
    """
    ???
    return result

df1["high_winds"] = df1.apply(func, axis=1)

In [None]:
# remove "high winds" from Weather_Conditions
df1["Weather_Conditions"] = df1["Weather_Conditions"].str.replace(" no high winds", "", regex=False)
df1["Weather_Conditions"] = df1["Weather_Conditions"].str.replace(" + high winds", "", regex=False)

# rename Weather_Conditions to "precipitation"
df1 = df1.rename(columns={"Weather_Conditions": "precipitation"})

Replace "Unknown" and "Data missing or out of range" with NaN, so later on we can deal with missing values.

In [None]:
df1['precipitation'] = df1['precipitation'].replace("Unknown", np.NaN)
df1['precipitation'] = df1['precipitation'].replace("Data missing or out of range", np.NaN)

3. Replace `Accident_Severity` to contain nominal values:

In [None]:
replacement_dict = {
    1: "Fatal",
    2: "Serious",
    3: "Slight"
}

???

# 2. Vehicle data

## 2.1. Load data

Extract the vehicle age from the vehicle file. This information will then be linked with the data on accidents from the accidents dataframe.

In [None]:
# We need only two columns: accident index and vehicle age, so we can use the usecol attribute:
df2 = pd.read_csv(??? + "/Road Safety Data - Vehicles 2019.csv",
                  usecols=["Accident_Index", "Age_of_Vehicle"])

df2.head()

Note there are "-1" values. Most likely, they indicate missing values. So replace them with `np.NaN`:

In [None]:
df2["Age_of_Vehicle"] = ???

# 3. Join the two dataframes

The first df contains unique accidents as rows, while the second df contains all vehicles involved in the accidents as rows, i.e. multiple vehicles can map to the same accident. The `Accident_Index` column is present in both dataframes, and can help us link vehicles to accidents.

So we need to create a new dataframe where each row is a vehicle, the first column is its age (from the vehicles dataframe), and the rest of the columns are taken from the accidents dataframe, containing precipiations, wind, and date of the accident.

In [None]:
df1.shape

In [None]:
df2.shape

In [None]:
# we can confirm that the number of unique accident indices in df2 is the same as the number of
# unique accidents in df1
len(df2['Accident_Index'].value_counts())

To merge two dataframes, on a column, we can use `pd.merge`. It works as follows:

`result = pd.merge(left, right, on='key')`

<img src="https://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key.png">

That is, given two tables, "left" and "right", we merge them on the column called `key`. The result is the third dataframe.

There is another important attribute for `pd.merge`: `how`. If the keys do not correspond exactly between the two dataframes, it can help specify how the merge should occur. It takes the following values:
* `left`: use keys in the first dataframe only and attach records from the second dataframe only for matching keys
* `right`: use keys in the second dataframe only and attach records from the first dataframe only for matching keys
* `inner`: use the intersection of the keys in the two dataframes (this is the default value)
* `outer`: use the union of the keys in the two dataframe

More details in the pandas [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html), see the "Brief primer on merge methods (relational algebra)".

We need to attach columns from accidents (df1) to the vehicle dataframe (df2). The "Accident_Index" column from df2 is the basis for the merge (the column is indicated with the "on" argument):

In [None]:
df = pd.merge(df1, df2, on="Accident_Index", how="right")

In [None]:
# the number of rows in the new dataframe is the same as in df2
df.shape

In [None]:
df.head()

# 4. Drop missing values

Drop those rows, where at least one column has an NaN value (this is the default behavior of `dropna` so we don't need to specify any arguments):

In [None]:
???

# The number of rows now
df.shape

# 5. Display unique values

We can print unique values (e.g., using `value_counts`) or plot them:

In [None]:
# Accident Severity
df["Accident_Severity"].value_counts().plot(kind="bar", rot=0)

In [None]:
# Precipitation
???

In [None]:
# High winds
???

In [None]:
# Age of vehicle in a histogram
# Using logarithmic scale for the y-axis, as there are very many new cars
df["Age_of_Vehicle"].plot.hist(logy=True)

# 6. Daily counts

Count number of vehicles involved in accidents per day. We can choose any column to select the counts, e.g. "Accident_Index":

In [None]:
counts = df.groupby(df["Date"]).count()["Accident_Index"]
counts

In [None]:
# plot it, use a wider figure
counts.plot(figsize=(12, 5))

# 7. Convert nominal values to numerical

In order to use a nominal value within, e.g., a linear regression model, it needs to be converted to a numerical value. It can be achieved with `pd.get_dummies`. Let's convert the precipitation column (we use `drop_first=True` to avoid perfect multicollinearity):

In [None]:
pd.get_dummies(df['precipitation'], drop_first=True)

In [None]:
# assign these as columns in df
df_tmp = pd.get_dummies(df['precipitation'], drop_first=True)
for c in df_tmp.columns:
    df[c] = df_tmp[c]
    
# delete the precipitation column
del df['precipitation']

In [None]:
# inspect the result
df.head()

# 8. Convert numerical values to nominal

Create a nominal category, holding the age group of the vehicle. We'd like 3 age groups, named "new", "medium" and "old".

In [None]:
age_groups = pd.cut(df['Age_of_Vehicle'], bins=3, labels=["new", "medium", "old"])
age_groups

In [None]:
df["age_group"] = age_groups

In [None]:
# inspect the result
df.head()