# Data Wrangling Berlin Traffic Accident 2019 Part 1

The goal of this analysis is to explore the official statistics for road traffic accidents in Berlin in 2019.

The dataset is publicly available through the Berlin Open Data project:  
(https://daten.berlin.de/datensaetze/strassenverkehrsunf%C3%A4lle-nach-unfallort-berlin-2019)

The first part of data wrangling involves the following tasks

- Changing the column names from German to English
- Dropping unnecessary columns

In [None]:
#import necessary libraries
import pandas as pd
import numpy as np
import os

# Data Wrangling

In [207]:
#assigning path
path = '/Users/satoruteshima/Documents/CareerFoundry/06 Date Immersion 6/Scripts'

In [211]:
df_berlin = pd.read_csv(os.path.join(path, 'Raw', 'berlinaccident2019.csv'), index_col = False, delimiter = ';')

In [209]:
df_berlin.shape

(13390, 1)

In [212]:
df_berlin.head()

Unnamed: 0,OBJECTID,LAND,BEZ,LOR,STRASSE,UJAHR,UMONAT,USTUNDE,UWOCHENTAG,UKATEGORIE,...,IstPKW,IstFuss,IstKrad,IstGkfz,IstSonstige,USTRZUSTAND,LINREFX,LINREFY,XGCSWGS84,YGCSWGS84
0,49090,11,12,12301203,Wittenau Süd,2019,1,13,6,3,...,1,0,0,0,0,1,7940622837,5835083823,1334146,5258609
1,49091,11,3,3040818,Pankow Süd,2019,1,9,5,3,...,1,0,0,0,0,0,7991304007,5832327415,1341356,5255862
2,49093,11,12,12103115,Breitkopfbecken,2019,3,21,6,3,...,0,0,0,0,0,0,795437613,5833549454,1336034,5257159
3,49096,11,6,6040703,Nikolassee,2019,1,7,6,2,...,1,1,0,0,0,1,7867143754,5817042137,1321777,5242825
4,49097,11,7,7030303,Grazer Platz,2019,2,15,3,3,...,1,0,0,0,0,0,7960743342,5822724905,1336007,5247421


In [221]:
#Change name of the columns from German to English

df_berlin.rename(columns = {'LAND' : 'city'}, inplace = True)
df_berlin.rename(columns = {'BEZ' : 'district'}, inplace = True)
df_berlin.rename(columns = {'LOR' : 'sub district key'}, inplace = True)
df_berlin.rename(columns = {'STRASSE' : 'street'}, inplace = True)
df_berlin.rename(columns = {'UJAHR' : 'year'}, inplace = True)
df_berlin.rename(columns = {'UMONAT' : 'month'}, inplace = True)
df_berlin.rename(columns = {'USTUNDE' : 'hour of day'}, inplace = True)
df_berlin.rename(columns = {'UKATAGORIE' : 'category'}, inplace = True)
df_berlin.rename(columns = {'UMONAT' : 'month'}, inplace = True)

In [228]:
#Change name of the columns from German to English

df_berlin.rename(columns = {'UWOCHENTAG' : 'weekday'}, inplace = True)
df_berlin.rename(columns = {'UKATEGORIE' : 'category'}, inplace = True)
df_berlin.rename(columns = {'UART' : 'accident type'}, inplace = True)
df_berlin.rename(columns = {'UTYP1' : 'accident type 2'}, inplace = True)
df_berlin.rename(columns = {'ULICHTVERH' : 'light situation'}, inplace = True)
df_berlin.rename(columns = {'IstRad' : 'bike involved'}, inplace = True)
df_berlin.rename(columns = {'IstPKW' : 'car involved'}, inplace = True)
df_berlin.rename(columns = {'IstFuss' : 'pedestrian involved'}, inplace = True)
df_berlin.rename(columns = {'IstGkfz' : 'big track involved'}, inplace = True)
df_berlin.rename(columns = {'IstKrad' : 'motor bike involved'}, inplace = True)
df_berlin.rename(columns = {'IstSonstige' : 'other vehicle involved'}, inplace = True)
df_berlin.rename(columns = {'USTRZUSTAND' : 'road condition'}, inplace = True)
df_berlin.rename(columns = {'OBJECTID' : 'object id'}, inplace = True)

 #District Key

|District key|name|
|---|---|
| 1| Mitte|
| 2| Friedrichshain - Kreuzberg |
| 3| Pankow|
|4 | Charlottenburg - Wilmersdorf|
|5| Spandau|
|6|Steglitz-Zehlendorf|
|7| Tempelhof-Schöneberg|
|8| Neukölln|
|9|Treptow-Köpenick|
|10| Marzahn-Hellersdorf|
|11| Lichtenberg|
|12| Reinickendorf|

In [226]:
#drop reference columns

df_berlin = df_berlin.drop(columns = ['LINREFX'])
df_berlin = df_berlin.drop(columns = ['LINREFY'])
df_berlin = df_berlin.drop(columns = ['XGCSWGS84'])
df_berlin = df_berlin.drop(columns = ['YGCSWGS84'])


In [229]:
df_berlin.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13390 entries, 0 to 13389
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   object id               13390 non-null  int64 
 1   city                    13390 non-null  int64 
 2   district                13390 non-null  int64 
 3   sub district key        13390 non-null  int64 
 4   street                  13390 non-null  object
 5   year                    13390 non-null  int64 
 6   month                   13390 non-null  int64 
 7   hour of day             13390 non-null  int64 
 8   weekday                 13390 non-null  int64 
 9   category                13390 non-null  int64 
 10  accident type           13390 non-null  int64 
 11  accident type 2         13390 non-null  int64 
 12  light situation         13390 non-null  int64 
 13  bike involved           13390 non-null  int64 
 14  car involved            13390 non-null  int64 
 15  pe

In [232]:


df_berlin.to_csv(os.path.join(path, 'Clean', 'df_berlin.csv'))