# Great Britain Road Casualties 2015-2019 EDA

**1. Importing the required libraries for EDA**

Below are the libraries that are used in order to perform EDA (Exploratory data analysis) in this Project.

In [84]:
import pandas as pd
import numpy as np

**2. Loading the data into the data frame**


Loading the data into the pandas data frame is certainly one of the most important steps in EDA.

In [65]:
df = pd.read_csv(r'C:\Cardiff University\CMT218 Data Visualisation\Assessment and Feedback\CW2\Project_Live\Great Britain Road Casualties 2015-2019.csv')
# To display the top 5 rows 
df.head(5)

Unnamed: 0,Accident year,Country,Ons code,Road user,Casualty class,Casualty sex,Casualty age,All casualties
0,2015,England,E92000001,Pedestrian,Pedestrian,,,3
1,2015,England,E92000001,Pedestrian,Pedestrian,Male,,310
2,2015,England,E92000001,Pedestrian,Pedestrian,Male,0.0,10
3,2015,England,E92000001,Pedestrian,Pedestrian,Male,1.0,26
4,2015,England,E92000001,Pedestrian,Pedestrian,Male,2.0,93


In [66]:
df.tail(5)                        # To display the botton 5 rows

Unnamed: 0,Accident year,Country,Ons code,Road user,Casualty class,Casualty sex,Casualty age,All casualties
20584,2019,Scotland,S92000003,Other vehicle,Passenger,Female,34.0,1
20585,2019,Scotland,S92000003,Other vehicle,Passenger,Female,53.0,1
20586,2019,Scotland,S92000003,Other vehicle,Passenger,Female,65.0,1
20587,2019,Scotland,S92000003,Other vehicle,Passenger,Female,68.0,1
20588,2019,Scotland,S92000003,Other vehicle,Passenger,Female,69.0,1


**3. Checking the types of data**

Here we check for the datatypes because sometimes the Casualty age or All casualties would be stored as a string, if in that case, we have to convert that string to the integer data only then we can plot the data via a graph. Here, in this case, the data is already in integer format so nothing to worry.

In [67]:
df.dtypes

Accident year       int64
Country            object
Ons code           object
Road user          object
Casualty class     object
Casualty sex       object
Casualty age      float64
All casualties      int64
dtype: object

**4. Dropping irrelevant columns**

This step is certainly needed in every EDA because sometimes there would be many columns that we never use in such cases dropping is the only solution. In this case, the columns such as 'Ons code' doesn't make any sense to me so I just dropped for this instance.

In [68]:
df = df.drop(['Ons code'], axis=1)
df.head(5)

Unnamed: 0,Accident year,Country,Road user,Casualty class,Casualty sex,Casualty age,All casualties
0,2015,England,Pedestrian,Pedestrian,,,3
1,2015,England,Pedestrian,Pedestrian,Male,,310
2,2015,England,Pedestrian,Pedestrian,Male,0.0,10
3,2015,England,Pedestrian,Pedestrian,Male,1.0,26
4,2015,England,Pedestrian,Pedestrian,Male,2.0,93


**5. Renaming the columns**

In this instance, most of the column names are very confusing to read, so I just tweaked their column names. This is a good approach it improves the readability of the data set.

In [69]:
df = df.rename(columns={"Accident year": "Accident_year", "Country": "Country", "Road user": "Road_user", "Casualty class": "Casualty_class", "Casualty sex": "Casualty_sex", "Casualty age": "Casualty_age", "All casualties": "All_casualties"})
df.head(5)

Unnamed: 0,Accident_year,Country,Road_user,Casualty_class,Casualty_sex,Casualty_age,All_casualties
0,2015,England,Pedestrian,Pedestrian,,,3
1,2015,England,Pedestrian,Pedestrian,Male,,310
2,2015,England,Pedestrian,Pedestrian,Male,0.0,10
3,2015,England,Pedestrian,Pedestrian,Male,1.0,26
4,2015,England,Pedestrian,Pedestrian,Male,2.0,93


In [70]:
df.head(5)

Unnamed: 0,Accident_year,Country,Road_user,Casualty_class,Casualty_sex,Casualty_age,All_casualties
0,2015,England,Pedestrian,Pedestrian,,,3
1,2015,England,Pedestrian,Pedestrian,Male,,310
2,2015,England,Pedestrian,Pedestrian,Male,0.0,10
3,2015,England,Pedestrian,Pedestrian,Male,1.0,26
4,2015,England,Pedestrian,Pedestrian,Male,2.0,93


**6. Dropping the duplicate rows**

For **example** prior to removing I had 16599 rows of data but after removing the duplicates 10925 data meaning that I had 989 of duplicate data.

In [71]:
df.shape

(20589, 7)

In [72]:
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

number of duplicate rows:  (0, 7)


Here i did not found any dublicate rows in my data set.

**7. Dropping the missing or null values**

This is mostly similar to the previous step but in here all the missing values are detected and are dropped later. Now, this is not a good approach to do so, because many people just replace the missing values with the mean or the average of that column, but in this case, I just dropped that missing values.

In [73]:
df.count()

Accident_year     20589
Country           20589
Road_user         20589
Casualty_class    20589
Casualty_sex      20384
Casualty_age      20343
All_casualties    20589
dtype: int64

In [74]:
print(df.isnull().sum())

Accident_year       0
Country             0
Road_user           0
Casualty_class      0
Casualty_sex      205
Casualty_age      246
All_casualties      0
dtype: int64


In [75]:
df = df.dropna( axis=0, how='any')   # Dropping the missing values.
df.head(5)

Unnamed: 0,Accident_year,Country,Road_user,Casualty_class,Casualty_sex,Casualty_age,All_casualties
2,2015,England,Pedestrian,Pedestrian,Male,0.0,10
3,2015,England,Pedestrian,Pedestrian,Male,1.0,26
4,2015,England,Pedestrian,Pedestrian,Male,2.0,93
5,2015,England,Pedestrian,Pedestrian,Male,3.0,167
6,2015,England,Pedestrian,Pedestrian,Male,4.0,165


Don't drop the rows.If the NaN represents more than 50% of the Column.

In [76]:
def drop_cols_na(df, threshold=0.5): 
  threshold = len(df.index) * threshold
  cols=[c for c in df.columns if sum(df[c].isnull()) >= threshold]
  df.drop(cols,axis=0,errors='ignore', inplace = True)

In [77]:
drop_cols_na(df)

In [78]:
df.count()

Accident_year     20183
Country           20183
Road_user         20183
Casualty_class    20183
Casualty_sex      20183
Casualty_age      20183
All_casualties    20183
dtype: int64

In [80]:
print(df.isnull().sum())   # After dropping the values

Accident_year     0
Country           0
Road_user         0
Casualty_class    0
Casualty_sex      0
Casualty_age      0
All_casualties    0
dtype: int64


**8.Export Pandas DataFrame to a CSV File**

In [81]:
df.to_csv(r'C:\Cardiff University\CMT218 Data Visualisation\Assessment and Feedback\CW2\Project_Live\Great Britain Road Casualties 2015-2019 Cleaned data1.csv', index = False)