# Business problem - Introduction


## 1. A description of the problem and a discussion of the background

Traffic accidents represent one of the leading causes of death worldwide and of economic expenditure. Despite the numerous measures and campaigns that are deployed every year to raise awareness of the seriousness of the problem, it still occurs quite frequently. The impact of road accidents on society and the economy is high, and human losses are compounded by large expenditures on health care, awareness campaigns, mobilization of specialized personnel, etc. The WHO sets the economic impact of road accidents in a developed country at 2 to 3% of GDP, a significant figure for any country. Collaboration to reduce these losses has become an important issue of general interest.

Defining the problem:

What are the factors that have a high impact on road accidents?

Is there a pattern to them?

Correlation?

We will have to analyze the data to get a clearer picture and draw conclusions.

### Introduction

Note that this work represents the final project of the IBM certification course, for which we have provided the data with which we will develop the project.

These data have been collected and shared by the Seattle Police Department (Traffic Records) and are provided by Coursera for downloading through a link.

It takes into account a period of time from 2004 to the present, recording information related to the severity of the traffic accident, location, type of collision, weather and road conditions, visibility, number of people involved, etc.

The objective is to define the problem, to find the factors that can have a relevant weight in the quantity and seriousness of the accidents, so that any organism, company or enterprise interested in reducing these figures, can focus the resources in points where these conditions converge.

In order to provide greater clarity, I will try to analyze the data, see if there are relationships or patterns, especially in high impact accidents, so that preventive measures can focus on these points as a first prevention strategy.

## Data to be used

### 2. A description of the data and how it will be used to solve the problem

For an accurate prediction of the magnitude of damage caused by accidents, they require a large number of reports on traffic accidents with accurate data to train prediction models. The data set provided for this work allows the analysis of a record of 200,000 accidents in the state of Seattle, from 2004 to the date it is issued, in which 37 attributes or variables are recorded and the codification of the type of accident is allowed, grouped according to 84 codes. The information can be extracted from it:

- speed information
- information on road conditions and visibility
- type of collision
- affected persons, etc

The data will be used so that we can determine which attributes are most common in traffic accidents in order to target prevention at these high-incidence points.

### Data Source

- Data Source: These data have been collected and shared by the Seattle Police Department (Traffic Records) and are provided by Coursera for downloading through a link.

- Data Location: Coursera_Capstone/Data assets

- Data set name: Data-Collisions (1)_shaped.csv

## Methodology

Objective: The objective of this project is to predict the severity of a traffic accident based on the other characteristics contained in the report.

Packages and libraries: We will use libraries and packages for both data manipulation and data visualization. PANDA, NUMPY, SCIPY, Matplotlib, Seaborn

A data analysis will be performed in order to determine what type of methodology and learning of the machine will be the most appropriate, in addition to obtaining a first contact with the data that we find more relevant to use in this project.

### Obtaining and cleaning data

#### Importing libraries and packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
print('imported')

imported


#### Uploading the data

In [2]:
# The code was removed by Watson Studio for sharing.

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,REPORTNO,STATUS,ADDRTYPE,INTKEY,...,ROADCOND,LIGHTCOND,PEDROWNOTGRNT,SDOTCOLNUM,SPEEDING,ST_COLCODE,ST_COLDESC,SEGLANEKEY,CROSSWALKKEY,HITPARKEDCAR
0,2,-122.323148,47.70314,1,1307,1307,3502005,Matched,Intersection,37475.0,...,Wet,Daylight,,,,10,Entering at angle,0,0,N
1,1,-122.347294,47.647172,2,52200,52200,2607959,Matched,Block,,...,Wet,Dark - Street Lights On,,6354039.0,,11,From same direction - both going straight - bo...,0,0,N
2,1,-122.33454,47.607871,3,26700,26700,1482393,Matched,Block,,...,Dry,Daylight,,4323031.0,,32,One parked--one moving,0,0,N
3,1,-122.334803,47.604803,4,1144,1144,3503937,Matched,Block,,...,Dry,Daylight,,,,23,From same direction - all others,0,0,N
4,2,-122.306426,47.545739,5,17700,17700,1807429,Matched,Intersection,34387.0,...,Wet,Daylight,,4028032.0,,10,Entering at angle,0,0,N


In [4]:
#checking datatype
df_data_1.dtypes

SEVERITYCODE        int64
X                 float64
Y                 float64
OBJECTID            int64
INCKEY              int64
COLDETKEY           int64
REPORTNO           object
STATUS             object
ADDRTYPE           object
INTKEY            float64
LOCATION           object
EXCEPTRSNCODE      object
EXCEPTRSNDESC      object
SEVERITYCODE.1      int64
SEVERITYDESC       object
COLLISIONTYPE      object
PERSONCOUNT         int64
PEDCOUNT            int64
PEDCYLCOUNT         int64
VEHCOUNT            int64
INCDATE            object
INCDTTM            object
JUNCTIONTYPE       object
SDOT_COLCODE        int64
SDOT_COLDESC       object
INATTENTIONIND     object
UNDERINFL          object
WEATHER            object
ROADCOND           object
LIGHTCOND          object
PEDROWNOTGRNT      object
SDOTCOLNUM        float64
SPEEDING           object
ST_COLCODE         object
ST_COLDESC         object
SEGLANEKEY          int64
CROSSWALKKEY        int64
HITPARKEDCAR       object
dtype: objec

In [5]:
#statistical summry
df_data_1.describe()

Unnamed: 0,SEVERITYCODE,X,Y,OBJECTID,INCKEY,COLDETKEY,INTKEY,SEVERITYCODE.1,PERSONCOUNT,PEDCOUNT,PEDCYLCOUNT,VEHCOUNT,SDOT_COLCODE,SDOTCOLNUM,SEGLANEKEY,CROSSWALKKEY
count,194673.0,189339.0,189339.0,194673.0,194673.0,194673.0,65070.0,194673.0,194673.0,194673.0,194673.0,194673.0,194673.0,114936.0,194673.0,194673.0
mean,1.298901,-122.330518,47.619543,108479.36493,141091.45635,141298.811381,37558.450576,1.298901,2.444427,0.037139,0.028391,1.92078,13.867768,7972521.0,269.401114,9782.452
std,0.457778,0.029976,0.056157,62649.722558,86634.402737,86986.54211,51745.990273,0.457778,1.345929,0.19815,0.167413,0.631047,6.868755,2553533.0,3315.776055,72269.26
min,1.0,-122.419091,47.495573,1.0,1001.0,1001.0,23807.0,1.0,0.0,0.0,0.0,0.0,0.0,1007024.0,0.0,0.0
25%,1.0,-122.348673,47.575956,54267.0,70383.0,70383.0,28667.0,1.0,2.0,0.0,0.0,2.0,11.0,6040015.0,0.0,0.0
50%,1.0,-122.330224,47.615369,106912.0,123363.0,123363.0,29973.0,1.0,2.0,0.0,0.0,2.0,13.0,8023022.0,0.0,0.0
75%,2.0,-122.311937,47.663664,162272.0,203319.0,203459.0,33973.0,2.0,3.0,0.0,0.0,2.0,14.0,10155010.0,0.0,0.0
max,2.0,-122.238949,47.734142,219547.0,331454.0,332954.0,757580.0,2.0,81.0,6.0,2.0,12.0,69.0,13072020.0,525241.0,5239700.0


In [16]:
# choosing the data we will work with
test = ['SEVERITYCODE', 'SPEEDING','ROADCOND']
df_data_1 = df_data_1[test]

# obtaining data dimensions
for feature in ["SPEEDING", "ROADCOND"]:
    print(df_data_1[feature].unique())

['N' 'Y']
['Wet' 'Dry' 'Unknown' 'Snow/Slush' 'Ice' 'Other' 'Sand/Mud/Dirt'
 'Standing Water' 'Oil']


In [17]:
# in speed we replace Nan with a negative value N
df_data_1['SPEEDING'] = df_data_1['SPEEDING'].fillna('N')


#we replace the value Nan declaring it as unknown too

df_data_1['ROADCOND'] = df_data_1['ROADCOND'].fillna('Unknown')

# checking value once again...
for feature in ["SPEEDING", "ROADCOND"]:
    print(df_data_1[feature].unique())

['N' 'Y']
['Wet' 'Dry' 'Unknown' 'Snow/Slush' 'Ice' 'Other' 'Sand/Mud/Dirt'
 'Standing Water' 'Oil']


In [19]:
# We assign new values to roadcond
df_data_1['ROADCOND'].replace(to_replace=['Wet','Dry','Unknown','Snow/Slush','Ice','Other','Sand/Mud/Dirt','Standing Water','Oil'], value = ['Dangerous','Normal','Normal','Dangerous','Dangerous','Normal','Dangerous','Dangerous','Dangerous'], inplace=True)

In [20]:
df_data_1["SPEEDING"].replace(to_replace=['N', 'Y'], value=[0,1], inplace=True)
df_data_1['ROADCOND'].replace(to_replace=['Dangerous','Normal'],value=[0,1],inplace=True)
test_condition = df_data_1[['SPEEDING','ROADCOND']]
test_condition.head()

Unnamed: 0,SPEEDING,ROADCOND
0,0,0
1,0,0
2,0,1
3,0,1
4,0,0


In [21]:
speed_analysis = df_data_1.groupby(['SPEEDING'])['SEVERITYCODE'].value_counts(normalize=True)
speed_analysis

SPEEDING  SEVERITYCODE
0         1               0.705099
          2               0.294901
1         1               0.621665
          2               0.378335
Name: SEVERITYCODE, dtype: float64

In [22]:
road_analysis = df_data_1.groupby(['ROADCOND'])['SEVERITYCODE'].value_counts(normalize=True)
road_analysis

ROADCOND  SEVERITYCODE
0         1               0.674176
          2               0.325824
1         1               0.710389
          2               0.289611
Name: SEVERITYCODE, dtype: float64

### Training the model

In [23]:
x = test_condition
y = df_data_1['SEVERITYCODE'].values.astype(str)
x = preprocessing.StandardScaler().fit(x).transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1234)

# obtaining data dimensions
print("Training set: ", x_train.shape, y_train.shape)
print("Testing set: ", x_test.shape, y_test.shape)

Training set:  (155738, 2) (155738,)
Testing set:  (38935, 2) (38935,)


  return self.partial_fit(X, y)
  app.launch_new_instance()


### Selecting the methods: Tree model, Logistic Regression and KNN methodology

In [None]:
#Tree model
Tree_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_model.fit(x_train, y_train)
predicted = Tree_model.predict(x_test)
Tree_f1 = f1_score(y_test, predicted, average='weighted')
Tree_acc = accuracy_score(y_test, predicted)

In [None]:
#Logistic Regression
LR_model = LogisticRegression(C=0.01, solver='liblinear').fit(x_train, y_train)
predicted = LR_model.predict(x_test)
LR_f1 = f1_score(y_test, predicted, average='weighted')
LR_acc = accuracy_score(y_test, predicted)

In [29]:
#KNN methodology
KNN_model = KNeighborsClassifier(n_neighbors = 4).fit(x_train, y_train)
predicted = KNN_model.predict(x_test)
KNN_f1 = f1_score(y_test, predicted, average='weighted')
KNN_acc = accuracy_score(y_test, predicted)

## Results 

### Comparing the results obtained 

In [31]:
results = {
    "Method of Analisys": ["KNN", "Decision Tree", "LogisticRegression"],
    "F1-score": [KNN_f1, Tree_f1, LR_f1],
    "Accuracy": [KNN_acc, Tree_acc, LR_acc]
}

results = pd.DataFrame(results)
results

Unnamed: 0,Method of Analisys,F1-score,Accuracy
0,KNN,0.591378,0.696751
1,Decision Tree,0.576051,0.699679
2,LogisticRegression,0.576051,0.699679


In [32]:
# Comparing results using LR
results = {
    "Intercept": LR_model.intercept_,
    "SPEEDING ": LR_model.coef_[:,0],
    "ROADCOND ": LR_model.coef_[:,1],
}

results = pd.DataFrame(results)
results

Unnamed: 0,Intercept,SPEEDING,ROADCOND
0,-0.853729,0.067702,-0.068295


According to the results of this method, the two attributes of the table have a determining influence on the increase of accidents, however more factors must be taken into account

## Conclusion and recommendations

Knowing that speed and road conditions are factors that influence traffic accidents and their severity, recommendations for reducing accidents could include

- Limiting speed in areas where road conditions are worse 

- Surveillance and radar cameras -Maintain traffic control routes on these roads 

- Enable emergency exit areas and rescue stations 

- Improve the maintenance of these streets, signaling and lighting