# Pipeline accidents in the us, 2010-

This database includes a record for each oil pipeline leak or spill reported to the Pipeline and Hazardous Materials Safety Administration since 2010. These records include the incident date and time, operator and pipeline, cause of incident, type of hazardous liquid and quantity lost, injuries and fatalities, and associated costs.

Eivind Strømsvåg 

Wednesday, 26. September

### In this kernel i will:

 
    - Clean up missing data and drop a few columns 
    - Show my finding visually
    - Find which where the most accidents happen
    - look into the time and date to maybe find correlation
    - Find correlations using different heatmaps
    - look into the accident that happens above ground
    - use linear regression to find r2 score
    - analyse the score
    - evaluate the model with MAE,MSE and RMSE
    - also interpret the coefficients
    - Then analyse with Linear Regresson
    - Different Mean errors
    - Then using Logistic Regression
    - KNN 
    - SVC 
    - Random forrest 



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pyplot as inline
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import cufflinks as cf
cf.go_offline()

In [None]:
acc = pd.read_csv('../input/database.csv')

In [None]:
acc.head()

### First lets clean up the data. 

Since i dont want to completely drop the rows with NaN, i just put the NaN with the value of 0. 


In [None]:
sns.heatmap(acc.isnull(),yticklabels=False,cbar=False,cmap='viridis')

First i decided to just drop all the columns with data 
i will also drop some columns that we dont need. 

In [None]:
 cl_acc= acc.drop(['Operator Contractor Injuries',
                  'Other Injuries','All Injuries',
                  'Emergency Responder Injuries',
                  'Operator Contractor Fatalities',
                  'Public Injuries',
                  'Emergency Responder Fatalities',
                  'Public Fatalities',
                  'Report Number',
                  'Operator ID',
                   'Operator Employee Fatalities',
                   'Other Fatalities','All Fatalities',
                   'Operator Employee Injuries'],axis=1)

Filling inn all the NaN with values of 0, because i want to keep the columns.
Checking if there is still any missing data.

In [None]:
fin_acc = cl_acc.fillna(0.00)
sns.heatmap(fin_acc.isnull(),yticklabels=False,cbar=False,cmap='viridis')

alright, now the data looks alright. 

In [None]:
fin_acc.head()

### Top 5 Operator count of accident

In [None]:
fin_acc['Operator Name'].value_counts().head(5)

### How many years of data? 

We got 8 different years. 

In [None]:
fin_acc['Accident Year'].nunique()

### The most commen place for accidents is by far Above ground, follow by under ground. I will illustrate with a countplot

In [None]:
fin_acc['Pipeline Type'].value_counts().head(5)

In [None]:
sns.countplot(x='Pipeline Type',data=fin_acc,hue='Pipeline Type',palette='coolwarm',
             dodge=False,order = fin_acc['Pipeline Type'].value_counts().index)
plt.tight_layout()

In [None]:
fin_acc['Pipeline Location'].iplot(kind='hist',
                                   bins=70,color='#FFBAD2',
                                   xTitle='Location',yTitle='Amount',
                                   title='Pipeline Location')

In [None]:
fin_acc.head(3)

### Lets look at the time/dates. 

As we can see, the time/date is still a string,
not a float even tho its numbers. That makes it easier for us.

In [None]:
type(fin_acc['Accident Date/Time'].iloc[0])

In [None]:
fin_acc['Accident Date/Time'] = pd.to_datetime(fin_acc['Accident Date/Time'])

I am going to convert the strings to DateTime objects with pd. 

In [None]:
time = fin_acc['Accident Date/Time'].iloc[0]

In [None]:
#example
time

In [None]:
fin_acc['Hour'] =  fin_acc['Accident Date/Time'].apply(lambda time: time.hour)

fin_acc['Month'] =  fin_acc['Accident Date/Time'].apply(lambda time: time.month)

fin_acc['Day of Week'] =  fin_acc['Accident Date/Time'].apply(lambda time: time.dayofweek)

In [None]:
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}

In [None]:
fin_acc['Day of Week'] = fin_acc['Day of Week'].map(dmap)

### Lets find out what year that suffers most accidents. 

January seems to be the month with most accidents. I like to use cufflinks because then we can point to the exact value.

In [None]:
fin_acc['Accident Year'].iplot(kind='hist',bins=70,color='green',xTitle='Year',yTitle='Amount')


Lets use groupby to see how much data we got for the Month value, per column.

In [None]:
by = fin_acc.groupby('Month').count()

In [None]:
by.head()

The columns could be anything, the important part is the total number. 
Which is 275, shown above.

In [None]:
by['Accident Date/Time'].plot()

### Lets analyse the accidents that happens above ground.

In [None]:
fin_acc['Date']=fin_acc['Accident Date/Time'].apply(lambda t: t.date())

In [None]:
fin_acc[fin_acc['Pipeline Type']=='ABOVEGROUND'].groupby('Date').count()['Accident Date/Time'].plot()
plt.title('Aboveground')
plt.tight_layout()


In [None]:
fin_acc[fin_acc['Pipeline Type']=='UNDERGROUND'].groupby('Date').count()['Accident Date/Time'].plot()
plt.title('Underground')
plt.tight_layout()

# Some correlations

We are now going to look if there is any correlations between the accidents. 
Can we find any interesting numbers? 
It is easy to see all the black parts, which indicates no correlations at all. I find the part down to the right interesting, there is a lot of lighter colors, so next i will analyse that section.

In [None]:
sns.heatmap(fin_acc.corr())

I will see if the Month column i made have any correlation to the rest. 

In [None]:
from sklearn.preprocessing import LabelEncoder
labe = LabelEncoder()
dic = {}

labe.fit(fin_acc.Month.drop_duplicates())
dic['Month'] = list(labe.classes_)
fin_acc.Month = labe.transform(fin_acc.Month)

There is a lot of interesting data. Some obvious, some not so much. 
Note: this is just an example of how to do correlation tests.  

In [None]:
g = ['Public/Private Property Damage Costs',
     'Emergency Response Costs',
     'Environmental Remediation Costs',
     'Other Costs',
     'All Costs',
     'Net Loss (Barrels)',
     'Unintentional Release (Barrels)']
t = np.corrcoef(fin_acc[g].values.T)
sns.set(font_scale = 1.0)
map = sns.heatmap(t,
                  cbar = True,
                  cmap="YlGnBu",
                  annot = True, 
                  square= True,
                  fmt = '.1f',
                  annot_kws = {'size':10}, 
                 yticklabels = g,
                 xticklabels = g)


# Linear Regression 
    

I want to explore the same columns by splitting the data into training and testing sets, with scikit. 

In [None]:
fin_acc.info()

In [None]:
target = fin_acc['Accident Year']
feat = fin_acc[[
     'All Costs',
     'Other Costs',
     'Net Loss (Barrels)',
     'Unintentional Release (Barrels)']]

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train,test,train_label,test_label=train_test_split(feat,target,test_size=0.33,random_state=222)

In [None]:
from sklearn.linear_model import LinearRegression

lin = LinearRegression(fit_intercept=True)

model = lin.fit(train,train_label)

pred = model.predict(test)

As we can se the score of r2 is negativt, which mean that the chosen model does not follow the trend of the data, so it dosnt fit the horizontal line. 
 

In [None]:
from sklearn.metrics import r2_score
print(r2_score(test_label,pred))

When we look at the lmplot, we can see that there is no normal verical line. 

In [None]:
sns.lmplot(x='Accident Latitude',y='Public/Private Property Damage Costs',data=fin_acc)

Now as we saw at the heatmap earlier, there were a huge correlation between the loss of barrels and the unintentional release. As the lmplot visualizes a more diagonal line, the r2 score would probably be positive. 

In [None]:
sns.lmplot(x='Unintentional Release (Barrels)',y='Net Loss (Barrels)',data=fin_acc)

### Evaluating the Model

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', metrics.mean_absolute_error(test_label, pred))
print('MSE:', metrics.mean_squared_error(test_label, pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(test_label, pred)))

MAE(Mean Absolute Error) is the mean of the absolute value of the errors:

MSE (Mean Squared Error)  is the mean of the squared errors:

RMSE(Root Mean Squared Error) is the square root of the mean of the squared errors:

In [None]:
sns.distplot((test_label-pred),bins=50)

### Evaluating and interpret the coefficients

In [None]:
print(lin.intercept_)

In [None]:
co = pd.DataFrame(lin.coef_,feat.columns,columns=['Coefficient'])
co

# LogisticRegression

In [None]:
y = fin_acc['Cause Category']
X = fin_acc[['All Costs',
     'Other Costs',
     'Net Loss (Barrels)',
     'Unintentional Release (Barrels)','Accident Latitude',
             'Accident Longitude','Liquid Recovery (Barrels)',
             'Net Loss (Barrels)','Public Evacuations'
             
            ]]

In [None]:
fin_acc.info()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
log = LogisticRegression()

In [None]:

log.fit(X_train,y_train)

In [None]:
pred = log.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

In [None]:
#terrible score. Lets see if KNeib works better. 
print(classification_report(y_test,pred))

# Using KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
knn.fit(X_train,y_train)

In [None]:
predi = knn.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

Printing out confusion matrix. 

In [None]:
print(confusion_matrix(y_test,predi))

In [None]:
# Before K value 
print(classification_report(y_test,predi))

In [None]:
err = []

for i in range (1,30):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    predi_i = knn.predict(X_test)
    err.append(np.mean(predi_i != y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,30),err, color='blue',linestyle='--',marker='o',
        markerfacecolor ='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
# We got a little better precicion when we use the lowest k value. Still a terrible score. 
knn = KNeighborsClassifier(n_neighbors=14)

knn.fit(X_train,y_train)
predi= knn.predict(X_test)
print('New version(14)')
print('\n')
print(confusion_matrix(y_test,predi))
print('\n')
print(classification_report(y_test,pred))

# SVC

In [None]:
fin_acc.keys()

# Random forrest 

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=200)

In [None]:
rfc.fit(X_train,y_train)

In [None]:
rfc_pred = rfc.predict(X_test)

In [None]:
print(confusion_matrix(y_test,rfc_pred))
print('\n')
print(classification_report(y_test,rfc_pred))