# QF 627 Programming and Computational Finance
## Lesson 01 | Revisit NumPy, Pandas, & Matplotlib (feat. `bokeh` & `seaborn`)

> Now that you are sufficinetly familiar with the basics of data cleaning and analysis in pandas, we're going to take it up a notch. 

> Previously, the datasets were in relatively clean and straightforward formats. 

> However, in many cases, the data you analyze can be extremely messy and difficult to manage.

> That's why we're going to practice with a more unweildy. 

> You'll notice that it's quite a big file – about 1.7 million rows! 

> These are reports from accidents in New Jersey between 2008 and 2013 from the New Jersey Department of Transportation. 

> The data was scraped from [PDFs of crash reports](http://www.state.nj.us/transportation/refdata/accident/) filled out by clerk.

### Import pandas and let's load in our new and very messy data

In [1]:
import pandas as pd

In [2]:
accidents = pd.read_csv("accidents.csv", encoding = "ISO-8859-1")

  accidents = pd.read_csv("accidents.csv", encoding = "ISO-8859-1")


In [3]:
# import pandas as pd

In [4]:
# accidents = pd.read_csv("accidents.csv", 
#                         encoding = "ISO-8859-1")

> You may notice that you get this warning.

`"DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)"`

> This dtype error happens when when a column has both strings and integer values. 

> You can ignore this for now because we'll fix it soon. Open up the first few rows of our dataframe.

In [5]:
accidents.head()

Unnamed: 0,case code,County Name,Municipality Name,Crash Date,Crash Day Of Week,Crash Time,Police Dept Code,Police Department,Police Station,Total Killed,...,Is Ramp,Ramp To/From Route Name,Ramp To/From Route Direction,Posted Speed,Posted Speed Cross Street,Latitude,Longitude,Cell Phone In Use Flag,Other Property Damage,Reporting Badge No.
0,2008010108-026816,ATLANTIC,ABSECON CITY,3/4/08,TU,1539,1,ATLANTIC CITY,AIU,0,...,,,,50,,39.41158,74.49162,N,NONE ...,384
1,2008010108-163190,ATLANTIC,ABSECON CITY,12/19/08,F,1114,1,ATLANTIC CITY,TRAFFIC,0,...,,,,50,,39.39231,74.48952,N,NONE ...,739
2,2008010108-24779,ATLANTIC,ABSECON CITY,11/25/08,TU,345,99,NJ TRANSIT P.D.,ATLANTIC CITY,0,...,,,,10,25.0,,,N,? ...,53
3,2008010108-3901,ATLANTIC,ABSECON CITY,3/31/08,M,105,1,EAST WINDSOR,TRAFFIC UNIT,0,...,,,,0,,,,N,NONE ...,551
4,2008010108-5016,ATLANTIC,ABSECON CITY,1/25/08,F,942,1,EGG HARBOR TWP,HQ,0,...,,,,50,40.0,39.43036,74.52469,N,NONE ...,1571


In [6]:
# accidents.head()

> Let's found out what we're working with, and get the column headers for all of the columns.

In [7]:
accidents.columns

Index(['case code', ' County Name', ' Municipality Name', ' Crash Date',
       ' Crash Day Of Week', ' Crash Time', ' Police Dept Code',
       ' Police Department', ' Police Station', ' Total Killed',
       ' Total Injured', ' Pedestrians Killed', ' Pedestrians Injured',
       ' Severity', ' Intersection', ' Alcohol Involved', ' HazMat Involved',
       ' Crash Type Code', ' Total Vehicles Involved', ' Crash Location',
       ' Location Direction', ' Route', ' Route Suffix',
       ' SRI (Std Rte Identifier)', ' MilePost  ', ' Road System',
       ' Road Character', ' Road Surface Type', ' Surface Condition',
       ' Light Condition', ' Environmental Condition', ' Road Divided By',
       ' Temporary Traffic Control Zone', ' Distance To Cross Street',
       ' Unit Of Measurement', ' Directn From Cross Street',
       ' Cross Street Name', ' Is Ramp', ' Ramp To/From Route Name',
       ' Ramp To/From Route Direction', ' Posted Speed',
       ' Posted Speed Cross Street', ' Latitud

In [8]:
# accidents.columns

> Bummer. There's our first problem. Notice that there's a leading space in every column header. We should take it out.

In [9]:
accidents.rename(columns = lambda x: x.strip(), inplace = True)

In [10]:
# accidents.rename(columns = lambda x: x.strip(), inplace = True) # will address empty spaces on column headers

> Remember where we renamed the columns in our dataframe previously? 

> This time, we're using the same rename function to do take out all of the leading spaces using `strip()`. 

> Pythonistas will notice that we're using the `lambda python` to apply `strip()` to every single column header.

In [11]:
accidents.columns

Index(['case code', 'County Name', 'Municipality Name', 'Crash Date',
       'Crash Day Of Week', 'Crash Time', 'Police Dept Code',
       'Police Department', 'Police Station', 'Total Killed', 'Total Injured',
       'Pedestrians Killed', 'Pedestrians Injured', 'Severity', 'Intersection',
       'Alcohol Involved', 'HazMat Involved', 'Crash Type Code',
       'Total Vehicles Involved', 'Crash Location', 'Location Direction',
       'Route', 'Route Suffix', 'SRI (Std Rte Identifier)', 'MilePost',
       'Road System', 'Road Character', 'Road Surface Type',
       'Surface Condition', 'Light Condition', 'Environmental Condition',
       'Road Divided By', 'Temporary Traffic Control Zone',
       'Distance To Cross Street', 'Unit Of Measurement',
       'Directn From Cross Street', 'Cross Street Name', 'Is Ramp',
       'Ramp To/From Route Name', 'Ramp To/From Route Direction',
       'Posted Speed', 'Posted Speed Cross Street', 'Latitude', 'Longitude',
       'Cell Phone In Use Flag', '

In [12]:
# accidents.columns

> Good job :)

> Let's describe() the dataframe.

In [13]:
accidents.describe()

Unnamed: 0,Total Killed,Total Injured,Pedestrians Killed,Pedestrians Injured,Total Vehicles Involved,Road System,Posted Speed
count,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0,1048575.0
mean,0.001965525,0.311951,0.0004854207,0.01735117,1.875997,5.199488,31.20711
std,0.04694568,0.701534,0.02211335,0.134448,0.5416507,2.49148,17.90289
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,2.0,2.0,25.0
50%,0.0,0.0,0.0,0.0,2.0,5.0,25.0
75%,0.0,0.0,0.0,0.0,2.0,7.0,45.0
max,5.0,42.0,2.0,10.0,20.0,10.0,99.0


In [14]:
# accidents.describe()

> But let's see if we could describe() a column. Let's use the describe() function for the `County Name` column header.

In [15]:
accidents["County Name"].describe()

count          1048575
unique              21
top       MIDDLESEX   
freq            115760
Name: County Name, dtype: object

In [16]:
# accidents["County Name"].describe()

> So there are 21 unique values in the `County Name` column (for the 21 counties in New Jersey). 

> We can see that the top county with the most rows is Middlesex County with 176,402 crashes. 

> What are the names of the counties in New Jersey? Let's find out by using the unique() function on our `County Name` column.

In [17]:
accidents["County Name"].unique()

array(['ATLANTIC    ', 'BERGEN      ', 'BURLINGTON  ', 'CAMDEN      ',
       'CAPE MAY    ', 'CUMBERLAND  ', 'ESSEX       ', 'GLOUCESTER  ',
       'HUDSON      ', 'HUNTERDON   ', 'MERCER      ', 'MIDDLESEX   ',
       'MONMOUTH    ', 'MORRIS      ', 'OCEAN       ', 'PASSAIC     ',
       'SALEM       ', 'SOMERSET    ', 'SUSSEX      ', 'UNION       ',
       'WARREN      '], dtype=object)

In [18]:
# accidents["County Name"].unique()

> Looks like we're going to need to strip out the spaces out of the these county values. 

> This time we'll use the `map()` function which will strip the white space out of every string found in the column.

In [19]:
accidents["County Name"] = accidents["County Name"].map(str.strip)
accidents["County Name"].unique()

array(['ATLANTIC', 'BERGEN', 'BURLINGTON', 'CAMDEN', 'CAPE MAY',
       'CUMBERLAND', 'ESSEX', 'GLOUCESTER', 'HUDSON', 'HUNTERDON',
       'MERCER', 'MIDDLESEX', 'MONMOUTH', 'MORRIS', 'OCEAN', 'PASSAIC',
       'SALEM', 'SOMERSET', 'SUSSEX', 'UNION', 'WARREN'], dtype=object)

In [20]:
# accidents["County Name"] = accidents["County Name"].map(str.strip) # will address empty spaces on each cell
# accidents["County Name"].unique()

> `map()` function returns a map object(which is an iterator) of the results after applying the given function to each item of a given iterable (list, tuple etc.)

#### map(function, iterables)

* function : It is a function to which map passes each element of given iterable.
* iterables : It is a iterable which is to be mapped.

> NOTE : You can pass one or more iterable to the map() function.

* Returns a list of the results after applying the given function to each item of a given iterable (list, tuple etc.) 
 
> NOTE : The returned value from map() (map object) then can be passed to functions like list() (to create a list), set() (to create a set) .

In [21]:
def addition(q):
    return q + q

In [22]:
numbers = (1,2,3,4)
result = map(addition, numbers)
print(list(result))

[2, 4, 6, 8]


In [23]:
# def addition(q):
#     return q + q

In [24]:
# We double all numbers using map()
# numbers = (1, 2, 3, 4)
# result = map(addition, numbers)
# print(list(result)
#      )

> You can also use lambda expressions with map to achieve above result.

In [25]:
numbers = (1,2,3,4)
result = map(lambda x: x+x, numbers)
print(list(result))

[2, 4, 6, 8]


In [26]:
# numbers = (1, 2, 3, 4)
# result = map(lambda x: x + x, numbers)
# print(list(result))

In [27]:
list1 = [1,2,3,4]
list2 = [6,7,8,9,10]
result = map(lambda x, y: x+y, list1, list2)
print(list(result))

[7, 9, 11, 13]


In [28]:
# Add two lists using map and lambda
  
# numbers1 = [1, 2, 3]
# numbers2 = [4, 5, 6]
  
# result = map(lambda x, y: x + y, numbers1, numbers2)
# print(list(result))

In [29]:
l = ['qf', '627', 'lots', 'of', 'work']
result = map(list, l)
print(list(result))

[['q', 'f'], ['6', '2', '7'], ['l', 'o', 't', 's'], ['o', 'f'], ['w', 'o', 'r', 'k']]


In [30]:
# #List of strings
# l = ['qf', '627', "lovin'", 'it']
  
# # map() can listify the list of strings individually
# test = list(map(list, l))
# print(test)

> Good :) Speaking of strings, let's fix that dtype error we got at the beginning of the exercise. 

> Type in dtypes at the end of our dataframe.

In [31]:
accidents.dtypes

case code                         object
County Name                       object
Municipality Name                 object
Crash Date                        object
Crash Day Of Week                 object
Crash Time                        object
Police Dept Code                  object
Police Department                 object
Police Station                    object
Total Killed                       int64
Total Injured                      int64
Pedestrians Killed                 int64
Pedestrians Injured                int64
Severity                          object
Intersection                      object
Alcohol Involved                  object
HazMat Involved                   object
Crash Type Code                   object
Total Vehicles Involved            int64
Crash Location                    object
Location Direction                object
Route                             object
Route Suffix                      object
SRI (Std Rte Identifier)          object
MilePost        

In [32]:
# accidents.dtypes

> This shows us the type of data type object (or dtypes) the values of every column are. Objects refer to strings. `Int64` are integers. `Float64` are floats.

> The `warning at the beginning` said it was column 6 that had mixed dtypes. If you look at your column list and count to the sixth column (Remember to count from zero!), you'll see that it's the `Police Dept Code` column. Let's look at every unique value in that column.

In [33]:
accidents["Police Dept Code"].unique()

array(['1', '99', '  ', '2', '3', '4', 1, 99, 2, 3, 4], dtype=object)

In [34]:
# accidents["Police Dept Code"].unique()

> And there it is! As you can see, there are strings and integers mixed together in the same column.

In [35]:
accidents["Crash Type Code"].unique()

array(['2', '8', '1', '6', '11', '5', '3', '13', '7', '99', '15', '14',
       '10', '4', '12', '9', '16', '  ', '0'], dtype=object)

In [36]:
# accidents["Crash Type Code"].unique()

> Same for column 17 or the `Crash Type Code` column. 

> Let's fix that by changing every value in both columns to a string using the `astype()` function.

In [37]:
accidents["Police Dept Code"] = accidents["Police Dept Code"].astype(str)

In [38]:
# accidents["Police Dept Code"] = accidents["Police Dept Code"].astype(str)

> We're changing it to a string because we don't need to do math with these numbers since they are codes so it's more beneficial to use them as objects. 

> If you wanted to change something to an integer or a float, you'll need to use astype(int) and astype(float) respectively.

In [39]:
accidents["Police Dept Code"].unique()

array(['1', '99', '  ', '2', '3', '4'], dtype=object)

In [40]:
# accidents["Police Dept Code"]

> That took care of that :)

> Let's make our dataframe a little bit more manageable by weeding out some unnecessary columns. 

> Let's also create a new dataframe called `crash_info`.

In [41]:
accidents.columns

Index(['case code', 'County Name', 'Municipality Name', 'Crash Date',
       'Crash Day Of Week', 'Crash Time', 'Police Dept Code',
       'Police Department', 'Police Station', 'Total Killed', 'Total Injured',
       'Pedestrians Killed', 'Pedestrians Injured', 'Severity', 'Intersection',
       'Alcohol Involved', 'HazMat Involved', 'Crash Type Code',
       'Total Vehicles Involved', 'Crash Location', 'Location Direction',
       'Route', 'Route Suffix', 'SRI (Std Rte Identifier)', 'MilePost',
       'Road System', 'Road Character', 'Road Surface Type',
       'Surface Condition', 'Light Condition', 'Environmental Condition',
       'Road Divided By', 'Temporary Traffic Control Zone',
       'Distance To Cross Street', 'Unit Of Measurement',
       'Directn From Cross Street', 'Cross Street Name', 'Is Ramp',
       'Ramp To/From Route Name', 'Ramp To/From Route Direction',
       'Posted Speed', 'Posted Speed Cross Street', 'Latitude', 'Longitude',
       'Cell Phone In Use Flag', '

In [42]:
crash_info = accidents[["County Name", "Municipality Name", "Crash Date",
               "Crash Day Of Week", "Crash Time", "Total Killed",
               "Total Injured", "Pedestrians Killed", "Pedestrians Injured",
               "Total Vehicles Involved", "Alcohol Involved", "Cell Phone In Use Flag"]]

In [43]:
accidents.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 47 columns):
 #   Column                          Non-Null Count    Dtype 
---  ------                          --------------    ----- 
 0   case code                       1048575 non-null  object
 1   County Name                     1048575 non-null  object
 2   Municipality Name               1048575 non-null  object
 3   Crash Date                      1048575 non-null  object
 4   Crash Day Of Week               1048575 non-null  object
 5   Crash Time                      1048575 non-null  object
 6   Police Dept Code                1048575 non-null  object
 7   Police Department               1048575 non-null  object
 8   Police Station                  1048575 non-null  object
 9   Total Killed                    1048575 non-null  int64 
 10  Total Injured                   1048575 non-null  int64 
 11  Pedestrians Killed              1048575 non-null  int64 
 12  Pedestrians In

In [44]:
# crash_info = accidents[["County Name", "Municipality Name", "Crash Date",
#                "Crash Day Of Week", "Crash Time", "Total Killed",
#                "Total Injured", "Pedestrians Killed", "Pedestrians Injured",
#                "Total Vehicles Involved", "Alcohol Involved", "Cell Phone In Use Flag"]]

In [45]:
# accidents.info()

### How many car accidents had alcohol involved?

Let's find out the unique values that come up in the column `Alcohol Involved`.

In [46]:
# crash_info["Alcohol Involved"].unique()

In [47]:
crash_info["Alcohol Involved"].unique()

array(['N', 'Y'], dtype=object)

> We have only two unique values in the column: `N` for `no` and `Y` for `yes`.

> Let's find out how many incidents had Ns and how many had Ys. 

> We're going to use the function value_counts() on the column 'Alcohol Involved'. 

> We're also going to put the list in a new dataframe called `alcohol` so that it will look nicer in our notebook.

In [48]:
alcohol = pd.DataFrame(accidents["Alcohol Involved"].value_counts())
alcohol

Unnamed: 0_level_0,count
Alcohol Involved,Unnamed: 1_level_1
N,1017766
Y,30809


In [49]:
# alcohol = pd.DataFrame(accidents["Alcohol Involved"].value_counts())
# alcohol

> A lot more Ns than Ys. But just what percentage are the Ys compared to the Ns? 

> First, let's get the total number of crashes in our data frame.

In [50]:
crash_count = crash_info["Alcohol Involved"].count()
crash_count

1048575

In [51]:
# crash_count = crash_info["Alcohol Involved"].count()
# crash_count

> `Be careful`. 

> The `count()` function doesn't count `NAs` or `null` values. 

> Always make sure to check for those using the `isnull()` function, followed by `sum()`

In [52]:
crash_info["Alcohol Involved"].isnull().sum()

0

In [53]:
# crash_info["Alcohol Involved"].isnull().sum()

> Let's create a new column named `Percent` and divide every value of the `Alcohol Involved` column by the total crashes from the `crash_count` we created above and then multiply by 100.

In [54]:
alcohol

Unnamed: 0_level_0,count
Alcohol Involved,Unnamed: 1_level_1
N,1017766
Y,30809


In [55]:
alcohol["Percentage"] = alcohol["Alcohol Involved"]/crash_count * 100
alcohol

KeyError: 'Alcohol Involved'

In [None]:
# alcohol

In [None]:
# alcohol["Percentage"] = alcohol["Alcohol Involved"]/crash_count * 100
# alcohol

> Mystery solved. Only 2.9 percent.

### How many total people were killed in every county?

> Let's first use the `value_counts()` function to find out how many accidents were reported in each county.

In [None]:
crash_info["County Name"].value_counts()

In [None]:
# crash_info["County Name"].value_counts()

> So let's split up every incident that happened in every county by using the `groupby()`

In [None]:
crash_info.groupby("County Name")

In [None]:
# crash_info.groupby("County Name")

> That looks like it did nothing, but it actually DID split up the counties into their own seperate groups. 

> We just need to know perform an action. 

> If you notice, there are columns like `Total Killed`, `Total Injured`, `Pedestrians Killed`, etc. that have numbers or integers that can be summed up. 

> Basically, we're going to add them all up by using the `sum()` function and make it into a new dataframe called `county_crash`.

In [None]:
county_level_crash_info = crash_info.groupby("County Name").sum()
county_level_crash_info

In [None]:
# county_level_crash_info = crash_info.groupby("County Name").sum()
# county_level_crash_info

> Well, that's grim. 

> Let's just take out the `Total Killed` column using `iloc` which asks what data we should slice by putting an integer based on its position. 

> The first value represents the rows and is separated by comma from the second value which represents the columns. 

> Therefore, if we want all of the rows, we put a colon. We then seperate using a comma. Then, because 'Total Killed' is the first column, we can slice it by putting in a zero. 

> We will also sort it by using sort_values and adding the option `ascending=False` because we want the values to descend. 

> Let's make it into a new dataframe called county_death.

In [None]:
county_level_total_death = crash_info.groupby("County Name").sum().iloc[:, 0].sort_values(ascending = False)
county_level_total_death

In [None]:
# county_level_total_death = crash_info.groupby("County Name").sum().iloc[ : , 0].sort_values(ascending = False)
# county_level_total_death

> What would be the `type` of `county_death`?

In [None]:
type(county_level_total_death)

In [None]:
county_level_total_death.dtype

In [None]:
# type(county_level_total_death)

In [None]:
# county_level_total_death.dtype

Let's make `county_death` into a dataframe.

In [None]:
pd.DataFrame(county_level_total_death)

In [None]:
# pd.DataFrame(county_level_total_death)

### What about dates?

In [None]:
crash_info

In [None]:
# crash_info

> You may have noticed that the dates on the 'Crash Date' are strings and not Python date objects. 

> This will be inconvenient because if you sort them you'll get '01/01/2008, 01/01/2009, 01/01/2010' etc. 

> We want them to sort by date correctly, and in order to do that, we need to turn them into the Python date format.

> ***We will need to `import datetime` first.*** 

In [None]:
from datetime import datetime

In [None]:
# from datetime import datetime

> Then we will use `apply()` along with the lambda function to turn every string in that column into the format "%m/%d/%Y".

In [None]:
Date = pd.Series(["3/4/08"])
Date

In [None]:
crash_info["Crash Date"]

In [None]:
crash_info["Crash Date"] = crash_info["Crash Date"].apply(lambda x: datetime.strptime(x, "%m/%d/%y").date())

In [None]:
# crash_info["Crash Date"] = crash_info["Crash Date"].apply(lambda x: datetime.strptime(x, "%m/%d/%y").date())

> Now we're ready to `groupby()` the `Crash Date` column every date in our dataframe and count how many accidents happened every day. 

> And then we will slice the first column which is how many crashes happened each day using iloc. (Colon for all rows, comma, then 0 for the first column)

In [None]:
crash_info["Crash Date"]

In [None]:
# crash_info["Crash Date"]

> Now let's sort.

In [None]:
crash_by_date = crash_info.groupby("Crash Date").count().iloc[:,0]
crash_by_date

In [None]:
crash_by_date.sort_values(ascending = False)

In [None]:
# crash_by_date = crash_info.groupby("Crash Date").count().iloc[:,0]
# crash_by_date

In [None]:
# crash_by_date.sort_values(ascending = False)

Looks like on [February 12, 2008 was a busy day for New Jersey](https://www.weather.gov/media/phi/StormReports/February12-132008.pdf) with 3,050 accidents reported to happen that day.

> Let's now save the following dataframes into csv.

In [None]:
county_level_total_death

In [None]:
crash_by_date

In [None]:
county_level_crash_info

In [None]:
crash_by_date.to_csv("linechart.csv")
county_level_total_death.to_csv("barchart.csv")
county_level_crash_info.to_csv("scatterplot.csv")

In [None]:
# county_level_total_death

In [None]:
# crash_by_date

In [None]:
# county_level_crash_info

In [None]:
# crash_by_date.to_csv("linechart.csv")

# county_level_total_death.to_csv("barchart.csv")

# county_level_crash_info.to_csv("scatterplot.csv")

### IMPORT

In [None]:
crash_by_date_line = pd.read_csv("linechart.csv")

county_death_bar = pd.read_csv("barchart.csv")

county_crash_scatter = pd.read_csv("scatterplot.csv")

In [None]:
# crash_by_date_line = pd.read_csv("linechart.csv")

# county_death_bar = pd.read_csv("barchart.csv")

# county_crash_scatter = pd.read_csv("scatterplot.csv")

> Another great feature of using python analysis in the Jupyter notebook is the ability to visualize the data using the [Bokeh visualization library](http://bokeh.pydata.org/en/latest/). 

> We won't go into great detail on the step-by-step process of creating beautiful graphics in your notebook, but you can see what's possible below. 

> You can read more documentation on Bokeh [here](http://bokeh.pydata.org/en/latest/docs/user_guide.html#userguide)

#### Let's upload the datasets we'll use which we created above.

In [None]:
%pip install bokeh

In [None]:
from bokeh.plotting import figure, show, output_file, output_notebook
from bokeh.models import HoverTool

In [None]:
# %pip install bokeh

In [None]:
# from bokeh.plotting import figure, show, output_file, output_notebook
# from bokeh.models import HoverTool

### `Bar plot`

> Let's have a look at `Total Killed` in **each county**

In [None]:
county_death_bar

In [None]:
county_death_bar.sort_values(by = "Total Killed", ascending = False)

In [None]:
output_notebook()

county_name = county_death_bar["County Name"]

bar = figure(title = "Total Death by County",
            x_range = county_name,
            plot_width = 800,
            plot_height = 600,
            toolbar_location = None,
            tools = "")

bar.vbar(x = "County Name",
        top = "Total Killed",
        source = county_death_bar,
        width = 0.8)

bar.xaxis.major_label_orientation = "vertical"
bar.y_range.start = 0
bar.xgrid.grid_line_color = None

output_file("Your_First_Bokeh_Barplot.html")

show(bar)

In [None]:
# county_death_bar

In [None]:
# county_death_bar.sort_values(by = "Total Killed", ascending = False)

In [None]:
# output_notebook()

# county_name = county_death_bar["County Name"]

# bar = figure(title = "Total Death by County",
#              x_range = county_name,
#              plot_width = 800,
#              plot_height = 600,
#              toolbar_location = None,
#              tools = "")

# bar.vbar(x = "County Name",
#          top = "Total Killed",
#          source = county_death_bar, # this is where you input your DF
#          width = 0.8)

# bar.xaxis.major_label_orientation = "vertical"
# bar.y_range.start = 0
# bar.xgrid.grid_line_color = None

# output_file("Your_First_Bokeh_Barplot.html")

# show(bar)

### `Scatter plot`

> Let's take a look at the relationships between `Total Killed` and `Pedestrians Killed in each county`.

In [None]:
county_crash_scatter

In [None]:
scatterplot = figure(title = "The Relationships between Total Death and Pedestrians Killed in Each County",
                     x_axis_label = "Total Killed",
                     y_axis_label = "Pedestrians Killed")

scatterplot.circle("Total Killed",
                   "Pedestrians Killed",
                   source = county_crash_scatter) # Again, this where you input your DF

output_file("Your_First_Scatter_with_Bokeh.html")

show(scatterplot)

In [None]:
# county_crash_scatter

In [None]:
# scatterplot = figure(title = "The Relationships between Total Death and Pedestrians Killed in Each County",
#                      x_axis_label = "Total Killed",
#                      y_axis_label = "Pedestrians Killed")

# scatterplot.circle("Total Killed",
#                    "Pedestrians Killed",
#                    source = county_crash_scatter) # Again, this where you input your DF

# output_file("Your_First_Scatter_with_Bokeh.html")

# show(scatterplot)

> You might want to create a `regression line` :)

> As you will learn more down the line in the course, you can use library `seaborn`.  

> `seaborn` is a Python data visualization library based on `matplotlib`. 

> It provides a high-level interface for drawing attractive and informative statistical graphics.

In [None]:
# %pip install seaborn

In [None]:
# import seaborn as sns

In [None]:
# sns.lmplot(x = "Total Killed",
#            y = "Pedestrians Killed",
#            data = county_crash_scatter)

In [None]:
# import matplotlib.pyplot as plt

# sns.jointplot(x = "Total Killed",
#               y = "Pedestrians Killed",
#               data = county_crash_scatter,
#               kind = "reg",
#               joint_kws = {"color":"red"})

### `Line Chart`

> Let's see the number of New Jersey car crashes over time (2008-2013)

In [None]:
# crash_by_date_line

In [None]:
# crash_by_date_line["Crash Date"] = pd.to_datetime(crash_by_date_line["Crash Date"])
# crash_by_date_line

In [None]:
# line = figure(title = "The Number of New Jersey Car Crashes Over Time (2008-2013)",
#               x_axis_type = "datetime",
#               plot_width = 800,
#               plot_height = 600)

# line.line(crash_by_date_line["Crash Date"],
#           crash_by_date_line["County Name"],
#           line_width = 1,
#           line_color = "purple")

# line.yaxis.axis_label = "County Name"
# line.xaxis.axis_label = "Crash Date"
# line.xaxis.major_label_orientation = "vertical"

# output_file("line_timeseries.html")
# show(line)

> `Thank you for working with the script :)`