# Exploring a CSV file – dates, categories, and places

In [24]:
import pandas as pd
import altair as alt

>Information about mining accidents published annually in the Queensland Legislative Assembly Votes and Proceedings (later known as Queensland Parliamentary Papers) from 1882 to 1945. 

In [2]:
# Queensland Mining accidents from SLQ
csv_url = 'https://www.data.qld.gov.au/dataset/2e5b65d7-09d5-403f-a5d5-a552410f2d5d/resource/35ea936d-083e-4ad6-beab-e0fede2cd3a6/download/slqqldminingaccidents.csv'

Trying to open this file with `pd.read_csv(csv_url)` will result in an encoding error. By default, Pandas expects files to use `utf-8` encoding, but if that doesn't work you can ask it to try other encoding schemes such as 'ISO-8859-1'. In some cases it might require a bit of trial and error to find the right encoding. The GLAM CSV Explorer tries 'utf-8', 'ISO-8859-1', and 'latin-1' before giving up.

In [6]:
df = pd.read_csv(csv_url, encoding='ISO-8859-1')
df.head()

Unnamed: 0,'ID,Name,Field2,Year,Session,Page,Date,District,Geo-subject,Name Of Mine,Latitude,Longitude,Nature of Injuries,Remarks,Field1,Newspapers,Continue
0,2652,"Weir, W.",,1908,3.0,573.0,(May 1907),Charters Towers,Charters Towers (Qld.),Mills United,"-20.079251,146.257961","-20.079251,146.257961",,Injured. Fell off ladder in stopes and broke ...,,,
1,2659,"Martin, F.",,1908,3.0,574.0,(May 1907),Charters Towers,Charters Towers (Qld.),Mills United,"-20.079251,146.257961","-20.079251,146.257961",`,Injured. Foot jammed while riding on tank in ...,,,
2,2683,"Williams, O.",,1908,3.0,575.0,(May 1907),Charters Towers,Charters Towers (Qld.),Brilliant St. George G.M.,"-20.075472,146.269151","-20.075472,146.269151",,Injured. Cut back by fall of stone.,,,
3,2778,"Morris, Edwd.",,1908,3.0,581.0,(May 1907),Charters Towers,Charters Towers (Qld.),Mills United,"-20.079251,146.257961","-20.079251,146.257961",,Injured. Received cut head through falling of...,,,
4,2750,"Bryden, J. H.",,1908,3.0,579.0,(Jun 1907),Charters Towers,Charters Towers (Qld.),Mills United,"-20.079251,146.257961","-20.079251,146.257961",,Injured. Lost top of finger against chute in l...,,,


## Number of records

Use `.shape` to find the rumber of rows and columns in the dataset.

In [14]:
df.shape

(8903, 17)

 So there are are 8,903 rows in this CSV file.

## Examining dates

It looks like there are two fields containing dates. 'Year' seems like its the year the accident was reported to Parliament. 'Date' looks like the actual date of the accident.

The first question we might ask is does every record have a date? By using `.dropna()` we can exclude rows without a value and count what's left.

In [64]:
df['Year'].dropna().shape

(8903,)

If we compare this to the total number of rows above, we see that every row has a `Year`.

In [63]:
df['Date'].dropna().shape

(7255,)

But not every row has a `Date`. Let's have a look at the values in the `Date` field.

In [65]:
df['Date'].dropna()

0       (May 1907)
1       (May 1907)
2       (May 1907)
3       (May 1907)
4       (Jun 1907)
           ...    
7250    13/01/1900
7251    12/01/1900
7252    11/01/1900
7253     7/01/1900
7254     5/01/1900
Name: Date, Length: 7255, dtype: object

So not only are some dates missing, but the format in which they're recorded varies. Let's try to normalise the values by extracting a year from the date. We're going to use a regular expression – `\d{4}` – to look for a series of four numbers in the date string. We'll save the extracted year to a new column – `accident_year`.

In [66]:
df['accident_year'] = df['Date'].dropna().str.extract(r'(\d{4})')

By comparing the number of `accident_year` values to the number of `Date` values we can see how many dates we managed to extract years from.

In [69]:
df['accident_year'].dropna().shape

(7255,)

It's the same as the number of `Date` values – so we seem to have extracted a year from every available date.

In [23]:
df.loc[df['accident_year'] == '2012']

Unnamed: 0,'ID,Name,Field2,Year,Session,Page,Date,District,Geo-subject,Name Of Mine,Latitude,Longitude,Nature of Injuries,Remarks,Field1,Newspapers,Continue,accident_year
7,8423,"Mann, H.",,1946,2.0,160.0,22/05/2012,Charters Towers,Charters Towers (Qld.),Black Jack,"-20.150396,146.221676","-20.150396,146.221676",,"Injured. Lacerated L hand, caused while break...",,,,2012


In [27]:
accidents_by_year = df['accident_year'].value_counts().to_frame().reset_index()
accidents_by_year.columns = ['year', 'accidents']
accidents_by_year.head()

Unnamed: 0,year,accidents
0,1935,384
1,1934,360
2,1912,348
3,1938,329
4,1932,239


In [42]:
alt.Chart(accidents_by_year).mark_bar(size=4).encode(
    x='year:T',
    y='accidents:Q'
).properties(width=600)

In [70]:
df['Nature of Injuries'].value_counts()[:20]

Killed              100
Scalp wound          68
Broken leg           43
Bruised              41
Crushed to death     33
Bruises              30
Fatal                23
Severe bruises       20
Fractured skull      19
Bruised foot         19
Slight bruises       19
Leg broken           17
Sprained ankle       16
Scalp wounds         16
Contusions           16
Slightly bruised     15
Drowned              14
Hand injured         13
Broken arm           13
Fatally injured      12
Name: Nature of Injuries, dtype: int64

In [71]:
df['Name Of Mine'].value_counts()[:20]

Mount Morgan                    602
Mount Isa Mines                 354
Bowen State                     343
Mount Morgan Mine               240
State Smelters                  145
Brilliant Extended              120
Bowen Consolidated              116
Chillagoe State Smelters        114
Mount Mulligan                  114
Aberdare Extended No. 1         108
New Aberdare                    107
Tannymorel                       84
Mount Mulligan Colliery          76
Noblevale No. 1                  76
Scottish Gympie                  74
Mount Morgan Reduction Works     70
Blackheath                       70
Mount Coolon                     69
Mount Isa                        66
Mills United                     64
Name: Name Of Mine, dtype: int64