## Exploring Mammal Species Records within the North East of England ##

#### Initial exploration and data cleaning - to-do: ####
* ~~Check fields, remove any which are not useful~~
* Check for N/As - do any need removing
* Check for datetime fields, change any if necessary - combine Start Date and End Date where necessary just to 'Date'
* Check for duplicates

#### Objectives: ####
* To understand recording of mammals within the North East of England, attempt to highlight any species biases within recording, and highlight any geographical biases or less-recorded areas.
* Understand how recent events, such as the covid pandemic, have effected mammal recording, and whether we need to 'promote' submitting records or mammal surveying to better understand the populations and ecologies of mammals in the North East.

#### Exploratory plots: ####
* Data quality - check for verification status - use this as a consideration as we explore the rest of the data
* Most common taxons - may present a good taxon to focus on for some further exploration (e.g. bats - what other detection has been used?)
* Most common species
* Time series - how count of records has changed over time - did covid have an effect?
* How species record counts changed over time - dynamic graph where some of the top species can be selected
* People - who has submitted the most records?
* Detection - how have most of the records been detected? (likely human obs)
* Look at observation remarks - use some NLP and pick out top keywords - does this vary between taxons?
* Geographical variaiton - heatmap of record locations - remove sensitive species for this as noise will have been added

#### Packages needed ####

In [1]:
import pandas as pd

#### Importing and merging datasets ####

In [4]:
#importing initial csv files
north_nland = pd.read_csv("data/North_Nland/North_Nland.csv")
south_nland = pd.read_csv("data/South_Nland/South_Nland.csv")
durham = pd.read_csv("data/Durham/Durham.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [18]:
#checking shape of files
#all have 56 attributes

print("Number of attributes:", north_nland.shape[1])

print("North Northumberland:", north_nland.shape[0]) 
print("South Northumberland:", south_nland.shape[0]) 
print("Durham:", durham.shape[0])
print("Total records should be:", (north_nland.shape[0] + south_nland.shape[0] + durham.shape[0]))

Number of attributes: 56
North Northumberland: 12923
South Northumberland: 55230
Durham: 55081
Total records should be: 123234


In [21]:
#Need to join datasets

mammals = pd.concat([north_nland, south_nland, durham])
print(mammals.shape)

(123234, 56)


In [22]:
# Number of rows is correct, this is the dataset we will be using from here.

#### Checking columns that can be removed ####

In [30]:
#First checking column names to see what we can remove and the first 5 rows for context
print(mammals.columns)

mammals.head()

Index(['NBN Atlas record ID', 'Occurrence ID', 'Licence', 'Rightsholder',
       'Scientific name', 'Taxon author', 'Name qualifier', 'Common name',
       'Species ID (TVK)', 'Taxon Rank', 'Occurrence status', 'Start date',
       'Start date day', 'Start date month', 'Start date year', 'End date',
       'End date day', 'End date month', 'End date year', 'Locality', 'OSGR',
       'Latitude (WGS84)', 'Longitude (WGS84)', 'Coordinate uncertainty (m)',
       'Verbatim depth', 'Recorder', 'Determiner', 'Individual count',
       'Abundance', 'Abundance scale', 'Organism scope', 'Organism remarks',
       'Sex', 'Life stage', 'Occurrence remarks',
       'Identification verification status', 'Basis of record', 'Survey key',
       'Dataset name', 'Dataset ID', 'Data provider', 'Data provider ID',
       'Institution code', 'Kingdom', 'Phylum', 'Class', 'Order', 'Family',
       'Genus', 'OSGR 100km', 'OSGR 10km', 'OSGR 2km', 'OSGR 1km', 'Country',
       'State/Province', 'Vitality'],
 

Unnamed: 0,NBN Atlas record ID,Occurrence ID,Licence,Rightsholder,Scientific name,Taxon author,Name qualifier,Common name,Species ID (TVK),Taxon Rank,...,Order,Family,Genus,OSGR 100km,OSGR 10km,OSGR 2km,OSGR 1km,Country,State/Province,Vitality
0,fffcd4ac-6a6b-4a86-888a-131e7995a5a0,SR0001360004BM04,CC-BY-NC,Environmental Records Information Centre North...,Sciurus vulgaris,"Linnaeus, 1758",,Eurasian Red Squirrel,NBNSYS0000005108,species,...,Rodentia,Sciuridae,Sciurus,NU,NU03,,,United Kingdom,England,
1,fe6d6b95-4986-4c93-9c37-8958324b7d14,SR0001360004BJIG,CC-BY-NC,Environmental Records Information Centre North...,Sciurus carolinensis,"Gmelin, 1788",,Eastern Grey Squirrel,NHMSYS0000332764,species,...,Rodentia,Sciuridae,Sciurus,NT,NT92,,,United Kingdom,England,
2,fe157059-a0dd-4cf3-982e-001846e68afa,1618708,CC-BY-NC,The Mammal Society and Biological Records Centre,Mustela erminea,"Linnaeus, 1758",,Stoat,NBNSYS0000005127,species,...,Carnivora,Mustelidae,Mustela,NU,NU22,NU22E,NU2128,United Kingdom,England,
3,fd3d9fae-b3b0-43c5-bcd4-f3ef4dc9370b,SR0001360004BGG1,CC-BY-NC,Environmental Records Information Centre North...,Sciurus vulgaris,"Linnaeus, 1758",,Eurasian Red Squirrel,NBNSYS0000005108,species,...,Rodentia,Sciuridae,Sciurus,NU,NU21,,,United Kingdom,England,
4,fd386f4c-009f-41c7-a112-7a073b728b2e,18542451,CC-BY,"The Mammal Society, and Biological Records Centre",Lepus europaeus,"Pallas, 1778",,Brown Hare,NHMSYS0000080218,species,...,Lagomorpha,Leporidae,Lepus,NU,NU13,NU13A,NU1130,United Kingdom,England,


In [34]:
mammals['Occurrence status'].value_counts()
#shows all occurrences are 'present' so no point in keeping this in.

present    123234
Name: Occurrence status, dtype: int64

In [33]:
#Using this to understand the date system
mammals[['Common name', 'Occurrence status', 'Start date', 'Start date day', 'Start date month']].head()

Unnamed: 0,Common name,Occurrence status,Start date,Start date day,Start date month
0,Eurasian Red Squirrel,present,2014-12-19,19.0,12.0
1,Eastern Grey Squirrel,present,2015-12-20,20.0,12.0
2,Stoat,present,2014-12-14,14.0,12.0
3,Eurasian Red Squirrel,present,2014-12-05,5.0,12.0
4,Brown Hare,present,2020-12-31,31.0,12.0


In [47]:
mammals[['Start date', 'End date']].isna().value_counts()
#Most records have a start date but not an end date (logical as usually just the sighting date is recorded)
# 5,077 records have no date - these will have to be removed
# 20,150 records are missing a start date but not an end date - could combine to imply just a 'date' column?
# Or remove these.

Start date  End date
False       True        98007
True        True        20150
False       False        5077
dtype: int64

In [78]:
mammals[['Class']].value_counts()
#Checking all records are definitely for mammals
# Will delete Kingdom, Phylum, Class attributes as all will be the same.
# Likewise for Country and State/Province

# Deleting all extra OSGR columsn as I don't think they will be useful - the lat/long 
# will be more useful using Folium

Class   
Mammalia    123234
dtype: int64

In [82]:
mammals['Vitality'].value_counts()

alive    21423
dead      1394
Name: Vitality, dtype: int64

In [83]:
#Final list of columns to keep in data

cols_to_keep = ['Scientific name', 'Common name', 'Species ID (TVK)', 'Taxon Rank', 
                'Start date', 'End date', 
                'OSGR', 'Latitude (WGS84)', 'Longitude (WGS84)','Recorder', 'Determiner',
                'Occurrence remarks',
                'Identification verification status', 'Basis of record',
                'Data provider',
                'Order', 'Family', 'Genus', 'Vitality']

In [84]:
mammals_df = mammals[cols_to_keep]

In [87]:
#Checking it worked. New shape 123,234 x 19 
mammals_df.shape

(123234, 19)

#### Checking for NAs that may need to be removed ####

NaNs in the 'Scientific name' column - whole record should be removed.

NaNs in 'Common name' can be filled in. 

NaNs in both the Start & End date should be removed as it is not technically a valid record. 

Check for NaNs in Recorder - keep for now but consider this if looking at Recorder further down the line.

Other NaNs should be okay. 

In [100]:
variable_list = ['Scientific name', 'Common name', 'Start date', 'End date', 'Recorder']

for i in variable_list: 
    print(mammals_df[i].isna().value_counts(), "\n")

#mammals_df['Scientific name'].isna().value_counts()
#no scientific names missing

False    123234
Name: Scientific name, dtype: int64 

False    123102
True        132
Name: Common name, dtype: int64 

False    103084
True      20150
Name: Start date, dtype: int64 

True     118157
False      5077
Name: End date, dtype: int64 

True     69860
False    53374
Name: Recorder, dtype: int64 



#### To-Do Next ####
* Collect scientific names of records missing common names so they can be 'translated'
* Sort out the dates - create a single 'Date' column, which uses the start date initially but the end date as a backup option. Any records still NaN can be removed.