In [1]:
import pandas as pd
import numpy as np

#### Basic Dataframe Examination

Start off with a basic examination of the dataframes using `Pandas` standard EDA functions.

In [15]:
species = pd.read_csv('data/species_info.csv')
observations = pd.read_csv('data/observations.csv')

In [3]:
species.head(3)

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,
1,Mammal,Bos bison,"American Bison, Bison",
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",


In [4]:
observations.head(3)

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138


In [5]:
species.shape, observations.shape

((5824, 4), (23296, 3))

In [6]:
species.info(), print("\n"), observations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5824 entries, 0 to 5823
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   category             5824 non-null   object
 1   scientific_name      5824 non-null   object
 2   common_names         5824 non-null   object
 3   conservation_status  191 non-null    object
dtypes: object(4)
memory usage: 182.1+ KB


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23296 entries, 0 to 23295
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   scientific_name  23296 non-null  object
 1   park_name        23296 non-null  object
 2   observations     23296 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 546.1+ KB


(None, None, None)

Regular Size for the notebook

<font size="4">There are only null values in the species dataframe are under the column conservation status. Some more familiarization of the species data set would be helpful before making decision on the null values.</font>

<span style="font-size: 20px;"> test </span>

<p>More Testing</p>

#### heading 4 size


In [7]:
# species.conservation_status.value_counts()
species.conservation_status.value_counts(dropna=False)

NaN                   5633
Species of Concern     161
Endangered              16
Threatened              10
In Recovery              4
Name: conservation_status, dtype: int64

In [8]:
species.category.value_counts()

Vascular Plant       4470
Bird                  521
Nonvascular Plant     333
Mammal                214
Fish                  127
Amphibian              80
Reptile                79
Name: category, dtype: int64

In [9]:
species.scientific_name.nunique(), observations.scientific_name.nunique(), species.common_names.nunique()

(5541, 5541, 5504)

<font size="3">The dataframe observations only contains the scientific name and not the common name. There are 37 less total number of common names compared to scientific names, indicating that some species may share a common name. There are 5824 total species in the data frame indicating 283 possible duplicates in the species scientific names.<br><br>The observations data frame also contains 5541 unique scientific names. To make sure all species are included in both, a check is performed to see if the unique names from both data sets are equal.</font>

In [10]:
# Casting the list to a set could also be used here
sorted(observations.scientific_name.unique()) == sorted(species.scientific_name.unique())

True

<font size="3">Considering the fact that the vast majority of species have **NaN** for conservation status, and no species scientific names are missing in the observation data, it is safe to say that since all the animals in the species data are being observed, that a missing conservation status means that the species is not under any threat.<br><br>Thus, the null values will be changed to a new category *"Not in Danger"* after some other analyis.<br><br> Null Values will be treated as a species not under conservation watch.</font>

In [12]:
# species.conservation_status.isna().sum()

In [13]:
# species.fillna('Not in Danger', inplace=True)
# species

In [21]:
species.conservation_status.value_counts().sum()

191

From looking at the value count, we know there are only 191 species that are _Under Watch_ which I will define as having some dangerous conservation status.

Let's look at the percentage of each species under watch based on their Class (or category in the data)

In [22]:
cat_counts =species.category.value_counts().to_dict()

In [23]:
cat_counts =species.category.value_counts().to_dict()
under_watch = species.groupby(['category', 'conservation_status'])['scientific_name'].count().reset_index()
def percentages(row):
    return row['scientific_name']/cat_counts[row['category']]
under_watch['percentage'] = under_watch.apply(percentages, axis=1)
under_watch.sort_values(by='percentage', ascending=False)

Unnamed: 0,category,conservation_status,scientific_name,percentage
5,Bird,Species of Concern,72,0.138196
11,Mammal,Species of Concern,28,0.130841
14,Reptile,Species of Concern,5,0.063291
1,Amphibian,Species of Concern,4,0.05
9,Mammal,Endangered,7,0.03271
7,Fish,Species of Concern,4,0.031496
8,Fish,Threatened,4,0.031496
2,Amphibian,Threatened,2,0.025
6,Fish,Endangered,3,0.023622
13,Nonvascular Plant,Species of Concern,5,0.015015


It is clear from the chart that most species are not in danger. Especially with plants where only around 1% of species are of concern or worse. Species from the animal kingdom, on the other hand, have around 10% or higher of the species under conservation watch.

Let's move on to investigate duplicate entries.

In [24]:
species.duplicated().sum()

0

There are no complete duplicate rows indicating that any duplication is coming from the scientific name or common name only, as per our counts earlier. 

I suspect there are different common names for some species and some scientific names duplicated. 

Let's take a look at how many there are of each.

In [26]:
species.duplicated(subset=['scientific_name']).sum(), species.duplicated(subset=['common_names']).sum()

(283, 320)

Let's see which share both scientific and common names

In [36]:
species[species.duplicated(subset=['scientific_name','common_names'], keep=False)].sort_values(by='common_names')

Unnamed: 0,category,scientific_name,common_names,conservation_status
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
4448,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered
560,Fish,Oncorhynchus mykiss,Rainbow Trout,
3283,Fish,Oncorhynchus mykiss,Rainbow Trout,Threatened


## NOTE HERE ABOUT THESE TWO
Only two species have shared scientific and common names. In addition, they also have different status. 
Let's look at species with duplicated common names first to see their conservation status.

In [27]:
# pd.set_option('display.max_rows', 30)
duplicated_common_names = species[species.duplicated(subset=['common_names'], keep=False)]
duplicated_common_names.sort_values(by='common_names')#[~duplicated_common_names.conservation_status.isna()]

Unnamed: 0,category,scientific_name,common_names,conservation_status
2730,Nonvascular Plant,Dichodontium pellucidum,A Moss,
2822,Nonvascular Plant,Cirriphyllum piliferum,A Moss,
2022,Vascular Plant,Carex normalis,"A Sedge, Sedge",
1971,Vascular Plant,Carex bromoides,"A Sedge, Sedge",
1960,Vascular Plant,Carex annectens,"A Sedge, Sedge",
...,...,...,...,...
250,Bird,Dendroica coronata,Yellow-Rumped Warbler,
252,Bird,Dendroica dominica,Yellow-Throated Warbler,
3206,Bird,Setophaga dominica,Yellow-Throated Warbler,
2957,Nonvascular Plant,Zygodon viridissimus var. rupestris,Zygodon Moss,


In [30]:
duplicated_common_names.conservation_status.value_counts(dropna=False)

NaN                   556
Species of Concern      9
In Recovery             1
Threatened              1
Endangered              1
Name: conservation_status, dtype: int64

The majority of the duplicated species are not in danger. Below, I will create a list of those species who are under watch.

In [32]:
duplicated_common_names.sort_values(by='common_names')[~duplicated_common_names.conservation_status.isna()]

  """Entry point for launching an IPython kernel.


Unnamed: 0,category,scientific_name,common_names,conservation_status
2929,Nonvascular Plant,Bazzania nudicaulis,Bazzania,Species of Concern
185,Bird,Guiraca caerulea,Blue Grosbeak,Species of Concern
2292,Vascular Plant,Poa paludigena,Bog Bluegrass,Species of Concern
3102,Bird,Stellula calliope,Calliope Hummingbird,Species of Concern
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
4448,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered
3288,Fish,Cottus bairdii,Mottled Sculpin,Species of Concern
284,Bird,Vermivora ruficapilla,Nashville Warbler,Species of Concern
310,Bird,Contopus cooperi,Olive-Sided Flycatcher,Species of Concern
3283,Fish,Oncorhynchus mykiss,Rainbow Trout,Threatened


In [33]:
mask = duplicated_common_names.conservation_status == 'Species of Concern'
duplicated_common_names[mask]

Unnamed: 0,category,scientific_name,common_names,conservation_status
185,Bird,Guiraca caerulea,Blue Grosbeak,Species of Concern
284,Bird,Vermivora ruficapilla,Nashville Warbler,Species of Concern
287,Bird,Wilsonia pusilla,Wilson's Warbler,Species of Concern
310,Bird,Contopus cooperi,Olive-Sided Flycatcher,Species of Concern
2292,Vascular Plant,Poa paludigena,Bog Bluegrass,Species of Concern
2929,Nonvascular Plant,Bazzania nudicaulis,Bazzania,Species of Concern
3102,Bird,Stellula calliope,Calliope Hummingbird,Species of Concern
3288,Fish,Cottus bairdii,Mottled Sculpin,Species of Concern
4879,Vascular Plant,Plagiobothrys torreyi var. diffusus,Torrey's Popcornflower,Species of Concern


In [37]:
spec_of_concern = duplicated_common_names[mask].common_names.to_list()
duplicated_common_names[duplicated_common_names.common_names.isin(spec_of_concern)].sort_values(by='common_names')

Unnamed: 0,category,scientific_name,common_names,conservation_status
2930,Nonvascular Plant,Bazzania tricrenata,Bazzania,
2929,Nonvascular Plant,Bazzania nudicaulis,Bazzania,Species of Concern
2928,Nonvascular Plant,Bazzania denudata,Bazzania,
4525,Bird,Passerina caerulea,Blue Grosbeak,
185,Bird,Guiraca caerulea,Blue Grosbeak,Species of Concern
2292,Vascular Plant,Poa paludigena,Bog Bluegrass,Species of Concern
5628,Vascular Plant,Poa leptocoma ssp. leptocoma,Bog Bluegrass,
3102,Bird,Stellula calliope,Calliope Hummingbird,Species of Concern
4506,Bird,Selasphorus calliope,Calliope Hummingbird,
563,Fish,Cottus bairdi,Mottled Sculpin,


This initial suspicion appears to be the case. Some of these animals or plants with different common names are variations on a species. For all extent and purposes, they are separate entities. Most of them are not in danger, while 12 of them have different converstation statuses. All will be kept as seperate entities since the scientific names are different.

Further investigation of the Grey Wolf and Rainbow Trout will be necessary.

Next, let's look at the duplicated scientific names.

In [38]:
duplicated_scientific_names = species[species.duplicated(subset=['scientific_name'], keep=False)].sort_values(by='scientific_name')
duplicated_scientific_names

Unnamed: 0,category,scientific_name,common_names,conservation_status
5553,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",
2132,Vascular Plant,Agrostis capillaris,Rhode Island Bent,
2134,Vascular Plant,Agrostis gigantea,Redtop,
5554,Vascular Plant,Agrostis gigantea,"Black Bent, Redtop, Water Bentgrass",
4178,Vascular Plant,Agrostis mertensii,"Arctic Bentgrass, Northern Bentgrass",
...,...,...,...,...
5643,Vascular Plant,Vulpia myuros,"Foxtail Fescue, Rattail Fescue, Rat-Tail Fescu...",
2331,Vascular Plant,Vulpia octoflora,Annual Fescue,
4290,Vascular Plant,Vulpia octoflora,"Eight-Flower Six-Weeks Grass, Pullout Grass, S...",
3347,Vascular Plant,Zizia aptera,"Heartleaf Alexanders, Heart-Leaf Alexanders, M...",


In [45]:
duplicated_scientific_names[['conservation_status']].value_counts(dropna=False)

conservation_status
NaN                    534
Species of Concern      19
Endangered               2
In Recovery              1
Threatened               1
dtype: int64

Some species are in the database twice because they have different common names. Like with common names, a majority of them are not in danger. It is possible that these should be considered as different entities since data collection might be occurring on different groups of these species, but we are not sure yet. 

Let's focus on the 23 species that are under watch.

In [47]:
print(duplicated_scientific_names.groupby(['conservation_status', 'scientific_name'])['common_names'].count().sum())
duplicated_scientific_names.groupby(['conservation_status', 'scientific_name'])['common_names'].count()

23


conservation_status  scientific_name          
Endangered           Canis lupus                  2
In Recovery          Canis lupus                  1
Species of Concern   Eptesicus fuscus             2
                     Gavia immer                  2
                     Lasionycteris noctivagans    2
                     Myotis californicus          2
                     Myotis lucifugus             3
                     Nycticorax nycticorax        2
                     Pandion haliaetus            2
                     Riparia riparia              2
                     Taxidea taxus                2
Threatened           Oncorhynchus mykiss          1
Name: common_names, dtype: int64

In [49]:
duplicated_scientific_names.scientific_name.value_counts().value_counts()

2    265
3      9
Name: scientific_name, dtype: int64

Only 9 species total have 3 scientific names. It appears that the Canis lupus (the grey wolf) and Myotis lucifugus (the little brown bat) both have three statuses under watch. We already know that about the wolf and we also know that the rainbow trout has _threatened_ and _not in danger_ statuses.

Let's look more at the species on this list to see if any of the entries with a count of 2 have a mixed statuses and confirm the statuses of species of three.

Myotis lucifugus         3
Puma concolor            3
Castor canadensis        3
Procyon lotor            3
Streptopelia decaocto    3
Canis lupus              3
Columba livia            3
Hypochaeris radicata     3
Holcus lanatus           3
Name: scientific_name, dtype: int64

In [51]:
list_of_3_entries = duplicated_scientific_names.scientific_name.value_counts()[duplicated_scientific_names.scientific_name.value_counts() == 3].index
list_of_2_entries = duplicated_scientific_names.scientific_name.value_counts()[duplicated_scientific_names.scientific_name.value_counts() == 2].index

In [52]:
duplicated_scientific_names[duplicated_scientific_names.scientific_name.isin(list_of_3_entries)]

Unnamed: 0,category,scientific_name,common_names,conservation_status
8,Mammal,Canis lupus,Gray Wolf,Endangered
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
4448,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered
49,Mammal,Castor canadensis,"American Beaver, Beaver",
4475,Mammal,Castor canadensis,Beaver,
3050,Mammal,Castor canadensis,American Beaver,
4513,Bird,Columba livia,Rock Pigeon,
3138,Bird,Columba livia,"Common Pigeon, Rock Dove, Rock Pigeon",
156,Bird,Columba livia,Rock Dove,
4236,Vascular Plant,Holcus lanatus,"Common Velvetgrass, Yorkshire-Fog",


Since the list of 2 is large (265), let's create a mask to filter out any species that have both statuses as _not in danger_. We will test the mask first with the list of 3.

In [53]:
# duplicated_scientific_names[~(duplicated_scientific_names.conservation_status == "Not in Danger") & duplicated_scientific_names.scientific_name.isin(list_of_3_entries)]
mask = ~(duplicated_scientific_names.conservation_status.isna()) & duplicated_scientific_names.scientific_name.isin(list_of_3_entries)
duplicated_scientific_names[mask]#.scientific_name.nunique()

Unnamed: 0,category,scientific_name,common_names,conservation_status
8,Mammal,Canis lupus,Gray Wolf,Endangered
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
4448,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered
4467,Mammal,Myotis lucifugus,Little Brown Myotis,Species of Concern
3042,Mammal,Myotis lucifugus,"Little Brown Bat, Little Brown Myotis, Little ...",Species of Concern
37,Mammal,Myotis lucifugus,"Little Brown Bat, Little Brown Myotis",Species of Concern


In [81]:
# duplicated_scientific_names[duplicated_scientific_names.scientific_name.isin(list_of_3_entries)]

In [54]:
# duplicated_scientific_names[~(duplicated_scientific_names.conservation_status == "Not in Danger") & duplicated_scientific_names.scientific_name.isin(list_of_2_entries)]
mask = ~(duplicated_scientific_names.conservation_status.isna()) & duplicated_scientific_names.scientific_name.isin(list_of_2_entries)
duplicated_scientific_names[mask]
# list_of_2_entries.shape[0]
# (duplicated_scientific_names[duplicated_scientific_names.conservation_status.isna()].shape[0] + duplicated_scientific_names[mask].shape[0]) / 2

Unnamed: 0,category,scientific_name,common_names,conservation_status
29,Mammal,Eptesicus fuscus,Big Brown Bat,Species of Concern
3035,Mammal,Eptesicus fuscus,"Big Brown Bat, Big Brown Bat",Species of Concern
3150,Bird,Gavia immer,"Common Loon, Great Northern Diver, Great North...",Species of Concern
172,Bird,Gavia immer,Common Loon,Species of Concern
30,Mammal,Lasionycteris noctivagans,Silver-Haired Bat,Species of Concern
3037,Mammal,Lasionycteris noctivagans,"Silver-Haired Bat, Silver-Haired Bat",Species of Concern
4465,Mammal,Myotis californicus,California Myotis,Species of Concern
3039,Mammal,Myotis californicus,"California Myotis, California Myotis, Californ...",Species of Concern
337,Bird,Nycticorax nycticorax,Black-Crowned Night-Heron,Species of Concern
4564,Bird,Nycticorax nycticorax,Black-Crowned Night Heron,Species of Concern


All duplicated species are either on the list 2 or 3 times. Almost all of the species with multiple entries have the same conservation status. The majority here are 'Not in Danger, and 9 species with a status of 'Species of Concern'. Only two species, the gray wolf and rainbow trout, have multiple statuses. 

This means that the duplicate entries are most likely not by mistake, but were inlcuded because of the change in status.

As for the others, it is crucial to look at the obervations table to get a sense of what is going with the multiple entries.

In [136]:
# duplicated_scientific_names_2 = species[species.duplicated(subset=['scientific_name'], keep='first')].sort_values(by='scientific_name')

In [None]:
# duplicated_scientific_names[duplicated_scientific_names.conservation_status.isna()]

In [101]:
duplicated_mask = species.duplicated(subset=['scientific_name'], keep=False)
trout_keep_NA_mask = species.scientific_name == species.loc[560, 'scientific_name']
duplicated_mask_NA = species.conservation_status.isna() & ~trout_keep_NA_mask

 = species[~(duplicated_mask & duplicated_mask_NA)]
# species[species.conservation_status.isna()].drop_duplicates(subset=['scientific_name'], keep='first')

IndentationError: unexpected indent (<ipython-input-101-27c49a733251>, line 5)

In [None]:
species.shape

In [55]:
duplicated_scientific_names = species[species.duplicated(subset=['scientific_name'], keep=False)].sort_values(by='scientific_name')
duplicated_scientific_names

Unnamed: 0,category,scientific_name,common_names,conservation_status
5553,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",Not in Danger
2132,Vascular Plant,Agrostis capillaris,Rhode Island Bent,Not in Danger
2134,Vascular Plant,Agrostis gigantea,Redtop,Not in Danger
5554,Vascular Plant,Agrostis gigantea,"Black Bent, Redtop, Water Bentgrass",Not in Danger
4178,Vascular Plant,Agrostis mertensii,"Arctic Bentgrass, Northern Bentgrass",Not in Danger
...,...,...,...,...
5643,Vascular Plant,Vulpia myuros,"Foxtail Fescue, Rattail Fescue, Rat-Tail Fescu...",Not in Danger
2331,Vascular Plant,Vulpia octoflora,Annual Fescue,Not in Danger
4290,Vascular Plant,Vulpia octoflora,"Eight-Flower Six-Weeks Grass, Pullout Grass, S...",Not in Danger
3347,Vascular Plant,Zizia aptera,"Heartleaf Alexanders, Heart-Leaf Alexanders, M...",Not in Danger


In [59]:
duplicated_scientific_names_mixed_status = duplicated_scientific_names[~duplicated_scientific_names.conservation_status.isna()]['scientific_name'].unique().tolist()

In [62]:
species[species.scientific_name.isin(duplicated_scientific_names_mixed_status)]

Unnamed: 0,category,scientific_name,common_names,conservation_status
8,Mammal,Canis lupus,Gray Wolf,Endangered
29,Mammal,Eptesicus fuscus,Big Brown Bat,Species of Concern
30,Mammal,Lasionycteris noctivagans,Silver-Haired Bat,Species of Concern
37,Mammal,Myotis lucifugus,"Little Brown Bat, Little Brown Myotis",Species of Concern
104,Bird,Pandion haliaetus,Osprey,Species of Concern
172,Bird,Gavia immer,Common Loon,Species of Concern
226,Bird,Riparia riparia,Bank Swallow,Species of Concern
337,Bird,Nycticorax nycticorax,Black-Crowned Night-Heron,Species of Concern
560,Fish,Oncorhynchus mykiss,Rainbow Trout,
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery


In [105]:
species[species.scientific_name == 'Oncorhynchus mykiss']

Unnamed: 0,category,scientific_name,common_names,conservation_status
560,Fish,Oncorhynchus mykiss,Rainbow Trout,
3283,Fish,Oncorhynchus mykiss,Rainbow Trout,Threatened


In [106]:
species[~duplicated_mask].conservation_status.value_counts()

Species of Concern    142
Endangered             14
Threatened              9
In Recovery             3
Name: conservation_status, dtype: int64

In [107]:
# # pd.set_option('display.max_rows', 30)
duplicated_scientific_names[duplicated_scientific_names.scientific_name.apply(lambda species: species in duplicated_scientific_names_mixed_status)]

Unnamed: 0,category,scientific_name,common_names,conservation_status
8,Mammal,Canis lupus,Gray Wolf,Endangered
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
4448,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered
29,Mammal,Eptesicus fuscus,Big Brown Bat,Species of Concern
3035,Mammal,Eptesicus fuscus,"Big Brown Bat, Big Brown Bat",Species of Concern
3150,Bird,Gavia immer,"Common Loon, Great Northern Diver, Great North...",Species of Concern
172,Bird,Gavia immer,Common Loon,Species of Concern
30,Mammal,Lasionycteris noctivagans,Silver-Haired Bat,Species of Concern
3037,Mammal,Lasionycteris noctivagans,"Silver-Haired Bat, Silver-Haired Bat",Species of Concern
4465,Mammal,Myotis californicus,California Myotis,Species of Concern


In [None]:
######### HERE

In [60]:
duplicated_scientific_names[duplicated_scientific_names.scientific_name.isin(obsv_counts[:9].index)].sort_values(by='scientific_name')

NameError: name 'obsv_counts' is not defined

In [61]:
observations[observations.scientific_name.isin(obsv_counts[:9].index)].sort_values(by=['scientific_name', 'park_name', 'observations']).head(20)

NameError: name 'obsv_counts' is not defined

In [None]:
back_up = species.copy()

In [None]:
species = back_up
species.sort_values(by=['scientific_name', 'common_names'], inplace=True)
species.reset_index(drop=True, inplace=True)

In [62]:
species[species.scientific_name == 'Canis lupus']

Unnamed: 0,category,scientific_name,common_names,conservation_status
8,Mammal,Canis lupus,Gray Wolf,Endangered
3020,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
4448,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered


In [63]:
species_df = pd.concat([species, species, species, species], ignore_index=True)

In [64]:
species_df = species_df.sort_values(by=['scientific_name', 'common_names']).reset_index(drop=True)#.drop(columns='scientific_name')
# species_df.drop(columns='scientific_name', inplace=True)

In [65]:
species_df[species_df.scientific_name == 'Canis lupus']

Unnamed: 0,category,scientific_name,common_names,conservation_status
3352,Mammal,Canis lupus,Gray Wolf,Endangered
3353,Mammal,Canis lupus,Gray Wolf,Endangered
3354,Mammal,Canis lupus,Gray Wolf,Endangered
3355,Mammal,Canis lupus,Gray Wolf,Endangered
3356,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
3357,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered
3358,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
3359,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered
3360,Mammal,Canis lupus,"Gray Wolf, Wolf",In Recovery
3361,Mammal,Canis lupus,"Gray Wolf, Wolf",Endangered


In [66]:
observations_df = observations.sort_values(by=['scientific_name']).reset_index(drop=True)

In [67]:
observations[observations.scientific_name == 'Canis lupus'].sort_values(by=['park_name', 'observations'])

Unnamed: 0,scientific_name,park_name,observations
1766,Canis lupus,Bryce National Park,27
7346,Canis lupus,Bryce National Park,29
9884,Canis lupus,Bryce National Park,74
17756,Canis lupus,Great Smoky Mountains National Park,14
10190,Canis lupus,Great Smoky Mountains National Park,15
20353,Canis lupus,Great Smoky Mountains National Park,30
10268,Canis lupus,Yellowstone National Park,60
10907,Canis lupus,Yellowstone National Park,67
13427,Canis lupus,Yellowstone National Park,203
1294,Canis lupus,Yosemite National Park,35


In [68]:
species_df = species_df.drop(columns='scientific_name')
df = pd.concat([species_df, observations_df], axis=1)
df

Unnamed: 0,category,common_names,conservation_status,scientific_name,park_name,observations
0,Vascular Plant,Rocky Mountain Alpine Fir,Not in Danger,Abies bifolia,Yellowstone National Park,215
1,Vascular Plant,Rocky Mountain Alpine Fir,Not in Danger,Abies bifolia,Bryce National Park,109
2,Vascular Plant,Rocky Mountain Alpine Fir,Not in Danger,Abies bifolia,Great Smoky Mountains National Park,72
3,Vascular Plant,Rocky Mountain Alpine Fir,Not in Danger,Abies bifolia,Yosemite National Park,136
4,Vascular Plant,"Balsam Fir, Colorado Fir, Concolor Fir, Silver...",Not in Danger,Abies concolor,Great Smoky Mountains National Park,101
...,...,...,...,...,...,...
23291,Nonvascular Plant,Zygodon Moss,Not in Danger,Zygodon viridissimus,Great Smoky Mountains National Park,71
23292,Nonvascular Plant,Zygodon Moss,Not in Danger,Zygodon viridissimus var. rupestris,Bryce National Park,102
23293,Nonvascular Plant,Zygodon Moss,Not in Danger,Zygodon viridissimus var. rupestris,Yosemite National Park,210
23294,Nonvascular Plant,Zygodon Moss,Not in Danger,Zygodon viridissimus var. rupestris,Yellowstone National Park,237


In [69]:
df[df.scientific_name == 'Canis lupus'].sort_values(by=['common_names', 'conservation_status'])

Unnamed: 0,category,common_names,conservation_status,scientific_name,park_name,observations
3352,Mammal,Gray Wolf,Endangered,Canis lupus,Great Smoky Mountains National Park,30
3353,Mammal,Gray Wolf,Endangered,Canis lupus,Yosemite National Park,35
3354,Mammal,Gray Wolf,Endangered,Canis lupus,Yellowstone National Park,60
3355,Mammal,Gray Wolf,Endangered,Canis lupus,Bryce National Park,27
3357,Mammal,"Gray Wolf, Wolf",Endangered,Canis lupus,Yosemite National Park,44
3359,Mammal,"Gray Wolf, Wolf",Endangered,Canis lupus,Bryce National Park,74
3361,Mammal,"Gray Wolf, Wolf",Endangered,Canis lupus,Great Smoky Mountains National Park,14
3363,Mammal,"Gray Wolf, Wolf",Endangered,Canis lupus,Bryce National Park,29
3356,Mammal,"Gray Wolf, Wolf",In Recovery,Canis lupus,Yellowstone National Park,67
3358,Mammal,"Gray Wolf, Wolf",In Recovery,Canis lupus,Yosemite National Park,117


In [70]:
####### STOP HERE

In [71]:
observations.groupby(['park_name', 'scientific_name']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,observations
park_name,scientific_name,Unnamed: 2_level_1
Bryce National Park,Abies bifolia,109
Bryce National Park,Abies concolor,83
Bryce National Park,Abies fraseri,109
Bryce National Park,Abietinella abietina,101
Bryce National Park,Abronia ammophila,92
...,...,...
Yosemite National Park,Zonotrichia leucophrys gambelii,169
Yosemite National Park,Zonotrichia leucophrys oriantha,135
Yosemite National Park,Zonotrichia querula,160
Yosemite National Park,Zygodon viridissimus,159


In [72]:
observations.columns

Index(['scientific_name', 'park_name', 'observations'], dtype='object')

In [73]:
obsv_counts = observations.scientific_name.value_counts()
obsv_counts

Myotis lucifugus                        12
Puma concolor                           12
Hypochaeris radicata                    12
Holcus lanatus                          12
Streptopelia decaocto                   12
                                        ..
Packera dimorphophylla var. paysonii     4
Smilax bona-nox                          4
Chondestes grammacus                     4
Leymus triticoides                       4
Dichanthelium depauperatum               4
Name: scientific_name, Length: 5541, dtype: int64

In [74]:
all(obsv_counts[obsv_counts.values > 4].index.sort_values() == duplicated_scientific_names.scientific_name.unique())

True

This implies that there are no duplicates in the dataset. Any species with a duplicate name are separate observations from different times.

In [75]:
# duplicated_scientific_names_mixed_status.sort()
# duplicated_scientific_names_mixed_status


In [76]:
obsv_counts[obsv_counts.values == 8]

Lactuca biennis           8
Digitaria ischaemum       8
Brassica rapa             8
Linaria vulgaris          8
Eragrostis cilianensis    8
                         ..
Lanius excubitor          8
Anthemis cotula           8
Bidens tripartita         8
Hieracium caespitosum     8
Polygonum convolvulus     8
Name: scientific_name, Length: 265, dtype: int64

This is also show that for each species, they are observed only in multiples of 4, meaning that a species is recorded at each park either 1, 2, or 3 times.

It appears that the Gray Wolf was endangered but recovered, while the Rainbow Trout were no longer under threat or of concern, at least in the 4 National Parks under study

In [77]:
observations[observations.scientific_name == 'Canis lupus'].sort_values(by='observations')

Unnamed: 0,scientific_name,park_name,observations
17756,Canis lupus,Great Smoky Mountains National Park,14
10190,Canis lupus,Great Smoky Mountains National Park,15
1766,Canis lupus,Bryce National Park,27
7346,Canis lupus,Bryce National Park,29
20353,Canis lupus,Great Smoky Mountains National Park,30
1294,Canis lupus,Yosemite National Park,35
19987,Canis lupus,Yosemite National Park,44
10268,Canis lupus,Yellowstone National Park,60
10907,Canis lupus,Yellowstone National Park,67
9884,Canis lupus,Bryce National Park,74


In [78]:
observations[observations.scientific_name == 'Oncorhynchus mykiss'].sort_values(by='observations')

Unnamed: 0,scientific_name,park_name,observations
15020,Oncorhynchus mykiss,Great Smoky Mountains National Park,39
925,Oncorhynchus mykiss,Bryce National Park,59
15239,Oncorhynchus mykiss,Yosemite National Park,59
3354,Oncorhynchus mykiss,Great Smoky Mountains National Park,61
11893,Oncorhynchus mykiss,Bryce National Park,105
167,Oncorhynchus mykiss,Yosemite National Park,118
4649,Oncorhynchus mykiss,Yellowstone National Park,119
8682,Oncorhynchus mykiss,Yellowstone National Park,253


In [79]:
trout_ob = observations[observations.scientific_name == 'Oncorhynchus mykiss'].sort_values(by='observations')
trout_sp = species[species.scientific_name == 'Oncorhynchus mykiss']

In [80]:
trout_sp

Unnamed: 0,category,scientific_name,common_names,conservation_status
560,Fish,Oncorhynchus mykiss,Rainbow Trout,Not in Danger
3283,Fish,Oncorhynchus mykiss,Rainbow Trout,Threatened


In [81]:
trout_ob

Unnamed: 0,scientific_name,park_name,observations
15020,Oncorhynchus mykiss,Great Smoky Mountains National Park,39
925,Oncorhynchus mykiss,Bryce National Park,59
15239,Oncorhynchus mykiss,Yosemite National Park,59
3354,Oncorhynchus mykiss,Great Smoky Mountains National Park,61
11893,Oncorhynchus mykiss,Bryce National Park,105
167,Oncorhynchus mykiss,Yosemite National Park,118
4649,Oncorhynchus mykiss,Yellowstone National Park,119
8682,Oncorhynchus mykiss,Yellowstone National Park,253


In [82]:
trout = pd.merge(trout_sp, trout_ob, how='cross')

In [83]:
trout[(trout.conservation_status.isna()) & (trout.observations > 200) | ~trout.conservation_status.isna()]


Unnamed: 0,category,scientific_name_x,common_names,conservation_status,scientific_name_y,park_name,observations
0,Fish,Oncorhynchus mykiss,Rainbow Trout,Not in Danger,Oncorhynchus mykiss,Great Smoky Mountains National Park,39
1,Fish,Oncorhynchus mykiss,Rainbow Trout,Not in Danger,Oncorhynchus mykiss,Bryce National Park,59
2,Fish,Oncorhynchus mykiss,Rainbow Trout,Not in Danger,Oncorhynchus mykiss,Yosemite National Park,59
3,Fish,Oncorhynchus mykiss,Rainbow Trout,Not in Danger,Oncorhynchus mykiss,Great Smoky Mountains National Park,61
4,Fish,Oncorhynchus mykiss,Rainbow Trout,Not in Danger,Oncorhynchus mykiss,Bryce National Park,105
5,Fish,Oncorhynchus mykiss,Rainbow Trout,Not in Danger,Oncorhynchus mykiss,Yosemite National Park,118
6,Fish,Oncorhynchus mykiss,Rainbow Trout,Not in Danger,Oncorhynchus mykiss,Yellowstone National Park,119
7,Fish,Oncorhynchus mykiss,Rainbow Trout,Not in Danger,Oncorhynchus mykiss,Yellowstone National Park,253
8,Fish,Oncorhynchus mykiss,Rainbow Trout,Threatened,Oncorhynchus mykiss,Great Smoky Mountains National Park,39
9,Fish,Oncorhynchus mykiss,Rainbow Trout,Threatened,Oncorhynchus mykiss,Bryce National Park,59


In [84]:
df = duplicated_scientific_names.merge(observations, on='scientific_name')

In [85]:
df.drop_duplicates(subset=['scientific_name', 'observations']).head(20)

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
0,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",Not in Danger,Great Smoky Mountains National Park,84
1,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",Not in Danger,Bryce National Park,103
2,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",Not in Danger,Bryce National Park,105
3,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",Not in Danger,Yellowstone National Park,241
4,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",Not in Danger,Yosemite National Park,182
5,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",Not in Danger,Yellowstone National Park,267
6,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",Not in Danger,Great Smoky Mountains National Park,97
7,Vascular Plant,Agrostis capillaris,"Colonial Bent, Colonial Bentgrass",Not in Danger,Yosemite National Park,140
16,Vascular Plant,Agrostis gigantea,Redtop,Not in Danger,Yellowstone National Park,253
17,Vascular Plant,Agrostis gigantea,Redtop,Not in Danger,Yellowstone National Park,235


In [86]:
print(f'mode\t{observations.observations.mode()}')
print(observations.observations.agg(['max', 'min', 'median', 'mean']))

mode	0    84
dtype: int64
max       321.000000
min         9.000000
median    124.000000
mean      142.287904
Name: observations, dtype: float64


In [87]:
pd.set_option('display.max_rows', 30)
observations[observations.scientific_name.apply(lambda species: species in duplicated_scientific_names_mixed_status)].sort_values(by=['scientific_name', 'observations']).head(30)



Unnamed: 0,scientific_name,park_name,observations
792,Agrostis capillaris,Great Smoky Mountains National Park,84
17428,Agrostis capillaris,Great Smoky Mountains National Park,97
3993,Agrostis capillaris,Bryce National Park,103
4864,Agrostis capillaris,Bryce National Park,105
17735,Agrostis capillaris,Yosemite National Park,140
7750,Agrostis capillaris,Yosemite National Park,182
6166,Agrostis capillaris,Yellowstone National Park,241
10379,Agrostis capillaris,Yellowstone National Park,267
11602,Agrostis gigantea,Great Smoky Mountains National Park,57
7763,Agrostis gigantea,Great Smoky Mountains National Park,93


Once a species is over 250 in observation, they are no longer in danger or or concern|

Threatened appears to be under 100 observations in 3 or more parks while Under Concern seems to be under 100 in 2 or more parks

In [88]:
# observations[observations.duplicated(subset=['scientific_name'], keep=False)].sort_values(by='scientific_name')

The names are equal in bddoth data sets, so there would be no species that are in one but not in another

Looking at duplicated rows
Figure out if the duplicates are errors or not

In [89]:
observations[observations.duplicated(keep=False)].sort_values(by='scientific_name')

Unnamed: 0,scientific_name,park_name,observations
513,Arctium minus,Yosemite National Park,162
10674,Arctium minus,Yosemite National Park,162
4527,Botrychium virginianum,Yellowstone National Park,232
20294,Botrychium virginianum,Yellowstone National Park,232
19392,Cichorium intybus,Yellowstone National Park,266
14142,Cichorium intybus,Yellowstone National Park,266
7263,Echinochloa crus-galli,Great Smoky Mountains National Park,62
1454,Echinochloa crus-galli,Great Smoky Mountains National Park,62
1020,Eleocharis palustris,Great Smoky Mountains National Park,62
12381,Eleocharis palustris,Great Smoky Mountains National Park,62


In [90]:
duplicated_species = observations[observations.duplicated()]['scientific_name']

In [91]:
# duplicated_species

In [92]:
sorted_species_park = observations.sort_values(by=['scientific_name', 'park_name'])
sorted_species_park.tail(50)

Unnamed: 0,scientific_name,park_name,observations
9204,Zigadenus venenosus var. venenosus,Yellowstone National Park,243
17255,Zigadenus venenosus var. venenosus,Yosemite National Park,150
10913,Zizia aptera,Bryce National Park,105
13904,Zizia aptera,Bryce National Park,112
7753,Zizia aptera,Great Smoky Mountains National Park,105
...,...,...,...
5539,Zygodon viridissimus,Yosemite National Park,159
12775,Zygodon viridissimus var. rupestris,Bryce National Park,102
20040,Zygodon viridissimus var. rupestris,Great Smoky Mountains National Park,102
6879,Zygodon viridissimus var. rupestris,Yellowstone National Park,237


Filter the data to see if duplicated data is legitimate. From the above observation, each specicies has at least 4 sets of observations with one from each park. However, some have multiple sets of observations.

In [93]:
sorted_species_park[['scientific_name', 'park_name']].value_counts()

scientific_name                      park_name                          
Procyon lotor                        Yellowstone National Park              3
Hypochaeris radicata                 Great Smoky Mountains National Park    3
Castor canadensis                    Great Smoky Mountains National Park    3
                                     Yellowstone National Park              3
                                     Yosemite National Park                 3
                                                                           ..
Equisetum scirpoides                 Great Smoky Mountains National Park    1
                                     Bryce National Park                    1
Equisetum laevigatum                 Yosemite National Park                 1
                                     Yellowstone National Park              1
Zygodon viridissimus var. rupestris  Yosemite National Park                 1
Length: 22164, dtype: int64

This shows that species can be observed muliptle times in one park, assuming this either on different time frames or by different observers

So, let's check if the species on the duplicated list seem to be a species with multiple observations

In [94]:
# pd.set_option('display.max_rows', 30)
sorted_species_park[sorted_species_park.scientific_name.apply(lambda species: species in duplicated_species.tolist())]

Unnamed: 0,scientific_name,park_name,observations
2054,Arctium minus,Bryce National Park,66
16394,Arctium minus,Bryce National Park,142
18320,Arctium minus,Great Smoky Mountains National Park,76
19102,Arctium minus,Great Smoky Mountains National Park,63
4017,Arctium minus,Yellowstone National Park,234
...,...,...,...
22131,Trifolium campestre,Great Smoky Mountains National Park,72
18435,Trifolium campestre,Yellowstone National Park,239
21151,Trifolium campestre,Yellowstone National Park,239
6478,Trifolium campestre,Yosemite National Park,130


The data above demonstrates that the rows that are duplicates are not errors because each species that contained a duplicate, had multiple observations from each of the four parks. I can conclude that a duplicate is just a recorded obervation done at a different time or by a different observer with the same value.

In [95]:
observations

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85
...,...,...,...
23291,Croton monanthogynus,Yosemite National Park,173
23292,Otospermophilus beecheyi,Bryce National Park,130
23293,Heterotheca sessiliflora ssp. echioides,Bryce National Park,140
23294,Dicranella rufescens,Yosemite National Park,171


In [96]:
multiple_species = species.scientific_name.value_counts()

In [97]:
multiple_species.where(multiple_species > 1, inplace=True)

In [98]:
multiple_species.dropna(inplace=True)

In [99]:
multiple_species.value_counts()

2.0    265
3.0      9
Name: scientific_name, dtype: int64

In [100]:
species_changed_status = multiple_species.index.tolist()

In [101]:
species_df = species[~species.scientific_name.isin(species_changed_status)]

In [102]:
species_df.shape

(5267, 4)

In [103]:
species_df.duplicated().sum()

0

In [104]:
(species_df.groupby('scientific_name').count()['conservation_status'] == 1).sum()

5267

In [105]:
# species_df[~(species_df.conservation_status == "Not in Danger")]
species_df.head()

Unnamed: 0,category,scientific_name,common_names,conservation_status
0,Mammal,Clethrionomys gapperi gapperi,Gapper's Red-Backed Vole,Not in Danger
1,Mammal,Bos bison,"American Bison, Bison",Not in Danger
2,Mammal,Bos taurus,"Aurochs, Aurochs, Domestic Cattle (Feral), Dom...",Not in Danger
3,Mammal,Ovis aries,"Domestic Sheep, Mouflon, Red Sheep, Sheep (Feral)",Not in Danger
7,Mammal,Canis latrans,Coyote,Species of Concern


In [106]:
observations.head()

Unnamed: 0,scientific_name,park_name,observations
0,Vicia benghalensis,Great Smoky Mountains National Park,68
1,Neovison vison,Great Smoky Mountains National Park,77
2,Prunus subcordata,Yosemite National Park,138
3,Abutilon theophrasti,Bryce National Park,84
4,Githopsis specularioides,Great Smoky Mountains National Park,85


In [107]:
observations.shape

(23296, 3)

Data merging safely from here

In [108]:
right = species_df.merge(observations, on='scientific_name', how='right').sort_values(by=['scientific_name', 'observations']).reset_index(drop=True)

In [109]:
left = species_df.join(observations.set_index('scientific_name'), on='scientific_name', how='left').sort_values(by=['scientific_name', 'observations']).reset_index(drop=True)

In [110]:
right[right.scientific_name.isin(species_changed_status)]

Unnamed: 0,category,scientific_name,common_names,conservation_status,park_name,observations
416,,Agrostis capillaris,,,Great Smoky Mountains National Park,84
417,,Agrostis capillaris,,,Great Smoky Mountains National Park,97
418,,Agrostis capillaris,,,Bryce National Park,103
419,,Agrostis capillaris,,,Bryce National Park,105
420,,Agrostis capillaris,,,Yosemite National Park,140
...,...,...,...,...,...,...
23251,,Zizia aptera,,,Bryce National Park,112
23252,,Zizia aptera,,,Yosemite National Park,123
23253,,Zizia aptera,,,Yosemite National Park,129
23254,,Zizia aptera,,,Yellowstone National Park,257


In [111]:
len(species_changed_status)

274