# Biological invasion: a starter analysis
Hello Kagglers! Welcome to this starter notebook. At this opportunity, I present to you one of the most troubling issues in recent years.

First of all, the [original data](https://catalog.data.gov/dataset/trends-in-non-native-aquatic-species-richness-in-the-united-states-reveal-shifting-pattern) was obtained in U.S. Environmental Protection Agency. This institution has the mission to protect human health and the environment. Credits for that institution and its works. 

The [preprocessed data](https://www.kaggle.com/lazaro97/biological-invasions) you can find [here](https://www.kaggle.com/lazaro97/biological-invasions).

So, as indicated in its description:

> Nonindigenous aquatic species introductions are widely recognized as major stressors to freshwater ecosystems, threatening native endemic biodiversity and causing negative impacts to ecosystem services as well as damaging local and regional economies. So, it's thus necessary to monitor the spatial and temporal trends and spread in order to guide prevention and control efforts and to develop effective policy aimed at mitigating impacts. (Michael J. Mangiante, 2018)

From our perspective, these new species bring about a number of changes in the ecosystems, such as altering the structure and composition of plant communities; reducing agricultural productivity, wildlife, biodiversity, and fodder availability; changing soil structure; affecting health of us and livestock.

You can say "Well, invasion is not a novel phenomenon; it always happens". However, the biological invasion increases tremendously during the past few years because of rapidly expanding trade and transport among countries. And will grow even more because of globalization.

![](http://muwo1.unibo.it/steamgreenuniboit/wp-content/uploads/sites/6/2017/07/inva.jpg)

I think that it exist two possible solutions:

1. Large corporations should take preventive measures in any transfer of resources between regions
2. Search some spaciotemporal pattern and anticipate a possible new introduction

Each one has their pros and cons. However, for bird and aquatic species are distinct. They arrive on their own, that detail complains the first solution. So, given the available information in the dataset, the latter is more feasible. But this case is difficult too, as there may be several cycle events, atypical data or lost data in the recollection. So, the establishment of forecasts is a challenge.

Next I'll show you some basic ideas for this type of analysis. Below there are a descriptive analysis for find any spatial or temporal pattern. If there is any improvement in this code do not hesitate to comment it.

For more information, at the end of the notebook I'll mention some additional references. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

#Spatial libaries
import geopandas as gpd
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster

# Exploratory analysis

First, explore our dataset. In this step we will check if the information is ok and find any visible pattern. To do this it must be 1) make visualizations 2) integrate all sources of data 3) clean the data 4) consider to add new features.

Previously, the dataset was preprocessed, so this part is more directly. 

In [None]:
#Get the data
spat=pd.read_csv('../input/biological-invasions/dat_spatial.csv')
spec=pd.read_csv('../input/biological-invasions/dat_species.csv')

In [None]:
#How many null values are there?
f, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))
sns.heatmap(spec.isna(),cmap='cubehelix',yticklabels=False,cbar=False,ax=ax[0])
sns.heatmap(spat.isna(),cmap='cubehelix',yticklabels=False,cbar=False,ax=ax[1])

The graph shows the number of null values per variable. The x-axis shows the name of the variable and the y-axis shows absence (white), or presence (black) of values for each variable.
* Georeferenced information in dat_spatial has no null values.
* As soon as group, family, dateobserved. Note that there are two large disjointed sets. So it is sensed that there is some special characteristic (or could not get that information) for each set.
* The most appreciable thing is the recordedby variable. If the data description is reviewed, the variable indicates the discoverer, which is closely associated with the source where the observation was collected. So there are no drawbacks with removing it. However, if a study wanted to do more research into the methodology of extraction, it may be convenient to keep it.

In [None]:
#Remove and fix null values
del spec['recordedBy']

In [None]:
#Analyze numeric features
spec['occurrence']=spec['occurrence'].astype('string') #Is a id, not a float. Although some ids gives information
spec.select_dtypes(include=['float64','int64']).hist(color='blue',figsize=(10,8),bins=12)

* Latitude and longitude are ok. Their distribution is logical and it can conclude that there is a greater agglomerment in a specific region.
* Year is biased to the right. There's less information in the early years or there were very few species. For more information, review the used methodology of the institution.
* On the huc features, it is logical that almost all areas are close to zero, and there are some that are very extensive.

In [None]:
#Analize 'text' features
spec.nativeregion.value_counts() #When I see this variable, I consider to define two values: possibly native or possibly foreign
def native_or_forecast(x):
    if str(x).find("possibly native")==-1: return('possibly native')
    else: return('no native')
print(spec.nativeregion.apply(native_or_forecast).value_counts())
plt.pie(spec.nativeregion.apply(native_or_forecast).value_counts(), autopct="%.1f%%",pctdistance=0.5,shadow=True,colors=['darkblue','blue'],labels=['possibly native','no native'])

Only claims that 10.7% is foreign. What about the other percentage? In general, what foreign (or native) area it comes from? 

Relative to this, a way is analyze the term, find many groups and makes a plot. Ex: Define dichotomic variables: europe, south america, african, asian,australian,.. and if your native region is that, its value is 1, else 0.

In [None]:
#Analyze categorical features
animal=spec[spec.kingdom=='Animal']
#sns.countplot(animal.group)

print("What's the animal group more common ?")
print(animal.group.value_counts()[:4]) #Obviously, the fishes
animal=animal[animal.group=='Fishes']
#What's the fisshes family more common ?
an=animal.family.value_counts()[:4].keys()
sns.countplot(animal[animal['family'].isin(an)].family, palette="Blues_r")

Note that each observation is defined in a spacio temporal plane. So maybe this graphs are skewed. Ideally, you should choose an intersection between space and time and then select a particular species (or family). In any case, the above graphs may give a global view of what happened.

# Temporal features

In [None]:
#A paper suggest this separation too
def gyear(x): #Cut the date
    if x<=1900: return('1600-1900')
    elif x<=1930: return('1901-1930')
    elif x<=1950: return('1931-1950')
    elif x<=1970: return('1951-1970')
    elif x<=1990: return('1971-1990')
    else: return('1991-2016')
spec['gyear']=spec['year'].apply(gyear)

In [None]:
print('Arrival of species from 1600 to 2016')
print(spec.gyear.value_counts()) #Maybe, from 1676 to 1900 should be considered as outliers
print('Which species came first?')
print(spec[spec.year<1800].sciname.value_counts().keys()) #
spec=spec[spec.gyear!='1600-1900'] #Remove that values

In [None]:
#Time series of occurrence from 1901 to 2016
from collections import Counter
sns.set_style("whitegrid")
def temporal_trend(text,col):  #Return a plot where appears a respective specie. Color can be removed as input
    specie=spec[(spec.sciname==text)]
    temporal=Counter(specie.year) # Temporal variation, In the plants kingdom there are months too
    #spatial=Counter(specie.state) #Spatial variation
    sns.lineplot(temporal.keys(),temporal.values(),color=col)
temporal_trend('Phalaris arundinacea','red')
temporal_trend('Cyprinus carpio','blue')
temporal_trend('Salmo trutta','green')
temporal_trend('Phragmites australis','black')
# All species
#for i in spec.sciname.unique()[:10]:
#    temporal_trend(i,'red') 

Exists an anomaly between 2005 and 2015. That's weird, i'm not sure if I would have to think of it as an cycle variation or if it's really a part of the trend variation.

In [None]:
#Another opcion: In generally, considering all species
temporal=Counter(spec.year)
sns.lineplot(temporal.keys(),temporal.values(),sort=True,color='magenta') 

In [None]:
#I think that I should show them how to process the date features. Previously, in the original dataset, the format was distinct between the items

def preprocess_date(x): #Varies depending on the species chosen. Should return same datetime format
    if len(str(x))<11:return pd.to_datetime(x, format='%m-%d-%Y')
    else: return pd.to_datetime(str(x)[:10], format='%Y-%m-%d')

In [None]:
specie=spec[spec.sciname=='Phalaris arundinacea'] #Select only one specie. Select all species is possible too
specie['date']=specie['dateobserved'].apply(preprocess_date) 
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(specie.date, period=300, model='additive', extrapolate_trend='freq')
specie['trend'] = decomp.trend
specie["seasonal"] = decomp.seasonal
fig,ax=plt.subplots(2,1)
sns.lineplot(specie.year,specie.trend,sort=True,color='darkred',ax=ax[0])
sns.lineplot(specie.year,specie.seasonal,sort=True,color='darkblue',ax=ax[1])

In [None]:
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf

f, ax = plt.subplots(nrows=2, ncols=1, figsize=(9, 4.5))

plot_acf(specie['date'], lags=100, ax=ax[0]) #Heavy decay
plot_pacf(specie['date'], lags=100, ax=ax[1])

plt.show() 

# Spatial features

In [None]:
# Geopandas DataFrame. Deal with space features
from shapely import wkt
spat['geometry'] = spat['geometry'].apply(wkt.loads)
spat = gpd.GeoDataFrame(spat, geometry = 'geometry')

In [None]:
#At any given time
spec_2015=spec[spec.year==2015] #Just this year
points=spec_2015[['decimalLat', 'decimalLon']]
points = gpd.GeoDataFrame(points, geometry=gpd.points_from_xy(points.decimalLon, points.decimalLat)) #Elements to add
#ax = spat.plot(figsize=(20,20), color='none', edgecolor='black', zorder=1) #boundary
#points.plot(color='blue', ax=ax) This graph is confused..

In [None]:
#..How to fix it?
#USA have many unincorporated territories in their control (https://en.wikipedia.org/wiki/Unincorporated_territories_of_the_United_States)
#Remove states with many islands. Some states have many islands in their control (https://en.wikipedia.org/wiki/List_of_islands_of_the_United_States_by_area)
spatg=spat #backup
spat=spat[-spat.name.isin(['Hawaii','Alaska'])] #sum(spec.state=='alaska') Also, no elements
#Now plot the map
ax = spat.plot(figsize=(20,20), color='none', edgecolor='black', zorder=1) #boundary
points.plot(color='blue', ax=ax) #Elements to add

You can see in which areas appeared most aquatic species in 2015

In [None]:
#Now add a temporal variation too
spec_t1=spec[spec.gyear=='1600-1900']
spec_t2=spec[spec.gyear=='1901-1930']
spec_t3=spec[spec.gyear=='1931-1950']
spec_t4=spec[spec.gyear=='1951-1970']
spec_t5=spec[spec.gyear=='1971-1990']
spec_t6=spec[spec.gyear=='1991-2016']
#for i in spec['gyear'].values: lst.append(spec[spec.year==i])
def plot_map(df):
    points = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.decimalLon, df.decimalLat))
    ax = spat.plot(figsize=(20,20), color='none',edgecolor='black', zorder=1) #Same boundary
    points.plot(color='blue', ax=ax) #In same boundary
#plot_map(spec_t1)
#plot_map(spec_t6)
#Maybe this is not interpetable but it can shows how today there are several species compared to before 
#I think that the ideal is per year(or per day), not for a large range. I suggest build a dynamic plot


In [None]:
#Define the observations and features that you want to analyze
#Species #Clearly, is convenient define the plot based on a specific specie 
pl_2001=spec[(spec.sciname=='Phalaris arundinacea')&(spec.year==2001)]
salmo_2001=spec[(spec.sciname=='Salmo trutta')&(spec.year==2001)]
#Boundary
spat_s = spat[["name", "geometry"]].set_index("name")

In [None]:
#Build a map interactive
#Point spatial process
m_point = folium.Map(location=[41,-82], tiles='cartodbpositron', zoom_start=4.2) #Define the map
 
for i in range(0,len(pl_2001)):  #Add points
    folium.Circle(
      location=[pl_2001.iloc[i]['decimalLat'], pl_2001.iloc[i]['decimalLon']],
      radius=pl_2001.iloc[i]['huc12skm'],  #I consider the area of HUC12. See the bubble
      color='green',
      fill=True,
      fill_color='green').add_to(m_point)
    
for i in range(0,len(salmo_2001)): #Also, is possible add other species
    folium.Circle(
      location=[salmo_2001.iloc[i]['decimalLat'], salmo_2001.iloc[i]['decimalLon']],
      radius=salmo_2001.iloc[i]['huc12skm'],
      color='blue',
      fill=True,
      fill_color='blue').add_to(m_point)

m_point

In [None]:
m_heatmap = folium.Map(location=[41,-82], tiles='cartodbpositron', zoom_start=4) #Again, define the map
HeatMap(data=salmo_2001[['decimalLat', 'decimalLon']], radius=10).add_to(m_heatmap) #Also, it's possible define a heatmap
#HeatMap(data=pl_2001[['decimalLat', 'decimalLon']], radius=10).add_to(m_heatmap)
m_heatmap

In [None]:
#For the discrete case is neccesary to define the number of events per region
plot_dict = salmo_2001.state.value_counts() 
s=pd.DataFrame(plot_dict).reset_index()
s.columns=['name','count']
#Join tables
dat=pd.merge(s,spat,on='name',how='inner')
dat=dat.fillna(0)

In [None]:
m_choropleth = folium.Map(location=[41,-82], tiles='cartodbpositron', zoom_start=4) #map
Choropleth(geo_data=spat_s.__geo_interface__, #In generally, add layers (or caractheristics)
           data=plot_dict, 
           key_on="feature.id", 
           fill_color='YlGn', 
           legend_name='Major presence of Phalaris arundinacea (Jan-Dec 2001)'
          ).add_to(m_choropleth)
m_choropleth

In [None]:
#Build a dynamic map
spec_e=spec[(spec.year>1900)&(spec.year<1971)]  #Only until 1970. Sorry, I couldn't optimize this part of the code, it took a long time if it was considered more years. If you consider more dates, remember the high rise betwwn 2005 and 2015
#spec_e=spec_e[spec_e.sciname=='Salmo trutta'] #You can choose an specific specie
plot_dict = spec_e[['state','year']].value_counts() #Transform to a discrete case. See the spatiotemporal index
s=pd.DataFrame(plot_dict).reset_index()
s.columns=['name','year','events'] #Same name of columns

spatg = spatg[["iso_code","name", "geometry"]]
spatg.iso_code=spatg.iso_code.apply(lambda x: str(x)[3:]) #In this case, the spatial code should be different 

#Join tables
dat=pd.merge(s,spatg,on='name',how='inner')
dat=dat.sort_values(by='year', ascending=True).reset_index()
del dat['index']
dat

In [None]:
df=pd.merge(spatg[['name','iso_code']],dat.year,how='cross') #I added zero values for the map. Maybe there's an alternative way
datf=pd.merge(dat,df,on=['name','year'],how='outer')
datf['events']=datf['events'].fillna(0) #Region without species should be 0

In [None]:
!pip install plotly

In [None]:
import plotly.express as px
fig = px.choropleth(datf,                          
                     locations='iso_code_y', locationmode='USA-states',     # identify country code column
                     color="events",                     # identify representing column
                     animation_frame="year",        # identify date column
                     scope='usa',                #shows only this country
                     color_continuous_scale= 'Ylgn', 
                     range_color=[0,datf.events.max()])             
fig.write_html("historic-invasion.html") 
fig.show() #It can looks better. Maybe it's better to use a point process

# Conclussions

* The areas where new species are received most were identified. Obviously, there should be more agglomeration on the maritime boundaries. However, the graph also shows which of these limtrofes regions received the most species. That regions were California, Florida, Maine, New Jersey and Wisconsin.
* California always keep in mind several aquatic species.
* The number of species in USA get higher while the time goes on, but this change was slight.
* In the dynamic graph, i analyze it in general way, maybe it's convenient to take just one specie. See the up and down in some states of the map.
* A idea now would be to establish some model that determines what the cloropleth map will look like after a few years.
* And no less important.. the maps and graphs are adaptable to any phenomenon. For example, you could analyze the human invasion, the spread of covid19, the occurrence of criminal acts, etc.

# Additional resources

**Notebooks:**
* https://www.kaggle.com/andreshg/timeseries-analysis-a-complete-guide
* https://www.kaggle.com/umerkk12/geo-animation-of-vaccination
* https://www.kaggle.com/learn/geospatial-analysis
* https://www.kaggle.com/dbennett/test-map

**Papers:**
* http://www.aquaticinvasions.net/2018/issue3.html
* https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6209226/

## Any task for this dataset?

* **Define some insights:** Maybe, the rate of increase per year, the neigborhood effect, first apparition, foreign area where it comes, etc.
* **Make an interactive dashboard:** Showing results visually is more eye-catching and interpetative, making it easier to make decisions in a certain way.
* **Build a spatiotemporal model:** The main objective is to predict the appearance of new species based on historical and spatial data. You can use a neural network or the bayesian theory.
* **Realize a spatial (or temporal) analysis:** No combining, so it's more practical.
* **Suggest some ideas in data staging**: More technical. I've always seen drawbacks in big data management. I think maybe by adding other data sources, you can get a more correct analysis.
* **Cycle variation from 2005 to 2015**: Honestly, I'm not sure how to interpret the high rise in 2005, if it could already control that cycle variation, then is necessary to remove it from the analysis?

That'all! Thanks for read this notebook. Now, is your turn. Choose a task and have fun!

**PD:** Don't forget to review and suport my [kaggle dataset](https://www.kaggle.com/lazaro97/biological-invasions). That motivates me to keep doing similar projects