# Analysis of projects in Chile

The Environmental Assessment Service (or SEA as its name in Spanish) is the institution responsible for authorizing the operation of projects in Chile, which could have potential impacts in the population health or the environment. When a company wants to carry out a project of a relatively large magnitude, it should present a requirement to the SEA to evaluate the correct and safety operation of that project. In this way if a project is detectable as harmful to the environment or population, the service can deny the environmental permit and thereby the start-up of a project. From the starting point of SEA in 1997, more than 15 thousand of projects have been evaluated by this service, thereby the database of SEA contains a large number of registers, which can be useful to analyze. 

<img src="https://sea.gob.cl/sites/default/files/styles/noticia_portada/public/imagenes/bloque-estilo-pagina-estatica/sea_2d8d4.png?itok=ojZ7om3H">

Notice that projects can present a environmental impact statement (DIA in spanish) or a environmental impact study (EIA). This depend on the magnitude of potential impacts to the environment or health population. DIA means a simple evaluation of impacts, while EIA corresponds to a more complex assessment. With this information, several relevant questions can be answered using these data. 

* How long is the average time it takes a project to get the final response from the service?
* What is the most important industry in Chile?
* What are the regions with most number of projects?
* Investment is correlated with the getting of permission?
* What is the proportion of DIA and DIA?
* EIA are most difficult to get permission?

In [None]:
# Import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import folium
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster
from datetime import datetime

# Set matplotlib config
%matplotlib inline
plt.style.use('seaborn')

In [None]:
# Load data
data = pd.read_csv('../input/chilean-projects/projects.csv',parse_dates = ["entry_date","qualification_date"])
data.head()

We can observe 26953 projects. Now, we can start to do general analysis

In [None]:
#Extract the year of entry and qualification
data['entry_year'] = data['entry_date'].apply(lambda x: x.year)

In [None]:
# What is the proportion of EIA and DIA?
data_type = data.groupby(["type"],as_index = False).size()

# We can see the evolution of request
data_type_year = data.groupby(["type","entry_year"],as_index = False).size()

# We plot data
fig, ax = plt.subplots(1,2, figsize=(20, 6))

# First plot
f1 = sns.barplot(x = 'type',y = 'size',data = data_type,ax = ax[0]);
f1.set_title("Number of projects based on evaluation type", fontsize=24)
f1.set_xlabel('Type of evaluation')
f1.set_ylabel('Frequency')

# Second plot
f2 = sns.lineplot(x='entry_year', y="size", hue = 'type', data=data_type_year,ax = ax[1]);
f2.set_title("Evolution of number of environmental analysis", fontsize=24)
f2.set_xlabel('Year')
f2.set_ylabel('Frequency')
# We go to add a special text in the final of the line marking that something special happened here
f2.annotate('Social crisis and COVID?', xy=(2020, 200), xytext=(2010, 200), arrowprops={'facecolor':'red', 'shrink':0.05}); 
fig.show()

We observe that the most used evaluation type is DIA. This can be due to most project are not large or not imply high impacts. On the other hand, we need to consider that many projects have been segmented in little parts. In this way, is easier to obtain the environmental permission. In the rigth plot, we see that the amount of projects by DIA had an increase during 2005 and 2012. Now, the trend is down. In the last two years a sharp decrease is evidence, which may be associated with the **social crisis** that occurred in the country added to **COVID**.

In [None]:
# What is the state of projects?
data_state = data.groupby(["state","type"],as_index = False).size()

# We plot data
fig, ax = plt.subplots(1,2, figsize=(20, 6))

# First plot
f1 = sns.barplot(y = 'state',x = 'size', data = data_state.loc[data_state['type'] == 'DIA'],ax = ax[0]);
f1.set_title("State of evaluation processes (EIA)", fontsize=24)
f1.set_xlabel('State')
f1.set_ylabel('Frequency')

# Second plot
f2 = sns.barplot(y='state', x="size", data=data_state.loc[data_state['type'] == 'EIA'],ax = ax[1]);
f2.set_title("State of evaluation processes (DIA)", fontsize=24)
f2.set_xlabel('State')
f2.set_ylabel('Frequency')
fig.show()

If we compare the state of evaluations by EIA and DIA, we can find a similar proportion of approval ("Aprobado"). A little part of evaluations are classified as denied ("Rechazado"). Besides, it is interesting that a great part of evaluation processes be withdrawn. Another appealing point is the value "No Admitido a Tramitación" which means processes which do not meet the requirements imposed by SEA and the environmental normative. The "expired" status indicates that the permit was approved, but the project did not come to fruition within the stipulated period. This is relevant, since the environmental permit has a certain duration. After that period, the permit expires.

In [None]:
# What is the most important industry in Chile?
data_typology = data.groupby(["typology","typology_des"],as_index = False).size()

# Create a plot
plt.figure(figsize=(20,6))
sns.barplot(x = 'typology',y = 'size', data = data_typology,ci = None);
plt.title("Projects by typology", fontsize=24)
plt.xlabel('Sector')
plt.ylabel('Frequency');

We see a high number of typologies used by SEA. Using the field "typology_des", we can identified the meaning of the most relevant typologies. 

In [None]:
# We identify the five most recurrent typologies
top_typologies = data_typology.sort_values("size",ascending = False)[0:5][["typology_des"]]

# We can see the entire description as a list
list(top_typologies["typology_des"])

As we ca see that the main typologies are:
* Fish and seafood production
* Power centrals
* Building projects
* Waste treatment and / or disposal systems
* Mining projects

Besides, it can consider that we might group these typologies in major categories, which  are based on the letter of each topology. In [this page](https://www.sea.gob.cl/sea/proyectos-actividades-sometidos-eia), we can see the general categories.

In [None]:
# Group typologies in major categories
data["major_typology"] = data["typology"].str.extract(r'(\D)')

# Create a plot
plt.figure(figsize=(20,6))
sns.barplot(x = 'major_typology',y = 'size', data = data.groupby("major_typology",as_index = False).size(),ci = None);
plt.title("Projects by major typologies", fontsize=24)
plt.xlabel('Sector')
plt.ylabel('Frequency');

Now, is easier to identify the main typologies. If we review the [SEA page](https://www.sea.gob.cl/sea/proyectos-actividades-sometidos-eia), we can note the the main major typologies are:

* n: Projects of intensive exploitation, cultivation, and hydrobiological resource processing plants
* o: Environmental sanitation projects, such as sewage and drinking water systems, water or solid waste treatment plants of domestic origin, sanitary landfills, underwater outfalls, liquid or solid industrial waste treatment and disposal systems

In [None]:
# What are the regions with most number of projects?
data_regions = data.groupby(["region","type"],as_index = False).size()

# Create a plot
plt.figure(figsize=(20,6))
sns.barplot(x = 'size',y = 'region', hue = "type",data = data_regions);
plt.title("Projects by region", fontsize=24)
plt.xlabel('Frequency')
plt.ylabel('Region');


RM, Décima and Undécima are the regions where are most projects.

In [None]:
# How long is the average time it takes a project to get the final response from the service?
data["processing_days"] = data["qualification_date"] - data["entry_date"]
data["processing_days"] = data["processing_days"].dt.days

# We create a histogram and a boxplot to know the distribution
fig, ax = plt.subplots(1,2, figsize=(20, 6))

# Create a histogram
f1 = sns.histplot(x = data["processing_days"],kde = True,bins = 30,ax = ax[0]);
f1.set_title("Amount of days of process", fontsize=24)
f1.set_xlabel('Days')
f1.set_ylabel('Frequency')

# Second plot
f2 = sns.boxplot(x='type', y="processing_days", data=data,ax = ax[1]);
f2.set_title("Days of evaluation process", fontsize=24)
f2.set_xlabel('Type of evaluation')
f2.set_ylabel('Days')
fig.show()

print("Average amount of days of EIA: {}".format(round(data.loc[data["type"] == "EIA","processing_days"].mean())))
print("Average amount of days of DIA: {}".format(round(data.loc[data["type"] == "DIA","processing_days"].mean())))


In [None]:
data.loc[data["processing_days"] > 5000,["name","processing_days","type","investment","typology_des"]]

Apparently they are projects that do not have major particularities. Now, we can observe the duration per major typologies.

In [None]:
# Create a plot
plt.figure(figsize=(20,6))
sns.boxplot(x='major_typology', y="processing_days",hue="type", data=data);
plt.title("Days of evaluation process", fontsize=24)
plt.xlabel('Major typology')
plt.ylabel('Days');

In general terms, we cannot observe a clear distinction between the duration of days of different categories. Only we can identify a more sharp variation between EIA and DIA. Apart from that, some categories present a wider range of days, such as **n** and **u** categories.

In [None]:
# Investment is correlated with the getting of permission? 

# Create a plot
plt.figure(figsize=(20,6))
sns.scatterplot(data=data.loc[data["processing_days"] < 2000], x="processing_days", y="investment", hue="state");
plt.title("Relation between processing days and investment with state", fontsize=24)
plt.xlabel('Processing days')
plt.ylabel('Investment');

We cannot identy a clear trend between these features. Apparently, the investment do not imply the getting of a permit. On the other hand, we can observe thar in the first days, many evaluation processes are withdrawn or not admitted ("No Admitido a Tramitación"). We only plot rows with processing_days less than 2000 in order to see clearer most points.

In [None]:
#Number of participatory activies or documents are related to getting of permission?

# Create a plot
plt.figure(figsize=(20,6))
sns.scatterplot(data=data.loc[data["n_docs"] < 5000], x="n_docs", y="n_participatory", hue="state");
plt.title("Relation between number of docs and participatory activities with state", fontsize=24)
plt.xlabel('Numer of docs')
plt.ylabel('Number of participatory activities');

We can observe that there are processes with many activities and documents. If we review the main URL associated to each row, we can see some projects with thousand of documents, however the most are only observations. In this plot, we see rows with less than 5000 of documents. we can see a great heterogeneity in the data.

In [None]:
# Now we see the spatial distribution of projects
projects_spatial = data.dropna(subset=['latitude','longitude']).copy()

# Create a base map
map_p = folium.Map(location=[-37.027811,-72.082867], tiles='cartodbpositron', zoom_start=4);

# Add a heatmap to the base map
HeatMap(data=projects_spatial[['latitude', 'longitude']], radius=10).add_to(map_p);

# Display the map
map_p

We can note there are a large number of projects throughout the Chilean territory, specially in Metropolitan Region (in the center of the country). Clearly, there is a mistake in a row due to its location appears in Argentina. We can observe specific territories. For instance, it is know that a large number of projects related to the extraction and production of marine resources take place in the south of the country. 

In [None]:
# Create a new dataset. We know that typology "n" is related to extraction and production of marine resources
aqua_projects = projects_spatial[projects_spatial["typology"].str.contains("n")]

# Create a base map
map_aqua = folium.Map(location=[-42.637024,-73.445172], tiles='cartodbpositron', zoom_start=9);

def color_producer(val):
    if type == "EIA":
        return 'forestgreen'
    else:
        return 'darkred'

# Add a bubble map to the base map
for i in range(0,len(aqua_projects)):
    Circle(
        location=[aqua_projects.iloc[i]['latitude'], aqua_projects.iloc[i]['longitude']],
        radius=20,
        color=color_producer(aqua_projects.iloc[i]['type'])).add_to(map_aqua)

# Display the map
map_aqua

We observe that almost all projects are based on DIA (darkred color). If we move the map, we see the other regions are not so saturated as this region (Décima region)