# Theory:



## Motivation

A key challenges in accumulation of property insurance is to identify and draw boundary around areas of exposure, such as create and understand the concentrations of risk. This is most commonly achieved through accumulation management, or the monitoring of insured values (i.e. limits excluding all deductibles) in any one area to avoid an over-concentration of risk. Accumulation management evaluates impact on property by assessing concentrations in the worst case scenarios by assuming 100% damage. As all limits and deductibles are applied on total sum insured of the concentrated risks, it is crucial to know all those risks that are co-related to each other.

In traditional risk management approach the area of the concentration could be city, state, cresta, country or any user defined geographic grid. However in case terrorism risk, it is not safe to assume these political-geographic boundary as concentration of risk. It's mainly because of the following two reason:

* The scope of risk goes beyond any political-geographic boundary.
 
* The area of accumulation is very limited and it is less than the area of a city or even a pin code.
    Usually, accumulation of terrorism risk measure in terms of radius.
    For example, a potential bomb can affect maximum 250 meter radius (e.g. how much exposure is within 250 meters of point x?).
    Hence, we need a different approach to accumulate terrorism risk.

## Vendor Model (RMS) Approach

**Spider analysis**

In general, the goal of this analysis is to identify the “worst” areas in terms of accumulation, as defined by the analysis type. A Spider analysis identifies the area by itself. The area type can be a geographic area such as CRESTA zone, county, or postal code. The area type could also be a simple damage circle which assumes 100% ground-up.

The user selects a method of attack (maximum area of affect, i.e. in terms of radius of a circle). The user also specifies a threshold when running a spider analysis. This threshold limits the number of accumulation areas identified. Assume you want to know where the top 50 areas of greatest exposure concentration lie assuming an accumulation within a 250 meter radius circle. This can be accomplished by performing a Spider analysis. The user could also specify monetary thresholds such as $500 million in an area.

RiskLink uses a fixed grid system that blankets the earth. For a 100% simple damage circle, the default grid size is 25% of the radius selected. For example, if a simple damage circle is defined by a radius of 400 meters, the grid size would be 100 meters by 100 meters (400 * 25%).

For each cell in the grid system, a circle is drawn that represents the accumulation area. The centroid of each circle is placed at the center of each cell. As each circle is drawn the exposure for each circle is accumulated. This process continues until a circle is drawn for each cell. Once all circles have been drawn, the circle with the highest accumulation is identified, any areas that overlap with the highest accumulated area are removed. This is called event generation and in this process RiskLink throws away few exposure.
The final step in the spider analysis, after eliminating overlapping circles, is to give the accumulations a descriptive name. All events are named based on the location closest to the center of the circle. Let's say the closest location from center of an accumulated circle is located in the country "US", state "California", city "Foothill Ranch" and postal code "91916". Therefore, the name of the event would be “Foothill Ranch, CA 92610, US”.

## Our Approach (TRAM)

This method is very much similar to RiskLink except the methodology for fining accumulated circles. While RMS uses Spider algorithm to do so, we are using a supervise algorithm called Agglomerative Hierarchical clustering. Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom.

For example, all files and folders on the hard disk are organized in a hierarchy. In agglomerative or bottom-up clustering method first each location is assigned to its own cluster.
Then, compute the similarity (e.g., distance) between each of the clusters and then join the two most similar clusters; any two clusters with minimum distance to each other. Finally, repeat steps 2 and 3 until there is only a single cluster left. In theory, it can also be done by initially grouping all the observations into one cluster, and then successively splitting these clusters.

Before any clustering is performed, a proximity matrix is determined which contains the distance between each point using a distance function. Then, the matrix is updated to display the distance between each cluster. There are three methods differ in how the distance between each cluster is measured. The one we have used hear is called "complete linkage".
In complete linkage hierarchical clustering, the distance between two clusters is defined as the longest distance between two points in each cluster.

Finally, an agglomeration tree will be formed in the end of this algorithm (as shown below). In this tree all the cluster are shown at x axis, and number of iteration y axis.
Additionally we know the distance between any two clusters at Nth iteration.

As we have used complete leakage method, if we cut this tree from ground till X height then we will get clusters with maximum 2X distance between any two locations.
This approach also ensure every location will be a part of a single cluster.

###### load sample data

In [3]:
# File location and type
file_location = "/FileStore/tables/data.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

In [4]:
from pyspark.sql.functions import col, lit, concat_ws, pandas_udf, PandasUDFType
from pyspark.sql.types import IntegerType
from math import ceil
from scipy.cluster.hierarchy import fcluster, linkage
from scipy.spatial.distance import pdist
from geopy import distance

###### Input from user blast zone radius, peril & portfolio

In [6]:
blast_radius = sc.broadcast(200)
peril = sc.broadcast("202")
portfolio = sc.broadcast(["3"])
max_items_for_clustering = sc.broadcast(10000)

###### If geo encoding is more than zip level, it dosen't make seance to include in the clusturing process. As RMS, we are also ignoring address match greater than 5 (i.e. considering only upto postal code granularity). We are not considering "number of building" as a measure of aggregated location.

In [8]:
encoding_type_list = sc.broadcast(["EXACT", "EXACT_STREET_ADDRESS", "ZIP"])
# encoding_type_list = ["ADMIN0", "ADMIN1", "ADMIN2", "CRESTA", "EXACT", "EXACT_STREET_ADDRESS", "PLACE_CITY", "SUB_CRESTA", "ZIP"]

###### Filtering dataframe for the user input peril & portfolio

In [10]:
df = df\
  .filter((col("pf_version_id").isin(portfolio.value)) & (col("peril_type_cd") == peril.value) & (col("encoding_quality_desc").isin(encoding_type_list.value))) \
  .select("pf_version_id", "admin_0_name", "ins_itm_id", "latitude", "longitude") \
  .dropDuplicates() \
  .withColumn("identifier", concat_ws("_", col("pf_version_id"), col("admin_0_name"))) \
  .withColumn("cluster", lit(None).cast(IntegerType())) \
  .orderBy(["identifier", "latitude", "longitude"]) \
  .repartitionByRange(100, "identifier")

display(df)
df.persist()

pf_version_id,admin_0_name,ins_itm_id,latitude,longitude,identifier,cluster
3,AE,907737,24.219509,55.782955,3_AE,
3,AE,907715,24.312007,54.620806,3_AE,
3,AE,855455,24.414582,54.490653,3_AE,
3,AE,902301,24.417363,54.441111,3_AE,
3,AE,859525,24.419955,54.441376,3_AE,
3,AE,861046,24.421239,54.456802,3_AE,
3,AE,859550,24.428011,54.641286,3_AE,
3,AE,859571,24.433241,54.645691,3_AE,
3,AE,962490,24.466666,54.366666,3_AE,
3,AE,962479,24.466666,54.366666,3_AE,


###### pandas grouped map udf

In [12]:
@pandas_udf("pf_version_id int, admin_0_name string, ins_itm_id int, latitude double, longitude double, identifier string, cluster int", PandasUDFType.GROUPED_MAP) 
def create_cluster(data):
  if data.shape[0] == 1:
    clust = data.cluster
  else:
    if data.shape[0] <= max_items_for_clustering.value:
      # Creating a distance matrix considering geo distance two LatLong
      distanceMatrix = pdist(data[["latitude","longitude"]], lambda u, v: distance.geodesic(u, v).meters)
      # Linkage methods are used to compute the distance d(s,t) between two clusters
      linkMatrix = linkage(distanceMatrix, method='complete')
      # Cutting the dendogram tree from leaf (i.e. distance 0) to blast diameter
      clust = fcluster(linkMatrix, blast_radius.value * 2, "distance")
    else:
      clust = list()
      start = 0
      end = 0
      num_itr = int(ceil(float(data.shape[0])/max_items_for_clustering.value))
      for j in range(num_itr):
        start = end
        end += max_items_for_clustering.value
        temp_data = data.iloc[start:end,:]
        # Creating a distance matrix considering geo distance two LatLong
        distanceMatrix = pdist(temp_data[["latitude","longitude"]], lambda u, v: distance.geodesic(u, v).meters)
        # Linkage methods are used to compute the distance d(s,t) between two clusters
        linkMatrix = linkage(distanceMatrix, method='complete')
        # Cutting the dendogram tree from leaf (i.e. distance 0) to blast diameter
        temp_clust = fcluster(linkMatrix, blast_radius.value * 2, "distance")
        clust.extend(temp_clust.tolist())
  return data.assign(cluster = clust)

In [13]:
from time import time
st = time()
df_cluster = df.groupBy("identifier").apply(create_cluster)
display(df_cluster)
df_cluster.persist()
en = time()

In [14]:
print("time = {}".format(en-st))