# Data Exploration

## Background:

When a disaster strikes, quick and accurate situational information is critical to an effective response. Before responders can act in the affected area, they need to know the location, cause and severity of damage. But disasters can strike anywhere, disrupting local communication and transportation infrastructure, making the process of assessing specific local damage difficult, dangerous, and slow. 

**Problem**: Disaster recovery efforts are currently slow as the analysts have to manually filter through the imagery to access the damage and forward the information to recovery teams on the ground.

**Objective**: Our project aims to automate the process of assessing building damage after a natural disaster.

High-resolution labeled dataset of raw satellite imagery exposing ground conditions hit by natural disasters are available

<div>
<img src="attachment:image.png" width="500"/>
</div>
<center>Pre and post-disaster (Tubbs Fire) imagery of a residential subdivision in Santa Rosa, Calif. [Source: xview2.org]</center>

-------

## About the Dataset

The dataset primarily consists of images of a region pre and post a natural disaster. The xBD dataset currently covers 15 countries and 6 types of disasters. The images in this dataset come labeled with manually annotated polygons highlighting buildings and other structures along with the scale index of their damage i.e. 0–3. The labels also contain image metadata such as disaster type, image resolution, date, sensor, etc.

<div>
<img src="attachment:image.png" width="500"/>
</div>
<center>Damage scales. [Source: xview2.org]</center>

Here are some insights:
- The data has been sourced from the Maxar/DigitalGlobe Open Data Program, which releases imagery for major crisis events.
- The xBD dataset contains 22,068 images pertaining to 19 natural disasters.
- There 850,736 annotated polygons of buildings.
- The imagery covers around 45,361.73km² of area.
- The targeted Ground Sample Distance is below 0.8. However, there are differences in the images from the same geographic region due to some factors.
- Environmental factors such as flood water have been annotated.
- Post-disaster images have been altered slightly to account for re-projection issues since the paired images were taken at different times.
- There are no polygons present in some post-disaster imagery, as the buildings were either created after the disaster or due to other factors such as haze and cloud obstruction.

<div>
<img src="attachment:image.png" width="300" height="100"/>
</div>
<center>Annotated image by damage scale. [Source: xview2.org]</center>

-----

## Exploratory Data Analysis

There is a section in the paper detailing the statistics regarding the imagery in the dataset. Some key points:

<div>
<img src="attachment:image.png" width="500"/>
</div>
<center>Area of imagery (in km2) per disaster event. [Source: xBD Paper]</center>

The imagery is highly unbalanced pertaining to disasters. While the Portugal Wildfire and Pinery Bushfire cover around 8000km², the Mexico Earthquake and Palu Tsunami cover less than 1000km². The Mexico Earthquake and Palu Tsunami, however, make up in the number of polygon annotations in the dataset. They both contain around 100,000 labeled polygon mappings across the dataset.

<div>
<img src="attachment:image.png" width="500"/>
</div>
<center>Polygons in xBD per disaster event. [Source: xBD Paper]</center>

<div>
<img src="attachment:image.png" width="500"/>
</div>
<center>Positive and negative imagery per disaster. [Source: xBD Paper]</center>

We should note that the pre and post-disaster imagery is also unbalanced. For example, as we see in the diagram below most of the dataset contains positive imagery ( post-disaster imagery ). There are only a couple of disaster events that have a balanced set of positive and negative imagery e.g Social Fire, Portugal Wildfire and Woolsey Fire.

<div>
<img src="attachment:image.png" width="500"/>
</div>
<center>Damage classification count. [Source: xBD Paper]</center>

The final diagram shows us the damage classification count. The distribution of the dataset is heavily skewed towards the No Damage class with 313,033 polygons. This is eight times more than the other classes. There are also a handful of annotations that are marked as unclassified.