In [None]:
%matplotlib inline

In [None]:
import sys
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import geopy.distance

from nose.tools import *
from mpl_toolkits.basemap import Basemap

<a href="http://www.norcalblogs.com/postscripts/2019/03/31/jack-lee-art-ww2-bomberlightning-storm/"><img src="https://i.ibb.co/st3QL54/head-img.png" align="center" alt="Jack Lee Art – WW2 Bomber/Lightning Storm"/></a>
###### [Jack Lee Art – WW2 Bomber/Lightning Storm](http://www.norcalblogs.com/postscripts/2019/03/31/jack-lee-art-ww2-bomberlightning-storm/)

# WW2 Aerial bombing operations vs. Weather
### Author: Georgi Stoyanov
### July 2019

## Abstract

The document contains my data processing methodology, analysis and research approaches.<br><br>
The result of the observations shows the relationship between the air attacks during the Second World War and the meteorological conditions when they were realized.<br><br>
The process of research shows that the weather data may not be the most complete for the purpose of the research, and that future work may require a fuller meteorological data to definitely confirm the results.

## Introduction

During this analysis we will familiarize you with the methods of reading, cleaning and filling in missing data when working with a datasources.<br>
Also, ways to merge our required data from different datasources.<br><br>
Will we see if there is any dependence, and what is it, between air strikes, during the Second World War, and the meteorological conditions when they are realized.<br><br>
For this purpose, we will use the following two (three :)) datasources from the [kaggle](https://www.kaggle.com) site:<br>
[Weather Conditions in World War Two](https://www.kaggle.com/smid80/weatherww2) (2 datasources .csv format)<br>
[Aerial Bombing Operations in World War II](https://www.kaggle.com/usaf/world-war-ii) (1 datasource .csv format)

## Read datasets

Let's first read the data, fix column names, set correct indexes and see if we've read it properly.

#### Weather station locations

In [None]:
weather_station_locations = pd.read_csv("../input/weatherww2/Weather Station Locations.csv", index_col = 0, header=0,
    names=["wban", "name", "state_country_id", "str_latitude", "str_longitude", "elev", "latitude", "longitude"])
weather_station_locations.shape

In [None]:
weather_station_locations.sample(5)

#### Summary of weather

In [None]:
summary_of_weather = pd.read_csv("../input/weatherww2/Summary of Weather.csv", index_col = 0, header=0,
    names=["sta", "date", "precip", "wind_gust_spd", "max_temp", "min_temp", "mean_temp", "snowfall",
           "poor_weather", "yr", "mo", "da", "prcp", "dr", "spd", "max", "min", "mea", "snf", "snd",
           "ft", "fb", "fti", "ith", "pgt", "tshdsbrsgf", "sd3", "rhx", "rhn", "rvg", "wte"], low_memory=False)
summary_of_weather.shape

In [None]:
summary_of_weather.sample(5)

#### Aerial bombing operations

In [None]:
aerial_bombing_operations = pd.read_csv("../input/world-war-ii/operations.csv", index_col = 0, header=0,
    names=["mission_id", "mission_date", "theater_of_operations", "country", "air_force", "unit_id",
           "aircraft_series", "callsign", "mission_type", "takeoff_base", "takeoff_location", "takeoff_latitude",
           "takeoff_longitude", "target_id", "target_country", "target_city", "target_type", "target_industry",
           "target_priority", "target_latitude", "target_longitude", "altitude_hundreds_of_feet", "airborne_aircraft",
           "attacking_aircraft", "bombing_aircraft", "aircraft_returned", "aircraft_failed", "aircraft_damaged",
           "aircraft_lost", "high_explosives", "high_explosives_type", "high_explosives_weight_pounds",
           "high_explosives_weight_tons", "incendiary_devices", "incendiary_devices_type",
           "incendiary_devices_weight_pounds", "incendiary_devices_weight_tons", "fragmentation_devices",
           "fragmentation_devices_type", "fragmentation_devices_weight_pounds", "fragmentation_devices_weight_tons",
           "total_weight_pounds", "total_weight_tons", "time_over_target", "bomb_damage_assessment", "source_id"], low_memory=False)
aerial_bombing_operations.shape

In [None]:
aerial_bombing_operations.sample(5)

## Inspect, fill missing data and remove unnecessary columns

The next step is to inspect, clean up and complete the data.

#### Weather station locations

In [None]:
weather_station_locations.info()

We see that the data is loaded correctly. It remains to remove unnecessary columns, for our research we will need only the coordinates of the weather stations.

In [None]:
weather_station_locations = weather_station_locations[["latitude", "longitude"]]

assert weather_station_locations.shape == (161, 2)

Now we have to check if there are incorrect coordinates. For this purpose, we need to limit latitude from 90 to -90 and longitude from 180 to -180.

In [None]:
weather_station_locations = weather_station_locations[((weather_station_locations.latitude <= 90) &
                                (weather_station_locations.latitude >= -90)) &
                                ((weather_station_locations.longitude <= 180) &
                                (weather_station_locations.longitude >= -180))]

assert weather_station_locations.shape == (161, 2)

#### Summary of weather

In [None]:
summary_of_weather.info()

Let's first remove unnecessary columns. As we can see, many of them do not have or have insufficient information. Others, according to the dataset information, repeat the information with different units of measurement. Тhere are also those that will not be needed in the process of the research.

In [None]:
summary_of_weather = summary_of_weather[["date", "precip", "snowfall", "poor_weather"]]
summary_of_weather.info()

The next step is to fix the date column that is currently from an object type. In the previous data visualisations, we can see that the date format is "year-month-day".

In [None]:
summary_of_weather.date =  pd.to_datetime(summary_of_weather.date, format = "%Y-%m-%d")
summary_of_weather.info()

The next problem is the data in the precip column, the rainfall information must be in the float format. We need to check the values that do not fit the format.

In [None]:
summary_of_weather.precip[(pd.to_numeric(summary_of_weather.precip, errors='coerce').notnull()) == False].unique()

We found the problem and it is good that we have a single value. We can now take action to fix the problem. At first, we consider the options, remove the records with that value, or fill them with an average column value. For people outside of the meteorological community, as I am, value seems like a mistake, be it human or system. But it turns out that if we look for T-value information when measuring rainfall, behind [T(Trace) stands a definition](https://www.thoughtco.com/what-is-trace-of-precipitation-3444238). Let remember that we always do research before we take action!

_"Trace amounts of precipitation are abbreviated by the capital letter "T", often placed in parenthesis (T).<br>
If you must convert a trace to a numerical amount, it would equal 0.00."_

In [None]:
summary_of_weather.loc[(summary_of_weather.precip == "T"), "precip"] = 0.00
summary_of_weather.precip = summary_of_weather.precip.astype(float)
summary_of_weather.info()

Let's check the problem with the snowfall column, which should also be a float type.

In [None]:
summary_of_weather.snowfall.unique()

We see two values that do not fit in the type. It may be from a human or system error, or a way of describing the data on the station. In any case, we can assume that if there was snowfall, the data would be correct. To do this, we can correct the problem values by replacing them with 0.

In [None]:
summary_of_weather.loc[(summary_of_weather.snowfall == "#VALUE!"), "snowfall"] = 0.00
summary_of_weather.snowfall = summary_of_weather.snowfall.fillna(0.00)
summary_of_weather.snowfall = summary_of_weather.snowfall.astype(float)
summary_of_weather.info()

Next column is poor_weather which should be of type int.

In [None]:
summary_of_weather.poor_weather.unique()

In [None]:
summary_of_weather.groupby("poor_weather")["poor_weather"].count()

Here the problem is serious! We have very bad data! From the description of the columns in the dataset, we assume that the column should hold information about more than one event (position, boolean value). The bad thing is that the data is absolutely inconsistent and incomplete. Here we accept the approach to count events that have happened (count 1 in string), and the values obtained (in this case from 0 to 4) to fill in the column.

In [None]:
summary_of_weather.poor_weather = summary_of_weather.poor_weather.apply(lambda x: str(x).count("1"))
summary_of_weather.poor_weather = summary_of_weather.poor_weather.astype("int64")
summary_of_weather.poor_weather.unique()

In [None]:
summary_of_weather.precip[summary_of_weather.precip > 0].describe(percentiles = [.5, .7, .9, .98]),\
summary_of_weather.snowfall[summary_of_weather.snowfall > 0].describe(percentiles = [.5, .7, .9, .98])

In [None]:
summary_of_weather.groupby("poor_weather")["poor_weather"].count()

For the purpose of our study, this column, which unfortunately is the most confusing, is the most important! Because of this, we must fill missing values and modify existing ones in the most objective way. Once we have allowed values from 0 to 4, we can take values from precip and snowfall columns to generate new poor_weather value, then modify or add the new value according to their value. We perform this operation consistently and according to the weight of the distribution.

In [None]:
mask = ((summary_of_weather.precip > 4) & (summary_of_weather.precip <= 9) & (summary_of_weather.poor_weather < 1))
summary_of_weather.loc[mask, "poor_weather"] = 1
mask = ((summary_of_weather.precip > 9) & (summary_of_weather.precip <= 27) & (summary_of_weather.poor_weather < 2))
summary_of_weather.loc[mask, "poor_weather"] = 2
mask = ((summary_of_weather.precip > 27) & (summary_of_weather.precip <= 63) & (summary_of_weather.poor_weather < 3))
summary_of_weather.loc[mask, "poor_weather"] = 3
mask = ((summary_of_weather.precip > 63) & (summary_of_weather.poor_weather < 4))
summary_of_weather.loc[mask, "poor_weather"] = 4

In [None]:
summary_of_weather.groupby("poor_weather")["poor_weather"].count()

In [None]:
mask = ((summary_of_weather.snowfall > 8) & (summary_of_weather.snowfall <= 14) & (summary_of_weather.poor_weather < 1))
summary_of_weather.loc[mask, "poor_weather"] = 1
mask = ((summary_of_weather.snowfall > 14) & (summary_of_weather.snowfall <= 30) & (summary_of_weather.poor_weather < 2))
summary_of_weather.loc[mask, "poor_weather"] = 2
mask = ((summary_of_weather.snowfall > 30) & (summary_of_weather.snowfall <= 61) & (summary_of_weather.poor_weather < 3))
summary_of_weather.loc[mask, "poor_weather"] = 3
mask = ((summary_of_weather.snowfall > 61) & (summary_of_weather.poor_weather < 4))
summary_of_weather.loc[mask, "poor_weather"] = 4

In [None]:
summary_of_weather.groupby("poor_weather")["poor_weather"].count()

#### Aerial bombing operations

In [None]:
aerial_bombing_operations.info()

First we remove unnecessary columns which we do not need or carry insignificant information.

In [None]:
aerial_bombing_operations = aerial_bombing_operations[["mission_date", "target_latitude",
    "target_longitude", "aircraft_returned", "aircraft_failed", "aircraft_damaged", "aircraft_lost"]]
aerial_bombing_operations.shape

We now have to remove the records whose missing data can not be recovered. In this case, this is location data.

In [None]:
aerial_bombing_operations = aerial_bombing_operations.dropna(subset=["target_latitude", "target_longitude"])
aerial_bombing_operations.info()

Let validate the available location data as we did earlier.

In [None]:
aerial_bombing_operations = aerial_bombing_operations[((aerial_bombing_operations.target_latitude <= 90) &\
                                   (aerial_bombing_operations.target_latitude >= -90)) & \
                                         ((aerial_bombing_operations.target_longitude <= 180) &\
                                   (aerial_bombing_operations.target_longitude >= -180))]
aerial_bombing_operations.shape

The next step is to do the standard date column conversion operation.

In [None]:
aerial_bombing_operations.mission_date =  pd.to_datetime(aerial_bombing_operations.mission_date, format = "%m/%d/%Y")
aerial_bombing_operations.info()

It seems that the data is cleaned, filled and ready for the next step of preparation procedure.

## Analyze data, generate needed columns and join dataframes

#### Analyze Summary of weather

By analyzing the distribution of poor_weather values, we can see if the data is close to what is expected based on the experience of real-time observations in the real world.

In [None]:
plt.hist(summary_of_weather.poor_weather, bins = 5)
plt.title("Weather conditions distribution")
plt.xlabel("Poor weather value")
plt.ylabel("Count")
plt.show()

We see, at least in my view, a distribution of the poor weather ratio.<br>
A little beyond the scope of the research but of interest to me, we can view distribution of attacks by years.

In [None]:
plt.hist(aerial_bombing_operations.mission_date.dt.year, bins = 7)
plt.title("Operations by year distribution")
plt.xlabel("Year")
plt.ylabel("Count")
plt.show()

For the next research - visualization of geographic locations of attacks and weather stations, we need to make a function for this. This will save us from repetition of code and better readability.

In [None]:
def drawgeolocations(longitude, latitude, loc_size, loc_color, label):
    figure = plt.figure(figsize = (15, 12))
    current_axis = figure.add_subplot(111)
    m = Basemap(projection = "merc", llcrnrlat = -80, urcrnrlat = 80,
    llcrnrlon = -180, urcrnrlon = 180)
    
    locations_lon, locations_lat = m(longitude.tolist(), latitude.tolist())
    
    m.drawcoastlines(ax = current_axis)
    m.fillcontinents(color = "coral", lake_color = "aqua")
    m.drawparallels(np.arange(-90, 91, 30))
    m.drawmeridians(np.arange(-180, 181, 60))
    m.drawmapboundary(fill_color = "aqua")
    m.scatter(locations_lon, locations_lat, color = loc_color, s = loc_size, zorder = 2)
    plt.xlabel(label)
    plt.show()

We are now ready to visualize geographic locations.

In [None]:
drawgeolocations(weather_station_locations.longitude, weather_station_locations.latitude, 
                 30, "green", "Location of weather stations")

In [None]:
drawgeolocations(aerial_bombing_operations.target_longitude, aerial_bombing_operations.target_latitude, 
                 30, "red", "Locations of air attacks")

From the visualization of attacks and stations, we can keep in mind that there is no nearby weather station in the area of some attacks. This observation must lead us to think that it is possible to work on weather data that does not match the location of the attacks. Before we tackle this problem, let's link the weather forecasts to specific geographic locations through the appropriate locations of the stations that have recorded them.

In [None]:
summary_of_weather_with_loc = summary_of_weather.join(weather_station_locations)
summary_of_weather_with_loc.info()

The next step is to find the nearest station for each attack. In this way, we will find the most accurate / near weather situation possible for an attack.<br>
To do this, we need to do a function that measures the distance from the attack point to each station and returns the nearest station id and distance to station.

In [None]:
def get_closest_weatherstation(lat, lot, weather_stations):
    min_distance = sys.maxsize
    min_station_index = -1
    for index, station in weather_station_locations.iterrows():
        distance = geopy.distance.distance((lat, lot), (station["latitude"], station["longitude"]))
        if min_distance > distance:
            min_distance = distance
            min_station_index = index
            
    return min_station_index, min_distance
    
assert get_closest_weatherstation(40.23333333, 18.08333333, weather_station_locations) == (34111, 0)

Applying the function is too slow, the time is more than 2 hours! For this reason, the result of the operation is written to a new dataset operation_to_sta_mapping.csv.

In [None]:
# Very slow operation - over 2 hours. Result is saved to ...data/operation_to_sta_mapping.csv

#lambdafunc = lambda x: get_closest_weatherstation(x["target_latitude"], x["target_longitude"], weather_station_locations)
#operation_to_sta_mapping = aerial_bombing_operations.apply(lambdafunc, axis=1)
#csv_operation_to_sta_mapping_df = pd.DataFrame(operation_to_sta_mapping, columns = ["tuple"])["tuple"].apply(pd.Series)
#csv_operation_to_sta_mapping_df.columns = ["sta", "sta_distance"]
#csv_operation_to_sta_mapping_df.sta_distance = csv_operation_to_sta_mapping_df.sta_distance.astype(str).str[:-3].astype(float)
#csv_operation_to_sta_mapping_df.to_csv("data/operation_to_sta_mapping.csv")

operation_to_sta_mapping = pd.read_csv("../input/ww2-aerial-operation-to-closest-ww2-weather-statio/operation_to_sta_mapping.csv", index_col = 0, header=0)
operation_to_sta_mapping.head(5)

The final step of preparing the research data follows. We must link each attack with the weather condition observed by the nearest station and the distance to that station.

In [None]:
operations_with_weather_sta = aerial_bombing_operations.join(operation_to_sta_mapping)
operations_with_weather_sta.head(5)

In [None]:
operations_with_weather_sta = operations_with_weather_sta.reset_index().set_index(["sta", "mission_date"])
operations_with_weather_sta.index.names = ["sta", "date"]
operations_with_weather_sta.head(5)

In [None]:
summary_of_weather = summary_of_weather.reset_index().set_index(["sta", "date"])
summary_of_weather.head(5)

In [None]:
operations_with_weather = operations_with_weather_sta.merge(summary_of_weather, left_index=True, right_on=["sta", "date"])
operations_with_weather = operations_with_weather.reset_index().set_index("mission_id")
operations_with_weather.head(5)

## Research

### What is the distribution of attacks vs. weather condition? Is there a direct dependence?

For the purpose of the research, we need to create a function that can easily and comprehensively visualize graphs of distribution of attacks on weather condition with different accuracy - distance from the attack to the nearest station recording weather conditions.

In [None]:
def plot_operations_weather_distr_by_sta_distance(operations, distance):
    operations_by_sta_distance = operations[operations.sta_distance < distance]
    operations_by_weather = operations_by_sta_distance.groupby("poor_weather")["poor_weather"].count()
    operations_by_weather.plot.bar()
    plt.title("Operations by weather condition with distance \n to closest weather station < " + str(distance))
    plt.xlabel("Poor weather value")
    plt.ylabel("Operations count")
    plt.show()

Let's use the function at different distances and see what the results are.

In [None]:
plot_operations_weather_distr_by_sta_distance(operations_with_weather, 2)

In [None]:
plot_operations_weather_distr_by_sta_distance(operations_with_weather, 10)

In [None]:
plot_operations_weather_distr_by_sta_distance(operations_with_weather, 50)

In [None]:
plot_operations_weather_distr_by_sta_distance(operations_with_weather, 100)

In [None]:
plot_operations_weather_distr_by_sta_distance(operations_with_weather, 500)

In [None]:
plot_operations_weather_distr_by_sta_distance(operations_with_weather, 1000)

In [None]:
plot_operations_weather_distr_by_sta_distance(operations_with_weather, 10000)

We can definitely see that weather condition has a direct influence on the attacks. Attacks are mainly conducted under good weather conditions.<br>
Let's see the visualization of attacks for each value of weather condition (poor_weather).

### Attacks by weather conditions locations

In [None]:
worst_weather_operations = operations_with_weather[operations_with_weather.poor_weather == 4]
drawgeolocations(worst_weather_operations.target_longitude, worst_weather_operations.target_latitude, 30, "blue", 
                 "Locations of air attacks poor_weather = 4")

An interesting result! Attacks are concentrated in two areas - India (Himalayas) and Papua New Guinea. These are regions that are at the top of bad and volatile weather conditions. When organizing attacks in these regions there is no choice - the weather conditions almost all the year is very bad!<br>Let's see the locations of the next level of poor weather.

In [None]:
worst_weather_operations = operations_with_weather[operations_with_weather.poor_weather == 3]
drawgeolocations(worst_weather_operations.target_longitude, worst_weather_operations.target_latitude, 30, "blue",
                 "Locations of air attacks poor_weather = 3")

Almost the same results! The difference is that we have a new region (Germany) of attacks where we should normally have a choice of meteorological conditions in an attack. But if we think this is a region of enormous significance in the context of the war, for this reason, attacks are apparently done despite the bad conditions!<br><br>
Next visualizations begin not contain data with so prominent distributions. Indicating that attacks are taking place in good weather, except for those targets that are very significant and need to be struck regardless of the conditions. Or targets are in regions that are bad weather all year round, in this case, there is no other option than a bad weather attack!

In [None]:
worst_weather_operations = operations_with_weather[operations_with_weather.poor_weather == 2]
drawgeolocations(worst_weather_operations.target_longitude, worst_weather_operations.target_latitude, 30, "blue",
                 "Locations of air attacks poor_weather = 2")

In [None]:
worst_weather_operations = operations_with_weather[operations_with_weather.poor_weather == 1]
drawgeolocations(worst_weather_operations.target_longitude, worst_weather_operations.target_latitude, 30, "blue",
                 "Locations of air attacks poor_weather = 1")

In [None]:
worst_weather_operations = operations_with_weather[operations_with_weather.poor_weather == 0]
drawgeolocations(worst_weather_operations.target_longitude, worst_weather_operations.target_latitude, 30, "blue",
                 "Locations of air attacks poor_weather = 0")

## Conclusion

From the survey we found that air attacks during World War II, and not only, were planned and conducted at the closest possible time to ideal weather conditions.
Exceptions are targets in regions with very poor year-round climatic conditions or regions of enormous significance in the context of the war justifying the risk of bad conditions.
Last but not least, we have seen how complicated the process of reading, clearing, completing and transforming the necessary data is!

## Future work

The result of the research would be strengthened by using a more complete and consistent dataset for weather conditions. The dataset used here is too incomplete and inconsistent!
Also, if we have a more complete dataset for attacks - more information about lost aircrafts, damaged aircrafts, success attack percentage, etc., would give us room for more research!

## References
[Kaggle](https://www.kaggle.com)<br>
    [Weather Conditions in World War Two](https://www.kaggle.com/smid80/weatherww2)<br>
    [Aerial Bombing Operations in World War II](https://www.kaggle.com/usaf/world-war-ii)<br>
    [What Is a \"Trace\" of Precipitation?](https://www.thoughtco.com/what-is-trace-of-precipitation-3444238)<br>
    [Probabilistic Parameterizations of Visibility Using Observations of,
   Rain Precipitation Rate, Relative Humidity, and Visibility](https://journals.ametsoc.org/doi/pdf/10.1175/2009JAMC1927.1)<br>
    [Jack Lee Art – WW2 Bomber/Lightning Storm](http://www.norcalblogs.com/postscripts/2019/03/31/jack-lee-art-ww2-bomberlightning-storm/)<br>
    [The Hump](https://en.wikipedia.org/wiki/The_Hump)<br>
    [Climate - Papua New Guinea](https://www.climatestotravel.com/climate/papua-new-guinea)<br>
    [Mm per year: Countries Compared](https://www.nationmaster.com/country-info/stats/Geography/Average-rainfall-in-depth/Mm-per-year)<br>