<h2>Weather Conditions in World War Two</h2>

<h3>Content</h3><br>
The dataset contains information on <b>weather conditions</b> recorded on each day at various weather stations around the world. Information includes precipitation, snowfall, temperatures, wind speed and whether the day included thunder storms or other poor weather conditions.

<h4>Import Required Libraries</h4>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
%matplotlib inline

<h4>Loading The Dataset</h4>

In [None]:
climate = pd.read_csv("/kaggle/input/world-war-2-weather-data/Weather.csv" , low_memory=False)

In [None]:
climate.head(2)

In [None]:
climate.shape
#119040 rows , 31 columns

In [None]:
climate.info()

In [None]:
missing = climate.isnull().sum()
missing_percentage = (missing /len(climate))*100
pd.DataFrame({"Missing Values" : missing , "Missing Percentage" : missing_percentage})

<h4>Cleaning The Dataset</h4>

In [None]:
to_drop = ['Precip','STA','Date','WindGustSpd','Snowfall',
           'PoorWeather', 'PRCP', 'DR', 'SPD', 'SNF', 'SND', 'FT', 'FB', 'FTI', 'ITH', 'PGT',
           'TSHDSBRSGF', 'SD3', 'RHX', 'RHN', 'RVG', 'WTE']
climate.drop(to_drop , inplace = True , axis = 1)

In [None]:
climate.head(5)

In [None]:
climate = climate.dropna(subset = ['MAX' , 'MIN' , 'MEA'])

<h4>Data Visualization</h4>

In [None]:
px.histogram(data_frame = climate , x = 'MeanTemp' , nbins = 31)

The Temperature within the range **22.5-27.4** had the highest count between the years 1940 - 1945

In [None]:
linemax = climate.groupby(['YR'])['MaxTemp'].mean().reset_index(name = 'Average')
px.line(linemax , x = 'YR' , y = 'Average')

The Year 1942 had the lowest average maximum temperature , **25.8**

In [None]:
linemin = climate.groupby(['YR'])['MinTemp'].mean().reset_index(name = 'Average')
px.line(linemin , x = 'YR' , y = 'Average')

The Year 1945 had the lowest average minimum temperature , **17.6**

In [None]:
linemin = climate.groupby(['MO'])['MaxTemp'].mean().reset_index(name = 'Average')
px.line(linemin , x = 'MO' , y = 'Average')

**June - August** had the highest average MaxTemp

In [None]:
sns.heatmap(climate.corr() , annot = True )

<h4>Creating X and y</h4>

In [None]:
y = climate['MaxTemp']
X = climate.drop(columns=['MaxTemp'])

<h4>Creating the Train and Test Split Data</h4>

In [None]:
X_train , X_test , y_train , y_test = train_test_split(X , y , train_size = 0.8 , test_size = 0.2 , random_state = 101)

<h4>Choosing the Best Model</h4>

In [None]:
def get_mae(model , X_t  = X_train , y_t = y_train , X_te = X_test , y_te = y_test):
    model.fit(X_t , y_t)
    preds = model.predict(X_te)
    return mean_absolute_error(y_te , preds)

In [None]:
model_1 = LinearRegression()
model_2 = DecisionTreeRegressor()
model_3 = RandomForestRegressor()
model= [model_1 , model_2 , model_3]

In [None]:
for i in range(0 , len(model)):
    print("MAE Score_" + str(i) + " "+str(get_mae(model[i])))

<h4>Predicting the MaxTemp</h4>

In [None]:
preds = model_1.predict(X_test)
plt.scatter(y_test , preds)

<h4>Accuracy Score</h4>

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test,preds)