In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import missingno as msno
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

pd.set_option('display.max_columns',None)

# <font color = 'sky-blue'><span style='font-family:Georgia'> <b>Will it rain ?</b> </span>

<img src="https://www.abc.net.au/reslib/201106/r788210_6839335.jpg" width=900 height=300 />

### Importing the dataset

In [None]:
aus = pd.read_csv('/kaggle/input/weather-dataset-rattle-package/weatherAUS.csv')

aus.head()

In [None]:
print(f'Dataset has {aus.shape[0]} rows and {aus.shape[1]} columns.')

## About the Dataset
 - **Date** : Day of the month
 - **Location** : Places in Australia
 - **MinTemp** *(in degree Celsius)* : Minimum temperature in the 24 hours to 9am.Sometimes only known to the nearest whole degree.
 - **MaxTemp** *(in degree Celsius)* : Maximum temperature in the 24 hours from 9am.Sometimes only known to the nearest whole degree.
 - **Rainfall** *(in millimetres)* : Precipitation (rainfall) in the 24 hours to 9am.
 - **Evaporation** *(in millimetres)* : Represents evaporation in the 24 hours to 9am.
 - **Sunshine** *(hours)* : Bright sunshine in the 24 hours to midnight.
 - **WindGustDir** *(16 compass points)* : Direction of strongest gust in the 24 hours to midnight.
 - **WindGustSpeed** *(in kilometres per hour)* : Speed of strongest wind gust in the 24 hours to midnight.
 - **WindDir9am** *(in compass points)* : Wind direction averaged over 10 minutes prior to 9 am.
 - **WindDir3pm** *(in compass points)* : Wind direction averaged over 10 minutes prior to 3 pm.
 - **WindSpeed9am** *(kilometres per hour)*: Wind speed averaged over 10 minutes prior to 9 am.
 - **WindSpeed3pm** *(kilometres per hour)* : Wind speed averaged over 10 minutes prior to 3 pm.
 - **Humidity9am** *(in percent)* : Relative humidity at 9 am.
 - **Humidity3pm** *(in percent)* : Relative humidity at 3 pm.
 - **Pressure9am** *(hectopascals)* : Atmospheric pressure reduced to mean sea level at 9 am.
 - **Pressure3pm** *(hectopascals)* : Atmospheric pressure reduced to mean sea level at 3 pm.
 - **Cloud9am** *(in eighths)* : Fraction of sky obscured by cloud at 9 am.
 - **Cloud3pm** *(in eighths)*: Fraction of sky obscured by cloud at 3 pm.
 - **Temp9am** *(in degrees Celsius)* : Temperature at 9 am.
 - **Temp3pm** *(in degrees Celsius)* : Temperature at 3 pm.
 - **RainToday** : If there has bee any rain on that specific day.
 - **RainTomorrow** : Target variable to predict. It means will it rain the next day, Yes or No? This column is Yes if the rain for that day was *1mm* or more
 

#### Lets have an idea of the missing values from the dataset.

In [None]:
msno.matrix(aus)
plt.show()

So, it has been observed that apart from *Date* and *Location* feature rest all have missing values.
 * In features like **Rainfall, Evaporation, Sunshine** the reason for values being missing cannot be that somebody forgot to enter it, there could be days when there is no rainfall and hence the values are not there and lly with Sunshine, Evaporation and so on.
 * In features like **MinTemp, MaxTemp, WindGustDir** and so on, here these features having missing values needs to be replaced by a suitable value median or mode depends as there can't be a day having *NULL* as max temperature.

In [None]:
aus.info()

Right, we have 22 features.<br>
### Lets dig further to find out how many are numerical and categorical.

#### Let me just quickly Convert Date to datetime format from object

In [None]:
aus['Date'] = aus['Date'].apply(lambda x: x.replace("-","/"))
aus['Date'] = pd.to_datetime(aus['Date'])
aus['Date'].head()

In [None]:
num = [i for i in aus.columns if aus[i].dtypes != 'O']
cat = [i for i in aus.columns if i not in num]

In [None]:
print('Categorical Features: ', cat,'\n', sep='')
print('Numerical Features: ', num, sep=' ')

# <font color = sky-blue> <span style='font-family:Georgia'> <b> Exploratory Data Analysis </b> </span>

#### So, here we go

#### Numerical Features:

In [None]:
print(num,end=' ')

In [None]:
import warnings
warnings.filterwarnings("ignore")
# there would be hell lot of warnings

plt.subplots(4,4,figsize=(16,16))

for h,i in enumerate(num[1:]):
    plt.subplot(4,4,h+1)
    sns.distplot(a=aus[i],kde=False)
    plt.title(i)
    plt.xlabel('')
plt.show()

In [None]:
plt.subplots(4,4,figsize=(18,18))

for h,i in enumerate(num[1:]):
    plt.subplot(4,4,h+1)
    sns.boxplot(y=aus[i],color='greenyellow')
    plt.title(i)
    plt.ylabel('')
plt.show()

**From the above plots we have observed that**:

* The temperatures in Australia ranges between -8 to 48 deg C and average (Min. temp) around 12 & avg. (Max. temp.) around 23-24.
* On most of the day it does'nt rains in Australia but when it does it can go upto 350mm.
* And since most of the days it is sunny outside hence Evaporation is obvious and it is around 8-10mm on avg.
* The median value of sunshine in a day is 8 hours.
* The median value of maximum wind gust speed has been 40kph
* Generally, it has been noticed that wind speed during evening(around 3pm) is greater than morning(9am).
* Humidity is higher during morning than evening.
* and no specific difference was noticed in cloud cover, pressure or temperature.

#### Categorical Features

In [None]:
print(cat,end=' ')

In [None]:
px.pie(data_frame=aus,names='RainToday')

In [None]:
plt.figure(figsize=(12,16))

aus.Location.value_counts()[::-1].plot.barh()
plt.show()

In [None]:
rn = aus.copy()
rn.head(3)

In [None]:
rn['RainToday'].fillna(rn.RainToday.mode()[0],inplace=True)
rn['WindGustDir'] = rn['WindGustDir'].fillna(rn.WindGustDir.mode()[0])
rn['RainTomorrow'] = rn['RainTomorrow'].fillna(rn.RainTomorrow.mode()[0])

In [None]:
px.histogram(x=rn.WindGustDir,color=rn['RainTomorrow'],title="Direction of wind gust and Rain")

In [None]:
px.histogram(x=rn.WindDir3pm, color=rn['RainTomorrow'], title="Direction of wind at 3pm and Rain")

#### Observations
 - Generally the strongest gust blows westwards.
 - During morning, wind blows northwards and during evening it is generally towards south-east.
 - Maximum samples have been taken from Canberra and Sydney.

In [None]:
px.scatter(data_frame=rn,x='MinTemp',y='MaxTemp',color='RainToday')

If the max temp is below 30-35 and min temp is below 25 chances are that it'll rain.

In [None]:
px.scatter(x='Rainfall',y='MaxTemp',data_frame=rn)

If the Max temperature ranges between 20 to 30, it rains and sometimes it goes like rainingcats and dogs.

In [None]:
plt.figure(figsize=(8,4))
rn.groupby('Location')['Rainfall'].median().sort_values(ascending=False)[:10][::-1].plot.barh()
plt.show()

So,places like Dartmoor, Walpole, MountGambker, Portland and Norfolkisland usually witness more rainfall in Australia than other cities.

In [None]:
px.scatter(data_frame=rn,x='MaxTemp',y='Evaporation')

Not strongly correlated but yea healthy one. As the temperature rises the evaporation also starts to increase.

In [None]:
print(f"We have {rn.WindGustSpeed.isna().sum()} null values.")
rn.groupby('RainToday')['WindGustSpeed'].median()

As per the data while its overcast the Gust speed is generally higher than Sunny Days and this sounds practical too.

In [None]:
px.histogram(data_frame=rn,x='Temp3pm',color='RainTomorrow', title ='Temperature at 3pm vs. Rain Tomorrow')

If the temperature ranges between 15-25deg C in evening, thne the chances of having a rainfall increase.

In [None]:
px.histogram(data_frame=rn,x='Cloud3pm',color='RainTomorrow',title ='Cloud at 3pm vs. Rain Tomorrow')

If there is cloud cover after 3 pm the chances of raining next day increase and it has been noticed that if the value reaches 8 then there is about 55% chances of raining the very next day.

In [None]:
plt.figure(figsize=(14,5))
sns.countplot(rn['Date'].dt.year,color='skyblue')
plt.show()

So, most of the data available is from year 2009 to 2016.<br>
#### Now, Let's see how much of rainfall has happened in these years.

In [None]:
plt.figure(figsize=(15,5))
u = sns.lineplot(x=rn['Date'].dt.year,y=rn['RainToday'],palette='viridis')
u.set(ylim=(0,0.5),xticks=[i for i in range (2007,2018)])
plt.show()

About 0.2mm and quite much in the year 2007 but as we know that complete data for the year is not available, so I suppose we can igonre that.<br>
#### Now lets check if it rains on some specific days of the month.

In [None]:
plt.figure(figsize=(15,5))
u = sns.lineplot(x=rn['Date'].dt.day,y=rn['RainToday'])
u.set(ylim=(0,0.28),xticks=[i for i in range(0,32)])
plt.show()

Nah it doesn't. No significant amount of rainfall noticed on a specific day.
#### Now doing same for months too.

In [None]:
plt.figure(figsize=(15,5))
u = sns.lineplot(x=rn['Date'].dt.month_name(),y=rn['RainToday'],marker="o")
u.set(ylim=(0,0.28))
plt.show()

Here we have an distinct observation. Months of June, July and August witness a healthy amount of rain.

In [None]:
pd.crosstab(rn['RainToday'],rn['RainTomorrow'])

In [None]:
px.histogram(rn , x = 'RainToday' , title = 'Rain Today vs Rain Tomorrow',color = 'RainTomorrow')

So, of it doesn't rains today then it'll rarely rain tomorrow. But if it rains Today then 46 out of 100 times it has rained thhe next day.

In [None]:
px.histogram(rn , x = 'Humidity3pm' , title ='Humidity at 3pm vs. Rain Tomorrow' ,color ='RainTomorrow')

More the Relative Humidiity at 3pm, more are the chances of raining the next day and<br> As the value of Humidity breaches the 80% mark the chances of rainfall is more than 50%.

In [None]:
px.histogram(rn , x = 'WindSpeed3pm' , title ='Wind Speed at 3pm vs. Rain Tomorrow' ,color ='RainTomorrow')

In case of Wind speed at 3pm, Lesser the value of Wind speed at 3pm, more are the chances of raining the next day and it rains the bext day if the wind speed ranges 10-30kph mark.

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(rn.corr(),annot=True,cmap='winter')
plt.show()

#### So, thats it for now. If you like my work do upvote and if I need to work in certian areas, please comment.