# Rain in Aussie: Exploratoey Data Analysis

![](https://i.imgur.com/N8aIuRK.jpg)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
raw_df = pd.read_csv('/kaggle/input/weather-dataset-rattle-package/weatherAUS.csv')

In [None]:
raw_df.head()

The dataset contains over 145,460 rows and 23 columns. The dataset contains date, numeric and categorical columns.

Let's check the data types and missing values in the various columns.

In [None]:
raw_df.info()

It might be a good idea to discard the rows where the value of `RainTomorrow` or `RainToday` is missing to make our analysis and modeling simpler (since one of them is the target variable, and the other is likely to be very closely related to the target variable). 

In [None]:
raw_df.dropna(subset=['RainToday', 'RainTomorrow'], inplace=True)

## Exploratory Data Analysis and Visualization

Its always a good idea to explore the distributions of various columns and see how they are related to the target column. Let's explore and visualize the data using the `Plotly`, `Matplotlib` and `Seaborn` libraries.

In [None]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'  #for setting environment

In [None]:
px.histogram(raw_df, x='Location', 
             title='Location vs. Rainy Days', 
             color='RainToday')

> Observations:
> 1. We have an approx unifrom distribution of data across location.
> 2. Nhil, Darwin, Uluru reports least rain
> 3. Whatsonia receives most rain among 49 locations.
> 4. `Location` is definitely a factor in deciding whether it will rain tomorrow or not.

In [None]:
px.histogram(raw_df, 
             x='Temp3pm', 
             title='Temperature at 3 pm vs. Rain Tomorrow', 
             color='RainTomorrow')

> Observation:
> 1. It is following a decent normal distribution.
> 2. When the temperature is lower, the chances of rain is quite higher.
> 3. Temperature at 3 PM plays a vital role in predicting whether it will rain tomorrow or not.

In [None]:
px.histogram(raw_df, 
             x='RainTomorrow', 
             color='RainToday', 
             title='Rain Tomorrow vs. Rain Today')

> Observation: 
> 1. We have an class imbalance case(we don't have equal number of observations in each class)
> 2. If it didn't rain today, there is a pretty good chance that it will not rain tomorrow.
(predicting not rain is easier, than predicting rain for tomorrow)

In [None]:
px.scatter(raw_df.sample(2000), 
           title='Min Temp. vs Max Temp.',
           x='MinTemp', 
           y='MaxTemp', 
           color='RainToday')

> Observation: 
> 1. It shows a linear positive correlation between minimum temperature and maximum temperature.
> 2. Points at the center is overlapping, indicating if it rains todays, variation in minimum and maximum temperature is low.

In [None]:
px.scatter(raw_df.sample(2000), 
           title='Temp (3 pm) vs. Humidity (3 pm)',
           x='Temp3pm',
           y='Humidity3pm',
           color='RainTomorrow')

the temperature today is low and humidity is high, it may rain tomorrow.> Observations:
> 1. If the temperature today is low and humidity is high, it may rain tomorrow.
> 2. If temperature today is high and humidity is low, it may not rain tomorrow

In [None]:
sns.distplot(raw_df.Rainfall.sample(200));

> Observation:
> It is following a nice gaussian distribution.

In [None]:
px.scatter(raw_df.sample(2000), 
           title='Pressure (3 pm) vs. Pressure (3 pm)',
           x='Pressure3pm',
           y='Pressure9am',
           color='RainTomorrow')

> Observations:
> 1. Pressure at 3PM is in a positive correlation with pressure at 9AM.
> 2. Since most of the points are overlapping, we can say that more  pressure at 9AM and 3PM results in no rain tomorrow.

In [None]:
px.scatter(raw_df.sample(2000), 
           title='Temp (3 pm) vs. Temp (9 am)',
           x='Temp9am',
           y='Temp3pm',
           color='RainTomorrow')

> Observations:
> 1. Temperature and humidity at 3PM are positively correlated.
> 2. High temperature and humidity results in no rain tomorrow.

In [None]:
raw_df.select_dtypes('float64').columns.tolist()

In [None]:
px.histogram(raw_df, 
             x='Cloud3pm', 
             title='Cloud at 3 pm vs. Rain Tomorrow', 
             color='RainTomorrow')

> Observations:
> 1. More cloud at 3PM indicates fair chances of raining tomorrow
> 2. Whereas, for no rain, it is quite difficult to decide

In [None]:
px.histogram(raw_df.sample(100), 
             x='Rainfall', 
             title='Rainfall vs. Rain Tomorrow', 
             color='RainTomorrow')

In [None]:
plt.figure(figsize=(16,12))
sns.heatmap(raw_df.corr(), 
            square=True, 
            cmap='Blues', 
            annot=True, 
            fmt='.2f', 
            linecolor='white');

In [None]:
numerical_columns = raw_df.select_dtypes('float64').columns

In [None]:
numerical_columns

In [None]:
sns.pairplot(raw_df[numerical_columns], kind='scatter', diag_kind='hist', palette='Rainbow');

> Observations:
> 1. Highly positive correlation can be found between few variables. Also, some have uniform distribution.


# THANK YOU FOR STICKING AROUND :)