In [1]:
# code line to properly show HTML table
from IPython.core.display import HTML
table_css = 'table {align:left;display:block} '
HTML('<style>{}</style>'.format(table_css))

<span style="font-family:Cambria">

<span style="color:#c27767">
    
# Rain in Australia
</span>

Ever wondered if you should carry an umbrella tomorrow? With this dataset, you can predict next-day rain by training classification models on the target variable <code>RainTomorrow</code>.

This dataset comprises about 10 years of daily weather observations from numerous locations across Australia.

<code>RainTomorrow</code> is the target variable to predict. It answers the crucial question: <span style="color:#c27767">**will it rain the next day? (Yes or No).**</span>

---
**Data source:**  
<a href="https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package" title="Data source - Kaggle">Data source - Kaggle</a>  

The observations were gathered from a multitude of weather stations. Definitions have been adapted from the Bureau of Meteorology's Climate Data Online. Data source: Climate Data and Climate Data Online. Copyright Commonwealth of Australia 2010, Bureau of Meteorology.

**Useful links:**  

<a href="http://www.bom.gov.au/climate/dwo/" title="Daily Weather Observations in Australia">Daily Weather Observations in Australia</a>  
<a href="http://www.bom.gov.au/climate/data/" title="Australia Climate Data">Australia Climate Data</a>  
<a href="http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml" title="Notes to accompany Daily Weather Observations">Notes to accompany Daily Weather Observations</a>  
<a href="http://www.bom.gov.au/climate/cdo/about/about-stats.shtml" title="About Climate Statistics">About Climate Statistics</a>  

</span>

<span style="font-family:Cambria">

<span style="color:#c27767">
    
## 1. Learn about Data Collection process and Problem Domain
</span>

First of all, I will try to examine variables present in the data set and think about any assumptions and issues within the data - whether highlighted in <a href="http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml" title="Notes to accompany Daily Weather Observations">Notes to accompany Daily Weather Observations</a> or not. I will also try to understand the domain problem and familiarise myself with meteorology.  

As per <a href="https://vdsbook.com" title="Veridical Data Science">book</a> recommendations, I will try Veridical Data Science approach and I will try to answer questions below:

1. What does each variable measure?
2. How the data was collected?
3. What are the observational units?  
4. Is the data relevant to my project?
5. What questions do I have, and what assumptions am I  making?

The questions above will be answered in following sections.  

---
</span>

<span style="font-family:Cambria">

<span style="color:#c27767">
    
### 1.1. Variables
</span>

Data definition is given in the following site: <a href="http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml" title="Notes to accompany Daily Weather Observations">Notes to accompany Daily Weather Observations</a>. Instead of copy-paste the table, I will check the data file and write correct column names as it is present in the data file I downloaded. I created a copy of the file to make sure nothing will be changed and renamed the copy to *copy_weatherAUS.csv*. The table below was created by using online table creator tool: 
<a href="https://www.tablesgenerator.com/html_tables" title="Tables Generator">Tables Generator</a>.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-zapl{border-color:#963400;font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-k1ns{border-color:#963400;text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-zapl">Data type</th>
    <th class="tg-zapl">Column name</th>
    <th class="tg-zapl">Description</th>
    <th class="tg-zapl">Method</th>
    <th class="tg-zapl">Units</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-k1ns">Date</td>
    <td class="tg-k1ns">Date</td>
    <td class="tg-k1ns">Date in format yyyy-mm-dd.</td>
    <td class="tg-k1ns">-</td>
    <td class="tg-k1ns">-</td>
  </tr>
  <tr>
    <td class="tg-k1ns">Location</td>
    <td class="tg-k1ns">Location</td>
    <td class="tg-k1ns">Name of the weather station in a certain location.</td>
    <td class="tg-k1ns">-</td>
    <td class="tg-k1ns">-</td>
  </tr>
  <tr>
    <td class="tg-k1ns" rowspan="2">Temperature</td>
    <td class="tg-k1ns">MinTemp</td>
    <td class="tg-k1ns">Minimum temperature in the 24 hours to 9 AM.</td>
    <td class="tg-k1ns" rowspan="2">To take temperature measurements, thermometers are placed inside instrument enclosures known as a Stevenson screen. Stevenson screen is basically a box with louvres that allow air to circulate around the thermometer inside while protecting it from outside elements like rain and direct sunlight. The outside is painted white to minimise heat absorption. This basic design has been around for about 150 years, and is used by most meteorological organisations around the world. Traditionally, trained observers would read the thermometer and send in the observations at least twice a day—normally at 9 am and 3 pm; but these days we have automatic thermometers that send in the information electronically. Source: <a href="https://media.bom.gov.au/social/blog/916/ask-the-bureau-how-is-temperature-measured/" title="Ask the Bureau: How is temperature measured?">Ask the Bureau: How is temperature measured?</a></td>
    <td class="tg-k1ns" rowspan="2">Degrees Celsius</td>
  </tr>
  <tr>
    <td class="tg-k1ns">MaxTemp</td>
    <td class="tg-k1ns">Maximum temperature in the 24 hours to 9 AM.</td>
  </tr>
  <tr>
    <td class="tg-k1ns">Rainfall</td>
    <td class="tg-k1ns">Rainfall</td>
    <td class="tg-k1ns">Precipitation (rain that falls to or condenses on the ground) in the 24 hours to 9 AM.</td>
    <td class="tg-k1ns">Mostly rain, but also can be present as snow. There are both manual rain gauge and automatic rain gauge. The first one need to be emptied by someone, while second one is used in automatic weather stations. Where snow is present, snow gauge is used, which automatically melts the snow. Nominally the rainfall is observed at 9 AM, but in a number of stations the number can be reported in 48 or 72 hours (or even longer) if it is a weekend or observer is not present. These are known as accumulated observations. At the vast majority of rainfall sites observations are taken by volunteers. Sources: <a href="http://www.bom.gov.au/climate/cdo/about/definitionsrain.shtml" title="Definitions for rainfall">Definitions for rainfall</a>, <a href="http://www.bom.gov.au/climate/cdo/about/about-rain-data.shtml" title="About rainfall">About rainfall</a>.</td>
    <td class="tg-k1ns">Millimetres</td>
  </tr>
  <tr>
    <td class="tg-k1ns">Evaporation</td>
    <td class="tg-k1ns">Evaporation</td>
    <td class="tg-k1ns">"Class A" evaporation pan in the 24 hours to 9 AM.</td>
    <td class="tg-k1ns">Evaporation is measured daily as the depth of water (in inches) evaporates from the pan. The measurement day begins with the pan filled to exactly two inches (5 cm) from the pan top. At the end of 24 hours, the amount of water to refill the pan to exactly two inches from its top is measured. Basically evaporation is the amount of water which evaporates from an open pan called a Class A evaporation pan. The rate of evaporation depends on factors such as cloudiness, air temperature and wind speed. Areas in central Australia are very dry, and therefore have a high rate of evaporation. In contrast, coastal areas tend to have a lower evaporation rate as a result of their proximity to a large water source. Areas with low rainfall and low humidity tend to have a high evaporation rate, whilst areas with high rainfall and high humidity tend to have a low evaporation rate. Sources: <a href="http://www.bom.gov.au/watl/evaporation/" title="Evaporation: Average Monthly & Annual Evaporation">Evaporation: Average Monthly & Annual Evaporation</a>, <a href="http://www.bom.gov.au/climate/cdo/about/definitionsother.shtml" title="Climate statistics for Australian locations">Climate statistics for Australian locations</a>, <a href="https://en.wikipedia.org/wiki/Pan_evaporation#:~:text=Class%20A%20evaporation%20pan,-In%20the%20United&text=Evaporation%20is%20measured%20daily%20as,from%20its%20top%20is%20measured." title="Pan evaporation">Pan evaporation - Wikipedia</a>, <a href="http://www.bom.gov.au/climate/maps/averages/evaporation/" title="Average annual, monthly and seasonal evaporation">Average annual, monthly and seasonal evaporation</a></td>
    <td class="tg-k1ns">Millimetres</td>
  </tr>
  <tr>
    <td class="tg-k1ns">Sunshine</td>
    <td class="tg-k1ns">Sunshine</td>
    <td class="tg-k1ns">Bright sunshine in the 24 hours to midnight.</td>
    <td class="tg-k1ns">Average number of hours of bright sunshine each day in a calendar month or year, calculated over the period of record. Hours of bright sunshine is measured from midnight to midnight. Within the Bureau of Meteorology network bright sunshine has generally been recorded with a Campbell-Stokes recorder. This device only measures the duration of “bright” sunshine, which is less than the amount of “visible” sunshine. For example, sunshine immediately after sunrise and just before sunset is visible, but would not be bright enough to register on the Campbell-Stokes recorder. Source: <a href="http://www.bom.gov.au/climate/cdo/about/definitionsother.shtml" title="Climate statistics for Australian locations">Climate statistics for Australian locations</a>.</td>
    <td class="tg-k1ns">Hours</td>
  </tr>
  <tr>
    <td class="tg-k1ns" rowspan="2">Wind Gust</td>
    <td class="tg-k1ns">WindGustDir</td>
    <td class="tg-k1ns">Direction of the strongest wind gust in the 24 hours to midnight.</td>
    <td class="tg-k1ns" rowspan="2">A gust is any sudden increase of wind speed of short duration; typically a 3 second time period is used. The maximum wind gust for a day is measured from midnight to midnight. If, for some reason, an observation is unable to be made, the next observation is recorded as an accumulation. Accumulated data can affect the Date of the Maximum Wind Gust, since the exact date of occurrence is unknown. Source: <a href="http://www.bom.gov.au/climate/cdo/about/definitionsother.shtml" title="Climate statistics for Australian locations">Climate statistics for Australian locations</a> </td>
    <td class="tg-k1ns">16 compass points</td>
  </tr>
  <tr>
    <td class="tg-k1ns">WindGustSpeed</td>
    <td class="tg-k1ns">Speed of strongest wind gust in the 24 hours to midnight.</td>
    <td class="tg-k1ns">Kilometres per hour</td>
  </tr>
  <tr>
    <td class="tg-k1ns" rowspan="6">9 AM measurements</td>
    <td class="tg-k1ns">Temp9am</td>
    <td class="tg-k1ns">Temperature at 9 AM.</td>
    <td class="tg-k1ns">Already explained above (see Temperature row).</td>
    <td class="tg-k1ns">Degrees Celsius</td>
  </tr>
  <tr>
    <td class="tg-k1ns">Humidity9am</td>
    <td class="tg-k1ns">Relative humidity at 9 AM.</td>
    <td class="tg-k1ns">Relative Humidity is the percentage ratio of Vapour Pressure and Saturation Vapour Pressure. It is commonly used indicator of the moisture in the air. Relative humidity (RH) is the amount of moisture in the air as a percentage of the amount the air can actually hold. Warmer air can hold more moisture than cooler air, which means that for a given amount of atmospheric moisture, RH will be lower if air is warm than it would be if the air is cool. This can be seen by comparing the daily 9am maps (higher RH values) with the daily 3pm maps (lower RH values) for any month of the year. Sources: <a href="http://www.bom.gov.au/climate/maps/averages/relative-humidity/files/calc-rh.pdf" title="Calculation of Relative Humidity">Calculation of Relative Humidity</a>, <a href="http://www.bom.gov.au/climate/maps/averages/relative-humidity/" title="Average 9 am and 3 pm relative humidity">Average 9 am and 3 pm relative humidity</a>.</td>
    <td class="tg-k1ns">Percent</td>
  </tr>
  <tr>
    <td class="tg-k1ns">WindDir9am</td>
    <td class="tg-k1ns">Wind direction averaged over 10 minutes prior to 9AM.</td>
    <td class="tg-k1ns" rowspan="2">Wind is one of the most highly variable meteorological elements, both in speed and direction. It is influenced by a wide range of factors, from large scale pressure patterns, to the time of day and the nature of the surrounding terrain. Because the wind is highly variable it is often studied by means of frequency analyses, provided here in the form of wind roses, rather than as simple averages. The wind direction is specified relative to true (geographic) north, and <strong>is the direction from which the wind is blowing</strong>. The direction can be specified either as the number of degrees clockwise from true north, or as one of the 8 or 16 compass points - as per given metadata, 16 compass points are used. Wind speeds are 10-minute average wind speeds unless specifically labelled as gusts, in which case they are an almost instantaneous reading. <a href="http://www.bom.gov.au/climate/averages/wind/wind_rose.shtml" title="Wind Roses">Wind Roses</a> are used to visualise wind drirection and speed.</td>
    <td class="tg-k1ns">Compass points</td>
  </tr>
  <tr>
    <td class="tg-k1ns">WindSpeed9am</td>
    <td class="tg-k1ns">Wind speed averaged over 10 minutes prior to 9 AM.</td>
    <td class="tg-k1ns">Kilometres per hour</td>
  </tr>
  <tr>
    <td class="tg-k1ns">Cloud9am</td>
    <td class="tg-k1ns">Fraction of sky obscured by cloud at 9 AM.</td>
    <td class="tg-k1ns">The total cloud amount is measured visually by estimating the fraction (in eighths or oktas) of the dome of the sky covered by clouds. A completely clear sky is recorded as zero okta, while a totally overcast sky is recorded as 8 oktas. The presence of any trace of cloud in an otherwise blue sky is recorded as 1 okta, and similarly any trace of blue in an otherwise cloudy sky is recorded as 7 oktas. Areas of inland Australia have a lower moisture content in the air and therefore less cloud cover. Coastal areas have a higher moisture content therefore greater and more frequent cloud cover. Source: <a href="http://www.bom.gov.au/climate/maps/averages/cloud/" title="Average 9 am and 3 pm cloud">Average 9 am and 3 pm cloud</a>.</td>
    <td class="tg-k1ns">Eights (oktas)</td>
  </tr>
  <tr>
    <td class="tg-k1ns">Pressure9am</td>
    <td class="tg-k1ns">Atmospheric pressure reduced to mean sea level at 9 AM.</td>
    <td class="tg-k1ns">The mean sea-level pressure (MSLP) is the atmospheric pressure at mean sea level. This is the atmospheric pressure normally given in weather reports on radio, television, and newspapers or on the internet. Average sea-level pressure is 1,013.25 hPa. The lowest measurable sea-level pressure is found at the centres of tropical cyclones and tornadoes, with a record low of 870 hPa. The highest sea-level pressure on Earth occurs in Siberia, where the Siberian High often attains a sea-level pressure above 1,050 hPa. A mean sea level pressure chart shows the direct relationship between isobar spacing (pressure gradient) and orientation, and the strength and direction of surface winds. The general rule is that winds are strongest where the isobars are closest together. Thus the strongest winds are usually experienced near cold fronts, low pressure systems and in westerly airstreams south of the continent. Winds are normally light near high pressure systems where the isobars are widely spaced. Sources: <a href="https://en.wikipedia.org/wiki/Atmospheric_pressure" title="Atmospheric pressure - Wikipedia">Atmospheric pressure - Wikipedia</a>, <a href="http://www.bom.gov.au/australia/charts/Interpreting_MSLP.shtml" title="Interpreting the Mean Sea Level Pressure (MSLP) Analysis">Interpreting the Mean Sea Level Pressure (MSLP) Analysis</a>.</td>
    <td class="tg-k1ns">Hectopascals</td>
  </tr>
  <tr>
    <td class="tg-k1ns" rowspan="6">3 PM measurements</td>
    <td class="tg-k1ns">Temp3pm</td>
    <td class="tg-k1ns">Temperature at 3 Pm.</td>
    <td class="tg-k1ns">Already explained above (see Temperature row).</td>
    <td class="tg-k1ns">Degrees Celsius</td>
  </tr>
  <tr>
    <td class="tg-k1ns">Humidity3pm</td>
    <td class="tg-k1ns">Relative humidity at 3 PM.</td>
    <td class="tg-k1ns">Already explained above - please see 9 AM measurements - Humidity.</td>
    <td class="tg-k1ns">Percent</td>
  </tr>
  <tr>
    <td class="tg-k1ns">WindDir3pm</td>
    <td class="tg-k1ns">Wind direction averaged over 10 minutes prior 3 PM.</td>
    <td class="tg-k1ns" rowspan="2">Already explained above - please see 9 AM measurements - Wind.</td>
    <td class="tg-k1ns">Compass points</td>
  </tr>
  <tr>
    <td class="tg-k1ns">WindSpeed3pm</td>
    <td class="tg-k1ns">Wind speed averaged over 10 minutes prior 3 PM.</td>
    <td class="tg-k1ns">Kilometres per hour</td>
  </tr>
  <tr>
    <td class="tg-k1ns">Cloud3pm</td>
    <td class="tg-k1ns">Fraction of sky obscured by cloud at 3 PM.</td>
    <td class="tg-k1ns">Already explained above - please see 9 AM measurements - Cloud.</td>
    <td class="tg-k1ns">Eights (oktas)</td>
  </tr>
  <tr>
    <td class="tg-k1ns">Pressure3pm</td>
    <td class="tg-k1ns">Atmospheric pressure reduced to mean sea level at 3 PM.</td>
    <td class="tg-k1ns">Already explained above - please see 9 AM measurements - Pressure.</td>
    <td class="tg-k1ns">Hectopascals</td>
  </tr>
  <tr>
    <td class="tg-k1ns" rowspan="2">Rainfall classification</td>
    <td class="tg-k1ns">RainToday</td>
    <td class="tg-k1ns">Calculated field - feature.</td>
    <td class="tg-k1ns">Boolean: 1 if precipitation (mm) in the 24 hours to 9 AM exceeds 1mm, otherwise 0.</td>
    <td class="tg-k1ns">Boolean</td>
  </tr>
  <tr>
    <td class="tg-k1ns">RainTomorrow</td>
    <td class="tg-k1ns">Calculated field - feature.</td>
    <td class="tg-k1ns">Boolean: 1 if precipitation (mm) the next day exceeds 1mm, otherwise 0. Used to create Responsible Variable.</td>
    <td class="tg-k1ns">Boolean</td>
  </tr>
</tbody></table>

<span style="font-family:Cambria">
    
Just by briefly looking into the data set, using some quick filters in Excel and quick Google checks, I can add the below:

<strong style="color:#c27767">Date:</strong> The dataset has observations from 01/11/2007 till 25/06/2017.  

<strong style="color:#c27767">Locations:</strong> The observations are taken from 49 different locations. All names looks unique; I quickly searched for a few places to make sure they exist (PearceRaaf). I also noticed location names are not "user friendly": the mentioned <code>PearceRAAF</code> is actually RAAF Base Pearce, a military base. What I mean is that if I would want to map all these locations to a visible map, I would need to re-write the names to actual location names, for example update <code>BadgerysCreek</code> to <code>Badgerys Creek</code>. Such updates are minor ones.

<strong style="color:#c27767">Temperature:</strong> just by looking into temperature fields I was not able to see any extremes and non-existant values; maximum temperatures are in the range of -4.8 to 48.1; minimum temperatures are from -8.5 to 33.9. 9 AM and 3 PM temperature values are in range from -7.2 to 40.2 and from -5.4 to 46.7 respectively. It also follows logic, meaning that temperature range in the morning is lower than in the afternoon.

<strong style="color:#c27767">Rainfall:</strong> it is either 0, or continuos values till 371 (mm). It looks like there are different rainfall classifications; probably related to locations. In 
<a href="https://community.wmo.int/en/activity-areas/aviation/hazards/precipitation" title="World Meteorological Organisation">World Meteorological Organisation> it is noted that:

<blockquote><em>While there is no agreed international definition regarding rainfall intensity, some use the following criteria: Heavy rain is defined as rates in excess of 4 mm per hour while heavy showers are defined as rates in excess of 10 mm per hour. Showers are further classified as being violent if the rate exceeds 50 mm per hour, although these are normally considered to be rates typical for tropical regions.</em></blockquote>

It looks like all countries have their own classifications present in government web sites. There are also different measurement units used - mm/24 hours, mm/hour and even mm/year. In this data set we have mm/24 hours. Just to basically understand how rainfall data looks like, I will use few different sources and merge them into one roughly-estimates table just to get an idea of how much rain is classified as light, moderate and heavy. The sources are: 

* <a href="https://www.nchm.gov.bt/attachment/ckfinder/userfiles/files/Rainfall%20intensity%20classification.pdf" title="Rainfall Classification: Intensity of Rainfall in 24 Hours">The National Center for Hydrology and Meteorology (NCHM) (Bhutan)</a>  
* <a href="https://www.researchgate.net/figure/Classification-of-rainfall-based-on-intensity-for-Indian-rainfall-by-IMD-IMD-2018_tbl1_344979410" title="Classification of rainfall based on intensity for Indian rainfall by IMD (IMD, 2018)">Impact of rainfall on travel time and fuel usage for Greater Mumbai city (India)</a>  
* <a href="https://www.researchgate.net/figure/Classification-standards-of-rainfall-intensity-and-corre-sponding-records-of-rainfall_tbl2_272011912" title="Rainfall intensity division of Chinese Meteorology Department">Soil moisture response to rainfall in forestland and vegetable plot in Taihu Lake Basin, China (China)</a>  
* <a href="https://www.researchgate.net/figure/Classification-of-Rainfall-Intensity_tbl1_325580146" title="Classification of Rainfall Intensity">Wireless Sensor Network Design for Earthquake’s and Landslide’s Early Warnings (Indonesia)</a>  
* <a href="https://www.semanticscholar.org/paper/Rainfall-classification-for-flood-prediction-using-Chai-Wong/ab428c6b4385b84f1aad80cdad0ca09747aad041/figure/4" title="Rainfall Event Classification">Rainfall classification for flood prediction using meteorology data of Kuching, Sarawak, Malaysia (Malaysia)</a>  


**Intensity of Rainfall in 24 Hours**
| Term      | Rainfall|
| ----------- | ----------- |
| **Light Rain** | 10 mm and less |
| **Moderate Rain** | 11 mm to 50 mm |
| **Heavy Rain** | 51 mm and more |

<strong style="color:#c27767">Evaporation:</strong> while lookin for more details about "Class A" evaporation pan, I found out that several key weather variables influence evaporation process, noteably: air temperature, relative humidity, wind speed and the net solar radiation absorbed by the body. All details are present in web site <a href="https://www.environdata.com.au/class-a-evaporation-pan" title="Class A Evaporation Pan">Class A Evaporation Pan</a>:  

* If the **air temperature** is high, there is more energy present to convert liquid water to water vapour. 
* If the **relative humidity** is low, the air mass can more readily suspend more water vapour, hence the energy required to evaporate is less. 
* If the **incoming solar radiation & subsequent energy** imparted to the body of water (not reflected) is high, again more energy is provided into the system to aid in the transition to water vapour. 
* If the **wind Speed** is high, then the air mass at the boundary layer is replaced with air not laden with water vapour (drier) hence more readily able to accept moisture.

<strong style="color:#c27767">Wind:</strong> wind direction is reported in sixteen different clasees. The classes are visualised in <a href="https://www.researchgate.net/figure/Classification-of-wind-directions-in-a-four-sectors-b-eight-sectors-c-sixteen_fig1_221914541" title="Quick and Economic Spatial Assessment of Urban Air Quality ">this publication figure</a>, while more basic information about wind is present in this <a href="https://windy.app/blog/what-is-wind-direction.html" title="How to read wind direction">web site.</a> Wind direction is always determined by where the wind is blowing FROM, not where it is blowing towards. We even have degrees (or ordinal values) for different wind directions. Findings from both sites above are presented in the table below.

From briefly looking into data, I was not able to spot any issues - it looks like all wind directions are properly recorded, there are no mistyped values or other accuracy issues. Wind Gust speed is usually bigger than average wind speed and can see pretty large values (from 6 km/h to 135 km/h). Wind seepd values are present in <a href="https://www.rmets.org/metmatters/beaufort-wind-scale" title="The Beaufort Wind Scale">The Beaufort Wind Scale</a>. There is one outlier in Newcastle weather station, where on 18/01/2017 it is marked that wind speed at 9 AM was 130 km/h, which is a hurricane; however, looking into the data I was not able to find any hurricanes or any tropical storms present in this region on that date (as per <a href="https://en.wikipedia.org/wiki/2017–18_Australian_region_cyclone_season" title="2017–18 Australian region cyclone season">Wikipedia</a>). Regardless of this value, wind speed in morning and afternoon varies from 0 to 87 km/h, which is fair value.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top}
.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-fymr">Degree</th>
    <th class="tg-fymr">Description</th>
    <th class="tg-fymr">Code</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-0pky">348.75° – 11.25°</td>
    <td class="tg-0pky">north wind</td>
    <td class="tg-0pky">N</td>
  </tr>
  <tr>
    <td class="tg-0pky">11.25° – 33.75°</td>
    <td class="tg-0pky">north-northeast wind</td>
    <td class="tg-0pky">NNE</td>
  </tr>
  <tr>
    <td class="tg-0pky">33.75° – 56.25°</td>
    <td class="tg-0pky">northeast wind</td>
    <td class="tg-0pky">NE</td>
  </tr>
  <tr>
    <td class="tg-0pky">56.25° – 78.75°</td>
    <td class="tg-0pky">east-northeast wind</td>
    <td class="tg-0pky">ENE</td>
  </tr>
  <tr>
    <td class="tg-0pky">78.75° – 101.25°</td>
    <td class="tg-0pky">east wind</td>
    <td class="tg-0pky">E</td>
  </tr>
  <tr>
    <td class="tg-0pky">101.25° – 123.75°</td>
    <td class="tg-0pky">east-southeast wind</td>
    <td class="tg-0pky">ESE</td>
  </tr>
  <tr>
    <td class="tg-0pky">123.75° – 146.25°</td>
    <td class="tg-0pky">southeast wind</td>
    <td class="tg-0pky">SE</td>
  </tr>
  <tr>
    <td class="tg-0pky">146.25° – 168.75°</td>
    <td class="tg-0pky">south-southeast wind</td>
    <td class="tg-0pky">SSE</td>
  </tr>
  <tr>
    <td class="tg-0pky">168.75° – 191.25°</td>
    <td class="tg-0pky">south wind</td>
    <td class="tg-0pky">S</td>
  </tr>
  <tr>
    <td class="tg-0pky">191.25° – 213.75°</td>
    <td class="tg-0pky">south-southwest wind</td>
    <td class="tg-0pky">SSW</td>
  </tr>
  <tr>
    <td class="tg-0pky">213.75° – 236.25°</td>
    <td class="tg-0pky">southwest wind</td>
    <td class="tg-0pky">SW</td>
  </tr>
  <tr>
    <td class="tg-0pky">236.25° – 258.75°</td>
    <td class="tg-0pky">west-southwest wind</td>
    <td class="tg-0pky">WSW</td>
  </tr>
  <tr>
    <td class="tg-0pky">258.75° – 281.25°</td>
    <td class="tg-0pky">west wind</td>
    <td class="tg-0pky">W</td>
  </tr>
  <tr>
    <td class="tg-0pky">281.25° – 303.75°</td>
    <td class="tg-0pky">west-northwest wind</td>
    <td class="tg-0pky">WNW</td>
  </tr>
  <tr>
    <td class="tg-0pky">303.75° – 326.25°</td>
    <td class="tg-0pky">northwest wind</td>
    <td class="tg-0pky">NW</td>
  </tr>
  <tr>
    <td class="tg-0pky">326.25° – 348.75°</td>
    <td class="tg-0pky">north-northwest wind</td>
    <td class="tg-0pky">NNW</td>
  </tr>
  <tr>
    <td class="tg-0pky">348.75° – 11.25°</td>
    <td class="tg-0pky">north wind</td>
    <td class="tg-0pky">N</td>
  </tr>
</tbody></table>

<strong style="color:#c27767">Humidity:</strong>  relative humidity (RH) is the ratio of how much water vapour is in the air to how much water vapour the air could potentially contain at a given temperature. It varies with the temperature of the air: colder air can contain less vapour, and water will tend to condense out of the air more at lower temperatures. So changing the temperature of air can change the relative humidity, even when the specific humidity remains constant. The humidity in the dataset varies from 0 to 100 %, and that looks accurate.

<strong style="color:#c27767">Pressure:</strong> air pressure is the force exerted by the weight of the column of air above the Earth’s surface. It depends on elevation and weather conditions. This and below details were found in following sites: <a href="https://www.meteoswiss.admin.ch/weather/weather-and-climate-from-a-to-z/air-pressure.html" title="Air pressure">Air pressure</a>, <a href="https://www.maximum-inc.com/learning-center/what-is-atmospheric-pressure-and-how-is-it-measured/?srsltid=AfmBOopSyGYgMJ_3MO4H1bIFIkfsZqT4kZUkG0ipJ1DL4AlK7TrXT16E" title="What is Atmospheric Pressure and How is it Measured?">What is Atmospheric Pressure and How is it Measured?</a> and <a href="https://brainly.com/question/33961160" title="Brainly">What is considered high and low barometric pressure (mb)?</a>.

In general, a barometer can let you know if your immediate future will see clearing or stormy skies, or little change at all, based only on atmospheric pressure.

Here are a few examples of how to interpret barometric readings:

* When the air is dry, cool, and pleasant, the barometer reading rises.
* In general, a rising barometer means improving weather.
* In general, a falling barometer means worsening weather.
* When atmospheric pressure drops suddenly, this usually indicates that a storm is on its way.
* When atmospheric pressure remains steady, there will likely be no immediate change in the weather.

Generally, a barometric pressure of 1000 mb is considered average or normal. High barometric pressure, also known as a high-pressure system, is typically above 1013 mb. It is associated with clear skies, stable weather conditions, and cooler temperatures. On the other hand, low barometric pressure, also known as a low-pressure system, is typically below 1000 mb. It is associated with cloudy or stormy weather, as well as higher temperatures.

---

</span>

<span style="font-family:Cambria">

<span style="color:#c27767">

### 1.2. Data Collection
</span>


In the <a href="http://www.bom.gov.au/climate/cdo/about/about-rain-data.shtml" title="General information about historical observations">General information about historical observations</a> page I found that: 

<blockquote><em>Very few stations have a complete unbroken record of climate information. A station may have been closed, reopened, upgraded to a full weather station or downgraded to a rainfall only station during its existence causing breaks in the record for some or all elements. Some gaps may be for one element due to a damaged instrument, others may be for all elements due to the absence or illness of an observer, or perhaps the failure of an automatic weather station.</em></blockquote>

Due to lack of time and resources I am not able to fully follow guideline to understand domain problem and data beneath; I will do my best to try understand the scientific measurements that are taking place to record such data. I understand that there might be inconsistencies because of missing observer; this means that if no data was recorded, the next possible value *might* be accumulated. There is not much information about volunteers and what exactly do they do (quality control or full responsibility of numbers?). I have to note that if this project would be my full time job, I would certainly make sure to search for details. I also note that for some methods that are not fully explained by Australian Government Bureau of Meteorology I search Wikipedia for quick explanation.

</span>

<span style="font-family:Cambria">
<span style="color:#c27767">
    
### 1.3. Observational Units

</span>
    
Another important question that I have is what are the <strong style="color:#c27767"> observational units</strong>. As per <a href="https://vdsbook.com" title="Veridical Data Science">Veridical Data Science book</a> reference, observational units "are the entities for which the measurements are collected". It can be countries, people, years; sometimes it is combined (country *and* year). Since we do have both timeline (years and seasons) and location (rainfall data depends a lot on this), we can ask the main question in both ways: if it is going to rain in January? Or if it is going to rain if we are in Adelaide? In my case, we cannot simply answer to any of these questions, because they are related: is it going to rain tomorrow, because it is January and we are in Adelaide? Therefore the observational unit might be date and location.

---
</span>

<span style="font-family:Cambria">

<span style="color:#c27767">
    
### 1.4. Data Relevance and Assumptions

</span>

By briefly looking into data table and variables, I can strongly agree that all of them might be relevant to the main project question. The only assumptions that I have are:

* I am not sure how exactly data was measured and collected - whether all of the stations in locations have autumated system, or there are volunteering work included. There might be inaccuracies in data if therainfall (mm) was accumulated and not observed daily, as would be expected.  
* I also keep in mind that sometimes weather changes way faster than 24 hours - it can change completely in an hour and therefore data is not collected (I am thinking about evaporation, wind direction, sunshine, clouds, maybe even pressure).
* There are always room for automatic systems failure, shutdown, database errors and it will be present in this data as well.  
* I assume the calculated boolean fields (<code>RainToday</code> and <code>RainTomorrow</code>) are presented without mistakes.
* I also thought about the main question here - will it rain tomorrow - and my analysis answer to the question: if it will be 1mm of rain in 24 hours (accumulated), then I say yes, bring an umbrella, because there will be at least 1 mm of rain tomorrow. But does 1 mm of rain per 24 hours actually requires an umbrella? What if I am going out in the afternoon, but it will rain only in th morning, or at night? It feels like my analysis does not provide useful information.

---
</span>

# Data Preparation

In [2]:
!pip install --upgrade kagglehub --quiet
!pip install dill --quiet

In [1]:
import dill

In [3]:
import kagglehub

path = kagglehub.dataset_download("jsphyg/weather-dataset-rattle-package")

In [4]:
import pandas as pd

rain_data = pd.read_csv("E:\\ML\\rain-in-Australia\\weatherAUS.csv")

In [5]:
rain_data

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145455,2017-06-21,Uluru,2.8,23.4,0.0,,,E,31.0,SE,...,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No
145456,2017-06-22,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,...,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No
145457,2017-06-23,Uluru,5.4,26.9,0.0,,,N,37.0,SE,...,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No
145458,2017-06-24,Uluru,7.8,27.0,0.0,,,SE,28.0,SSE,...,51.0,24.0,1019.4,1016.5,3.0,2.0,15.1,26.0,No,No


In [6]:
rain_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145460 entries, 0 to 145459
Data columns (total 23 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Date           145460 non-null  object 
 1   Location       145460 non-null  object 
 2   MinTemp        143975 non-null  float64
 3   MaxTemp        144199 non-null  float64
 4   Rainfall       142199 non-null  float64
 5   Evaporation    82670 non-null   float64
 6   Sunshine       75625 non-null   float64
 7   WindGustDir    135134 non-null  object 
 8   WindGustSpeed  135197 non-null  float64
 9   WindDir9am     134894 non-null  object 
 10  WindDir3pm     141232 non-null  object 
 11  WindSpeed9am   143693 non-null  float64
 12  WindSpeed3pm   142398 non-null  float64
 13  Humidity9am    142806 non-null  float64
 14  Humidity3pm    140953 non-null  float64
 15  Pressure9am    130395 non-null  float64
 16  Pressure3pm    130432 non-null  float64
 17  Cloud9am       89572 non-null

I will save all my variables using dill package.

In [4]:
import dill

dill.dump_session()

In [2]:
dill.load_session()

In [3]:
rain_data

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145455,2017-06-21,Uluru,2.8,23.4,0.0,,,E,31.0,SE,...,51.0,24.0,1024.6,1020.3,,,10.1,22.4,No,No
145456,2017-06-22,Uluru,3.6,25.3,0.0,,,NNW,22.0,SE,...,56.0,21.0,1023.5,1019.1,,,10.9,24.5,No,No
145457,2017-06-23,Uluru,5.4,26.9,0.0,,,N,37.0,SE,...,53.0,24.0,1021.0,1016.8,,,12.5,26.1,No,No
145458,2017-06-24,Uluru,7.8,27.0,0.0,,,SE,28.0,SSE,...,51.0,24.0,1019.4,1016.5,3.0,2.0,15.1,26.0,No,No
