# Exercise - 1

The goal of this exercise is to calculate and plot seasonal weather anomalies to see how temperatures have changed in different seasons over the past 100+ years. The data used for this exercise is the daily temperature data from the Sodankyla weather station of Northern Finland.


The modules which will be used for this exercise are `pandas` and `matplotlib`. Therefore, importing these necessary modules is the first step before we proceed further. We will be actually using a sub-package of `matplotlib` library called `pyplot`, so we will import only that.

In [1]:
#importing necessary modules

import pandas as pd
import matplotlib.pyplot as plt

## Task 1 - Reading, cleaning and preparing the data

To read the given [data](data/2315676.txt) from the **data** folder in the working directory we use the `pd.read_csv` function of the pandas library. 

In [2]:
fp = 'data/2315676.txt' #fp means filepath

data = pd.read_csv(fp, na_values = -9999, skiprows = [1], delim_whitespace = True)

- The `pd.read_csv` function requires the path of the file which we are trying to read, which we represented by **fp**.
- As the missing values are mentioned as `-9999` in this data, we specify that information to the function through the `na_values` argument.
- If we inspect the data in the raw format, we can observe that the first row has no information with just blank lines `-------`, so to remove that row from the data we specify that information to the function through the `skiprows=[1]` arugment.
- As the data is seperated by whitespaces unlike comma or other deliminators, we specify the information to the function through the `delim_whitespace = True)` argument to make the data read into a data frame (table format)

Now, inspect the data to understand the rows, columns, data and the structure

In [3]:
data

Unnamed: 0,STATION,STATION_NAME,DATE,TAVG,TMAX,TMIN
0,GHCND:FI000007501,SODANKYLA-AWS-FI,19080101,,2.0,-37.0
1,GHCND:FI000007501,SODANKYLA-AWS-FI,19080102,,6.0,-26.0
2,GHCND:FI000007501,SODANKYLA-AWS-FI,19080103,,7.0,-27.0
3,GHCND:FI000007501,SODANKYLA-AWS-FI,19080104,,-3.0,-27.0
4,GHCND:FI000007501,SODANKYLA-AWS-FI,19080105,,4.0,-36.0
...,...,...,...,...,...,...
41060,GHCND:FI000007501,SODANKYLA-AWS-FI,20201003,47.0,51.0,
41061,GHCND:FI000007501,SODANKYLA-AWS-FI,20201004,43.0,47.0,37.0
41062,GHCND:FI000007501,SODANKYLA-AWS-FI,20201005,42.0,,37.0
41063,GHCND:FI000007501,SODANKYLA-AWS-FI,20201006,45.0,46.0,43.0


We can see the data has **41065** records with **6** columns.

Use the `DataFrame.describe()` function to better understand your data

In [4]:
data.describe()

Unnamed: 0,DATE,TAVG,TMAX,TMIN
count,41065.0,21222.0,40296.0,39119.0
mean,19639600.0,31.696211,39.034296,22.315985
std,325362.0,20.809623,20.905912,22.18709
min,19080100.0,-53.0,-47.0,-57.0
25%,19360300.0,19.0,26.0,9.0
50%,19640610.0,33.0,38.0,27.0
75%,19920720.0,48.0,55.0,39.0
max,20201010.0,78.0,90.0,67.0


From the above statistics, we can observe that there are only **21222** records for the `TAVG` column, but there are significantly more records for the `TMAX` and `TMIN` columns. Therefore, we can use these max and min columns data to calculate an estimated average temperature for those records which are missing `TAVG` value.

We create a new function named `estimates` that caluculates the estimated average temperatures.

In [5]:
def estimates(df):
    if pd.isnull(df.TAVG): # if TAVG is missing, then only the function calculates the average value of max and min value.
        return (df.TMAX + df.TMIN)/2
    else:
        return df.TAVG

Now, we use the above function to find the estimated average temperatures and add it to our data frame `data` using the `apply` function of pandas.

We a create a new column labelled `TAVG_EST` to record our new estimated average temperatures so that we do not mess with original data.

In [6]:
data['TAVG_EST'] = data.apply(estimates, axis = 'columns')

The `apply` function takes the argument of:

- what function we are passing the data frame data through : `estimates` function.
- through which axis we are passing this function through : `columns` in this case as `TAVG`, `TMAX` & `TMIN` are column labels.