# CMSC320 - Final Project
## Polynimail Fitting - Daily Temperature of Major Cities
(https://www.kaggle.com/datasets/sudalairajkumar/daily-temperature-of-major-cities)

In [103]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio

In [104]:
# pio.templates.default = "simple_white"
np.random.seed(0)

## Step #1 - Load and preprocessing of city temperature dataset

The analysis process that I did is this:
* drop duplicated values
* drop rows with unordinary values such as measurements of tempratures below 15 Celsius degrees. 
* adding a column calles "DayOfYear" in order to compare different seasons of days in different places. This column will be the feature to be used for the polynomial fitting.
* extract the year from the date column to its own column

In [105]:
def load_data(filename):
    temp_df = pd.read_csv(filename, parse_dates=['Date'], dayfirst=True)
    temp_df = temp_df.drop_duplicates().dropna()
    temp_df = temp_df[temp_df["Temp"] >= - 15]
    dates = pd.to_datetime(temp_df["Date"], errors='coerce').dt.to_period('h')
    temp_df["DayOfYear"] = dates.dt.day_of_year
    temp_df["Year"] = temp_df["Year"].astype(str)
    return temp_df

# Load city daily temperature dataset and preprocess data.
df = load_data("./City_Temperature.csv")

df

Unnamed: 0,Country,City,Date,Year,Month,Day,Temp,DayOfYear
0,South Africa,Capetown,1995-01-01,1995,1,1,19.333333,1
1,South Africa,Capetown,1995-01-02,1995,1,2,19.888889,2
2,South Africa,Capetown,1995-01-03,1995,1,3,19.388889,3
3,South Africa,Capetown,1995-01-04,1995,1,4,20.833333,4
4,South Africa,Capetown,1995-01-05,1995,1,5,21.444444,5
...,...,...,...,...,...,...,...,...
32434,Jordan,Amman,2020-05-09,2020,5,9,17.555556,130
32435,Jordan,Amman,2020-05-10,2020,5,10,17.055556,131
32436,Jordan,Amman,2020-05-11,2020,5,11,20.666667,132
32437,Jordan,Amman,2020-05-12,2020,5,12,24.444444,133


## We now can explore the data for specific country, Israel for instance
## Relation between daily tempreture to 'DayOfYear':

Let us subset the dataset to caintain samples only from the country of Israel, so we can investigate how the average daily temperature (`Temp` column) change as a function of the `DayOfYear`

In [106]:
df_israel = df[df["Country"] == "Israel"]
df_israel_avg = df_israel.groupby(["Year", "DayOfYear"], as_index=False)["Temp"].mean()
fig = px.scatter(df_israel_avg, x="DayOfYear", y="Temp", color="Year",
                 title="Figure (1)    Average daily temperature as a function of the DayOfYear")
fig.show()

Based on the this plot, one can note that data behaves pretty similar among different year, and it has a shape of a wave, with higher temp around day ~200 of the year.

Since we have three extreme points we can assume that a polynomial with degree of 3 or 4 might be
suitable for this data.

### The standard deviation of the daily temperatures for each month:

Now we will group the samples by `Month` and create a bar plot showing for each month the std of the daily temperatures. 

In [107]:
df_israel_months = df_israel.groupby(["Month"], as_index=False)["Temp"].std()
fig = px.bar(df_israel_months, x='Month', y='Temp',
             labels={'Temp': 'std'},
             title="Figure (2)    Standard Deviation Of The Daily Temperatures Over Months")
fig.show()

Suppose we fit a polynomial model (with the correct degree) over data sampled uniformly at random from this dataset, and then use it to predict temperatures from random days across the year. 

Based on this graph, I would expect this model wont succeed equally in prediction across all months. 
In months with low variance (June [6] - September [9]), I would expect that this model would preform better and will probably will fit closer to reality. I assume that it will do the worst on the months March and April (3 and 4) which are months with high variability. 

This is under the assumption the the test set is generated from the same distribution as train set. 

## Step 3 - Explore differences between countries

And now, back to the full dataset: we will group the samples according to `Country` and `Month`, and calculate the average and standard deviation of the temperature. 

We will Plot a line plot of the average monthly temperature, with error bars color coded by the country.

In [108]:
df_3 = df.groupby(["Country", "Month"], as_index=False)["Temp"].agg({
    'avgTemp': 'mean',
    'std': 'std'})
fig = px.line(df_3, x="Month", y="avgTemp", color="Country", error_y="std",
              labels={'avgTemp': 'Average Temperature'},
              title="Figure (3)    Average monthly temperature as a function of the Month")
fig.show()

Based on the graph above, one can note that not all countries share the same pattern in term of haing the same distibution of average monthly temperature as a funciton of the month. 

According to this plot we expect that a model fitted for Israel data only will preform very well on Jordan, whereas the model likely wont work on South Africa or on The Netherlands. This is becuase South Africa's trends are opposite to
those of the other three countries (e.g. relatively hot in months 6-9 in Israel, however this is the
cold period in South Africa), and on the other hand The Netherlands tempAvg is quite far from
those values of Israel. It's distibution (of Netherlands) is similar to that of Israel, with difference of ~9 degrees lower any time of the year. Thus, I can use the model fitted for Israel by simply adjusting the value of the intercept.

## Step 4 - Fitting model for different values of the degree hyperparameter 

Over the subset containing observations only from Israel we will do the following:
* Randomly split the dataset into a training set (75%) and test set (25%).
* For every value k ∈ [1,10], fit a polynomial model of degree k using the training set.
* Record the loss of the model over the test set.

Then we will create a bar plot showing the test error recorded for each value of k. 
This is in order to find which value of k best fits the data. 

In [109]:
train_X, test_X, train_y, test_y = train_test_split(df_israel["DayOfYear"], df_israel["Temp"], test_size=0.25)
losses = {'k': [], 'test_error': [], 'test_error_rounded': []}
for k in list(range(1, 11)):
    z = np.poly1d(np.polyfit(train_X.values, train_y, k))
    pred_y = z(test_X.values)
    error = mean_squared_error(pred_y, test_y)
    losses['k'].append(k)
    losses['test_error'].append(error)
    losses['test_error_rounded'].append(round(error, 2))
#     print(f"Degree k={k}, test error: {round(error, 2)}")
fig = px.bar(losses, x='k', y='test_error', text='test_error_rounded',
             title="Figure (4)    Test Error as a function of Polynomial Degree")
fig.show()

Based on this, I would choose the valueof k=5 as best fits and describes the data (the lowest error, above this value it looks like overfitting).

##  Step 5 - Evaluating fitted model on different countries

Now we will fit a model over the entire subset of records from Israel using the degree of k=5 chosen above. 

And create a bar plot showing the model’s error over each of the other countries. 

In [110]:
# model = PolynomialFitting(k=5).fit(df_israel["DayOfYear"], )
model = np.poly1d(np.polyfit(df_israel["DayOfYear"].values, df_israel["Temp"], 5))
countries = ["South Africa", "Jordan", "The Netherlands"]
losses = {'Country': [], 'test_error': []}
for c in countries:
    df_cur_country = df[df["Country"] == c]
    losses['Country'].append(c)
    error = mean_squared_error(model(df_cur_country["DayOfYear"]), df_cur_country["Temp"])
    losses['test_error'].append(error)
fig = px.bar(losses, x='Country', y='test_error', title="Figure (5)    Temperature Over Months")
fig.show()

As we expecded, the model fitted over the subset of observations from Israel performed the best on Jordan, and in general it less good over data from other countries. As we have seen in figure 3, the distribution of temperatures in Jordan resembles that of Israel. Therefore, out of the three countries, the model performed best on Jordan.

The distributions of South Africa and  Netherlands were further from those of Israel and therefore the fitted model performed poorly over them. 

Although the distribution of the temp data from the Netherlands has a very similar shape to that of Israel, and that the distribution of the observations from South Africa is very different, the model performed better over South Africa. This is probably because on average the observations from Israel are closer to those of South Africa. Hence, although the model does not correctly mimics the distribution of observations from South Africa, the errors are still smaller than in the case of observations from the Netherlands.