## Apparent Temperature Prediction

Predicting the Apparent Temperature of a particular location based off some features gotten from historical data

___

* Type of Machine Learning Method
    * Supervised Learning
        * Regression
            * Multiple Linear Regression (Least Squares Method)
                * Test Accuracy of 98%
                * RMSE loss of 0.988
___
* Dataset Used
    * [Weather Temperature Dataset](https://www.kaggle.com/budincsevity/szeged-weather "Gotten from Kaggle")
___

### Steps to Solve Problem
* Import Dataset
* Exploratory Data Analysis
* Feature Engineering
    * Data Cleaning
    * Missing Data Imputation
    * Feature Encoding
* Model Build
    * Train / Test Data split
    * Model Initiation and Fitting
    * Test predictions
* Model Perfromance
    * RMSE
    * R^2 score
        

### Import Dataset

We first would import the necessary libraries we need at the moment, we will import others as when they are needed

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

Importing dataset

In [None]:
weather_df = pd.read_csv("../input/szeged-weather/weatherHistory.csv")

In [None]:
weather_df.head()

Checking subset of the data

We can see from the data above we have 11 features as our Target Variable is **Apparent Temperature**

In [None]:
# Features
weather_df.columns

### Exploratory Data Analysis

From the information provided below, we can see that only **Precip Type** has missing values which we will have to impute later

In [None]:
weather_df.info()

In [None]:
weather_df.isna().sum() 

Let's check the correlation of the features to the target below

In [None]:
plt.figure(figsize=[10,5])
sns.heatmap(weather_df.corr(), annot=True)

Now we will check the descriptive statistics of our numerical features

In [None]:
weather_df.describe()

From the data provided above, I suspect that Temperature, Humidity, Wind Speed contain outliers and it seems LoudCover doesn't have any other value than zero

Now let's explore our categorical features

In [None]:
category_features = [feature for feature in weather_df.columns if weather_df[feature].dtype == "object"]
category_features

I tried converting Formatted Date into a DateTime Object but it produces a **TypeError** so we will parse it manually

In [None]:
# weather_df["Formatted Date"].dt.year
# pd.DatetimeIndex(weather_df["Formatted Date"]).year

In [None]:
# Function to parse year out of a string
def year(sample):
    return sample.split("-")[0]

In [None]:
# Testing Function
year("2006-04-01 00:00:00.000 +0200")

In [None]:
# Applying function to Formatted Date column and storing results in another column
weather_df["Year"] = weather_df["Formatted Date"].apply(lambda x: year(x))

The Year column would be usefull when spliting the data (assuming Time Series data), we would come back to this later

In [None]:
weather_df[["Formatted Date", "Year"]].sample(5)

Now unto the **Summary** feature, we would encoded this values 

In [None]:
weather_df["Summary"].value_counts()

**Precip Type** feature

In [None]:
weather_df["Precip Type"].value_counts()

**Daily Summary** contains too much values, so we would just drop this feature

This feature alone contains 214 values

In [None]:
weather_df["Daily Summary"].nunique()

In [None]:
weather_df["Daily Summary"].value_counts()

In [None]:
weather_df.drop("Daily Summary", axis=1, inplace=True)

### Feature Engineering
#### Missing Data Imputation

**Precip Type** is missing about 517 values, since it is a categorical feature and contains binary values of "rain" or "snow", we will fill the missing values with the most occuring value

In [None]:
weather_df.isna().sum() 

In [None]:
# Importing imputer
from sklearn.impute import SimpleImputer 

In [None]:
# Initializing Imputer and setting most frequent as strategy
imputer = SimpleImputer(strategy="most_frequent")

In [None]:
# finding mode value and filling missing data with mode
weather_df["Precip Type"] = imputer.fit_transform(weather_df[["Precip Type"]])

Now there is no more missing values

In [None]:
weather_df.isna().sum() 

#### Feature Encoding

In [None]:
# Importing OneHotEncoder to encode norminal categorical values
from sklearn.preprocessing import OneHotEncoder

In [None]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [None]:
# Fitting Encoded data
encoder.fit(weather_df[["Summary", "Precip Type"]])

In [None]:
# List of Encoded Categories
encoder.categories_

In [None]:
# Appending Feature name to respective encoded values
encoded_cols = list(encoder.get_feature_names(["Summary", "Precip Type"]))
# print(encoded_cols)

In [None]:
# Adding encoded features to dataset
weather_df[encoded_cols] = encoder.transform(weather_df[["Summary", "Precip Type"]])

In [None]:
weather_df.head()

### Model Build

#### Train / Test Data Split
We are going to split the data based on the year the samples were recorded, the method I'm using is not the most prefered, but since we couldn't parse our data using the datetime function we would use this method

In [None]:
## Counting values based on year recorded
weather_df["Year"].value_counts()

The Year feature is still a string so we would need to convert it into a number data type

In [None]:
weather_df["Year"].dtype

In [None]:
weather_df["Year"] = weather_df["Year"].astype("int64")

In [None]:
weather_df["Year"].dtype

We are spliting the data based on year recorded, so we would give the training data all the samples before the year 2016 and the test data would contain data only from 2016

Train Data contains 87699 rows while Test data contains 8784 rows

In [None]:
train_df = weather_df[weather_df["Year"] < 2016]
test_df = weather_df[weather_df["Year"] == 2016]

Now let's drop the features we aren't using

In [None]:
train_df.drop(["Formatted Date", "Summary", "Precip Type", "Loud Cover","Year"], axis=1, inplace=True)
test_df.drop(["Formatted Date", "Summary", "Precip Type", "Loud Cover","Year"], axis=1, inplace=True)

Seperating targets from features

In [None]:
X_train = train_df.drop("Apparent Temperature (C)", axis=1)
y_train = train_df["Apparent Temperature (C)"]

In [None]:
X_test = test_df.drop("Apparent Temperature (C)", axis=1)
y_test = test_df["Apparent Temperature (C)"]

### Model Initiation and Fitting

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train, y_train)

In [None]:
# Storring Coefficients to Dataframe
coefficient = pd.DataFrame({
    "Coef": model.coef_
},
index = X_train.columns
)

In [None]:
coefficient

In [None]:
# Intercept
model.intercept_

#### Model Prediction

In [None]:
predictions = model.predict(X_test)

In [None]:
# Storing Targets and predictions to compare values
compare_df = pd.DataFrame({
    "Target" : y_test,
    "Prediction" : predictions
})

As we can see below, the predcitions we off by a little bit, let's check the performance of the model to know how much loss we encontered

In [None]:
compare_df.sample(10)

### Model Perfromance

In [None]:
# Importing Metrics
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

Calculating Root Mean Square Error (RMSE) to check loss

In [None]:
#RMSE
rmse = np.sqrt(mean_squared_error(y_test, predictions))
print(f"The RMSE of our model is {rmse:.4f}")

We have a loss of **0.9886**, which is quite small so our model did quite well, let us check the score of our model

In [None]:
# R2 Score
r2 = r2_score(y_test, predictions)
print(f"The r2 score of our model is {rmse * 100:.2f} %")

We got a score of **98.86 %**, which was how accurate our model was
___

Let's compare our predicitions against our targets to see if we got a correlated trend

In [None]:
plt.figure(figsize=[12,8])
sns.scatterplot(y_test, predictions)

We got a positive correlated trend, so our model did quite well