# szeged-temperature-prediction-linearregression
I used elastic regression to predict the apparent temperature from the given dataset.


Use the "Run" button to execute the code.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [None]:
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
%matplotlib inline

Above is the default set for matplotlib forthe analysis to come

In [None]:
szeged_df = pd.read_csv('../input/szeged-weather/weatherHistory.csv')

In [None]:
szeged_df

In [None]:
szeged_df.columns

In [None]:
szeged_df_final = szeged_df.drop(['Formatted Date', 'Summary', 'Precip Type','Visibility (km)', 'Loud Cover','Daily Summary','Wind Bearing (degrees)'],axis = 1)

Above columns were dropped because these have almost no affect on the apparent temperature.

## Exploratory Data analysis


Lets try to visualize the correlation between Apparent temperature and others using Heatmap

In [None]:
sns.heatmap(szeged_df_final.corr(), cmap='Reds', annot=True)
plt.title('Correlation Matrix');

We can see there is a high correlation between Apparent temperature and Temperature and Humidity.

In [None]:
pip install plotly

In [None]:
import plotly.express as px

As there are many observations it will be very much time consuming to visualize all of them. So we will take samples of 10000 and visualize them.

In [None]:
fig = px.scatter(szeged_df_final.sample(10000, random_state=42), 
                 x='Temperature (C)', 
                 y='Apparent Temperature (C)',  
                 opacity=0.5,  
                 title='Apparent Temperature vs Temperature')
fig.show()

The graph shows there is a highly positive correlation between Apparent temperature and Temperature which is very much likely to happen.

In [None]:
fig = px.scatter(szeged_df_final.sample(10000, random_state=42), 
                 x='Humidity', 
                 y='Apparent Temperature (C)',  
                 opacity=0.5,  
                 title='Apparent Temperature vs Humidity')
fig.show()

Strangely, as humidity tends to increase apparent temperature tends to decrease which is a thing to discuss and think about.
Normally, As humidity increases temperature tends to increase too. Maybe there is significant amount of rainfall per year in Szeged therefore as humidity increases rainfall increases too and hence temperature decreases. We can see from this article (https://en.climate-data.org/europe/hungary/szeged/szeged-648/) there is indeed significant amount of rainfall about 594 mm per year in Szeged.


As the temperature decreases as humidity increeases therefore there is a negetive correlation between those two and it is noticable.


In [None]:
fig = px.scatter(szeged_df_final.sample(10000, random_state=42), 
                 x='Wind Speed (km/h)', 
                 y='Apparent Temperature (C)',  
                 opacity=0.5,  
                 title='Apparent Temperature vs Wind Speed (km/h)')
fig.show()

It appears from wind speed 0 to about 5 km/h there is a singgle cluster. But as the speed increases there is a split around the temperature from 10 degree celcius.The apparent temperature above 10 degree celcius tends to go up as the wind speed increases and the temperature tends to drop as the wind speed increases if the temperature is below 10 degree celcius. It may be because as wind speed increases hot air tends to increase the apparent temperature and cold air tends to cool down apparent temperature as wind speed increases.


But the graph shows there is not much of a correlation between Apparent temperature and Wind speed.

## Removing Outliers


**Outliers!!** What are those? 
Outliers are those data points which differs significantly from other observations present in given dataset. It can occur because of variability in measurement and due to misinterpretation in filling data points. Defination was collected from : https://medium.com/analytics-vidhya/how-to-remove-outliers-for-machine-learning-24620c4657e8

In [None]:
fig = px.box(szeged_df_final.sample(10000, random_state=42), 
                 y='Humidity')
fig.show()

The boxplot shows there exists one outlier.

In [None]:
fig = px.box(szeged_df_final.sample(10000, random_state=42), 
                 y='Apparent Temperature (C)')
fig.show()

In [None]:
fig = px.box(szeged_df_final.sample(10000, random_state=42), 
                 y='Temperature (C)')
fig.show()

**You can get a very good idea about how to remove outliers and why it is necesarry from : https://medium.com/analytics-vidhya/how-to-remove-outliers-for-machine-learning-24620c4657e8**

In [None]:
def remove_outliers(col):
    q75,q25 = col.quantile(.75),col.quantile(.25)
    for x in col:
        intr_qr = q75-q25

        max = q75+(1.5*intr_qr)
        min = q25-(1.5*intr_qr)

        if x < min or x > max:
            x = np.median(col)

In [None]:
remove_outliers(szeged_df_final['Temperature (C)']);
remove_outliers(szeged_df_final['Apparent Temperature (C)']);
remove_outliers(szeged_df_final['Humidity'])

In [None]:
y = szeged_df_final['Apparent Temperature (C)']
X = szeged_df_final[['Temperature (C)',
                     'Humidity']]

Here X is the **Feature dataframe** and y is the **Target column**.

In [None]:
X

In [None]:
# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

**Why choose ElasticNet and what is it?**
You can get a good insight from :https://machinelearningmastery.com/elastic-net-regression-in-python/

## Scaling and Regression

**What is Scaling and why it is necessary?**
Machine learning algorithms like linear regression, logistic regression, neural network, etc. that use gradient descent as an optimization technique require data to be scaled.Having features on a similar scale can help the gradient descent converge more quickly towards the minima. Information collected from : https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .3, random_state= 42)

In [None]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Why use scaler.fit_transform for Training data and scaler.transform for test data?**

If you are interested please read this article : https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe

In [None]:
regressor = ElasticNet()
params = dict()

# values for alpha: 100 values between e^-5 and e^5
params['alpha'] =  np.logspace(-5, 5, 100, endpoint=True)

# values for l1_ratio: 100 values between 0 and 1
params['l1_ratio'] = np.arange(0, 1, 0.01)

In [None]:
rn_cv = RandomizedSearchCV(regressor, params, n_iter = 100, scoring=None, cv=10, verbose=0, refit=True)

**Why prefer RandomizedSearchCV over GridSearchCV?**

**The number of observations is very high in this dataset. So using GridSearchCv would be very much time consuming**.
You can read this article for better understanding : https://analyticsindiamag.com/guide-to-hyperparameters-tuning-using-gridsearchcv-and-randomizedsearchcv/#:~:text=The%20only%20difference%20between%20both,that%20increase%20the%20model%20generalizability.

In [None]:
rn_cv.fit(X_train,y_train);

In [None]:
y_pred = rn_cv.predict(X_test)
r2 = rn_cv.score(X_test,y_test)
mse = mean_squared_error(y_pred,y_test)

In [None]:
print(r2)

**What is r2 score?**

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. In other words, r-squared shows how well the data fit the regression model (the goodness of fit). Defination collected from : https://www.investopedia.com/terms/r/r-squared.asp . For details you can read the article.

In [None]:
print(mse)

In [None]:
print("Tuned ElasticNet l1 ratio: {}".format(rn_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))

## Summary


In the above analysis I tried to predict the apparent temperature based on the dataset using elastic net regression. the accuracy of our model was determined using R2 score and MSE(mean squared error). The r2 score of our model was 0.98 and mse score was 1.54.


## References


*   Rainfall data of Szeged : https://en.climate-data.org/europe/hungary/szeged/szeged-648/
*   About Outliers : https://medium.com/analytics-vidhya/how-to-remove-outliers-for-machine-learning-24620c4657e8
*   About Elastic net regression : https://machinelearningmastery.com/elastic-net-regression-in-python/
*   Grid search CV VS Random Search CV : https://analyticsindiamag.com/guide-to-hyperparameters-tuning-using-gridsearchcv-and-randomizedsearchcv/#:~:text=The%20only%20difference%20between%20both,that%20increase%20the%20model%20generalizability.
*   About R2 score : https://www.investopedia.com/terms/r/r-squared.asp
*   About Scaling : https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/
*   fit_transform VS transform : https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe