In [2]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Szeged*, Hungary Weather

This notebook is gonna walk through an analysis of weather data from Szeged*, Hungary.  The analysis will lead up to a linear regression model that predicts temperature.

<sub>*(according to every submitted pronunciation [here](https://forvo.com/word/szeged/), the city is pronounced kinda like 'sehged')</sub>

### But first!  Warm up 🥵

* Q: How does the ROC curve differ in binary and multi-class classification?
  * A: \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
  
* Bonus warm-up 🥵!

In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

# Gen data
n = 5000
y = np.random.choice([0, 1, 2], n)
x1 = np.random.normal(10, 5, n)
x2 = np.random.normal(5, 3, n)

# Shift xs by class to make more easily separable
x1[np.where(y == 0)] += 5
x1[np.where(y == 1)] -= 5
x2[np.where(y == 0)] += 5
x2[np.where(y == 2)] -= 5

df = pd.DataFrame({"x1": x2, "x2": x1, "y": y})
df.head()

Unnamed: 0,x1,x2,y
0,2.9832,5.995096,2
1,9.197455,22.532109,0
2,3.585781,3.842656,2
3,-5.037402,16.537421,2
4,14.81044,9.503859,0


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

* Plot `x1` by `x2` and color by `y`

* Perform a train/test split with 20% of the data in test set

* Fit a logistic regression model (use whatever hyperparameters you'd like)

* Score your model and report fitting issues (i.e. under/over)

* Display a confusion matrix and a classification report.
  * When classifying an actual class 0, what mistake is the model most likely to make?
  * What 2 classes are the hardest to separate? Does this make sense based on the scatter plot?
  * What class has the highest recall? What does that mean? Does this make sense based on the scatter plot?

## General EDA

We'll start with loading the data and doing some intro EDA.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

import plotly_express as px
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# CSV downloaded from https://www.kaggle.com/budincsevity/szeged-weather
data_url = "https://docs.google.com/spreadsheets/d/1VI1rDsUI7KTMUEyDgV8cc0gLdwYr1p_aCwyyp3Mz3_M/export?format=csv"
szeged = pd.read_csv(data_url)

Get to know the data.  Keep in mind, the end goal is to be able to predict temperature.

### Tangent start

*click here to jump to [Tangent end](#Tangent-end)*

My guess going in would be that a lack of precipitation would appear as an `NA` here, but that's a suspiciously low percentage for a lack of precipitation.  I will concede, I'm not familiar with Hungary's weather, maybe it does rain there 99.5% of the time.

We could:
  * Drop them.. It's a low percentage of our records, but maybe there's value to be had?
  * look at the data's documenation (should probably start here... but we won't...)
  * look at the `value_counts` of the `Precip Type` column.
  * look at the other column values when `Precip Type` is `NA`
  * Look at a `crosstab` of `Precip Type` and a column like `Summary`

Show the value counts of the `Precip Type` column, use an argument to avoid excluding NaN from this output

Show the head of the data when `Precip Type` is NaN

Look at a `crosstab` of `Precip Type` and a column like `Summary` (include NaNs)

A higher percentage of `NaN`s seem to be associated with clear weather than rain or snow.  It feels safe to conclude there's some relationship between `NaN`s and lack of precipitation.  We can confirm this by checking the documentation.  If we check the [Kaggle page](https://www.kaggle.com/budincsevity/szeged-weather) where this data was downloaded from, we see that this data was originally collected from the [darksky.net](https://darksky.net/) API.  Conveniently, this API has [some documentation](https://darksky.net/dev/docs#data-point-object) on all the values it can return.  The below is copied from the documentation about our `Precip Type` column.  So we see that if the `precipIntensity` is zero, we are expected to have a `NaN`.

> **`precipType`** *optional*
>
> The type of precipitation occurring at the given time. If defined, this property will have one of the following values: `"rain"`, `"snow"`, or `"sleet"` (which refers to each of freezing rain, ice pellets, and “wintery mix”). (If `precipIntensity` is zero, then this property will not be defined. Additionally, due to the lack of data in our sources, historical `precipType` information is usually estimated, rather than observed.)

So how do we use this information? We just went through a lot of work investigating, so what? Well now we know that doing a `dropna` would be needlessly losing records from the dataset.  These NaNs are legit records and we should keep them in the analysis by treating NaN as a legit category for the `precipType` column.

### Tangent end

Okie doke. Let's get down to brass tacks.  We want to predict temperature.  For today, we'll just use the `Humidity` and `Visibility (km)` features.  We can start to focus our EDA on these culprits.

Subset the dataframe to only the `"Temperature (C)"`, `"Humidity"`, and `"Visibility (km)"` columns.

Create a pairplot/scatter matrix of all three remaining columns. Do we see some correlations (especially with our target)?

Create a heatmap of the correlations between these three columns.

We seem to have some predictive power within these two inputs.  Looking at the scatterplots we can see trends, and a heatmap confirms some correlation.  `Humidity` is more tightly coupled with `Temperature (C)` than `Visibility (km)`.

Perform a train test split; pick whatever parameters you want

Fit a linear regression model using `sklearn`

Score the model and report fitting issues (i.e. under/over).

Note: these scores aren't "accuracy" like it was for logistic regression.  This metric is called $R^2$.  In some ways it can be treated the same accuracy: the higher the better and a perfect score is `1.0`.

My model's $R^2$ on the test set is `0.43`.  This number is often interpreted like:
* "the model explains 43% of the variation in temperature"
* "humidity and visibility explain 43% of the variation".

Math-wise, this number is asking the question: 'Did we predict better than guessing the mean?'.  For a deeper dive on $R^2$ checkout the `understanding_r_squared.ipynb` notebook.

Our model's formula can be found in the `intercept_` and `coef_` attributes.  The trailing underscore is a convention in `sklearn` to mean the model's `fit` method will define them (i.e. we our model can't have coefficients until the model is fit, so they're stored in a trailing `_` attribute).

In [None]:
print(model.intercept_)
print(model.coef_)

Print out a string version of our linear regression model's formula
  * i.e. this might look like `Temp = 100 + (20) * Humidity + (2) * Visibility`


What does this formula tell us?  Does this make sense with the correlations/EDA we looked at?

Use the model's predict method to make predictions on the test set

Create a dataframe with the input features, `y_test`, and the predictions for the test set

Let's look at a plot of our predictions vs our predictors.

Make a scatter plot with Humidity on the x axis and model predictions as the y axis 

Make a scatter plot with Visibility on the x axis and model predictions as the y axis

That's not very linear... Although linear regression is making 'linear combinations' of our variables, the output isn't a line when we have multiple predictors. Our current data has 3 dimensions (2 features and 1 target), so we'll need a visualization that can capture all 3 to fully make sense of it.  Color is a nice goto way to cheat 3 dims into a 2d plot.

Make a scatter plot with Humidity on the x axis, model predictions as the y axis, and color by visibility

Make a scatter plot with visibility on the x axis, model predictions as the y axis, and color by Humidity

In the case of 3d, we can actually plot this directly.  Note, 3d plots do not always provide more insight than a series of 2d plots; make sure to evaluate your use case on whether or not it fits.  Here, I think 3d plots help drive home the point that we have a plane of predictions.  When we get above 3d this gets harder and harder to visualize.

In [None]:
figure = px.scatter_3d(
    data_frame=pred_df, x="Humidity", y="Visibility (km)", z="predicted"
)


figure.update_traces(name="Predicted", showlegend=True)
figure.add_scatter3d(
    x=pred_df["Humidity"],
    y=pred_df["Visibility (km)"],
    z=pred_df["actual"],
    opacity=0.2,
    mode="markers",
    name="Actuals",
)

figure.show()