<a href="https://colab.research.google.com/github/mfligiel/Models-for-MLOPS-Review/blob/main/Evidently_for_WeatherModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Weather Data

I am going to predict Chicago's weather from the weather of 5 other places nearby using a weather API.  This model isn't the most useful, but is good for showcasing model monitoring.

Here, I will pull in some June data, but replacing Toronto with Phoenix.  A bit of a different temperature distribution!

In [1]:
!pip install evidently

Collecting evidently
[?25l  Downloading https://files.pythonhosted.org/packages/8b/64/817e8fb176d8393eb2b49f5650957e7ddb11dc3f9d531deb9e26036f8553/evidently-0.1.19.dev0-py3-none-any.whl (15.2MB)
[K     |████████████████████████████████| 15.2MB 199kB/s 
Collecting dataclasses
  Downloading https://files.pythonhosted.org/packages/26/2f/1095cdc2868052dd1e64520f7c0d5c8c550ad297e944e641dbf1ffbb9a5d/dataclasses-0.6-py3-none-any.whl
Installing collected packages: dataclasses, evidently
Successfully installed dataclasses-0.6 evidently-0.1.19.dev0


In [26]:
import requests
import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
import evidently
from evidently.dashboard import Dashboard
from evidently.tabs import DataDriftTab, NumTargetDriftTab, RegressionPerformanceTab
from IPython.display import IFrame
import pickle

In [3]:

from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


This should work!  I'll now find the IDs of 5 cities I will use to predict Chicago's weather:

Milwaukee\
Detroit\
Toronto\
St Louis\
Omaha, NE


I'll use this site to look it up: https://www.findmecity.com/

Milwaukee: 2451822\
Detroit: 2391585 \
Toronto: 4118\
St. Louis: 2486982\
Omaha, NE: 2465512

I'll switch Toronto's WOEID for that of Phoenix: 2471390

I'll run this once, and then comment it out for now.

In [4]:
# #dictionary of cities
# cities = {'Milwaukee':'2451822', 'Detroit':'2391585', 'Toronto':'2471390', 'St. Louis':'2486982', 'Omaha':'2465512', 'Chicago':'2379574'} #phoenix for toronto

# #empty list to enter these into:
# values = []

# #loop through cities
# for k, v in cities.items():
#   #loop through 3 months
#   for mth in ['6']:
#     #just do days through 30, it's not time series, I don't care
#     for day in range(1, 15):
#       #what to request
#       strng = 'https://www.metaweather.com/api/location/' + v +'/2021/' + mth + '/' +str(day) + '/'
#       if day == 1:
#         print(strng)
#       reqst = requests.get(strng)
#       #get the pieces
#       date = pd.to_datetime(pd.DataFrame(reqst.json()).max()['created']).date()
#       maxtemp = pd.DataFrame(reqst.json()).max()['max_temp']
#       values.append([k, date, maxtemp])
#       time.sleep(3)





In [5]:
#pd.DataFrame(values).to_csv('Test.csv')

In [6]:
#!ls

In [7]:
#!cp Test.csv gdrive/MyDrive

## Loading in this data (once saved), and the model

Now I can prep to predict, and see how different it really is.

In [8]:
#load in data
df = pd.read_csv(r'/content/gdrive/MyDrive/ModelMonitoringBlog/Test.csv')
#load in model
model = pickle.load(open(r'/content/gdrive/MyDrive/ModelMonitoringBlog/weather_model.pkl', 'rb'))

In [9]:
#Now, to rename the columns
df.columns = ['drp', 'city', 'date', 'maxtemp']
df.drop('drp', axis=1, inplace=True)

In [10]:
#reshape the data
df = df.pivot(index='date', columns='city', values='maxtemp')

In [11]:
df.head()

city,Chicago,Detroit,Milwaukee,Omaha,St. Louis,Toronto
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-06-02,23.7,25.79,23.15,26.06,29.23,40.12
2021-06-03,24.42,24.91,22.73,27.475,28.09,41.715
2021-06-04,28.055,26.21,28.65,29.965,28.675,41.96
2021-06-05,30.675,30.885,31.51,33.61,30.565,42.41
2021-06-06,31.375,31.79,31.8,33.165,33.33,42.07


Let's make a prediction.

In [12]:
predictions = model.predict(df.drop('Chicago', axis=1))

In [13]:
df['Chicago'] - predictions

date
2021-06-02    0.252094
2021-06-03    1.663958
2021-06-04    1.463601
2021-06-05    0.412679
2021-06-06    0.194342
2021-06-07    1.332528
2021-06-08    0.825372
2021-06-09    0.262758
2021-06-10   -1.233267
2021-06-11   -0.465028
2021-06-12   -0.681110
2021-06-13   -0.408895
2021-06-14   -0.015240
2021-06-15    0.750277
Name: Chicago, dtype: float64

## Evidently

In order to generate reports, I will need to load in the old data:


In [14]:
#re creating the dictionary above 
cities = {'Milwaukee':'2451822', 'Detroit':'2391585', 'Toronto':'4118', 'St. Louis':'2486982', 'Omaha':'2465512', 'Chicago':'2379574'}

df_old = pd.DataFrame()

for i in cities.keys():
  if i == 'St. Louis':
    i = 'St_Louis'
  pth = "gdrive/MyDrive/ModelMonitoringBlog/" + i + ".csv"
  print(pth)
  to_append = pd.read_csv(pth)
  #print(to_append.head())
  if df_old.empty:
    df_old = to_append
    print(df_old.empty)
  else:
    df_old = pd.concat([df_old, to_append], ignore_index=True)
  


gdrive/MyDrive/ModelMonitoringBlog/Milwaukee.csv
False
gdrive/MyDrive/ModelMonitoringBlog/Detroit.csv
gdrive/MyDrive/ModelMonitoringBlog/Toronto.csv
gdrive/MyDrive/ModelMonitoringBlog/St_Louis.csv
gdrive/MyDrive/ModelMonitoringBlog/Omaha.csv
gdrive/MyDrive/ModelMonitoringBlog/Chicago.csv


In [15]:
#Now, to rename the columns
df_old.columns = ['drp', 'city', 'date', 'maxtemp']
df_old.drop('drp', axis=1, inplace=True)
df_old = df_old.pivot(index='date', columns='city', values='maxtemp')

In [16]:
weather_data_drift_report = Dashboard(tabs=[DataDriftTab])
weather_data_drift_report.calculate(df_old.drop('Chicago', axis=1), df.drop('Chicago', axis=1), column_mapping = None)
weather_data_drift_report.save("gdrive/MyDrive/ModelMonitoringBlog/reports/my_report_with_2_tabs.html")


Even though this code is almost directly from their example, I had to find where CatTargetDriftTab is (though I quickly remembered it wasn't applicable with this data).  This does seem super easy!  But it does seem like you'd need some old data, so this wouldn't necessarily be great for an ongoing use case.

Also, as I am (possibly?) on windows in Colab, I cannot directly build this in the notebook.

In [17]:

IFrame(src="gdrive/MyDrive/ModelMonitoringBlog/reports/my_report_with_2_tabs.html", width=700, height=600)



It seems to detect an issue in each one, which makes sense, as it is comparing June to other months.  I will look later at taking a sample from Feb-May of the previous year to predict.  However, it only says if drift is detected or not detected, and the charts do leave some room for interpretation - it isn't always clear what the axes of each are!

In [18]:
# #dictionary of cities
# cities = {'Milwaukee':'2451822', 'Detroit':'2391585', 'Toronto':'2471390', 'St. Louis':'2486982', 'Omaha':'2465512', 'Chicago':'2379574'} #phoenix for toronto

# #empty list to enter these into:
# values = []

# #loop through cities
# for k, v in cities.items():
#   #loop through 3 months
#   for mth in ['3','4','5']:
#     #just do days through 10, it's not time series, I don't care
#     for day in range(1, 10):
#       #what to request
#       strng = 'https://www.metaweather.com/api/location/' + v +'/2020/' + mth + '/' +str(day) + '/'
#       if day == 1:
#         print(strng)
#       reqst = requests.get(strng)
#       #get the pieces
#       date = pd.to_datetime(pd.DataFrame(reqst.json()).max()['created']).date()
#       maxtemp = pd.DataFrame(reqst.json()).max()['max_temp']
#       values.append([k, date, maxtemp])
#       time.sleep(3)


# pd.DataFrame(values).to_csv('Weather2020.csv')


In [19]:
#!ls

In [20]:
#!cp Weather2020.csv gdrive/MyDrive/ModelMonitoringBlog/

Now I can load this in and compare it.

In [49]:
w2020 = pd.read_csv(r'/content/gdrive/MyDrive/ModelMonitoringBlog/Weather2020.csv')
#Now, to rename the columns
w2020.columns = ['drp', 'city', 'date', 'maxtemp']
w2020.drop('drp', axis=1, inplace=True)
w2020 = w2020.pivot(index='date', columns='city', values='maxtemp')

In [50]:
w2020

city,Chicago,Detroit,Milwaukee,Omaha,St. Louis,Toronto
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-03-02,11.605,6.565,11.4,16.785,19.72,26.57
2020-03-03,8.58,11.25,8.015,14.06,20.145,22.89
2020-03-04,8.41,11.735,8.635,15.215,18.69,24.455
2020-03-05,9.31,11.615,9.545,16.97,18.04,24.73
2020-03-06,11.425,12.675,11.955,19.195,20.02,28.78
2020-03-07,8.46,10.09,12.34,20.55,18.97,29.85
2020-03-08,10.005,7.83,11.605,21.255,17.45,28.52
2020-03-09,16.635,15.34,15.61,21.76,21.1,25.51
2020-03-10,12.425,17.65,13.68,17.89,17.335,26.51
2020-04-02,11.28,17.28,14.07,22.92,20.78,31.07


In [51]:
predictions2020 = model.predict(w2020.drop('Chicago', axis=1))

In [52]:
w2020['Chicago'] - predictions2020

date
2020-03-02    2.516554
2020-03-03   -0.376788
2020-03-04   -0.795456
2020-03-05   -0.340018
2020-03-06   -0.153182
2020-03-07   -2.308167
2020-03-08    0.532318
2020-03-09    1.679587
2020-03-10   -1.313858
2020-04-02   -3.270832
2020-04-03    0.390888
2020-04-04    3.560140
2020-04-05    1.209856
2020-04-06   -0.845977
2020-04-07   -0.701304
2020-04-08    2.302799
2020-04-09   -2.187732
2020-04-10   -0.423680
2020-05-02   -0.688398
2020-05-03   -0.204281
2020-05-04   -0.852827
2020-05-05   -2.593346
2020-05-06    3.838542
2020-05-07    1.426937
2020-05-08    3.128059
2020-05-09    1.377549
2020-05-10    1.416158
Name: Chicago, dtype: float64

Let's look at this with evidently.

In [53]:
weather_data_drift_report_2 = Dashboard(tabs=[DataDriftTab])
weather_data_drift_report_2.calculate(df_old.drop('Chicago', axis=1), w2020.drop('Chicago', axis=1), column_mapping = None)
weather_data_drift_report_2.save("gdrive/MyDrive/ModelMonitoringBlog/reports/2020_weather_report.html")


I had to save this in a separate folder and then download to open, since these don't integrate in well.  It seems to use a KS test to analyze drift - let's see what it does for:

- Target shift: is it detecting if we're seeing a difference in our output distribution?
- Concept shift: this is hard to detect, but could it help us tell if the fundamental concept of the model doesn't quite apply in the same way anymore?
- Class imbalance: not quite applicable for this model, but could be covered similarly to target shift.

I can directly look at target shift - let's see how it compares my predictions to past ones:

In [54]:
weather_target_drift_report = Dashboard(tabs=[NumTargetDriftTab])

weather_target_drift_report.calculate(df_old, predictions2020, column_mapping = None)


In [55]:
weather_target_drift_report.show()


This just shows 'loading.'

In [56]:
weather_target_drift_report.save("gdrive/MyDrive/ModelMonitoringBlog/reports/2020_weather_TARGET_report.html")

This didn't work.  Let me retry with the chicago column as a df.., and their recommended column mapping.  Not a ton of documentation here.

In [57]:
df_old.columns

Index(['Chicago', 'Detroit', 'Milwaukee', 'Omaha', 'St. Louis', 'Toronto'], dtype='object', name='city')

In [73]:
target = 'Chicago'
prediction = 'prediction'
numerical_features = ['Detroit', 'Milwaukee', 'Omaha', 'St. Louis', 'Toronto']
#no categorical features
#categorical_features = ['season', 'holiday', 'workingday']

#reference = raw_data.loc['2011-01-01 00:00:00':'2011-01-28 23:00:00']
#current = raw_data.loc['2011-01-29 00:00:00':'2011-02-28 23:00:00']

column_mapping = {}

column_mapping['target'] = target
column_mapping['prediction'] = predictions
#column_mapping['numerical_features'] = numerical_features
#column_mapping['categorical_features'] = categorical_features


Also, let me add in the predictions to the test dataframe....and it needs predictions for the original too.

In [77]:
w2020_preds = w2020.copy()
w2020_preds['prediction'] = predictions2020
#df_old['prediction'] = model.predict(df_old.drop('Chicago', axis=1))
df_old['prediction'] = model.predict(df_old.drop(['Chicago', 'prediction'], axis=1))


In [87]:
weather_target_drift_report = Dashboard(tabs=[NumTargetDriftTab])

weather_target_drift_report.calculate(df_old, w2020_preds, column_mapping=None)


Turns out the column mapping was messing it up...

In [88]:
weather_target_drift_report.show()

In [89]:
weather_target_drift_report.save("gdrive/MyDrive/ModelMonitoringBlog/reports/2020_weather_TARGET_report.html")

From here, I think we're good to write up about the experience, pausing for now.