# Concept Drift Detection and Model Retraining

Using the ECOICOP product and category data this notebook attempts to create concept drift and retrain a model based on drift that has been detected
<br>
The flow diagram shows the process for what is happening (numbers in brackets indicate the cell number, when ran in order, where the code is found)

![flow%20%286%29.jpg](attachment:flow%20%286%29.jpg)

Import packages and data

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from alibi_detect.cd import MMDDrift, FETDrift, CVMDrift, KSDrift

In [2]:
df = pd.read_excel('https://raw.githubusercontent.com/UNECE/ML_dataset/master/Stats%20Poland%20ECOICOP%20data.xlsx', sheet_name = 'English')
t1

Unnamed: 0,produkt,kategoria
0,Hejki - Emotes handmade lollipops with fruit f...,Confectionery products
1,100% Pur jus d orange orange juice with pulp ...,Fruit and vegetable juices
2,100% sucralose without sugar (sweeteners),Artificial sugar substitutes
3,100% peach fruit product sweetened with grape ...,"Jams, marmalades and honey"
4,100% blackcurrant fruit product sweetened with...,"Jams, marmalades and honey"
...,...,...
17094,Żywiec Zdrój with apple flavor Non-carbonated ...,Mineral or spring waters
17095,Żywiec Zdrój with strawberry flavor Non-carbon...,Mineral or spring waters
17096,Element - Gently sparkling spring water,Mineral or spring waters
17097,Element - Sparkling spring water,Mineral or spring waters


Create a new dataframe from the original with any Product containing the word pizza having its Category changed to "Pizza"

In [3]:
changes = t1[t1['produkt'].str.contains('Pizza')].copy(deep=True).reset_index(drop=True)
changes['kategoria'] = 'Pizza'
changes
#maybe save old category for later use

Unnamed: 0,produkt,kategoria
0,American Style Ham and Mushroom Pizza 423 g,Pizza
1,American Style Pepperoni Pizza 404 g,Pizza
2,American Style Quattro Formaggi Pizza 395 g,Pizza
3,Pizza roll 75 g,Pizza
4,Capricciosa Pizza 384 g,Pizza
...,...,...
76,Value Pizza Speciale 1 kg,Pizza
77,Value Pizza with mushrooms 300 g,Pizza
78,Value Pizza with salami 300 g,Pizza
79,Value Pizza with ham 300 g,Pizza


Transform rows under Products into bag of words to get word features

In [4]:
vectorizer = CountVectorizer(token_pattern='\w\w+|[1-9]\.[1-9]\%|[1-9]\,[1-9]\%|[1-9]\.[1-9]|[1-9]\,[1-9]|[1-9]\%',  max_features=500)
vectorizer.fit(t1['produkt'])
X = pd.DataFrame(vectorizer.transform(t1["produkt"]).todense(), columns=vectorizer.get_feature_names()).to_numpy()
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

Replace Categories with numerical values so we are working with numerical values only

In [5]:
Y = t1['kategoria']
Ys = np.unique(t1['kategoria'])
Ys = dict(zip(Ys, range(len(Ys))))
Y = Y.map(Ys).to_numpy()
print(t1['kategoria'],"\nBecomes\n",Y)

0              Confectionery products
1          Fruit and vegetable juices
2        Artificial sugar substitutes
3          Jams, marmalades and honey
4          Jams, marmalades and honey
                     ...             
17094        Mineral or spring waters
17095        Mineral or spring waters
17096        Mineral or spring waters
17097        Mineral or spring waters
17098        Mineral or spring waters
Name: kategoria, Length: 17099, dtype: object 
Becomes
 [10 30  0 ... 34 34 34]


So any row with the same category will now just have the same numeric value instead

## Splitting Data into Training, Reference and Testing

![image.png](attachment:image.png)

Data for both X and Y values is split equally between Training data, Reference data and Testing data 

In [6]:
Xtrain, Xref, Ytrain, Yref = train_test_split(X, Y, train_size=0.33, random_state=42)
Xref, Xtest, Yref, Ytest = train_test_split(Xref, Yref, train_size=0.5, random_state=42)

A Multinomial Naive Bayes classifier is trained using the X and Y training data
<br>
This is then used to predict labels for both the Training and Testing data

In [7]:
clf = MultinomialNB()
clf.partial_fit(Xtrain, Ytrain, classes=np.unique(Ytrain))

MultinomialNB()

In [8]:
train_predictions = clf.predict(Xtrain)
test_predictions = clf.predict(Xtest)

In [9]:
print("Model training accuracy: ", accuracy_score(Ytrain, train_predictions))
print("Model test accuracy: ", accuracy_score(Ytest, test_predictions))

Model training accuracy:  0.7913860333215172
Model test accuracy:  0.7500436376330948


Whilst a higher accuracy on the training data would be ideal, the accuracy for both sets of data are similar. Therefore the model is capable of predicting labels well enough on new data

## Generate different forms of Concept Drift

Generate concept drift in different ways based on the Test data

Concept Drift 1:
<br>
Randomly select 2 numeric category labels and swap them around

![image.png](attachment:image.png)

In [10]:
#Randomise selection of categories
#k=np.random.choice(np.unique(Ytest), 2, replace=False)
#print(k)

In [11]:
XconSwap, YconSwap = Xtest.copy(), Ytest.copy()
idx1 = np.argwhere(Ytest==43)
idx2 = np.argwhere(Ytest==7)
YconSwap[idx1] = 7
YconSwap[idx2] = 43

In [12]:
print("Size of category 1 -", idx1.size,"\nSize of category 2 -", idx2.size)

Size of category 1 - 125 
Size of category 2 - 138


Concept Drift 2:
<br>
Add the changes made, which intoduced the Pizza category, onto the test data whilst keeping the old category as well

In [13]:
changes = pd.concat([changes]*5, ignore_index=True)

In [14]:
XAdd = pd.DataFrame(vectorizer.transform(changes['produkt']).todense(), columns=vectorizer.get_feature_names()).to_numpy()

In [15]:
labels = np.amax(Ytest)+1
YAdd = changes['kategoria']
YCs = np.unique(changes['kategoria'])
YCs = dict(zip(YCs, range(labels,labels+YCs.size)))
YAdd = YAdd.map(YCs).to_numpy()

In [16]:
XconAdd, YconAdd = Xtest.copy(), Ytest.copy()
XconAdd = np.concatenate((XconAdd, XAdd), axis=0)
YconAdd = np.concatenate((YconAdd, YAdd), axis=0)

Concept Drift 3:
<br>
Update the test data by replacing the old "Pizza and quiche" category with a new "Pizza" category

In [17]:
p = Ys.get('Pizza and quiche')

In [18]:
idx3 = np.argwhere(Ytest==p)

In [19]:
XconUpd, YconUpd = Xtest.copy(), Ytest.copy()
YconUpd[idx3] = labels

Check model accuracy of drifted data sets

In [20]:
d1_predictions = clf.predict(XconSwap)
d2_predictions = clf.predict(XconAdd)
d3_predictions = clf.predict(XconUpd)

In [21]:
print("Model drift 1 accuracy: ", accuracy_score(YconSwap, d1_predictions))
print("Model drift 2 accuracy: ", accuracy_score(YconAdd, d2_predictions))
print("Model drift 3 accuracy: ", accuracy_score(YconUpd, d3_predictions))

Model drift 1 accuracy:  0.7168790364810612
Model drift 2 accuracy:  0.7005216824258232
Model drift 3 accuracy:  0.7444580205969629


## Drift Detection

Calculate binary losses for the reference data, test data and each of the different concept drift data sets. These loss values provide a measure of the models performance for each data set

In [22]:
lossRef = (clf.predict(Xref) == Yref).astype(int)
lossTest = (clf.predict(Xtest) == Ytest).astype(int)
lossSwap = (clf.predict(XconSwap) == YconSwap).astype(int)
lossAdd = (clf.predict(XconAdd) == YconAdd).astype(int)
lossUpd = (clf.predict(XconUpd) == YconUpd).astype(int)
losses = {'No drift': lossTest, 'Concept swap drift': lossSwap, 'Concept add drift': lossAdd, 'Concept update drift': lossUpd}

Due to this now being binary data, the FET (Fishers Exact Test) Detector is used. The reference data losses are used to initialise the detector with the "alternative" parameter set to "less". This means that drift will only be detected if there is a reduction in the models performance

In [23]:
fetDetective = FETDrift(lossRef, p_val=0.05, alternative='less')

Applying the detector to the test data and the different concept drift data sets we can see how concept drift is detected.
<br>
No drift is found in the original test data.
<br>
Drift is found in the data where two labels were swapped. In testing, the p-value changed depending on the size of the data for each label. Larger sizes had larger amounts of drift.
<br>
Drift is found when adding the new category, but only when the number of new entries is large enough. The initial addition of just 81 new rows resulted in not enough change for drift to be detected.
<br>
No drift is found when replacing the old category with a new one. Again this is down to the change being too minor and not enough to be flagged as drift.

In [24]:
label = ['No!', 'Yes!']
for name, lossArr in losses.items():
    print('\n%s' % name)
    preds = fetDetective.predict(lossArr)
    print('Drift? {}'.format(label[preds['data']['is_drift']]))
    print('p-value: {}'.format(preds['data']['p_val'][0]))


No drift
Drift? No!
p-value: 0.706548772309148

Concept swap drift
Drift? Yes!
p-value: 0.00025772177063497335

Concept add drift
Drift? Yes!
p-value: 2.0238320084024257e-08

Concept update drift
Drift? No!
p-value: 0.44251259052393954


## Model Retraining

Retrain the classifier using partial fit to update the model. The Drifted data is used as Training data

In [25]:
clf = MultinomialNB()
clf.partial_fit(XconSwap, YconSwap, classes=np.unique(YconSwap))

MultinomialNB()

In [26]:
train_predictions = clf.predict(XconSwap)
test_predictions = clf.predict(Xtest)

In [27]:
print("Model drift training accuracy: ", accuracy_score(YconSwap, train_predictions))
print("Model test accuracy: ", accuracy_score(Ytest, test_predictions))

Model drift training accuracy:  0.7962995287135626
Model test accuracy:  0.7589457147844301


Model accuracy on the drifted data is now 0.796 compared to 0.716 when using the original model. With a slight improvement on test accuracy also

Attempt to find drift again using the same reference data but with predictions from the updated model classifier

In [28]:
lossSwap2 = (clf.predict(XconSwap) == YconSwap).astype(int)

In [29]:
preds = fetDetective.predict(lossSwap2)
print('Drift? {}'.format(label[preds['data']['is_drift']]))
print('p-value: {}'.format(preds['data']['p_val'][0]))

Drift? No!
p-value: 0.9999999999471755
