# Validation of models


Our models find similarities among articles, but how can we test that the similarities make sense?
In the "AutismParentMagazine" website each article is assigned to a **category**.
We can validate the similarities against the catagories provided in the website.   

Let's define a function to tell wether two articles ($1$ and $2$) belong to the same category:     

\begin{eqnarray}
f (1,2)  &=& 1 & \mbox{ if } 1\mbox{, } 2 \mbox{ in same category, or}\nonumber\\   
         &=& 0 & \mbox{ elsewhere.}\nonumber
\end{eqnarray}

Now, we can define a score for the model, as:

If the model finds that two articles ($i$, $j$) are similar, when $f(i,j)=1$ they belong to the same category, and hence are a good match; viceversa if $f(i,j)=0$ they may not be good match.

We can then evaluate the model, for all $N$ pairs of similar articles $i$, $j$, as:   
\begin{eqnarray}
\mbox{score} = \frac{1}{N} \sum_{i,j} f(i,j) .
\end{eqnarray}

When the model finds that all articles pairs found to be similar also belong to the same category, then the score is 1.   
In fact, the closer the score is to 1, the better the model is.



In [1]:
import pandas as pd
import os

In [2]:
from collections import defaultdict
from gensim import corpora, models, similarities


def get_model_score(ids,matsim,categories):
    num_of_predictions=3
    model_score=0
    for id,doc in zip(ids,matsim.index):
        sims=matsim[doc]
        for other_id,score in sims:
            #print("ID {} OTHER_ID {} SCORE {}".format(id,other_id,score))
            category1=categories[id]
            category2=categories[other_id]
            if id != other_id:
                if category1 == category2:
                    model_score+=1
    N=len(ids)*num_of_predictions
    model_score=model_score/N
    return model_score


In [3]:
os.chdir('../data/')
# Read dataframe
input_fname="AutismParentMagazine-posts-tokens.csv"


# Get categories and ids from dataset
df = pd.read_csv(input_fname,index_col=0)
df.head(2)
categories=df['category']
ids=df.index

In [4]:
import pickle

# Read models
matsim = pickle.load(open("lsi-matsim.save", "rb"))
model_score= get_model_score(ids,matsim,categories)
print("LSI model score {}".format(model_score))

# Read models
matsim = pickle.load(open("lda-matsim.save", "rb"))
model_score= get_model_score(ids,matsim,categories)
print("LDA model score {}".format(model_score))


LSI model score 0.6851851851851852
LDA model score 0.24074074074074073
