<center><h1>Zero Shot Classification</h1></center>
<br>
<center>Zero-shot learning is a problem setup in machine learning, where at test time, a learner observes samples from classes that were not observed during training, and needs to predict the category they belong to. This problem is widely studied in computer vision, natural language processing and machine perception.</center>

<center><img src = https://miro.medium.com/max/576/1*7i5LhQ33_EdxMaPu3iteQg@2x.png></center>

<center><h4>I will be using the HuggingFace Python package for predicting question tags for this StackOverflow dataset. I'm just a beginner with this so please feel free to comment if I can do something better. As always lets start with a meme.</h4></center>

<br>
<center><img src = https://miro.medium.com/max/450/1*SVWSZ-lOCcQNHk6r8apCbA.png></center>

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git

In [None]:
from transformers import pipeline # HuggingFace Transformers Package
import pandas as pd
from tqdm import tqdm

In [None]:
df = pd.read_csv('/kaggle/input/60k-stack-overflow-questions-with-quality-rate/data.csv')
df.head()

### The dataset is very stratghtforward. We are only interested in 
- Title
- Body
- Tags

## Initializing the classifier

It takes the model type and device as input 
- device = -1 (CPU) [This will take atleast 2 hours for 100 rows of this dataset]
- device = 0 (GPU) [So much faster]

In [None]:
classifier = pipeline("zero-shot-classification",device = 0) 

# Minor Preprocessing
Each tag is inside '<>' so we slice from index 1 to second last and split by '><' to get a list of tags. Similarly slicing the body. The length of the body fields is too large for 16 GB of RAM that Kaggle provides so we will be classifying based on the title only for now. Stay tuned for updates.


In [None]:
df['Tags'] = df['Tags'].apply(lambda x: x[1:-1].split('><'))
df['Body'] = df['Body'].apply(lambda x: x[3:-4])

In [None]:
# Extracting the unique labels from the first 100 Rows 

labels = []
for i in range(200):
    labels.extend(df.iloc[i,]['Tags'])
labels = set(labels)
all_labels = list(labels)

## Most Common Tags in first 200 samples

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

cloud = ''
for i in all_labels:
    cloud += i + ' '

plt.subplots(figsize = (8,8))

wordcloud = WordCloud (
                    background_color = 'white',
                    width = 1024,
                    height = 1024
                        ).generate(cloud)
plt.imshow(wordcloud) # image show
plt.axis('off') # to off the axis of x and y
plt.savefig('Plotly-World_Cloud.png')
plt.show()

## Training

In [None]:
y_pred = []
y = []
for i in tqdm(range(200)):
    titles = df.iloc[i,]['Title']
    tags = df.iloc[i,]['Tags']
    op = classifier(titles, all_labels, multi_class=True)
    labels = op['labels'] 
    scores = op['scores']
    res_dict = {label : score for label,score in zip(labels, scores)}
    sorted_dict = dict(sorted(res_dict.items(), key=lambda x:x[1],reverse = True)) #sorting the dictionary of labels in descending order based on their score
    categories = []
    for i, (k,v) in enumerate(sorted_dict.items()):
        if(i > 3): #storing only the best 4 predictions
            break
        else:
            categories.append(k)
    y.append(tags)
    y_pred.append(categories)

In [None]:
out = pd.DataFrame(list(zip(y, y_pred)), columns =['Labels', 'Predicted_Labels']) 
out.to_csv('output.csv')
out.head(10)

### It is evident that the Zero Shot Learner is able to predict tags with respectable accuracy. Now let's measure the performance by encoding the labels using Hamming Loss form SKLearn and also implement it from scratch.

In [None]:
cat_idx = {cat : i for i,cat in enumerate(all_labels)}  # Map of category and it's index to encode the o/p for evaluation

In [None]:
y_trueEncoded = []
y_predEncoded = []
for y_true, y_pred in zip(y, y_pred):
    encTrue = [0] * len(all_labels)
    for cat in y_true:
        idx = cat_idx[cat]
        encTrue[idx] = 1
    y_trueEncoded.append(encTrue)
    encPred = [0] * len(all_labels)
    for cat in y_pred:
        idx = cat_idx[cat]
        encPred[idx] = 1
    y_predEncoded.append(encPred)

## The Hamming loss is the fraction of labels that are incorrectly predicted.


[Refer this for Hamming Loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html)

In [None]:
from sklearn.metrics import hamming_loss
print('Hamming Loss =', hamming_loss(y_trueEncoded,y_predEncoded))

## Implementing Hamming loss from scratch

In [None]:
loss = 0
for x, y in zip(y_trueEncoded,y_predEncoded):
    temp = 0
    for i in range(len(x)):
        if x[i] == y[i]:
            temp += 1
    temp /= len(x)
    loss += temp
loss /= len(y_trueEncoded)
print('Hamming Loss =', 1 - loss)

## References
- [1] [Hugging Face Github](https://github.com/huggingface/transformers)
- [2] [Zero Shot Classification Pipeling](https://huggingface.co/transformers/main_classes/pipelines.html#zeroshotclassificationpipeline)
- [3] [Kernel By Aayush Jain](https://www.kaggle.com/foolofatook/zero-shot-classification-with-huggingface-pipeline)