# Tweet Sentiment Extraction


## 1. Introduction

* With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description? In this competition you will need to pick out the part of the tweet (word or phrase) that reflects the sentiment.

* The goal of this competition is to extract those word or phrases which determines the sentiment of whole tweet.

> Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.



### 1.1 About Tweet Sentiment Extraction Dataset

The data folder contains following files, all in csv format

**Files**
* `train.csv` - the training set
* `test.csv` - the test set
* `sample_submission.csv` - a sample submission file in the correct format

**Columns**
* `textID` - unique ID for each piece of text
* `text` - the text of the tweet
* `sentiment` - the general sentiment of the tweet
* `selected_text` - [train only] the text that supports the tweet's sentiment

### 1.2 Competition metric:

The metric in this competition is the word-level Jaccard score.
Jaccard similarity or intersection over union is defined as size of intersection divided by size of union of two sets. Let’s take example of two sentences:

`Sentence 1: AI is our friend and it has been friendly`

`Sentence 2: AI and humans have always been friendly`

In order to calculate similarity using Jaccard similarity, we will first perform lemmatization to reduce words to the same root word. In our case, “friend” and “friendly” will both become “friend”, “has” and “have” will both become “has”. Drawing a Venn diagram of the two sentences we get:

![](https://miro.medium.com/max/926/1*NSK8ERXexyIZ_SRaxioFEg.png)

Please read [this](https://medium.com/@adriensieg/text-similarities-da019229c894) article for better understanding

### 1.3 Posing this problem as NER problem

**What is NER?**

`Named-entity recognition (NER)` (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.


**Few import links to visit**

* I have used `spacy` for this purpose
* Please visit [this](https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718) to know more about `spacy`
* Please visit [this](https://spacy.io/usage/training) to read about training custom models using `spacy` 

### 1.4. 3 Models Vs 1 Model

* Earlier i have tried using single model for prediction but unfortunately results are not convincing enough, this may be due to lots of wrong labels in the dataset or may be the model was not good enough to generalize.

* Here in this kernel i have used 3 models one for each sentiment category.

* The trained models are available at this [link](https://www.kaggle.com/rohitsingh9990/tse-spacy-model)

* The models are still baseline and can be improved further.

* Using current models you will be able to get `0.628 LB score` which is quite decent, as competition is still in its early stage.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# !pip install chart_studio

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/ import string
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
from tqdm import tqdm
import os
import nltk
import spacy
import random
from spacy.util import compounding
from spacy.util import minibatch

import warnings
warnings.filterwarnings("ignore")


### 2. Reading Data

In [None]:
BASE_PATH = '../input/tweet-sentiment-extraction/'

train_df = pd.read_csv(BASE_PATH + 'train.csv')
test_df = pd.read_csv( BASE_PATH + 'test.csv')
submission_df = pd.read_csv( BASE_PATH + 'sample_submission.csv')

### 2.2 Dropping columns with  values

In [None]:
train_df = train_df.dropna()

## 3. Training Model

* To train a model from scratch pass `model` parameter as `None`
* To resume training of a saved model pass `model` parameter as `some_value`

In [None]:
def save_model(output_dir, nlp, new_model_name):
    output_dir = f'../working/{output_dir}'
    if output_dir is not None:        
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        nlp.meta["name"] = new_model_name
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

In [None]:
# pass model = nlp if you want to train on top of existing model 

def train(train_data, output_dir, n_iter=20, model=None):
    """Load the model, set up the pipeline and train the entity recognizer."""
    ""
    if model is not None:
        nlp = spacy.load(output_dir)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")
    
    # add labels
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        # sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
        if model is None:
            nlp.begin_training()
        else:
            nlp.resume_training()

        for itn in tqdm(range(n_iter)):
            random.shuffle(train_data)
            batches = minibatch(train_data, size=compounding(4.0, 500.0, 1.001))    
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,   # dropout - make it harder to memorise data
                    losses=losses, 
                )
            
            print("Losses", losses)
    save_model(output_dir, nlp, 'st_ner')

In [None]:
def get_model_out_path(sentiment):
    model_out_path = None
    if sentiment == 'positive':
        model_out_path = 'models/model_pos'
    elif sentiment == 'negative':
        model_out_path = 'models/model_neg'
    else:
        model_out_path = 'models/model_neu'
    return model_out_path
    

In [None]:
## creating data in spacy data input format

def get_training_data(sentiment):
    train_data = []
    for index, row in train_df.iterrows():
        if row.sentiment == sentiment:
            selected_text = row.selected_text
            text = row.text
            start = text.find(selected_text)
            end = start + len(selected_text)
            train_data.append((text, {"entities": [[start, end, 'selected_text']]}))
    return train_data

### 3.1 Traing for `positive` sentiment

In [None]:
sentiment = 'positive'

train_data = get_training_data(sentiment)
model_path = get_model_out_path(sentiment)

# for demo purpose i am just training the model for 2 iterations, feel free to experiment.
train(train_data, model_path, n_iter=2, model=None)

### 3.2 Traing for `negative` sentiment

In [None]:
sentiment = 'negative'

train_data = get_training_data(sentiment)
model_path = get_model_out_path(sentiment)

# for demo purpose i am just training the model for 2 iterations, feel free to experiment.
train(train_data, model_path, n_iter=2, model=None)

### 3.3 Traing for `neutral` sentiment

In [None]:
sentiment = 'neutral'

train_data = get_training_data(sentiment)
model_path = get_model_out_path(sentiment)

# for demo purpose i am just training the model for 2 iterations, feel free to experiment.
train(train_data, model_path, n_iter=2, model=None)

### 4. jaccard score on train data

In [None]:
TRAINED_MODELS_BASE_PATH = '../input/tse-spacy-model/models/'

In [None]:
def predict_entities(text, model):
    doc = model(text)
    ent_array = []
    for ent in doc.ents:
        start = text.find(ent.text)
        end = start + len(ent.text)
        new_int = [start, end, ent.label_]
        if new_int not in ent_array:
            ent_array.append([start, end, ent.label_])
    selected_text = text[ent_array[0][0]: ent_array[0][1]] if len(ent_array) > 0 else text
    return selected_text

In [None]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))


if TRAINED_MODELS_BASE_PATH is not None:
    print("Loading Models  from ", TRAINED_MODELS_BASE_PATH)
    model_pos = spacy.load(TRAINED_MODELS_BASE_PATH + 'model_pos')
    model_neg = spacy.load(TRAINED_MODELS_BASE_PATH + 'model_neg')
    model_neu = spacy.load(TRAINED_MODELS_BASE_PATH + 'model_neu')
        
    jaccard_score = 0
    for index, row in tqdm(train_df.iterrows(), total=train_df.shape[0]):
        text = row.text
        if row.sentiment == 'neutral':
            jaccard_score += jaccard(predict_entities(text, model_neu), row.selected_text)
        elif row.sentiment == 'positive':
            jaccard_score += jaccard(predict_entities(text, model_pos), row.selected_text)
        else:
            jaccard_score += jaccard(predict_entities(text, model_neg), row.selected_text) 
        
    print(f'Average Jaccard Score is {jaccard_score / train_df.shape[0]}') 

### 5. Whats Next?

* For submission related part please visit [here](https://www.kaggle.com/rohitsingh9990/spacy-inference)
* I am going to create a discussion thread for those who are trying to approach this problem as NER. Will share link soon.
* Will try to update this kernel with more approaches to increase LB score.
* Suggestions are most welcome.

Note: If you like my work, please, upvote ☺