In [7]:
import torch
from transformers import DistilBertTokenizer, DistilBertModel, DistilBertForSequenceClassification
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline


In [9]:
SAMPLE_ARTICLE = """Title: Exploring the Dynamic World of Technology Titans
In the ever-evolving landscape of technology, there are entities that have made an indelible mark, both positively and negatively, shaping our digital age. Apple, a true innovator, has consistently pushed the boundaries of design and functionality with its sleek devices, evoking awe and envy alike. However, its closed ecosystem has been criticized for limiting user choice and driving up costs.
Moving to the realm of social media, we encounter Facebook. Its revolutionary platform has connected the world, transcending borders and fostering relationships. Nevertheless, concerns surrounding user privacy and the spread of misinformation cast a shadow on its impact.
On the brighter side, Tesla, led by the enigmatic Elon Musk, is reshaping transportation with its electric vehicles. The company's commitment to sustainability is commendable, although occasional quality control issues have raised eyebrows.
In the realm of e-commerce, Amazon stands tall, providing unparalleled convenience. Its vast selection and efficient delivery have transformed the way we shop, but allegations of worker exploitation loom over its success.
Health technology finds a champion in Google. With its foray into healthcare data management, Google aims to revolutionize patient care. However, concerns over data privacy and monopolistic tendencies have triggered debates.
Finally, Microsoft, a software behemoth, continues to empower businesses and individuals with its suite of tools. The company's contributions to productivity are undeniable, yet past antitrust violations remind us of its market dominance.
In this intricate tapestry of technological giants, each entity's positive and negative aspects create a complex narrative. As consumers, it's essential to recognize both sides, celebrating innovation while advocating for accountability and ethical practices in the digital age.
"""

print(SAMPLE_ARTICLE)

Title: Exploring the Dynamic World of Technology Titans
In the ever-evolving landscape of technology, there are entities that have made an indelible mark, both positively and negatively, shaping our digital age. Apple, a true innovator, has consistently pushed the boundaries of design and functionality with its sleek devices, evoking awe and envy alike. However, its closed ecosystem has been criticized for limiting user choice and driving up costs.
Moving to the realm of social media, we encounter Facebook. Its revolutionary platform has connected the world, transcending borders and fostering relationships. Nevertheless, concerns surrounding user privacy and the spread of misinformation cast a shadow on its impact.
On the brighter side, Tesla, led by the enigmatic Elon Musk, is reshaping transportation with its electric vehicles. The company's commitment to sustainability is commendable, although occasional quality control issues have raised eyebrows.
In the realm of e-commerce, Amazon

In [10]:
# Entity model


Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
ner_output = nlp_ner(SAMPLE_ARTICLE)

In [14]:
import pandas as pd

In [19]:
ner_output_df = pd.DataFrame(ner_output)
ner_output_df

Unnamed: 0,entity,score,index,word,start,end
0,B-MISC,0.858467,8,D,21,22
1,I-MISC,0.933929,10,World,29,34
2,I-MISC,0.970559,11,of,35,37
3,I-MISC,0.933663,12,Technology,38,48
4,B-ORG,0.998053,45,Apple,212,217
5,B-ORG,0.995951,101,Facebook,503,511
6,B-ORG,0.998016,145,Te,747,749
7,I-ORG,0.99349,146,##sla,749,752
8,B-ORG,0.994405,154,El,775,777
9,I-ORG,0.884524,155,##on,777,779


In [20]:
tokens = ner_output_df["word"].tolist()
tags = ner_output_df["entity"].tolist()

In [30]:
def merge_ner_tokens(tokens, tags):
    merged_entities = []
    current_entity = None

    for token, tag in zip(tokens, tags):
        prefix, label = tag.split("-")
        
        if prefix == "B":
            if current_entity:
                merged_entities.append(current_entity)
            current_entity = {"text": token, "label": label}
        elif prefix == "I" and current_entity:
            current_entity["text"] += " " + token
        else:
            if current_entity:
                merged_entities.append(current_entity)
            current_entity = None

    if current_entity:
        merged_entities.append(current_entity)

    # create dataframe
    output = pd.DataFrame(merged_entities)
    output["text"] = output["text"].apply(lambda x : x.replace(" ##", ""))
    output = output[output["label"] != "MISC"]
    output = output.drop_duplicates().reset_index(drop=True)
    

    return output

In [31]:
out = merge_ner_tokens(tokens, tags)
out

Unnamed: 0,text,label
0,Apple,ORG
1,Facebook,ORG
2,Tesla,ORG
3,Elon Musk,ORG
4,Amazon,ORG
5,Google,ORG
6,Microsoft,ORG


In [34]:
# def predict_sentiment(text): 

#     # extract output
#     ner_output = nlp_ner(text)

#     # format output
#     ner_output_df = pd.DataFrame(ner_output)
#     tokens = ner_output_df["word"].tolist()
#     tags = ner_output_df["entity"].tolist()
#     output = merge_ner_tokens(tokens, tags)

#     return output

    

Unnamed: 0,text,label
0,Apple,ORG
1,Facebook,ORG
2,Tesla,ORG
3,Elon Musk,ORG
4,Amazon,ORG
5,Google,ORG
6,Microsoft,ORG


In [5]:
class PredictSentiment:

    def __init__(self):

        # NER model
        ner_tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
        ner_model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
        self.nlp_ner = pipeline("ner", model=ner_model, tokenizer=ner_tokenizer)

        # QA model
        self.qa_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased-distilled-squad')
        self.qa_model = DistilBertModel.from_pretrained('distilbert-base-cased-distilled-squad')

        # Sentiment Model
        self.pn_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
        self.pn_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

    def predict_entity(self, text):

        # extract output
        ner_output = self.nlp_ner(text)
    
        # format output
        ner_output_df = pd.DataFrame(ner_output)
        tokens = ner_output_df["word"].tolist()
        tags = ner_output_df["entity"].tolist()
        output = merge_ner_tokens(tokens, tags)
    
        return output

    @staticmethod
    def merge_ner_tokens(tokens, tags):
        
        merged_entities = []
        current_entity = None
    
        for token, tag in zip(tokens, tags):
            prefix, label = tag.split("-")
            
            if prefix == "B":
                if current_entity:
                    merged_entities.append(current_entity)
                current_entity = {"text": token, "label": label}
            elif prefix == "I" and current_entity:
                current_entity["text"] += " " + token
            else:
                if current_entity:
                    merged_entities.append(current_entity)
                current_entity = None
    
        if current_entity:
            merged_entities.append(current_entity)
    
        # create dataframe
        output = pd.DataFrame(merged_entities)
        output["text"] = output["text"].apply(lambda x : x.replace(" ##", ""))
        output = output[output["label"] != "MISC"]
        output = output.drop_duplicates().reset_index(drop=True)
    
        return output

        




result = question_answerer(question="What is a good example of a question answering dataset?",     context=context)
print(
f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}"
...)

        



In [38]:
# ner output
ner_result_df = predict_sentiment(SAMPLE_ARTICLE)

ner_result_df

Unnamed: 0,text,label
0,Apple,ORG
1,Facebook,ORG
2,Tesla,ORG
3,Elon Musk,ORG
4,Amazon,ORG
5,Google,ORG
6,Microsoft,ORG


In [57]:
from transformers import pipeline
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

In [54]:
result = question_answerer(question="What do you think  about Apple", context=SAMPLE_ARTICLE)

In [56]:
print(result["answer"])

has consistently pushed the boundaries of design and functionality with its sleek devices


In [53]:
SAMPLE_ARTICLE

"Title: Exploring the Dynamic World of Technology Titans\nIn the ever-evolving landscape of technology, there are entities that have made an indelible mark, both positively and negatively, shaping our digital age. Apple, a true innovator, has consistently pushed the boundaries of design and functionality with its sleek devices, evoking awe and envy alike. However, its closed ecosystem has been criticized for limiting user choice and driving up costs.\nMoving to the realm of social media, we encounter Facebook. Its revolutionary platform has connected the world, transcending borders and fostering relationships. Nevertheless, concerns surrounding user privacy and the spread of misinformation cast a shadow on its impact.\nOn the brighter side, Tesla, led by the enigmatic Elon Musk, is reshaping transportation with its electric vehicles. The company's commitment to sustainability is commendable, although occasional quality control issues have raised eyebrows.\nIn the realm of e-commerce, A

In [61]:
def extract_entity_dependent_phrase(df, text):
    
    """
    This function uses trained NLP-QA model to extract
    the dependent phrase from the text for the entity
    :param dataframe:
    :return:
    """
    results = []
    
    for entity in df["text"].tolist():
        
        # create QA output
        result = question_answerer(question= f"What do you think  about {entity}",
                                   context=text)
        
        # predict from the QA model
        result_qa_answer = result["answer"]
        results.append(result_qa_answer)
    df["dep_phrase"] = results

    return df

In [62]:
qa_output = extract_entity_dependent_phrase(ner_result_df, text=SAMPLE_ARTICLE)

In [63]:
qa_output

Unnamed: 0,text,label,dep_phrase
0,Apple,ORG,has consistently pushed the boundaries of desi...
1,Facebook,ORG,revolutionary platform has connected the world...
2,Tesla,ORG,reshaping transportation with its electric veh...
3,Elon Musk,ORG,enigmatic
4,Amazon,ORG,"Amazon stands tall, providing unparalleled con..."
5,Google,ORG,Health technology finds a champion
6,Microsoft,ORG,a software behemoth


In [66]:
def extract_sentiment(df):

    result = []
    for idx, row in df.iterrows():

        inputs = tokenizer(row["dep_phrase"], return_tensors="pt")
        with torch.no_grad():
            logits = model(**inputs).logits
        predicted_class_id = logits.argmax().item()
        sentiment = model.config.id2label[predicted_class_id]
        sentiment = "Positive" if sentiment == "LABEL_1" else "Negative"
        result.append(sentiment)

    # create df
    df["sentiment"] = result

    return df


        

In [67]:
sentiment_df = extract_sentiment(qa_output)

In [68]:
sentiment_df

Unnamed: 0,text,label,dep_phrase,sentiment
0,Apple,ORG,has consistently pushed the boundaries of desi...,Negative
1,Facebook,ORG,revolutionary platform has connected the world...,Positive
2,Tesla,ORG,reshaping transportation with its electric veh...,Negative
3,Elon Musk,ORG,enigmatic,Negative
4,Amazon,ORG,"Amazon stands tall, providing unparalleled con...",Negative
5,Google,ORG,Health technology finds a champion,Negative
6,Microsoft,ORG,a software behemoth,Negative


In [65]:

tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

inputs = tokenizer("Hello, my dog is ugly", return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


'LABEL_0'