# A complete profile for sentiment analysis model

## Models Ideas:
1. Use of custom models for different categories (tech, food, books,...) to be automatically (using context classfication) or manually selected(by the client). *(different datasets applied)*
2. Run multiple models per dataset and derive weighted average results.
3. Developing a layered classification **use *fast/slow* classification** (divide the dataset using confidence index to strong and weak groups; the weak group will be analysed further using Roberta model).
4. Aspect based analysis **(attach sentiment to specific aspects rather than sentence/opinion)** and word cloud **(for word frequencies)** to show insights of the reviews. (Amazon comprehend model)
5. Use of lemmatization, opinion unit extractor, subjectivity index and multiclass classification(love, sad, angry,...) for better accuracy and data enrichment.
6. Test of a sent-ngrams lexion sentiment analysis **(SO-CAL)**.
7. Use of client dataset to fine-tune the model. (Ideation phase)

## Datasets used:
1. Twitter airline 
2. IMDB 
3. Yelp (preprocessing phase)
4. Amazon 
5. 140sentiment twitter

## Implementation:

### Imports:

In [6]:
# Install the transformers library
!pip install transformers
!pip install vaderSentiment
!pip install flair

Collecting transformers
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 36.5 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 5.3 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 39.7 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 45.7 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
  

In [17]:
from textblob import TextBlob as tb
from textblob.sentiments import NaiveBayesAnalyzer
from textblob import Blobber
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from flair.data import Sentence
from flair.models import TextClassifier
from sklearn import metrics
import pandas as pd
from nltk import tokenize
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer
# only for colab
from google.colab import drive
drive.mount('/content/drive')

### libraries implementation:

In [None]:
def metric(true,predict):
    analytics=[]
    #metrics.classification_report(true,predict)
    analytics.append(round(metrics.accuracy_score(true,predict),2))
    analytics.append(round(metrics.precision_score(true,predict,average='weighted'),2))
    analytics.append(round(metrics.recall_score(true,predict,average='weighted'),2))
    analytics.append(round(metrics.f1_score(true,predict,average='weighted'),2))
    return analytics

In [4]:
class modelDataset:
    def __init__(self, tokenized_texts):
        self.tokenized_texts = tokenized_texts

    def __len__(self):
        return len(self.tokenized_texts["input_ids"])

    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.tokenized_texts.items()}

In [None]:
def textblobPattern(text):
    sentiment=[]
    for sentence in text:
        sent=tb(sentence).polarity
        if sent>0:
            sentiment.append(1)
        elif sent<0:
            sentiment.append(-1)
        else:
            sentiment.append(0)
    return sentiment

In [None]:
def textblobNB(text):
    sentiment=[]
    tbnb = Blobber(analyzer=NaiveBayesAnalyzer())
    for sentence in text:
        ts=tbnb(sentence).sentiment
    return sentiment    

In [None]:
def vader(text):
    sentiment=[]
    analyzer = SentimentIntensityAnalyzer()
    for sentence in text:
        vs=analyzer.polarity_scores(sentence)['compound']
        if (vs > 0.5):
            sentiment.append(1)
        elif (vs < -0.5):
            sentiment.append(-1)
        else:
            sentiment.append(0)
    return sentiment

In [None]:
def flair(text):
    classifier = TextClassifier.load('sentiment-fast')
    sentences = [Sentence(t) for t in text]
    sentiment=[]
    for phrase in sentences:
        classifier.predict(phrase,mini_batch_size=32)
        sentiment.append(1 if phrase.labels[0].value == 'POSITIVE' else -1)
    return sentiment

In [None]:
def RoBerta_large(text):
    tokenizer = AutoTokenizer.from_pretrained("siebert/sentiment-roberta-large-english")
    model = AutoModelForSequenceClassification.from_pretrained("siebert/sentiment-roberta-large-english")
    trainer = Trainer(model=model)
    tokenized_texts = tokenizer(text,truncation=True,padding=True)
    pred_dataset = modelDataset(tokenized_texts)
    predictions = trainer.predict(pred_dataset)
    preds = predictions.predictions.argmax(-1)
    return [-1 if x==0 else 1 for x in preds]

In [None]:
def roBerta_multitwitter(df):
    # under processing
    pass

In [None]:
def bert_base(text):
    text=['lol i am happy']
    tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
    model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
    trainer = Trainer(model=model)
    for sentence in text:
      inputs = tokenizer(text, return_tensors="pt")
      output= model(**inputs)
    # to continue

### Model testing: (for each dataset)

In [None]:
def UsAirTw():
  file_name = "/content/drive/MyDrive/usAir_tweets.csv"
  text_column = "text"
  df = pd.read_csv(file_name)
  true=df["sentiment"]
  pred_texts = df[text_column].dropna().astype('str').tolist()
  textblobM=metric(true,textblobPattern(pred_texts))
  vaderM=metric(true,vader(pred_texts))
  flairM=metric(true,flair(pred_texts))
  robertaM=metric(true,RoBerta_large(pred_texts))
  print(f"The textblob metrics:\n accuracy={textblobM[0]},precision={textblobM[1]},recall={textblobM[2]},f1 score ={textblobM[3]}")
  print(f"The Vader metrics:\n accuracy={vaderM[0]},precision={vaderM[1]},recall={vaderM[2]},f1 score ={vaderM[3]}")
  print(f"The flair metrics:\n accuracy={flairM[0]},precision={flairM[1]},recall={flairM[2]},f1 score ={flairM[3]}")
  print(f"The Roberta large metrics:\n accuracy={robertaM[0]},precision={robertaM[1]},recall={robertaM[2]},f1 score ={robertaM[3]}")

def Imdb():
  file_name = "/content/drive/MyDrive/imdb.csv"
  text_column = "review"
  df = pd.read_csv(file_name)
  true=df["sentiment"]
  pred_texts = df[text_column].dropna().astype('str').tolist()
  textblobM=metric(true,textblobPattern(pred_texts))
  vaderM=metric(true,vader(pred_texts))
  flairM=metric(true,flair(pred_texts))
  robertaM=metric(true,RoBerta_large(pred_texts))
  print(f"The textblob metrics:\n accuracy={textblobM[0]},precision={textblobM[1]},recall={textblobM[2]},f1 score ={textblobM[3]}")
  print(f"The Vader metrics:\n accuracy={vaderM[0]},precision={vaderM[1]},recall={vaderM[2]},f1 score ={vaderM[3]}")
  print(f"The flair metrics:\n accuracy={flairM[0]},precision={flairM[1]},recall={flairM[2]},f1 score ={flairM[3]}")
  print(f"The Roberta large metrics:\n accuracy={robertaM[0]},precision={robertaM[1]},recall={robertaM[2]},f1 score ={robertaM[3]}")


### Main:

In [None]:
UsAirTw()
Imdb()

## Results:

## Topics covered:
- Textblob - vader - flair libraries
- Text operations: lemmatization - tokenization - vectorization - wordnet - tagging - n-gram 
- Machine learning concepts: vector space model, k-means clustering,[ Naive Bayes, k-NN, SVM] classifiers, decision tree - random forest - transformers (word2vec and wordtree of stanford).
- Technologies: Jupyter notebook - Google colab
- Dataset handeling: dataset preprocessing
- Sentiment analysis approaches
- Handeling multiple  Deeplearning models: roBERTa - BERT - [GloVe - Fasttext - torchtext]

## References:
- https://neptune.ai/blog/sentiment-analysis-python-textblob-vs-vader-vs-flair
- https://towardsdatascience.com/customer-churn-accuracy-a-4-6-increase-with-feature-engineering-29bcb1b1ee8f (REVIEW)
- https://www.analyticsvidhya.com/blog/2021/01/sentiment-analysis-vader-or-textblob/
- https://pythonprogramming.net/sentiment-analysis-python-textblob-vader/
- https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664
- https://medium.com/geekculture/what-nlp-library-you-should-use-for-your-sentimental-analysis-project-bef6b357a6db
- https://towardsdatascience.com/sentiment-analysis-comparing-3-common-approaches-naive-bayes-lstm-and-vader-ab561f834f89
****
* N-grams rule based model
- https://www.sciencedirect.com/science/article/pii/S095741741830143X
- https://github.com/sfu-discourse-lab/SO-CAL(to be reviewed)
- https://towardsdatascience.com/text-analysis-basics-in-python-443282942ec5
****
- https://towardsdatascience.com/text-classification-with-state-of-the-art-nlp-library-flair-b541d7add21f