# Sentiment Analysis Modelling

The goal of this notebook is to test the performace of [CAMeL-BERT](https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment) model. At first, we will do some iterations on the model pipeline and then we will evaluate it on the 3 datasets that we have collected before.

In [14]:
import pandas as pd
import numpy as np

import torch
from transformers import pipeline

import warnings
warnings.filterwarnings('ignore')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Available device:", device)

Available device: cuda


## Load and preprocess the data

In [15]:
URL = "https://github.com/swarmsTeam/swarms-ai/raw/main/sentiment-analysis/data/"
CompanyReviews = pd.read_csv(URL + "CompanyReviews.csv", index_col=0)
RestaurantReviews = pd.read_csv(URL + "RestaurantReviewsSample.csv", index_col=0)
appReviews = pd.read_csv(URL + "appReviews.csv", index_col=0)

In [16]:
appReviews = appReviews.dropna()
appReviews.head()

Unnamed: 0,review_description,rating,company
0,سيئ جدا بعد الإصدار الجديد,-1,alahli_bank
1,ابلكيشن زباله بجد,-1,alahli_bank
2,سيئ التطبيق لايعمل,-1,alahli_bank
3,للأسف التطبيق للأسوأ كان جدا رائع وسهل وبسيط ا...,-1,alahli_bank
4,التحديث بطيئ جدا جدا عند الفتح,-1,alahli_bank


## Model pipeline

In [17]:
MODEL = 'CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment'
pipe = pipeline("text-classification", model=MODEL)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


### runing the model in a small examples

In [18]:
examples = ["ده ايفنت مش نافع ومكنش فيه أي جزء كويس", "كان حاجة اخر ملل", "معقول", "يعني حاسس انه في العموم مقبول وعادي", "روعة بجد"]
pipe(examples)

[{'label': 'neutral', 'score': 0.7268695831298828},
 {'label': 'negative', 'score': 0.9548670053482056},
 {'label': 'neutral', 'score': 0.4663392901420593},
 {'label': 'negative', 'score': 0.6648894548416138},
 {'label': 'positive', 'score': 0.9908993244171143}]

In [19]:
def evaluate(df, comments):
  df['model_label'] = None
  for index, review in df.iterrows():
    output = pipe(review[comments])[0]
    if output['label'] == 'positive':
      label = 1
    elif output['label'] == 'negative':
      label = -1
    else: label = 0
    df.at[index, 'model_label'] = label

In [20]:
df = appReviews.iloc[-20:, :]
evaluate(df, "review_description")

In [21]:
df[df['model_label'] != df['rating']]

Unnamed: 0,review_description,rating,company,model_label
67107,خيرها في غيرها . عدم الازدحام. الازعاج كان هنا...,0,hotels,-1
67109,شكلت مغامرات أدهم صبري جزء من مراهقتي المبكرة ...,0,hotels,-1
67111,أكثر ما يعجبني في ماركيز .. ولعه بالتفاصيل .. ...,0,hotels,1
67112,الرواية جيدة. للاسف شخصية البطلة كانت مستفزه ب...,0,hotels,-1
67113,مرضي. . المساج اسوء مساج في العالم وهذا محل ال...,0,hotels,-1
67114,بحب كتابات بلال ومن المتابعين ليها بإستمرار. و...,0,hotels,-1
67115,سلطنة عمان . الهدوء في المكان. .بعد الغرفه عن ...,0,hotels,-1
67116,تاب عامة مختصر مفيد غير ممل علي الأطلاق اشبه ب...,0,hotels,-1
67118,عندما اقرأ موضوع العذراء ينمو بداخلي الف سؤال ...,0,hotels,-1
67120,إسلوب الكتاب الأكاديمى كان بيصيبنى بالملل فى أ...,0,hotels,1


For some points, our model classify it as negative while in the dataset it was rated as neutral. And here from the context we can say that our model has the correct choice in this case.

## Work with the positive sample

In [22]:
RestaurantReviews.label.unique()

array(['Positive'], dtype=object)

In [23]:
df = RestaurantReviews.iloc[:10, :]
df['model_label'] = None
comments = "text"
for index, review in df.iterrows():
  output = pipe(review[comments])[0]
  if output['label'] == 'positive':
    label = 1
  elif output['label'] == 'negative':
    label = -1
  else: label = 0
  df.at[index, 'model_label'] = label

In [24]:
df[df['model_label'] != df['label']]

Unnamed: 0,label,text,model_label
0,Positive,ممتاز نوعا ما . النظافة والموقع والتجهيز والشا...,1
1,Positive,أحد أسباب نجاح الإمارات أن كل شخص في هذه الدول...,1
2,Positive,هادفة .. وقوية. تنقلك من صخب شوارع القاهرة الى...,1
3,Positive,خلصنا .. مبدئيا اللي مستني ابهار زي الفيل الاز...,-1
4,Positive,ياسات جلوريا جزء لا يتجزأ من دبي . فندق متكامل...,1
5,Positive,أسلوب الكاتب رائع جدا و عميق جدا، قرأته عدة مر...,1
6,Positive,استثنائي. الهدوء في الجناح مع مسبح. عدم وجود ع...,0
7,Positive,الكتاب هو السيرة الذاتية للحداثة في المملكة بل...,1
8,Positive,من أجمل ما قرأت.. رواية تستحق القراءة فعلا..,1
9,Positive,بشكل عام جيده .. . التجاوب جيد جدا من قبل موظف...,1


In working with these examples, our model again showed better results...

## Final pipeline

For EVNTO app, we want to give the model the full data contains (event_id, comments, etc.) and get the overall rating for each event.

In [25]:
def final_rating(data, comments, group):
  data['score'] = None
  for index, review in data.iterrows():
    sentiment = pipe(review[comments])[0]
    s = sentiment['score']
    if sentiment['label'] == 'negative':
      s = 1 - s
    else: s = 0.5
    data.at[index, 'score'] = s
  grouped = data.groupby(group).agg({"score": ["sum", "count"]})
  grouped.columns = ["total_score", "length"]
  # grouped.reset_index()
  grouped['star_rating'] = grouped['total_score'] / (grouped['length'] / 5)

  return grouped['star_rating']

In [27]:
DATA = appReviews.iloc[9700:10000]
final_rating(data=DATA, comments='review_description', group='company')

Unnamed: 0_level_0,star_rating
company,Unnamed: 1_level_1
talbat,2.16662


## Save the model

In [28]:
torch.save(pipe, "/content/models/evnto-sentiment.pth")

## Next Steps:
* Write cleaner code...
* Develop the API and the documentation...