# Sentiment analysis for reviews

In our project, we have made significant progress by collecting reviews from various locations using a Google Maps scraper. After performing data cleaning, we decided to translate the reviews from French to English and employ BERT for sentiment analysis.

We chose BERT because it is a state-of-the-art language model that excels in natural language processing tasks, including sentiment analysis. By leveraging BERT's powerful language understanding capabilities, we can obtain accurate sentiment predictions.

Translating the reviews to English offers advantages such as access to a wider range of NLP resources and a broader audience. English is widely used in NLP, providing a rich ecosystem of tools and pre-trained models. Additionally, analyzing sentiment in English ensures our results can be easily understood and shared globally.

In the upcoming sections, we will implement sentiment analysis using BERT. This will help us gain valuable insights into customer sentiments towards different locations, enabling businesses to make data-driven decisions and enhance user experiences.

For further reference, you can explore our repository on sentiment analysis of Ryanair airline reviews using VADER: [**Sentiment Analysis of Ryanair Airline Reviews**](https://github.com/yasirech-chammakhy/Sentiment-Analysis-of-Ryanair-Airline-Reviews). It showcases VADER's implementation and provides insights into sentiment expressed in Ryanair reviews.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/all_cities_cleaned_english.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17206 entries, 0 to 17205
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   bank                     17206 non-null  object 
 1   categoryName             17205 non-null  object 
 2   city                     17206 non-null  object 
 3   totalScore               17206 non-null  float64
 4   rank                     17206 non-null  int64  
 5   cid                      17206 non-null  float64
 6   publishedAtDate          17206 non-null  object 
 7   reviewsCount             17206 non-null  int64  
 8   reviewsDistribution      17206 non-null  object 
 9   textTranslated           9257 non-null   object 
 10  reviewId                 17206 non-null  object 
 11  reviewerId               17206 non-null  float64
 12  reviewerNumberOfReviews  17206 non-null  float64
 13  stars                    17206 non-null  float64
 14  lat                   

##  Clean the text

In [4]:
from utils import clean_review

Using region Rabat-Sale-Kenitra server backend.



In [5]:
# Cleaning the text in the textTranslated column
df['cleaned_text'] = df['textTranslated'].apply(clean_review)

In this step, we performed extensive cleaning on the text data in the reviews column by removing special characters and numericals, converting all characters to lowercase, tokenizing each review, removing stopwords, and lemmatizing each word in every review. By doing so, we created a new column called "cleaned reviews" which was a prerequisite for the sentiment analysis.

## Generating Sentiment Scores for Cleaned text

In [6]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification

  from .autonotebook import tqdm as notebook_tqdm


These two lines of code are initializing a BERT model and tokenizer for sentiment analysis in multiple languages. The `BertTokenizer` class is used to tokenize the input text into subwords that can be processed by the BERT model. The `BertForSequenceClassification` class is a pre-trained BERT model that has been fine-tuned for sentiment analysis on a large corpus of text data. By calling `from_pretrained` method, the model and tokenizer are loaded from a pre-trained checkpoint provided by the `nlptown` library.

In [7]:
tokenizer = BertTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')
model = BertForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

To handle cases where there is no review text, you have chosen to utilize the 'stars' column. This approach leverages the available data and provides a sentiment value based on the star rating.By incorporating the 'stars' column into our sentiment analysis, we are able to make the most of the available information and provide meaningful insights even in the absence of review text.

In [8]:
def sentiment_bank(row, text_column):
    """
    Perform sentiment analysis on bank reviews.

    Parameters:
    - row: The row of the dataframe containing the review information.
    - text_column: The column name for sentiment analysis.

    Returns:
    - The predicted sentiment as an integer value.
    """

    cleaned_text = row[text_column]
    stars = row['stars']

    if pd.isna(cleaned_text) or cleaned_text == "nan":
        return stars
    elif isinstance(cleaned_text, str):
        token = tokenizer.encode(cleaned_text[:512], return_tensors='pt')
        result = model(token)
        return int(torch.argmax(result.logits)) + 1
    else:
        return None

In [9]:
df['sentiment'] = df.apply(lambda row: sentiment_bank(row, 'cleaned_text'), axis=1)

In [10]:
df.tail()

Unnamed: 0,bank,categoryName,city,totalScore,rank,cid,publishedAtDate,reviewsCount,reviewsDistribution,textTranslated,reviewId,reviewerId,reviewerNumberOfReviews,stars,lat,lng,cleaned_text,sentiment
17201,CIH Bank,Banque,Tétouan,2.0,57,1.825584e+19,2017-11-18,41,"{'oneStar': 29, 'twoStar': 0, 'threeStar': 3, ...",It's my bank,ChdDSUhNMG9nS0VJQ0FnSURRM2EtT3N3RRAB,1.040698e+20,43.0,4.0,35.585468,-5.347991,bank,5.0
17202,CIH Bank,Banque,Tétouan,2.0,57,1.825584e+19,2017-07-13,41,"{'oneStar': 29, 'twoStar': 0, 'threeStar': 3, ...",,ChdDSUhNMG9nS0VJQ0FnSURRNHY2V3JnRRAB,1.013352e+20,14.0,5.0,35.585468,-5.347991,,5.0
17203,CIH Bank,Banque,Tétouan,2.0,57,1.825584e+19,2017-06-23,41,"{'oneStar': 29, 'twoStar': 0, 'threeStar': 3, ...",,ChZDSUhNMG9nS0VJQ0FnSURBc1l5OVNBEAE,1.08129e+20,0.0,1.0,35.585468,-5.347991,,1.0
17204,CIH Bank,Banque,Tétouan,2.0,57,1.825584e+19,2017-06-21,41,"{'oneStar': 29, 'twoStar': 0, 'threeStar': 3, ...",,ChdDSUhNMG9nS0VJQ0FnSUNBX1pDQW9RRRAB,1.104122e+20,1.0,1.0,35.585468,-5.347991,,1.0
17205,CIH Bank,Banque,Tétouan,2.0,57,1.825584e+19,2017-04-10,41,"{'oneStar': 29, 'twoStar': 0, 'threeStar': 3, ...",Well served especially for young people. .,ChdDSUhNMG9nS0VJQ0FnSUN3c3NXcXVBRRAB,1.070876e+20,374.0,3.0,35.585468,-5.347991,well served especially young people,5.0


In [11]:
# define a function to categorize the sentiment score
def sentiment_category(score):
    if score > 3:
        return 'positive'
    elif score < 3:
        return 'negative'
    else:
        return 'neutral'

df['sentiment_category'] = df['sentiment'].apply(sentiment_category)

In [12]:
# extract year, month , day from publishedAtDate column
df['publishedAtDate'] = pd.to_datetime(df['publishedAtDate'])
df['year'] = df['publishedAtDate'].dt.year
df['month'] = df['publishedAtDate'].dt.month
df['day'] = df['publishedAtDate'].dt.day

# drop publishedAtDate column
df.drop('publishedAtDate', axis=1, inplace=True)

In [16]:
# add country column that will be useful in power bi
df['country'] = 'Morocco'

To prepare our data for in-depth analysis and loading into PostgreSQL, we have made several enhancements. Firstly, we have introduced the 'sentiment_category' column to categorize the sentiment scores, providing a more granular understanding of sentiment levels (e.g., positive, negative, neutral). This additional column enables us to analyze sentiment trends and patterns with greater specificity.

Furthermore, we have extracted the year, month, and day components from the 'publishedAtDate' column. By creating separate columns for these temporal elements, we can perform time-based analyses, such as sentiment trends over different periods, seasonal variations, or comparison between specific timeframes.

In [18]:
# Save the dataframe to a csv file
df.to_csv('../data/final_data.csv', index=False)