# Introduction To Sentiment Analysis

Sentiment analysis refers to analyzing an opinion or feelings about something using data like text or images, regarding almost anything. Sentiment analysis helps companies in their decision-making process. For instance, if public sentiment towards a product is not so good, a company may try to modify the product or stop the production altogether in order to avoid any losses.

There are many sources of public sentiment e.g. public interviews, opinion polls, surveys, etc. However, with more and more people joining social media platforms, websites like Facebook and Twitter can be parsed for public sentiment.

# Problem Definition

Given reviews about different apps , the task is to predict whether the reviews contains positive, negative, or neutral sentiment about the apps.

# Importing libraries

In [1]:
 pip install vaderSentiment

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install wordcloud

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install afinn

Collecting afinn
  Downloading afinn-0.1.tar.gz (52 kB)
Building wheels for collected packages: afinn
  Building wheel for afinn (setup.py): started
  Building wheel for afinn (setup.py): finished with status 'done'
  Created wheel for afinn: filename=afinn-0.1-py3-none-any.whl size=53447 sha256=a7b06fe45aea594ccda179deff96f50a11b27582a042c988efaf2619d577225a
  Stored in directory: c:\users\prashant\appdata\local\pip\cache\wheels\79\91\ee\8374d9bc8c6c0896a2db75afdfd63d43653902407a0e76cd94
Successfully built afinn
Installing collected packages: afinn
Successfully installed afinn-0.1
Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install wordcloud




In [1]:
import pandas as pd
import numpy as np
import re
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud
from collections import Counter
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from afinn import Afinn


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Prashant\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Prashant\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
df= pd.read_csv('reviews.csv')
df.head()

Unnamed: 0,reviewId,userName,userImage,content,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,appVersion,sortOrder,appId
0,0197c118-5c6f-4a7b-894c-970023d1a350,Mar Zur,https://play-lh.googleusercontent.com/a/ACg8oc...,I have the same recurring tasks to do every da...,11,4.16.6.2,22-07-2020 13:13,Our team will be happy to look into it for you...,23-07-2020 16:32,4.16.6.2,most_relevant,com.anydo
1,94868fb5-a21d-4ef9-ab85-81b2ed3d0785,Devin Rivera,https://play-lh.googleusercontent.com/a-/ALV-U...,"Instead of shopping around, I downloaded Any.d...",8,,08-12-2020 06:24,We are not aware of any issues with randomized...,10-12-2020 09:38,,most_relevant,com.anydo
2,825da34e-f65d-4ef3-991d-02d5291820d6,Heidi Kinsley,https://play-lh.googleusercontent.com/a/ACg8oc...,Why does every once in a while... out of the b...,6,5.11.1.2,09-07-2021 13:51,Sorry to hear that! It sounds like you might h...,11-07-2021 11:16,5.11.1.2,most_relevant,com.anydo
3,a49c2875-651a-4c33-b79c-5813780d659e,Daniel Keller,https://play-lh.googleusercontent.com/a/ACg8oc...,Terrible Update! This app used to be perfect f...,5,,16-11-2020 01:50,Please note that the tasks in your tasks view ...,17-11-2020 09:31,,most_relevant,com.anydo
4,9482c75e-2e63-46ab-8c94-47273dd6a829,A Google user,https://play-lh.googleusercontent.com/EGemoI2N...,This app is deceivingly terrible. There are so...,20,4.14.0.4,31-01-2019 16:19,"Hi Ryan, it sounds like you are describing our...",05-02-2019 11:52,4.14.0.4,most_relevant,com.anydo


# Converting the content coloum into lower case format

In [None]:
df["content"] = df["content"].str.lower()

In [None]:
print (df['content'])

# Removing links/urls

In [None]:
df['content'] = df['content'].fillna('')

In [None]:
def remove_links(text):
    return re.sub(r'http\S+', '', text)
for i, row in df.iterrows():
    df.at[i, 'content'] = remove_links(row['content'])
print(df.head())

# Removing next line

In [None]:
df['content'] = df['content'].str.replace('\n','')
print(df.head())

# Removing extra/white spaces

In [None]:
df['content'] = df['content'].apply(lambda x: ''.join(x.split()))
print(df.head())

# Removing words containing number 

In [None]:
def remove_words_with_numbers(text):
    return re.sub(r'\b\w*\d\w*\b', '', text)

# Apply the remove_words_with_numbers function to the content column
df['content'] = df['content'].apply(remove_words_with_numbers)

print(df.head())

# Removing special characters

In [None]:
def remove_special_characters(text):
    return re.sub(r'[^a-zA-Z0-9\s]', '', text)

# Apply the function to the content column
df['content'] = df['content'].apply(remove_special_characters)
print(df['content'])

# Removal of stopwords

In [None]:
from nltk.corpus import stopwords
", ".join (stopwords.words('english'))

In [None]:
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# Apply the function to the content column
df['content'] = df['content'].apply(remove_stopwords)
print(df['content'])

# Stemming

In [None]:
porter = PorterStemmer()

In [None]:
def stem_text(text):
    words = word_tokenize(text)
    stemmed_words = [porter.stem(word) for word in words]
    return ' '.join(stemmed_words)

# Apply the function to the content column
df['content'] = df['content'].apply(stem_text)
print(df['content'])

# Lemmatization

In [None]:
def lemmatize_text(text):
    # Process the text with spaCy
    doc = nlp(text)
    # Extract lemmatized tokens
    lemmatized_tokens = [token.lemma_ for token in doc]
    # Join the lemmatized tokens into a sentence
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text
print(df['content'])

# Removing Punctuation

In [None]:
string.punctuation

In [None]:
def remove_punctuations(text):
    punctuations = string.punctuation
    return text.translate(str.maketrans('', '', punctuations))
    


In [None]:
print(df.columns)

In [None]:
def remove_punctuation_regex(text):
    if isinstance(text, str):
        # Replace punctuation with an empty string
        return re.sub(r'[^\w\s]', '', text)
    else:
        # Return the text as is if it's not a string
        return text

# Apply the function to the Content column
df['clean_text'] = df['content'].apply(remove_punctuation_regex)


In [None]:
df.head()

In [None]:
df= pd.read_csv('reviews.csv')


In [None]:
print (df.shape)

As we can see their are total of "Sixteen thousand seven hundred eighty seven" rows along with "twelve" colums.

In [None]:
print (df.shape)
df=df.head(16787)
print(df.shape)

Here we are considering all the total rows present in the dataset for the analysis.

In [None]:
df.head()

In [None]:
df['content'].value_counts()

Going to sort the index values of content coloum and plotting a bar graph with title "Count of Reviews by Users"

# WordCloud

In [8]:
text = ' '.join(df['content'].dropna())

# Create a WordCloud object
wordcloud = WordCloud(width=800, height=400,max_words=200, background_color='white').generate(text)

# Display the word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

ValueError: Only supported for TrueType fonts

# Frequency Table

In [None]:
import pandas as pd
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Load the dataset
df = pd.read_csv('reviews.csv')

# Tokenize the 'content' column and convert to lowercase
words = ' '.join(df['content']).lower()
tokens = word_tokenize(words)

# Filter out stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Get the frequency of each word
word_freq = Counter(filtered_tokens)

# Create a DataFrame from the frequency table
freq_table = pd.DataFrame(word_freq.items(), columns=['Word', 'Frequency'])

# Sort the DataFrame by frequency in descending order
freq_table = freq_table.sort_values(by='Frequency', ascending=False)

# Display the frequency table
print(freq_table)

# Frequency for top 10 Words

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Load the dataset
df = pd.read_csv('reviews.csv')

# Tokenize the 'content' column and convert to lowercase
words = ' '.join(df['content']).lower()
tokens = word_tokenize(words)

# Filter out stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Get the frequency of each word
word_freq = Counter(filtered_tokens)

# Get the top 10 most frequent words
top_words = word_freq.most_common(10)

# Create a DataFrame from the top words
top_words_df = pd.DataFrame(top_words, columns=['Word', 'Frequency'])

# Plot the top 10 words
plt.figure(figsize=(10, 6))
plt.bar(top_words_df['Word'], top_words_df['Frequency'])
plt.title('Top 10 Most Frequent Words in Reviews')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

# Sentiment Analysis 

# Using VADER Library

In [9]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Load the dataset
df = pd.read_csv('reviews.csv')

# Initialize the VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Function to get the sentiment score for each review
def get_sentiment_score(review):
    sentiment = analyzer.polarity_scores(review)
    if sentiment['compound'] >= 0.05 :
        return "Positive"
    elif sentiment['compound'] <= - 0.05 :
        return "Negative"
    else :
        return "Neutral"
# Apply the function to the 'content' column to get the sentiment score for each review
df['Sentiment'] = df['content'].apply(get_sentiment_score)

# Display the DataFrame with the added 'Sentiment' column
print(df[['content', 'Sentiment']])

                                                 content Sentiment
0      I have the same recurring tasks to do every da...  Negative
1      Instead of shopping around, I downloaded Any.d...  Negative
2      Why does every once in a while... out of the b...  Negative
3      Terrible Update! This app used to be perfect f...  Positive
4      This app is deceivingly terrible. There are so...  Positive
...                                                  ...       ...
16782                                      Excellent app  Positive
16783  I love it. Easy to use. Make my life organize....  Positive
16784  I love how I could make plans and check the ap...  Positive
16785                           Exactly what I needed!!!   Neutral
16786                                        Very good 👍  Positive

[16787 rows x 2 columns]


# Using AFINN Library

In [3]:
import pandas as pd
from afinn import Afinn

# Load the dataset
df = pd.read_csv('reviews.csv')

# Initialize the Afinn sentiment analyzer
afinn = Afinn()

# Function to calculate sentiment score
def calculate_sentiment(text):
    return afinn.score(text)

# Apply the function to the content column and create a new column for sentiment scores
df['sentiment_score'] = df['content'].apply(calculate_sentiment)

# Display the first few rows of the dataframe with the sentiment scores
print(df[['content', 'sentiment_score']].head())

                                             content  sentiment_score
0  I have the same recurring tasks to do every da...             -4.0
1  Instead of shopping around, I downloaded Any.d...              2.0
2  Why does every once in a while... out of the b...            -10.0
3  Terrible Update! This app used to be perfect f...              4.0
4  This app is deceivingly terrible. There are so...              4.0


# Rule-Based Lexicon Integration

In [15]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Load the dataset
df = pd.read_csv('reviews.csv')

# Create a SentimentIntensityAnalyzer object
analyzer = SentimentIntensityAnalyzer()

# Define a function to get the sentiment score of a sentence
def get_sentiment_score(sentence):
    sentiment_score = analyzer.polarity_scores(sentence)
    return sentiment_score['compound']

# Apply the function to the 'content' column
df['sentiment_score'] = df['content'].apply(get_sentiment_score)

# Display the DataFrame with the sentiment scores
print(df[['content', 'sentiment_score']])

                                                 content  sentiment_score
0      I have the same recurring tasks to do every da...          -0.6792
1      Instead of shopping around, I downloaded Any.d...          -0.7558
2      Why does every once in a while... out of the b...          -0.8847
3      Terrible Update! This app used to be perfect f...           0.7901
4      This app is deceivingly terrible. There are so...           0.3204
...                                                  ...              ...
16782                                      Excellent app           0.5719
16783  I love it. Easy to use. Make my life organize....           0.9607
16784  I love how I could make plans and check the ap...           0.8451
16785                           Exactly what I needed!!!           0.0000
16786                                        Very good 👍           0.4927

[16787 rows x 2 columns]
