<a href="https://colab.research.google.com/github/son-of-man47/CHATGPT-REVIEW-ANALYSIS/blob/main/CHATGPT_REVIEWS_ANALYSIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

CHAT GPT REVIEWS ANALYSIS

In [None]:
from google.colab import files
uploaded = files.upload()

Saving chatgpt_reviews.csv to chatgpt_reviews.csv


In [None]:
import pandas as pd
import plotly.graph_objects as go
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from collections import Counter
import plotly.express as px
import plotly.io as pio
pio.templates.default = 'plotly_white'

In [None]:
df = pd.read_csv('chatgpt_reviews.csv')
df.head()

Unnamed: 0,Review Id,Review,Ratings,Review Date
0,6fb93778-651a-4ad1-b5ed-67dd0bd35aac,good,5,2024-08-23 19:30:05
1,81caeefd-3a28-4601-a898-72897ac906f5,good,5,2024-08-23 19:28:18
2,452af49e-1d8b-4b68-b1ac-a94c64cb1dd5,nice app,5,2024-08-23 19:22:59
3,372a4096-ee6a-4b94-b046-cef0b646c965,"nice, ig",5,2024-08-23 19:20:50
4,b0d66a4b-9bde-4b7c-8b11-66ed6ccdd7da,"this is a great app, the bot is so accurate to...",5,2024-08-23 19:20:39


***DATA CLEANING***

In [None]:
df.isna().sum()

Unnamed: 0,0
Review Id,0
Review,6
Ratings,0
Review Date,0


Replace the null reviews with NO REVIEWS for easy analysis

In [None]:
df['Review'] = df['Review'].fillna('NO REVIEWS')

In [None]:
df.isna().sum()

Unnamed: 0,0
Review Id,0
Review,0
Ratings,0
Review Date,0


We will add sentiment labels based on the reviews to the dataset

In [None]:
from textblob import TextBlob

In [None]:
#function to determine sentiment polarity
def get_sentiment(review):
    sentiment = TextBlob(review).sentiment.polarity
    if sentiment > 0:
      return 'Positive'
    elif sentiment < 0:
      return 'Negative'
    else:
      return 'Neutral'

In [None]:
#applying the function
df['Sentiment'] = df['Review'].apply(get_sentiment)

In [None]:
df.head()

Unnamed: 0,Review Id,Review,Ratings,Review Date,Sentiment
0,6fb93778-651a-4ad1-b5ed-67dd0bd35aac,good,5,2024-08-23 19:30:05,Positive
1,81caeefd-3a28-4601-a898-72897ac906f5,good,5,2024-08-23 19:28:18,Positive
2,452af49e-1d8b-4b68-b1ac-a94c64cb1dd5,nice app,5,2024-08-23 19:22:59,Positive
3,372a4096-ee6a-4b94-b046-cef0b646c965,"nice, ig",5,2024-08-23 19:20:50,Positive
4,b0d66a4b-9bde-4b7c-8b11-66ed6ccdd7da,"this is a great app, the bot is so accurate to...",5,2024-08-23 19:20:39,Positive


Let's visualize the sentiments and know the distribution across the data

In [None]:
sentiments = df['Sentiment'].value_counts()

In [None]:
fig = go.Figure(data = [go.Bar(
    x = sentiments.index,
    y = sentiments.values,
    marker_color = ['green', 'grey', 'red']
)])
fig.update_layout(
    title = 'Sentiment Distribution of Reviews',
    xaxis_title = 'Sentiment',
    yaxis_title = 'Count'
)
fig.show()

We observe that most of the sentiments are positive and very few negative sentiments

Let's try to find out what exactly do users like about Chatgpt by analysing the reviews with positive sentiments.

To do this, we'll extract common words and phrases that signify a positive sentiment

In [None]:
#first we'll filter out the positive reviews
positive_reviews = df[df['Sentiment'] == 'Positive']

In [None]:
#use vectorizer to extract common phrases
vectorizer = CountVectorizer(stop_words = 'english', ngram_range = (2,3), max_features = 100)
X = vectorizer.fit_transform(positive_reviews['Review'])

In [None]:
#sum the counts of each phrase
phrase_counts = X.sum(axis = 0)
phrases = vectorizer.get_feature_names_out()
phrase_frequency = [(phrases[i], phrase_counts[0,i]) for i in range(len(phrases))]

#sort phrases by frequency
phrase_frequency = sorted(phrase_frequency, key = lambda x:x[1], reverse = True)

#turning it to a dataframe
phrase_df = pd.DataFrame(phrase_frequency, columns = ['Phrase', 'Frequency'])

In [None]:
#visualize
fig = px.bar(
    phrase_df,
    x = 'Frequency',
    y = 'Phrase',
    orientation = 'h',
    title = 'Common Phrases in Positive Reviews',
    labels = {'Phrase': 'Phrase', 'Frequency': 'Frequency'},
    color_discrete_sequence = ['green'],
    width = 1000,
    height = 600
)

fig.update_layout(
    xaxis_title = 'Phrase',
    yaxis_title = 'Frequency',
    yaxis  ={'categoryorder': 'total ascending'}
)

fig.show()

The visualization shows that users appreciate Chatgpt for being a great app and an amazing app. They also mention the app as a good ai and good app for students and user friendly

Now let's find out what the users don't like about chatgpt.

To do this we'll follow the same process as before

In [None]:
negative_reviews = df[df['Sentiment'] == 'Negative']

In [None]:
X_neg = vectorizer.fit_transform(negative_reviews['Review'])

In [None]:
#sum the counts of each phrase
negative_phrase_counts = X_neg.sum(axis = 0)
negative_phrases = vectorizer.get_feature_names_out()
negative_phrase_frequency = [(negative_phrases[i], negative_phrase_counts[0,i]) for i in range(len(negative_phrases))]

#sort phrases by frequency
negative_phrase_frequency = sorted(negative_phrase_frequency, key = lambda x:x[1], reverse = True)

#turning it to a dataframe
negative_phrase_df = pd.DataFrame(negative_phrase_frequency, columns = ['Phrase', 'Frequency'])

In [None]:
#visualize
fig = px.bar(
    negative_phrase_df,
    x = 'Frequency',
    y = 'Phrase',
    orientation = 'h',
    title = 'Common Phrases in Negative Reviews',
    labels = {'Phrase': 'Negative Phrase', 'Frequency': 'Frequency'},
    color_discrete_sequence = ['green'],
    width = 1000,
    height = 600
)

fig.update_layout(
    xaxis_title = 'Negative Phrase',
    yaxis_title = 'Frequency',
    yaxis  ={'categoryorder': 'total ascending'}
)

fig.show()

We see that the reasons why Chatgpt have some negative reviews is mostly due to error, be it network or other errors and when Chatgpt gives a wrong answer.

This shows a dissatisfaction in its reliability

*COMMON PROBLEMS FACED BY CHATGPT*

We want to group the common problems into categories such as
1. Quality of responses and answers
2. App performance
3. User interface
4. General features

In [33]:
#we'll first group the phrases into the categories
problem_categories = {
    'Responses and Answers Quality': ['wrong answer', 'gives wrong', 'incorrect', 'inaccurate', 'wrong', 'bad response',
                                      'irrelevant', 'useless', 'poor'],
    'App Performance': ['bad', 'lag', 'freeze', 'crash', 'bug', 'loading', 'glitch'],
    'User Interface': ['poor', 'interface', 'UI', 'layout', 'difficult', 'confusing'],
    'General Features': ['network', 'feature missing', 'poor', 'not working', 'not available', 'poor network', 'no network']
}

#a dictionary to count the occurence of the problem categories
problem_counts = {key: 0 for key in problem_categories.keys()}

In [34]:
#let's count
for phrase, count in negative_phrase_frequency:
  for category, keywords in problem_categories.items():
    if any(keyword in phrase for keyword in keywords):
      problem_counts[category] += count
      break

In [35]:
problem_df = pd.DataFrame(list(problem_counts.items()), columns = ['Problem Category', 'Count'])
problem_df.head()

Unnamed: 0,Problem Category,Count
0,Responses and Answers Quality,759
1,App Performance,219
2,User Interface,0
3,General Features,35


In [38]:
#visualize
fig = px.bar(
    problem_df,
    x = 'Problem Category',
    y = 'Count',
    title = 'Common Problems Encountered by Chatgpt',
    labels = {'Problem Category': 'Problem Category', 'Count': 'Frequency'}
)

fig.update_layout(
    xaxis_title = 'Problem Category',
    yaxis_title = 'Frequency',
    yaxis = {'categoryorder': 'total descending'}
)

fig.show()

We see that the most common problems are those associated with the Response and Answer quality, which goes to show the level of reliability on Chatgpt responses. Also some associated problems include the App performance.

***SHOWING HOW THE REVIEWS CHANGED OR SHIFTED OVER TIME***

In [39]:
#convert the review date to datetime
df['Review Date'] = pd.to_datetime(df['Review Date'])

In [40]:
#aggregate sentiment counts by date
sentiment_overtime = df.groupby([df['Review Date'].dt.to_period('M'), 'Sentiment']).size().unstack(fill_value = 0)

#convert period back to datetime
sentiment_overtime.index = sentiment_overtime.index.to_timestamp()

In [42]:
#visualize
fig = go.Figure()

for sentiment in sentiment_overtime.columns:
  fig.add_trace(go.Scatter(
      x = sentiment_overtime.index,
      y = sentiment_overtime[sentiment],
      mode = 'lines',
      name = sentiment
  ))

  fig.update_layout(
      title = 'Sentiment Over Time',
      xaxis_title = 'Date',
      yaxis_title = 'Number of eviews',
      legend_title = 'Sentiment',
      xaxis=dict(showgrid=True, gridcolor='lightgray'),
      yaxis=dict(showgrid=True, gridcolor='lightgray')
  )
fig.show()

We see that Positive reviews has an upward trajectory over time, with a high peak from February 2024 and a little dip by May 2024. The neutral and negative reviews also follow such patterns.