In [3]:
import pandas as pd
import plotly.express as px
from concurrent.futures import ThreadPoolExecutor as te
import logging
import os
import csv
import gc
#These below are imported functions I created these to process large datasets faster than 3 hours 
from largedata_reader import parallel_read_csv
from largedata_reader import process_game_chunk
from largedata_reader import merge_scores


# Overview

This project focuses on analyzing and visualizing user sentiment from Steam game reviews. It processes raw review and game title data, cleans it, and applies sentiment analysis to classify reviews as positive, negative, or neutral. The sentiment results are then displayed in visually appealing graphs to help users quickly understand the public perception of various games.

### Key Features:
- **Data Cleaning**: Extracts and preprocesses Steam reviews and game titles.
- **Sentiment Analysis**: Leverages tokenization and multithreading techniques to efficiently process reviews. Reviews are then graded as positive or negative based on sentiment.
- **Visualizations**: Presents sentiment percentages and trends through interactive bar graphs and charts.


Enjoy exploring the sentiment trends of your favorite Steam games!


<iframe src="https://giphy.com/embed/xT9C25UNTwfZuk85WP" width="480" height="346" style="" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/Giflytics-gif-jazminantoinette-giflytics-xT9C25UNTwfZuk85WP">via GIPHY</a></p>


In [4]:
#we are going to be looking at tsteam reviews and applying sentiment analysis 
#on them for game in the steam store in either 2020 

#File sentiment scores will be cahced to its to help not have to recompute the sentiment scores every time 
Cache_file = "game_scores_cache.csv"


game_titles = []

if __name__ == '__main__':
    #path for csv here for datset
    df_path = "c:/Users/stray/Desktop/portfolioDatasets/datasets/steam_reviews.csv"
    dataframe = parallel_read_csv(df_path)

    #validate the results
    print(dataframe.head())
    #converts from a polars object to a pandas object
    dataframe = dataframe.to_pandas()

    for i in dataframe["app_name"]:
        if i not in game_titles:
            game_titles.append(i)


shape: (5, 3)
┌──────────────────────────┬──────────┬─────────────────────────────────┐
│ app_name                 ┆ language ┆ review                          │
│ ---                      ┆ ---      ┆ ---                             │
│ str                      ┆ str      ┆ str                             │
╞══════════════════════════╪══════════╪═════════════════════════════════╡
│ The Witcher 3: Wild Hunt ┆ english  ┆ One of the best RPG's of all t… │
│ The Witcher 3: Wild Hunt ┆ english  ┆ good story, good graphics. lot… │
│ The Witcher 3: Wild Hunt ┆ english  ┆ dis gud,                        │
│ The Witcher 3: Wild Hunt ┆ english  ┆ favorite game of all time cant… │
│ The Witcher 3: Wild Hunt ┆ english  ┆ Why wouldn't you get this       │
└──────────────────────────┴──────────┴─────────────────────────────────┘


# Steam Reviews Sentiment Analysis (2020)

This section focuses on preparing the dataset for sentiment analysis of Steam reviews for games available in 2020.  

### Key Steps:
1. **Data Loading**: The dataset is loaded using multithreading to efficiently handle large volumes of data.
2. **Filtering**: The `largedataframereader` function applies the following filters to ensure the dataset is relevant and clean:
   - Reviews must be written in English.
   - Reviews must not contain `NaN` values.
   - Reviews must have a minimum length of three characters.  

3. **Extracting Unique Game Titles**: A list of unique game titles is generated from the `app_name` column, which will be used in the next stage of the analysis.





In [5]:
#put your wordlist here the path to that file 
#look reviews and determin sentiments
with open("./wordlist/positive_sentiment_words.txt", "r") as f:
   good_keywords = [line.strip() for line in f.readlines()]

#load bad wordlist
with open("./wordlist/negative_sentiment_words.txt","r") as f:
    bad_keywords = [line.strip() for line in f.readlines()]


# Sentiment Analysis Function Overview

This section focuses on the sentiment analysis function developed to process and score game reviews efficiently.  

### Process:
1. **Sentiment Scoring**:  
   - Each review is analyzed using predefined keyword lists for positive and negative sentiment.  
   - If a word in the review matches an entry in the keyword lists, it adds +1 to the corresponding sentiment score.  
   - Reviews without matches do not contribute to the sentiment score.  

2. **Optimization with Caching**:  
   - Sentiment scores are written to a cache file after processing, ensuring the analysis only needs to run once per dataset.  
   - On subsequent executions, the function checks for the existence of the cache file and loads the precomputed sentiment scores, saving time and resources.  

3. **Efficiency Enhancements**:  
   - The function processes game reviews using multithreading to handle large datasets more quickly.  
   - Tokenization is applied to break down each review into individual words, facilitating direct comparisons with the keyword lists.  

The function systematically iterates through each game in the `game_titles[]` list and scores the associated reviews for both positive and negative sentiment. This approach ensures efficient and reusable sentiment scoring,
forming the foundation for further data analysis and visualization.


In [6]:
def analyze_review(dataframe, game_titles, good_keywords, bad_keywords):
    #check if cache file exists
    if os.path.exists(Cache_file):
        print("Cache file found, loading sentiment scores...........")
        return load_scores_from_csv(Cache_file)
    
    logging.info("No cache file found, starting sentiment analysis..........")
    #initialize the game scores
    game_scores = {title:{"Good":0, "Bad":0} for title in game_titles}
    chunk_size = 100000

    print(f"Chunk size set to {chunk_size} reviews per batch..")

    good_keywords_set = set(good_keywords)
    bad_keywords_set = set(bad_keywords)

    #processs in chunks 
    chunks = (dataframe[i:i+chunk_size] for i in range(0, len(dataframe), chunk_size))
    with te() as executor:
        for i,chunk_results in enumerate(
            executor.map(
                lambda chunk: process_game_chunk(chunk, good_keywords_set,
                                 bad_keywords_set, game_scores),chunks
            )
        ):
            print(f"Processed {i + 1} chunks of reviews")
            merge_scores(chunk_results, game_scores)
            print(f"Scores after chunk {i + 1}: {game_scores}")
            gc.collect() # free memory to make sure no overflows happen 

    print("Saving computed game scores to CSV cache file")
    save_scores_to_csv(game_scores, Cache_file)

    print("Sentiment analysis completed successfully.")
    return game_scores

def load_scores_from_csv(filepath):
    """Load sentiment scores from a CSV file."""
    scores = {}
    with open(filepath, mode="r") as file:
        reader = csv.DictReader(file)
        for row in reader:
            scores[row["Title"]] = {"Good": int(row["Good"]), "Bad": int(row["Bad"])}
    print("Successfully loaded game scores from cache.")
    return scores

def save_scores_to_csv(game_scores, filepath):
    """Save sentiment scores to a CSV file."""
    with open(filepath, mode="w", newline="", encoding="utf-8") as file:
        writer = csv.writer(file)
        writer.writerow(["Title", "Good", "Bad"])
        for title, counts in game_scores.items():
            writer.writerow([title, counts["Good"], counts["Bad"]])
            print(f"Game scores saved to cache file: {filepath}")

In [7]:
game_sentiment = analyze_review(dataframe,game_titles,good_keywords,bad_keywords)
games = list(game_sentiment.keys())
good_reviews = [game_sentiment[game]["Good"] for game in games]
bad_reviews = [game_sentiment[game]["Bad"] for game in games]

Cache file found, loading sentiment scores...........
Successfully loaded game scores from cache.


In [8]:
fig = px.bar(
    x=[good_reviews, bad_reviews],
    y=games,
    labels={"x": "Game", "y": "Review Count"},  # Axis labels
    title="Good vs. Bad Reviews per Game",
    barmode="group"  # Grouped bars for comparison
)

# Add layout properties
fig.update_layout(
    bargap=0.1,  # Gap between bars
    width=1000,  # Set figure width
    height=2500,  # Set figure height
     xaxis={"categoryorder": "total descending"}  # Sort by y-values in descending order
)

# Display the figure
fig.show()

# Improving Graph Readability and Exploring Sentiment Distribution

The previous graph provided a general overview of the sentiment trends, showing that there are more positive reviews than negative ones. However, due to the large number of game titles and the cluttered nature of the graph, it’s difficult to discern specific insights, such as identifying which game has the most favorable reviews.  

To improve the clarity of the data, we will shift to a different visualization approach, which will present the sentiment distribution in a more readable format. This allows us to better highlight the games with the best reviews and make the trends more apparent, potentially helping us discover a new game worth exploring.  

The initial graph was intentionally designed with a large dataset to emphasize the challenges of displaying such vast amounts of data. This serves as a stepping stone toward the next, more focused visualization, where we will more effectively present the key insights from the dataset, helping to paint a clearer picture of the sentiment landscape.

But wow that alot of data for sure  
<iframe src="https://giphy.com/embed/M33UV4NDvkTHa" width="480" height="278" style="" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/ever-flawless-meryl-M33UV4NDvkTHa">via GIPHY</a></p>

In [10]:
def calculate_percents(game_scores):
     sentiment_results = {}

     for game, scores in game_scores.items():
          total_reviews = scores["Good"] + scores["Bad"]

          if total_reviews > 0:
               sentiment_score = (scores["Good"] - scores["Bad"]) / total_reviews
               sentiment_results[game] = {
                    "Sentiment Score": round(sentiment_score, 2),
                    "Good": scores["Good"],
                    "Bad": scores["Bad"]
               }
          else:
               sentiment_results[game] = {
                    "Sentiment Score": 0,
                    "Good": scores["Good"],
                    "Bad": scores["Bad"]
               }
     return sentiment_results

#call calculate_percents function and store the results 
sentiment_scores = calculate_percents(game_sentiment)
#convert percents to a dict of each game for graph 
santiment_data = [
    {"Game":games, "Sentiment Percent": data["Sentiment Score"]}
    for game,data in sentiment_scores.items()
]


# Sentiment Calculation and Data Preparation

In this section, we calculate the sentiment scores for each game based on the number of positive and negative reviews.  

### Process:
- **Total Reviews Calculation**: For each game, the total number of reviews is calculated by adding the good (positive) and bad (negative) reviews together.
- **Sentiment Score Calculation**: The sentiment score for each game is determined by subtracting the number of good reviews from the bad reviews and then dividing the result by the total number of reviews.
 This gives us a value that represents the overall sentiment for the game, with a positive score indicating more good reviews than bad.

Once the sentiment scores are calculated, the data is prepared for visualization by converting it into a format that can be easily plotted in the next graph. This allows us to better visualize how each game is performing based on user feedback.


In [11]:
#values are not being handed to this properly find out why need a bigger screen for it to see what vales
#are being passed 
#create the graph for the percent scores 
fig1 = px.bar(
    santiment_data,
    x=games,
    y="Sentiment Percent",
    title="Game Sentiment Percentages",
    labels={"Sentiment Score": "Sentiment Score", "Game": "Game Title"},
    color="Sentiment Percent",  # Adds color to the bars based on the score
    color_continuous_scale="RdYlGn"  # Green for positive, red for negative
)
fig1.update_layout(
    bargap=0.2,  # Gap between bars
    width=2300,  # Set figure width
    height=1000,  # Set figure height
    title=dict(font=dict(size=24)),  # Increase title font size
    xaxis=dict(title="Game Title", titlefont=dict(size=18)),  # Axis title font size
    yaxis=dict(title="Sentiment Percent", titlefont=dict(size=18)),  # Axis title font size
     #xaxis={"categoryorder": "total descending"}  # Sort by y-values in descending order
)

fig1.show()

# Visualizing Sentiment Percentages for Games

This section creates a bar graph to display the sentiment percentages for each game, derived from the sentiment scores calculated in the previous steps.  

### Graph Details:
- **Sentiment Calculation**: The sentiment for each game is represented as a percentage, with a positive sentiment closer to 1 and a negative sentiment closer to -1.  
  This score is based on the ratio of good to bad reviews for each game.
  
- **Graph Format**: The bar graph is designed to make it easier to identify the sentiment for each game.  
  Each bar represents a game's sentiment score, and the color of the bar reflects the sentiment intensity—green for positive and red for negative.
  
- **Enhanced UI**: The graph’s size has been increased to allow for more readable labels and bars.  
  The games are presented in a format where the bars are wider and less compact, improving visibility.  
  Hovering over each bar reveals the game title and its respective sentiment score, making the graph more interactive and easier to interpret.  
  Additionally, you can drag to zoom in on sections of the graph for a closer view, allowing for a more detailed exploration of specific games.

By using this approach, the sentiment percentages are more clearly presented.  
This allows us to quickly identify which games have the best user feedback and which ones may need improvement.




# Conclusion

Through this analysis, we've explored the sentiment trends across a variety of Steam games, using both positive and negative review data to calculate sentiment scores. The interactive visualizations provided allow for a deeper understanding of user feedback, highlighting which games resonate positively with players and which may need improvement.

By applying sentiment analysis and visualizing the results, we’ve been able to gain valuable insights that could guide future decisions, whether it’s selecting a game to play or understanding how different titles are perceived by their audience.

Thank you for exploring this sentiment analysis project! I hope you’ve gained a better understanding of how data can help uncover trends and guide decisions. Feel free to dive deeper into the data and discover even more insights.  

<iframe src="https://giphy.com/embed/B6G2MYBmtnGYU" width="480" height="298" style="" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/abandon-thread-dislike-B6G2MYBmtnGYU">via GIPHY</a></p>