<h1 align="center">Analyzing Steam Reviews: An NLP Journey</h1>
<h2 align="center">Implementing NERC, Sentiment, and Topic Analysis at the Sentence Level: Methods and Evaluations</h2>

**Text Mining for AI - Project Group 1**

**INTRODUCTION**

The gaming industry is still expanding at an unparalleled rate, making significant contributions to the world of entertainment. According to recent statistics, this industry is seeing significant economic consequences, with revenue rising to all-time highs [1]. But the growing market also presents difficulties, especially when it comes to preserving player satisfaction in the face of growing game diversity. The challenges of meeting a variety of player expectations have highlighted the need for creative game improvement techniques [2].                       

User reviews have become an essential component of player feedback in tackling the issues facing the rapidly growing gaming industry. Studies have demonstrated that evaluations give developers a direct route to the experiences and preferences of customers, giving them crucial insights [3,4,5]. Particularly common on digital platforms that compile user interactions and opinions is this immediate feedback option. Of them, **Steam** (https://store.steampowered.com/about/) is among the most well-known video game marketplaces, with a large library and a vibrant community. Its significant user base contributes to a diverse and comprehensive collection of reviews, making Steam an ideal environment for mining and analyzing player feedback.

Our research intends to create a comprehensive analytical framework for mining and understanding user evaluations on websites such as **Steam**, taking these factors into account. This will help game producers improve their products to better suit players' varied needs and will also contribute to the larger conversation about how to improve customer satisfaction in the ever-changing gaming industry. Our ultimate objective is to shorten the gap between player expectations and game development in order to provide a more enjoyable and engaging gaming experience for all players.

**DESCRIPTION AND MOTIVATION OF APPROACHES**

Specifically, we apply three core NLP strategies: 
- Named Entity Recognition and Classification (NERC)
- Sentiment analysis
- Topic analysis. 

NERC helps identify and categorize key entities within text data, such as names of games, characters, or features, into predefined categories[12]. This NLP technique is crucial for structuring unstructured data[11]. In our project, NERC serves to pinpoint specific game elements mentioned in user reviews, aiding in the contextual analysis of feedback.

Sentiment analysis assesses the emotional content of sentences and allows us to classify them as neutral, positive, or negative[12,13]. This procedure is essential to comprehending the general attitude of players toward different components of the game. 

In our project, topic analysis helps find recurrent patterns within user evaluations by extracting themes or concepts from vast amounts of text[14]. Using this method, we can identify common problems or remarks that might not be immediately apparent.

**DESCRIPTION OF THE DATASET** 

The "Steam Reviews International" dataset on Kaggle (https://www.kaggle.com/datasets/andrewmvd/steam-reviews) is a collection of ???k Steam reviews collected using the Steamworks API (228,3 MB). Each row of the dataset.csv file contains various infromation about the review gathered. 

The structure and content of the dataset include : 

| **Label** | **Content Description** | **Data Type** |
|----------|----------|----------|
| app_id | Id of reviewed app | Unique identifier |
| app_name | Corresponding app name | string |
| review_id| Id of review written | Unique identifier |
| review_text | Text reviewer has written  | string |
| voted_up | If reviewer recommends the game (T=yes, F=no) | boolean |
| votes_up | How many times it was voted up | integer |
| timestamp_created/updated| When the review was both created and updated | Datetime |
| votes_funny | How many times it was votes as funny (like concept) | integer |
| weighted_vote_score | Between 0-1, relevance of review  | float |
| comment_count | How many other users interacted with the review | integer |
| steam_purchase | If the reviewer purchased the game | boolean |
| received_for_free | If the game was not purchased by reviewer | boolean |
|  written_during_early_access | If the review was written before public release | boolean |


**DATA CLEANING - PHASE 1**

Our data cleaning procedure started with the removal of duplicate entries to guarantee the originality of each review before standardizing the text by changing all of the review content to lowercase, which made text analysis chores easier in the future. Consequently, we removed any dataset entries that were missing important information. This included reviews without text or application names.We further de-cluttered the dataset by removing columns that had no impact on our project's goals, such as unnecessary timestamps and identities, and concentrating only on information that was relevant to the main analysis. In order to ensure consistency and relevance throughout the dataset, we focused on English-language reviews in order to conform to the linguistic scope of our study. The language column was then removed as well because it was implied. 

The finalization of the first cleaning phase process involved more general edits to the datasets, which resulted also in a substantially smaller one. 

In [16]:
import pandas as pd
import re

df = pd.read_csv('steam_reviews.csv')
#remove duplicate reviews
df = df.drop_duplicates(subset='review_text', keep='first')
#all in lowercase 
df['review_text'] = df['review_text'].str.lower()
#remove NaN reviews or empty strings
df = df[df['review_text'].notna() & (df['review_text'] != '')]
df = df[df['app_name'].notna() & (df['app_name'] != '')]
#removes app_id coloumn 
df = df.drop(columns=['app_id'])
# Keep only rows where the language is English
df = df[df['language'] == 'english']

#completly drop some coloumns 
df = df.drop(columns=['timestamp_created', 'timestamp_updated','hidden_in_steam_china','steam_china_location','Unnamed: 0'])

#now that we decided to only have english, we can make it implicit by deleting the coloumn 
df = df.drop(columns='language')

#display: 
#print(df.head(100))
#print(df.isna().sum().sum())

  df = pd.read_csv('steam_reviews.csv')


                                         app_name  review_id  \
0                                   The Cold Hand  138284361   
1                                   The Cold Hand  137869845   
3                            Unnatural Season Two  137882379   
4                            Unnatural Season Two  138491433   
6    My Pleasure - Season 2: Advanced Walkthrough  140706437   
..                                            ...        ...   
209                                 Dream of Echo  138046872   
210                                    Triad Ball  138865714   
211                          Let It Boom Playtest  141417569   
212                                Frozen shelter  141698855   
215                                  Just Survive  140520667   

                                           review_text  voted_up  votes_up  \
0    "yeah man, i'm making a game. it's gonna be a ...     False         3   
1    i like the part where you jump on enemies and ...      True         2 

**DATA CLEANING - PHASE 2**

Continuing our data cleaning journey, we introduced a specialized function to refine the quality of our text data further. Recognizing the disruptive presence of ASCII art in review texts—characters arranged in patterns that do not contribute to meaningful text analysis—we developed a method to identify and eliminate these elements. 

 To handle this, we use the function `remove_ascii_art`. This function breaks each review into individual lines and checks each for signs of ASCII art, typically sequences of three or more non-word characters. Lines identified as ASCII art are removed, ensuring our dataset consists only of meaningful text." 

In [19]:
def remove_ascii_art(text):
    lines = text.split('\n')
    pattern = re.compile(r'^[^\w\s]{3,}$')  # This pattern looks for lines with 3 or more non-word characters
    filtered_lines = [line for line in lines if not pattern.search(line)]
    cleaned_text = '\n'.join(filtered_lines)
    return cleaned_text

In [27]:
#print(df.columns)
df['review_text'] = df['review_text'].apply(remove_ascii_art)

Index(['app_name', 'review_id', 'review_text', 'voted_up', 'votes_up',
       'votes_funny', 'weighted_vote_score', 'comment_count', 'steam_purchase',
       'received_for_free', 'written_during_early_access'],
      dtype='object')


**DATA CLEANING - FINAL PHASE**

In the later stages of our data cleaning process, despite our initial efforts to filter out non-English reviews, we encountered instances where reviews, although classified as English, still contained fragments of other languages.

To address this, we incorporated an additional layer of language verification using the `is_english` function, which employs the `langdetect` library. This function assesses each review's language content, ensuring its primary language aligns with our focus on English texts. We applied this function to each review, appending a new column `is_english` to our dataset, which flags whether a review is predominantly in English.

However,while we chose this library for its performance, we were aware of possible misclassification and therefore we chose to carry-out a final manual review as our last step of the data leaning process. This step involved a hands-on examination of the dataset to identify and remove any remaining reviews that erroneously passed through the earlier filters. 

Now that we have obtained a cleaned dataset, comprising ??? entries, we are prepared to proceed with the application of the previously mentioned NLP techniques.

In [34]:
from langdetect import detect, LangDetectException

df = pd.read_csv('intermediate_dataset.csv', sep=',', encoding='utf-8',on_bad_lines="skip") 

def is_english(text):
    # First, check if 'text' is a string
    if not isinstance(text, str):
        return False  # If not a string, return False or handle differently
    
    try:
        # Detect the language of the text
        return detect(text) == 'en'
    except LangDetectException:
        # If the text is too short or has other issues, language detection may fail
        return False


# Apply the language detection function to the 'review_text' column
df['is_english'] = df['review_text'].apply(is_english)

# Filter the DataFrame to keep only the rows where 'is_english' is True
df_filtered = df[df['is_english']]

# Optionally, save the filtered DataFrame to a new CSV file
#df_filtered.to_csv('final_cleaned_dataset.csv', sep=';', encoding='utf-8', index=False)


**References**

1. Wijman, T. (2024, February 8). Newzoo’s games market revenue estimates and forecasts by region and segment for 2023. Newzoo. Retrieved from https://newzoo.com/resources/blog/games-market-estimates-and-forecasts-2023

2. Chambers C, Feng Wc, Sahu S, Saha D (2005) Measurement-based characterization of a collection of online games. In: Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement, USENIX Association, pp 1–1

3. Lin, D., Bezemer, C. P., Zou, Y., & Hassan, A. E. (2019). An empirical study of game reviews on the Steam platform. Empirical Software Engineering, 24, 170-207.

4. Vasa R, Hoon L, Mouzakis K, Noguchi A (2012) A preliminary analysis of mobile app user reviews. In: Proceedings of the 24th Australian Computer-Human Interaction Conference. ACM, pp 241–244

5. Hoon L, Vasa R, Schneider JG, Mouzakis K (2012) A preliminary analysis of vocabulary in mobile app user reviews. In: Proceedings of the 24th Australian Computer-Human Interaction Conference. ACM, pp 245–248

6. Livingston, I. J., Nacke, L. E., & Mandryk, R. L. (2011, August). The impact of negative game reviews and user comments on player experience. In Proceedings of the 2011 ACM SIGGRAPH Symposium on Video Games (pp. 25-29).

7. Zagal, J. P., Ladd, A., & Johnson, T. (2009, April). Characterizing and understanding game reviews. In Proceedings of the 4th international Conference on Foundations of Digital Games (pp. 215-222).

8. Livingston, I. J., Mandryk, R. L., & Stanley, K. G. (2010). Critic-proofing: How using critic reviews and game genres can refine heuristic evaluations. In Proceedings of FuturePlay 2010.

9. Lin, D., Bezemer, C. P., Zou, Y., & Hassan, A. E. (2019). An empirical study of game reviews on the Steam platform. Empirical Software Engineering, 24, 170-207.

10. Busurkina, I., Karpenko, V., Tulubenskaya, E., & Bulygin, D. (2020, June). Game experience evaluation. a study of game reviews on the steam platform. In International conference on digital transformation and global society (pp. 117-127). Cham: Springer International Publishing.

11. Markov, I. (2024). Named entity recognition and classification [PDF document]. Text Mining for AI, Academic Year 2024.Retrieved from Canvas.

12. Maynard, D., Bontcheva, K., & Rout, D. (2016). NLP for software: Understanding the role of natural language processing in software development. In Natural Language Processing for Software Engineering (pp.25-35,  73-86). 

13. Markov, I. (2024). Sentiment analysis [PDF document]. Text Mining for AI, Vrije Universiteit, Academic Year 2024. Retrieved from Canvas. 

14. Markov, I. (2024). Topic Modelling and Text Classification [PDF document].  Text Mining for AI, Vrije Universiteit, Academic Year 2024. Retrieved from Canvas. 