# Project

The goal of this project is to perform comprehensive sentiment analysis and text classification on Steam game reviews to gain valuable insights into user sentiments and preferences. By leveraging natural language processing (NLP) techniques, the project aims to achieve the following objectives:

__Sentiment Analysis__: Analyze the sentiment expressed in Steam game reviews using both rule-based methods (VADER sentiment analysis) and machine learning models.

__Text Classification__: Build and evaluate a text classification model to categorize reviews into positive and negative sentiments based on user votes (upvotes or downvotes).

__User Behavior Exploration__: Utilize additional information such as playtime, number of games owned, and number of reviews written by users.

By achieving these objectives, the project aims to provide game developers, platform administrators, and gamers with actionable insights to enhance user experiences, understand popular game trends, and potentially improve game recommendations on the Steam platform.

# Data Engineering

Since I want to have up-to date data about the games, I decided to use the Steam API to get the data. The whole process of scraping the data is inspired by [this](https://andrew-muller.medium.com/scraping-steam-user-reviews-9a43f9e38c92) article. More detailed description on the functionality is inside the code.

In [337]:
# Import necessary libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

# Function to get Steam game IDs
# Parameters:
#   - n: Number of games to get IDs for (default is 100)
#   - filter_by: Type of filter to apply when searching for games (default is 'topsellers')
#     Options for filter_by include 'topsellers', 'new', 'releases', 'comingsoon', etc.
# Returns a list of game IDs
def get_n_appids(n=100, filter_by='topsellers'):
    # Initialize an empty list to store game IDs
    appids = []
    
    # Steam URL for game search with specified filter
    url = f'https://store.steampowered.com/search/?category1=998&filter={filter_by}&page='
    
    # Counter for the page number
    page = 0

    # Continue fetching game IDs until reaching the desired count (n)
    while page * 25 < n:
        # Increment page number
        page += 1
        
        # Send a request to the Steam URL with the current page number
        response = requests.get(url=url + str(page), headers={'User-Agent': 'Mozilla/5.0'})
        
        # Parse the HTML content of the page using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract game IDs from the search results and add them to the appids list
        for row in soup.find_all(class_='search_result_row'):
            appids.append(row['data-ds-appid'])

    # Return the first n game IDs
    return appids[:n]

In [338]:
# Function to get reviews for a specific app ID from the Steam store
# Parameters:
#   - appid: Steam application ID for a specific game
#   - params: Additional parameters for the API request (default is {'json': 1})
# Returns a dictionary containing review information in JSON format
def get_reviews(appid, params={'json': 1}):
    url = 'https://store.steampowered.com/appreviews/'
    response = requests.get(url=url + appid, params=params)
    return response.json()

# Function to get a specified number of reviews for a given app ID
# Parameters:
#   - appid: Steam application ID for a specific game
#   - n: Number of reviews to retrieve (default is 100)
# Returns a list of reviews, each represented as a dictionary
def get_n_reviews(appid, n=100):
    reviews = []
    cursor = '*'  # Initial cursor for pagination

    # Default parameters for the API request
    params = {
        'json': 1,
        'filter': 'all',
        'language': 'english',
        'day_range': 9223372036854775807,
        'review_type': 'all',
        'purchase_type': 'all'
    }

    # Continue fetching reviews until reaching the desired count (n)
    while n > 0:
        # Set the cursor and number of reviews per page in the request parameters
        params['cursor'] = cursor.encode()
        params['num_per_page'] = min(100, n)
        n -= 100

        # Make a request to get reviews for the current app ID and parameters
        response = get_reviews(appid, params)
        cursor = response['cursor']
        reviews += response['reviews']

        # Break if the number of retrieved reviews is less than the requested batch size (less than 100)
        if len(response['reviews']) < 100:
            break

    # Return the list of reviews
    return reviews

# Function to scrape Steam data by fetching reviews for a specified number of apps
# Parameters:
#   - n_apps: Number of apps to retrieve reviews for (default is 3)
#   - n_reviews: Number of reviews to retrieve for each app (default is 100)
# Returns a list of reviews for the specified number of apps
def scrape_steam_data(n_apps=3, n_reviews=100):
    reviews = []
    appids = get_n_appids(n_apps)

    # Iterate through app IDs and fetch reviews for each app
    for appid in appids:
        reviews += get_n_reviews(appid, n_reviews)

    # Return the aggregated list of reviews
    return reviews

In [339]:
# Import necessary library
from datetime import datetime

# Function to create a pandas DataFrame from a list of Steam reviews
# Parameters:
#   - reviews: List of Steam reviews, each represented as a dictionary
# Returns a DataFrame with relevant information extracted from the reviews
def create_dataframe(reviews):
    data = []

    # Iterate through each review in the list
    for review in reviews:
        # Extract relevant information from the review dictionary
        timestamp_created = review.get('timestamp_created', None)
        playtime_forever = review.get('author', {}).get('playtime_forever', None)

        # Extract fields from the 'author' column
        author_info = review.get('author', {})
        steamid = author_info.get('steamid', None)
        num_games_owned = author_info.get('num_games_owned', None)
        num_reviews = author_info.get('num_reviews', None)

        # Create a row data dictionary with extracted information
        row_data = {
            'review': review.get('review', None),
            'voted_up': review.get('voted_up', None),
            'steamid': steamid,
            'num_games_owned': num_games_owned,
            'num_reviews': num_reviews,
            'timestamp': convert_timestamp(timestamp_created),
            'playtime_formatted': convert_playtime(playtime_forever)
        }

        # Append the row data to the list
        data.append(row_data)

    # Create a DataFrame from the list of row data
    df = pd.DataFrame(data)
    return df

# Function to convert a timestamp to a formatted string
# Parameters:
#   - timestamp: Unix timestamp
# Returns a formatted string in the format 'YYYY-MM-DD HH:MM:SS'
def convert_timestamp(timestamp):
    if timestamp is not None:
        return datetime.utcfromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S')
    return None

# Function to convert playtime in minutes to a formatted string
# Parameters:
#   - playtime: Playtime in minutes
# Returns a formatted string in the format 'Xh Ym'
def convert_playtime(playtime):
    if playtime is not None:
        hours, minutes = divmod(playtime, 60)
        return f'{int(hours)}h {int(minutes)}m'
    return None

# Function to create a DataFrame with game IDs repeated based on a specified repetition count
# Parameters:
#   - numbers: List of game IDs
#   - repetition_count: Number of times each game ID should be repeated
# Returns a DataFrame with a single column 'game_id'
def create_appid_df(numbers, repetition_count):
    data = [int(num) for num in numbers for i in range(1, repetition_count + 1)]
    df = pd.DataFrame(data, columns=['game_id'])
    return df

# Function to get the name of a game based on its Steam app ID
# Parameters:
#   - appid: Steam application ID for a specific game
# Returns the name of the game or None if the request was not successful
def get_game_name(appid):
    url = f'https://store.steampowered.com/api/appdetails/'
    params = {'appids': appid}
    response = requests.get(url, params=params)
    data = response.json()

    # Check if the request was successful and data is available
    if str(appid) in data and data[str(appid)]['success']:
        return data[str(appid)]['data']['name']
    else:
        return None

In [340]:
# Set the number of apps and reviews to retrieve
n_apps = 10
n_reviews = 10

# Get a list of Steam app IDs for the specified number of apps
numbers_list = get_n_appids(n_apps)
# Set the repetition count for creating the DataFrame
repetition_count = n_reviews
# Create a DataFrame with game IDs repeated based on the repetition count
result_dataframe = create_appid_df(numbers_list, repetition_count)

# Scrape data and create a DataFrame with reviews and additional information
reviews_data = scrape_steam_data(n_apps=n_apps, n_reviews=n_reviews)
# Concatenate the result DataFrame with the reviews DataFrame
steam_df = pd.concat([result_dataframe, create_dataframe(reviews_data)], axis=1)
# Get the game names based on the app IDs and add a new 'game_name' column
steam_df['game_name'] = steam_df['game_id'].apply(get_game_name)

# Reorder columns to make 'game_name' the first column
steam_df = steam_df[['game_name'] + [col for col in steam_df.columns if col != 'game_name']]

In [341]:
# Display the DataFrame
steam_df.head()

Unnamed: 0,game_name,game_id,review,voted_up,steamid,num_games_owned,num_reviews,timestamp,playtime_formatted
0,Counter-Strike 2,730,false game ban simulator,False,76561198991688065,1237,6,2019-11-04 10:17:10,6286h 33m
1,Counter-Strike 2,730,"After 8 years playing it, I didn't improve my ...",True,76561198110513339,3242,7,2020-12-10 22:07:43,2625h 59m
2,Counter-Strike 2,730,>see a guy\n>shoot him\n>miss every shot\n>he ...,True,76561199484699870,0,9,2023-03-15 06:31:58,8h 20m
3,Counter-Strike 2,730,This community is so nice i got a lot of tips ...,True,76561198158022300,265,36,2022-06-28 12:38:58,20h 54m
4,Counter-Strike 2,730,Your team in every random competitive game:\n\...,True,76561198388416030,1,1,2023-01-22 13:01:52,8092h 32m


In steam_df DataFrame, each row represents a Steam game review, and the columns capture various pieces of information related to each review. Here's a description of each column:

__game_name__: The name of the game to which the review belongs.

__review__: The text content of the user's review for the game.

__voted_up__: A boolean indicating whether the user voted up (True) or down (False) for the review. It represents the user's recommendation

__steamid__: The Steam ID of the user who wrote the review.

__num_games_owned__: The number of games owned by the user who left the review.

__num_reviews__: The total number of reviews written by the user.

__timestamp__: The timestamp when the review was created.

__playtime_formatted__: The playtime of the user for the associated game, represented in a formatted string (hours and minutes).

These columns provide a comprehensive set of information about each Steam game review, including details about the game, the reviewer, playtime, sentiment, and the actual review text. We can now use these columns to analyze user sentiments, explore user behavior, and gain insights into the popularity and reception of different games on the Steam platform.

# Analysis

## Sentiment Analysis and Text Classification

In this section, we perform sentiment analysis and text classification on Steam game reviews.
The dataset is represented as a DataFrame where each row corresponds to a review.

Sentiment Analysis using NLTK:
We use the NLTK library's SentimentIntensityAnalyzer to perform sentiment analysis on the reviews. The SentimentIntensityAnalyzer assigns a compound score to each review, and based on this score, we classify reviews into positive, negative, or neutral sentiments.

In [342]:
# Import necessary libraries for sentiment analysis and machine learning
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from nltk.sentiment import SentimentIntensityAnalyzer

# Download necessary resources for NLTK (VADER lexicon for sentiment analysis)
nltk.download('vader_lexicon')

# Function to perform sentiment analysis using NLTK's SentimentIntensityAnalyzer
# Parameters:
#   - review: Text of the review to be analyzed
# Returns a sentiment label ('positive', 'negative', or 'neutral') based on compound polarity score
def analyze_sentiment(review):
    # Create a SentimentIntensityAnalyzer instance from NLTK
    analyzer = SentimentIntensityAnalyzer()
    
    # Get the compound polarity score for the review
    compound_score = analyzer.polarity_scores(review)['compound']
    
    # Classify the sentiment based on the compound score
    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Perform sentiment analysis using NLTK and add a 'sentiment' column to the DataFrame
steam_df['sentiment'] = steam_df['review'].apply(analyze_sentiment)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\sayfi\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [343]:
# Perform Sentiment Analysis using NLTK
steam_df['sentiment'] = steam_df['review'].apply(analyze_sentiment)

steam_df # sentiment column: The sentiment of the review, determined using sentiment analysis. It can be 'positive', 'negative', or 'neutral'.

Unnamed: 0,game_name,game_id,review,voted_up,steamid,num_games_owned,num_reviews,timestamp,playtime_formatted,sentiment
0,Counter-Strike 2,730,false game ban simulator,False,76561198991688065,1237,6,2019-11-04 10:17:10,6286h 33m,negative
1,Counter-Strike 2,730,"After 8 years playing it, I didn't improve my ...",True,76561198110513339,3242,7,2020-12-10 22:07:43,2625h 59m,positive
2,Counter-Strike 2,730,>see a guy\n>shoot him\n>miss every shot\n>he ...,True,76561199484699870,0,9,2023-03-15 06:31:58,8h 20m,neutral
3,Counter-Strike 2,730,This community is so nice i got a lot of tips ...,True,76561198158022300,265,36,2022-06-28 12:38:58,20h 54m,negative
4,Counter-Strike 2,730,Your team in every random competitive game:\n\...,True,76561198388416030,1,1,2023-01-22 13:01:52,8092h 32m,positive
...,...,...,...,...,...,...,...,...,...,...
95,Euro Truck Simulator 2,227300,Wish I wrote this review years ago. But being ...,True,76561198282098219,287,7,2019-11-27 14:52:29,2659h 3m,positive
96,Euro Truck Simulator 2,227300,Upon a winter's night quite cold\nI loaded up ...,True,76561198028681413,0,15,2014-02-14 06:14:11,25h 53m,positive
97,Euro Truck Simulator 2,227300,Turns out driving a truck through the night un...,True,76561198296475045,526,11,2021-12-14 19:30:24,99h 33m,positive
98,Euro Truck Simulator 2,227300,Much like my sentiments of Farming Simulator 2...,True,76561198038604011,914,232,2014-11-11 01:20:28,62h 0m,positive


In [344]:
# Convert 'predicted_sentiment' to boolean for comparison with 'voted_up'
steam_df['predicted_sentiment'] = steam_df['sentiment'].astype(bool)

# Evaluate the accuracy of sentiment analysis
accuracy = accuracy_score(steam_df['voted_up'], steam_df['predicted_sentiment'])
print(f"Accuracy of Sentiment Analysis: {accuracy:.2f}")

Accuracy of Sentiment Analysis: 0.82


In [345]:
steam_df['voted_up'].value_counts()

voted_up
True     82
False    18
Name: count, dtype: int64

An accuracy of 0.82 means that the sentiment analysis model correctly predicted the sentiment (positive or negative) of the reviews approximately 82% of the time when compared to the actual labels in the 'voted_up' column. A higher accuracy indicates better performance. An accuracy of 0.82 suggests that the sentiment analysis model is reasonably effective in distinguishing between positive and negative sentiments in the reviews. While accuracy is a common metric, it might not tell the whole story, especially in imbalanced datasets like the one we have (only 18% False voted up values).

It's often useful to look at other metrics like precision, recall, and F1-score, especially if the classes are imbalanced. The classification_report output provides more detailed information on these metrics for each sentiment class to understand the model's performance on positive and negative sentiments individually, especially if the dataset is imbalanced.

So let's look at the classification algortihm.

## Text Classification using Naive Bayes Classifier:
For text classification, we split the dataset into training and testing sets. We vectorize the text data using the CountVectorizer, and then train a Naive Bayes classifier. The trained classifier is used to predict the sentiment of reviews in the testing set. We evaluate the model's accuracy and present a classification report.

In [346]:
# Text Classification using Naive Bayes Classifier
def classify_sentiment(data):
    X_train, X_test, y_train, y_test = train_test_split(data['review'], data['voted_up'], test_size=0.2, random_state=42)

    # Vectorize the text data
    vectorizer = CountVectorizer()
    X_train_vectorized = vectorizer.fit_transform(X_train)
    X_test_vectorized = vectorizer.transform(X_test)

    # Train a Naive Bayes classifier
    classifier = MultinomialNB()
    classifier.fit(X_train_vectorized, y_train)

    # Make predictions
    predictions = classifier.predict(X_test_vectorized)

    # Evaluate the model
    accuracy = accuracy_score(y_test, predictions)
    print(f"Accuracy: {accuracy:.2f}")
    print("\nClassification Report:\n", classification_report(y_test, predictions))

In [347]:
# Perform Text Classification using Naive Bayes Classifier
classify_sentiment(steam_df)

Accuracy: 0.80

Classification Report:
               precision    recall  f1-score   support

       False       1.00      0.33      0.50         6
        True       0.78      1.00      0.88        14

    accuracy                           0.80        20
   macro avg       0.89      0.67      0.69        20
weighted avg       0.84      0.80      0.76        20



The results from the text classification using the Naive Bayes classifier are quite promising. Let's interpret the key metrics from the classification report:

__Accuracy (0.8)__: This indicates that the Naive Bayes classifier correctly predicted the sentiment (True or False) 80% of the time on the test set.

__Precision__: Precision is the ratio of correctly predicted positive observations to the total predicted positives. In our case, precision for the 'True' class is 0.78, which means that when the model predicts a positive sentiment, it is correct about 78% of the time.

__Recall__: Recall, or sensitivity, is the ratio of correctly predicted positive observations to the all observations in actual class. For the 'True' class, recall is 1.00, indicating that the model is capturing all the positive instances in the dataset.

__F1-score__: The F1-score is the weighted average of precision and recall. It's a useful metric when dealing with imbalanced classes. For the 'True' class, the F1-score is 0.88, which is a strong balance between precision and recall.

__Support__: The number of actual occurrences of the class in the specified dataset. It can help to understand how well thr model performs on each class.

__Macro Avg and Weighted Avg__: These are the averages for precision, recall, and F1-score. Macro avg gives equal weight to all classes, while weighted avg considers the number of instances in each class.

In summary, an accuracy of 0.8 and high precision, recall, and F1-score for the 'True' class indicate that the Naive Bayes classifier is performing well in classifying positive sentiments. However, considering the support values, it's important to note that the dataset is relatively small, and performance metrics might vary with larger datasets.

# Summary:

__Data Collection__:

- Fetched Steam game reviews using the Steam API, focusing on popular games.
- Extracted relevant information such as reviews, votes, playtime, and user details.

__Data Preprocessing__:

- Cleaned and organized the data for analysis.
- Obtained a DataFrame with columns including review text, user details, voting information, and more.

__Sentiment Analysis__:

- Applied sentiment analysis using NLTK's SentimentIntensityAnalyzer.
- Achieved an accuracy of approximately 82%, classifying reviews into positive or negative sentiments.

__Text Classification (Naive Bayes)__:

- Performed text classification using a Naive Bayes classifier.
- Achieved an accuracy of 80%, with high precision, recall, and F1-score for positive sentiments.

# Conclusion:

__Sentiment Analysis Insights__:

- The sentiment analysis results provide a quick overview of the overall sentiments expressed in the reviews.
- A moderate accuracy of 82% indicates reasonable success in classifying positive and negative sentiments.

__Text Classification Insights__:

- The Naive Bayes classifier excelled in accurately predicting positive sentiments, achieving an accuracy of 80%.
- Precision, recall, and F1-score metrics indicate robust performance, particularly for the positive sentiment class.

__Considerations and Next Steps__:

- The project offers valuable insights into the sentiments associated with popular Steam games.
- Future enhancements could include exploring more sophisticated sentiment analysis models and classifiers, especially for larger datasets.
- Additionally, the project can be extended to include more advanced natural language processing (NLP) techniques, such as topic modeling or deep learning models, for deeper insights.

__Application__:

- The sentiment analysis and text classification models can be utilized to automatically categorize and understand user sentiments in Steam game reviews.
- Developers and stakeholders can use these insights to gauge user satisfaction, identify areas for improvement, and make informed decisions about game development and marketing strategies.

In conclusion, the project successfully leverages sentiment analysis and text classification to gain valuable insights into user sentiments in Steam game reviews. The high accuracy achieved by the Naive Bayes classifier demonstrates the effectiveness of the approach. This analysis can be a valuable tool for game developers and the gaming community to better understand player experiences and preferences.