# Data Cleaning

The primary source of our data is the Steam platform, which is one of the most popular digital distribution platforms for PC games. We utilized two different APIs to gather information on games from Steam: https://steamspy.com/ and https://api.steampowered.com. SteamSpy provided us with data such as the number of owners, playtime, and user reviews, while the Steam API provided us with information such as the game's price, release date, and genre.

In addition to the data obtained from the Steam platform, we also utilized the Steam User, Item Data, and Meta-data dataset from https://cseweb.ucsd.edu/~jmcauley/datasets.html#steam_data. This dataset includes information on user behavior, such as game ownership, playtime, and review sentiment, as well as metadata on games, such as genre, developer, and publisher.

By combining these sources of data, we were able to create a comprehensive dataset that includes information on thousands of games, as well as user behavior and preferences. This dataset serves as the foundation for our machine learning models, which will be used to make personalized game recommendations to users.

## Cleaning Items per user.

In [2]:
import pandas as pd
import numpy as np
import ast
import json

In [1]:
with open(r'Data\australian_users_items.json') as f:
    lines = f.readlines()

Convert the contents of the file "australian_users_items.json" into a string in JSON format by joining all the lines in the file with commas, enclosing the resulting string in square brackets, and assigning the final string to the variable newstring.

In [None]:
stringConvert = json.dumps(lines)

Convert to JSON

In [None]:
JSONConvert = json.loads(stringConvert)

Save the JSON file

In [None]:
with open('data.json', 'w') as json_file:
    json.dump(JSONConvert, json_file)

Convert json to pandas dataframe

In [None]:
df = pd.DataFrame(JSONConvert)

# Data Preprocessing

In this section, we will describe the data-preprocessing steps we took to create a user-item interactions DataFrame.

After cleaning the data, we extracted user IDs and game IDs from the dataset. We then created a user-item interactions DataFrame, with each row representing a particular user-item relationship. Specifically, we included all the games that each user owned, along with the corresponding playtime and review information.

To create this user-item interactions DataFrame, we first filtered the data to include only the games that were owned by at least one user. We then extracted the unique user IDs and game IDs from this filtered data.

Next, we created an empty DataFrame with columns for user IDs, game IDs, playtime, and review information. We then iterated over each user and game combination and checked if the user had owned the game. If the user had owned the game, we added a row to the DataFrame with the user ID, game ID, and the corresponding playtime and review information.

This resulted in a user-item interactions DataFrame that contains all the games that were owned by at least one user, along with the corresponding playtime and review information. This DataFrame serves as the input to our machine learning models, which will use this data to make personalized game recommendations to users based on their preferences and behavior.

## Data Loading

### Steam games data loading

In this part, the data that we obtained from our API call program will be loaded.

In [None]:
df = pd.read_csv('Data/all_games.csv')

### Items per User data loading

Now we load the data that was cleaned in the Data Cleaning part.

In [None]:
itemspuser = pd.read_json('Data/data.json')

## Feature Extraction

Creates a new feature in the useritems dataframe called item_ids. For each row in the dataframe, it extracts the item_id from each element in the items list using a list comprehension, and then assigns the resulting list of item ids to the new column.

In [None]:
itemspuser['item_ids'] = [[item['item_id'] for item in items] for items in itemspuser['items']]

To make working with user IDs easier, we replaced the unique user steam_id with a new uid counter starting at 0 and incremented by 1 for each new user. We also selected only the relevant columns for building a user-item interactions matrix: the uid and item_id columns. This simplified matrix serves as the basis for our machine learning models to make personalized game recommendations.

In [None]:
itemspuser['uid'] = np.arange(len(itemspuser))

itemspuser = itemspuser[['uid', 'item_id']]

We used the pandas explode function to split the item_id column into separate rows, resulting in a new DataFrame where each row represents a single user-item interaction with one uid and one item_id value. This step was necessary to prepare the data for training and testing our machine learning models.

In [None]:
itemspuser = itemspuser.explode('item_ids').reset_index(drop=True)

To simplify our machine learning models, we added a new binary column called "owned" to our user-item interactions DataFrame. This column has a value of 1 for every row in the DataFrame, as each row represents a user-item interaction where the user owns the game. This step was necessary because we are only concerned with whether or not a user owns a game, as opposed to the ratings or reviews they may have given the game. By adding the "owned" column, we can more easily and efficiently filter the data and focus on owned games in our machine learning models.

In [None]:
itemspuser['owned'] = 1

To extract relevant information such as genre for our machine learning models, we restricted ourselves to user-item relationships where the game is present in the first "df" DataFrame. To ensure that the DataFrames could be merged on the game ID feature, we changed the type and column name of the game ID in the user-item interactions DataFrame to match the type and column name in the "df" DataFrame. This step was necessary to merge the two DataFrames and extract relevant information such as genre for our machine learning models.

In [None]:
itemspuser = itemspuser.astype({'item_id': int}).rename(columns={'item_id': 'id'})

In [None]:
df.rename(columns={'appid': 'id'}, inplace=True)

In [None]:
main = pd.merge(itemspuser, df, on = 'id')

### Further Data Cleaning

#### Handling Missing values

In [None]:
main = main.dropna(axis=0, subset=['game_name'])

### Save Data

In [None]:
main.to_csv('Data/main.csv', index=False)

In [None]:
dataForRec = main[['uid','id','owned']]

In [None]:
dataForRec.to_csv('recdata.csv')