# ITCS 3162 Data Mining Project 3:

# Steam Game Player Retention Based on Game Features, Reviews, and other Characteristics

### Shan Raheim

#### **Kaggle Link for the Dataset:**
#### https://www.kaggle.com/datasets/nikdavis/steam-store-games
##### https://aws.amazon.com/what-is/linear-regression/#:~:text=Linear%20regression%20is%20a%20data,variable%20as%20a%20linear%20equation.

### Problem Introduction

For this project the main topic will be player retention predictions for various games in the Steam game catalog.  For clarification, player retention refers to the amount of players that return to play the game after certain time periods.  Due to this dataset being a few years old it may not include some of the most games released in recent years.  By looking at retention rates for games based on factors like genre, how many people bought the game, and the average play-time, people can determine for future or other games if they are good and if players actually enjoy the games.  If players keep coming back to play it shows that the company who made it did well and players can expect possibly similar results when it comes to other games based on similar factors. Less time can be wasted on finding out if the game is good if it can be predicted on factors that can tell you if it is good, people would not have to waste money and time playing the game to find out if they can predict if it will be good.          

### Data Introduction

The dataset I used for this project is a Kaggle dataset called "Steam Store Games" that has already been cleaned, and since it has already I do not need to do this in my pre-processing. It has 18 columns and the game amount will be shown below after the shape is calculated from the data. As stated before it only goes up to games released in 2019. The data was gathered directly from the Steam Store and SteamSpy APIs. Some of the columns included are game prices, average player count, genres, game type categories, and different ratings if they are positive or negative.  The column "english" indicates if the game has English language support.   

### What is Linear Regression and How does it Work?

Linear regression is a data analysis method that predicts values of unknown data by using existing known data values that are related.  The model is based on a linear equations where there is a dependent variable and independent variable. It works by plotting a line on a graph between 2 variables, x and y, x being the independent variable (or explanatory variable) and y being the dependent variable (the one you are trying to predict). In data and machine learning, linear regression is used with large datasets to predict patterns within the data.  It is trained with labeled data already made in the dataset then it used to predict unknown values and a regression line is generated during this prediction.   

In [50]:
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

In [26]:
data_filepath = "../ITSC3162/steam.csv"
data = pd.read_csv(data_filepath)
df = pd.DataFrame(data)

df.shape
df.head(10)

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99
5,60,Ricochet,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Valve Anti-Ch...,Action,Action;FPS;Multiplayer,0,2758,684,175,10,5000000-10000000,3.99
6,70,Half-Life,1998-11-08,1,Valve,Valve,windows;mac;linux,0,Single-player;Multi-player;Online Multi-Player...,Action,FPS;Classic;Action,0,27755,1100,1300,83,5000000-10000000,7.19
7,80,Counter-Strike: Condition Zero,2004-03-01,1,Valve,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,Action;FPS;Multiplayer,0,12120,1439,427,43,10000000-20000000,7.19
8,130,Half-Life: Blue Shift,2001-06-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player,Action,FPS;Action;Sci-fi,0,3822,420,361,205,5000000-10000000,3.99
9,220,Half-Life 2,2004-11-16,1,Valve,Valve,windows;mac;linux,0,Single-player;Steam Achievements;Steam Trading...,Action,FPS;Action;Sci-fi,33,67902,2419,691,402,10000000-20000000,7.19


### Data Pre-Processing and Understanding

Due to my dataset already being clean this eliminates this step from pre-processing for but some columns like appid, english, release_date, and any other irrelevant data columns will be dropped to simplify the data since that information is not possibly needed to determine player retention. The remaining columsn will be more useful to keep to provide insight on what would impact player retention or help indicate based on these preset factors would the prediction look like based on the selected columns.  With less irrelevant information this will de-clutter and simplify model making while improving clarity.  

In [68]:
df.drop(['appid', 'release_date', 'english', 'required_age', 'achievements'], axis = 1)

Unnamed: 0,name,developer,publisher,platforms,categories,genres,steamspy_tags,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,Counter-Strike,Valve,Valve,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,124534,3339,17612,317,10000000-20000000,7.19
1,Team Fortress Classic,Valve,Valve,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,3318,633,277,62,5000000-10000000,3.99
2,Day of Defeat,Valve,Valve,windows;mac;linux,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,3416,398,187,34,5000000-10000000,3.99
3,Deathmatch Classic,Valve,Valve,windows;mac;linux,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,1273,267,258,184,5000000-10000000,3.99
4,Half-Life: Opposing Force,Gearbox Software,Valve,windows;mac;linux,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,5250,288,624,415,5000000-10000000,3.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...
27070,Room of Pandora,SHEN JIAWEI,SHEN JIAWEI,windows,Single-player;Steam Achievements,Adventure;Casual;Indie,Adventure;Indie;Casual,3,0,0,0,0-20000,2.09
27071,Cyber Gun,Semyon Maximov,BekkerDev Studio,windows,Single-player,Action;Adventure;Indie,Action;Indie;Adventure,8,1,0,0,0-20000,1.69
27072,Super Star Blast,EntwicklerX,EntwicklerX,windows,Single-player;Multi-player;Co-op;Shared/Split ...,Action;Casual;Indie,Action;Indie;Casual,0,1,0,0,0-20000,3.99
27073,New Yankee 7: Deer Hunters,Yustas Game Studio,Alawar Entertainment,windows;mac,Single-player;Steam Cloud,Adventure;Casual;Indie,Indie;Casual;Adventure,2,0,0,0,0-20000,5.19


#### Experiment 1: Regression Model Based on Average Playtime

In [70]:
X = df[['developer', 'publisher']]
Y = df['owners']

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = .25, random_state = 42)

In [73]:
reg = LinearRegression()
reg.fit(X_train, Y_train)
print(reg.score(X_test, Y_test))

ValueError: could not convert string to float: 'Blue Tea Games'