# Mental Health in Gaming 

## Introduction
<p>In this project I want to explore any correlation between anxiety and gaming. I am someone who has struggled severely with anxiety in the past. I am also someone who very much enjoys gaming. Those things being said, this project is very interesting to me and I am very excited to see what the data has to say.</p>

## The Data
In this project I am using a dataset titled 'Online Gaming Anxiety Dataset.' It was posted on Kaggle by Divyansh Agrawal. <br>
**Link:** https://www.kaggle.com/datasets/divyansh22/online-gaming-anxiety-data <br>
The data includes of user-entered responses to the GAD-7 questionnaire, as well as responses to question pretaining to gaming, i.e., how many hours a week one plays, why they play, gaming platform.

In [319]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the Data

In [320]:
df = pd.read_csv('anxiety_gaming_data.csv', index_col = 'S. No.')

In [412]:
df.columns

Index(['Timestamp', 'GAD1', 'GAD2', 'GAD3', 'GAD4', 'GAD5', 'GAD6', 'GAD7',
       'GADE', 'SWL1', 'SWL2', 'SWL3', 'SWL4', 'SWL5', 'Game', 'Platform',
       'Hours', 'earnings', 'whyplay', 'League', 'highestleague', 'streams',
       'SPIN1', 'SPIN2', 'SPIN3', 'SPIN4', 'SPIN5', 'SPIN6', 'SPIN7', 'SPIN8',
       'SPIN9', 'SPIN10', 'SPIN11', 'SPIN12', 'SPIN13', 'SPIN14', 'SPIN15',
       'SPIN16', 'SPIN17', 'Narcissism', 'Gender', 'Age', 'Work', 'Degree',
       'Birthplace', 'Residence', 'Reference', 'Playstyle', 'accept', 'GAD_T',
       'SWL_T', 'SPIN_T', 'Residence_ISO3', 'Birthplace_ISO3'],
      dtype='object')

## The Plan
Looking at the different columns I can see a few questions that I want to try to answer (no particular hierarchy). <br>
- I am interested in seeing any correlation between a gamer's playstyle, i.e., online, singleplayer, and their total anxiety score.
- I am also interested in the correlation between a gamer's platform and their total life satisfaction score.

# Playstyle vs Anxiety Score

## Subsetting the Data and Relevant Cleaning
Because I am only concerned with the playstyle and anxiety score in this first test, I am going to subset the dataframe so that we only have to worry about those columns. <br>

In [322]:
from sklearn.neighbors import KNeighborsClassifier

In [359]:
playstyle_anxiety = df[['Playstyle', 'GAD_T']]

In [362]:
playstyle_anxiety.head()

Unnamed: 0_level_0,Playstyle,GAD_T
S. No.,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Singleplayer,1
2,Multiplayer - online - with strangers,8
3,Singleplayer,8
4,Multiplayer - online - with online acquaintanc...,0
5,Multiplayer - online - with strangers,14


In [325]:
print(playstyle_anxiety['Playstyle'].value_counts()[:4])

Multiplayer - online - with real life friends                    5564
Multiplayer - online - with strangers                            4134
Multiplayer - online - with online acquaintances or teammates    2652
Singleplayer                                                      762
Name: Playstyle, dtype: int64


In [326]:
# creating a dictionary so that we can differentiate between online and non-online gamers
dict = {'Multiplayer - online - with real life friends' : 1, 
        'Multiplayer - online - with strangers' : 1,
        'Multiplayer - online - with online acquaintances or teammates' : 1, 
        'Singleplayer' : 0}

In [363]:
# this will allow us differentiate between online players and non-online players
playstyle_anxiety['Playstyle'] = playstyle_anxiety['Playstyle'].map(dict) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  playstyle_anxiety['Playstyle'] = playstyle_anxiety['Playstyle'].map(dict)


In [364]:
# this will turn any string value of playstyle to NA, so we can drop it
playstyle_anxiety['Playstyle'] = playstyle_anxiety['Playstyle'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  playstyle_anxiety['Playstyle'] = playstyle_anxiety['Playstyle'].astype(float)


In [366]:
# dropping the rows we do not care about
# now we have a dataframe that has only playstyle (online vs singleplayer) and the total anxiety score
playstyle_anxiety = playstyle_anxiety.dropna()

## Choosing Logistic Regression Model
I believe that, based on our data, we should use a logistic regression model. <br>
We can use this model to predict the likelihood of someone being an online or single player gamer based on their anxiety score.

In [367]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression

In [447]:
# Here I am assigning my feature and label and also making sure that they are the same size
X = playstyle_anxiety[['GAD_T']]
y = playstyle_anxiety[['Playstyle']]
print(sum(X.value_counts()))
print(sum(y.value_counts()))

13087
13087


In [455]:
# first we want to scale our GAD_T data so that it looks like a normal distribution 
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
X = X.reshape(-1, 1)

In [468]:
# now we can split our data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state = 2)

In [469]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train.values.ravel())

LogisticRegression()

In [470]:
print(lr_model.coef_, lr_model.intercept_)

[[-0.1176688]] [2.79938456]


In [471]:
y_pred = lr_model.predict(X_test)

In [472]:
from sklearn.metrics import confusion_matrix
conmat = confusion_matrix(y_test, y_pred)
print(conmat)

[[   0   80]
 [   0 1229]]


In [473]:
# Print F1 score here:
from sklearn.metrics import f1_score
print(f1_score(y_test, y_pred))

0.9684791174152876


## Hmm.. There's a problem.
It seems that the model is predicting that every gamer would be an online gamer. While this makes the accouracy of the model very high (93.67%), I do not believe that it is okay in this case. As we can see from our input data, 94.1% of gamers are online gamers. We could not even have created a model here, but simply guessed that everyone games online. <br>
Its possible that this indicates there is no correlation between the feature and the label. Or, considering that there
are a ton of online labels and not as many singleplayer labels, maybe we just need better data. <br>
We could also try to include more than one feature.

In [474]:
playstyle_anxiety['Hours Played'] = df['Hours']

In [475]:
playstyle_anxiety = playstyle_anxiety.dropna()
playstyle_anxiety.isnull().values.any()

False

In [482]:
X = playstyle_anxiety[['GAD_T', 'Hours Played']]
y = playstyle_anxiety[['Playstyle']]
print(sum(X.value_counts()))
print(sum(y.value_counts()))

13087
13087


In [494]:
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

In [500]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state = 2)

In [501]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train.values.ravel())

LogisticRegression()

In [502]:
print(lr_model.coef_, lr_model.intercept_)

[[-0.11777425 -0.03967282]] [2.80034266]


In [503]:
y_pred = lr_model.predict(X_test)

In [504]:
conmat = confusion_matrix(y_test, y_pred)
print(conmat)

[[   0   80]
 [   0 1229]]


## Well, it may be time to move on.
It seems that adding an extra feature did nothing for the model. I believe this is due to the fact that there are so many online gamers compared to offline gamers in the dataset.

# Gaming Platform vs Life Satisfaction