# The Data

```{figure} https://cdn.akamai.steamstatic.com/steam/apps/1938090/header.jpg?t=1668017465
---
align: center
---
```

Review data for the title ['Call of Duty: Modern Warfare 2'](https://store.steampowered.com/app/1938090/Call_of_Duty_Modern_Warfare_II/) published by Activision were collected. 

At the time of access (2022-11-11), this title held a 'Mixed' review score based on 68,626 user reviews.

The Steam store uses a binary review classification system in which users can 'recommend' or 'not recommend' a title. Many titles display severely skewed review classifications which would generate an extremely unbalanced sample. The 'Mixed' classification of this title indicates a more even split between the possible review classifications.

Reviews were scraped from the Steam store using the `steamreviews` API for Python {cite}`wok_2018`.

In [1]:
# api access
import steamreviews

# set parameters
request_params = dict()
request_params['language'] = 'english'
request_params['purchase_type'] = 'all'
app_id = 1938090

# store results as dictionary
review_dict, query_count = steamreviews.download_reviews_for_app_id(app_id,chosen_request_params=request_params)


[appID = 1938090] expected #reviews = 43294
Number of queries 150 reached. Cooldown: 310 seconds
Number of queries 150 reached. Cooldown: 310 seconds


All available English language reviews were scraped, forming an initial sample of 43,294 observations.     
3 features were extracted, including:

- Review date
- Review text
- Review classification

An an additional feature, `review_length`, calculates the number of words in the review text and was added to the set.

In [39]:
import pandas as pd

review_id = [x for x in review_dict['reviews']]
date = [review_dict['reviews'][x]['timestamp_created'] for x in review_id]
review_text = [review_dict['reviews'][x]['review'] for x in review_id]
classification = [review_dict['reviews'][x]['classification'] for x in review_id]


df = pd.DataFrame(list(zip(date,review_text,voted_up)),
                 columns=['date','review_text','classification'])

# calculate review text length, set as feature
df['review_length'] = df['review_text'].str.split().str.len().fillna(0)

df.head(10)

Unnamed: 0,date,review_text,classification,review_length
0,1668190378,lamo,True,1
1,1668190323,good,True,1
2,1668190314,Jimmy is playing. Jimmy likes it.,True,6
3,1668190243,The main thing I liked was the campaign mode f...,True,23
4,1668190182,This MW2 is a joke compared to the old school ...,False,15
5,1668190101,full of cheaters ugly looking graphics broken ...,False,12
6,1668190059,I have so far only played the campaign and I'v...,False,42
7,1668190034,it is the best cod i have played,True,8
8,1668189695,it sucks because it encourages violence and sp...,False,9
9,1668189658,Activision has still kept the formula for COD ...,True,406


## Inital Clean-up

Prior to conducting any exploratory analysis, some basic cleaning was performed:

1. Replace boolean values for the `classification` variable with strings ('Positive', 'Negative')
2. Convert unix timestamp in `date` to datetime (YYYY-MM-DD)

The resulting data frame is stored as a .csv for use in subsequent stages of the analysis.

In [40]:
import numpy as np
from datetime import datetime

# replace boolean values with strings
df['classification'].replace([True,False],['Positive','Negative'],inplace=True)

# convert unix time stamp to datetime64
df['date'] = pd.to_datetime(df['date'], unit='s').dt.normalize()

df.to_csv('data/processed_review_data.csv',index=False)

df.head(10)

Unnamed: 0,date,review_text,classification,review_length
0,2022-11-11,lamo,Positive,1
1,2022-11-11,good,Positive,1
2,2022-11-11,Jimmy is playing. Jimmy likes it.,Positive,6
3,2022-11-11,The main thing I liked was the campaign mode f...,Positive,23
4,2022-11-11,This MW2 is a joke compared to the old school ...,Negative,15
5,2022-11-11,full of cheaters ugly looking graphics broken ...,Negative,12
6,2022-11-11,I have so far only played the campaign and I'v...,Negative,42
7,2022-11-11,it is the best cod i have played,Positive,8
8,2022-11-11,it sucks because it encourages violence and sp...,Negative,9
9,2022-11-11,Activision has still kept the formula for COD ...,Positive,406
