<a href="https://colab.research.google.com/github/susandong/w266_final_project_game_sentiment/blob/master/w266_Final_Project_Game_Review_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project: Game Review Sentiment Analysis Over Time
## Research Question: 
* Can we use sentiment analysis score to predict the active user base for video games over time

## Dataset: 
* Game Review: twitter/reddit/discord/steam reviews
* active user base: steam

## Algorithm: 
* Baseline(logistic Regression); 
* Transformer(Elmo/Bert)


In [7]:
#Load libraries
import pandas as pd
import nltk
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [33]:
""" Download Data: There are 4 games with review data
Fall Guys (fg)
PlayerUnknown Battlegrounds (pubg)
Dota 2 (dota2)
Counterstrike Source: Go (csgo)

Review data has the following columns:
app: ID for the game
useful: how many users voted the review as useful
funny: how many users voted the review as funny
username: username of the person who wrote the review
games_owned: how many games the reviewer owns on Steam
num_reviews: how many reviews the reviewer has written on Steam
recommend: 1 for recommend (thumbs up), -1 for do not recommend (thumbs down)
hours_played: number of hours the reviewer played before writing the review
date: date review was written
text: text of the review
"""
fg_url = 'https://raw.githubusercontent.com/susandong/w266_final_project_game_sentiment/master/data/fallguys_reviews.csv'
fg_df = pd.read_csv(fg_url, error_bad_lines=False)
csgo_url = 'https://raw.githubusercontent.com/susandong/w266_final_project_game_sentiment/master/data/csgo_reviews.csv'
csgo_df = pd.read_csv(csgo_url, error_bad_lines=False)
#dota2_url = 'https://raw.githubusercontent.com/susandong/w266_final_project_game_sentiment/master/data/dota2_reviews.csv'
#dota2_df = pd.read_csv(dota2_url, error_bad_lines=False)
pubg_url = 'https://raw.githubusercontent.com/susandong/w266_final_project_game_sentiment/master/data/pubg_reviews.csv'
pubg_df = pd.read_csv(pubg_url, error_bad_lines=False)
#player_url = 'https://raw.githubusercontent.com/susandong/w266_final_project_game_sentiment/master/data/PlayerCountData.csv'
#player_df = pd.read_csv(player_url, error_bad_lines=False)

In [63]:
# Data Preprocessing
#!pip install unidecode
import unidecode
import re

#Convert accented characters
def remove_accents(text):
  try:
    text = unidecode.unidecode(text)
  except:
    pass
  return text

#Remove digits and punctuation
def remove_nonletters(text):
  try:
    #Remove digits AND punctuation
    #text = re.sub('[^a-zA-Z]', ' ', text)
    
    #Remove just digits that are by themselves
    text = re.sub('^\d+\s|\s\d+\s|\s\d+$', ' ', text)
  except:
    pass
  return text

#Process text from dataframe. df = dataframe to clean, text = name of column with text
def process_text(df, text):
  #Create new column for cleaned text
  df['cleaned'] = df[text]

  #Lower case all text
  df['cleaned'] = df['cleaned'].str.lower()

  #Clean URLs
  df['cleaned'] = df['cleaned'].str.replace('http\S+|www.\S+', '', case=False)

  #Remove accents from text
  df['cleaned'] = df['cleaned'].apply(remove_accents)

  #Remove numbers and punctuation from text
  df['cleaned'] = df['cleaned'].apply(remove_nonletters)
  
  #Tokenize
  #df['cleaned'] = df['cleaned'].apply()

  




In [65]:
process_text(fg_df, 'text')
fg_df[:20]

Unnamed: 0,app,useful,funny,username,games_owned,num_reviews,recommend,hours_played,date,text,cleaned
0,1097150,0,0,7.65612E+16,51,16,1,17.1,"11 October, 2020",ow i fell:( thats a sad face btwincase you did...,ow i fell:( thats a sad face btwincase you did...
1,1097150,0,0,7.65612E+16,1,1,1,50.4,"11 October, 2020",yes,yes
2,1097150,0,0,7.65612E+16,64,3,-1,8.1,"11 October, 2020",This Game is not fun. If your looking for a ga...,this game is not fun. if your looking for a ga...
3,1097150,0,0,floolp,1,1,1,15.3,"11 October, 2020",Fun but VERY HARD game!this is a very fun game...,fun but very hard game!this is a very fun game...
4,1097150,0,0,7.65612E+16,6,1,1,34.9,"11 October, 2020",its fun,its fun
5,1097150,0,0,7.65612E+16,16,2,1,56.1,"11 October, 2020",sweet game,sweet game
6,1097150,0,0,7.65612E+16,39,1,1,5.9,"11 October, 2020","Good Fun Game , Nice to pick and play for a sh...","good fun game , nice to pick and play for a sh..."
7,1097150,0,0,Fnley,122,4,1,32.7,"11 October, 2020",very funn,very funn
8,1097150,0,0,7.65612E+16,3,1,1,100.9,"11 October, 2020",I have played Fall Guys for 100 hours now and ...,i have played fall guys for hours now and it's...
9,1097150,0,0,sf6133,29,4,1,145.2,"11 October, 2020",Thicc Beanz,thicc beanz


In [None]:
# Build model

In [None]:
# evaluate model