# Social Computing/Social Gaming - Summer 2020

# Exercise Sheet 4: Sentiment Analysis in DotA

In this exercise we will work with data gathered from the popular MOBA *Defense of the Ancients 2* or short DotA 2, developed by Valve in 2013. If you are unfamiliar with this game, we provide a short introduction that should be enough to make you understand what the tasks demand from you:

In Dota 2, two teams of 5 players play against each other on a single map, each team trying to destroy the enemy base, also called the ancient. In order to do this, they try to kill each other, earn gold and experience by killing non player characters called creeps to gain an advantage over the enemy. In every match, players first choose from a pool of 117 different heroes which are roughly divided into 2 groups: Carries, who start out weak and become much stronger once they accumulated a sufficient amount of gold to buy items they need, and supports, who start protect the carries in the early stages of a match, but tend to become less relevant in the later stages. Every team needs a balanced hero selection in order to have a chance of winning, as too many carries will have that team face a disadvantage early on, while too many supports may cause that team to struggle to win the game even once an advantage has been secured early.

Psychologically speaking, DotA - or any MOBA for that matter - is an experiment on succesful team formation and cooperation, as 5 strangers meet each other for one match with the same goal, but usually different views on how to achieve it. Its real world equivalent would be any mash-up of people forced to work in a group, the only difference being that usually real-life situations don't involve another group working against them. 

Needless to say, the nature of the game does provoke negativity at times, and we want to try to predict it. More precisely, we want to find out whether we can infer negative player behavior from modelling the state of a game as a set of values.

The .csv files provided for you contain information from 1.500 matches played during December 2016, and are split into 5 tables: 

- chat.csv : this table contains information about what was said in the chat between teams, when it was said and which player said it. We need the 'key', 'time' and 'slot' column as we are only interested in which team the players belong to, not their identities.
- match.scsv: contains information about the game results. We only need the 'radiant_win' column from it, which tells us which team won.
- players.csv : Detailed statistics for every player. 'kills' and 'deaths' columns are needed as we will need them to determine underperforming players
- player_times.csv: Among other things the gold accumulated by every player, for every minute of a match, used to calculate the difference in gold earned between these teams.
- labels.xlsx : a sample of labeled chat used for the sentiment analysis

## Task 4.1: Your first sentiment analysis

Sentiment analysis, sometimes called opinion mining, is a method to derive information from the text that allows for a classification as neutral, positive or negative. It is a semi-supervised process, meaning that you need a small set of labeled data to train your machine learning model on in order to use it on another set of unlabeled data. Without the labeled set it would be difficult for your AI to know what exactly makes a statement positive or negative. Some of its many practical applications are the analysis of customer reviews, social media comments or survey responses.

**Note:** We will use a random forest classifier in this task.

Your task is to train a model using the labeled data, then use that model to predict the sentiments of the whole chat. Let us start with the basics:

### a) Preparation

Import the labels and split them into two arrays: the chat itself and the labels.

A label is like a review of a single message:  
-1 = negative  
 0 = neutral   
 1 = positive

In [None]:
#import libraries needed for sentiment analysis
import numpy as np 
import pandas as pd 
import re
import nltk 
import matplotlib.pyplot as plt

#TODO: Import "labels.xlsx" and split it into 2 arrays: chat and labels.


chat_data = chat_data[:450]
chat_labels = chat_labels[:450]

### b) How to train your model

In this step you will use the chat and labels to train your random forest classifier. In order to do so, create the random forest classifier, fit it and make a prediction on the test set.

After you are done, print the accuracy score and comment on it.

**Hints:**
- When creating the classifier, use n_estimators=200, random_state=0 as arguments.
- The test should be 20% of the whole set

In [None]:
# There are some words which do not have any valuable classification information. 
# We will use 'stopwords' to get rid of them.
import nltk
nltk.download('stopwords')

In [None]:
import nltk
import sklearn
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn import preprocessing

chat_data = [str (item) for item in chat_data]
vectorizer = TfidfVectorizer (max_features=2500, min_df=3, max_df=0.8, stop_words=stopwords.words('english'))
chat_data = vectorizer.fit_transform(chat_data).toarray()

# TODO: Create the random forest classifier, fit it and make a prediction on the test set.


print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))

**TODO: Write your observations here:**


### c) Prediction time:

Now you can use the model to predict the sentiments for the whole chat. Import the chat and predict the labels. You will need to use `vectorizer.transform().toarray()` on your data, but **DO NOT** use `fit()` anywhere! The classifier is already fitted, fitting it again effectively erases all it has learned.

**Hint:** chat.csv includes ALL chatlines, including those that have been used in the previous steps and are already labeled. Don't label them again.

**Note:** The chat table is massive. Labelling all of it may take a while.

In [None]:
chatData = pd.read_csv("chat.csv")
unlabeled = chatData.iloc[:,1].values
# TODO:



## Task 4.2: Linear regression

Linear regression is a technique that tries to find a correlation between a set of input variables x and a dependant variable y. In mathematical terms:

$$y = \alpha + \beta X + \epsilon$$

where:
- $X$ is the predictive vector, containing the (predictive) variables
- $\alpha$ and $\beta$ are the model's parameters, where $\alpha$ is the intercept/bias, $\beta$ the coefficient vector containing coefficients for each predictive variable
- and $\epsilon$ the prediction error.

Note that the assumption made is that the relationship is linear. This is a special case of polynomial regression, where we would allow for e.g. squared relationships.

Our dependant variable is the negativity in the chat. Therefore we need to convert our labels into numbers first: We will use 0 for neutral, -1 for negative and +1 for positive sentiments. This is, of course, a simplification, as not all negative statements are equally negative. But we need to acknowledge that it is simply impossible to make an accurate distinction without knowing any context. And if we knew that, there would be no point in doing this regression.

In [None]:
# Convert the labels into values
sentiments = []
for i in chat_labels:
    if i == 'positive':
        sentiments.append(1)
    elif i == 'negative':
        sentiments.append(-1)
    else:
        sentiments.append(0)

In [None]:
for i in predictions:
    if i == 'positive':
        sentiments.append(1)
    elif i == 'negative':
        sentiments.append(-1)
    elif i == 'neutral':
        sentiments.append(0)

### a) Preparation:

1. First, we need to read the csv files and group them by `match_id`.

2. We will create a dataframe containing all relevant information and each row will represent one match. An empty dataframe has already been created for you with all the columns you need to fill. Here is an explanation of what you need to put into each column:

3. Create a list of tuples called `full_chatdata`, each tuple has the following structure: label, team.

4. Create a list called `goldData` containing the gold advantage for every timestamp of a match (usually every minute).

5. Create a list called `KDratios` of kill-death ratios for each player in a match, split into two parts, one for each team, called ratiosRadiant and ratiosDire.

6. Add an additional column called `radiant_win` displaying the winning team using boolean.

**Hints:**

- you can find the labels in the chatData dataframe
- use the `slot` column to determine the team. 0 to 4 is for radiant, 5-9 is for dire
- There is a column in the match.csv file called `radiant_win` that displays true if team radiant won, false if not

In [None]:
import pandas as pd
import numpy as np

# 1. Read the csv files and group them by match id
chatData = pd.read_csv("chat.csv")
chatData = chatData.drop(['unit'],axis=1)
chatData['label'] = sentiments # We are assigning labels to the chat messages
chatData = chatData.groupby('match_id')
player_times = pd.read_csv("player_time.csv")
player_times = player_times.groupby('match_id')
match_info = pd.read_csv("match.csv")
radiant_win = match_info['radiant_win']
player_info = pd.read_csv("players.csv")
player_info = player_info[['match_id', 'kills','deaths']]
player_info = player_info.groupby('match_id')


# 2. Create the dataframe
dataframe = pd.DataFrame(columns = ['chatData','goldDataAvg', 'KDratios', 'radiant_win'])

# 3.
full_chatdata = []

for name, group in chatData:
    chat_data_line = []
    for index,row in group.iterrows():
        chat_tuple = []
        # TODO: Create a list of tuples called full_chatdata, each tuple has the following structure: label, team.
        # Hint 1: use the label column to determine the negativity/positivity of the message
        # Hint 2: use the 'slot' column to determine the team. 0 to 4 is for radiant, 5-9 is for dire.
        
        
# 4. Create a list containing the gold advantage
full_golddata =[]

for name,group in player_times:
    radiantAdv =[]
    for index, row in group.iterrows():
        radiantAdv.append((row['gold_t_0']+row['gold_t_1']+row['gold_t_2']+row['gold_t_3']+row['gold_t_4'])-
            (row['gold_t_128']+row['gold_t_129']+row['gold_t_130']+row['gold_t_131']+row['gold_t_132']))
        
    full_golddata.append(radiantAdv)


# 5.
full_playerinfo = []
for name, group in player_info:
    playerinfo = []
    for index, row in group.iterrows():
        killsdeaths = []
        killsdeaths.append(row['kills'])
        killsdeaths.append(row['deaths'])
        playerinfo.append(killsdeaths)
    full_playerinfo.append(playerinfo)

full_KDRatios = []

for row in full_playerinfo:
    KDRatios = []
    ratiosRadiant =[]
    ratiosDire = []
    for i,player in enumerate(row):
        # TODO: Create a list called [...] kill-death ratios for each player
        # Hint: For each game the kd ratios should look like the following: 
        # [[RadiantPlayer0KD, ... RadiantPlayer4KD],[DirePlayer0KD, ... DirePlayer4KD]] 
        
        
# We add the newly created columns to our dataframe       
dataframe['chatData'] = full_chatdata
dataframe['goldData'] = full_golddata
dataframe['radiant_win'] = radiant_win
dataframe['KDratios'] = full_KDRatios
        
dataframe.head(5)

### b) Per-match analysis:

As you may have noticed, the gold variables are gathered every minute, but the chat times are irregular. We could try to group the chat into 60 second timeframes that would correspond with the gold values, but this would be too tedious. Instead, we will simplify this by looking at the game as a whole:

1. Compute the average negativity for each team by iterating over the list of tuples you created in exercise 4.2.3.

2. Then, compute the average gold advantage for each match, and add a column for the gold advantage at the end of a match. The gold advantage at the end of a match is the last value of the list.

3. Create a new column for the difference in negativity between the two teams.

4. The kill/death ratios aren't very useful in the current format. Take the lowest K/D ratio from each team and create new columns for them. The reasoning behind this is that a low K/D ratio is a sign of underperformance of a player and players who do not perform on an acceptable level are usually harassed more often.

In [None]:
from sklearn import  svm 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from statistics import mean


# 1. Average negativity
radiantToxicity_full = []
direToxicity_full = []

for index, row in dataframe.iterrows():
    radiantToxicity = 0
    direToxicity = 0
    # These counters keep of track of the number of messages each team wrote:
    radiantcounter = 0
    direcounter = 0
    for tuples in row['chatData']:
        # TODO: Calculate each team's toxicity by summing all labels of a match.
        # Hint: Don't forget to keep count of the number of messages written by each team.
        
        
# 2. Average gold
goldAverages = []
goldEnd = []
for index, row in dataframe.iterrows():
    # TODO: Compute the average gold advantage for each match, as well as the gold advantage at the end of the match.
    # Hint: The column goldData contains a list with gold advantage per minutes.
    
    
# 3. Difference in negativity
differences = []
# TODO: Compute the difference in negativity between the 2 teams.


# 4. K/D ratios
worstRadiant = []
worstDire = []
for index, row in dataframe.iterrows():
    # TODO: Take the lowest K/D ratio from each team and create new columns for them.

    
# We add the newly created columns to our dataframe       
dataframe['toxicityR'] = radiantToxicity_full
dataframe['toxicityD'] = direToxicity_full
dataframe['goldData'] = goldAverages
dataframe['goldEnd'] = goldEnd 
dataframe['diff'] = differences
dataframe['worstKDR'] = worstRadiant
dataframe['worstKDD'] = worstDire

dataframe.head(5)

### c) A warm-up regression

Before we test our hypothesis of whether or not the state of the game influences player behavior, we will perform a linear regression with only one input variable: The gold advantage. 

You have probably wondered why we just assume that the gold values would represent the state of a game, whether a team is losing or winning. So far, this has only been a theory, and we should test it, as it would not make sense to use it as a representation for the state of the game in the actual regression model, if it wasn't representative at all.

1. Once again, split your data into a train set and test set, create a linear regression model, fit the data and print your score. Try it two times: Your dependant variable should always be `radiant_win`, your X should be the average gold advantage and the gold advantage at the end. 

2. Discuss the score you obtained! What do the results mean for the explanatory power of the gold variables?

In [None]:
from sklearn import linear_model

# TODO 1:
X = # TODO 
y = # TODO

# Splitting the data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 
  
regr = LinearRegression() 
  
regr.fit(X_train, y_train) 
print(regr.score(X_test, y_test))

In [None]:
y_pred = regr.predict(X_test) 
plt.scatter(X_test, y_test, color ='b') 
plt.plot(X_test, y_pred, color ='k') 
  
plt.show() 

**TODO: Write your observations here:**




### d) Testing our hypothesis

**1.** Finally, we can do our linear regression. This time, use the gold data, the kill/death ratios and the negativity scores.

In [None]:
# TODO 1:
X = # TODO 
Y = # TODO
 
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)
print(regr.score(X,Y))

### e) Discussion

What is the score? What does that number mean? Discuss possible reasons for this result.  
**Hint:** Take a peek at the labels.xlsx file and look at some of the most common negative words. 

**TODO: Write your observations here:**


