### The data keeps on changing so I'd need to change my statistics too!

## Among Us

Dataset column description

- Game completed date
- Team: Crewmate/Imposter
- Outcome: Win/Loss
- Task Completed: number of tasks completed by crewmate
- All tasks completed: Yes/No
- Murdered: Yes/No
- Imposter Kills: number of kills by imposter
- Game Length: duration min:seconds
- Ejected: Yes/No, voted out
- Sabotages Fixed: number of subotages fixed
- Time to complete all tasks: time taken to complete all tasks
- Rank Change: After playing 3 games, players are assigned a competitive rank that is affected by every game played
- Region/Game Code: Game code with server region




# **How to play among us:**

In a game of maximum 10 people, you play either as a crewmember or imposter. If you are a crewmate, your job is to finish the tasks and save the spaceship while the imposters (2) job is to kill the crewmates and thus win the game. Any time a dead body is discovered, everyone can discuss and decide to vote on someone to be kicked out (they sus on a crew member) and thus they get ejected. If crewmates can get both imposters kicked out then they win the game, or if they finish all their tasks they win the game. Imposters win if they are able to kill 6 (with both of them remaining in the game) or 8 (with 1 of them remaining). Imposters are able to vent (use secret passageways to go from room to room) and they are also able to sabotage the game. 
    

In [None]:
# Import all required libraries

import glob as glob
import pandas as pd
import numpy as np
import collections
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline



#importing all the required ML packages
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn import svm #support vector Machine
from sklearn.ensemble import RandomForestClassifier #Random Forest
from sklearn.neighbors import KNeighborsClassifier #KNN
from sklearn.naive_bayes import GaussianNB #Naive bayes
from sklearn.tree import DecisionTreeClassifier #Decision Tree
from sklearn.model_selection import train_test_split #training and testing data split
from sklearn import metrics #accuracy measure
from sklearn.preprocessing import LabelEncoder
import eli5
from eli5.sklearn import PermutationImportance

import warnings
warnings.filterwarnings('ignore')

### Merging all the data sets

In [None]:
path = "../input/among-us-dataset"
all_files = glob.glob(path + "/*csv")

In [None]:
list_ = []
index = 0
user = 0

for filename in all_files:
    user += 1
    data = pd.read_csv(filename, index_col = None, header = 0)
    data["User ID"] = user
    list_.append(data)

In [None]:
data_merged = pd.concat(list_)
data_merged = data_merged.reset_index(drop = True)

In [None]:
# The Region/Game code column should be separated

data_merged[["Region", "Game Code"]] = data_merged["Region/Game Code"].str.split("/", expand = True)
data_merged = data_merged.drop(columns = ["Region/Game Code"])

In [None]:
data_merged.head()

### Cleaning the dataset - this I got from @ruchi798

In [None]:
timezones = []
for index, row in data_merged.iterrows():
    tz = data_merged["Game Completed Date"][index][-3:]
    if tz not in timezones:
        timezones.append(tz)

In [None]:
timezones

In [None]:
# There's only 1 timezone, so we can drop this value

In [None]:
date_val = []
time_val = []

for index, row in data_merged.iterrows():
    st = data_merged["Game Completed Date"][index].split()
    
    date = st[0]
    
    t = st[2] + " " + st[3]
    time = pd.to_datetime(t).strftime("%H:%M:%S")
    
    date_val.append(date)
    time_val.append(time)
    
data_merged["Game Date"] = date_val
data_merged["Game Time"] = time_val

data_merged = data_merged.drop(columns = ["Game Completed Date"])

In [None]:
# It's easier to work with integers, so convert the columns game length and time to complete all tasks into minutes


def convert_minutes(col):
    for index, row in data_merged.iterrows():
        ts = data_merged[col][index]
        if ts == '-':
            pass
        else:
            ts = ts[:2] + ':' + ts[4:6]
            ftr = [60,1]
            t = round(sum([a*b for a,b in zip(ftr, [int(i) for i in ts.split(":")])])/60,2)
            data_merged[col][index] = t
            
convert_minutes('Game Length')
convert_minutes('Time to complete all tasks')

In [None]:
data_merged.head(2)

### Check if each game is unique - COME BACK TO THIS LATER

Are these players playing with the same people?  
Are some of these duplicate games from different user perspectives?

In [None]:
data_merged.groupby('Game Code')['User ID'].nunique().sort_values(ascending=False).head(67)
#data_merged.groupby('Game Code')['User ID'].agg(['min','max','count','nunique'])

In [None]:
# 65 of these games are the same with different users! So have to drop those rows - manually counted them, there are 151

In [None]:

# maybe this will work?


g = data_merged.groupby("Game Code").size()

df = data_merged[~data_merged["Game Code"].isin(g[g<=2].index)]

df

In [None]:
# this doesn't work for some reason?

In [None]:
#df.groupby('Game Code')['User ID'].nunique().sort_values(ascending=False).head(49)

### Explatory Data Analysis

In [None]:
data_merged.shape
# So there are 2227 games to analyze

In [None]:
# Let's drop the columns we don't need to analyze for this section

data_dropped = data_merged.drop(['Rank Change', 'Game Code'], axis = 1)

In [None]:
data_dropped.head()

In [None]:
data_dropped.isnull().sum()

Observations:

- Imposters (449) didn't fix any sabotages. Kinda sus since imposters sometimes do fix their own sabotages to throw the scent off them.


In [None]:
# Chances of you getting imposter or crewmate?
data_dropped['Team'].value_counts(normalize=True)

# You'll get crewmate more than imposter

79% chances of getting crewmate and 21% chances of getting imposter

### 1. Team

In [None]:
# Let's see how many times a person gets imposter and wins
pd.crosstab([data_dropped.Team], data_dropped.Outcome, margins = True).style.background_gradient(cmap = 'Reds')

In [None]:
pd.crosstab([data_dropped.Team], data_dropped.Outcome).apply(lambda r: round(r/r.sum() * 100, 2), axis=1)

In [None]:
fig = sns.countplot('Team', hue = 'Outcome', data = data_dropped)
fig.set_title('Imposters vs Crewmates')
plt.show()

#### Observations

1. As a crewmate, you have a 55.66% chance of winning the round
2. As an imposter, you have a 56.12% chance of winning the round

### 2. Task Completed

In [None]:
pd.crosstab([data_dropped['Task Completed']], 
            data_dropped.Outcome, margins = True).style.background_gradient(cmap = 'Reds')

In [None]:
pd.crosstab([data_dropped['Task Completed']], 
            data_dropped.Outcome).apply(lambda r: round(r/r.sum() * 100, 2), axis=1)

pd.crosstab([data_dropped['Task Completed'], data_dropped['All Tasks Completed']], 
            data_dropped.Outcome).apply(lambda r: round(r/r.sum() * 100, 2), axis=1)

Since many games were played where crewmates lost even though all tasks were completed, means tasks of ghosts or those ejected were not recorded and therefore cannot effect the outcome. Since, if you have finished your tasks means guaranteed win.

### 3. All Tasks Completed

In [None]:
# How many times crewmates won depending on all tasks completed

pd.crosstab([data_dropped['All Tasks Completed']], 
            data_dropped.Outcome, margins = True).style.background_gradient(cmap = 'Reds')

# Here (-) means the user was an imposter, you can check as the number of imposters (391) and (-) match

In [None]:
pd.crosstab([data_dropped['All Tasks Completed']], 
            data_dropped.Outcome).apply(lambda r: round(r/r.sum() * 100, 2), axis=1)

In [None]:
fig = sns.countplot('All Tasks Completed', hue = 'Outcome', data = data_dropped)
fig.set_title('Imposters vs Crewmates')
plt.show()

#### Observations

If we look at just the yes column, we can see that there is still a high loss even though all tasks were completed. So, that means ghosts (those who were murdered, ejected or lost connection(ejected)) did not complete their tasks, since the system logs in all tasks completed only if the living crewmates have finished their tasks.



### 4. Imposter Kills

In [None]:
# We'll make a kill band later, but how many murders happen for the imposter to win

pd.crosstab([data_dropped['Imposter Kills']], 
            data_dropped.Outcome, margins = True).style.background_gradient(cmap = 'Reds')

In [None]:
pd.crosstab([data_dropped['Imposter Kills']], 
            data_dropped.Outcome).apply(lambda r: round(r/r.sum() * 100, 2), axis=1)

#### Observations

- More kills higher win rate, sad that we can't see if it was a double kill (when both imposters kill two crewmates at the same time).
- You(imposter) automatically win if there's only 2 imposters and 2 crewmates or 1 imposter and 1 crewmate left, so in the 1 game with 6 kills means 2 imposters managed to stay alive and killed 6 to win.
- Will use this when analyzing only imposters.

### 5. Ejected

In [None]:
# Did the imposter win because the crewmates blamed each other and ejected themselves?

pd.crosstab([data_dropped['Ejected'], data_dropped['Team']], 
            data_dropped.Outcome, margins = True).style.background_gradient(cmap = 'Reds')


In [None]:
pd.crosstab([data_dropped['Ejected'], data_dropped['Team']], 
            data_dropped.Outcome).apply(lambda r: round(r/r.sum() * 100, 2), axis=1)

#### Observations

- if you get ejected while being an imposter you have a 8% chance of winning (we need to also factor in the fact that there are 2 imposters per game)

### 6. Game Length

In [None]:
data_dropped['Outcome'].replace(['Loss', 'Win'], [0, 1], inplace = True)

data_dropped['Game_Length'] = pd.qcut(data_dropped['Game Length'], 4)
data_dropped.groupby(['Game_Length'])['Outcome'].mean().to_frame()

In [None]:
data_dropped['game_time'] = 0
data_dropped.loc[data_dropped['Game Length'] <= 6.28, 'game_time'] = 0
data_dropped.loc[(data_dropped['Game Length'] > 6.28) & (data_dropped['Game Length'] <= 10.47), 'game_time'] = 1
data_dropped.loc[(data_dropped['Game Length'] > 10.47) & (data_dropped['Game Length'] <= 14.27), 'game_time'] = 2
data_dropped.loc[(data_dropped['Game Length'] > 14.27) & (data_dropped['Game Length'] <= 29.0), 'game_time'] = 3


In [None]:
data_dropped.groupby(['game_time', 'Outcome'])['Outcome'].count()
pd.crosstab([data_dropped['game_time']], 
            data_dropped.Outcome).apply(lambda r: round(r/r.sum() * 100, 2), axis=1)

#### Observations

Better chance of winning if you finish within 6.28 mins or take between 14.27 to 29.0 mins

So, the attributes I'd like to work with are:

- Team
- Ejected
- Game Length

to check my kernel

In [None]:
data_dropped['Team'].replace(['Crewmate', 'Imposter'], [0, 1], inplace = True)
data_dropped['Ejected'].replace(['Yes', 'No'], [0, 1], inplace = True)

In [None]:
data_dropped

In [None]:
df = data_dropped.drop(['Task Completed', 'All Tasks Completed', 'Murdered', 'Sabotages Fixed',
                        'Time to complete all tasks', 'User ID', 'Region', 'Game Date', 'Game Time', 'Game_Length', 'Imposter Kills', 'Game Length' ], axis = 1)

In [None]:
df

In [None]:
sns.heatmap(df.corr(), annot = True, cmap = 'Reds', linewidths = 0.2, annot_kws = {'size':20})
fig = plt.gcf()
fig.set_size_inches(15,10)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.show()


## Data Analysis

1. Support Vector Machines
2. Logistic Regression
3. Decision Tree
4. K-Nearest Neighbours
5. Naive Bayes
6. Random Forrest

In [None]:
# split the data set into training and testing dataset

label = df.Outcome
attributes = [c for c in df.columns if c in ["Team", "Ejected", "game_time"]]


train_x, test_x, train_y, test_y = train_test_split(df[attributes], label, test_size = 0.3, random_state = 2)

In [None]:
# KNN
# Before doing this algorithm it is important to check the accuracy for the KNN models using different values of n

a_index = list(range(1,11))
a = pd.Series()
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

for i in a_index:
    model_KNN = KNeighborsClassifier(n_neighbors = i)
    model_KNN.fit(train_x, train_y)
    prediction_KNN = model_KNN.predict(test_x)
    accuracy_KNN = metrics.accuracy_score(prediction_KNN, test_y)
    a = a.append(pd.Series(accuracy_KNN))

plt.plot(a_index, a)
plt.xticks(x)
plt.show()

In [None]:
classifiers=['Support Vector Machine', 'Logistic Regression','KNN','Decision Tree','Naive Bayes','Random Forest']

models=[svm.SVC(), LogisticRegression(), KNeighborsClassifier(n_neighbors = 10), 
        DecisionTreeClassifier(), GaussianNB(), RandomForestClassifier(n_estimators=100)]

accuracy = []

for i in models:
    model = i
    model.fit(train_x, train_y)
    prediction = model.predict(test_x)
    a = metrics.accuracy_score(prediction, test_y)
    accuracy.append(a)
    
table = pd.DataFrame({'Accuracy': accuracy}, index = classifiers)
table


# SVM seems to be the highest, but yeah sad kernel

# Will come back to this in the future!