Here are a few lines to clean up two inconsistencies in this data set : characters with unique logs and characters with two different races at the same time. 

Here are a few lines to clean this data set from single logs characters and characters with several races.

In [None]:
import numpy as np
import pandas as pd

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
# date parsing function
parser = lambda x: pd.to_datetime(x, format='%m/%d/%y %H:%M:%S')

In [None]:
# load data
df = pd.read_csv('../input/wowah_data.csv', parse_dates=[' timestamp'], date_parser=parser)

### Avatars
Let's have a look at the avatars used by the players. Since players can have multiple avatars, their data is only a proxy to actual player's behavior.

In [None]:
# group logs by character
avatars = df.groupby('char')

# number of unique characters
len(avatars)

In [None]:
# count logs per characters
log_number = avatars.count()

# number of characters with a single log
len(avatars.filter(lambda x: len(x) == 1))

It seems that a significant proportion of the logs belong to avatars who connected only once on the server. It could be a series of avatars left stillborn, or a bunch of bots spawning only for a second to spam the whole server before getting banned. Either way, there is not much we can do with these.

In [None]:
# clean data from single logs
df = avatars.filter(lambda x: len(x) > 1)

# number of remaining avatars
avatars = df.groupby('char')
len(avatars)

### Races
Many of you have already looked at the races/classes combinations, so I'm not going to run the same analysis at length. But let's have a quick look.

In [None]:
races = avatars[' race'].unique().value_counts()
races.head(n=10)

Wait... Wat? So it seems that we have a few mixed-races in the data set. How come? I see three possible explanations. (1) The data contains records from the avatar creation, including races and class swaps. (2) Player actually bought a race swap. This explanation is unlikely since this feature was implemented on October 27th 2009. (3) There are inconsistencies in the original dataset. 

In [None]:
# let's look at this avatar
avatars.get_group(65856).sort_index()[' race'].value_counts()

I have no idea why this guy is a Tauren only 2/3 of the time, so I assume there are inconsistencies. Better to get rid of these avatars as well. If anyone has another explanation, I take it.

In [None]:
# clean characters with multiple races
df = df.groupby('char').filter(lambda x: len(x[' race'].unique()) == 1)
df.groupby('char')[' race'].unique().value_counts().head(n=10)