# Determine the frequency and cause of contradicting highscores
**Contributors:** Victor Lin

**Achievement:** The frequency of duplicate scores (scores from the same player, beatmap, and mods) was found to be 778, or ~.00078% of the 10M scores present across all dumps. Duplicates can be safely ignored for the data cleaning process.

**Requirements:**

1. *exploration/sql_migration/random_dump_migration.ipynb*

## Introduction
Similar to the exploration of user crossover between random dumps, this notebook takes a closer look at contradicting user highscores, for the same beatmap and dump, found in *random_dump_migration.ipynb*

In [None]:
import sys
sys.path.append('../..')
from exploration.config import sql_inst, mongo_inst

In [None]:
osu_random_db = mongo_inst['osu_random_db']

## Contradicting Highscores Pipeline Filter
Define a pipeline that groups highscores by identical (user_id, beatmap_id, and mods) index. Filter for groups with more than one highscore found (aka, contradicting highscores)

In [None]:
pipeline = [
    {
        '$group': {
            '_id': {
                'user_id': '$user_id',
                'beatmap_id': '$beatmap_id',
                'enabled_mods': '$enabled_mods'
            },
            'count': {
                
                '$sum': 1
            }
        }
    },
    {
        '$match': {
            'count': {
                '$gt': 1
            }
        }
    },

]

duplicate_indecies = osu_random_db['osu_scores_high'].aggregate(pipeline, allowDiskUse = True)

## Querying highscores from each multi-index
For the Aug - Jan dumps, it appears there are only cases of 2 contradicting highscores for each index (as opposed to 3 or more)

In [None]:
duplicate_highscores = []

for duplicate_index in duplicate_indecies:
    index = duplicate_index['_id']
    duplicate_highscores.extend(osu_random_db['osu_scores_high'].find(index, {'mlpp': 0}))

## Investigating Highscores with DataFrame
Reordered table columns to have the 3 indexed columsn (beatmap_id, user_id, enabled_mods) to be first

In [None]:
import pandas as pd
df = pd.DataFrame(duplicate_highscores)
df.set_index(['beatmap_id', 'user_id', 'enabled_mods'])

cols = df.columns.tolist()
cols = cols[1:3] + cols[-6:-5] + cols[:1] + cols[3: -6] + cols[-5:]
df = df[cols]

## Tables for first 5 contradicting pairs of highscores
It appears that for each pair, one highscore was submitted in 2011 or 2012 and one in 2017. Perhaps there was a change to score storage in this time that occasionally missed duplicates.

In [None]:
from IPython.display import display

for i in range(5):
    print(f'\n\nContradicting highscores GROUP {i + 1}')
    display(df[2 * i: 2 * (i + 1)])

## Tables for last 5 contradicting pairs of highscores
The score _ids for the first 4 pairs are extremely close, and in the 3rd and 4th pair the other columns are the exact same. This may be an unintentional double submission on the serverside. Overall, contradicting highscores appear to be due to server-side errors, and we should simply choose the higher score.

In [None]:
N = len(duplicate_indecies)
for i in range(N - 5, N):
    print(f'\n\nContradicting highscores GROUP {i + 1}')
    display(df[2 * i: 2 * (i + 1)])