<a id='intro'></a>
<div class="alert alert-block alert-warning"> 
<h1 align="center">Introduction</h3>
</div>

> Now that I have compiled a dataset of matches from the API, I will perform the necessary steps to compute and visualize the win rates and pick rates of each character.

> First, I will perform [Exploratory Data Analysis](#eda) to get a sense of what I am working.

> Next, I will compute the [Character Statistics](#char_stats) for each character by format and game mode.

> Next, I will perform [Inferential Statistics](#inf) to verify any statistically significant differences in win rates and pick rates between formats (casual vs. ranked) and game modes (2v2 vs. 3v3).

<a id='eda'></a>
<div class="alert alert-block alert-warning"> 
<h1 align="center">Exploratory Data Analysis</h3>
</div>

In [None]:
# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

<a id='eda1'></a>
### Check if the match data includes every character
#### Description:
> I need to make sure my dataset includes every character, since this is the most important part of the project.

#### Procedure:
> Using the mappings found in the **character_names** dataframe, I will pair the ID of each character to the corresponding name in the **match_df** dataframe, then ensure that the dataset includes data for each character.
1. Assign each row of the match data with the corresponding character name
2. Build a function to check for missing characters
3. Build a function to check for missing characters for all game modes
4. Find missing characters for each format and game mode combination

##### 1. Assign each row of the match data with the corresponding character name

In [None]:
# Load data
match_df = pd.read_csv('compiled_data\match_data', parse_dates=['date'])
character_names = pd.read_csv('compiled_data\character_names')

# Merge the data using the character ids
match_df = match_df.merge(character_names, left_on='character', right_on='characterID')

# Replace the 'character' column with the 'name' column
match_df['character'] = match_df['name']

# Drop unnecessary columns
match_df = match_df.drop(columns=['characterID', 'name'])

match_df.head()

##### 2. Build a function to check for missing characters

In [None]:
# Create an array of all the characters
all_characters = character_names.name.values

def missing_character(df, total_characters=all_characters):
    '''Find missing characters'''
    # Included characters included in match data
    included_characters = df.character.unique()
    # Missing characters
    missing_characters = set(total_characters).difference(set(included_characters))
    
    return sorted(missing_characters)

##### 3. Build a function to check for missing characters for all game modes

In [None]:
def missing_game_mode(df, characters, format_str):
    '''Find missing characters for each game mode'''   
    # Initialize list of data
    df_2v2 = df.loc[df.game_mode == '2V2']
    df_3v3 = df.loc[df.game_mode == '3V3']
    data = [df, df_2v2, df_3v3]
    
    # Initialize list of game modes
    game_modes = ['overall', '2v2', '3v3']
    
    # Print the format
    print('Format : {}'.format(format_str))
    
    # Loop through each game mode
    for i in range(3):
        # Print the game mode
        print('Game mode: {}'.format(game_modes[i]))
        
        # Find missing characters
        missing = missing_character(data[i], characters)
        
        # Print the missing characters
        if missing:
            # Store all missing characters in a string
            missing_char_str = ', '.join(sorted(missing))
            # Print which characters are missing
            print('   Missing (amount): {}'.format(len(missing)))
            print('   Missing characters: {}'.format(missing_char_str))
        else:
            print('   Missing characters: None') 
    print('')

##### 4. Find missing characters for each format and game mode combination

In [None]:
'''All matches'''
missing_game_mode(match_df, all_characters, 'overall')

'''Casual matches'''
casual = match_df.loc[match_df.ranked == False]
missing_game_mode(casual, all_characters, 'casual')

'''Ranked matches'''
ranked = match_df.loc[match_df.ranked == True]
missing_game_mode(ranked, all_characters, 'ranked')

'''League matches'''
bronze = ranked.loc[ranked.league == 0]
silver = ranked.loc[ranked.league == 1]
gold = ranked.loc[ranked.league == 2]
platinum = ranked.loc[ranked.league == 3]
diamond = ranked.loc[ranked.league == 4]
champion = ranked.loc[ranked.league == 5]
grand_champ = ranked.loc[ranked.league == 6]

# Initialize a list of all leagues
league_list = [bronze, silver, gold, platinum, diamond, champion, grand_champ]
# Initialize a list of all league names
league_names = ['bronze', 'silver', 'gold', 'platinum', 'diamond', 'champion', 'grand_champ']
for i in range(7):
    league = league_list[i]
    league_name = league_names[i]
    missing_game_mode(league, all_characters, league_name)

> It seems the data is too sparse, and is missing a large number of characters in the higher leagues.

> I'll have to collect more data, and try again.

<a id='eda1'></a>
### Import newly collected data and ensure
#### Description:
> I will import my new dataset and ensure that there are less missing characters.

#### Procedure:
> Using the mappings found in the **character_names** dataframe, I will pair the ID of each character to the corresponding name in the **match_df** dataframe, then ensure that the dataset includes data for each character.
1. Assign each row of the match data with the corresponding character name
2. Find missing characters for each format and game mode combination

##### 1. Assign each row of the match data with the corresponding character name

In [None]:
# Load data
match_df = pd.read_csv('compiled_data\match_data3', parse_dates=['date'])

# Merge the data using the character ids
match_df = match_df.merge(character_names, left_on='character', right_on='characterID')

# Replace the 'character' column with the 'name' column
match_df['character'] = match_df['name']

# Drop unnecessary columns
match_df = match_df.drop(columns=['characterID', 'name'])

match_df.head()

##### 2. Find missing characters for each format and game mode combination

In [None]:
'''All matches'''
missing_game_mode(match_df, all_characters, 'overall')

'''Casual matches'''
casual = match_df.loc[match_df.ranked == False]
missing_game_mode(casual, all_characters, 'casual')

'''Ranked matches'''
ranked = match_df.loc[match_df.ranked == True]
missing_game_mode(ranked, all_characters, 'ranked')

'''League matches'''
bronze = ranked.loc[ranked.league == 0]
silver = ranked.loc[ranked.league == 1]
gold = ranked.loc[ranked.league == 2]
platinum = ranked.loc[ranked.league == 3]
diamond = ranked.loc[ranked.league == 4]
champion = ranked.loc[ranked.league == 5]
grand_champ = ranked.loc[ranked.league == 6]

# Initialize a list of all leagues
league_list = [bronze, silver, gold, platinum, diamond, champion, grand_champ]
# Initialize a list of all league names
league_names = ['bronze', 'silver', 'gold', 'platinum', 'diamond', 'champion', 'grand_champ']
for i in range(7):
    league = league_list[i]
    league_name = league_names[i]
    missing_game_mode(league, all_characters, league_name)

> Great! Among all the format and game mode combinations, there is only one missing character, which is good enough to being the analysis.

<a id='char_stats'></a>
<div class="alert alert-block alert-warning"> 
<h1 align="center">Character Statistics</h3>
</div>

<a id='char_stats1'></a>
### Win rates
#### Description:
> I will compute the win rates for each character by format and game mode.

#### Procedure:
> 
1. Build a function to compute the win rates
2. Compute the win rates for each format and game mode combination

##### 1. Build a function to compute the win rates

In [None]:
def win_rates(df, format_str, mode_str, check_missing=True):
    '''
    Returns the win rates of the input DataFrame
    '''
    # Find the win rates
    temp = df.groupby('character').win.sum() / df.groupby('character').win.count()

    # Rename 'win' column to 'win_rate'
    win_rates = pd.DataFrame(temp).reset_index().rename(columns={'win':'win_rate'})
    
    if check_missing:
        # Check for missing characters
        missing = missing_character(df)
        if missing:
            # Create a dataframe of missing characters
            df_missing = pd.DataFrame({'character':missing, 'win_rate':[np.nan]*len(missing)})
            # Append missing characters
            win_rates = win_rates.append(df_missing, ignore_index=True)
            # Sort by 'character' column
            win_rates.sort_values(by=['character'])

    # Add 'format' and 'game_mode' column
    win_rates['format'] = format_str
    win_rates['game_mode'] = mode_str
    
    return win_rates

##### 2. Compute the win rates for each format and game mode combination

In [None]:
def game_mode_win_rates(df, format_str, check_missing=True):
    output = win_rates(df, format_str, 'overall', check_missing)
    game_modes = ['2V2', '3V3']
    for game_mode in game_modes:
        df_game_mode = df.loc[df.game_mode ==  game_mode]
        temp = win_rates(df_game_mode, format_str, game_mode, check_missing)
        output = pd.concat([output, temp], ignore_index=True)
    
    return output

'''Overall win rates'''
overall_win_rates = game_mode_win_rates(match_df, 'overall')

'''Casual win rates'''
casual_win_rates = game_mode_win_rates(casual, 'casual')

'''Ranked win rates'''
ranked_win_rates = game_mode_win_rates(ranked, 'ranked')

'''League win rates'''
# Initialize a list of all leagues
league_list = [bronze, silver, gold, platinum, diamond, champion, grand_champ]
# Initialize a list of all league names
league_names = ['bronze', 'silver', 'gold', 'platinum', 'diamond', 'champion', 'grand_champ']
# Intitalize empty dataframe to store win rates for each league
league_win_rates = pd.DataFrame()
for i in range(7):
    league = league_list[i]
    league_name = league_names[i]
    temp = game_mode_win_rates(league, league_name)
    league_win_rates = pd.concat([league_win_rates, temp], ignore_index=True)

'''Combine all win rates'''
combined_win_rates = pd.concat([overall_win_rates, casual_win_rates, ranked_win_rates, league_win_rates], ignore_index=True)
combined_win_rates.to_csv('compiled_data\combined_win_rates', index=False)

<a id='char_stats2'></a>
### Pick rates
#### Description:
> I will compute the pick rates for each character by format and game mode.

#### Procedure:
> 
1. Build a function to compute the pick rates
2. Compute the pick rates for each format and game mode combination

##### 1. Build a function to compute the pick rates

In [None]:
def pick_rates(df, format_str, mode_str, check_missing=True):
    '''
    Returns the win rates of the input DataFrame
    '''
    # Find the overall pick rates
    temp = df.groupby('character').win.count() / df.win.count()

    # Rename 'win' column to 'pick_rate
    pick_rates = pd.DataFrame(temp).reset_index().rename(columns={'win':'pick_rate'})
    
    if check_missing:
        # Check for missing characters
        missing = missing_character(df)
        if missing:
            # Create a dataframe of missing characters
            df_missing = pd.DataFrame({'character':missing, 'pick_rate':[np.nan]*len(missing)})
            # Append missing characters
            pick_rates = pick_rates.append(df_missing, ignore_index=True)
            # Sort by 'character' column
            pick_rates.sort_values(by=['character'])

    # Add 'format' and 'game_mode' column
    pick_rates['format'] = format_str
    pick_rates['game_mode'] = mode_str
    
    return pick_rates

##### 2. Compute the pick rates for each format and game mode combination

In [None]:
def game_mode_pick_rates(df, format_str, check_missing=True):
    output = pick_rates(df, format_str, 'overall', check_missing)
    game_modes = ['2V2', '3V3']
    for game_mode in game_modes:
        df_game_mode = df.loc[df.game_mode ==  game_mode]
        temp = pick_rates(df_game_mode, format_str, game_mode, check_missing)
        output = pd.concat([output, temp], ignore_index=True)
    
    return output

'''Overall pick rates'''
overall_pick_rates = game_mode_pick_rates(match_df, 'overall')

'''Casual pick rates'''
casual_pick_rates = game_mode_pick_rates(casual, 'casual')

'''Ranked pick rates'''
ranked_pick_rates = game_mode_pick_rates(ranked, 'ranked')

'''League pick rates'''
# Initialize a list of all leagues
league_list = [bronze, silver, gold, platinum, diamond, champion, grand_champ]
# Initialize a list of all league names
league_names = ['bronze', 'silver', 'gold', 'platinum', 'diamond', 'champion', 'grand_champ']
# Intitalize empty dataframe to store win rates for each league
league_pick_rates = pd.DataFrame()
for i in range(7):
    league = league_list[i]
    league_name = league_names[i]
    temp = game_mode_pick_rates(league, league_name)
    league_pick_rates = pd.concat([league_pick_rates, temp], ignore_index=True)

'''Combine all pick rates'''
combined_pick_rates = pd.concat([overall_pick_rates, casual_pick_rates, ranked_pick_rates, league_pick_rates], ignore_index=True)
combined_pick_rates.to_csv('compiled_data\combined_pick_rates', index=False)

<a id='char_stats3'></a>
### View statistics in a table
#### Description:
> I will display the statistics in a table to easily view the win rates and pick rates of each character by format and game mode.

#### Procedure:
> Create separate tables to view the win rates and pick rates, and then create a combined table to view both statistics.
1. Create a win rates table
2. Create a pick rates table
3. Create a combined table

##### 1. Create a win rates table

In [None]:
# Group by 'character', 'game_mode', and 'format and get pick rates
win_rates_table = combined_win_rates.groupby(['character', 'game_mode', 'format']).agg({'win_rate':'unique'})
# Get pick rate values
win_rates_table = win_rates_table.win_rate.apply(lambda x: x[0])
# Convert to table for easy viewing
win_rates_table = win_rates_table.unstack(level=2)
# Rearrange columns
columns = ['overall', 'casual', 'ranked', 'bronze', 'silver', 'gold', 'platinum', 'diamond', 'champion', 'grand_champ']
win_rates_table = win_rates_table.loc[:, columns]
# Add 'win_rate' level to the columns
win_rates_table.columns = pd.MultiIndex.from_product([['win_rate'], win_rates_table.columns], names=['stat', 'format'])

win_rates_table.head()

##### 2. Create a pick rates table

In [None]:
# Group by 'character', 'game_mode', and 'format and get pick rates
pick_rates_table = combined_pick_rates.groupby(['character', 'game_mode', 'format']).agg({'pick_rate':'unique'})
# Get pick rate values
pick_rates_table = pick_rates_table.pick_rate.apply(lambda x: x[0])
# Convert to table for easy viewing
pick_rates_table = pick_rates_table.unstack(level=2)
# Rearrange columns
columns = ['overall', 'casual', 'ranked', 'bronze', 'silver', 'gold', 'platinum', 'diamond', 'champion', 'grand_champ']
pick_rates_table = pick_rates_table.loc[:, columns]
# Add 'pick_rate' level to the columns
pick_rates_table.columns = pd.MultiIndex.from_product([['pick_rate'], pick_rates_table.columns], names=['stat', 'format'])

pick_rates_table.head()

##### 3. Create a combined table

In [None]:
statistics_table = win_rates_table.merge(pick_rates_table, left_index=True, right_index=True)
statistics_table.to_csv('compiled_data/statistics_table', index=False)

statistics_table.head()

<a id='inf'></a>
<div class="alert alert-block alert-warning"> 
<h1 align="center">Inferential Statistics</h3>
</div>

<a id='inf1'></a>
### Create functions
#### Description:
> I will create functions to perform inferenctial statistics. The first function I need to create is the **difference** function to compute the difference between a character's statistic for two datasets that I'd like to compare. The second function I need to create is the **draw_bs_pairs** function to iteratively create bootstrap samples of two datasets, run the **difference** function on the samples, and store the results.

#### Procedure:
> Create a **difference** function, and a **draw_bs_pairs** function.
1. Create **difference** function
  1. Find all entries of each dataset that includes the desired character
  2. Compute the desired statistic for each dataset
  3. Compute the difference between the two datasets
2. Create **draw_bs_pairs** function
  1. Concatenate the two datasets
  2. Sample the concatenated dataset with replacement
  3. Split the dataset to get two samples of appropriate size
  4. Compute and store the difference between the two samples
  5. Repeat

##### 1. Create **difference** function

In [None]:
def difference(data1, data2, character, stat):
    character_data1 = data1.loc[data1.character == character]
    character_data2 = data2.loc[data2.character == character]
    if stat == 'win':
        stat1 = character_data1.win.sum() / character_data1.win.count()
        stat2 = character_data2.win.sum() / character_data2.win.count()
    elif stat == 'pick':
        stat1 = character_data1.win.count() / data1.win.count()
        stat2 = character_data2.win.count() / data2.win.count()
    
    return stat1 - stat2

##### 2. Create **draw_bs_pairs** function

In [None]:
def draw_bs_pairs(data1, data2, character, stat, size=1, func=difference):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)
    
    # Concatenate data
    concatenated = pd.concat([data1,data2], ignore_index=True)
    
    # Generate replicates
    for i in range(size):
        # Set random seed
        np.random.seed(seed=i)
        
        # Bootstrap indices
        indices = np.random.choice(concatenated.index, size=len(concatenated))
        indices1 = indices[:len(data1)]
        indices2 = indices[len(data1):]
        
        # Generate bootstrap samples
        bs1 = concatenated.iloc[indices1]
        bs2 = concatenated.iloc[indices2]
        bs_replicates[i] = func(bs1,bs2,character,stat)
    
    return bs_replicates

<a id='inf1'></a>
### Hypothesis tests
#### Description:
> I will perform several hypothesis tests to determine if there is a statistical significance between the stats of formats and game modes. I believe the ranked format is the most important because it is played by serious players who care about the game, so it should be no surprise that the game should be balanced around the ranked format. With that said, I will compare the statistics of ranked 2v2 matches to ranked 3v3 matches to determine statistical significance between game modes, and then compare the statistics of bronze league matches to champion league matches to determine statistical significance between leagues.

#### Procedure:
> Create a **difference** function, a **draw_perm_reps** function, and a **draw_bs_pairs** function.
1. Game modes
  1. Loop through each character
  2. Compare the statistics of ranked 2v2 matches to ranked 3v3 matches
  3. Compute the p-values for each character
2. Leagues
  1. Loop through each character
  2. Compare the statistics of bronze league matches to champion league matches
  3. Compute the p-values for each character

##### 1. Game modes

In [None]:
data1 = ranked.loc[ranked.game_mode=='2V2']
data2 = ranked.loc[ranked.game_mode=='3V3']
count = 0
for character in sorted(all_characters):
    empirical_diff = difference(data1, data2, character, 'win')
    replicates = draw_bs_pairs(data1, data2, character, 'win', 1000)
    p = np.sum(replicates >= empirical_diff) / len(replicates)
    if p > 0.5:
        p = 1-p
    if p < 0.151:
        count += 1
    print('{} win rate p-value: {}'.format(character,p))

print('Number of characters with statistically significant p-values:', count)
print('Percentage of characters with statistically significance:', count/27)

In [None]:
data1 = ranked.loc[ranked.game_mode=='2V2']
data2 = ranked.loc[ranked.game_mode=='3V3']
count = 0
for character in sorted(all_characters):
    empirical_diff = difference(data1, data2, character, 'pick')
    replicates = draw_bs_pairs(data1, data2, character, 'pick', 1000)
    p = np.sum(replicates >= empirical_diff) / len(replicates)
    if p > 0.5:
        p = 1-p
    if p < 0.151:
        count += 1
    print('{} pick rate p-value: {}'.format(character,p))

print('Number of characters with statistically significant p-values:', count)
print('Percentage of characters with statistically significance:', count/27)

> 40% of the characters have significantly different win rates between 2v2 and 3v3 game modes, while 78% of the characters have significantly different pick rates between 2v2 and 3v3 game modes. Both of these numbers are adequately high for me to determine that there is indeed a difference between 2v2 and 3v3 game modes. This means that game developers must take game modes into account when they implement balance changes.

##### 2. Leagues

In [None]:
data1 = ranked.loc[ranked.league==0]
data2 = ranked.loc[ranked.league==5]
count = 0
for character in sorted(all_characters):
    empirical_diff = difference(data1, data2, character, 'win')
    replicates = draw_bs_pairs(data1, data2, character, 'win', 1000)
    p = np.sum(replicates >= empirical_diff) / len(replicates)
    if p > 0.5:
        p = 1-p
    if p < 0.151:
        count += 1
    print('{} win rate p-value: {}'.format(character,p))

print('Number of characters with statistically significant p-values:', count)
print('Percentage of characters with statistically significance:', count/27)

In [None]:
data1 = ranked.loc[ranked.league==0]
data2 = ranked.loc[ranked.league==5]
count = 0
for character in sorted(all_characters):
    empirical_diff = difference(data1, data2, character, 'pick')
    replicates = draw_bs_pairs(data1, data2, character, 'pick', 1000)
    p = np.sum(replicates >= empirical_diff) / len(replicates)
    if p > 0.5:
        p = 1-p
    if p < 0.151:
        count += 1
    print('{} pick rate p-value: {}'.format(character,p))

print('Number of characters with statistically significant p-values:', count)
print('Percentage of characters with statistically significance:', count/27)

> 48% of the characters have significantly different win rates between bronze league and champion league, while 96% of the characters have significantly different pick rates between bronze league and champion league. Both of these numbers are adequately high for me to determine that there is indeed a difference between lower level players, and higher level players. This means that game developers must take player level into account when they implement balance changes.

<a id='monitor'></a>
<div class="alert alert-block alert-warning"> 
<h1 align="center">Monitoring Changes</h3>
</div>

<a id='monitor1'></a>
### 
#### Description:
> 
#### Procedure:
> 
1. 

In [None]:
bakko = match_df.loc[match_df.character=='Bakko']
bakko_casual = bakko.loc[bakko.ranked == False]
bakko_ranked = bakko.loc[bakko.ranked == True]
bakko_bronze = bakko_ranked.loc[bakko_ranked.league == 0]
bakko_silver = bakko_ranked.loc[bakko_ranked.league == 1]
bakko_gold = bakko_ranked.loc[bakko_ranked.league == 2]
bakko_platinum = bakko_ranked.loc[bakko_ranked.league == 3]
bakko_diamond = bakko_ranked.loc[bakko_ranked.league == 4]
bakko_champion = bakko_ranked.loc[bakko_ranked.league == 5]
bakko_grand_champ = bakko_ranked.loc[bakko_ranked.league == 6]

'''Overall win rates'''
overall_win_rates = game_mode_win_rates(bakko, 'overall', False)

'''Casual win rates'''
casual_win_rates = game_mode_win_rates(bakko_casual, 'casual', False)

'''Ranked win rates'''
ranked_win_rates = game_mode_win_rates(bakko_ranked, 'ranked', False)

'''League win rates'''
# Initialize a list of all leagues
league_list = [bakko_bronze, bakko_silver, bakko_gold, bakko_platinum, bakko_diamond, bakko_champion, bakko_grand_champ]
# Initialize a list of all league names
league_names = ['bronze', 'silver', 'gold', 'platinum', 'diamond', 'champion', 'grand_champ']
# Intitalize empty dataframe to store win rates for each league
league_win_rates = pd.DataFrame()
for i in range(7):
    league = league_list[i]
    league_name = league_names[i]
    temp = game_mode_win_rates(league, league_name, False)
    league_win_rates = pd.concat([league_win_rates, temp], ignore_index=True)

In [None]:
'''x_values = date
y_values = overall_win_rates.win_rate
legend = overall_win_rates.format + ' - ' + overall_win_rates.game_mode'''