In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [2]:
def split_data(data):
    """Split data based on whether or not it is associated
    with a valid in-game name and drop empty rows after
    
    Args: 
        data (dataframe): contains in-game names and turnip prices
    """
    invalid_ign_mask = data['In-Game Name'].notna()
    valid_name_data = data[invalid_ign_mask].copy()
    valid_name_data.set_index(['In-Game Name'], inplace=True)
    valid_name_data.dropna(how='all', inplace=True)

    valid_ign_mask = ~invalid_ign_mask
    invalid_name_data = data[valid_ign_mask].copy()
    invalid_name_data.reset_index(drop=True, inplace=True)
    invalid_name_data.drop('In-Game Name', axis=1, inplace=True)
    invalid_name_data.dropna(how='all', inplace=True)
        
    return valid_name_data, invalid_name_data

def convert_entry_to_float(entry):
    """Convert an entry to a float 
    
    Args:
        entry (str/float):
            entry to be converted
    
    Returns:
        convert_entry (float): 
            entry as a float or np.nan 
    """
    
    try: 
        converted_entry = float(entry)
    except:
        converted_entry = np.nan
    return converted_entry

In [3]:
# load data, skipping the first row since it contains 
# a message as opposed to the column names
data = pd.read_csv('data/week1.csv', skiprows=[0])

In [4]:
data.head(3)

Unnamed: 0.1,Unnamed: 0,Contact(Twitter/Discord),In-Game Name,Island,FC,Buy Price,Mon AM,Mon PM,Tue AM,Tue PM,...,Wed PM,Thu AM,Thu PM,Fri AM,Fri PM,Sat AM,Sat PM,Possible Pattern,Active Dodo Code,Other notes
0,,@KnightCarmine,Maddox,Knight,,102,43.0,40.0,36.0,32.0,...,139,118,146.0,148.0,142.0,61.0,57.0,Small >:(,,
1,,@semefake,Dev,Sootopolis,0308-4250-1245,93,54.0,51.0,46.0,135.0,...,146,135,142.0,45.0,38.0,,,,,
2,,@naniichanx,Levii,Montecki,,108,63.0,60.0,55.0,51.0,...,115,154,202.0,,,,,Small Spike,,tracking prices & probabilities on https://tur...


The goal of this exploration is to model the buying and selling prices of turnips in Animal Crossing New Horizons. Keeping this in mind, I begin by removing columns which do not relate to turnip prices while retaining In-Game Names I plan to use as primary keys within a MySQL database.

In [5]:
data = data.iloc[:, 2:18]
data.drop(['Island','FC'], axis=1, inplace=True)

I also split the data in terms of whether or not the player's in game name was provided as those without names will require different keys within the database.

In [6]:
valid_name_data, invalid_name_data = split_data(data)

After processing and viewing a sample of the data, everything appears to be fine while further inspection reveals an underlying issue with some of the data types. The issue is related to the object data type which indicates the presence of mixed data where we expect to see floats. 

In [7]:
display(valid_name_data.head(3))
print(valid_name_data.dtypes)

Unnamed: 0_level_0,Buy Price,Mon AM,Mon PM,Tue AM,Tue PM,Wed AM,Wed PM,Thu AM,Thu PM,Fri AM,Fri PM,Sat AM,Sat PM
In-Game Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Maddox,102,43.0,40.0,36.0,32.0,28,139,118,146.0,148.0,142.0,61.0,57.0
Dev,93,54.0,51.0,46.0,135.0,121,146,135,142.0,45.0,38.0,,
Levii,108,63.0,60.0,55.0,51.0,138,115,154,202.0,,,,


Buy Price     object
Mon AM       float64
Mon PM       float64
Tue AM       float64
Tue PM       float64
Wed AM        object
Wed PM        object
Thu AM        object
Thu PM       float64
Fri AM       float64
Fri PM       float64
Sat AM       float64
Sat PM        object
dtype: object


This issue appears in 'Buy Price', 'Wed AM', 'Wed PM', 'Thu AM' and 'Sat PM' where ill-formed data has cast turnip prices into strings.

In [8]:
display(valid_name_data.loc[['Tazz', 'Remi']])
data_types = valid_name_data['Buy Price'].apply(type).unique()
print("Buy Price Data Types:", data_types)

Unnamed: 0_level_0,Buy Price,Mon AM,Mon PM,Tue AM,Tue PM,Wed AM,Wed PM,Thu AM,Thu PM,Fri AM,Fri PM,Sat AM,Sat PM
In-Game Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Tazz,101,140.0,130.0,119.0,115.0,,65,forgot to check :(,,81.0,73.0,63.0,142!!
Remi,x,90.0,85.0,125.0,169.0,x,x,124,48.0,77.0,66.0,46.0,61


Buy Price Data Types: [<class 'str'> <class 'float'>]


Using a function with try and except alongside applymap to convert entries into floats and replace strings with NaNs solves the problem. It is important to note that this would be inefficient for larger datasets and that it may be more efficient to only target columns with issues as opposed to the entire dataframe.

In [9]:
valid_name_data = valid_name_data.applymap(convert_entry_to_float)
invalid_name_data = invalid_name_data.applymap(convert_entry_to_float)
print(valid_name_data.dtypes)

Buy Price    float64
Mon AM       float64
Mon PM       float64
Tue AM       float64
Tue PM       float64
Wed AM       float64
Wed PM       float64
Thu AM       float64
Thu PM       float64
Fri AM       float64
Fri PM       float64
Sat AM       float64
Sat PM       float64
dtype: object
