This kernel is for cleaning & reformatting the various YGO dataframes, and combining them all together. Before I start doing analysis, I want as much data as I can get in one place. However, everything I have so far contains various problems, like missing entries, duplicate entries, and values in the wrong columns.

To start with, I will clean up the dataset found on Kaggle from https://www.kaggle.com/nalfmalf/yugioh-tcg. This appears to be the most up to date database, but has a variety of errors. After that, I will turn to my monster data extracted from YGOHub and YGOPrices, and then try to combine all 3 to fill any gaps.

The dataset found on Kaggle has a wide variety of errors and other problems, including unlabelled columns, duplicate entries, duplicates due to spelling errors, leakage between columns, columns that need splitting, and so on. Unfortunately, due to the size of the dataset, I can only correct for obvious errors, and cannot pick out entries with incorrect information.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
print(os.listdir("../input"))

In [None]:
#Load data
df=pd.read_csv('../input/yugioh-tcg/original/original/cards.csv',header=None,index_col=0)

In [None]:
#Quick inspection. There are no column names!
df.head()

# Fixing the Names
It turns out, there are multiple duplicate entries for card names, so these need to be dropped. Thankfully, this is a simple enough problem, and leads to the removal of ~2000 entries.

In [None]:
#Check for duplicates.
#One card is in there 30 times!
df[1].value_counts().head()

In [None]:
#Remove duplicates & check how many
print(df.shape)
df=df.drop_duplicates()
print(df.shape)
#There were apparentlly ~1800 duplicate entries

In [None]:
#Check I've not missed anything
df[1].value_counts().head()

# Fixing Passcodes

The second column refers to the passcodes written on the cards. This presents several problems, including string entries, and duplicate passwords.

In terms of string entries, 'None' refers to cards which do not have a passcode. These are relatively rare, but do exist. At the end, I'll want to convert these entries to NAN, so I can use this as a numerical column. It probably won't be useful for any statistical analysis, but I'm ready to be surprised! I'll need to convert the NANs to a numerical value later, but that can wait until analysis time.

The second problem is the 17 entries with 'Monster' instead of passcode. As far as I can tell, their passcode entry was missing, so the entry for column 3 has ended up there instead. These likely do have passcodes, so those will need to be added back too. In these cases, I shifted the entries along by 1, to move other data into the correct column. However, since the rest of the data in these rows is a mess too, it became necessary to add an extra column to hold data from the last column. This will need to be fixed & rearranged later.

The problem of duplicate entries is due to spelling errors or name changes. As far as I can tell, often one entry was for a fan translation, and another for the official translation. In other cases, the password was simply entered wrong.

In [None]:
#Problem 2: Missing passcodes
#None is a legitimate entry.
#'Monster' seems to have been a mistake & the column was missed off
#Not sure why the passcodes along are duplicated, so will check.
df[2].value_counts().head()

In [None]:
#'Monster' entries have the password missing, so want to correct that by shifting the rows
#A quick inspection showed some of these rows have plenty of other problems too
df[df[2]=='Monster'].head()

In [None]:
#Save the names, so I can add passcodes later
passcode_names=df[df[2]=='Monster'][1]

In [None]:
#Shift the rows along by 1 where the passcode is missing.
#The rest of the rows are also a big mess, with other data out of order.
#To avoid pushing it out of the dataframe, I added a dummy column to put it in for now
mask = df[2] == 'Monster'
df['Dummy']=""
c = [2,3,4,5,6,7,8,9,10,11,12,13,'Dummy']
#shift columns
df.loc[mask, c] = df.loc[mask, c].shift(1, axis=1)

In [None]:
#Something to save me time / effort
for i in list(passcode_names):
    print('df.loc[df[1]==\"'+i+'\",2]')

In [None]:
df.loc[df[1]=="Performapal Card Gardna",2]='37256334'
df.loc[df[1]=="D/D/D Destiny King Zero Laplace",2]='21686473'
df.loc[df[1]=="Odd-Eyes Wing Dragon",2]='58074177'
df.loc[df[1]=="D/D/D Superdoom King Purplish Armageddon",2]='84569886'
df.loc[df[1]=="SPYRAL Sleeper",2]='00035699'
df.loc[df[1]=="Subterror Fiendess",2]='74762582'

#Tokens are not real cards, so drop
df=df[df[1]!="Ancient Gear Token"]

df.loc[df[1]=="SPYRAL GEAR - Last Resort",2]='37433748'
df.loc[df[1]=="Subterror Behemoth Phospheroglacier",2]='01151281'
df.loc[df[1]=="Subterror Behemoth Speleogeist",2]='47556396'
df.loc[df[1]=="Link Disciple",2]='32995276'

#These 2 seem to be prize cards
df.loc[df[1]=="Iron Knight of Revolution",2]=np.nan
df.loc[df[1]=="Sanctity of Dragon",2]=np.nan


df.loc[df[1]=="Hallohallo",2]='77994337'
df.loc[df[1]=="Mudragon of the Swamp",2]='54757758'

#Tokens are not real cards, so drop
df=df[df[1]!="Token"]


df.loc[df[1]=="Heavymetalfoes Electrumite",2]='24094258'

In [None]:
#Duplicate passcodes is because of name changes & spelling errors
#This seems to be the case for 22 cards.
#Some are actually just passcode errors.
#Unfortunately, therefore this will probably need to be fixed manually.
df[2].value_counts().head(22)

In [None]:
#Corrections

#Name change & spelling error
df.drop(df[df[2]=='62279666'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='62279666'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='48152161'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='12097275'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='97273514'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='40854824'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='58374719'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='07969770'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='43464884'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='99674361'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='87475570'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='87259933'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='93236220'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='50548657'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='96150936'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='25163979'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='01735088'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='14469229'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='85763457'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='08192327'].iloc[[0]].index,inplace=True)
df.drop(df[df[2]=='33541430'].iloc[[0]].index,inplace=True)

#Passcode error
df.loc[df[1]=="Metaphys Ragnarok",2]='19476824'

In [None]:
#One more check
df[2].value_counts().head()

# Fixing Card Category

The third column indicates whether a card is a Monster, Spell or Trap. For my later analysis, I don't want the latter 2 fields, but someone else might find it useful. This seems to be correct, although the spell /trap categories are smaller than expected, so I suspect some are missing.

In [None]:
df[3].value_counts()

# Fixing Attribute

Yu-Gi-Oh! has 7 attributes, covering the 4 classical elements, Light, Dark & Divine. There do not appear to be any problems with this column.

In [None]:
df[4].value_counts()

# Fixing Level

Next up is a card's level. I will be curious to see how they decided to handle cards without levels, like Xyz or Links. I suspect Xyz rank was just translated to a level. For Link monsters, it seems this column was originally just left blank, so has instead been filled by their type info. There is also the 'Number F0/S0' sereis of monsters, which do not have a printed Rank.

I will shift these columns along, and later think what the best replacement number is.

In [None]:
#Check for any problems
df[5].value_counts()

In [None]:
#Shift values
mask=~df[5].isin(['1','2','3','4','5','6','7','8','9','10','11','12',np.nan])
c = [5,6,7,8,9,10,11,12,13,'Dummy']
#shift columns
df.loc[mask, c] = df.loc[mask, c].shift(1, axis=1)

In [None]:
#Check everything is now correct
df[5].value_counts()

# Fixing Type Info

The 6th column contains text information about whether the card is Normal, Effect, Tuner etc. This will need to be split into separate columns.

The first entry appears to be their Type (Warrior etc), followed by further information. str.split offers a way to split these entries into new columns, resulting in 4 new columns, presumably because there are cards with up to 4 entries in the original.

I then removed the original column from the data, and reordered the columns. Further processing will be needed to turn things into the same form as the YGOHub data, but this can be done later.

A quick check of the 4 new columns I've produced does not give any unusual results, so will assume they are correct.

In [None]:
df[6].value_counts().head()

In [None]:
#Splitting the string into new columns
str_df = df[6].str.split('/',expand=True).add_prefix("Card Type ")
df=pd.concat([df, str_df], axis=1).replace({None:np.NaN})

In [None]:
#Reorder columns & drop the original column
cols=[1,2,3,4,5,6,'Card Type 0', 'Card Type 1',
       'Card Type 2', 'Card Type 3',7,8,9,10,11,12,13,'Dummy']
df=df[cols]
df.drop(6,axis=1,inplace=True)

In [None]:
#Now all these columns need to be re-checked
#Nothing looks like an error
Type0=df['Card Type 0'].value_counts()
Type1=df['Card Type 1'].value_counts()
Type2=df['Card Type 2'].value_counts()
Type3=df['Card Type 3'].value_counts()

print(Type0.index)
print(Type1.index)
print(Type2.index)
print(Type3.index)

# Fixing Stats

Currently Attack / Defense are listed as X/Y in a single column, so I want to split them, as before. As above, I also removed the original column, and renamed them to Attack & Defense.

After this, all the entries need to be checked for unusual values or errors. I know that some stats are listed as ? for example.

For attack, both ? and X000 will be converted to NAN for now, and dealt with later. They basically indicate special conditions.

It also turns out that pendulum scales are sometimes included in this column, which is wrong. This is made clear by the monsters with unusually low attack, of <10. Scales above 10 do exist, going to 13, but are not incorrectly labelled in this dataset. 

A pendulum scale of zero also exists, and will need to be checked too. For now, these entries appear to be correct.

Similarly, Defense has to worry about ? and X000 entries.

In [None]:
df[7].value_counts().head()

In [None]:
#Split into two columns
str_df = df[7].str.split('/',expand=True).add_prefix("Stat ")
df=pd.concat([df, str_df], axis=1).replace({None:np.NaN})

In [None]:
#Reorder & rename columns
cols=[1,2,3,4,5,'Card Type 0', 'Card Type 1',
       'Card Type 2', 'Card Type 3','Stat 0','Stat 1',8,9,10,11,12,13,'Dummy',7]
df=df[cols]
df.drop(7,axis=1,inplace=True)
df.rename(columns={'Stat 0':'Attack','Stat 1':'Defense'},inplace=True)

In [None]:
#Check the attack
df.Attack.value_counts().index

In [None]:
# Both ? and X000 exist as non-integer values
#These will be converted to NAN for now
df.loc[df['Attack']=="?",'Attack']=np.nan
df.loc[df['Attack']=="X000",'Attack']=np.nan

In [None]:
#A small number of cards have their scales in the stat slot, which is wrong.
#This will need to be corrected. 
#Thanfully, this data is (seemingly randomly) in one of the other columns

#Scale 1
df.loc[df[1]=="D/D/D Destiny King Zero Laplace",'Attack']=np.nan
df.loc[df[1]=="D/D/D Destiny King Zero Laplace",'Defense']='0'
df.loc[df[1]=="D/D/D Superdoom King Purplish Armageddon",'Attack']='3500'
df.loc[df[1]=="D/D/D Superdoom King Purplish Armageddon",'Defense']='3000'
df.loc[df[1]=="Yoko-Zuna Sumo Spirit",'Attack']='2400'
df.loc[df[1]=="Yoko-Zuna Sumo Spirit",'Defense']='1000'

#Scale 2
df.loc[df[1]=="Foucault's Cannon",'Attack']='2200'
df.loc[df[1]=="Foucault's Cannon",'Defense']='1200'
df.loc[df[1]=="Mandragon",'Attack']='2500'
df.loc[df[1]=="Mandragon",'Defense']='1000'
df.loc[df[1]=="Risebell the Summoner",'Attack']='800'
df.loc[df[1]=="Risebell the Summoner",'Defense']='800'
df.loc[df[1]=="Hallohallo",'Attack']='800'
df.loc[df[1]=="Hallohallo",'Defense']='600'

#Scale 3
df.loc[df[1]=="Dragon Horn Hunter",'Attack']='2300'
df.loc[df[1]=="Dragon Horn Hunter",'Defense']='1000'
df.loc[df[1]=="Magical Abductor",'Attack']='1700'
df.loc[df[1]=="Magical Abductor",'Defense']='1400'
df.loc[df[1]=="Samurai Cavalry of Reptier",'Attack']='1800'
df.loc[df[1]=="Samurai Cavalry of Reptier",'Defense']='1200'

#Scale 4
df.loc[df[1]=="Ghost Beef",'Attack']='2000'
df.loc[df[1]=="Ghost Beef",'Defense']='1000'
df.loc[df[1]=="Metrognome",'Attack']='1800'
df.loc[df[1]=="Metrognome",'Defense']='1600'
df.loc[df[1]=="Pandora's Jewelry Box",'Attack']='1500'
df.loc[df[1]=="Pandora's Jewelry Box",'Defense']='1500'

#Scale 5
df.loc[df[1]=="P.M. Captor",'Attack']='1800'
df.loc[df[1]=="P.M. Captor",'Defense']='0'
df.loc[df[1]=="Steel Cavalry of Dinon",'Attack']='1600'
df.loc[df[1]=="Steel Cavalry of Dinon",'Defense']='2600'

#Scale 7
df.loc[df[1]=="Dragong",'Attack']='500'
df.loc[df[1]=="Dragong",'Defense']='2100'
df.loc[df[1]=="Flash Knight",'Attack']='1800'
df.loc[df[1]=="Flash Knight",'Defense']='600'
df.loc[df[1]=="Lancephorhynchus",'Attack']='2500'
df.loc[df[1]=="Lancephorhynchus",'Defense']='800'
df.loc[df[1]=="Mild Turkey",'Attack']='1000'
df.loc[df[1]=="Mild Turkey",'Defense']='2000'
df.loc[df[1]=="Zany Zebra",'Attack']='0'
df.loc[df[1]=="Zany Zebra",'Defense']='2000'

#Scale 8
df.loc[df[1]=="Performapal Card Gardna",'Attack']='1000'
df.loc[df[1]=="Performapal Card Gardna",'Defense']='1000'

#Scale 9
df.loc[df[1]=="Kuro-Obi Karate Spirit",'Attack']='2400'
df.loc[df[1]=="Kuro-Obi Karate Spirit",'Defense']='1000'
df.loc[df[1]=="Kai-Den Kendo Spirit",'Attack']='2400'
df.loc[df[1]=="Kai-Den Kendo Spirit",'Defense']='1000'

#Scale 10
df.loc[df[1]=="Odd-Eyes Wing Dragon",'Attack']='3000'
df.loc[df[1]=="Odd-Eyes Wing Dragon",'Defense']='2500'


In [None]:
test_df=df.loc[df['Attack']=='0']
test_df.loc[test_df['Card Type 1']=='Pendulum']

In [None]:
#Check defense column
df.Defense.value_counts().index

In [None]:
# Both ? and X000 exist as non-integer values
#These will be converted to NAN for now
df.loc[df['Defense']=="?",'Defense']=np.nan
df.loc[df['Defense']=="X000",'Defense']=np.nan

In [None]:
#Let's take a moment to rename some columns
df.rename(columns={1:'Name',2:'Passcode',3:'Category',4:'Attribute',5: 'Level',
                  'Card Type 0':'Type'},inplace=True)

In [None]:
df.head()

# Break to Output Data

The remaining columns tend to include text, such as flavour & effect text, but they are a big mess. Column 8 should be the main text, but sometimes includes Link Number, Xyz/Fusion materials and more. Since this seems like quite heavy work, I'm simply going to ouput the current results & go back later. For comparison with the YGOHub and YGOPrices data, I'll also split it off to just include the Monsters.

In [None]:
#Columns to write
header=['Name','Passcode','Category','Attribute','Level','Type','Card Type 1','Card Type 2',
        'Card Type 3','Attack','Defense']

In [None]:
#Full DF
df.to_csv('YGO_partial.csv',columns=header,index=False)

In [None]:
#Monsters only
df_monster=df[df['Category']=='Monster']

In [None]:
#Monsters only
df_monster.to_csv('YGO_Monster_partial.csv',columns=header,index=False)

# Comparing against YGOHub & YGOPrices data

Since the text columns in the Kaggle user set are a mess, I'm going to turn back to the YGO hub data for now. Ultimately, I want to add missing entries from one to the other, so it's useful to know how many extra entries I can gain from the above set. In the raw form, the two dataframes only differ in rows by 1, but I can't guarantee the other 5948 match.

At first glance, they are mismatched by 43 and 44 columns, but this is partly due to the fact there have been text reading errors (for example the Celtic cards).  Further refining by removing cases where the passcodes match, narrows this list down to ~30, and includes the missing data related to '&' and '#' characters missing in the YGOHub data.

In [None]:
YGO_df=pd.read_csv('../input/ygo-data/YGO_Cards_v2.csv',encoding = "ISO-8859-1")

In [None]:
YGO_P_df=pd.read_csv('../input/ygo-prices-data/YGO_Cards_v3.csv',encoding = "ISO-8859-1")

In [None]:
YGO_df.rename(columns={'Unnamed: 0':'Name'},inplace=True)

In [None]:
df_monster[~df_monster.Name.isin(YGO_df.Name.values)].Name

In [None]:
YGO_df[~YGO_df.Name.isin(df_monster.Name.values)].Name

In [None]:
Missing_monster_df1=df_monster[~df_monster.Name.isin(YGO_df.Name.values)]

In [None]:
Missing_monster_df1[~Missing_monster_df1.Passcode.isin(YGO_df.number.values)]

In [None]:
Missing_monster_df2=YGO_df[~YGO_df.Name.isin(df_monster.Name.values)]

In [None]:
len(Missing_monster_df2[~Missing_monster_df2.number.isin(df_monster.Passcode.values)])