In [401]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from bs4 import BeautifulSoup

import urllib.request,urllib.parse, urllib.error



The following is a brief script I am currently working on to scrape SumoDB, select wrestler based on their year of premiere, and then extract the match data for each wrestler, above a certain rank. The goal is to assemble a series of entries, where each wrestler has a set of bashos/tournaments under their name, as well as the win/loss/injury stats. By extracting this info from the plain text representation on SumoDB and converting it into a more user friendly table, I can then begin doing some analysis. 

As a test case, I will be extracting Shodai's match records. Note that the URL displays the site as text-only, as the "table" used is actually a set of images. It is a little easier to scrape the info this way. 

In [905]:
url = input('Enter -')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'html.parser')

Enter -http://sumodb.sumogames.de/Rikishi.aspx?r=1136&t=1


In [None]:
url = input('Enter -')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'html.parser')

Just parsing the HTML here gives us a rough structure of the page, and there are two areas we want to focus on: the biographical info( age,weight,height) , and the tournament info. Let's first start by extracting the name (look for 'h2' tags).

In [906]:
soup


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head><meta content="Rek8qIcTIMRzIdPb0GEasAfYsajhqVKmwC9WwdCk3U0" name="google-site-verification"/><title>
	Mokonami Sakae Rikishi Information
</title><link href="website.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript">
window.onload = function()
{
    if (window.winOnLoad) window.winOnLoad();
}
window.onunload = function()
{
    if (window.winOnUnload) window.winOnUnload();
}
</script>
<script src="scripts/x_core.js" type="text/javascript"></script>
<script src="scripts/xselect.js" type="text/javascript"></script>
<!-- Add jQuery library -->
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.7/jquery.min.js" type="text/javascript"></script>
<!-- Add mousewheel plugin (this is optional) -->
<script src="scripts/jquery.mousewheel-3.0.6.pack.js" type="text/javascript"></script>
<!-- Add fancy

In [907]:
Name = soup.find_all('h2')[0].text
Name

'Mokonami Sakae'

The rest of our info, is unfortunately stored in plain text. Here, I did a bit of digging, and found a nice snippet from https://medium.com/@chipk215/web-scraping-a-story-of-preformatted-text-df65486a8f15, which gives a quick overview of what is done below: you strip the text, and then find your start/end parts. Web scraping is still a bit new to me, so I figured it's worth putting that out there.

We are now going to take that set of text, and set it up as a nice and messy list. Who doesn't love lists? Again the features here are split in two. Note the premiere date of 'Hatsu Dohyo', essentially meaning when they first entered professional sumo. 

In [908]:
page_text

'Highest Rank     Jonokuchi 14\r\nShusshin         -\r\nHeya             -\r\nShikona          Katoyama\r\nHatsu Dohyo      1960.01\r\nIntai            unknown\r\n\r\nCareer Record    0-0-8/0 (2 basho)\r\n  In Jonokuchi   0-0-8/0 (1 basho)\r\n  In Mae-zumo    1 basho\r\n\r\nKatoyama\r\n1960.01 Mz                      0-0\r\n1960.03 Jk14e                   0-0-8'

In [909]:
page_text[start_position:].splitlines()

['Highest Rank     Jonokuchi 14',
 'Shusshin         -',
 'Heya             -',
 'Shikona          Katoyama',
 'Hatsu Dohyo      1960.01',
 'Intai            unknown',
 '',
 'Career Record    0-0-8/0 (2 basho)',
 '  In Jonokuchi   0-0-8/0 (1 basho)',
 '  In Mae-zumo    1 basho',
 '',
 'Katoyama',
 '1960.01 Mz                      0-0',
 '1960.03 Jk14e                   0-0-8']

In [910]:
pre = soup.find('pre')
start_section_text = 'Highest Rank'
end_section_text = '2001.05 ' # date of next basho...

page_text = pre.text.strip()
start_position = page_text.find(start_section_text)
end_position = page_text.find(end_section_text)

table_text = page_text[start_position:]

lines = table_text.splitlines()
lines

['Highest Rank     Maegashira 6',
 'Real Name        GANBOLD Bazarsad - Ishikawa Sakae(2009.11.18)',
 'Birth Date       April 5, 1984',
 'Shusshin         Mongolia, Ulan-Bator - Mongolia, Zavkhan',
 'Height and Weight186.5 cm 152 kg',
 'Heya             Tatsunami',
 'Shikona          Mokonami Sakae',
 'Hatsu Dohyo      2001.03',
 'Intai            2011.05',
 '',
 'Career Record    341-304/645 (60 basho)',
 '  In Makuuchi    71-79/150 (10 basho)',
 '   As Maegashira 71-79/150 (10 basho)',
 '  In Juryo       144-141/285 (19 basho)',
 '  In Makushita   54-30/84 (12 basho)',
 '  In Sandanme    42-42/84 (12 basho)',
 '  In Jonidan     25-10/35 (5 basho)',
 '  In Jonokuchi   5-2/7 (1 basho)',
 '  In Mae-zumo    1 basho',
 '',
 'Mokonami Sakae',
 '2001.03 Mz                      0-0',
 '2001.05 Jk23w   *--OO-O-O--O*-- 5-2',
 '2001.07 Jd93e   -O-O-O-O-*-*--O 5-2',
 '2001.09 Jd51e   -OO--O-**--O--O 5-2',
 '2001.11 Jd10w   O--O-*O-O-*---* 4-3',
 '2002.01 Sd93e   O-O--*-O*-*-*-- 3-4',
 '2002.03 J

As you might have noticed, Sumo rankings are given by this upper case + number + lowercase notation, denoting a wrestler/rikishi's rank. Also note that Basho/tournaments are held every other month, and that match records are stored here as *,O,-, which are losses, wins, and withdrawals (due to injury). The next column summarizes a 15 day tournament, with a score out of 15 (i.e. 11-4 denotes 11 wins and 4 losses). 

In [912]:
Name

'Mokonami Sakae'

In [861]:
lines[::-1].index(' ')


ValueError: ' ' is not in list

In [936]:
for i in range(len(lines)):
    res = lines[i].find(Name.split()[0])
    if res == 0:
        print(i)
        break

20


In [938]:
lines[record_idx:]

['Mokonami Sakae',
 '2001.03 Mz                      0-0',
 '2001.05 Jk23w   *--OO-O-O--O*-- 5-2',
 '2001.07 Jd93e   -O-O-O-O-*-*--O 5-2',
 '2001.09 Jd51e   -OO--O-**--O--O 5-2',
 '2001.11 Jd10w   O--O-*O-O-*---* 4-3',
 '2002.01 Sd93e   O-O--*-O*-*-*-- 3-4',
 '2002.03 Jd9e    O-*--OO--O-O--* 5-2',
 '2002.05 Sd72e   O-O--*O--O-*--O 5-2',
 '2002.07 Sd41w   -**-*--O-*-*--* 1-6',
 '2002.09 Sd76w   -O-O-O*-O-*--O- 5-2',
 '2002.11 Sd47w   -O-*O--**--O-*- 3-4',
 '2003.01 Sd65e   -*-O*--O-**--O- 3-4',
 '2003.03 Sd82e   *--O-*-*-**---O 2-5',
 '2003.05 Jd12e   -O-O-O*--OO---O 6-1',
 '2003.07 Sd50e   -*O-*-*-O-O--O- 4-3',
 '2003.09 Sd37e   O--**-O--O*--*- 3-4',
 '2003.11 Sd52e   -*-O-*O-O-*--O- 4-3',
 '2004.01 Sd33w   *--O-O-O-OO---* 5-2',
 '2004.03 Sd8e    -O-O-*-O*--O--* 4-3',
 '2004.05 Ms56e   -OO-*--O*-*-O-- 4-3',
 '2004.07 Ms47e   -*-OO--O-O*---* 4-3',
 '2004.09 Ms39w   *-O--O-*-*O-O-- 4-3',
 '2004.11 Ms35e   -*-OO-*-*--O-*- 3-4',
 '2005.01 Ms43e   O-*--O-O-O-*--* 4-3',
 '2005.03 Ms35e   -**

In [916]:

record_idx = lines.index(Name)

In [917]:
lines[record_idx:]

['Mokonami Sakae',
 '2001.03 Mz                      0-0',
 '2001.05 Jk23w   *--OO-O-O--O*-- 5-2',
 '2001.07 Jd93e   -O-O-O-O-*-*--O 5-2',
 '2001.09 Jd51e   -OO--O-**--O--O 5-2',
 '2001.11 Jd10w   O--O-*O-O-*---* 4-3',
 '2002.01 Sd93e   O-O--*-O*-*-*-- 3-4',
 '2002.03 Jd9e    O-*--OO--O-O--* 5-2',
 '2002.05 Sd72e   O-O--*O--O-*--O 5-2',
 '2002.07 Sd41w   -**-*--O-*-*--* 1-6',
 '2002.09 Sd76w   -O-O-O*-O-*--O- 5-2',
 '2002.11 Sd47w   -O-*O--**--O-*- 3-4',
 '2003.01 Sd65e   -*-O*--O-**--O- 3-4',
 '2003.03 Sd82e   *--O-*-*-**---O 2-5',
 '2003.05 Jd12e   -O-O-O*--OO---O 6-1',
 '2003.07 Sd50e   -*O-*-*-O-O--O- 4-3',
 '2003.09 Sd37e   O--**-O--O*--*- 3-4',
 '2003.11 Sd52e   -*-O-*O-O-*--O- 4-3',
 '2004.01 Sd33w   *--O-O-O-OO---* 5-2',
 '2004.03 Sd8e    -O-O-*-O*--O--* 4-3',
 '2004.05 Ms56e   -OO-*--O*-*-O-- 4-3',
 '2004.07 Ms47e   -*-OO--O-O*---* 4-3',
 '2004.09 Ms39w   *-O--O-*-*O-O-- 4-3',
 '2004.11 Ms35e   -*-OO-*-*--O-*- 3-4',
 '2005.01 Ms43e   O-*--O-O-O-*--* 4-3',
 '2005.03 Ms35e   -**

In [884]:
x= 'Weight182'
x.split('t')

['Weigh', '182']

In [885]:
lines

['Highest Rank     Jonokuchi 14',
 'Shusshin         -',
 'Heya             -',
 'Shikona          Katoyama',
 'Hatsu Dohyo      1960.01',
 'Intai            unknown',
 '',
 'Career Record    0-0-8/0 (2 basho)',
 '  In Jonokuchi   0-0-8/0 (1 basho)',
 '  In Mae-zumo    1 basho',
 '',
 'Katoyama',
 '1960.01 Mz                      0-0',
 '1960.03 Jk14e                   0-0-8']

We can quickly get the biographical info by exploiting the structure of this page: the name repeats a few times, and we can use these as handy landmarks to denote when the biographical info ends and tournament records begin. This is really useful since rikishi can earn a variety of prizes in a tournmament, and that the match record (at the top) obviously varies from rikishi to rikishi. By using the names, we can split this list into a few easier to manage snippets.

In [496]:
def get_bio_info(first_snippet):
    premiere_yr = first_snippet[1].split()[-1]
    current_weight, current_height = get_height(first_snippet)
    

In [497]:

def get_height(raw_input):
    phys_line = raw_input[4].split()
    weight = phys_line[-2] # weight in KG
    
    # Check if there is a height/weight first
    
    
    if phys_line[2]!='Weight':
        height = phys_line[2].split('t')[-1] # it got stuck together, seperate to get height in cm
        
    else:
        height = phys_line[3]
        
    return(height,weight)



In [923]:
matches = lines[record_idx+1:]

In [924]:
matches[1:2][0].split()

['2001.05', 'Jk23w', '*--OO-O-O--O*--', '5-2']

In [509]:
matches[1:]

['1970.05 Jk1w    -*-OO-O-*-O--O- 5-2',
 '1970.07 Jd47e   -O*--OO-O--*--* 4-3',
 '1970.09 Jd37e   -*-*-OO-O--**-- 3-4',
 'Takanobori Hiroshi#',
 '1970.11 Jd44e   O--O-O*-O--O--O 6-1',
 '1971.01 Sd70w   -**-*-O-*-*-O-- 2-5',
 '1971.03 Jd14e   -O-*-OO--O-**-- 4-3',
 '1971.05 Sd74e   *--*-O-*O--O-O- 4-3',
 '1971.07 Sd63e   -*O--**-*--O--O 3-4',
 '1971.09 Sd78w   O-O-*--*-**---* 2-5',
 '1971.11 Jd22e   -OO-O--OO-O-*-- 6-1',
 '1972.01 Sd59e   -*-O-O-*O--O-*- 4-3',
 '1972.03 Sd49e   -*-**-*-*-O--O- 2-5',
 '1972.05 Sd73e   *--*-O-O*-O---* 3-4',
 '1972.07 Jd2e    *-*-O--OO-O--O- 5-2',
 '1972.09 Sd51e   -OO-*-O--*-O-*- 4-3',
 '1972.11 Sd43e   -**-O--O*--OO-- 4-3',
 '1973.01 Sd31e   -O-OO-*--O-O*-- 5-2',
 '1973.03 Sd4w    -*-*O--*-O-*--* 2-5',
 '1973.05 Sd22w   *-O--*-O*-*---O 3-4',
 'Izumiya',
 '1973.07 Sd32e   O-O-*--O-*-*-*- 3-4',
 '1973.09 Sd44w   -O-O-*O-*-O---O 5-2',
 '1973.11 Sd17w   *-*--O-*O-O---* 3-4',
 '1974.01 Sd30e   -O-*-O*--*O-*-- 3-4',
 '1974.03 Sd41w   O--O-*-*O--O--* 4-3',
 '19

In [904]:
matches

['1960.01 Mz                      0-0',
 '1960.03 Jk14e                   0-0-8']

Now for the fun task: we can convert that messy looking pre-formatted text table into something a bit more useable. We are going to ignore any honours (i.e. Yusho, etc) for now. The next two functions extract a full record, and then trim the record based on a rank cutoff.

In [925]:
matches

['2001.03 Mz                      0-0',
 '2001.05 Jk23w   *--OO-O-O--O*-- 5-2',
 '2001.07 Jd93e   -O-O-O-O-*-*--O 5-2',
 '2001.09 Jd51e   -OO--O-**--O--O 5-2',
 '2001.11 Jd10w   O--O-*O-O-*---* 4-3',
 '2002.01 Sd93e   O-O--*-O*-*-*-- 3-4',
 '2002.03 Jd9e    O-*--OO--O-O--* 5-2',
 '2002.05 Sd72e   O-O--*O--O-*--O 5-2',
 '2002.07 Sd41w   -**-*--O-*-*--* 1-6',
 '2002.09 Sd76w   -O-O-O*-O-*--O- 5-2',
 '2002.11 Sd47w   -O-*O--**--O-*- 3-4',
 '2003.01 Sd65e   -*-O*--O-**--O- 3-4',
 '2003.03 Sd82e   *--O-*-*-**---O 2-5',
 '2003.05 Jd12e   -O-O-O*--OO---O 6-1',
 '2003.07 Sd50e   -*O-*-*-O-O--O- 4-3',
 '2003.09 Sd37e   O--**-O--O*--*- 3-4',
 '2003.11 Sd52e   -*-O-*O-O-*--O- 4-3',
 '2004.01 Sd33w   *--O-O-O-OO---* 5-2',
 '2004.03 Sd8e    -O-O-*-O*--O--* 4-3',
 '2004.05 Ms56e   -OO-*--O*-*-O-- 4-3',
 '2004.07 Ms47e   -*-OO--O-O*---* 4-3',
 '2004.09 Ms39w   *-O--O-*-*O-O-- 4-3',
 '2004.11 Ms35e   -*-OO-*-*--O-*- 3-4',
 '2005.01 Ms43e   O-*--O-O-O-*--* 4-3',
 '2005.03 Ms35e   -**--O-*O--O-O- 4-3',


In [926]:
def extract_record(table_of_bashos):
    date = []
    rank = []
    record = []
    final_score = []
    
    for i in range(len(table_of_bashos)):
        entry = table_of_bashos[i:i+1]
        values = entry[0].split()
        if len(values)<4:
            print('Name change/Missing Data')
            pass
        else:
            print(i)
            date.append(values[0])
            rank.append(values[1])
            record.append(values[2])
            final_score.append(values[3])
    return(date,rank,record,final_score)

date,rank,record,final_score = extract_record(matches[1:])

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Name change/Missing Data


In [932]:
len(rank)

59

In [933]:
rank

['Jk23w',
 'Jd93e',
 'Jd51e',
 'Jd10w',
 'Sd93e',
 'Jd9e',
 'Sd72e',
 'Sd41w',
 'Sd76w',
 'Sd47w',
 'Sd65e',
 'Sd82e',
 'Jd12e',
 'Sd50e',
 'Sd37e',
 'Sd52e',
 'Sd33w',
 'Sd8e',
 'Ms56e',
 'Ms47e',
 'Ms39w',
 'Ms35e',
 'Ms43e',
 'Ms35e',
 'Ms30e',
 'Ms19w',
 'Ms8w',
 'Ms2e',
 'J10w',
 'J3w',
 'J1e',
 'J5w',
 'J12w',
 'J14e',
 'Ms4w',
 'Ms1e',
 'J13w',
 'J10e',
 'J7e',
 'J10e',
 'J7e',
 'J5w',
 'J6w',
 'J12e',
 'J7w',
 'J13w',
 'J12w',
 'J6e',
 'J1w',
 'M7w',
 'M9e',
 'M13w',
 'M11w',
 'M15w',
 'M11e',
 'M10e',
 'M6e',
 'M11e',
 'M12w']

In [928]:
lines

['Highest Rank     Maegashira 6',
 'Real Name        GANBOLD Bazarsad - Ishikawa Sakae(2009.11.18)',
 'Birth Date       April 5, 1984',
 'Shusshin         Mongolia, Ulan-Bator - Mongolia, Zavkhan',
 'Height and Weight186.5 cm 152 kg',
 'Heya             Tatsunami',
 'Shikona          Mokonami Sakae',
 'Hatsu Dohyo      2001.03',
 'Intai            2011.05',
 '',
 'Career Record    341-304/645 (60 basho)',
 '  In Makuuchi    71-79/150 (10 basho)',
 '   As Maegashira 71-79/150 (10 basho)',
 '  In Juryo       144-141/285 (19 basho)',
 '  In Makushita   54-30/84 (12 basho)',
 '  In Sandanme    42-42/84 (12 basho)',
 '  In Jonidan     25-10/35 (5 basho)',
 '  In Jonokuchi   5-2/7 (1 basho)',
 '  In Mae-zumo    1 basho',
 '',
 'Mokonami Sakae',
 '2001.03 Mz                      0-0',
 '2001.05 Jk23w   *--OO-O-O--O*-- 5-2',
 '2001.07 Jd93e   -O-O-O-O-*-*--O 5-2',
 '2001.09 Jd51e   -OO--O-**--O--O 5-2',
 '2001.11 Jd10w   O--O-*O-O-*---* 4-3',
 '2002.01 Sd93e   O-O--*-O*-*-*-- 3-4',
 '2002.03 J

In [901]:
len(matches[1:][0].split())

3

In [524]:
record

['-*-OO-O-*-O--O-',
 '-O*--OO-O--*--*',
 '-*-*-OO-O--**--',
 'O--O-O*-O--O--O',
 '-**-*-O-*-*-O--',
 '-O-*-OO--O-**--',
 '*--*-O-*O--O-O-',
 '-*O--**-*--O--O',
 'O-O-*--*-**---*',
 '-OO-O--OO-O-*--',
 '-*-O-O-*O--O-*-',
 '-*-**-*-*-O--O-',
 '*--*-O-O*-O---*',
 '*-*-O--OO-O--O-',
 '-OO-*-O--*-O-*-',
 '-**-O--O*--OO--',
 '-O-OO-*--O-O*--',
 '-*-*O--*-O-*--*',
 '*-O--*-O*-*---O',
 'O-O-*--O-*-*-*-',
 '-O-O-*O-*-O---O',
 '*-*--O-*O-O---*',
 '-O-*-O*--*O-*--',
 'O--O-*-*O--O--*',
 '-O-*-*-*-------',
 '---------------',
 '---------------',
 '---------------']

One motivation for trimming based on rank is that Juryo is the first official rank where Rikishi compete in 15 bouts, not 7. This is the top tier of sumo, and while the occasional top ranked wrestler falls into Juryo, they rarely fall below unless their career is practically over. We will use Juryo as our cutoff, and it is where I will consider a rikishi/wrestler to have finally reached "pro" status. 

In [746]:
rank

['Jk1w',
 'Jd47e',
 'Jd37e',
 'Jd44e',
 'Sd70w',
 'Jd14e',
 'Sd74e',
 'Sd63e',
 'Sd78w',
 'Jd22e',
 'Sd59e',
 'Sd49e',
 'Sd73e',
 'Jd2e',
 'Sd51e',
 'Sd43e',
 'Sd31e',
 'Sd4w',
 'Sd22w',
 'Sd32e',
 'Sd44w',
 'Sd17w',
 'Sd30e',
 'Sd41w',
 'Sd32e',
 'Sd56w',
 'Jd13e',
 'Jd61w']

In [770]:
entry

'Jd61w'

In [773]:
entry[1:2] != 'd'
entry[1:2] != 'k'

True

In [774]:
if entry[:1] =='J':
    print('first pass')
    if entry[1:2] != 'd' and entry[1:2] !='k':
        print('second pass')

first pass


In [775]:
row = 0
while row < len(rank):
    print(row,len(rank))
    entry = rank[row]
    if ((entry[:2] != 'Jd' and entry[:2] != 'Jk') and entry[:1] == 'J'):
        print(row, 'found Juryo')
        row +=1
    else:
        row +=1
        print(row)

0 28
1
1 28
2
2 28
3
3 28
4
4 28
5
5 28
6
6 28
7
7 28
8
8 28
9
9 28
10
10 28
11
11 28
12
12 28
13
13 28
14
14 28
15
15 28
16
16 28
17
17 28
18
18 28
19
19 28
20
20 28
21
21 28
22
22 28
23
23 28
24
24 28
25
25 28
26
26 28
27
27 28
28


In [781]:
rank

['Jk1w',
 'Jd47e',
 'Jd37e',
 'Jd44e',
 'Sd70w',
 'Jd14e',
 'Sd74e',
 'Sd63e',
 'Sd78w',
 'Jd22e',
 'Sd59e',
 'Sd49e',
 'Sd73e',
 'Jd2e',
 'Sd51e',
 'Sd43e',
 'Sd31e',
 'Sd4w',
 'Sd22w',
 'Sd32e',
 'Sd44w',
 'Sd17w',
 'Sd30e',
 'Sd41w',
 'Sd32e',
 'Sd56w',
 'Jd13e',
 'Jd61w']

In [789]:
True or False or True

True

In [847]:
test_list = [
 'Sd74e',
 'Sd63e',
 'Sd78w',
 'Jd22e','J12e','J5w','Jd12e','J5w']

filter_for_rank(test_list,'Juryo')

4

In [778]:
record
row = 0
entry = rank[row]

while (entry[:2] != 'Jd' and entry[:2] != 'Jk' and entry[:1] == 'J') and row < len(rank):
        # reached a juryo rank
        row = +1
        entry = rank[row]
        print(row)
        
row

0

In [590]:
def filter_rank(table_of_bashos, rank):
    # Modifies extracted table to include all bashos after exceeding a certain rank
    # Table of bashos is a list of extracted entries, sorted row-wise. Want to modify into table
    # Rank is a string representation of a rank;
    
    
    # Returns row, the relevant index for a given basho rank's first instance
    if rank == 'Juryo':
        # Basically find first instance of first 'Juryo' bout, slice off the rest.
        row = 0
        entry =table_of_bashos[row]
        while (entry[:2] != 'Jk' and entry[:1] == 'J' and entry[:2]!='Jd') and row < len(table_of_bashos):
            
            print(entry,row, len(table_of_bashos))
            
            if row >= len(table_of_bashos)-1:
                break
            else:
                row += 1
                entry = table_of_bashos[row]

            
            
            
        return(row) # index where first Juryo match is recorded
    
    elif rank == 'Maegashira':
        row = 0
        while table_of_bashos[row][1:3] =='s' and table_of_bashos[row][:1]!='M': # Makuhita != Maegashira, so we dennote the
            #latter as M(number) vs Ms_number
            row += 1
            
        return(row) # index where first Juryo match is recorded
    else:
        print('Invalid rank, try again')
    


In [825]:
def filter_for_rank(rank_list, rank_to_use):


    if rank_to_use =='Juryo':
        row = 0
        for i in range(len(rank_list)):
            res = rank_list[i].find('J')
            if res != -1 and rank_list[i].find('Jd') == -1 and rank_list[i].find('Jk') == -1 and rank_list[i].find('Js') == -1:
        #found juryo
                row = i
                rank_list[i].find('J')
                break
            else:
                #print('No juryo')
                row= -2
        
    return(row)

Filter rank just gives us a nice little index to trim the list we've been working with further. We use it to reformat the previous lists, so that the next few steps involve only the ranks we are interested in.

In [826]:
result = filter_for_rank(rank, 'Juryo')


    # IGNORE THIS RIKISHI


In [827]:
Jidx = filter_rank(rank,'Juryo')
# use Jidx to filter for Juryo only matches.

date_f,rank_f,record_f,final_score_f = date[Jidx:],rank[Jidx:],record[Jidx:],final_score[Jidx:]


Jk1w 0 28
Jd47e 1 28
Jd37e 2 28
Jd44e 3 28
Sd70w 4 28
Jd14e 5 28
Sd74e 6 28
Sd63e 7 28
Sd78w 8 28
Jd22e 9 28
Sd59e 10 28
Sd49e 11 28
Sd73e 12 28
Jd2e 13 28
Sd51e 14 28
Sd43e 15 28
Sd31e 16 28
Sd4w 17 28
Sd22w 18 28
Sd32e 19 28
Sd44w 20 28
Sd17w 21 28
Sd30e 22 28
Sd41w 23 28
Sd32e 24 28
Sd56w 25 28
Jd13e 26 28
Jd61w 27 28


In [741]:
def translate_record(record, concat):
    # Inputs: Record is the list of wins,losses, withdrawals in a given basho
    # We one-hot these symbolic representations with three arrays of 0s and 1s
    
    wins = np.zeros(15)
    losses = np.zeros(15)
    withdrawals = np.zeros(15)
    
    for i in range(len(record)):
        if record[i] == 'O' or record[i] == '%':
            wins[i] = 1
        elif record[i] == '*' or record[i] =='#':
            losses[i] = 1
        elif record[i] == '-':
            withdrawals[i]= 1

            
            
    #print(np.sum(wins),np.sum(losses), np.sum(withdrawals),'W/L/With ')
    if concat == True:
        return(np.concatenate((wins,losses,withdrawals)))
        # Stick em all together and return it as one array!
    else:
        return(wins,losses,withdrawals)
        # Return as three seperate arrays
    
    
    #print(np.sum(wins),np.sum(losses), np.sum(withdrawals),'W/L/With ')
        


Translate record just takes the simple inputs I described above, and converts it into either a 45x1 dimensional array (useful for ML) or a 3x15 array of wins,losses,withdrawals. This is just toggled with the "concat" parameter. With this said and done, we have now been able to take a full list of bouts on a single html page, and convert it into a useable array. We can easily assemble an array from this.

To scale it up to the rikishi we are interested in, we can exploit SumoDB's ability to sort by 'Intai' or premiere year. This amounts to a nice little start date for all our Rikishi. 

In [742]:
record_arr =np.zeros((len(rr),3,15))
for i in range(len(record_f)):
    basho_record = record_f[i]
    record_arr[i,0,:],record_arr[i,1,:],record_arr[i,2,:]= translate_record(basho_record,concat = False)



In [None]:
 # To scrape the full table, first get reference list (will select based off of that...)

In [227]:
url_ref = 'http://sumodb.sumogames.de/Rikishi.aspx?shikona=&heya=-1&shusshin=-1&b=-1&high=-1&hd=-1&entry=-1&intai=-1&sort=7'
html_ref = urllib.request.urlopen(url_ref).read()
soup_ref = BeautifulSoup(html_ref,'html.parser')

In [231]:
table = soup_ref.find_all('table')
df = pd.read_html(str(table))[1]
df

Unnamed: 0,Shikona,Heya,Shusshin,Birth Date,Highest Rank,Hatsu Dohyo,Intai,Last Shikona
0,Akashi,-,Tochigi,1600,Yokozuna,0.00,0.0,Akashi
1,Araiwa,-,-,,Not in Kyokai,0.00,0.0,Araiwa
2,Ashinoura#,Nakamura,-,,Not in Kyokai,0.00,0.0,Ashinoura#
3,Ayagawa,-,Tochigi,1703,Yokozuna,0.00,0.0,Ayagawa
4,Chitosegawa,-,-,,Not in Kyokai,0.00,0.0,Chitosegawa
...,...,...,...,...,...,...,...,...
12660,Yutakanami,Tatsunami,Fukuoka,"January 2, 2001",Jonidan 51,2019.09,,Yutakanami
12661,Yutakasho,Sakaigawa,Kagoshima,"November 19, 1994",Makushita 39,2013.03,,Yutakasho
12662,Yutakayama,Tokitsukaze,Niigata,"September 22, 1993",Maegashira 1,2016.03,,Yutakayama
12663,Zendaisho,Takadagawa,Chiba,"October 14, 1987",Sandanme 85,2003.05,,Zendaisho


In [246]:
tags = soup_ref('a')
tag_list = []
for tag in tags:
    tag_list.append((tag.get('href',None)))
    
tag_list[8+some_starting_indx:some_end_idx]

['Rikishi.aspx?r=11893',
 'Rikishi.aspx?r=11791',
 'Rikishi.aspx?r=8351',
 'Rikishi.aspx?r=11894',
 'Rikishi.aspx?r=11799',
 'Rikishi.aspx?r=8343',
 'Rikishi.aspx?r=11802',
 'Rikishi.aspx?r=8331',
 'Rikishi.aspx?r=8346',
 'Rikishi.aspx?r=8354',
 'Rikishi.aspx?r=8353',
 'Rikishi.aspx?r=8334',
 'Rikishi.aspx?r=8338',
 'Rikishi.aspx?r=8339',
 'Rikishi.aspx?r=8332',
 'Rikishi.aspx?r=8333',
 'Rikishi.aspx?r=8344',
 'Rikishi.aspx?r=8345',
 'Rikishi.aspx?r=8347',
 'Rikishi.aspx?r=8349',
 'Rikishi.aspx?r=8352',
 'Rikishi.aspx?r=8355',
 'Rikishi.aspx?r=11795',
 'Rikishi.aspx?r=11797',
 'Rikishi.aspx?r=11798',
 'Rikishi.aspx?r=11800',
 'Rikishi.aspx?r=11801',
 'Rikishi.aspx?r=8350',
 'Rikishi.aspx?r=11895',
 'Rikishi.aspx?r=11796',
 'Rikishi.aspx?r=8336',
 'Rikishi.aspx?r=11792',
 'Rikishi.aspx?r=8335',
 'Rikishi.aspx?r=8340',
 'Rikishi.aspx?r=8348',
 'Rikishi.aspx?r=11793',
 'Rikishi.aspx?r=11794',
 'Rikishi.aspx?r=11803',
 'Rikishi.aspx?r=8341',
 'Rikishi.aspx?r=8342',
 'Rikishi.aspx?r=8356',


In [828]:
sorted_df =df.loc[df['Hatsu Dohyo']>=1960.01]

Shikonas = sorted_df['Shikona']

In [830]:
sorted_df

Unnamed: 0,Shikona,Heya,Shusshin,Birth Date,Highest Rank,Hatsu Dohyo,Intai,Last Shikona
4237,Katoyama,-,-,,Jonokuchi 14,1960.01,1960.03,Katoyama
4275,Murakami,-,-,,Jonokuchi 6,1960.03,1960.05,Murakami
4279,Ohirato,-,-,,Jonidan 108,1960.01,1960.05,Ohirato
4297,Kondo,-,-,,Jonokuchi 20,1960.05,1960.07,Kondo
4314,Higashida,-,-,,Jonidan 64,1960.01,1960.09,Higashida
...,...,...,...,...,...,...,...,...
12660,Yutakanami,Tatsunami,Fukuoka,"January 2, 2001",Jonidan 51,2019.09,,Yutakanami
12661,Yutakasho,Sakaigawa,Kagoshima,"November 19, 1994",Makushita 39,2013.03,,Yutakasho
12662,Yutakayama,Tokitsukaze,Niigata,"September 22, 1993",Maegashira 1,2016.03,,Yutakayama
12663,Zendaisho,Takadagawa,Chiba,"October 14, 1987",Sandanme 85,2003.05,,Zendaisho


In [831]:
xs =sorted_df['Intai'].isnull()

xs.iloc[0]

False

In [414]:
sorted_df['Intai'].iloc[2]

1962.01

In [389]:
sorted_df.shape

(6233, 8)

In [254]:
tag_list[-15:-12]

['Rikishi.aspx?r=12292', 'Rikishi.aspx?r=2924', 'Rikishi.aspx?r=12419']

In [None]:
URL_list = tag_list

In [391]:
URL_list = ['http://sumodb.sumogames.de/' +s+'&t=1' for s in tag_list]

In [392]:
r_upper = 8# full list of RIKISHI, SORTED!
r_lower = -12

In [393]:
full_URL_list = URL_list[r_upper:r_lower]

In [832]:
IDX=(sorted_df.index.to_numpy())
NIDX = np.zeros(np.size(IDX),dtype =int)
print(type(NIDX[0]))
for i in range(np.size(IDX)):
    NIDX[i] = IDX[i].item()




<class 'numpy.int64'>


In [833]:
type(IDX[0].item())

int

In [834]:
# NOw we have a sorted_URL list

URL_list_sort = np.asarray(full_URL_list)[IDX]

URL_LIST = URL_list_sort.tolist()



In [939]:
def Scrape_and_Save(URL,keyword,concat,indx):
    # Given a URL, rank to filter, and concat status, scrape a link!
    htmly = urllib.request.urlopen(URL).read()
    soupy = BeautifulSoup(htmly,'html.parser')
    pre_s = soupy.find('pre')
    start_section_text = 'Highest Rank'
    
    Name =soupy.find_all('h2')[0].text
    
    xs =sorted_df['Intai'].isnull()

    # CHECK IF THERE IS A CUTOFF/INTAI
    if xs.iloc[indx]== False:
        end_section_text = str(sorted_df['Intai'].iloc[indx]) + str(' ') # this space makes all the difference.
        page_text = pre_s.text.strip()
        start_position = page_text.find(start_section_text)
        end_position = page_text.find(end_section_text)

        table_text = page_text[start_position:]

        lines = table_text.splitlines()
    else:
        end_section_text = '2020.11 '
        page_text = pre_s.text.strip()
        start_position = page_text.find(start_section_text)
        end_position = page_text.find(end_section_text)

        table_text = page_text[start_position:end_position]

        lines = table_text.splitlines()# date of next basho...

    # Strip down the page based on this header/footer split

    
    Name =soupy.find_all('h2')[0].text
    Weight,Height = get_height(lines)
    First_Name= Name.split()[0]
    for i in range(len(lines)):
        res = lines[i].find(Name.split()[0])
        if res == 0:
            record_idx = i
            break
    #record_idx = lines.index(Name)
    #print(record_idx)
    print('STRUCTURE: ',lines)
    matches = lines[record_idx+1:]
    print('MATCHES: ',matches)
    
    date,rank,record,final_score = extract_record(matches[1:])
    Jidx = filter_for_rank(rank,keyword)
    
    if Jidx == -2:
        return # We do not want rikishi with a non 15 record...
        # DO NOT PROCESS FURTHER!
    else:
        
# use Jidx to filter for Juryo only matches.

        date_f,rank_f,record_f,final_score_f = date[Jidx:],rank[Jidx:],record[Jidx:],final_score[Jidx:]
        print(Jidx)
        if concat== True:
        
            record_arr =np.zeros((len(rank_f),45))
            for i in range(len(record_f)):
                basho_record = record_f[i]
                record_arr[i,:]= \
                translate_record(basho_record,concat =concat)
            
        elif concat== False:
        
            record_arr =np.zeros((len(rank_f),3,15))
            for i in range(len(record_f)):
                basho_record = record_f[i]
                record_arr[i,0,:],record_arr[i,1,:],record_arr[i,2,:]= \
                translate_record(basho_record,concat = concat)

            
        return date_f,rank_f,final_score_f,record_arr,Weight,Height,Name


New database:


ID | RIKISHI_NAME | BASHO | Win/Loss/With vectors (15,15,15) | Weight | Height | Rank | Cutoff_Rank |

In [865]:
wrestler_df = 

SyntaxError: invalid syntax (<ipython-input-865-38e619a08785>, line 1)

In [866]:
lines

['Highest Rank     Maegashira 1',
 'Real Name        YOSHITANE Hiromichi',
 'Birth Date       December 15, 1970',
 'Shusshin         Chiba-ken, Funabashi-shi',
 'Height and Weight183 cm 183 kg',
 'Heya             Tatsutagawa - Michinoku',
 'Shikona          Yoshitane Hiromichi - Shikishima Katsumori',
 'Hatsu Dohyo      1989.01',
 'Intai            2001.05',
 'KabuShikishima Katsumori - Tatsutagawa Katsumori - Fujigane Katsumori - Fujigane Shigeki - Nishikijima Sukemoto - Onogawa Sukemoto - Onogawa Hiromichi - Tanigawa Hiromichi - Ajigawa Hiromichi - Urakaze Hiromichi - Urakaze Tomimichi',
 '',
 'Career Record    416-418-38/832 (75 basho)',
 '  In Makuuchi    175-228-17/402 (28 basho), 2 Kinboshi',
 '   As Maegashira 175-228-17/402 (28 basho), 2 Kinboshi',
 '  In Juryo       128-114-13/241 (17 basho), 1 Yusho',
 '  In Makushita   53-38-8/91 (15 basho), 1 Yusho',
 '  In Sandanme    34-22/56 (8 basho)',
 '  In Jonidan     20-15/35 (5 basho)',
 '  In Jonokuchi   6-1/7 (1 basho)',
 '  In 

In [867]:
unique_ids = np.zeros(len(URL_LIST),dtype= int)
for i in range(len(URL_LIST)):
    unique_ids[i] = int(URL_LIST[i].split('=')[1].split('&')[0])
    
unique_ids

array([ 5921, 11151, 11328, ..., 12292,  2924, 12419])

In [868]:
item_id = int(URL_LIST[0].split('=')[1].split('&')[0])

item_id

5921

In [940]:
if Scrape_and_Save('http://sumodb.sumogames.de/Rikishi.aspx?r=20&t=1','Juryo', True,5000) == None:
    print('Not added, did not exceed rank')
else:
    Date,Rank,Score,Records,Ht,Wt,Nm =Scrape_and_Save('http://sumodb.sumogames.de/Rikishi.aspx?r=20&t=1','Juryo', True,5000)

STRUCTURE:  ['Highest Rank     Maegashira 1', 'Real Name        YOSHITANE Hiromichi', 'Birth Date       December 15, 1970', 'Shusshin         Chiba-ken, Funabashi-shi', 'Height and Weight183 cm 183 kg', 'Heya             Tatsutagawa - Michinoku', 'Shikona          Yoshitane Hiromichi - Shikishima Katsumori', 'Hatsu Dohyo      1989.01', 'Intai            2001.05', 'KabuShikishima Katsumori - Tatsutagawa Katsumori - Fujigane Katsumori - Fujigane Shigeki - Nishikijima Sukemoto - Onogawa Sukemoto - Onogawa Hiromichi - Tanigawa Hiromichi - Ajigawa Hiromichi - Urakaze Hiromichi - Urakaze Tomimichi', '', 'Career Record    416-418-38/832 (75 basho)', '  In Makuuchi    175-228-17/402 (28 basho), 2 Kinboshi', '   As Maegashira 175-228-17/402 (28 basho), 2 Kinboshi', '  In Juryo       128-114-13/241 (17 basho), 1 Yusho', '  In Makushita   53-38-8/91 (15 basho), 1 Yusho', '  In Sandanme    34-22/56 (8 basho)', '  In Jonidan     20-15/35 (5 basho)', '  In Jonokuchi   6-1/7 (1 basho)', '  In Mae-zum

In [870]:
Date,Rank,Score,Records,Ht,Wt,Nm =Scrape_and_Save('http://sumodb.sumogames.de/Rikishi.aspx?r=20&t=1','Juryo', True,5000)

STRUCTURE:  ['Highest Rank     Maegashira 1', 'Real Name        YOSHITANE Hiromichi', 'Birth Date       December 15, 1970', 'Shusshin         Chiba-ken, Funabashi-shi', 'Height and Weight183 cm 183 kg', 'Heya             Tatsutagawa - Michinoku', 'Shikona          Yoshitane Hiromichi - Shikishima Katsumori', 'Hatsu Dohyo      1989.01', 'Intai            2001.05', 'KabuShikishima Katsumori - Tatsutagawa Katsumori - Fujigane Katsumori - Fujigane Shigeki - Nishikijima Sukemoto - Onogawa Sukemoto - Onogawa Hiromichi - Tanigawa Hiromichi - Ajigawa Hiromichi - Urakaze Hiromichi - Urakaze Tomimichi', '', 'Career Record    416-418-38/832 (75 basho)', '  In Makuuchi    175-228-17/402 (28 basho), 2 Kinboshi', '   As Maegashira 175-228-17/402 (28 basho), 2 Kinboshi', '  In Juryo       128-114-13/241 (17 basho), 1 Yusho', '  In Makushita   53-38-8/91 (15 basho), 1 Yusho', '  In Sandanme    34-22/56 (8 basho)', '  In Jonidan     20-15/35 (5 basho)', '  In Jonokuchi   6-1/7 (1 basho)', '  In Mae-zum

In [871]:
Rank

['J12w',
 'J8w',
 'J11w',
 'J3e',
 'J5w',
 'J3w',
 'M16e',
 'J4w',
 'J2w',
 'M16e',
 'J3e',
 'J1w',
 'J6e',
 'J3w',
 'J1w',
 'M15e',
 'M11e',
 'M14w',
 'M11w',
 'M15e',
 'M11w',
 'M7w',
 'M10w',
 'M7e',
 'M10w',
 'M7e',
 'M1w',
 'M6e',
 'M3w',
 'M7w',
 'M7w',
 'M8e',
 'M2w',
 'M11w',
 'M7w',
 'M4e',
 'M10w',
 'M5w',
 'M6w',
 'M10e',
 'M12e',
 'J5e',
 'J4e',
 'J5w',
 'Ms3w',
 'Ms43w']

In [838]:
np.shape(Records[:,0])

(34,)

In [839]:
w_labels = ['Day %i win'%(i+1) for i in range(15)]
l_labels = ['Day %i loss'%(i+1) for i in range(15)]
wth_labels = ['Day %i withdrawal'%(i+1) for i in range(15)]

labels = w_labels+l_labels+wth_labels
labels

['Day 1 win',
 'Day 2 win',
 'Day 3 win',
 'Day 4 win',
 'Day 5 win',
 'Day 6 win',
 'Day 7 win',
 'Day 8 win',
 'Day 9 win',
 'Day 10 win',
 'Day 11 win',
 'Day 12 win',
 'Day 13 win',
 'Day 14 win',
 'Day 15 win',
 'Day 1 loss',
 'Day 2 loss',
 'Day 3 loss',
 'Day 4 loss',
 'Day 5 loss',
 'Day 6 loss',
 'Day 7 loss',
 'Day 8 loss',
 'Day 9 loss',
 'Day 10 loss',
 'Day 11 loss',
 'Day 12 loss',
 'Day 13 loss',
 'Day 14 loss',
 'Day 15 loss',
 'Day 1 withdrawal',
 'Day 2 withdrawal',
 'Day 3 withdrawal',
 'Day 4 withdrawal',
 'Day 5 withdrawal',
 'Day 6 withdrawal',
 'Day 7 withdrawal',
 'Day 8 withdrawal',
 'Day 9 withdrawal',
 'Day 10 withdrawal',
 'Day 11 withdrawal',
 'Day 12 withdrawal',
 'Day 13 withdrawal',
 'Day 14 withdrawal',
 'Day 15 withdrawal']

In [844]:
Score

['8-7',
 '6-9',
 '12-3',
 '7-8',
 '9-6',
 '9-6',
 '5-10',
 '9-6',
 '10-5',
 '7-8',
 '8-7',
 '5-10',
 '9-6',
 '8-7',
 '10-5',
 '10-5',
 '6-9',
 '8-7',
 '7-8',
 '9-6',
 '8-7',
 '6-9',
 '8-7',
 '6-9',
 '8-7',
 '8-7',
 '3-12',
 '8-7',
 '4-9-2',
 '0-0-15',
 '7-8',
 '9-6',
 '1-14',
 '8-7']

In [845]:
Records[0]

array([0., 1., 1., 1., 0., 1., 0., 0., 0., 1., 1., 1., 0., 1., 0., 1., 0.,
       0., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [840]:
cols = ['frame', 'count']
N = 4
dat = pd.DataFrame(Records,columns = labels)
for i in range(N):

    #dat = dat.append(np.ones(45,dtype =int),ignore_index=True)
    dat = dat.append(dict(zip(dat.columns, np.ones(45,dtype =int))), ignore_index=True)
dat

Unnamed: 0,Day 1 win,Day 2 win,Day 3 win,Day 4 win,Day 5 win,Day 6 win,Day 7 win,Day 8 win,Day 9 win,Day 10 win,...,Day 6 withdrawal,Day 7 withdrawal,Day 8 withdrawal,Day 9 withdrawal,Day 10 withdrawal,Day 11 withdrawal,Day 12 withdrawal,Day 13 withdrawal,Day 14 withdrawal,Day 15 withdrawal
0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [709]:
record_frame = pd.DataFrame(Records, columns = labels)
record_frame

Unnamed: 0,Day 1 win,Day 2 win,Day 3 win,Day 4 win,Day 5 win,Day 6 win,Day 7 win,Day 8 win,Day 9 win,Day 10 win,...,Day 6 withdrawal,Day 7 withdrawal,Day 8 withdrawal,Day 9 withdrawal,Day 10 withdrawal,Day 11 withdrawal,Day 12 withdrawal,Day 13 withdrawal,Day 14 withdrawal,Day 15 withdrawal
0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0
2,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [659]:
np.column_stack((np.asarray(Rank),np.asarray(Date),np.asarray(Score),Records))

array([['Ms10e', '1993.07', '6-1', ..., '1.0', '1.0', '0.0'],
       ['Ms1e', '1993.09', '4-3', ..., '1.0', '1.0', '1.0'],
       ['J12w', '1993.11', '8-7', ..., '0.0', '0.0', '0.0'],
       ...,
       ['M8e', '1999.01', '9-6', ..., '0.0', '0.0', '0.0'],
       ['M2w', '1999.03', '1-14', ..., '0.0', '0.0', '0.0'],
       ['M11w', '1999.05', '8-7', ..., '0.0', '0.0', '0.0']], dtype='<U32')

In [717]:
dataframe = pd.DataFrame({'Rank': np.asarray(Rank), 'Date':np.asarray(Date), 'Score': np.asarray(Score), 'Name': Name,'Height':Ht,'Weight':Wt})
dataframe.join(record_frame)

# This creates a single part of the giant database.
# To add a row:


Unnamed: 0,Rank,Date,Score,Name,Height,Weight,Day 1 win,Day 2 win,Day 3 win,Day 4 win,...,Day 6 withdrawal,Day 7 withdrawal,Day 8 withdrawal,Day 9 withdrawal,Day 10 withdrawal,Day 11 withdrawal,Day 12 withdrawal,Day 13 withdrawal,Day 14 withdrawal,Day 15 withdrawal
0,Ms10e,1993.07,6-1,Shikishima Katsumori,183,183,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0
1,Ms1e,1993.09,4-3,Shikishima Katsumori,183,183,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0
2,J12w,1993.11,8-7,Shikishima Katsumori,183,183,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,J8w,1994.01,6-9,Shikishima Katsumori,183,183,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,J11w,1994.03,12-3,Shikishima Katsumori,183,183,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,J3e,1994.05,7-8,Shikishima Katsumori,183,183,0.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,J5w,1994.07,9-6,Shikishima Katsumori,183,183,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,J3w,1994.09,9-6,Shikishima Katsumori,183,183,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M16e,1994.11,5-10,Shikishima Katsumori,183,183,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,J4w,1995.01,9-6,Shikishima Katsumori,183,183,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [947]:
# Create a dataframe
import time
for i in range(100):
    time.sleep(3)
    if Scrape_and_Save(URL_LIST[i],'Juryo', True,i) == None:
        print('Not added, did not exceed rank')
    else:
    #Date,Rank,Score,Records,Ht,Wt,Nm =Scrape_and_Save('http://sumodb.sumogames.de/Rikishi.aspx?r=20&t=1','Juryo', True,5000)
        Date,Rank,Score,Records,Ht,Wt,Nm = Scrape_and_Save(URL_LIST[i],'Juryo', True,i)
    # Append values row by row:
        dataframe = pd.DataFrame({'Rank': np.asarray(Rank), 'Date':np.asarray(Date), 'Score': np.asarray(Score), 'Name': Name,'Height':Ht,'Weight':Wt})
    
    if i%500 ==0:
        print(URL_LIST[i],i)
    

STRUCTURE:  ['Highest Rank     Jonokuchi 14', 'Shusshin         -', 'Heya             -', 'Shikona          Katoyama', 'Hatsu Dohyo      1960.01', 'Intai            unknown', '', 'Career Record    0-0-8/0 (2 basho)', '  In Jonokuchi   0-0-8/0 (1 basho)', '  In Mae-zumo    1 basho', '', 'Katoyama', '1960.01 Mz                      0-0', '1960.03 Jk14e                   0-0-8']
MATCHES:  ['1960.01 Mz                      0-0', '1960.03 Jk14e                   0-0-8']
Name change/Missing Data
0
STRUCTURE:  ['Highest Rank     Jonokuchi 14', 'Shusshin         -', 'Heya             -', 'Shikona          Katoyama', 'Hatsu Dohyo      1960.01', 'Intai            unknown', '', 'Career Record    0-0-8/0 (2 basho)', '  In Jonokuchi   0-0-8/0 (1 basho)', '  In Mae-zumo    1 basho', '', 'Katoyama', '1960.01 Mz                      0-0', '1960.03 Jk14e                   0-0-8']
MATCHES:  ['1960.01 Mz                      0-0', '1960.03 Jk14e                   0-0-8']
Name change/Missing Data
0
http:/

STRUCTURE:  ['Highest Rank     Jonidan 97', 'Shusshin         -', 'Heya             -', 'Shikona          Shuto', 'Hatsu Dohyo      1960.05', 'Intai            1960.09', '', 'Career Record    4-3-7/7 (3 basho)', '  In Jonidan     0-0-7/0 (1 basho)', '  In Jonokuchi   4-3/7 (1 basho)', '  In Mae-zumo    1 basho', '', 'Shuto', '1960.05 Mz                      0-0', '1960.07 Jk11w                   4-3', '1960.09 Jd97e                   0-0-7']
MATCHES:  ['1960.05 Mz                      0-0', '1960.07 Jk11w                   4-3', '1960.09 Jd97e                   0-0-7']
Name change/Missing Data
Name change/Missing Data
0
STRUCTURE:  ['Highest Rank     Jonidan 78', 'Shusshin         -', 'Heya             -', 'Shikona          Kasuga', 'Hatsu Dohyo      1960.07', 'Intai            1960.11', '', 'Career Record    5-4-5/9 (3 basho)', '  In Jonidan     1-1-5/2 (1 basho)', '  In Jonokuchi   4-3/7 (1 basho)', '  In Mae-zumo    1 basho', '', 'Kasuga', '1960.07 Mz                      0-0', '196

STRUCTURE:  ['Highest Rank     Jonidan 77', 'Shusshin         -', 'Heya             -', 'Shikona          Sugawara', 'Hatsu Dohyo      1960.07', 'Intai            1960.11', '', 'Career Record    6-8/14 (3 basho)', '  In Jonidan     2-5/7 (1 basho)', '  In Jonokuchi   4-3/7 (1 basho)', '  In Mae-zumo    1 basho', '', 'Sugawara', '1960.07 Mz                      0-0', '1960.09 Jk24e                   4-3', '1960.11 Jd77e                   2-5']
MATCHES:  ['1960.07 Mz                      0-0', '1960.09 Jk24e                   4-3', '1960.11 Jd77e                   2-5']
Name change/Missing Data
Name change/Missing Data
0
STRUCTURE:  ['Highest Rank     Jonidan 77', 'Shusshin         -', 'Heya             -', 'Shikona          Sugawara', 'Hatsu Dohyo      1960.07', 'Intai            1960.11', '', 'Career Record    6-8/14 (3 basho)', '  In Jonidan     2-5/7 (1 basho)', '  In Jonokuchi   4-3/7 (1 basho)', '  In Mae-zumo    1 basho', '', 'Sugawara', '1960.07 Mz                      0-0', '196

STRUCTURE:  ['Highest Rank     Jonokuchi 17', 'Shusshin         -', 'Heya             -', 'Shikona          Izawa', 'Hatsu Dohyo      1961.01', 'Intai            unknown', '', 'Career Record    0-0-7/0 (2 basho)', '  In Jonokuchi   0-0-7/0 (1 basho)', '  In Mae-zumo    1 basho', '', 'Izawa', '1961.01 Mz                      0-0', '1961.03 Jk17w                   0-0-7']
MATCHES:  ['1961.01 Mz                      0-0', '1961.03 Jk17w                   0-0-7']
Name change/Missing Data
0
STRUCTURE:  ['Highest Rank     Jonokuchi 17', 'Shusshin         -', 'Heya             -', 'Shikona          Izawa', 'Hatsu Dohyo      1961.01', 'Intai            unknown', '', 'Career Record    0-0-7/0 (2 basho)', '  In Jonokuchi   0-0-7/0 (1 basho)', '  In Mae-zumo    1 basho', '', 'Izawa', '1961.01 Mz                      0-0', '1961.03 Jk17w                   0-0-7']
MATCHES:  ['1961.01 Mz                      0-0', '1961.03 Jk17w                   0-0-7']
Name change/Missing Data
0
STRUCTURE:  ['High

STRUCTURE:  ['Highest Rank     Jonidan 79', 'Shusshin         -', 'Heya             -', 'Shikona          Matsudaiwa', 'Hatsu Dohyo      1960.11', 'Intai            1961.05', '', 'Career Record    9-12/21 (4 basho)', '  In Jonidan     0-7/7 (1 basho)', '  In Jonokuchi   9-5/14 (2 basho)', '  In Mae-zumo    1 basho', '', 'Matsudaiwa', '1960.11 Mz                      0-0', '1961.01 Jk20e                   4-3', '1961.03 Jd79w                   0-7', '1961.05 Jk2e                    5-2']
MATCHES:  ['1960.11 Mz                      0-0', '1961.01 Jk20e                   4-3', '1961.03 Jd79w                   0-7', '1961.05 Jk2e                    5-2']
Name change/Missing Data
Name change/Missing Data
Name change/Missing Data
0
STRUCTURE:  ['Highest Rank     Jonidan 79', 'Shusshin         -', 'Heya             -', 'Shikona          Matsudaiwa', 'Hatsu Dohyo      1960.11', 'Intai            1961.05', '', 'Career Record    9-12/21 (4 basho)', '  In Jonidan     0-7/7 (1 basho)', '  In Jonok

STRUCTURE:  ['Highest Rank     Jonidan 52', 'Shusshin         -', 'Heya             -', 'Shikona          Tachiisami', 'Hatsu Dohyo      1960.07', 'Intai            unknown', '', 'Career Record    11-20-14/31 (7 basho)', '  In Jonidan     6-15-7/21 (4 basho)', '  In Jonokuchi   4-3-7/7 (2 basho)', '  In Shinjo      1-2/3 (1 basho)', '', 'Tachiisami', '1960.07 Sj                      1-2', '1960.09 Jk17e                   4-3', '1960.11 Jd76e                   4-3', '1961.01 Jd52w                   1-6', '1961.03 Jd84w                   1-6', '1961.05 Jd96w                   0-0-7', '1961.07 Jk7w                    0-0-7']
MATCHES:  ['1960.07 Sj                      1-2', '1960.09 Jk17e                   4-3', '1960.11 Jd76e                   4-3', '1961.01 Jd52w                   1-6', '1961.03 Jd84w                   1-6', '1961.05 Jd96w                   0-0-7', '1961.07 Jk7w                    0-0-7']
Name change/Missing Data
Name change/Missing Data
Name change/Missing Data
Name ch

STRUCTURE:  ['Highest Rank     Jonidan 2', 'Shusshin         -', 'Heya             -', 'Shikona          Shionohana', 'Hatsu Dohyo      1960.05', 'Intai            1961.09', '', 'Career Record    24-25-7/49 (9 basho)', '  In Jonidan     20-22-7/42 (7 basho)', '  In Jonokuchi   4-3/7 (1 basho)', '  In Mae-zumo    1 basho', '', 'Shionohana', '1960.05 Mz                      0-0', '1960.07 Jk15e                   4-3', '1960.09 Jd99e                   5-2', '1960.11 Jd35w                   3-4', '1961.01 Jd45w                   2-5', '1961.03 Jd72w                   4-3', '1961.05 Jd36w                   4-3', '1961.07 Jd2e                    2-5', '1961.09 Jd19w                   0-0-7']
MATCHES:  ['1960.05 Mz                      0-0', '1960.07 Jk15e                   4-3', '1960.09 Jd99e                   5-2', '1960.11 Jd35w                   3-4', '1961.01 Jd45w                   2-5', '1961.03 Jd72w                   4-3', '1961.05 Jd36w                   4-3', '1961.07 Jd2e        

STRUCTURE:  ['Highest Rank     Jonidan 33', 'Shusshin         -', 'Heya             -', 'Shikona          Ifuji', 'Hatsu Dohyo      1960.09', 'Intai            1961.11', '', 'Career Record    17-18-14/35 (8 basho)', '  In Jonidan     5-9-14/14 (4 basho)', '  In Jonokuchi   12-9/21 (3 basho)', '  In Mae-zumo    1 basho', '', 'Ifuji', '1960.09 Mz                      0-0', '1960.11 Jk23e                   3-4', '1961.01 Jk11w                   4-3', '1961.03 Jd77e                   2-5', '1961.05 Jd88w                   3-4', '1961.07 Jd81e                   0-0-7', '1961.09 Jk19e                   5-2', '1961.11 Jd33w                   0-0-7']
MATCHES:  ['1960.09 Mz                      0-0', '1960.11 Jk23e                   3-4', '1961.01 Jk11w                   4-3', '1961.03 Jd77e                   2-5', '1961.05 Jd88w                   3-4', '1961.07 Jd81e                   0-0-7', '1961.09 Jk19e                   5-2', '1961.11 Jd33w                   0-0-7']
Name change/Missing Da

STRUCTURE:  ['Highest Rank     Jonidan 72', 'Shusshin         -', 'Heya             -', 'Shikona          Komatsu', 'Hatsu Dohyo      1961.05', 'Intai            unknown', '', 'Career Record    4-3-14/7 (4 basho)', '  In Jonidan     0-0-7/0 (1 basho)', '  In Jonokuchi   4-3-7/7 (2 basho)', '  In Mae-zumo    1 basho', '', 'Komatsu', '1961.05 Mz                      0-0', '1961.07 Jk22e                   4-3', '1961.09 Jd72w                   0-0-7', '1961.11 Jk16w                   0-0-7']
MATCHES:  ['1961.05 Mz                      0-0', '1961.07 Jk22e                   4-3', '1961.09 Jd72w                   0-0-7', '1961.11 Jk16w                   0-0-7']
Name change/Missing Data
Name change/Missing Data
Name change/Missing Data
0
STRUCTURE:  ['Highest Rank     Jonidan 72', 'Shusshin         -', 'Heya             -', 'Shikona          Komatsu', 'Hatsu Dohyo      1961.05', 'Intai            unknown', '', 'Career Record    4-3-14/7 (4 basho)', '  In Jonidan     0-0-7/0 (1 basho)', '  In

STRUCTURE:  ['Highest Rank     Jonidan 30', 'Shusshin         Nara-ken', 'Heya             Tatsunami', 'Shikona          Tachizakura Tadao', 'Hatsu Dohyo      1960.07', 'Intai            1961.11', '', 'Career Record    26-25-8/51 (9 basho)', '  In Jonidan     18-23-8/41 (7 basho)', '  In Jonokuchi   6-1/7 (1 basho)', '  In Shinjo      2-1/3 (1 basho)', '', 'Tachizakura Tadao', '1960.07 Sj                      2-1', '1960.09 Jk6w                    6-1', '1960.11 Jd49w                   2-5', '1961.01 Jd65w                   4-3', '1961.03 Jd30e                   2-5', '1961.05 Jd51w                   3-3-1', '1961.07 Jd65e                   4-3', '1961.09 Jd34e                   3-4', '1961.11 Jd49w                   0-0-7']
MATCHES:  ['1960.07 Sj                      2-1', '1960.09 Jk6w                    6-1', '1960.11 Jd49w                   2-5', '1961.01 Jd65w                   4-3', '1961.03 Jd30e                   2-5', '1961.05 Jd51w                   3-3-1', '1961.07 Jd65e    

STRUCTURE:  ['Highest Rank     Sandanme 78', 'Shusshin         -', 'Heya             -', 'Shikona          Nomiya - Ezonoyama', 'Hatsu Dohyo      1960.09', 'Intai            1962.01', '', 'Career Record    24-19-14/43 (9 basho)', '  In Sandanme    0-0-7/0 (1 basho)', '  In Jonidan     17-11-7/28 (5 basho)', '  In Jonokuchi   7-8/15 (2 basho)', '  In Mae-zumo    1 basho', '', 'Nomiya', '1960.09 Mz                      0-0', '1960.11 Jk26e                   2-6', '1961.01 Jk17e                   5-2', '1961.03 Jd59e                   3-4', 'Ezonoyama', '1961.05 Jd74w                   5-2', '1961.07 Jd15e                   3-4', '1961.09 Jd24w                   6-1', '1961.11 Sd78w                   0-0-7', '1962.01 Jd3e                    0-0-7']
MATCHES:  ['1961.05 Jd74w                   5-2', '1961.07 Jd15e                   3-4', '1961.09 Jd24w                   6-1', '1961.11 Sd78w                   0-0-7', '1962.01 Jd3e                    0-0-7']
Name change/Missing Data
Name chan

IndexError: list index out of range

In [945]:
Date

[]

In [874]:
URL_LIST[0]

'http://sumodb.sumogames.de/Rikishi.aspx?r=5921&t=1'