Some dates don't have entries in the joined.  Here, we figure out for which files this is due to errors in parsing, and for which this is due to not having any data.  During my initial round of scraping, there was not appropriate handling for dates with no data (see 12/30/2014 as an example).  Here, we try out a corrected Parser, and then run the corrected parser on all dates with no matches in the combined match_result file.

In [1]:
import pandas as pd
from tennis_new.fetch.tennis_explorer.defs import ALL_MATCH_PATH, DATE_FORMAT

matches = pd.read_csv(ALL_MATCH_PATH)

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
from datetime import datetime, timedelta

missing_dates = []
min_date = datetime.strptime(matches['date'].min(), DATE_FORMAT)
max_date = datetime.strptime(matches['date'].max(), DATE_FORMAT)
cur_date = min_date
all_dates = matches['date'].unique().tolist()
while cur_date <= max_date:
    if datetime.strftime(cur_date, DATE_FORMAT) not in all_dates:
        missing_dates.append(cur_date)
    cur_date += timedelta(days=1) 

In [4]:
# Try a known case of no data available to figure out what to do rather than throwing an error...
from lxml import html 
from tennis_new.fetch.tennis_explorer.matches.scraper import TennisExplorerParser
from tennis_new.fetch.tennis_explorer import helpers

te = TennisExplorerParser(datetime(2014, 12, 30))

In [5]:
te.process()
te.write_df()

In [7]:
from tennis_new.fetch.tennis_explorer.matches.scraper import TennisExplorerParser
from time import sleep

PAUSE_SECONDS = 2

# Attempt to parse again, this removes connection errors...
still_missing = []
for date in missing_dates:
    print("Trying %s..." % date.strftime(DATE_FORMAT))
    try:
        te = TennisExplorerParser(date)
        te.process()
        te.write_df()
        sleep(PAUSE_SECONDS)
    except:
        print("Still has errors...")
        still_missing.append(date)

Trying 1997-01-12...
Trying 1997-01-13...
Trying 1997-01-14...
Trying 1997-01-16...
Trying 1997-01-25...
Trying 1997-02-08...
Trying 1997-03-10...
Trying 1997-03-17...
Trying 1997-03-18...
Trying 1997-03-31...
Trying 1997-04-06...
Trying 1997-05-01...
Trying 1997-05-05...
Trying 1997-05-12...
Trying 1997-07-13...
Trying 1997-08-25...
Trying 1997-08-31...
Trying 1997-09-08...
Trying 1997-09-15...
Trying 1997-09-22...
Trying 1997-09-29...
Trying 1997-10-06...
Trying 1997-10-13...
Trying 1997-10-19...
Trying 1997-11-03...
Trying 1997-11-10...
Trying 1997-11-17...
Trying 1997-11-23...
Trying 1997-11-24...
Trying 1997-12-01...
Trying 1997-12-07...
Trying 1997-12-08...
Trying 1997-12-14...
Trying 1997-12-15...
Trying 1997-12-16...
Trying 1997-12-17...
Trying 1997-12-18...
Trying 1997-12-19...
Trying 1997-12-20...
Trying 1997-12-21...
Trying 1997-12-22...
Trying 1997-12-23...
Trying 1997-12-24...
Trying 1997-12-25...
Trying 1997-12-26...
Trying 1997-12-27...
Trying 1997-12-28...
Trying 1997-1

Trying 2001-12-21...
Trying 2001-12-22...
Trying 2001-12-23...
Trying 2001-12-24...
Trying 2001-12-25...
Trying 2001-12-26...
Trying 2001-12-27...
Trying 2001-12-28...
Trying 2002-12-22...
Trying 2002-12-23...
Trying 2002-12-24...
Trying 2002-12-25...
Trying 2002-12-26...
Trying 2002-12-27...
Trying 2003-12-15...
Trying 2003-12-21...
Trying 2003-12-22...
Trying 2003-12-23...
Trying 2003-12-24...
Trying 2003-12-25...
Trying 2003-12-26...
Trying 2003-12-29...
Trying 2004-12-19...
Trying 2004-12-20...
Trying 2004-12-21...
Trying 2004-12-22...
Trying 2004-12-23...
Trying 2004-12-24...
Trying 2004-12-25...
Trying 2004-12-26...
Trying 2004-12-27...
Trying 2004-12-28...
Trying 2004-12-29...
Trying 2004-12-30...
Trying 2005-12-05...
Trying 2005-12-12...
Trying 2005-12-18...
Trying 2005-12-19...
Trying 2005-12-20...
Trying 2005-12-21...
Trying 2005-12-22...
Trying 2005-12-26...
Trying 2006-07-19...
Trying 2006-12-13...
Trying 2006-12-18...
Trying 2006-12-25...
Trying 2007-12-23...
Trying 2007-1

In [8]:
still_missing

[datetime.datetime(2016, 12, 9, 0, 0)]

Above, there is only one error -- this is 12/9/2016, because there is one case where p2 has won more sets than p1 -- this is a weird case of a "tie" in the International Premier Tennis League.

#### After Fixing Match File

Many matches will still be missing, but these should now be recorded in the no_results_log.log file

In [13]:
import pandas as pd
from tennis_new.fetch.tennis_explorer.defs import ALL_MATCH_PATH, DATE_FORMAT

matches = pd.read_csv(ALL_MATCH_PATH)

  interactivity=interactivity, compiler=compiler, result=result)


In [14]:
from datetime import datetime, timedelta

missing_dates = []
min_date = datetime.strptime(matches['date'].min(), DATE_FORMAT)
max_date = datetime.strptime(matches['date'].max(), DATE_FORMAT)
cur_date = min_date
all_dates = matches['date'].unique().tolist()
while cur_date <= max_date:
    if datetime.strftime(cur_date, DATE_FORMAT) not in all_dates:
        missing_dates.append(cur_date)
    cur_date += timedelta(days=1) 

In [28]:
log_file = TennisExplorerParser.NO_RESULTS_LOG
expected_missing_dates = pd.read_csv(log_file, header=None)[0].tolist()

In [29]:
[x for x in missing_dates if x.strftime(DATE_FORMAT) not in expected_missing_dates]

[datetime.datetime(2016, 12, 9, 0, 0)]

Good, the only unexpected missing case is the aforementioned one.