## Data Inspiration
<p>I was watching cricket match yesterday. India vs WestIndies. Both Rohit and Virat were well settled on crease with 200 run partnership and counting. virat was close to 150 and i saw a dialog box on TV stating kohli just needs another 154 runs to reach 10k Runs. To join the groups of elite Batsmen including Sachin, Pointing and Dravid. I was little surprised to see him reach close to the milestone so soon. </p>
<p>Then it stuck me, the amount of data in cricket. There's huge data in cricket. Data from 4's and 6's by a batsman in a match to number of balls faced in his test career.</p>
<p>It din't take me long to search for perfect website to find those stats. In India, cricket is a religion after all. I landed on cricinfo, searched for stats and found <a href='http://stats.espncricinfo.com/ci/content/records/335431.html'>this</a> webpage containing data of every single test match ever recorded or reported.</p>
<p>So, i dug in, scraped the data and analysed every stat i could imagine</p>

In [1]:
import requests

In [2]:
r = requests.get('http://stats.espncricinfo.com/ci/content/records/335431.html')
r.text[0:500]

'\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<!-- hostname: web001, edition-view: espncricinfo-en-in, country: unknown, cluster: www, created: 2018-10-23 14:44:24 -->\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://developers.facebook.com/schema/" >\n<head>\n <script type="text/javascript">var _sf_startpt=(new Date())'

In [3]:
from bs4 import BeautifulSoup  
soup = BeautifulSoup(r.text, 'html.parser')

In [4]:
stats = soup.find_all('tr', attrs={'class':'data1'})
print("The resulting object is of type {}".format(type(stats)))
print("Total number of results, i.e total no of tests played till date are: {}".format(len(stats)))

The resulting object is of type <class 'bs4.element.ResultSet'>
Total number of results, i.e total no of tests played till date are: 738


The Whole Data is present in form of a list in the <i>bs4.element.ResultSet</i>. 
<p>Each test series is in form of an element in list.</p>
<p>Below is the structure of data in each element</p>

In [5]:
stats[0]

<tr class="data1" data-days="685654">
<td class="left"><a class="data-link" href="/ci/engine/series/60260.html">England in Australia Test Series</a></td>
<td class="left" nowrap="nowrap">1876/77</td>
<td nowrap="nowrap"></td>
<td class="left" nowrap="nowrap"><a class="data-link" href="/ci/content/team/1126861.html">drawn</a></td>
<td nowrap="nowrap">1-1 (2)</td>
</tr>

### Data Structure of each table row of the data table
- The Number 685654 in data-days attribute. Such numbers notably unique are present in all elements. must mean something.
- The pathname of link to direct to page containing the test series results
- The Test Series teams playing. It gives information about the home team and the away team.
- The year the test series is played in
- A link inside 4th td tag leading to the winning team's page. It's throwing 404 (Page not Found) error for drawn series.
- The numerical Result of the series
- Total number of matches played in the series. ( Given in the Bracket )

So, the features (and it's values as per the first row) can be :
<p>
    <b>data-days</b> 685654<br>
    <b>Match link</b> /ci/engine/series/60260.html<br>
    <b>teams playing</b> England in Australia Test Series<br>
    <b>Year</b> 1876/77<br>
    <b>Team Won</b> Drawn<br>
    <b>Result</b> 1-1 (first number corresponds to winner)<br>
    <b>Matches Played</b> 2<br>
</p>

<p>The Dataframe would look something like this: </p>
<table>
    <thead>
    <tr>
        <td><b>data-days</b></td>
        <td><b>Series link</b></td>
        <td><b>Teams playing</b></td>
        <td><b>Year</b></td>
        <td><b>Team Won</b></td>
        <td><b>Result</b></td>
        <td><b>Matches Played</b></td>
    </tr>
    </thead>
    <tr>
        <td>685654<br></td>
        <td>/ci/engine/series/60260.html<br></td>
        <td>England in Australia Test Series<br></td>
        <td>1876/77<br></td>
        <td>Drawn<br></td>
        <td>1-1<br></td>
        <td>2<br></td>
    </tr>
</table>

- We can Further divide the Team playing into Home team and away team. Some of the Series have Series name embedded in this section. So, we can add another column <b>Series Name</b> to it. 
- We can divide the Year to series <b>starting Year</b> and <b>ending Year</b>
- The Result can be divided into <b>Home wins</b> and <b>away wins</b>

<p>With updated features, the dataframe would look like this: </p>
<table>
    <thead>
    <tr>
        <td><b>datadaysId</b></td>
        <td><b>Link</b></td>
        <td><b>Name</b></td>
        <td><b>Home</b></td>
        <td><b>Away</b></td>
        <td><b>startYear</b></td>
        <td><b>endYear</b></td>
        <td><b>TeamWon</b></td>
        <td><b>HomeWins</b></td>
        <td><b>AwayWins</b></td>
        <td><b>Matches</b></td>
    </tr>
    </thead>
    <tr>
        <td>685654</td>
        <td>/ci/engine/series/60260.html</td>
        <td></td>
        <td>Australia</td>
        <td>England</td>
        <td>1876</td>
        <td>1877</td>
        <td>Drawn</td>
        <td>1</td>
        <td>1</td>
        <td>2</td>
    </tr>
</table>

Now that we have format of the features, let's separate the values from the bs4 Resultset

In [6]:
final_result = stats[-2]
final_result

<tr class="data1" data-days="737346">
<td class="left"><a class="data-link" href="/ci/engine/series/1157748.html">West Indies in India Test Series</a></td>
<td class="left" nowrap="nowrap">2018/19</td>
<td nowrap="nowrap"></td>
<td class="left" nowrap="nowrap"><a class="data-link" href="/ci/content/team/6.html">India</a></td>
<td nowrap="nowrap">2-0 (2)</td>
</tr>

This series is the inspiration for my data. So, this data is updated till date.

In [8]:
records = []
for series in stats:
    datadaysId = int(series['data-days'])
    Link = series.find('a').get('href')
    teams_text = series.find_all('td')[0].text
    years = series.find_all('td')[1].text
    winner = series.find_all('td')[3].text
    result = series.find_all('td')[4].text
    records.append((datadaysId, Link, teams_text, years, winner, result))

In [9]:
records[-3:]

[(737313,
  '/ci/engine/series/1119531.html',
  'Pataudi Trophy (India in England)',
  '2018',
  'England',
  '4-1 (5)'),
 (737346,
  '/ci/engine/series/1157748.html',
  'West Indies in India Test Series',
  '2018/19',
  'India',
  '2-0 (2)'),
 (737351,
  '/ci/engine/series/1157363.html',
  'Pakistan v Australia Test Series (in United Arab Emirates)',
  '2018/19',
  'Pakistan',
  '1-0 (2)')]

In [10]:
import pandas as pd
series = pd.DataFrame(records, columns=['datadaysID', 'link', 'teams_text', 'years', 'winner', 'result'])

In [17]:
series.head(5)

Unnamed: 0,datadaysID,link,teams_text,years,winner,result
0,685654,/ci/engine/series/60260.html,England in Australia Test Series,1876/77,drawn,1-1 (2)
1,686294,/ci/engine/series/60261.html,England in Australia Test Match,1878/79,Australia,1-0 (1)
2,686907,/ci/engine/series/60262.html,Australia in England Test Match,1880,England,1-0 (1)
3,687459,/ci/engine/series/60263.html,England in Australia Test Series,1881/82,Australia,2-0 (4)
4,687627,/ci/engine/series/60264.html,Australia in England Test Match,1882,Australia,1-0 (1)


In [20]:
series.tail(6)

Unnamed: 0,datadaysID,link,teams_text,years,winner,result
732,737236,/ci/engine/series/1135145.html,Sobers/Tissera Trophy (Sri Lanka in West Indies),2018,drawn,1-1 (3)
733,737254,/ci/engine/series/1146713.html,Bangladesh in West Indies Test Series,2018,West Indies,2-0 (2)
734,737263,/ci/engine/series/1142576.html,South Africa in Sri Lanka Test Series,2018,Sri Lanka,2-0 (2)
735,737313,/ci/engine/series/1119531.html,Pataudi Trophy (India in England),2018,England,4-1 (5)
736,737346,/ci/engine/series/1157748.html,West Indies in India Test Series,2018/19,India,2-0 (2)
737,737351,/ci/engine/series/1157363.html,Pakistan v Australia Test Series (in United Ar...,2018/19,Pakistan,1-0 (2)


## Feature Value distribution

Let's now see how the data is distributed across the whole Dataframe and if any data is outside the patern we assumed for each feature.

### 1. datadaysID

In [24]:
print("Total number of datadaysID elements are {}, number of unique id elements are {}".format(len(series), series['datadaysID'].nunique()))

Total number of datadaysID elements are 738, number of unique id elements are 729


9 datadaysID values are repeated. Lets check some of the most frequent datadaysID values

In [15]:
# Top 3 most occuring datadaysID values
series['datadaysID'].value_counts()[:3]

731538    2
733630    2
732576    2
Name: datadaysID, dtype: int64

We can see that some of the datadaysID are not unique. There aren't any specific patterns for similar datadaysID values too (as shown below). Since we don't even know what the field represents in cricketing terminology, we can ignore the feature while analysing. But let's save it for now in the Dataframe.

In [16]:
series[series['datadaysID'] == 731538]

Unnamed: 0,datadaysID,link,teams_text,years,winner,result
475,731538,/ci/engine/series/60737.html,Pakistan in Zimbabwe Test Series,2002/03,Pakistan,2-0 (2)
476,731538,/ci/engine/series/60736.html,Sri Lanka in South Africa Test Series,2002/03,South Africa,2-0 (2)


### 2. link

In [26]:
print("Total number of links are {}, number of unique links are {}".format(len(series), series['link'].nunique()))

Total number of links are 738, number of unique links are 738


Each link representing a html page containing scorecard summary of each match in that series. There are no duplicate links.

### 3. teams_text

In [32]:
series['teams_text'].value_counts()[:5]

The Ashes (England in Australia)              35
The Ashes (Australia in England)              35
England in New Zealand Test Series            18
New Zealand in England Test Series            17
The Wisden Trophy (West Indies in England)    16
Name: teams_text, dtype: int64

The most common series is "The Ashes" with most number of matches played. 
<p>Some of the Series have names. All the teams we checked are in (Team A in Team B) format. <br>
    We can find the countries by slicing at <b style="color: green"><u>in</u></b> and checking if the strings (countries or part of their country names) are present in test playing nations. </p>
<p>We can now check if all the text fields have exactly 2 names of test playing nations once we break the strings</p>

In [34]:
test_playing_nations = ['India', 'England', 'Australia', 'South Africa', 'West Indies', 'New Zealand', 'Pakistan', 'Sri Lanka', 'Zimbabwe', 'Bangladesh', 'Ireland', 'Afghanistan']

In [None]:
import re
import numpy as np
def split_teams(team):
    team = re.sub(r'[^a-zA-Z]', ' ', team).split()
    result = []
    for idx, name in enumerate(team):
        if name.lower() == 'in':
            # team 1 (home)
            if team[idx-1] in test_playing_nations:
                result.append(team[idx-1])
            elif (team[idx-2] + ' ' + team[idx-1]) in test_playing_nations:
                result.append(team[idx-2] + ' ' + team[idx-1])
            else:
                result.append(np.nan())
            # team 2 (away)
            if team[idx+1] in test_playing_nations:
                result.append(team[idx+1])
            elif (team[idx+1] + ' ' + team[idx+2]) in test_playing_nations:
                result.append(team[idx+1] + ' ' + team[idx+2])
            else:
                result.append(np.nan())
            return result
        
#     return team
#     return set(test_playing_nations) & set(team)
    
split_teams('The Ashes (England in Australia)')
# # np.char.array(q) + ' ' + np.char.array(q)

In [115]:
series['teams_text'].iloc[737]

'Pakistan v Australia Test Series (in United Arab Emirates)'

In [129]:
series.head()

Unnamed: 0,datadaysID,link,teams_text,years,winner,result,teams,Home,Away
0,685654,/ci/engine/series/60260.html,England in Australia Test Series,1876/77,drawn,1-1 (2),"[England, Australia]",England,Australia
1,686294,/ci/engine/series/60261.html,England in Australia Test Match,1878/79,Australia,1-0 (1),"[England, Australia]",England,Australia
2,686907,/ci/engine/series/60262.html,Australia in England Test Match,1880,England,1-0 (1),"[Australia, England]",Australia,England
3,687459,/ci/engine/series/60263.html,England in Australia Test Series,1881/82,Australia,2-0 (4),"[England, Australia]",England,Australia
4,687627,/ci/engine/series/60264.html,Australia in England Test Match,1882,Australia,1-0 (1),"[Australia, England]",Australia,England


In [130]:
# series.loc[series['Home'] == '']
series['Home'].isnull().sum()

0