# RQ2
#### Is there a home-field advantage? It is generally believed that there is an underlying home field advantage in sport, i.e. an highest probability of winning of the home team. Let's check for this, and see whether the outcome of the game (win, draw, lose) is correlated to the playing side (home or away). For 5 different teams of Premier League, show the contingency table (outcome x side). Therefore, perform an "overall" Chi-squared test in the following way: build a unique contingency table, that contains all the matches in which only one of the 5 teams previously selected is involved, to see whether there is home field advantage. State clearly the tested hypothesis and whether it is accepted or rejected.

In [1]:
#first of all we add the libraries which we need them
import pandas as pd
from scipy.stats import chi2_contingency #for calculating the chi squared test

In [25]:
matches_England = pd.read_json('matches/matches_England.json', orient='columns') #read Premier League
#matches_England has many variables and there is a variable that the name is teamData that is a unique table in
#to the matches_England and we need it for this question. So we seprate it from matches_England.
user_dict=matches_England['teamsData'].to_dict()
data=pd.DataFrame.from_dict({(i,j):user_dict[i][j]
                           for i in user_dict.keys() 
                           for j in user_dict[i].keys()},
                    orient='index')
#the formation variables is not important for us so we delete it from our data.
del data['formation']

760

In [26]:
data.head(5)

Unnamed: 0,Unnamed: 1,scoreET,coachId,side,teamId,score,scoreP,hasFormation,scoreHT
0,1646,0,8880,home,1646,1,0,1,1
0,1659,0,8934,away,1659,2,0,1,0
1,1627,0,0,away,1627,0,0,1,0
1,1628,0,8357,home,1628,2,0,1,0
2,1609,0,7845,away,1609,1,0,1,1


In [3]:
#for calculating the final score we need the score (Score at the end of the match (not counting penalties) and
#the score at the penalty and the score in Extra time. so we sum them together.
data['final score']=data['score']+data['scoreP']+data['scoreET']

In [4]:
data.head(5)

Unnamed: 0,Unnamed: 1,scoreET,coachId,side,teamId,score,scoreP,hasFormation,scoreHT,final score
0,1646,0,8880,home,1646,1,0,1,1,1
0,1659,0,8934,away,1659,2,0,1,0,2
1,1627,0,0,away,1627,0,0,1,0,0
1,1628,0,8357,home,1628,2,0,1,0,2
2,1609,0,7845,away,1609,1,0,1,1,1


In [5]:
#now our data is ready for finding the winner and loser of the matches but first of all we need to rearrange the
#index of our table.
data.index=range(len(data['scoreET']))

In [6]:
data.head(5)

Unnamed: 0,scoreET,coachId,side,teamId,score,scoreP,hasFormation,scoreHT,final score
0,0,8880,home,1646,1,0,1,1,1
1,0,8934,away,1659,2,0,1,0,2
2,0,0,away,1627,0,0,1,0,0
3,0,8357,home,1628,2,0,1,0,2
4,0,7845,away,1609,1,0,1,1,1


In [7]:
#So with the following code we compare each pair of teams and define who is winner and who is loser. and if
#match was draw, we showed it too and put them in to new columns which is called result.
for i in range (0,(len(data)-1),2):
    if ((data.iloc[i]['final score'] > data.iloc[i+1]['final score'])==True):
        data.loc[i,'result']='win'
        data.loc[i+1,'result']='lose'
    elif ((data.iloc[i]['final score'] < data.iloc[i+1]['final score'])==True):
        data.loc[i,'result']='lose'
        data.loc[i+1,'result']='win'
    else:
        data.loc[i,'result']='draw'
        data.loc[i+1,'result']='draw'

In [8]:
data.head(5)

Unnamed: 0,scoreET,coachId,side,teamId,score,scoreP,hasFormation,scoreHT,final score,result
0,0,8880,home,1646,1,0,1,1,1,lose
1,0,8934,away,1659,2,0,1,0,2,win
2,0,0,away,1627,0,0,1,0,0,lose
3,0,8357,home,1628,2,0,1,0,2,win
4,0,7845,away,1609,1,0,1,1,1,win


In [10]:
#we choose 5 team and write their teamId into a list. and built a subdata1 which is involved some information
#about the specific teams.
list1=[1646,1627,1673,1633,1611]
subdata1=(data.loc[data['teamId'].isin(list1)])

In [11]:
subdata1.head(5)

Unnamed: 0,scoreET,coachId,side,teamId,score,scoreP,hasFormation,scoreHT,final score,result
0,0,8880,home,1646,1,0,1,1,1,lose
2,0,0,away,1627,0,0,1,0,0,lose
5,0,18572,home,1673,0,0,1,0,0,lose
8,0,3295,home,1611,1,0,1,1,1,win
19,0,8235,home,1633,3,0,1,1,3,win


In [12]:
#with these code we built a dictionary which has data about the number of wins, losses and draws in home or away
#for those specific teams.
d = {}
for p in list1:
    d[p]={'home':[(len(subdata1[(subdata1.teamId == p) & (subdata1.side == 'home') & (subdata1.result == 'win')])),
          (len(subdata1[(subdata1.teamId == p) & (subdata1.side == 'home') & (subdata1.result == 'lose')])),
          (len(subdata1[(subdata1.teamId == p) & (subdata1.side == 'home') & (subdata1.result == 'draw')]))],
          'away':[(len(subdata1[(subdata1.teamId == p) & (subdata1.side == 'away') & (subdata1.result == 'win')])),
          (len(subdata1[(subdata1.teamId == p) & (subdata1.side == 'away') & (subdata1.result == 'lose')])),
          (len(subdata1[(subdata1.teamId == p) & (subdata1.side == 'away') & (subdata1.result == 'draw')]))]}

In [13]:
d

{1646: {'home': [7, 7, 5], 'away': [7, 5, 7]},
 1627: {'home': [3, 7, 9], 'away': [3, 12, 4]},
 1673: {'home': [6, 8, 5], 'away': [3, 11, 5]},
 1633: {'home': [7, 6, 6], 'away': [3, 10, 6]},
 1611: {'home': [15, 2, 2], 'away': [10, 5, 4]}}

In [18]:
#built 5 dataframe for each team. in another word, we sperate each team of d dictionary into a dataframe
df1646=pd.DataFrame(data=d[1646],index=['wins','losses','draws'])
df1627=pd.DataFrame(data=d[1627],index=['wins','losses','draws'])
df1673=pd.DataFrame(data=d[1673],index=['wins','losses','draws'])
df1633=pd.DataFrame(data=d[1633],index=['wins','losses','draws'])
df1611=pd.DataFrame(data=d[1611],index=['wins','losses','draws'])

print (df1646.T)
print (df1627.T)
print (df1673.T)
print (df1633.T)
print (df1611.T)

      wins  losses  draws
home     7       7      5
away     7       5      7
      wins  losses  draws
home     3       7      9
away     3      12      4
      wins  losses  draws
home     6       8      5
away     3      11      5
      wins  losses  draws
home     7       6      6
away     3      10      6
      wins  losses  draws
home    15       2      2
away    10       5      4


In [31]:
#with following code we could find chi-squared and p-value
re=chi2_contingency(df1611)
print('the chi-squared is',re[0],'and the p-value is',re[1])

the chi-squared is 2.9523809523809526 and the p-value is 0.22850653346991975
