# Challenge Set 9

## Part III: Soccer Data

*Introductory - Intermediate level SQL*

--

Please complete this exercise using sqlite3 and Jupyter notebook.

Download the [SQLite database](https://www.kaggle.com/hugomathien/soccer/downloads/soccer.zip) and load in your notebook using the sqlite3 library. 

1. Which team scored the most points when playing at home?  

2. Did this team also score the most points when playing away?  

3. How many matches resulted in a tie?  

4. How many players have Smith for their last name? How many have 'smith' anywhere in their name?

5. What was the median tie score? Use the value determined in the previous question for the number of tie games. *Hint:* PostgreSQL does not have a median function. Instead, think about the steps required to calculate a median and use the [`WITH`](https://www.postgresql.org/docs/8.4/static/queries-with.html) command to store stepwise results as a table and then operate on these results. 

6. What percentage of players prefer their left or right foot? *Hint:* Calculate either the right or left foot, whichever is easier based on how you setup the problem.

In [1]:
import sqlalchemy as db
import pandas as pd

In [2]:
from sqlalchemy import create_engine
from pprint import pprint
engine = create_engine('sqlite:///database.sqlite', echo=False)
connection = engine.connect()
metadata = db.MetaData()
engine.table_names()

['Country',
 'League',
 'Match',
 'Player',
 'Player_Attributes',
 'Team',
 'Team_Attributes',
 'sqlite_sequence']

In [3]:
result = connection.execute("SELECT * FROM League LIMIT 3;")
pprint(list(result))

[(1, 1, 'Belgium Jupiler League'),
 (1729, 1729, 'England Premier League'),
 (4769, 4769, 'France Ligue 1')]


## Which team scored the most points when playing at home?

In [4]:
team = db.Table('Team', metadata, autoload=True, autoload_with=engine)
team.columns.keys()

['id', 'team_api_id', 'team_fifa_api_id', 'team_long_name', 'team_short_name']

In [5]:
match = db.Table('Match', metadata, autoload=True, autoload_with=engine)
print('# of match Table fields: ',len(match.columns.keys()))

# of match Table fields:  115


In [6]:
match.columns.keys()

['id',
 'country_id',
 'league_id',
 'season',
 'stage',
 'date',
 'match_api_id',
 'home_team_api_id',
 'away_team_api_id',
 'home_team_goal',
 'away_team_goal',
 'home_player_X1',
 'home_player_X2',
 'home_player_X3',
 'home_player_X4',
 'home_player_X5',
 'home_player_X6',
 'home_player_X7',
 'home_player_X8',
 'home_player_X9',
 'home_player_X10',
 'home_player_X11',
 'away_player_X1',
 'away_player_X2',
 'away_player_X3',
 'away_player_X4',
 'away_player_X5',
 'away_player_X6',
 'away_player_X7',
 'away_player_X8',
 'away_player_X9',
 'away_player_X10',
 'away_player_X11',
 'home_player_Y1',
 'home_player_Y2',
 'home_player_Y3',
 'home_player_Y4',
 'home_player_Y5',
 'home_player_Y6',
 'home_player_Y7',
 'home_player_Y8',
 'home_player_Y9',
 'home_player_Y10',
 'home_player_Y11',
 'away_player_Y1',
 'away_player_Y2',
 'away_player_Y3',
 'away_player_Y4',
 'away_player_Y5',
 'away_player_Y6',
 'away_player_Y7',
 'away_player_Y8',
 'away_player_Y9',
 'away_player_Y10',
 'away_player

In [7]:
#df.iloc[:,8:12]

In [8]:
query = 'SELECT home_team_api_id, \
                      max(home_team_goal) AS max_home_goals \
               FROM match \
               GROUP BY home_team_api_id ORDER BY max_home_goals DESC'
#pd.read_sql_query(query, connection)

In [9]:
query = 'SELECT team.team_long_name, \
                TBL.max_home_goals \
         FROM (SELECT home_team_api_id, \
                      max(home_team_goal) AS max_home_goals \
               FROM match \
               GROUP BY home_team_api_id ORDER BY max_home_goals DESC) AS TBL \
         INNER JOIN team ON TBL.home_team_api_id=team.team_api_id'
pd.read_sql_query(query, connection)

Unnamed: 0,team_long_name,max_home_goals
0,Real Madrid CF,10
1,PSV,10
2,Tottenham Hotspur,9
3,FC Bayern Munich,9
4,Celtic,9
...,...,...
294,Birmingham City,2
295,Le Havre AC,2
296,Pescara,2
297,DSC Arminia Bielefeld,2


## Did this team also score the most points when playing away?  

In [10]:
query = 'SELECT team.team_long_name, \
                TBL.max_away_goals \
         FROM (SELECT away_team_api_id, \
                      max(away_team_goal) AS max_away_goals \
               FROM match \
               GROUP BY away_team_api_id ORDER BY max_away_goals DESC) AS TBL \
         INNER JOIN team ON TBL.away_team_api_id=team.team_api_id'
pd.read_sql_query(query, connection)

Unnamed: 0,team_long_name,max_away_goals
0,Paris Saint-Germain,9
1,Real Madrid CF,8
2,FC Barcelona,8
3,FC Bayern Munich,8
4,Club Brugge KV,7
...,...,...
294,Royal Excel Mouscron,2
295,Amadora,2
296,AC Arles-Avignon,2
297,Carpi,2


No, but Real Madrid CF was close to the most away goals too.

## How many matches resulted in a tie?  
6,596

In [11]:
query = 'SELECT match_api_id, \
                ABS(home_team_goal - away_team_goal) As spread \
                FROM match'
#pd.read_sql_query(query, connection)

In [12]:
query = 'SELECT spreads.spread, count(spreads.match_api_id) As num_matches FROM \
         (SELECT match_api_id, \
                ABS(home_team_goal - away_team_goal) As spread \
                FROM match) As spreads \
         GROUP BY spreads.spread'
pd.read_sql_query(query, connection)

Unnamed: 0,spread,num_matches
0,0,6596
1,1,9598
2,2,5740
3,3,2523
4,4,1054
5,5,317
6,6,106
7,7,31
8,8,11
9,9,2


In [13]:
query = 'SELECT spreads.spread, count(spreads.match_api_id) As num_matches FROM \
         (SELECT match_api_id, \
                ABS(home_team_goal - away_team_goal) As spread \
                FROM match) As spreads \
         WHERE spreads.spread=0'
pd.read_sql_query(query, connection)

Unnamed: 0,spread,num_matches
0,0,6596


## How many players have Smith for their last name? How many have 'smith' anywhere in their name?
14 with Smith for last name.  
17 with smith in their name.

In [14]:
player = db.Table('Player', metadata, autoload=True, autoload_with=engine)
player.columns.keys()

['id',
 'player_api_id',
 'player_name',
 'player_fifa_api_id',
 'birthday',
 'height',
 'weight']

In [15]:
query = 'SELECT player_name FROM player'
#pd.read_sql_query(query, connection)

In [16]:
query = 'SELECT player_name FROM player \
         WHERE LOWER(player_name) like "% smith"'
pd.read_sql_query(query, connection)

Unnamed: 0,player_name
0,Adam Smith
1,Alan Smith
2,Brad Smith
3,Cameron Smith
4,Chris Smith
5,Daan Smith
6,David Smith
7,Gordon Smith
8,Graeme Smith
9,Graeme Smith


In [17]:
query = 'SELECT player_name FROM player \
         WHERE LOWER(player_name) like "%smith%"'
pd.read_sql_query(query, connection)

Unnamed: 0,player_name
0,Adam Smith
1,Alan Smith
2,Brad Smith
3,Cameron Smith
4,Chris Smith
5,Daan Smith
6,David Smith
7,Gary Naysmith
8,Gordon Smith
9,Graeme Smith


## What was the median tie score? 

1 to 1

Use the value determined in the previous question for the number of tie games. *Hint:* PostgreSQL does not have a median function. Instead, think about the steps required to calculate a median and use the [`WITH`](https://www.postgresql.org/docs/8.4/static/queries-with.html) command to store stepwise results as a table and then operate on these results. 

In [18]:
query = 'SELECT spreads.spread, count(spreads.match_api_id) As num_matches \
         FROM \
             (SELECT match_api_id, \
                     ABS(home_team_goal - away_team_goal) As spread \
              FROM match) As spreads \
         WHERE spreads.spread=0'
pd.read_sql_query(query, connection)

Unnamed: 0,spread,num_matches
0,0,6596


In [19]:
query = 'SELECT scores.home_team_goal as goals_for_both_sides, \
                count(match_api_id) As num_matches \
         FROM \
             (SELECT match_api_id, home_team_goal FROM match \
              WHERE ABS(home_team_goal - away_team_goal)=0) AS scores \
         GROUP BY scores.home_team_goal'
pd.read_sql_query(query, connection)

Unnamed: 0,goals_for_both_sides,num_matches
0,0,1978
1,1,3014
2,2,1310
3,3,264
4,4,27
5,5,2
6,6,1


Simple inspect shows that the median is the 1-1 tie score since nearly half the matches tied at this score and adding the 0-0 matches puts the total matchs with 0-0 or 1-1 well above 50% of the 6,596 matches that tied.  So, the 50% percentile occurs within the 1-1 group.

## What percentage of players prefer their left or right foot? 
*Hint:* Calculate either the right or left foot, whichever is easier based on how you setup the problem.

In [20]:
attributes = db.Table('Player_Attributes', metadata, autoload=True, autoload_with=engine)
#attributes.columns.keys()

In [21]:
query = 'SELECT preferred_foot, \
                Count(*) as NumOfPlayers, \
                ROUND(100 * cast(Count(*) as float) / sum(count(*)) over (),1) as Percent \
         FROM "Player_Attributes" \
         group by preferred_foot \
         order by NumOfPlayers'
pd.read_sql_query(query, connection)

Unnamed: 0,preferred_foot,NumOfPlayers,Percent
0,,836,0.5
1,left,44733,24.3
2,right,138409,75.2


In [22]:
connection.close()