# Contents 

1. Required Libraries
2. General goals statistics
       1. The most resultative leagues
       2. The most resultative teams
       3. The most missing teams
       4. The most difference goals-missed
3. The best free kickers
4. Penalties analysis
5. The best bombardirs       
        1. Right foot
        2. Left foot
        3. Head
6. The most attempts
        1. By player
        2. By team
7. Fouls
        1. By team
        2. By player
        3. Time distribution
8. Yellow cards
        1. By team
        2. By player
        3. Time distribution
9. Red cards
        1. By team
        2. By player
        3. Time distribution
10. Top of Maradona's fans (the most often hand players)
11. Offsides
12. Assists
13. Autogoals
14. Subsitutions
15. Resultative substitutions

## Required libraries
Let's start our journey with importing required libs and importing data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import pandas as pd
import numpy as np

events = pd.read_csv('../input/events.csv')
info = pd.read_csv('../input/ginf.csv')
events.info()
info.info()

## Goals statistics

First of all, let's create additional dataframe with goals info. Then let's process our data

In [None]:
goals = events[events.is_goal == 1]
goals.describe()

In [None]:
# The most resultative leagues
info['scored'] = info['fthg'] + info['ftag']
print('Overall scores')
info.groupby('country').scored.sum().sort_values(ascending=False).head()

That's good, but we must keep in mind that there are 20 teams in EPL, but only 18 in 1st Bundesliga. The best mark estimating number of goals is mean value of it.

In [None]:
print('Mean scores')
info.groupby('country').scored.mean().sort_values(ascending=False).head()

And now we have other result. We can see that EPL isn't such resultative as 1st Bundesliga. So it goes...
Now let's create teams statistics

In [None]:
# The most resultative teams
home_goals = info.groupby('ht').fthg.sum()
away_goals = info.groupby('at').ftag.sum()
sum_goals = (home_goals + away_goals).sort_values(ascending=False)
sum_goals.head(20)

Now we have the same problem like in the previous case. But let's skip this "normalization" and let's watch at quantative values ("Normalization" will be your hometask).

In [None]:
# Most missing teams
home_missed = info.groupby('ht').ftag.sum()
away_missed = info.groupby('at').fthg.sum()
sum_missed = (home_missed + away_missed).sort_values(ascending=False)
sum_missed.head(20)

And now we are able to compare differences between scored and missed balls

In [None]:
(sum_goals - sum_missed).sort_values(ascending=False).head(20)

I suppose it's great parameter for comparation, because the less games has a team played, the less balls it could score, but also the less balls this team could miss. As for me, this mark can show us an "efficiency" of each team.

## The best free kickers
Now let's explore `events.csv`. Let's start with free kick goals

In [None]:
free_kicks = goals[goals.situation == 4]
best_kickers = free_kicks.groupby('player').player.count().sort_values(ascending=False)
best_kickers.head(20)

We could write it in one string

In [None]:
goals[goals.situation == 4].groupby('player').player.count().sort_values(ascending=False).head(20)

And we have the same result. 

## Penalties analysis

Now let's watch at best penaltists

In [None]:
penalties_goals = goals[goals.location == 14]
penalties_scored = penalties_goals.groupby('player').player.count().sort_values(ascending=False)
penalties_scored.head(20)

Ok, here are the best penaltists. But who are the worst?

In [None]:
non_goals = events[events.is_goal == 0]
penalties_non_goals = non_goals[non_goals.location == 14]
penalties_missed = penalties_non_goals.groupby('player').player.count().sort_values(ascending=False)
penalties_missed.head(20)

But it's very bad mark because different players had different numbers of attemps. Let's calculate percentage of successful attempts

In [None]:
penalties_stats = pd.concat([penalties_scored.T, penalties_missed.T], axis=1)
penalties = penalties_stats.fillna(0)
penalties.columns.values[0] = 'goals'
penalties.columns.values[1] = 'missed'
penalties['total'] = penalties['goals'] + penalties['missed']
penalties['success'] = penalties['goals'] / penalties['total']
penalties['unsuccess'] = penalties['missed'] / penalties['total']
penalties.sort_values(by='success', ascending=False).head(10)

Now we have "Cantabria problem" (Watch my [another kernel ](https://www.kaggle.com/mikhinandrei/some-statistics) . In that dataset we had an interesting story with Cantabria national team which had na ONLY game, this game was won, and so Cantabria has the best percentage of wins).
 For justice, let's explore players with over 10 penalties.

In [None]:
penalties_best = penalties[penalties.goals >= 10].sort_values(by='success', ascending=False)
penalties_best.head(20)

That's good. We can see that neither CriRo, nor Messi is a king of penalties). 

In [None]:
penalties_worst = penalties[penalties.goals >= 10].sort_values(by='unsuccess', ascending=False)
penalties_worst.head(20)

Antonio Di Natale has the worst penalties statistics. So it goes...

Let's draw our first plots

In [None]:
fig, ax = plt.subplots(1,1, figsize=(20, 8))
best_20 = penalties_best.loc[penalties_best.index.tolist()[0]:penalties_best.index.tolist()[19], 'success':'unsuccess']
y_offset = np.zeros(len(best_20.index.tolist()))
index = np.arange(len(best_20.index.tolist()))
plt.title('The best penaltists')
plt.xticks(index, best_20.index.tolist(), rotation=30)
plt.bar(index, best_20['success'], color='green')

In [None]:
fig, ax = plt.subplots(1,1, figsize=(20, 8))
worst_20 = penalties_worst.loc[penalties_worst.index.tolist()[0]:penalties_worst.index.tolist()[19], 'success':'unsuccess']
y_offset = np.zeros(len(worst_20.index.tolist()))
index = np.arange(len(worst_20.index.tolist()))
plt.title('The worst penaltists')
plt.xticks(index, worst_20.index.tolist(), rotation=30)
plt.bar(index, worst_20['unsuccess'], color='red')

Now let's explore what teams had more and less attempts

In [None]:
all_penalties = pd.concat([penalties_goals, penalties_non_goals])
all_penalties.groupby('is_goal').count()

In [None]:
all_penalties.groupby('event_team').event_team.count().sort_values(ascending=False).head(20)

So, Barcelona is the most kicking team

In [None]:
all_penalties.groupby('opponent').event_team.count().sort_values(ascending=False).head(20)

Udinese is the most rude team in it's penalty area. Also we can see that many Italian teams are in this list. So it goes...

## Best bombardirs
Suppose, Messi and Cri Ro are at two first rows.

In [None]:
bombardirs = goals.groupby('player').player.count().sort_values(ascending=False)
bombardirs.head(20)

I was right. Now our data allows us to analyze parts of body which were used for scoring.

In [None]:
# Right foot
right_foot_goals = goals[goals.bodypart == 1].groupby('player').player.count().sort_values(ascending=False)
right_foot_goals.head(20)

In [None]:
# Left foot
left_foot_goals = goals[goals.bodypart == 2].groupby('player').player.count().sort_values(ascending=False)
left_foot_goals.head(20)

In [None]:
# Head
head_goals = goals[goals.bodypart == 3].groupby('player').player.count().sort_values(ascending=False)
head_goals.head(20)

Now we are ready to explore favourite parts of body of each player

In [None]:
goals_distr = pd.concat([right_foot_goals, left_foot_goals, head_goals, bombardirs], axis=1).fillna(0)
goals_distr.columns.values[0] = 'rf'
goals_distr.columns.values[1] = 'lf'
goals_distr.columns.values[2] = 'head'
goals_distr.columns.values[3] = 'overall'
goals_distr = goals_distr.sort_values(by='overall', ascending=False)
goals_distr.head(20)

And percentage

In [None]:
goals_distr['rf'] /= goals_distr['overall']
goals_distr['lf'] /= goals_distr['overall']
goals_distr['head'] /= goals_distr['overall']
goals_distr.head(20)

Now let's watch at players preferring every part of body

In [None]:
#Right foot
goals_distr = goals_distr[goals_distr.overall >= 20]
goals_distr = goals_distr.sort_values(by='rf', ascending=False)
goals_distr.head(10)

In [None]:
# Left foot
goals_distr = goals_distr[goals_distr.overall >= 20]
goals_distr = goals_distr.sort_values(by='lf', ascending=False)
goals_distr.head(10)

In [None]:
# Head
goals_distr = goals_distr[goals_distr.overall >= 20]
goals_distr = goals_distr.sort_values(by='head', ascending=False)
goals_distr.head(10)

It's obviously that Sergio Ramos it at the first place

## Most attempts

As we know, the more attempts, the more goals.

In [None]:
# By player
attempts = events[events.event_type == 1]
attempts.groupby('player').player.count().sort_values(ascending=False).head(20)

In [None]:
attempts.groupby('event_team').player.count().sort_values(ascending=False).head(20)

Now let's analyze shots per goal

In [None]:
shots_per_goal_pl = attempts.groupby('player').player.count() / bombardirs
shots_per_goal_pl.sort_values().head(20)

I don't know a lot of this players... But I see a lot of goalkeepers here. Suppose, they had their only attempt. 
Let's watch at teams

In [None]:
shots_per_goal_tm = attempts.groupby('event_team').player.count() / sum_goals
shots_per_goal_tm.sort_values().head(20)

Be afraid of Fulham. They had a really good realization.

## Fouls

Let's completely analyze fouls

In [None]:
fouls = events[events.event_type == 3]

# By team
fouls.groupby('event_team').player.count().sort_values(ascending=False).head(20)

In [None]:
# By player
fouls.groupby('player').player.count().sort_values(ascending=False).head(20)

And time distribution

In [None]:
fig, ax = plt.subplots(1,1, figsize=(40, 20))
sns.set(font_scale=1)
time_distr = fouls.groupby('time').time.count()
time_distr.head()
x = np.arange(len(time_distr))
plt.bar(x, time_distr)

Teams try to use tactic fouls at the end of each half.

## Yellow cards

In [None]:
y_cards = events[events.event_type == (4 or 5)]

# By team
y_cards.groupby('event_team').player.count().sort_values(ascending=False).head(20)

In [None]:
# By player
y_cards.groupby('player').player.count().sort_values(ascending=False).head(20)

In [None]:
# Time distribution
fig, ax = plt.subplots(1,1, figsize=(40, 20))
sns.set(font_scale=1)
time_distr = y_cards.groupby('time').time.count()
time_distr.head()
x = np.arange(len(time_distr))
plt.bar(x, time_distr, color='yellow')

The same situation

## Red cards

In [None]:
r_cards = events[events.event_type == 6]

# By team
r_cards.groupby('event_team').player.count().sort_values(ascending=False).head(20)

In [None]:
# By player
r_cards.groupby('player').player.count().sort_values(ascending=False).head(20)

In [None]:
# Time distribution
fig, ax = plt.subplots(1,1, figsize=(40, 20))
sns.set(font_scale=1)
time_distr = r_cards.groupby('time').time.count()
time_distr.head()
x = np.arange(len(time_distr))
plt.bar(x, time_distr, color='red')

Players get red cards at the end of the match (not like in previous cases)

## Top of Maradona's fans (the most often hand players)
Kind of funny statistics

In [None]:
hands = events[events.event_type == 10]
hands.groupby('player').event_type.count().sort_values(ascending=False).head(20)

Don't know about successful attempts, but it seems that Helder Postiga is hard trying to repeat "A hand of God"

## Offsides

In [None]:
offs = events[events.event_type == 9]

# By team
offs.groupby('event_team').event_type.count().sort_values(ascending=False).head(20)

In [None]:
# By player
offs.groupby('player').event_type.count().sort_values(ascending=False).head(20)

Antonio Di Natale was really trying to save his team, but very often he was hurrying...)

## Assists

In [None]:
assists = goals[goals.event_type2 == 12]
assists.groupby('player2').event_type.count().sort_values(ascending=False).head(20)

As usually, Messi is on the top...

## Autogoals

In [None]:
autogoals = events[events.event_type2 == 15]
autogoals.groupby('player').player.count().sort_values(ascending=False).head(20)

Harry Kane? LOL=)

## Substitutions

In [None]:
substs = events[events.event_type == 7]
substs.groupby('event_team').event_type.count().sort_values(ascending=False).head(20)

And again a lot of Italian teams... There's sth interesting...

In [None]:
substs.groupby('player_in').player_in.count().sort_values(ascending=False).head(20)

I call Adrian a king of substitutions... Outstanding result!

In [None]:
substs.groupby('player_out').player_out.count().sort_values(ascending=False).head(20)

Supose, Benzema deserves more time to play. Watch at previous stats.

And distribution of substitutions

In [None]:
fig, ax = plt.subplots(1,1, figsize=(40, 20))
sns.set(font_scale=1)
time_distr = substs.groupby('time').time.count()
time_distr.head()
x = np.arange(len(time_distr))
plt.bar(x, time_distr, color='green')

45'th minute is a substitution time

## Resultative substitutions analysis

If you aren't tired, let's make final, kind of 'hard" statistics. Let's watch what substitutions were the most resultative.

In [None]:
substs_prepared = substs[['id_odsp', 'time', 'event_team', 'player_in']]
substs_prepared.columns.values[3] = 'player'
goals_prepared = goals[['id_odsp', 'time', 'event_team', 'player']]
res_substs = pd.merge(substs_prepared, goals_prepared, how='inner', on=['id_odsp', 'player'])
res_substs.head()

Here was simple merging. It was the most difficult part... But... WTF??? Look at row with index 1. Time of goal is less than time of substitution... Let's explore our data

In [None]:
substs[substs.id_odsp == 'UBZQ4smg/'].head()

It was a missprint. (Look at `player_in` and `player_out`).
What should we do?
It's hard to believe, but we may simply use only that data where substitution time is less than goal time...

In [None]:
res_substs = res_substs[res_substs.time_x <= res_substs.time_y]
res_substs.head(20)

Now OK. So, let's discover our data

In [None]:
res_substs.groupby('player').player.count().sort_values(ascending=False).head(20)

So, Alvaro Morata really deserves to be the first striker in his team.
Let's find teams, whose substitutions were the most resultative

In [None]:
res_substs.groupby('event_team_x').player.count().sort_values(ascending=False).head(20)

Bayern is the first, it means they have very good substitution players. But I don't see Barcelona and it's strange

# Conclusion

If you are reading these strings, thank you for attention. I understand that in this kernel I had a lot of identic actions, but I suppose that result is the main goal of each work. We had a lot of routine, but now we have a lot of interesting statistics.

Thank you for attention,
Hope you had a little fun.