# Europe Cup 2021 - Data Analysis
This Notebook ist about different approaches to predict the ranking of Football Teams during the Europe Cup 2021  

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# import the dataset with international Results
df_intStats = pd.read_csv('../input/international-football-results-from-1872-to-2017/results.csv')

## Exploratory Data Analysis
Let's have a first look in the .csv

In [None]:
df_intStats.shape

In [None]:
df_intStats.head(5)

### First Conclussion to the Exploratory Data Analysis
Nice. There are 42105 rows content of football matches. I will narrow down my perspective a little to specify my goals and expectations:
* Given the situation that continously change of players has an impact to the performance of the team and further to the Stats, I will reduce the timeline from now to 2017
* The Data Set includes the Football Teams World wide, I filter out the Teams that don't participate at the Europe Cup 2021
* To understand the Performance Data of the Teams, i will create time charts to understand the actual Performance Situation of each Team
    * For that I have to transform the Date Colum in the index column
    


## Data Preparation
To get an understanding of the data first the data needs to be prepared.

### Steps of Data Preparation
1. change Dataframe-index to timestamp-index 
2. Reduce the Dataframe time range from 2017 - 2021
3. Reduce the DataFrame by the Teams that participate on the EU Cup 2021
4. Visualize thte Performace of winnings by the teams


In [None]:
# 1) Set Index do Date
df_intStats = df_intStats.set_index('date')
# Change datatype of index from object to datetime
df_intStats.index = pd.to_datetime( df_intStats.index, format= '%Y.%m.%d')
# check datatype
df_intStats.index

In [None]:
# 2) frame the timestamp an sclice the date range
from datetime import datetime
start = datetime(day=1, month=1, year=2017)
end = datetime(day=13, month=6, year=2021)
df_intStats = df_intStats[start:end]


# test
# df_intStats.shape
df_intStats.index.min()

In [None]:
# 3) Reduce the DataFrame by the Teams that participate on the EU Cup 2021

# create List of the teams
teams = ['Belgium','Denmark', 'Germany', 'England', 'Finland ', 'France', 'Croatia', 'Italy', 'Netherlands', 'Austria', 'Poland', 'Portugal', 'Russia', 'Sweden', 'Switzerland', 'Spain', 'Czech Republic', 'Turkey', 'Ukraine', 'Wales', 'Northern Macedonia', 'Scotland', 'Slovakia', 'Hungary' ]

# select only the rows with the which includes the teams that participate
df_filtered_teams = df_intStats.loc[df_intStats['home_team'].isin(teams) & df_intStats['away_team'].isin(teams)] 
df_filtered_teams.shape

# test the selection 
# df_filtered_teams['home_team'].value_counts()
# df_filtered_teams['home_team'].value_counts().sum()
# df_filtered_teams['away_team'].value_counts()


## Data Analysis I
now the data is in the dataframe df_filtered_teams in good format to analyse. I will do the following steps. 

1. visualize the goals shoot in a home play 'homescore' of the teams with a heatmap
2. visualise the goals shoot in a outside play 'outside' of the teams with a heatmap
3. visualize the teams with the most goals in the selcetd time range of 4 years
4. visualize the teams timeline of by the shot goals, home and away

notes: how to analyse the relationship between teams via the score relation?

In [None]:
# 4 visualize the performance of winnings by the teams
# Create Dataframe with Country names
df_team_score = pd.DataFrame()

df_team_score = df_filtered_teams[['home_team', 'away_team', 'home_score', 'away_score']]

df_team_score = df_team_score.groupby(['home_team']).sum()
df_team_score.head(30)

In [None]:
# figure size in inches
rcParams['figure.figsize'] = 13,4

df_team_score_home = df_team_score['home_score'].sort_values(ascending=False)
ax = df_team_score_home.plot.bar(color="red")

In [None]:
# figure size in inches
rcParams['figure.figsize'] = 13,4

df_team_score_away = df_team_score['away_score'].sort_values(ascending=False)
ax = df_team_score_away.plot.bar(color="green")

df_filtered_teams_france = df_filtered_teams[df_filtered_teams['home_team'].str.contains('France',case=False)]
## First Conclusion

The first 5 teams with the top score in away games, in the last 4 years, are: 
1. France 
1. Spain 
1. Belgium  
1. Germany
1. Russia

The first 5 teams with the top score in home games, in the last 4 years, are: 
1. Russia
1. Germany
1. Scotland
1. France 
1. Hungary

## Data Analysis II 
Lets go deeper and try to answer follwing questions:
1. How is the performance of goals of the top 5 over the last 4 years? (Lineplot)
1. Are there interesting peaks and falls in the performance?

In [None]:
df_filtered_teams.head()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rcParams

teams_awayMax = ['France', 'Spain', 'Belgium', 'Germany','Russia']

# figure size in inches
rcParams['figure.figsize'] = 13,4

for team in teams_awayMax:
    
    df_team = df_filtered_teams[df_filtered_teams['away_team'].str.contains(team,case=True)]
    GradientScore = (df_team['away_score'].sum()) / 2
    sns.lineplot(data=df_team, x="date", y="away_score", hue='away_team', style='away_team',  markers=True, dashes=False)
    plt.xticks(rotation=75)
    plt.title('Top 5 teams performance in away games from 2017 to 2021')
    plt.show()
    print("Team: " + team)
    print("Gradient: " + GradientScore.astype('str'))

In [None]:
teams_homeMax = ['Russia', 'Germany', 'Scotland', 'France', 'Hungary']

# figure size in inches
rcParams['figure.figsize'] = 13,4

for team in teams_homeMax:
    
    df_team = df_filtered_teams[df_filtered_teams['home_team'].str.contains(team,case=True)]
    GradientScore = (df_team['home_score'].sum()) / 2
    sns.lineplot(data=df_team, x="date", y="home_score", hue='home_team', style='home_team',  markers=True, dashes=False)
    plt.xticks(rotation=75)
    plt.title('Top 5 teams performance in home games from 2017 to 2021')
    plt.show()
    print("Team: " + team)
    print("Gradient: " + GradientScore.astype('str'))

## Second Conclusion
After Plotting the performance of the top five teams in the categorie "home game" and "away game", I select the top 2 with the highest Gradient Faktor in the last two years.
The Gradient is the sum of the goals over devided by the last two years.  

This is for the **home Team**:

**1. Spain with a Gradient of 11.0
2. Belgium with a Gradient of 11.0**

This is for the **away Team**:
1. France with a Gradient of 20.5 
2. Russia and Germnay with a Gradient of 12.5

So now I have a selection of the best performing 5 Teams from the the Teams with the most Goals.  

## THird Analysis

## Machine Learning and Prediction
### how to predict