**Predicting players rating**

In this project we are going to predict the overall rating of soccer player based on their attributes
such as 'crossing', 'finishing etc.
The dataset you are going to use is from European Soccer Database
(https://www.kaggle.com/hugomathien/soccer) has more than 25,000 matches and more than
10,000 players for European professional soccer seasons from 2008 to 2016.

**About the Dataset**

The ultimate Soccer database for data analysis and
machine learning
The dataset comes in the form of an SQL database and contains statistics of about 25,000 football
matches, from the top football league of 11 European Countries. It covers seasons from 2008 to
2016 and contains match statistics (i.e: scores, corners, fouls etc...) as well as the team formations,
with player names and a pair of coordinates to indicate their position on the pitch.
+25,000 matches
+10,000 players
11 European Countries with their lead championship
Seasons 2008 to 2016
Players and Teams' attributes* sourced from EA Sports' FIFA video game series, including the
weekly updates
Team line up with squad formation (X, Y coordinates)
Betting odds from up to 10 providers
Detailed match events (goal types, possession, corner, cross, fouls, cards etc...) for +10,000
matches
The dataset also has a set of about 35 statistics for each player, derived from EA Sports' FIFA video
games. It is not just the stats that come with a new version of the game but also the weekly
updates. So for instance if a player has performed poorly over a period of time and his stats get
impacted in FIFA, you would normally see the same in the dataset.


**Importing Modules**

In [1]:
import sqlite3
import numpy as np
import pylab as pl
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as  sns
from math import sqrt
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
%matplotlib inline

  from pandas.core import datetools


**Data Pre-Processing**

In [2]:
conn = sqlite3.connect('database.sqlite') #Creating a Connection object that represents the database 
df = pd.read_sql_query("SELECT * FROM Player_Attributes", conn) #Reading SQL query into a DataFrame

DatabaseError: ignored

**Data Exploration**

In [3]:
df.head() #Returns the first 5 rows of dataframe df

NameError: ignored

In [0]:
df.columns #Columns of dataframe df

In [0]:
df.describe() #The summary statistics of the dataframe df

In [0]:
df.shape #Return a tuple representing the dimensionality of the DataFrame df.

**Data Cleaning**

In [0]:
df.isnull().values.any() #Check for any NA’s in the dataframe.

In [0]:
df1 = df.dropna() #Drop the rows where at least one element is missing.

In [0]:
df1.shape #Return a tuple representing the dimensionality of the DataFrame df1.

In [0]:
df1.columns #Columns of dataframe df1

In [0]:
df1 = df1.drop(["id", "player_fifa_api_id", "player_api_id"], axis = 1) #Dropping id, player_fifa_api_id and player_api_id columns

In [0]:
df1.columns #Columns of dataframe df1

In [0]:
clms = list(df1.columns[1:]) #Listing the columns of dataframe df1 starting from 2nd column
print(clms) #Printing clms list

In [0]:
len(clms) #Lenth of the clms list

**Data Vizualization**

In [0]:
#Histogram Plot
fig, axes = plt.subplots(10, 4, figsize=(16, 12))
for i,ax in enumerate(axes.flat):
    if i < len(clms):
        ax.hist(df1[clms[i]])
        ax.set_title(clms[i])
plt.tight_layout()
plt.show()

In [0]:
#Scatter Plot
fig, axes = plt.subplots(10, 4, figsize=(16, 12))
for i,ax in enumerate(axes.flat):
    if i < len(clms)-1:
        ax.scatter(df1[clms[i+1]], df1[clms[0]])
        ax.set_title(clms[i+1])
plt.tight_layout()
plt.show()

In [0]:
axes[0,0].hist(df1[clms[0]])

In [0]:
plt.hist(df1["preferred_foot"]) #Histogram for preferred_foot

In [0]:
#Correlation Matrix
sns.set(style="white")
df_corr= df1[1:]
corr = df_corr.dropna().corr() #Compute the correlation matrix
mask = np.zeros_like(corr, dtype=np.bool) #Generate a mask 
f, ax = plt.subplots(figsize=(30, 10)) #Set up the matplotlib figure
cmap = sns.diverging_palette(220, 10, as_cmap=True) #Generate a custom diverging colormap
sns.heatmap(corr, mask=mask, cmap=cmap, square=True, linewidths=.5, ax=ax) #Draw the heatmap with the mask and correct aspect ratio

**Preparing Data for Linear Regression**

In [0]:
df1.loc[:, "new_date"] = df1["date"].apply(pd.to_datetime) #Creating a new column new_date and changing the date time format

In [0]:
df1.loc[:, "day"] = df1["new_date"].apply(lambda x: x.day) #Creating a day column and assigning the day values of new_date column

In [0]:
df1.loc[:, "month"] = df1["new_date"].apply(lambda x: x.month) #Creating a month column and assigning the month values of new_date column

In [0]:
df1.loc[:, "year"] = df1["new_date"].apply(lambda x: x.year) #Creating a day year and assigning the year values of new_date column

In [0]:
df1["year"].unique() #Unique year values of the year column

In [0]:
cat_clms =  ["preferred_foot", "attacking_work_rate", "defensive_work_rate", "year", "month", "day"] #Catagory Column List

In [0]:
df1.head() #Returns the first 5 rows of dataframe df1

In [0]:
df1 =  df1.drop(["date", "new_date"], axis = 1) #Dropping the date and new_date columns

In [0]:
for clm in cat_clms:
    dummies = pd.get_dummies(df1[clm], prefix = clm)
    df1 = df1.join(dummies)
    df1 = df1.drop(clm, axis = 1)

In [0]:
df1.columns #Columns of dataframe df1

In [0]:
df1.shape #Shape of the dataframe df1

**Train, Test & Split**

In [0]:
#Spliting the dataset into two: target value and predictor values. 
X = df1.drop('overall_rating', axis = 1) #All features except overall_rating ( predictor values )
Y = df1['overall_rating'] #overall_rating ( target value )

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 5)
print(X_train.shape) #Training data shape (predictor values) : 80%
print(X_test.shape) #Test data shape (predictor values) : 20%
print(Y_train.shape) #Training data shape (target values) : 80%
print(Y_test.shape) #Test data shape (target values) : 20%

**Creating and Training the Model**

In [0]:
lm = LinearRegression() #Creating an instance of LinearRegression
lm.fit(X_train, Y_train) #Fitting the created instance of the LinearRegression

In [0]:
#Printing intercept 
print(lm.intercept_)

In [0]:
#Printing coefficients
print(lm.coef_)

**Predicting overall_rating using Test Data**

In [0]:
Y_pred = lm.predict(X_test) #Calculating the prediction values

In [0]:
Y_pred.shape #Prediction shape from test data

In [0]:
#To visualize the differences between actual overall rating and predicted values, creating a scatter plot.
sns.set_style("whitegrid")
sns.set_context("poster")
plt.figure(figsize=(16,9))
plt.scatter(Y_test, Y_pred)
plt.xlabel("Overall Rating: $Y_i$")
plt.ylabel("Predicted Overall Rating: $\hat{Y}_i$")
plt.title("Overall Rating vs Predicted Overall Rating: $Y_i$ vs $\hat{Y}_i$")
plt.text(40,25, "Comparison between the actual Overall Rating and predicted Overall Rating.", ha='left')
plt.show()

In [0]:
sns.regplot(Y_test, Y_pred, data=df1, fit_reg=True) #Plot Y_test and Y_pred for Linear Regression Model.

In [0]:
sns.regplot(x=lm.predict(X), y=df1['overall_rating'], data=df1, fit_reg=True) #Plot predicted and actual Overall Rating values.

**Model Evaluation Using Cross-Validation**

In [0]:
#Evaluating the model using 10-fold cross-validation
scores = cross_val_score(LinearRegression(), X, Y, scoring='neg_mean_squared_error', cv=10)
scores

In [0]:
np.sqrt(scores.mean() * -1)

In [0]:
print("The Root Mean Square Error using cross validation for the Model is "+ str(np.sqrt(scores.mean() * -1)) +" and the Results can be further improved using feature extraction and rebuilding, training the model.")

**Evaluating the Model Using RMSE**

In [0]:
#Calculating Mean Squared Error
mse = mean_squared_error(Y_test, Y_pred) #Mean Squared Error: To check the level of error of a model
print(mse)

In [0]:
#Calculating Root Mean Squared Error#Calcula 
rmse = mse ** 0.5 #Square root of mse (Mean Squared Error)
print(rmse)

In [0]:
print("The Root Mean Square Error (RMSE) for the Model is "+ str(rmse) +" and the Results can be further improved using feature extraction and rebuilding, training the model.")