**The Python libraries such as Pandas, NumPy, Plotly, and Matplotlib are for data analysis and visualization. It generates plots to visualize data using Plotly Express and Seaborn. Additionally, it suppresses warnings using the warnings module. The specific functionality of the code involves creating subplots, making line plots, and possibly exploring data patterns or trends.**

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns


**This code reads a CSV file named 'data.csv' containing data related to Lionel Messi and Cristiano Ronaldo's club goals, then displays the first few rows of the dataframe.**

In [None]:
df=pd.read_csv('data.csv')
df.head()

**The `df.info()` function provides a concise summary of the dataframe's information, including the number of entries, column names, data types, and memory usage.**

In [None]:
df.info()

**`df.describe(include='all')` generates descriptive statistics for all columns in a pandas DataFrame, including count, unique, top (most frequent), and frequency (top's frequency) for object (string) and categorical columns, and numeric statistics for numeric columns.**

In [None]:
print(df.describe(include='all'))

**The `df.isnull().sum()` function computes the sum of missing values for each column in the dataframe `df`, indicating how many null values exist in each column.**

In [None]:
df.isnull().sum()

**The code fills in any missing values (NaN) in the dataframe `df` with zeros (0), ensuring that the dataframe does not contain any null values.**

In [None]:
df=df.fillna(0)
df

**`Outliers detectection` using `Z-score method`**

In [None]:
import pandas as pd
from scipy import stats

# Convert 'Minute' column to numeric
df['Minute'] = pd.to_numeric(df['Minute'], errors='coerce')

# Define a threshold for Z-score
threshold = 3

# Calculate Z-score for the 'Minute' column
z_scores = stats.zscore(df['Minute'])

# Identify outliers based on the threshold
outliers = df[(z_scores > threshold) | (z_scores < -threshold)]

# Print outliers
print("Outliers detected using Z-score method:")
print("=======================================")
print(outliers)

# Remove outliers from the dataset
cleaned_df = df.drop(outliers.index)

# Print cleaned dataset
print("Cleaned dataset after removing outliers:")
print(cleaned_df)


**`Outliers detectection` using `IQR method`**

In [None]:
# Calculate Q1 and Q3
Q1 = df['Minute'].quantile(0.25)
Q3 = df['Minute'].quantile(0.75)

# Calculate IQR
IQR = Q3 - Q1

# Define the outlier boundaries
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df['Minute'] < lower_bound) | (df['Minute'] > upper_bound)]

# Print outliers
print("Outliers detected using IQR method:")
print("===================================")
print(outliers)

# Remove outliers from the dataset
cleaned_df = df.drop(outliers.index)


# Print cleaned dataset
print("Cleaned dataset after removing outliers:")
print(cleaned_df)


**Calculating `mean`, `median`, and `mode` of Minute Column (no of minutes played)**

In [None]:
# Convert 'Minute' column to numeric type, ignoring errors
df['Minute'] = pd.to_numeric(df['Minute'], errors='coerce')


mean_minute = df['Minute'].mean()
median_minute = df['Minute'].median()
mode_minute = df['Minute'].mode()[0]

print(f"Mean Minute: {mean_minute}")
print(f"Median Minute: {median_minute}")
print(f"Mode Minute: {mode_minute}")


**Finding Measures of dispersion i.e `Standard deviation`, `Variance`, `Range`**

In [None]:
# Measures of dispersion
std_minute = df['Minute'].std()
variance_minute = df['Minute'].var()
range_minute = df['Minute'].max() - df['Minute'].min()

print(f"Standard Deviation of Minute: {std_minute}")
print(f"Variance of Minute: {variance_minute}")
print(f"Range of Minute: {range_minute}")

In [None]:
from sklearn.preprocessing import LabelEncoder

**Encoding categorical features for correlation analysis**

In [None]:
label_encoder = LabelEncoder()
encoded_df = df.copy()

for column in df.select_dtypes(include=['object']).columns:
    encoded_df[column] = label_encoder.fit_transform(df[column].astype(str))


**Finding Correlation matrix**

In [None]:
correlation_matrix = encoded_df.corr()
print(correlation_matrix)

**Visualize the Correlation Matrix**

In [None]:
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

**Group by Club and find average minutes played**

In [None]:
avg_minutes_by_club = df.groupby('Club')['Minute'].mean().reset_index()

# Fill NaN value with the overall mean of the 'Minute' column
overall_mean_minute = df['Minute'].mean()
avg_minutes_by_club['Minute'].fillna(overall_mean_minute, inplace=True)

print(avg_minutes_by_club)


**Group by Player and Competition to find total goals**

In [None]:
goals_by_player_competition = df.groupby(['Player', 'Competition']).size().reset_index(name='Total Goals')
print(goals_by_player_competition)

**Multiple `aggregation functions`**

In [None]:
# Convert 'Player' column to string if needed
df['Player'] = df['Player'].astype(str)

player = df[df['Player'].isin(['0', '1'])]

# Define aggregation functions for each column
aggregation = {
    'Minute': ['mean', 'sum', 'max'],
    'Goal_assist': 'count'
}

# Group by Player and perform multiple aggregations
agg_stats = player.groupby('Player').agg(aggregation).reset_index()

print(agg_stats)

**`HISTPLOT` of Distribution of Goals by Minute**

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['Minute'], bins=30, kde=True)
plt.title('Distribution of Goals by Minute')
plt.xlabel('Minute')
plt.ylabel('Frequency')
plt.show()

**`Count-plot` of Goals by Competition**

In [None]:
plt.figure(figsize=(12, 8))
sns.countplot(data=df, x='Competition', hue='Player')
plt.title('Goals by Competition')
plt.xlabel('Competition')
plt.ylabel('Number of Goals')
plt.xticks(rotation=45)
plt.show()

**The code `df['Playing_Position'].unique()` retrieves the unique values present in the 'Playing_Position' column of the dataframe `df`, showing all distinct positions played by the players in the dataset.**

In [None]:
df['Playing_Position'].unique()

In [None]:
cr=df[df['Player']=='Cristiano Ronaldo']
lm=df[df['Player']=='Lionel Messi']

In [None]:
print("ALL COMPETITIONS IN WHICH CRISTIANO RONALDO PLAYED :  ")
print()
cr['Competition'].unique()

In [None]:
print("ALL COMPETITIONS IN WHICH LIONEL MESSI PLAYED :  ")
print()
lm['Competition'].unique()

In [None]:
print('The Tournaments in which both CR7 and LM10 have participated:')
print()
tour=[]
for i in cr['Competition'].unique():
    for j in lm['Competition'].unique():
        if i==j:
            tour.append(i)
            print(i)

**The code `cr['Season'].unique()` will retrieve the unique values present in the 'Season' column of the dataframe `cr`, showing all distinct seasons represented in the dataset.**

In [None]:
cr['Season'].unique()

**The code `lm['Season'].unique()` retrieves the unique values present in the 'Season' column of the dataframe `lm`, displaying all distinct seasons represented in the dataset associated with Lionel Messi's club goals.**

In [None]:
lm['Season'].unique()

In [None]:
print('The number of goals scored in their entire career for clubs by Cristiano Ronaldo',cr['Minute'].count())
print('The number of goals scored in entire career for clubs by Lionel Messi',lm['Minute'].count())

**SUBPLOT OF COMPARISION OF GOALS BETWEEN BOTH PLAYERS.                                  Pro Tip : Drag your cursor onto the plot bars to see more details**

In [None]:
# Create subplots with two columns for Cristiano Ronaldo and Lionel Messi
fig = make_subplots(rows=1, cols=2,
                    subplot_titles=('Cristiano Ronaldo', 
                                    'Lionel Messi'))

# Aggregate goals by season for Cristiano Ronaldo
cr_goals_by_season = cr.groupby('Season')['Type'].count()\
.reset_index(name='Goals')

# Sort Cristiano Ronaldo's goals by season in descending order
cr_goals_by_season = cr_goals_by_season.sort_values(by='Goals',ascending=False)

# Create a bar plot for Cristiano Ronaldo's goals
fig1 = px.bar(cr_goals_by_season, x='Season', y='Goals', color='Goals')

# Add Cristiano Ronaldo's bar plot to the first subplot
fig.add_trace(fig1.data[0], row=1, col=1)

# Aggregate goals by season for Lionel Messi
lm_goals_by_season = lm.groupby('Season')['Type'].count()\
.reset_index(name='Goals')

# Sort Lionel Messi's goals by season in descending order
lm_goals_by_season = lm_goals_by_season.sort_values(by='Goals',ascending=False)

# Create a bar plot for Lionel Messi's goals
fig2 = px.bar(lm_goals_by_season, x='Season', y='Goals', color='Goals')

# Add Lionel Messi's bar plot to the second subplot
fig.add_trace(fig2.data[0], row=1, col=2)

# Update layout with title and dimensions
fig.update_layout(height=500, width=1000, 
                  title_text='COMPARISION OF GOALS BETWEEN CRISTINAO RONALDO AND LIONEL MESSI')

# Display the plot
fig.show()


**The code creates a histogram showing the distribution of Cristiano Ronaldo's goals across different competitions. It's colored by club and includes hover information displaying competition and club details.**

In [None]:
px.histogram(cr,x='Competition',
             title="Goals per Competition by CR7",
             height=600,
             color='Club',
             hover_name='Club',
             hover_data=['Competition','Club'])

**The code creates a histogram showing the distribution of Lionel Messi's goals across different competitions. It's colored by club and includes hover information displaying competition and club details.**

In [None]:
px.histogram(lm,x='Competition',
             title="Goals per Competition by Lionel Messi",
             height=600,
             color='Club',
             hover_name='Club',
             hover_data=['Competition','Club'])

***Different Pie Chart Visuals of CR7 Data***

In [None]:
import plotly.express as px

# Selecting three columns for Cristiano Ronaldo (CR7)
cr_columns = ['Competition', 'Club', 'Playing_Position']
cr_selected = cr[cr_columns]

# Creating pie charts for CR7
for column in cr_selected.columns:
    fig = px.pie(cr_selected, names=column, title=f"Pie chart for {column} in CR7's dataset")
    fig.show()



***Different Pie Chart Visuals of LM10 Data***

In [None]:
# Selecting three columns for Lionel Messi
lm_columns = ['Competition', 'Club', 'Playing_Position']
lm_selected = lm[lm_columns]

# Creating pie charts for Lionel Messi
for column in lm_selected.columns:
    fig = px.pie(lm_selected, names=column, title=f"Pie chart for {column} in Lionel Messi's dataset")
    fig.show()

****HISTOGRAM OF CR7 RECORD AGAINST DIFFERENT TEAMS****

In [None]:
px.histogram(cr,x='Competition',
             title="Cristiano Ronaldo vs opponents",
             height=1000,
             color='Opponent',
             hover_name='Opponent',
             hover_data=['Opponent'])

**HISTOGRAM OF LM10 RECORD AGAINST DIFFERENT TEAMS**

In [None]:
px.histogram(lm,x='Competition',
             title="Lionel Messi vs opponents",
             height=1000,
             color='Opponent',
             hover_name='Opponent',
             hover_data=['Opponent'])

**This code creates a side-by-side comparison of Cristiano Ronaldo and Lionel Messi's goals in different playing positions per season. It uses bar charts to show the number of goals scored by each player in various positions. The figure is displayed with a title indicating the comparison, using Plotly's `make_subplots`.**

In [None]:
# Create subplots with two columns for Cristiano Ronaldo and Lionel Messi
fig = make_subplots(rows=1, cols=2, 
                    subplot_titles=('Cristiano Ronaldo',
                                    'Lionel Messi'))

# Aggregate goals by playing position for Cristiano Ronaldo
cr_pos_goals = cr.groupby('Playing_Position')['Type'].count()\
.drop(0).reset_index(name='Goals')

# Sort Cristiano Ronaldo's goals by playing position in descending order
cr_pos_goals = cr_pos_goals.sort_values(by='Goals',ascending=False)

# Create a bar plot for Cristiano Ronaldo's goals by playing position
fig1 = px.bar(cr_pos_goals, x='Playing_Position', y='Goals',
              color='Goals')

# Add Cristiano Ronaldo's bar plot to the first subplot
fig.add_trace(fig1.data[0], row=1, col=1)

# Aggregate goals by playing position for Lionel Messi
lm_pos_goals = lm.groupby('Playing_Position')['Type'].count()\
.reset_index(name='Goals')

# Sort Lionel Messi's goals by playing position in descending order
lm_pos_goals = lm_pos_goals.sort_values(by='Goals',ascending=False)

# Create a bar plot for Lionel Messi's goals by playing position
fig2 = px.bar(lm_pos_goals, x='Playing_Position', y='Goals',
              color='Goals')

# Add Lionel Messi's bar plot to the second subplot
fig.add_trace(fig2.data[0], row=1, col=2)

# Update layout with title and dimensions
fig.update_layout(height=500, width=1000, 
                  title_text='Comparison of Goals in different positions by Ronaldo and Messi per season')

# Display the plot
fig.show()


**This code will generate a scatter plot where the x-axis represents the minute of the match and the y-axis represents the goal type. Each point in the scatter plot represents a goal scored by either Cristiano Ronaldo or Lionel Messi, with different colors indicating the respective player.**

In [None]:


# Concatenate both players' dataframes
combined_df = pd.concat([cr, lm], keys=['Cristiano Ronaldo', 'Lionel Messi'])

# Create scatter plot
fig = px.scatter(combined_df, x='Minute', y='Type', color=combined_df.index.get_level_values(0),
                 title='Goals Scored by Minute', labels={'Type': 'Goal Type', 'Minute': 'Minute'})

# Show plot
fig.show()



**The ttest_ind library performs a two-sample (independent) t-test, which is used to determine whether the means of two independent samples are significantly different from each other.**
**The chi2_contingency library performs the chi-square test of independence, which is used to determine whether there is a significant association between two categorical variables.**

In [None]:
from scipy.stats import ttest_ind, chi2_contingency

**Hypothesis 1: T-test for comparing mean minute of goal scored by forwards and midfielders**

In [None]:
forward_minutes = df[df['Playing_Position'] == 'Forward']['Minute']
midfielder_minutes = df[df['Playing_Position'] == 'Midfielder']['Minute']

# Create a contingency table
contingency_table = pd.crosstab(df['Competition'], df['Playing_Position'])

# Perform the chi-square test of independence
chi2_statistic, p_value, dof, expected = chi2_contingency(contingency_table)

print("Chi-square statistic:", chi2_statistic)

# Interpret the results
if p_value < 0.05:
    print("Reject Null Hypothesis: There is an association between playing position and competition.")
else:
    print("Fail to Reject Null Hypothesis: There is no association between playing position and competition.")

**Hypothesis 2: Chi-square test of independence for association between goal type and competition**

In [None]:
goal_type_competition_table = pd.crosstab(df['Type'], df['Competition'])

chi2_statistic, p_value, dof, expected = chi2_contingency(goal_type_competition_table)
print("\nHypothesis 2:")
print("Chi-square statistic:", chi2_statistic)
if p_value < 0.05:
    print("Reject Null Hypothesis. There is an association between goal type and competition.")
else:
    print("Fail to Reject Null Hypothesis. There is no association between goal type and competition")

**Hypothesis 3: Chi-square test of independence for association between Playing_Position and Type**

In [None]:
goal_type_competition_table = pd.crosstab(df['Playing_Position'], df['Type'])

chi2_statistic, p_value, dof, expected = chi2_contingency(goal_type_competition_table)
print("\nHypothesis 3:")
print("Chi-square statistic:", chi2_statistic)
if p_value < 0.05:
    print("Reject Null Hypothesis. There is an association between Playing_Position and Type.")
else:
    print("Fail to Reject Null Hypothesis. There is no association between Playing_Position and Type.")

**Quantifying the association between playing position and goal type using Cramér's V, which measures the strength of association between two categorical variables.**

In [None]:
# Create a contingency table
contingency_table = pd.crosstab(df['Playing_Position'], df['Type'])

# Perform the chi-square test of independence
chi2_statistic, p_value, dof, expected = chi2_contingency(contingency_table)

# Calculate Cramér's V
n = contingency_table.sum().sum()  # Total sample size
phi2 = chi2_statistic / n
r, k = contingency_table.shape
cramers_v = np.sqrt(phi2 / min(r-1, k-1))

print("Chi-square statistic:", chi2_statistic)
print("Cramér's V:", cramers_v)

# Interpretation of Cramér's V
if cramers_v < 0.1:
    association = "negligible"
elif cramers_v < 0.3:
    association = "weak"
elif cramers_v < 0.5:
    association = "moderate"
else:
    association = "strong"

print(f"Association between Playing_Position and Type is {association}.")


**The `train_test_split` function splits datasets into training and testing sets. The `LabelEncoder` converts categorical labels into numerical form. `LogisticRegression` implements logistic regression for binary classification, while the `RandomForestClassifier` uses an ensemble of decision trees for classification. The `classification_report` generates a report with precision, recall, F1-score, and support for each class. The `accuracy_score` computes the model's prediction accuracy.**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score


**Converting the relevant columns to strings**

In [None]:
columns_to_encode = ['Player', 'Season', 'Competition', 'Venue', 'Club', 'Opponent', 'Playing_Position', 'Type', 'Goal_assist']
for column in columns_to_encode:
    df[column] = df[column].astype(str)

**Encoding the categorical features**

In [None]:
label_encoder = LabelEncoder()
for column in columns_to_encode:
    df[column] = label_encoder.fit_transform(df[column])


**Defining features and target**

In [None]:

X = df[columns_to_encode]
y = df['Type']


**Spliting the data into training and testing sets**

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**Applying Logistic Regression Model**

In [None]:

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)


**Evaluate Logistic Regression Model**

In [None]:

print("Logistic Regression Model Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print(classification_report(y_test, y_pred_log_reg))

**Applying Random Forest Classifier Model**

In [None]:

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)

**Evaluate Random Forest Classifier Model**

In [None]:

print("Random Forest Classifier Model Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

**The SVC from sklearn.svm is a Support Vector Classifier used for classification tasks. It finds the optimal hyperplane that best separates the data into different classes**

In [None]:
from sklearn.svm import SVC

**Initialize and train the Support Vector Machine (SVM) classifier**

In [None]:

svm_classifier = SVC()
svm_classifier.fit(X_train, y_train)


**Make predictions**

In [None]:

y_pred = svm_classifier.predict(X_test)

**Evaluate the model**

In [None]:

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

**The GradientBoostingClassifier from sklearn.ensemble builds a series of decision trees sequentially, each correcting errors of the previous ones, to improve classification accuracy.**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

**Initialize and train the Gradient Boosting Classifier**

In [None]:

gb_classifier = GradientBoostingClassifier()
gb_classifier.fit(X_train, y_train)


**Make predictions**

In [None]:
y_pred = gb_classifier.predict(X_test)

**Evaluate the model**

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))