# Data Cleaning: Top Scorer of Goals (Excel Version)

This notebook prepares the dataset **Top Goals.xlsx** for regression modeling.

## Steps
1. Install and import libraries
2. Load the dataset
3. Drop unwanted columns
4. Handle missing values
5. Handle duplicates
6. Save cleaned dataset
7. Display final cleaned dataset

In [None]:
# Install required libraries
!pip install pandas openpyxl numpy

In [None]:
import pandas as pd
import numpy as np

## Load dataset

In [None]:
file_path = "Top Goals.xlsx"
df = pd.read_excel(file_path, engine="openpyxl")

# Display initial info
df.info()
df.head()

## Drop unwanted columns

We keep only target and feature columns, removing others.

In [None]:
target = ['Goals']
features = [
    'Position','Age','Appearances','Goals_prev_season','Assists',
    'Penalty_Goals','Non-Penalty_Goals','Goals_per_90',
    'Big_6_Club_Feature','League_Goals_per_Match'
]

keep_columns = target + features
df = df[keep_columns]
df.head()

## Handle missing values

In [None]:
print("Missing values before handling:")
print(df.isnull().sum())

# Drop rows with any missing values
df = df.dropna()

print("\nMissing values after handling:")
print(df.isnull().sum())

## Handle duplicates

In [None]:
print(f"Duplicates before: {df.duplicated().sum()}")
df = df.drop_duplicates()
print(f"Duplicates after: {df.duplicated().sum()}")

## Save cleaned dataset

In [None]:
output_file = "Top_Goals_Cleaned.xlsx"
df.to_excel(output_file, index=False)
print(f"Cleaned dataset saved as: {output_file}")

## Final cleaned dataset overview

In [None]:
df.info()

df.head()