# Winners of Matches Data Cleaning
---

## 1. Introduction
In this notebook, we will clean and preprocess the dataset `Match Winner.csv` to prepare it for a classification task that predicts match winners. The target variable is **`FullTimeResult`** (values: `H` = Home Win, `A` = Away Win, `D` = Draw).

We will:
1. Load and inspect the dataset.
2. Drop irrelevant columns.
3. Handle missing values.
4. Remove duplicate rows.
5. Save the cleaned dataset for future modeling.


## 2. Load the Data

In [7]:
import pandas as pd

# Load the dataset
file_path = "Match Winner.csv"
df = pd.read_csv(file_path)

# Display basic info
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9380 entries, 0 to 9379
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Season             9380 non-null   object
 1   MatchDate          9380 non-null   object
 2   HomeTeam           9380 non-null   object
 3   AwayTeam           9380 non-null   object
 4   FullTimeHomeGoals  9380 non-null   int64 
 5   FullTimeAwayGoals  9380 non-null   int64 
 6   FullTimeResult     9380 non-null   object
 7   HalfTimeHomeGoals  9380 non-null   int64 
 8   HalfTimeAwayGoals  9380 non-null   int64 
 9   HalfTimeResult     9380 non-null   object
 10  HomeShots          9380 non-null   int64 
 11  AwayShots          9380 non-null   int64 
 12  HomeShotsOnTarget  9380 non-null   int64 
 13  AwayShotsOnTarget  9380 non-null   int64 
 14  HomeCorners        9380 non-null   int64 
 15  AwayCorners        9380 non-null   int64 
 16  HomeFouls          9380 non-null   int64 


Unnamed: 0,Season,MatchDate,HomeTeam,AwayTeam,FullTimeHomeGoals,FullTimeAwayGoals,FullTimeResult,HalfTimeHomeGoals,HalfTimeAwayGoals,HalfTimeResult,...,HomeShotsOnTarget,AwayShotsOnTarget,HomeCorners,AwayCorners,HomeFouls,AwayFouls,HomeYellowCards,AwayYellowCards,HomeRedCards,AwayRedCards
0,2000/01,19-08-2000,Charlton,Man City,4,0,H,2,0,H,...,14,4,6,6,13,12,1,2,0,0
1,2000/01,19-08-2000,Chelsea,West Ham,4,2,H,1,0,H,...,10,5,7,7,19,14,1,2,0,0
2,2000/01,19-08-2000,Coventry,Middlesbrough,1,3,A,1,1,D,...,3,9,8,4,15,21,5,3,1,0
3,2000/01,19-08-2000,Derby,Southampton,2,2,D,1,2,A,...,4,6,5,8,11,13,1,1,0,0
4,2000/01,19-08-2000,Leeds,Everton,2,0,H,2,0,H,...,8,6,6,4,21,20,1,3,0,0


## 3. Drop Irrelevant Columns
We will keep only the features specified:
- **HalfTimeHomeGoals**
- **HalfTimeAwayGoals**
- **HalfTimeResult**
- **HomeTeam**
- **AwayTeam**
- **FullTimeResult** (target)

We will drop the rest.

In [8]:
# Define columns to keep
cols_to_keep = [
    'HalfTimeHomeGoals',
    'HalfTimeAwayGoals',
    'HalfTimeResult',
    'HomeTeam',
    'AwayTeam',
    'FullTimeResult'
]

# Subset dataframe
df = df[cols_to_keep]

# Verify
print("Remaining columns:", df.columns.tolist())
df.head()

Remaining columns: ['HalfTimeHomeGoals', 'HalfTimeAwayGoals', 'HalfTimeResult', 'HomeTeam', 'AwayTeam', 'FullTimeResult']


Unnamed: 0,HalfTimeHomeGoals,HalfTimeAwayGoals,HalfTimeResult,HomeTeam,AwayTeam,FullTimeResult
0,2,0,H,Charlton,Man City,H
1,1,0,H,Chelsea,West Ham,H
2,1,1,D,Coventry,Middlesbrough,A
3,1,2,A,Derby,Southampton,D
4,2,0,H,Leeds,Everton,H


## 4. Handle Missing Values
We will:
- Check for missing values.
- Drop rows if any are found.

In [9]:
# Check missing values
df.isnull().sum()

# Drop rows with missing values (if any)
df = df.dropna()

# Verify again
df.isnull().sum()

HalfTimeHomeGoals    0
HalfTimeAwayGoals    0
HalfTimeResult       0
HomeTeam             0
AwayTeam             0
FullTimeResult       0
dtype: int64

## 5. Remove Duplicates
Check for duplicate rows and remove them.

In [10]:
# Check duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Drop duplicates
df = df.drop_duplicates()

# Verify again
df.duplicated().sum()

Number of duplicate rows: 2704


np.int64(0)

## 6. Final Dataset Overview

In [11]:
print("Final shape of dataset:", df.shape)
df.head()

Final shape of dataset: (6676, 6)


Unnamed: 0,HalfTimeHomeGoals,HalfTimeAwayGoals,HalfTimeResult,HomeTeam,AwayTeam,FullTimeResult
0,2,0,H,Charlton,Man City,H
1,1,0,H,Chelsea,West Ham,H
2,1,1,D,Coventry,Middlesbrough,A
3,1,2,A,Derby,Southampton,D
4,2,0,H,Leeds,Everton,H


## 7. Save Cleaned Data
We will save the cleaned dataset as `Match_Winner_Cleaned.csv` for use in the modeling phase.

In [12]:
# Save to new CSV file
df.to_csv("Match_Winner_Cleaned.csv", index=False)

---

### Summary
- Loaded the dataset.
- Dropped irrelevant columns.
- Checked and removed missing values.
- Removed duplicate rows.
- Saved the cleaned dataset for modeling.