# Encoding Match Winner dataset

This notebook covers the encoding steps for **Match_cleaned.csv**  
- **Data Source:** Cleaned CSV file (`Match_cleaned.csv`)  
- **Goal:** Encode categorical features (teams, results) using both **Label Encoding** and **One-Hot Encoding**.  

In [32]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

## Loading the cleaned dataset

In [33]:
df = pd.read_csv('../Match_cleaned.csv')

In [34]:
df.head()

Unnamed: 0,HomeTeam,AwayTeam,FullTimeResult,HalfTimeHomeGoals,HalfTimeAwayGoals,HalfTimeResult
0,Charlton,Man City,H,2,0,H
1,Chelsea,West Ham,H,1,0,H
2,Coventry,Middlesbrough,A,1,1,D
3,Derby,Southampton,D,1,2,A
4,Leeds,Everton,H,2,0,H


## Label Encoding

We'll use Label Encoding for `FullTimeResult` and `HalfTimeResult` since these are categorical with fixed values (H, A, D).

In [35]:
# Copy df to preserve original
df_encoded = df.copy()

# Apply label encoding to result columns
le = LabelEncoder()

for col in ['FullTimeResult', 'HalfTimeResult']:
    df_encoded[col] = le.fit_transform(df_encoded[col])
    
    # Show mapping for each column
    mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    print(f"{col} mapping:", mapping)

FullTimeResult mapping: {'A': np.int64(0), 'D': np.int64(1), 'H': np.int64(2)}
HalfTimeResult mapping: {'A': np.int64(0), 'D': np.int64(1), 'H': np.int64(2)}


## Preview label encoded columns

Here we can see the numeric representation of match results after encoding.  

In [36]:
# Preview first 10 rows of the encoded columns
df_encoded[['FullTimeResult', 'HalfTimeResult']].head(10)

Unnamed: 0,FullTimeResult,HalfTimeResult
0,2,2
1,2,2
2,0,1
3,1,0
4,2,2
5,1,1
6,2,1
7,2,1
8,2,2
9,2,2


In [37]:
print("FullTimeResult unique values:", df_encoded['FullTimeResult'].unique())
print("HalfTimeResult unique values:", df_encoded['HalfTimeResult'].unique())

FullTimeResult unique values: [2 0 1]
HalfTimeResult unique values: [2 1 0]


## One Hot Encoding

We’ll apply One-Hot Encoding to **team names** (`HomeTeam`, `AwayTeam`) since these are nominal categories.  
This creates a new binary column for each team.  

In [38]:
# Show shape before encoding
print("Before one-hot encoding:", df_encoded.shape)

# Apply one-hot encoding
df_encoded = pd.get_dummies(df_encoded, columns=['HomeTeam', 'AwayTeam'])

# Show shape after encoding
print("After one-hot encoding:", df_encoded.shape)

Before one-hot encoding: (6676, 6)
After one-hot encoding: (6676, 96)


## Preview encoded team columns

In [39]:
# Preview first 5 HomeTeam one-hot columns
df_encoded.filter(like="HomeTeam").head()

# Preview first 5 AwayTeam one-hot columns
df_encoded.filter(like="AwayTeam").head()

Unnamed: 0,AwayTeam_Arsenal,AwayTeam_Aston Villa,AwayTeam_Birmingham,AwayTeam_Blackburn,AwayTeam_Blackpool,AwayTeam_Bolton,AwayTeam_Bournemouth,AwayTeam_Bradford,AwayTeam_Brentford,AwayTeam_Brighton,...,AwayTeam_Southampton,AwayTeam_Stoke,AwayTeam_Sunderland,AwayTeam_Swansea,AwayTeam_Tottenham,AwayTeam_Watford,AwayTeam_West Brom,AwayTeam_West Ham,AwayTeam_Wigan,AwayTeam_Wolves
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Saving the encoded dataset

In [40]:
df_encoded.to_csv('../Match_encoded.csv', index=False)