**Summary:**

This notebook aims to demonstrate the process of enhancing the performance of a classification model by applying clustering before classification. The steps involved are as follows:

1. **Data Preparation:** 
   - The notebook begins by reading a dataset from a CSV file ("dataset_before_clustering.csv") using pandas.
   - Initial exploration of the dataset is performed by displaying the first few rows.

2. **Preprocessing:**
   - The "match_winner" column is adjusted by subtracting 1 from its values, presumably for compatibility with zero-based indexing.
   - The "match_winner" column is stored separately, and the remaining columns are extracted into a new DataFrame ("df").
   - Unnecessary columns such as "season" and "match_id" are dropped from the DataFrame.

3. **Clustering:**
   - The KMeans algorithm from scikit-learn is utilized for clustering with 2 clusters.
   - The DataFrame ("df") is fitted to the KMeans model to obtain cluster labels.
   - The obtained labels are added as a new column ("label") to the DataFrame.

4. **Post-Processing:**
   - The original "match_winner" column is added back to the DataFrame from the stored variable.
   - The modified DataFrame is saved to a new CSV file ("dataset_after_clustering.csv") without including the index.

**Conclusion:**

The notebook demonstrates a preprocessing technique where clustering is applied before classification to potentially enhance the performance of the classification model. By clustering similar data points together, the hope is to improve the model's ability to distinguish between different classes. The effectiveness of this approach can be further evaluated through model training and performance evaluation.

In [1]:
import pandas as pd
cust_df = pd.read_csv("dataset_before_clustering.csv")
cust_df.head()

Unnamed: 0.1,Unnamed: 0,season,match_id,runs_to_be_scored,balls_remaining,wickets_remaining,match_winner
0,124,2008,335982,222.0,119.0,10.0,1.0
1,125,2008,335982,221.0,119.0,10.0,1.0
2,126,2008,335982,221.0,118.0,10.0,1.0
3,127,2008,335982,220.0,117.0,10.0,1.0
4,128,2008,335982,219.0,116.0,10.0,1.0


In [6]:
# Subtract 1 from the "match_winner" column and store it in a variable
cust_df["match_winner"] = cust_df["match_winner"] - 1

# Store the "match_winner" column in a separate variable
match_winner = cust_df["match_winner"]

# Drop the "match_winner" column from the DataFrame and store the remaining columns in a new DataFrame
df = cust_df.iloc[:, 1:]

Unnamed: 0,season,match_id,runs_to_be_scored,balls_remaining,wickets_remaining
0,2008,335982,222.0,119.0,10.0
1,2008,335982,221.0,119.0,10.0
2,2008,335982,221.0,118.0,10.0
3,2008,335982,220.0,117.0,10.0
4,2008,335982,219.0,116.0,10.0


In [7]:
# Drop the "season" and "match_id" columns from the DataFrame
df.drop(["season", "match_id"], inplace=True, axis=1)

In [None]:
from sklearn.cluster import KMeans

# Define the number of clusters
cluster_num = 2

# Initialize KMeans with specified parameters
k_means = KMeans(init="k-means++", n_clusters=cluster_num, n_init=12)

# Fit KMeans to the data
k_means.fit(df)

In [10]:
# Assign cluster labels to a new column in the DataFrame
df["label"] = k_means.labels_

# Print the length of the labels array
print(len(k_means.labels_))

91637


In [13]:
# Add the "match_winner" column back to the DataFrame
df["match_winner"] = match_winner

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,runs_to_be_scored,balls_remaining,wickets_remaining,label,match_winner
0,222.0,119.0,10.0,0,0.0
1,221.0,119.0,10.0,0,0.0
2,221.0,118.0,10.0,0,0.0
3,220.0,117.0,10.0,0,0.0
4,219.0,116.0,10.0,0,0.0


In [14]:
# Save the modified DataFrame to a CSV file without including the index
df.to_csv("dataset_after_clustering.csv", index=False)