<a href="https://colab.research.google.com/github/roxanagruianu/Predictia-castigatorului-meciurilor-de-tenis/blob/main/Predictia_castigatorului_meciurilor_de_tenis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predictia castigatorului meciurilor de tenis

### Introducere

Setul de date contine informatii detaliat despre meciurile de tenis ATP din perioada 2000-2025. Fiecare inregistrare reprezinta un meci si include informatii precum: numele turneului, data la care a avut loc meciul, tipul turneului, suprafata terenului, runda meciului din turneu, numarul de seturi necesare pentru casti, numele, locul in clasament, numarul de puncte ATP si cota de casti a jucatorilor, castigatorul si scorul meciului.

Obiecivul proiectului este de a construi modele predictive pentru a anticipa castigatorul meciurilor ATP utilizad mai multe metode ML: regresie logistica, Decision Tree, Random Forest si Kmeans, dar si o retea neuronala implementata cu Keras. De asemenea, se va urmari acuratetea, precizia si scorul F1 a acestor modele pentru a compara performanta.

Setul de date poate fi gasit: https://www.kaggle.com/datasets/dissfya/atp-tennis-2000-2023daily-pull/data \\
Sau in arhiva incarcata

In [None]:
import kagglehub

#Descarcam setul de date
path = kagglehub.dataset_download("dissfya/atp-tennis-2000-2023daily-pull")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/atp-tennis-2000-2023daily-pull


In [None]:
import os

#Verific locul unde a fost salvat setul de date
print(os.listdir(path))

['atp_tennis.csv']


In [None]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

import os
import sys

Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:3 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
42 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mSkipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubunt

In [None]:
#Importuri necesare pentru rularea proiectului

import findspark
import pyspark
from pyspark.sql import DataFrame, SparkSession
from typing import List
import pyspark.sql.types as T
import pyspark.sql.functions as F
from pyspark.sql.functions import *
from pyspark.sql.functions import lower, trim
from pyspark.sql.functions import array, sort_array, col, concat_ws
from pyspark.sql.functions import greatest, max
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression, NaiveBayes
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.sql.functions import when
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

Se initiaza o sesiune Spark si se incarca fisierul CSV cu datele.

In [None]:
spark = SparkSession.builder \
    .appName("ATP Tennis Project") \
    .getOrCreate()

In [None]:
csv_path = os.path.join(path, "atp_tennis.csv")

# Se incarca fisierul CSV în Spark DataFrame
df = spark.read.option("header", True).option("inferSchema", True).csv(csv_path)

# Se afiseaza schema si primele 5 randuri
df.printSchema()
df.show(5)

root
 |-- Tournament: string (nullable = true)
 |-- Date: date (nullable = true)
 |-- Series: string (nullable = true)
 |-- Court: string (nullable = true)
 |-- Surface: string (nullable = true)
 |-- Round: string (nullable = true)
 |-- Best of: integer (nullable = true)
 |-- Player_1: string (nullable = true)
 |-- Player_2: string (nullable = true)
 |-- Winner: string (nullable = true)
 |-- Rank_1: integer (nullable = true)
 |-- Rank_2: integer (nullable = true)
 |-- Pts_1: integer (nullable = true)
 |-- Pts_2: integer (nullable = true)
 |-- Odd_1: double (nullable = true)
 |-- Odd_2: double (nullable = true)
 |-- Score: string (nullable = true)

+--------------------+----------+-------------+-------+-------+---------+-------+--------------+-------------+-----------+------+------+-----+-----+-----+-----+-----------+
|          Tournament|      Date|       Series|  Court|Surface|    Round|Best of|      Player_1|     Player_2|     Winner|Rank_1|Rank_2|Pts_1|Pts_2|Odd_1|Odd_2|      Score

In [None]:
#Se afiseaza toate valorile feature-ului Round
df.select("Round").distinct().show(100, truncate=False)

+-------------+
|Round        |
+-------------+
|1st Round    |
|Quarterfinals|
|Semifinals   |
|The Final    |
|4th Round    |
|Round Robin  |
|2nd Round    |
|3rd Round    |
+-------------+



### Procesarea, pregatirea, curatarea datelor

In [None]:
#Se sterg duplicatele
df = df.dropDuplicates()

In [None]:
#Textul din coloanele Tournament, Series, Court, Surface, Round, Player_1, Player_2, Winner este transformat in litere mici si sunt eliminate spatiile de la inceput si final

for colname in ["Tournament", "Series", "Court", "Surface", "Round", "Player_1", "Player_2", "Winner"]:
    df = df.withColumn(colname, lower(trim(col(colname))))

### Agregari si Grupari de Date

In [None]:
#Se afiseaza numarul de meciuri jucate per suprafata
df.groupBy("Surface").count().orderBy("count", ascending=False).show()

+-------+-----+
|Surface|count|
+-------+-----+
|   hard|35345|
|   clay|21283|
|  grass| 7157|
| carpet| 1632|
+-------+-----+



In [None]:
#Se afiseaza numarul de meciuri jucate per suprafata folosind Spark SQL

df.createOrReplaceTempView("matches")

spark.sql("""
  SELECT Surface, COUNT(*) as num_matches
  FROM matches
  GROUP BY Surface
  ORDER BY num_matches DESC
""").show()

+-------+-----------+
|Surface|num_matches|
+-------+-----------+
|   hard|      35345|
|   clay|      21283|
|  grass|       7157|
| carpet|       1632|
+-------+-----------+



In [None]:
#Se afiseaza numarul de meciuri castigate per jucator
df.groupBy("Winner").count().orderBy("count", ascending=False).show(10)

+-----------+-----+
|     Winner|count|
+-----------+-----+
| federer r.| 1157|
|djokovic n.| 1047|
|   nadal r.| 1009|
|  ferrer d.|  678|
|  murray a.|  675|
| gasquet r.|  577|
| berdych t.|  576|
| roddick a.|  564|
| monfils g.|  553|
|wawrinka s.|  535|
+-----------+-----+
only showing top 10 rows



In [None]:
#Se afiseaza numarul de meciuri castigate per jucator folosind Spark SQL
df.createOrReplaceTempView("matches")

spark.sql("""
  SELECT Winner, COUNT(*) as num_matches
  FROM matches
  GROUP BY Winner
  ORDER BY num_matches DESC
""").show()

+------------+-----------+
|      Winner|num_matches|
+------------+-----------+
|  federer r.|       1157|
| djokovic n.|       1047|
|    nadal r.|       1009|
|   ferrer d.|        678|
|   murray a.|        675|
|  gasquet r.|        577|
|  berdych t.|        576|
|  roddick a.|        564|
|  monfils g.|        553|
| wawrinka s.|        535|
|    cilic m.|        533|
| verdasco f.|        526|
|  robredo t.|        504|
|   hewitt l.|        494|
|    simon g.|        479|
|    lopez f.|        472|
|    isner j.|        466|
|davydenko n.|        461|
|  youzhny m.|        459|
|   zverev a.|        448|
+------------+-----------+
only showing top 20 rows



In [None]:
#Se afiseaza jucatorul cu cele mai multe titluri castigate per turneu
df.groupBy("Tournament", "Winner").count().orderBy("count", ascending=False).show(10)

+-------------------+-----------+-----+
|         Tournament|     Winner|count|
+-------------------+-----------+-----+
|        french open|   nadal r.|  109|
|          wimbledon| federer r.|  103|
|    australian open| federer r.|  100|
|        french open|djokovic n.|   97|
|    australian open|djokovic n.|   97|
|          wimbledon|djokovic n.|   96|
|            us open| federer r.|   87|
|            us open|djokovic n.|   84|
|    australian open|   nadal r.|   74|
|monte carlo masters|   nadal r.|   73|
+-------------------+-----------+-----+
only showing top 10 rows



In [None]:
#Se afiseaza jucatorul cu cele mai multe titluri castigate per turneu folosind Spark SQL

df.createOrReplaceTempView("matches")

spark.sql("""
  SELECT Tournament, Winner, COUNT(*) as num_matches
  FROM matches
  GROUP BY Tournament, Winner
  ORDER BY num_matches DESC
""").show()

+--------------------+-----------+-----------+
|          Tournament|     Winner|num_matches|
+--------------------+-----------+-----------+
|         french open|   nadal r.|        109|
|           wimbledon| federer r.|        103|
|     australian open| federer r.|        100|
|         french open|djokovic n.|         97|
|     australian open|djokovic n.|         97|
|           wimbledon|djokovic n.|         96|
|             us open| federer r.|         87|
|             us open|djokovic n.|         84|
|     australian open|   nadal r.|         74|
| monte carlo masters|   nadal r.|         73|
|         french open| federer r.|         72|
|internazionali bn...|djokovic n.|         66|
|    gerry weber open| federer r.|         63|
|             us open|   nadal r.|         63|
|           wimbledon|  murray a.|         61|
|         masters cup| federer r.|         59|
|internazionali bn...|   nadal r.|         57|
|           wimbledon|   nadal r.|         57|
|       swiss

In [None]:
#Se afiseaza perechile de jucatori care s-au intalnit de cele mai multe ori
df_pairs = df.withColumn("pair", concat_ws(" vs ", sort_array(array("Player_1", "Player_2"))))
df_pairs.groupBy("pair").count().orderBy("count", ascending=False).show(10, truncate=False)

+----------------------------+-----+
|pair                        |count|
+----------------------------+-----+
|djokovic n. vs nadal r.     |54   |
|djokovic n. vs federer r.   |48   |
|federer r. vs nadal r.      |40   |
|djokovic n. vs murray a.    |34   |
|ferrer d. vs nadal r.       |30   |
|federer r. vs wawrinka s.   |26   |
|federer r. vs murray a.     |24   |
|federer r. vs roddick a.    |24   |
|del potro j.m. vs federer r.|24   |
|djokovic n. vs wawrinka s.  |23   |
+----------------------------+-----+
only showing top 10 rows



In [None]:
#Se afiseaza perechile de jucatori care s-au intalnit de cele mai multe ori folosind Spark SQL

df = df.withColumn("pair", concat_ws(" vs ", sort_array(array("Player_1", "Player_2"))))

df.createOrReplaceTempView("matches")

spark.sql("""
    SELECT pair, COUNT(*) AS num_matches
    FROM matches
    GROUP BY pair
    ORDER BY num_matches DESC
    LIMIT 10
""").show(truncate=False)

+----------------------------+-----------+
|pair                        |num_matches|
+----------------------------+-----------+
|djokovic n. vs nadal r.     |54         |
|djokovic n. vs federer r.   |48         |
|federer r. vs nadal r.      |40         |
|djokovic n. vs murray a.    |34         |
|ferrer d. vs nadal r.       |30         |
|federer r. vs wawrinka s.   |26         |
|federer r. vs murray a.     |24         |
|federer r. vs roddick a.    |24         |
|del potro j.m. vs federer r.|24         |
|federer r. vs hewitt l.     |23         |
+----------------------------+-----------+



In [None]:
#Se afiseaza numarul total de jucători unici in Player_1 si Player_2
df.select("Player_1").union(df.select("Player_2")).agg(countDistinct("Player_1").alias("unique_players")).show()

+--------------+
|unique_players|
+--------------+
|          1670|
+--------------+



In [None]:
#Se afiseaza numarul total de jucători unici in Player_1 si Player_2 folosind Spark SQL

df.createOrReplaceTempView("matches")

spark.sql("""
    SELECT COUNT(DISTINCT Player) AS unique_players
    FROM (
      SELECT Player_1 AS Player FROM matches
      UNION
      SELECT Player_2 AS Player FROM matches
    )
""").show()

+--------------+
|unique_players|
+--------------+
|          1670|
+--------------+



In [None]:
#Se afiseaza diferenta medie intre punctajul celor doi jucatori
df.withColumn("pts_diff", abs(df["Pts_1"] - df["Pts_2"])) \
  .agg(avg("pts_diff").alias("avg_pts_diff")).show()

+------------------+
|      avg_pts_diff|
+------------------+
|1022.3205894492257|
+------------------+



In [None]:
#Se afiseaza diferenta medie intre punctajul celor doi jucatori folosind Spark SQL

df.createOrReplaceTempView("matches")

spark.sql("""
    SELECT AVG(ABS(Pts_1 - Pts_2)) AS avg_pts_diff
    FROM matches
""").show()

+------------------+
|      avg_pts_diff|
+------------------+
|1022.3205894492257|
+------------------+



In [None]:
#Se afiseaza numarul maxim de puncte ATP al unui jucator intr-un meci

df_with_max_pts = df.withColumn("max_pts_in_match", greatest("Pts_1", "Pts_2"))
max_pts = df_with_max_pts.agg(max("max_pts_in_match").alias("max_points")).collect()[0]["max_points"]
print(f"Numărul maxim de puncte ATP într-un meci (player 1 sau 2): {max_pts}")

Numărul maxim de puncte ATP într-un meci (player 1 sau 2): 16950


In [None]:
#Se afiseaza numarul maxim de puncte ATP al unui jucator intr-un meci folosind Spark SQL

df.createOrReplaceTempView("matches")

spark.sql("""
    SELECT MAX(GREATEST(Pts_1, Pts_2)) as max_pts
    FROM matches
""").show()

+-------+
|max_pts|
+-------+
|  16950|
+-------+



### Transformarea datelor

In [None]:
#Transformarea datelor

# Valorile din Round sunt transformate in numere intregi

df = df.withColumn("Round_num",
    when(df["Round"] == "1st round", 1)
    .when(df["Round"] == "2nd round", 2)
    .when(df["Round"] == "3rd round", 3)
    .when(df["Round"] == "4th round", 4)
    .when(df["Round"] == "quarterfinals", 5)
    .when(df["Round"] == "semifinals", 6)
    .when(df["Round"] == "the final", 7)
    .when(df["Round"] == "round robin", 8)
    .otherwise(None)
)

In [None]:
# Coloane categorice
categorical_cols = ["Surface", "Series", "Court"]

# Coloane numerice
numerical_cols = ["Rank_1", "Rank_2", "Best of", "Round_num"]

In [None]:
#Coloanele cateorice sunt transformate in numere cu ajutorul StringIndexer
#Noile coloane numerice sunt adunate cu cele deja existenta intr-o lista
#VectorAssembler combina toate aceste coloane intr-un singur vector numit features

indexers = [StringIndexer(inputCol=col, outputCol=col + "_idx") for col in categorical_cols]
feature_cols = [col + "_idx" for col in categorical_cols] + numerical_cols
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

In [None]:
#Se creeaza o coloana noua in df. In aceasta, meciurile vor fi etichetate cu 1 daca Player_1 a castigat si cu 0 daca nu
df_1 = df.withColumn("label", when(df["Winner"] == df["Player_1"], 1).otherwise(0))

In [None]:
#Se construieste pipeline-ul si se antreneaza pe DataFrame-ul df_1
pipeline = Pipeline(stages=indexers + [assembler])
df_1 = pipeline.fit(df_1).transform(df_1)

In [None]:
#Setul de date este impartit in train_data, care este folosit pentru antrenarea modelului si test_data, care este folosit pentru evaluarea modelului
train_data, test_data = df_1.randomSplit([0.8, 0.2], seed=42)

### Metode ML

### Regresie Logistica / Logistic Regression

Regresia Logistica este folosita pentru prezicerea probabilitatii ca jucatorul 1 sa fie castigatorul meciului, pe baza unor factori precum clasamentele jucatorilor, tipul terenului, runda si alte caracteristici ale meciului.

Am ales regresia logistica pentru ca este un alogirtm simplu si eficient pentru problemele de clasificare binara.

In [None]:
#Regresie Logistica
#Modelul este creat, antrenat si folosit pentru a face predictii pe datele de test

lr = LogisticRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(train_data)
lr_preds = lr_model.transform(test_data)

### Random Forest

La fel ca pentru regresia logistica, Random Forest este folosit pentru clasificarea rezultatului unui meci ca victorie sau infrangere pentru jucatorul 1.

Am ales Random Forest datorita capacitatii sale de a gestiona eficient date cu multe trasaturi.

In [None]:
#Random Forest
#Modelul este creat, antrenat si folosit pentru a face predictii pe datele de test

rf = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=100)
rf_model = rf.fit(train_data)
rf_preds = rf_model.transform(test_data)
rf_preds = rf_preds.withColumn("prediction", col("prediction").cast("double"))

### Decision Tree

La fel ca la modelele anterioare, se doreste prezicerea rezultatului unui meci ATP.

Am ales Decision Tree deoarece antrenarea este rapida si scalabila chiar si pe volume mari de date.

In [None]:
#Decision Tree
#Modelul este creat, antrenat si folosit pentru a face predictii pe datele de test

dt = DecisionTreeClassifier(featuresCol="features", labelCol="label", maxDepth=5)
dt_model = dt.fit(train_data)
dt_preds = dt_model.transform(test_data)

### Compararea acuratetii, preciziei si scorului F1 intre Regresia Logistica, Decision Tree si Random Forest

In [None]:
#Sunt afisate acuratetea, precizia si scorul F1 pentru modele

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

print("Acuratete Regresie Logistica:", evaluator.evaluate(lr_preds))
print("Acuratete Random Forest:", evaluator.evaluate(rf_preds))
print(f"Acuratete Decision Tree:", evaluator.evaluate(dt_preds))

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
print("F1 Score Regresie Logistica:", evaluator.evaluate(lr_preds))
print("F1 Score Random Forest:", evaluator.evaluate(rf_preds))
print("F1 Score Decision Tree:", evaluator.evaluate(dt_preds))

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="precisionByLabel")
print("Precision Regresie Logistica:", evaluator.evaluate(lr_preds))
print("Precision Random Forest:", evaluator.evaluate(rf_preds))
print("Precision Decision Tree:", evaluator.evaluate(dt_preds))

Acuratete Regresie Logistica: 0.6554550375009666
Acuratete Random Forest: 0.6548364648573417
Acuratete Decision Tree: 0.6489600247429057
F1 Score Regresie Logistica: 0.6554529487544276
F1 Score Random Forest: 0.654692005529655
F1 Score Decision Tree: 0.6476521184006685
Precision Regresie Logistica: 0.6518734643734644
Precision Random Forest: 0.6588709677419354
Precision Decision Tree: 0.6667840789010215


### KMeans

KMeans a fost aplicat pentru a descoperi structuri si tipare ascunse in date. Algoritmul este eficient in gruparea jucatorilor sau meciurilor pe baza caracteristicilor si ofera o perspectiva utila asupra datelor.

Am ales KMean pentru ca este un algoritm eficient de clustering, util in explorarea tiparelor din date.

In [None]:
#KMeans

#Coloanele cateorice sunt transformate in numere cu ajutorul StringIndexer
categorical_cols = ["Surface", "Series", "Court"]
indexers = [StringIndexer(inputCol=col, outputCol=col + "_idx") for col in categorical_cols]

#Coloane numerice
numerical_cols = ["Rank_1", "Rank_2", "Best of", "Round_num"]

#Se construieste pipeline-ul
pipeline = Pipeline(stages=indexers)
df_2 = pipeline.fit(df).transform(df)

#Coloanele cateorice sunt transformate in numere cu ajutorul StringIndexer
#Noile coloane numerice sunt adunate cu cele deja existenta intr-o lista
#VectorAssembler combina toate aceste coloane intr-un singur vector numit features

feature_cols = [col + "_idx" for col in categorical_cols] + numerical_cols
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_2_features = assembler.transform(df_2)

#Se aplica KMeans cu k=5
kmeans = KMeans(featuresCol="features", k=5)
k_model = kmeans.fit(df_2_features)

#Se afiseaza primele 10 randuri cu jucatori
clusters = k_model.transform(df_2_features)
clusters.select("Player_1", "Player_2", "features", "prediction").show(10)

+--------------+------------+--------------------+----------+
|      Player_1|    Player_2|            features|prediction|
+--------------+------------+--------------------+----------+
|    kroslak j.|  jonsson f.|[0.0,2.0,0.0,103....|         4|
| gustafsson m.|johansson t.|[0.0,2.0,0.0,60.0...|         1|
|     dupuis a.|rodriguez m.|[0.0,1.0,0.0,98.0...|         4|
|       haas t.| saulnier c.|[0.0,1.0,0.0,12.0...|         4|
|    sluiter r.| gaudenzi a.|[0.0,1.0,0.0,166....|         0|
|    ullyett k.|   tabara m.|[0.0,1.0,0.0,182....|         0|
|    russell m.|  armando h.|[1.0,2.0,0.0,165....|         4|
|      pavel a.| medvedev a.|[1.0,2.0,0.0,43.0...|         1|
|   medvedev a.|   norman m.|[1.0,1.0,0.0,20.0...|         1|
|di pasquale a.|  sluiter r.|[1.0,2.0,0.0,53.0...|         4|
+--------------+------------+--------------------+----------+
only showing top 10 rows



In [None]:
#Se afiseaza distributia datelor in cele 5 clustere
clusters.groupBy("prediction").count().orderBy("prediction").show()

+----------+-----+
|prediction|count|
+----------+-----+
|         0| 5940|
|         1|49136|
|         2|  627|
|         3|  797|
|         4| 8917|
+----------+-----+



In [None]:
#Se afiseaza centrele clusterelor
centers = k_model.clusterCenters()
for i, center in enumerate(centers):
    print(f"Cluster {i} center: {center}")

Cluster 0 center: [6.84361792e-01 1.63516484e+00 1.66187658e-01 2.28597633e+02
 6.88696534e+01 3.38038884e+00 1.66745562e+00]
Cluster 1 center: [ 0.60002445  2.22698836  0.18272777 50.08499582 46.05345098  3.38029059
  2.49401911]
Cluster 2 center: [6.76800e-01 1.78880e+00 1.40800e-01 7.83120e+02 8.85248e+01 3.23040e+00
 1.46240e+00]
Cluster 3 center: [6.700000e-01 1.608750e+00 1.662500e-01 8.847625e+01 7.033150e+02
 3.215000e+00 1.437500e+00]
Cluster 4 center: [6.70924034e-01 1.73000888e+00 1.64704576e-01 6.55980675e+01
 1.83620280e+02 3.38427366e+00 1.70457574e+00]


In [None]:
#Se afiseaza cate meciuri pe fiecare tip de suprafata exista in fiecare cluster
clusters.groupBy("prediction", "Surface").count().orderBy("prediction").show()

#Se afiseaza cate meciuri din fiecare tip de turneu exista in fiecare cluster
clusters.groupBy("prediction", "Series").count().orderBy("prediction").show()

#Se afiseaza media clasamentului jucatorilor pentru fiecare cluster si runda a turneului
clusters.groupBy("prediction", "Round_num").avg("Rank_1", "Rank_2").orderBy("prediction").show()

+----------+-------+-----+
|prediction|Surface|count|
+----------+-------+-----+
|         0|   hard| 2938|
|         0| carpet|  113|
|         0|   clay| 2053|
|         0|  grass|  836|
|         1|   hard|27229|
|         1|   clay|15651|
|         1| carpet| 1315|
|         1|  grass| 4941|
|         2|  grass|   90|
|         2|   hard|  319|
|         2| carpet|   13|
|         2|   clay|  205|
|         3|  grass|   96|
|         3|   hard|  393|
|         3| carpet|   18|
|         3|   clay|  290|
|         4|  grass| 1194|
|         4|   hard| 4466|
|         4| carpet|  173|
|         4|   clay| 3084|
+----------+-------+-----+

+----------+------------------+-----+
|prediction|            Series|count|
+----------+------------------+-----+
|         0|           masters|  140|
|         0|        grand slam| 1126|
|         0|            atp500|  386|
|         0|            atp250| 2093|
|         0|     international| 1426|
|         0|international gold|  394|
|        

In [None]:
#Se evalueaza clusterele
evaluator = ClusteringEvaluator(predictionCol="prediction", featuresCol="features", metricName="silhouette")
silhouette = evaluator.evaluate(clusters)
print(f"Scorul Silhouette: {silhouette:.4f}")

Scorul Silhouette: 0.6921


In [None]:
#Se afiseaza predictiile pentru clustere
clusters.select("Player_1", "Player_2", "Surface", "Round_num", "Rank_1", "Rank_2", "prediction").show(20)

+--------------+-------------+-------+---------+------+------+----------+
|      Player_1|     Player_2|Surface|Round_num|Rank_1|Rank_2|prediction|
+--------------+-------------+-------+---------+------+------+----------+
|    kroslak j.|   jonsson f.|   hard|        1|   103|   127|         4|
| gustafsson m.| johansson t.|   hard|        2|    60|    43|         1|
|     dupuis a.| rodriguez m.|   hard|        1|    98|   119|         4|
|       haas t.|  saulnier c.|   hard|        1|    12|   171|         4|
|    sluiter r.|  gaudenzi a.|   hard|        1|   166|    88|         0|
|    ullyett k.|    tabara m.|   hard|        1|   182|   122|         0|
|    russell m.|   armando h.|   clay|        1|   165|   239|         4|
|      pavel a.|  medvedev a.|   clay|        5|    43|    21|         1|
|   medvedev a.|    norman m.|   clay|        4|    20|     3|         1|
|di pasquale a.|   sluiter r.|   clay|        2|    53|   142|         4|
| goellner m.k.| squillari f.|   clay|

In [None]:
#Se creeaza o coloana noua in df. In aceasta, meciurile vor fi etichetate cu 1 daca Player_1 a castigat si cu 0 daca nu
df_3 = df.withColumn("label", when(df["Winner"] == df["Player_1"], 1).otherwise(0))

### Utilizarea unei functii definite de utilizator

In [None]:
#Se defineste o functie UDF in care se calculeaza diferenta absoluta dintre pozitiile in clasament a jucatorilor
@pandas_udf("int")
def rank_diff(rank1: pd.Series, rank2: pd.Series) -> pd.Series:
    return (rank1 - rank2).abs()

In [None]:
#Se creeaza o coloana noua in df. Aceasta va contine diferenta dintre pozitiile in clasament a jucatorilor
df_3 = df_3.withColumn("ranking_diff", rank_diff(df["Rank_1"], df["Rank_2"]))

In [None]:
#Se construieste pipeline-ul si se antreneaza pe DataFrame-ul df_3
pipeline = Pipeline(stages=indexers + [assembler])
df_3 = pipeline.fit(df_3).transform(df_3)

In [None]:
#Setul de date este impartit in train_data, care este folosit pentru antrenarea modelului si test_data, care este folosit pentru evaluarea modelului
train_data, test_data = df_3.randomSplit([0.8, 0.2], seed=42)

### Optimizarea hiperparametrilor

Am antrenat un model de regresie logistică si am testat mai multe combinatii de hiperparametri (cu CrossValidator) ca sa gasesc varianta care da cele mai bune rezultate. Apoi am folosit acel model optimizat pe datele de test.

In [None]:
#Regresie Logistica optimizata

#Va evalua modelele dupa metrica AUC
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")

#Se initializeaza modelul
lr_optimized = LogisticRegression(featuresCol="features", labelCol="label")

#Se construieste gridul de hiperparametri
paramGrid_lr = ParamGridBuilder() \
    .addGrid(lr_optimized.regParam, [0.01, 0.1, 0.5]) \
    .addGrid(lr_optimized.elasticNetParam, [0.0, 0.5, 1.0]) \
    .addGrid(lr_optimized.maxIter, [50, 100]) \
    .build()

#cv_lr va testa fiecare combinatie de hiperparametri
cv_lr = CrossValidator(estimator=lr_optimized,
                       estimatorParamMaps=paramGrid_lr,
                       evaluator=evaluator,
                       numFolds=5)

#Se antreneaza modelul folosind cv_lr
cv_model_lr = cv_lr.fit(train_data)

#Se afiseaza cei mai buni hiperparametri gasiti
print("Cel mai bun regParam:", cv_model_lr.bestModel._java_obj.getRegParam())
print("Cel mai bun elasticNetParam:", cv_model_lr.bestModel._java_obj.getElasticNetParam())
print("Cel mai bun maxIter:", cv_model_lr.bestModel._java_obj.getMaxIter())

Cel mai bun regParam: 0.01
Cel mai bun elasticNetParam: 0.5
Cel mai bun maxIter: 100


In [None]:
#Se aplica cel mai bun model logistic antrenat
lr_optimized_pred = cv_model_lr.transform(test_data)

In [None]:
#Se afiseaza un esantion de predictii
lr_optimized_pred.select("features", "label", "rawPrediction", "probability", "prediction").show(5)

+--------------------+-----+--------------------+--------------------+----------+
|            features|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[0.0,2.0,0.0,79.0...|    1|[-0.0648398370869...|[0.48379571750691...|       1.0|
|[0.0,2.0,0.0,112....|    1|[-0.5306186781115...|[0.37037260274845...|       1.0|
|[0.0,2.0,0.0,67.0...|    0|[-0.2999936512268...|[0.42555903519944...|       1.0|
|[0.0,2.0,0.0,79.0...|    0|[0.06588896552386...|[0.51646628464529...|       0.0|
|[0.0,2.0,0.0,88.0...|    0|[0.28624959623819...|[0.57107772508877...|       0.0|
+--------------------+-----+--------------------+--------------------+----------+
only showing top 5 rows



In [None]:
#Sunt afisate acuratetea, precizia si scorul F1 pentru model
predictionAndLabels = lr_optimized_pred.select("prediction", "label").rdd.map(tuple)

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

print("Accuracy Logistic Regression:", evaluator.evaluate(lr_optimized_pred))

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
print("F1 Score Logistic Regression:", evaluator.evaluate(lr_optimized_pred))

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="precisionByLabel")
print("Precision Logistic Regression:", evaluator.evaluate(lr_optimized_pred))

Accuracy Logistic Regression: 0.6552230727596072
F1 Score Logistic Regression: 0.655221168123625
Precision Logistic Regression: 0.6516664106896022


### Aplicarea unei metode DL - Neural Network

Am folosit Neural Network pentru prezicerea castigatorului unui meci de tenis pe baza unor date istorice.

Am ales neural network deoarece are performanta ridicata in clasificare.

In [None]:
#Se creeaza o coloana noua in df. In aceasta, meciurile vor fi etichetate cu 1 daca Player_1 a castigat si cu 0 daca nu
df_dl = df.withColumn("label", when(df["Winner"] == df["Player_1"], 1).otherwise(0))

In [None]:
#Se construieste pipeline-ul si se antreneaza pe DataFrame-ul df_dl
pipeline = Pipeline(stages=indexers + [assembler])
df_dl = pipeline.fit(df_dl).transform(df_dl)

In [None]:
#Se face conversia dintr-un DataFrame PySpark intr-un DataFrame Pandas
pandas_df = df_dl.toPandas()

In [None]:
#Se creeaza variabla tinta, care va fi folosita pentru modelul de deep learning
pandas_df['Player_1_Wins'] = (pandas_df['Winner'] == pandas_df['Player_1']).astype(int)

#Sunt extrase coloanele din pandas_df din feature_cols
#X va contine doar valorile numerice
X = pandas_df[feature_cols].values

#Se asigura ca nu exista coloana features in DataFrame
if 'features' in pandas_df.columns:
    X = pandas_df[feature_cols].values
else:
    X = pandas_df[[col for col in feature_cols if col in pandas_df.columns]].values

#y este transformata in array pentru model
#y va fi variabila tinta
y = pandas_df['Player_1_Wins'].values

In [None]:
#Setul de date este impartit in train_data, care este folosit pentru antrenarea modelului si test_data, care este folosit pentru evaluarea modelului
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#Se creaza un model secvential Keras cu 3 straturi
model = keras.Sequential([
    layers.Input(shape=(X_train.shape[1],)),  #dimensiunea e data de numarul de coloane din X_train
    layers.Dense(64, activation='relu'),      #strat cu 64 de neuroni si functia ReLU
    layers.Dense(32, activation='relu'),      #al doilea strat cu 32 de neuroni
    layers.Dense(1, activation='sigmoid')     #strat de iesire cu 1 neuron si activare sigmoida
])

#Modelul este compilat cu optimizer Adam, loss binary_crossentropy si acuratete ca metrica
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

#Modelul este antrenat cu 10 epoci
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)

#Se calculeaza loss-ul si acuratetea si se afiseaza
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

#Se genereaza probabilitatile pentru fiecare exemplu din test
predictions = model.predict(X_test)

Epoch 1/10
[1m1472/1472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.6214 - loss: 0.7573 - val_accuracy: 0.6108 - val_loss: 0.6663
Epoch 2/10
[1m1472/1472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.6217 - loss: 0.6884 - val_accuracy: 0.6603 - val_loss: 0.6194
Epoch 3/10
[1m1472/1472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.6317 - loss: 0.6565 - val_accuracy: 0.6504 - val_loss: 0.6444
Epoch 4/10
[1m1472/1472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 2ms/step - accuracy: 0.6406 - loss: 0.6447 - val_accuracy: 0.6639 - val_loss: 0.6197
Epoch 5/10
[1m1472/1472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 2ms/step - accuracy: 0.6436 - loss: 0.6343 - val_accuracy: 0.5843 - val_loss: 0.7305
Epoch 6/10
[1m1472/1472[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.6429 - loss: 0.6362 - val_accuracy: 0.6559 - val_loss: 0.6237
Epoch 7/10
[1m1