# API Data

## Consuming API Data

For receiving the data we collected from the API we use a Kafka Consumer:

In [138]:
from kafka import KafkaConsumer
import json
import pandas as pd

In [32]:
# Initialize Kafka Consumer
bootstrap_servers = 'kafka:9092'
topic_name = 'nba-API-data'
consumer = KafkaConsumer(topic_name, bootstrap_servers=bootstrap_servers, value_deserializer=lambda x: json.loads(x.decode('utf-8')))

For easier data management we append the messages into a Pandas dataframe:

In [33]:
message_list = []

In [None]:
for message in consumer:
    data = message.value
    message_list.append(data)
print("Exited")

In [35]:
consumer.close()

In [36]:
pandas_df = pd.DataFrame(message_list)

In [37]:
pandas_df.shape

(3923, 24)

In [38]:
pandas_df.head()

Unnamed: 0,_id,player_id,full_name,season,ast,blk,dreb,fg3_pct,fg3a,fg3m,...,fta,ftm,games_played,min,oreb,pf,pts,reb,stl,turnover
0,{'$oid': '6491cda1e79aadf30ea4eddd'},67,MarShon Brooks,2011,2.34,0.27,2.32,0.313,2.68,0.84,...,2.64,2.02,56,29:26,1.25,2.07,12.64,3.57,0.93,2.11
1,{'$oid': '6491cda1e79aadf30ea4edde'},67,MarShon Brooks,2012,1.04,0.22,0.99,0.273,0.75,0.21,...,1.29,0.95,73,12:31,0.44,1.27,5.4,1.42,0.47,0.95
2,{'$oid': '6491cda2e79aadf30ea4eddf'},67,MarShon Brooks,2013,0.76,0.12,1.33,0.52,0.76,0.39,...,1.33,0.97,33,9:37,0.3,0.64,4.82,1.64,0.42,0.73
3,{'$oid': '6491cda5e79aadf30ea4ede0'},71,Lorenzo Brown,2013,1.71,0.13,0.83,0.1,1.25,0.13,...,0.54,0.38,24,9:17,0.33,0.79,2.67,1.17,0.54,0.67
4,{'$oid': '6491cda8e79aadf30ea4ede1'},90,Omri Casspi,2011,1.02,0.32,2.54,0.315,2.58,0.82,...,1.66,1.14,65,20:39,0.97,1.77,7.06,3.51,0.57,0.98


## Data Cleaning with Spark

For cleaning and reading the consumed data we use Spark:

In [82]:
from pyspark.sql import SparkSession

We create a SparkSession and transform the Pandas Dataframe into a Spark Dataframe, since it is more powerful and efficient:

In [83]:
spark = SparkSession.builder \
    .appName("DataCleaningSpark") \
    .getOrCreate()

In [94]:
# Convert Pandas DataFrame to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)

We first get an overview of the schema of the data:

In [95]:
spark_df.printSchema()

root
 |-- _id: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- player_id: long (nullable = true)
 |-- full_name: string (nullable = true)
 |-- season: long (nullable = true)
 |-- ast: double (nullable = true)
 |-- blk: double (nullable = true)
 |-- dreb: double (nullable = true)
 |-- fg3_pct: double (nullable = true)
 |-- fg3a: double (nullable = true)
 |-- fg3m: double (nullable = true)
 |-- fg_pct: double (nullable = true)
 |-- fga: double (nullable = true)
 |-- fgm: double (nullable = true)
 |-- ft_pct: double (nullable = true)
 |-- fta: double (nullable = true)
 |-- ftm: double (nullable = true)
 |-- games_played: long (nullable = true)
 |-- min: string (nullable = true)
 |-- oreb: double (nullable = true)
 |-- pf: double (nullable = true)
 |-- pts: double (nullable = true)
 |-- reb: double (nullable = true)
 |-- stl: double (nullable = true)
 |-- turnover: double (nullable = true)



Then we drop some columns we do not need and remove duplicate data:

In [97]:
spark_df = spark_df.drop("player_id")

In [98]:
spark_df = spark_df.dropDuplicates()

In [99]:
spark_df.show()

+----------------+------+----+----+----+-------+----+----+------+-----+----+------+----+----+------------+-----+----+----+-----+-----+----+--------+
|       full_name|season| ast| blk|dreb|fg3_pct|fg3a|fg3m|fg_pct|  fga| fgm|ft_pct| fta| ftm|games_played|  min|oreb|  pf|  pts|  reb| stl|turnover|
+----------------+------+----+----+----+-------+----+----+------+-----+----+------+----+----+------------+-----+----+----+-----+-----+----+--------+
|     DJ Stephens|  2013| 0.0| 0.0| 2.0|    0.0| 0.0| 0.0| 0.429|  3.5| 1.5|   1.0| 0.5| 0.5|           2| 7:30| 0.5| 0.0|  3.5|  2.5| 0.0|     0.0|
|   Brian Skinner|  2006|0.87|0.96|4.16|    0.0| 0.0| 0.0|  0.49| 3.78|1.85| 0.582|1.18|0.69|          67|22:39|1.58|2.91| 4.39| 5.75|0.27|    1.12|
|   Marcin Gortat|  2012|1.23|1.61|6.36|    0.0|0.05| 0.0| 0.521| 9.28|4.84| 0.652|2.26|1.48|          61|30:47| 2.1|2.08|11.15| 8.46|0.66|    1.62|
|      Mark Price|  1996|4.89|0.04|2.04|  0.396|4.04| 1.6| 0.447| 8.41|3.76| 0.906|2.44|2.21|          70|

The data seems clean now, but lets convert the `min` (Minutes Played) column into numeric values for later analysis:

In [126]:
from pyspark.sql.functions import split, col, expr

In [127]:
# Split the 'min' column by ':' and convert the resulting array to columns
split_cols = split(col('min'), ':')

In [128]:
# Convert the hours part to minutes and the minutes part to seconds
minutes = split_cols.getItem(0).cast('integer')
seconds = split_cols.getItem(1).cast('integer')

In [129]:
# Calculate the total minutes by adding the minutes and seconds parts
total_minutes = minutes + seconds / 60

In [130]:
# Calculate the total minutes by adding the minutes and seconds parts
total_minutes = (minutes + seconds / 60).alias('total_minutes')

# Round the total minutes to 1 decimal place
rounded_minutes = expr('ROUND(min, 1)')

In [131]:
spark_df = spark_df.withColumn('min', rounded_minutes)

In [132]:
spark_df.show()

+----------------+------+----+----+----+-------+----+----+------+-----+----+------+----+----+------------+----+----+----+-----+-----+----+--------+
|       full_name|season| ast| blk|dreb|fg3_pct|fg3a|fg3m|fg_pct|  fga| fgm|ft_pct| fta| ftm|games_played| min|oreb|  pf|  pts|  reb| stl|turnover|
+----------------+------+----+----+----+-------+----+----+------+-----+----+------+----+----+------------+----+----+----+-----+-----+----+--------+
|     DJ Stephens|  2013| 0.0| 0.0| 2.0|    0.0| 0.0| 0.0| 0.429|  3.5| 1.5|   1.0| 0.5| 0.5|           2| 7.5| 0.5| 0.0|  3.5|  2.5| 0.0|     0.0|
|   Brian Skinner|  2006|0.87|0.96|4.16|    0.0| 0.0| 0.0|  0.49| 3.78|1.85| 0.582|1.18|0.69|          67|22.7|1.58|2.91| 4.39| 5.75|0.27|    1.12|
|   Marcin Gortat|  2012|1.23|1.61|6.36|    0.0|0.05| 0.0| 0.521| 9.28|4.84| 0.652|2.26|1.48|          61|30.8| 2.1|2.08|11.15| 8.46|0.66|    1.62|
|      Mark Price|  1996|4.89|0.04|2.04|  0.396|4.04| 1.6| 0.447| 8.41|3.76| 0.906|2.44|2.21|          70|26.8|0

## Uploading to MongoDB

Now that the data is cleaned, we convert the Spark Dataframe back into a Pandas Dataframe and upload the data to the database:

In [133]:
import pymongo as mdb

We use the built-in `toPandas()` method:

In [134]:
mongo_df = spark_df.toPandas()

In [135]:
mongo_df

Unnamed: 0,full_name,season,ast,blk,dreb,fg3_pct,fg3a,fg3m,fg_pct,fga,...,fta,ftm,games_played,min,oreb,pf,pts,reb,stl,turnover
0,DJ Stephens,2013,0.00,0.00,2.00,0.000,0.00,0.00,0.429,3.50,...,0.50,0.50,2,7.5,0.50,0.00,3.50,2.50,0.00,0.00
1,Brian Skinner,2006,0.87,0.96,4.16,0.000,0.00,0.00,0.490,3.78,...,1.18,0.69,67,22.7,1.58,2.91,4.39,5.75,0.27,1.12
2,Marcin Gortat,2012,1.23,1.61,6.36,0.000,0.05,0.00,0.521,9.28,...,2.26,1.48,61,30.8,2.10,2.08,11.15,8.46,0.66,1.62
3,Mark Price,1996,4.89,0.04,2.04,0.396,4.04,1.60,0.447,8.41,...,2.44,2.21,70,26.8,0.51,1.43,11.33,2.56,0.96,2.30
4,Nick Anderson,1995,3.62,0.60,4.19,0.391,5.58,2.18,0.442,11.74,...,3.12,2.16,77,35.3,1.19,1.75,14.73,5.39,1.57,1.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3894,Henry Sims,2013,1.13,0.43,2.89,0.000,0.02,0.00,0.474,5.96,...,2.63,1.96,46,19.0,2.26,2.52,7.61,5.15,0.61,0.89
3895,Jeff Green,2012,1.58,0.84,3.25,0.385,2.25,0.86,0.467,9.95,...,3.27,2.64,81,27.8,0.68,2.16,12.79,3.93,0.69,1.63
3896,Ian Mahinmi,2013,0.31,0.94,1.95,0.000,0.01,0.00,0.481,2.45,...,1.88,1.17,77,16.3,1.39,2.69,3.53,3.34,0.53,0.75
3897,Antonio McDyess,1995,0.99,1.50,4.51,0.000,0.05,0.00,0.485,11.59,...,3.20,2.18,76,30.0,3.01,3.29,13.42,7.53,0.71,2.03


In [139]:
spark.stop()

We initialise the MongoDB client and connect to the database:

In [136]:
# Initialize MongoDB client and database
client = mdb.MongoClient("mongodb://pt-n20.p4001.w3.cs.technikum-wien.at:4001")
db = client.nba_data
collection = db.season_stats_api

And finally we upload the cleaned data: 

In [137]:
skipped_documents = []

for record in mongo_df.to_dict("records"):
    full_name = record["full_name"]
    season = record["season"]

    existing_doc = collection.find_one({"full_name": full_name, "season": season})

    if existing_doc is None:
        collection.insert_one(record)
        print(f"Inserted document for {full_name} - Season {season}")
    elif existing_doc != record:
        collection.update_one({"full_name": full_name, "season": season}, {"$set": record})
        print(f"Updated document for {full_name} - Season {season}")
    else:
        skipped_documents.append(record)
        print(f"Skipped document for {full_name} - Season {season}")

# Retry insertion/update for skipped documents
for record in skipped_documents:
    full_name = record["full_name"]
    season = record["season"]

    existing_doc = collection.find_one({"full_name": full_name, "season": season})

    if existing_doc is None:
        collection.insert_one(record)
        print(f"Inserted skipped document for {full_name} - Season {season}")
    elif existing_doc != record:
        collection.update_one({"full_name": full_name, "season": season}, {"$set": record})
        print(f"Updated skipped document for {full_name} - Season {season}")
    else:
        print(f"Skipped document already exists for {full_name} - Season {season}")

print("Upload complete")

Updated document for DJ Stephens - Season 2013
Updated document for Brian Skinner - Season 2006
Updated document for Marcin Gortat - Season 2012
Updated document for Mark Price - Season 1996
Updated document for Nick Anderson - Season 1995
Updated document for Dennis Scott - Season 1996
Updated document for Chris Webber - Season 1996
Updated document for Ed Stokes - Season 1997
Updated document for Trenton Hassell - Season 2007
Updated document for Boris Diaw - Season 2011
Updated document for Yaroslav Korolev - Season 2005
Updated document for Andrew Bogut - Season 2006
Updated document for Litterial Green - Season 1996
Updated document for Shawn Respert - Season 1995
Updated document for Elton Brand - Season 2006
Updated document for Manu Ginobili - Season 2012
Updated document for Tom Gugliotta - Season 1995
Updated document for LaPhonso Ellis - Season 1997
Updated document for Chauncey Billups - Season 2013
Updated document for Jarvis Hayes - Season 2007
Updated document for Amir J