## Consume Web Scraping Data from Kafka with Spark

For consuming the Web Scraping data we use Spark  to directly read the data from Kafka and we subscribe to the corresponding topic. 
We made use of the Spark Kafka Package, consume the data, cleaned it and uploaded it to our MongoDB Collection for the Web Scraping data.

### Required Imports 

In [13]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType
from pyspark.sql.functions import from_json, col
from pyspark.sql.functions import udf, round
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import split
from pymongo import MongoClient

import pymongo as mdb

 ## Create Spark Session

In [2]:
spark = (SparkSession
         .builder
         .appName('nbaConsumer')
         .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0")
         .getOrCreate())
sc = spark.sparkContext

## Create Schema for Spark Dataframe

In [3]:
schema = StructType([
    StructField("season", StringType(), True),
    StructField("player_name", StringType(), True),
    StructField("team_abbreviation", StringType(), True),
    StructField("age", StringType(), True),
    StructField("data", StructType([
        StructField("PLAYER_NAME", StringType(), True),
        StructField("TEAM_ABBREVIATION", StringType(), True),
        StructField("AGE", StringType(), True),
        StructField("GP", StringType(), True),
        StructField("W", StringType(), True),
        StructField("L", StringType(), True),
        StructField("MIN", StringType(), True),
        StructField("PTS", StringType(), True),
        StructField("FGM", StringType(), True),
        StructField("FGA", StringType(), True),
        StructField("FG_PCT", StringType(), True),
        StructField("FG3M", StringType(), True),
        StructField("FG3A", StringType(), True),
        StructField("FG3_PCT", StringType(), True),
        StructField("FTM", StringType(), True),
        StructField("FTA", StringType(), True),
        StructField("FT_PCT", StringType(), True),
        StructField("OREB", StringType(), True),
        StructField("DREB", StringType(), True),
        StructField("REB", StringType(), True),
        StructField("AST", StringType(), True),
        StructField("TOV", StringType(), True),
        StructField("STL", StringType(), True),
        StructField("BLK", StringType(), True),
        StructField("PF", StringType(), True),
        StructField("NBA_FANTASY_PTS", StringType(), True),
        StructField("DD2", StringType(), True),
        StructField("TD3", StringType(), True),
        StructField("PLUS_MINUS", StringType(), True)
    ]))
])

## Read Messages into Spark Dataframe 

In [4]:
df = spark.read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "NBA-WEB-TOPIC") \
    .option("startingOffsets", "earliest") \
    .option("endingOffsets", "latest") \
    .load()
df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
parsed_df = df.withColumn("parsed_value", from_json(col("value"), schema))
final_df = parsed_df.select("key", "parsed_value.*")
final_df = final_df.withColumnRenamed('season', 'SEASON')
final_df.show()

+----+------+-----------------+-----------------+---+--------------------+
| key|SEASON|      player_name|team_abbreviation|age|                data|
+----+------+-----------------+-----------------+---+--------------------+
|None|  1996|   Michael Jordan|              CHI| 34|{Michael Jordan, ...|
|None|  1996|      Karl Malone|              UTA| 33|{Karl Malone, UTA...|
|None|  1996|        Glen Rice|              CHH| 30|{Glen Rice, CHH, ...|
|None|  1996| Shaquille O'Neal|              LAL| 25|{Shaquille O'Neal...|
|None|  1996|   Mitch Richmond|              SAC| 32|{Mitch Richmond, ...|
|None|  1996| Latrell Sprewell|              GSW| 26|{Latrell Sprewell...|
|None|  1996|    Allen Iverson|              PHI| 22|{Allen Iverson, P...|
|None|  1996|  Hakeem Olajuwon|              HOU| 34|{Hakeem Olajuwon,...|
|None|  1996|    Patrick Ewing|              NYK| 34|{Patrick Ewing, N...|
|None|  1996|   LaPhonso Ellis|              DEN| 27|{LaPhonso Ellis, ...|
|None|  1996|     Kendall

## Data Cleaning

### Datatype-Schema of for each Colum in Spark-DF

In [45]:
final_df.printSchema()

root
 |-- key: string (nullable = true)
 |-- SEASON: string (nullable = true)
 |-- player_name: string (nullable = true)
 |-- team_abbreviation: string (nullable = true)
 |-- age: string (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- PLAYER_NAME: string (nullable = true)
 |    |-- TEAM_ABBREVIATION: string (nullable = true)
 |    |-- AGE: string (nullable = true)
 |    |-- GP: string (nullable = true)
 |    |-- W: string (nullable = true)
 |    |-- L: string (nullable = true)
 |    |-- MIN: string (nullable = true)
 |    |-- PTS: string (nullable = true)
 |    |-- FGM: string (nullable = true)
 |    |-- FGA: string (nullable = true)
 |    |-- FG_PCT: string (nullable = true)
 |    |-- FG3M: string (nullable = true)
 |    |-- FG3A: string (nullable = true)
 |    |-- FG3_PCT: string (nullable = true)
 |    |-- FTM: string (nullable = true)
 |    |-- FTA: string (nullable = true)
 |    |-- FT_PCT: string (nullable = true)
 |    |-- OREB: string (nullable = true)
 |    |--

### Select data and define data types

In [7]:
cleaned_data = final_df.select(
    col('SEASON').cast('int'),
    final_df['data']['PLAYER_NAME'].alias('PLAYER_NAME'),
    final_df['data']['AGE'].cast('int').alias('AGE'),
    final_df['data']['W'].cast('int').alias('W'),
    final_df['data']['L'].cast('int').alias('L'),
    final_df['data']['DD2'].cast('double').alias('DD2'),
    final_df['data']['TD3'].cast('double').alias('TD3'),
    final_df['data']['PLUS_MINUS'].cast('double').alias('PLUS_MINUS')
)

### Convert data into a dictionary format

In [8]:
records = cleaned_data.toPandas().to_dict('records')

In [9]:
print(records[0])

{'SEASON': 1996, 'PLAYER_NAME': 'Michael Jordan', 'AGE': 34, 'W': 69, 'L': 13, 'DD2': 9.0, 'TD3': 1.0, 'PLUS_MINUS': 10.0}


### Create MongoDB Connection

In [6]:
client = MongoClient("mongodb://pt-n20.p4001.w3.cs.technikum-wien.at:4001")
mdb = client.nba_data
collection = mdb.season_stats_web

### Iteration through the dataset and insert it into the MongoDB collection

In [12]:
for record in records:
    season = record.get('SEASON')
    player_name = record.get('PLAYER_NAME')
    if season is None or player_name is None:
        print('Skipping invalid record:', record)
        continue

    existing_doc = collection.find_one({
        'SEASON': season,
        'PLAYER_NAME': player_name,
    })
    if existing_doc is None:
        collection.insert_one(record)
        print(f"Inserted {player_name} - Season: {season}")
    else:
        print("Skipped")

Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
Skipped
