# NBA Analysis: A Comparison of Michael Jordan, Kobe Bryant and LeBron James at Their Prime

## Introduction

![LebronKobeMJ](./images/KobeMJLebron.png)

Our project focuses on the age old discussion: Who is the greatest basketball player of all time? By examining their individual statistics at their arguable prime years, we seek to gain insights and compare their achievements on the basketball court. Furthermore, we orchestrated this data science project in a way which focuses on the use of standardised tools and multiple data sources in regards to Big Data requirements. The challenge is to combine all these and create a valuable insight with the collected and processed data.

Our team consists of following members:

- Maria Mirnic
- Kevin Xhunga
- Safwan Zullash

## Project Architecture and Environment

**Here should be the picture of the architecture**

### Architecture:

Our architecture follows a distributed and scalable approach to handle the data processing and analysis tasks. It consists of the following key components:

   - Data Sources: We utilize multiple data sources (REST-API, web data and CSV data) for more complete analysis.

   - Kafka: Kafka acts as our central message broker in our architecture. It collects data from the different sources using Kafka producers and stores it in Kafka topics. This allows for decoupling between data producers and consumers, enabling efficient data ingestion and processing.

   - Spark: Spark serves as the core processing framework in our architecture. It reads the data from Kafka using Kafka consumers and performs various data manipulation and analysis tasks using Spark DataFrames.
   
   - Pandas: We use Pandas both for Data Cleaning but also visualisations. Pandas is one of the most popular Python libraries and its extendibility and interoperability are key features. Furthermore, Spark and Pandas have a lot of similar data processing capabilities with similar syntax that our team is accustomed to which is a bonus.

   - Database: We utilize a MongoDB database to store the analyzed and transformed data. We chose a NoSQL and document-based databased since its easy to work, flexible and lightweight.

   - Jupyter Lab: Our project environment includes the Jupyter Lab Server, which is provided by the FH. From here we execute the code necessary and maintain the project.

   - Git: We employ Git as our version control system to manage code, track changes, and enable collaborative development. It ensures that all project members have access to the latest code and can work concurrently on different project components.

## Data Sources

We used three different data sources:

- "Ball Don't Lie" API - [Link]("https://www.balldontlie.io/home.html#introduction")
- CSV-Data from Kaggle - [Link]("https://www.kaggle.com/datasets/drgilermo/nba-players-stats?select=Seasons_Stats.csv")
- NBA Stats Website - [Link]("https://www.nba.com/stats/players/traditional?PerMode=PerGame")

API Data:

The data of the API was structured as JSON responses and contained data like this i.e.:

![API Data](./images/APIData.png)

This was our main data source, since the API tracked and updated the data very regularly and had most of the data we wanted in an already well-formatted structure, which was easy to use and clean. We simply sent requests for all the players that played in the specific prime years of the three players that are in focus and processed the responses.

WebScraping:

Our second data source was the official website nba.com/stats. To gather the data from this website we used WebScraping. The website looked like this:

![Website Data](./images/nbaStatsWebsite.png)


CSV Dataset:

Our third and last data source was a CSV-file which was available on the popular Data Science website Kaggle. Here we simply picked a CSV-dataset which had the most complete amount of data and some additional stats that were missing from the other two sources. The data inside the file looked like this:

![CSV Data](./images/CSVData.png)

## Kafka Integration

![Kafka](./images/Kafka.jpg)

In our project, we leveraged Kafka as a central data broker to streamline the handling of data from various sources. Specifically, we utilized Kafka to efficiently manage the API data and the data obtained through web scraping. By employing Kafka as the backbone of our data pipeline, we were able to establish a scalable and fault-tolerant architecture that facilitated the seamless flow of data throughout our project.

To organize the data flow, we created two Kafka topics: one for the API data and another for the web scraping data. The topics served as logical channels where the data producers could publish the relevant information. This allowed us to decouple the data sources from the consumers, enabling asynchronous processing and ensuring data availability for downstream tasks.

For pushing data to the Kafka topics, we implemented Kafka producers. These producers were responsible for fetching data from the respective sources and publishing it to the relevant topics in JSON format. In the case of the API data, we utilized the Kafka producer to consume the data from the API endpoints and push it to the corresponding Kafka topic. Similarly, for web scraping, we employed a Kafka producer to retrieve the scraped data, save them in CSV files, read them in and then publish it to the designated topic. This approach enabled us to collect and consolidate the data from multiple sources in a standardized and efficient manner.

Here you can see the two Kafka Producers we wrote:

- API Data Producer - [Notebook Path]()
- WebScraping Data Producer - [Notebook Path]()

## Data Cleaning

Since we took the ETL approach, we cleaned our data before we uploaded it to our database.

To view the individual Data Cleaning processes we linked the notebooks responsible for that here:

- API Data Cleaning - [Notebook Path]()
- Web Data Cleaning - [Notebook Path]()
- CSV Data Cleaning - [Notebook Path]()

![SparkPandas](./images/SparkPandas.png)

In the case of the API Data we consumed the data via Spark directly from Kafka and read it into a Spark Dataframe. For the WebScraping Data we simply read the Data into a Pandas Dataframe,created a Spark Dataframe out of it. For the static CSV Data from Kaggle we simply read the data, cleaned and uploaded it to MongoDB. In general we used different methods but most commonly we used Spark Dataframes versatile capabilities and also Pandas sometimes as an inbetween step. Most of the work we had to do was changing types specifically of the CSV Data, or certain values that had to be converted into numeric values specifically such as the statistic "Minutes Played" which was mostly in a string form like this "30:26". We also checked for Null values and duplicate data, but all in all this was very comfortable to do with the power and flexibility of Spark. One of the challenges we faced was figuring out which Year or Season was actually contained in the data, because some datasets had different meanings for a value like "1995", i.e. in one dataset that meant the season 1994-1995 and in the other it meant 1995-1996. In these instances we had to do some research and use our own domain knowledge to correctly address these issues.

## ETL Output

![ETL](./images/ETL.png)

After the data was extracted and transformed, we decided to store the analyzed and transformed data in MongoDB, a popular NoSQL database. MongoDB provided us with flexibility and scalability, allowing us to easily handle and query large volumes of data. We created three separate collections within MongoDB to store the different types of data: one for the API data, one for the web scraping data, and another for the CSV data obtained from Kaggle.

![MongoDB](./images/mongoDB.png)
![MongoCollections](./images/MongoCollections.png)
![CleanData](./images/CleanData.png)

To interact with MongoDB, we utilized the PyMongo library, which provided us with a Python interface to connect to the MongoDB database and perform operations. PyMongo allowed us to seamlessly insert the transformed data into the respective collections.

![PyMongo](./images/PyMongo.jpg)

## Data Analysis with Spark

## Visualisations and Analysis

## Conclusion