# Architekturstruktur

![Architektur](./images/BDENG-Project-Architecture.png)

## Datenquellen

Folgende Datenquellen werden verwendet für unser Projekt:

* CSV-Datensatz zu NBA Spielern seit 1950 - [Link]("https://www.kaggle.com/datasets/drgilermo/nba-players-stats?select=Seasons_Stats.csv")
* NBA Stats Webseite - [Link]("https://www.nba.com/stats/players/traditional?PerMode=PerGame")
* NBA API "Ball don't lie" - [Link]("https://www.balldontlie.io/home.html#introduction")

Für die Verarbeitung der Datenquellen wird einerseits Kafka als Message Broker verwendet und andererseits Spark, welcher den verarbeiteten Datensatz von Kafka liest. Für die Arbeit wird er Jupyter Lab Server verwendet, welcher uns sowohl Spark, Kafka, als auch die von uns gewählte Datenbankinstanz bereitstellt. Als Versionskontrollensystem wird GitHub verwendet, sodass die Projektmitglieder parallel an dem Projekt arbeiten können und auf dem aktuellsten Stand sind.

# NBA Analysis: A Comparison of Michael Jordan, Kobe Bryant and LeBron James at Their Prime

## Introduction

Our project aims focuses on the age old discussion: Who is the greatest basketball player of all time? By examining their individual statistics at their arguable prime years, we seek to gain insights and compare their achievements on the basketball court. Furthermore, we orchestrated this data science project in a way which focuses on the use of standardised tools and multiple data sources. The challenge is to combine all these and create a valuable insight with the collected and processed data.

Our team consists of following members:

- Maria Mirnic
- Kevin Xhunga
- Safwan Zullash

## Data Sources

We used three different data sources:

- "Ball Don't Lie" API - [Link]("https://www.balldontlie.io/home.html#introduction")
- CSV-Data from Kaggle - [Link]("https://www.kaggle.com/datasets/drgilermo/nba-players-stats?select=Seasons_Stats.csv")
- NBA Stats Website - [Link]("https://www.nba.com/stats/players/traditional?PerMode=PerGame")

API Data:

The data of the API was structured as JSON responses and contained data like this i.e.:

![API Data](./images/APIData.png)

This was our main data source, since the API tracked and updated the data very regularly and had most of the data we wanted in an already well-formatted structure, which was easy to use and clean. We simply sent requests for all the players that played in the specific prime years of the three players that are in focus and processed the responses.

WebScraping:

Our second data source was the official website nba.com/stats. To gather the data from this website we used WebScraping. The website looked like this:

![Website Data](./images/nbaStatsWebsite.png)


CSV Dataset:

Our third and last data source was a CSV-file which was available on the popular Data Science website Kaggle. Here we simply picked a CSV-dataset which had the most complete amount of data and some additional stats that were missing from the other two sources. The data inside the file looked like this:

![CSV Data](./images/CSVData.png)

## Project Architecture, Infrastructure and Environment

Here should be the picture of the architecture

### Architecture:

Our architecture follows a distributed and scalable approach to handle the data processing and analysis tasks. It consists of the following key components:

   - Data Sources: We utilize multiple data sources for more complete analysis.

   - Kafka: Kafka acts as a central message broker in our architecture. It collects data from the different sources using Kafka producers and stores it in Kafka topics. This allows for decoupling between data producers and consumers, enabling efficient data ingestion and processing.

   - Spark: Spark serves as the core processing framework in our architecture. It reads the data from Kafka using Kafka consumers and performs various data manipulation and analysis tasks using Spark DataFrames. Spark's distributed computing capabilities ensure scalability and efficient processing of large-scale NBA season data.

   - Database: We utilize a MongoDB database to store the analyzed and transformed data.

   - Jupyter Lab: Our project environment includes the Jupyter Lab Server, which is provided by the FH. 

### Infrastructure:

Our infrastructure is designed to support a multi-user environment and enable efficient development and collaboration. It comprises the following components:

   - Jupyter Lab Server: We leverage a Jupyter Lab server hosted by the FH (Fachhochschule) to ensure a shared computing environment for the project team. This server provides access to Spark, Kafka, and the selected database instance.

   - Docker: We utilize Docker containers to encapsulate our project's dependencies, including Spark, Kafka, and other required libraries and tools. Docker allows for easy deployment and reproducibility of the project environment across different systems.

   - Git: We employ Git as our version control system to manage code, track changes, and enable collaborative development. It ensures that all project members have access to the latest code and can work concurrently on different project components.

### Environment:

Our project environment encompasses the necessary tools, technologies, and configurations to support seamless development, execution, and analysis of the NBA season data. It includes the following aspects:

   - Python: We use the Python programming language as the primary language for data processing, analysis, and visualization. Python offers a wide range of libraries and packages, including Pandas, Spark, and visualization libraries like Matplotlib and Plotly.

   - Spark Configuration: We configure Spark to leverage the distributed computing capabilities and performance optimizations based on the available system resources.

   - Development Workflow: We follow an iterative development workflow, where each team member works on their assigned tasks, utilizes Git for version control, and collaborates through regular code reviews and discussions.

   - Documentation: We maintain Jupyter notebooks as our primary documentation format. Each step of the project, including data collection, cleaning, analysis, and visualization, is documented in separate notebooks to ensure clear and concise documentation of the project's progress.

By establishing this architecture, infrastructure, and environment, we ensure an efficient, scalable, and collaborative approach to conducting the NBA season analysis project. It provides a solid foundation for data engineering and analysis tasks, enabling us to explore and compare the performances of Kobe Bryant, Michael Jordan, and LeBron James during their prime seasons in the NBA.