# Criminal Soccer
**Team:** Michael Gangl, Sebastian Grünewald, Patrick Leitner

## Topic <TBR>
The goal of the project is to load and process various datasets from the soccer area and visualize the results. The project is to be run and analyzed under the
Big Data aspect. For this purpose, relational and NoSQL databases were used, as well as a MapReduce algorithm. Furthermore, the architecture is built in a way that all services and metadata are hosted and is therefore multiuser capable.

## Question <TBR>
With our diagrams we want to analyze if there are correlations between the foul statistics of the players and the crime statistics of their countries of origin.
Additionally, some other interesting and relevant graphs about the datasets will be shown.

## Architecture <TBD>
[GitHub](https://github.com/sebi-gr/fh_bdeng_criminal_soccer)

MongoDB: mongodb://pt-n20.p4001.w3.cs.technikum-wien.at:4001</br>
MongoDB Database: `criminalSoccer`</br>
MongoDB Collections: `raw_players`, `players`, `countries`</br>

Kafka: <>
Spark: <>

<TBD: Architecture Diagram>

## Data Collection
The following scripts load the data from various sources and stores it into MongoDB
### Datasets
[Kaggle UEFA (.csv-files)](https://www.kaggle.com/datasets/azminetoushikwasi/ucl-202122-uefa-champions-league): This dataset contains all the player stats of UEFA Champions League season 2021-22.

[Wikipedia National Crime Stats](https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate): Wikipedia provides a table with crime rates per 100.000 inhabitants.

[World Bank API](TBD): <>

[Transfermarkt](https://www.transfermarkt.com/schnellsuche/keinergebnis/schnellsuche?query=): To complete the player stats, we used the Transfermarkt API to search for players by their name.

### Technical Notice
**Make sure Docker is up and running!**
Due to historical problems with the hosted MongoDB instance from the FH, we implemented a backup strategy with Docker.

### Steps
1. Prepare local backup strategy for mongo with a docker container due to problems with the hosted mongo instance.
2. Download UEFA Dataset from kaggle and store the data in MongoDB.
3. Call the World Bank API with Kafka and store the responses in topics.
4. Call the Transfermarkt API and scrape the resulting webpage for additional player infos.
5. Scrape Wikipedia to get crime statistics per country and store it in MongoDB.
6. Transform the data with proper datatypes and cleanup combined dataset.

#### Kaggle .csv dataset:

In [1]:
import pandas as pd
import zipfile
import os
import requests

# Importing necessary packages for mongodb connectivity
try:
    from pymongo import MongoClient
    from pymongo.errors import ServerSelectionTimeoutError
except ImportError:
    !pip install pymongo[srv]
    from pymongo import MongoClient
    from pymongo.errors import ServerSelectionTimeoutError

# Importing config from config.py
from conifg import MONGO_HOST_REMOTE, MONGO_DB_REMOTE, MONGO_HOST_LOCAL, MONGO_DB_LOCAL

# Defining constants for kaggle files
UEFA_ZIP = "kaggle_players_zip.zip"
UEFA_UNZIPPED = "kaggle_files"
UEFA_FILES = ["key_stats.csv", "disciplinary.csv", "distributon.csv", "defending.csv"] # only some .csv-files are interesting for our purposes
UEFA_RAW_DATA = "raw_players"

# Defining constants for MongoDB connection
conn_str = MONGO_HOST_REMOTE
mongoDB = MONGO_DB_REMOTE

class MongoContext:
    """mongodb client context manager"""
    def __init__(self):
        self.conn_str = MONGO_HOST_REMOTE
        self.mongoDB = MONGO_DB_REMOTE
    def __enter__(self):
        try:
            # print(f"conn_str: {conn_str}  mongoDB: {self.mongoDB}")
            # Connection to Mongo Server from FH-Technikum
            self.client = MongoClient(conn_str)
            self.client.server_info()
            #print("Connection successful to remote mongo host")
            return self.client
        # If connection is not possible, setting a local docker instance
        except ServerSelectionTimeoutError as err:
            print("Remote Error: " + str(err))
            os.system("docker pull mongo")
            os.system("docker run -d -p 27017:27017 mongo:latest")
            self.con_str = MONGO_HOST_LOCAL
            self.mongoDB = MONGO_DB_LOCAL
            try:
                # Trying to connect to the local docker Mongo database
                self.client = MongoClient(conn_str)
                self.client.server_info()
                #print("Connection successful to local mongo host")
                return self.client
            except ServerSelectionTimeoutError as errLocal:
                print("Local Error: " + str(errLocal))

    def __exit__(self, exception_type, exception_value, exception_traceback):
        self.client.close()
        del self.client

def unpack_zip(src, dest):
    """takes files in zip folder from src and extracts them to dest"""
    with zipfile.ZipFile(src, 'r') as zip_ref:
        zip_ref.extractall(dest)

def csv_to_mongo(folder, files, map_key):
    """Fetching data from interesting files in csv folder"""
    # kill existing collection if it exists:
    with MongoContext() as client:
        db = client[mongoDB]
        collection = db[UEFA_RAW_DATA]
        collection.drop()

        for idx, file in enumerate(files):
            df = pd.read_csv(f"{folder}/{file}")
            data = df.to_dict(orient='records')
            if idx == 0:
                # Insert data into mongo
                collection.insert_many(data)
            else:
                for row in data:
                    # Insert data into mongo
                    query = {map_key:  row[map_key]}
                    new_values = {"$set": row}
                    collection.update_one(query, new_values)

def read_from_mongo():
    """reading from mongo database and printing the collection"""
    with MongoContext() as client:
        db = client[mongoDB]
        collection = db[UEFA_RAW_DATA]

        data = collection.find()
        for x in data:
            print("==========================================================================")
            print(x)

def collect_from_kaggle():

    # guard, in case data is already in database.
    with MongoContext() as client:
        db = client[mongoDB]
        if UEFA_RAW_DATA in db.list_collection_names():
            print(f"{UEFA_RAW_DATA} is already in database")
            return False

    unpack_zip(UEFA_ZIP, UEFA_UNZIPPED)
    csv_to_mongo(UEFA_UNZIPPED, UEFA_FILES, "player_name")
    read_from_mongo()

collect_from_kaggle()

Collecting pymongo[srv]
  Downloading pymongo-4.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (492 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m492.9/492.9 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dnspython<3.0.0,>=1.16.0
  Downloading dnspython-2.3.0-py3-none-any.whl (283 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m283.7/283.7 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.3.0 pymongo-4.3.3
{'_id': ObjectId('648eddf3583fa9e84fb89e73'), 'player_name': 'Courtois', 'club': 'Real Madrid', 'position': 'Goalkeeper', 'minutes_played': 1230, 'match_played': 13, 'goals': 0, 'assists': 0, 'distance_covered': '64.2', 'cross_accuracy': 0, 'cross_attempted': 0, 'cross_complted': 0, 'freekicks_taken': 27, 'pass_accuracy': 76.7, 'pass_attempted': 483, 'pass_completed': 365, 'serial': 447}
{'_id': ObjectId('648eddf3583fa9e84fb89e

### WorldBank API with Kafka
To analyze the players stats against their home origin country and its gross domestic product, we fetch the corresponding data from the World Bank API via a Kafka Consumer.

In [38]:
import requests
from kafka import KafkaProducer
import json

# Configure Kafka producer
bootstrap_servers = 'kafka:9092'
topic = 'criminal_soccer_gdp'
producer = KafkaProducer(bootstrap_servers=bootstrap_servers,
                         value_serializer=lambda v: json.dumps(v).encode('utf-8'))

# Fetch data from API
api_url = 'http://api.worldbank.org/v2/country/all/indicators/NY.GDP.MKTP.CD?format=json&mrnev=1&per_page=300'
response = requests.get(api_url)
data = response.json()
data = data[1]

# Split data into smaller chunks
chunk_size = 1
chunks = [data[i:i + chunk_size] for i in range(0, len(data), chunk_size)]

# Publish each chunk to Kafka topic
for chunk in chunks:
    producer.send(topic, value=chunk)
    producer.flush()


# Close Kafka producer
producer.close()


In [37]:
# Optional: delete the topic
#from kafka import KafkaAdminClient
#admin_client = KafkaAdminClient(bootstrap_servers=bootstrap_servers)
#admin_client.delete_topics([topic])

DeleteTopicsResponse_v3(throttle_time_ms=0, topic_error_codes=[(topic='criminal_soccer_gdp', error_code=0)])

### Spark Data Consumer
After writing the messages to the Kafka topic we extract those with Spark (using streaming) and save it to our mongo database.

In [1]:
%%file spark-consumer.py

from pyspark.sql import SparkSession
from conifg import MONGO_HOST_REMOTE, MONGO_DB_REMOTE
from pymongo import MongoClient
import json

KAFKA_TOPIC = "criminal_soccer_gdp"
KAFKA_SERVER = "kafka:9092"

# Initialize MongoDB client
mongo_client = MongoClient(MONGO_HOST_REMOTE)
mongo_collection = mongo_client[MONGO_DB_REMOTE]["world_bank_hdi"]

# Define a function to write DataFrame batches to MongoDB
def write_to_mongodb(batch_df, batch_id):
    documents = batch_df.selectExpr("CAST(value AS STRING)").collect()
    
    for row in documents:
        document = json.loads(row["value"])[0]
        mongo_collection.insert_one(document)

    
spark = SparkSession.builder.appName("CriminalSoccerStream").getOrCreate()

df = (spark.readStream                             # Get the DataStreamReader
  .format("kafka")                                 # Specify the source format as "kafka"
  .option("kafka.bootstrap.servers", KAFKA_SERVER) # Configure the Kafka server name and port
  .option("subscribe", KAFKA_TOPIC)                # Subscribe to the Kafka topic
  .option("startingOffsets", "earliest")           # Rewind stream to beginning when we restart notebook
  .option("maxOffsetsPerTrigger", 1000)            # Throttle Kafka's processing of the streams
  .load()                                          # Load the DataFrame
)

df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
    .writeStream \
    .foreachBatch(write_to_mongodb) \
    .start()

spark.streams.awaitAnyTermination()

Overwriting spark-consumer.py


In [2]:
!spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 spark-consumer.py

:: loading settings :: url = jar:file:/usr/local/spark-3.3.2-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/jovyan/.ivy2/cache
The jars for the packages stored in: /home/jovyan/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-61c45bd5-75aa-4131-8a4e-dde42752c4f3;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.12;3.0.1 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.12;3.0.1 in central
	found org.apache.kafka#kafka-clients;2.4.1 in central
	found com.github.luben#zstd-jni;1.4.4-3 in central
	found org.lz4#lz4-java;1.7.1 in central
	found org.xerial.snappy#snappy-java;1.1.7.5 in central
	found org.slf4j#slf4j-api;1.7.30 in central
	found org.spark-project.spark#unused;1.0.0 in central
	found org.apache.commons#commons-pool2;2.6.2 in central
:: resolution report :: resolve 631ms :: artifacts dl 


KeyboardInterrupt



### Transfermarkt API and Crawler
<img src="imgs/Transfermarkt_logo.png" style="width:200px; height:auto"/>

In Order to collect insights on players nationality we are querying the Transfermarkt [website](https://transfermarkt.com). We then parse the html response to collect `full_name`, `nationality`, `icon` and `market_value` from it.

In [None]:
try:
    import parsel
except ImportError:
    !pip install parsel
    import parsel


def transfermarkt_spider(name):
    """queries transfermarkt.com and parses response table"""
    ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
    header = {"user-agent": ua}
    result = None
    with requests.Session() as session:
        url = "https://www.transfermarkt.com/schnellsuche/ergebnis/schnellsuche"
        req = session.get(url, params={"query": name}, headers=header) # API call
        response = parsel.Selector(req.text)
        try:
            # retrieve data from table
            row = response.xpath("//table[@class='items']/tbody/tr[1]")
            icon_url = row.xpath(".//table//img/@src").get()
            name = row.xpath(".//table//img/@title").get()
            national = row.xpath("./td[5]/img[1]/@alt").get()
            value = row.xpath("./td[6]/text()").get()
            result = dict(icon=icon_url, full_name=name, nationality=national, market_value=value)
        except:
            pass
    return result

# Collect from Transfermarkt.com
with MongoContext() as client:
    # MongoDB connection
    db = client[mongoDB]
    collection = db[UEFA_RAW_DATA]
    raw = collection.find()
    count = 0

    # add the complementary data to the mongo documents
    mongo_rows = collection.find({ "nationality": { "$exists":False }})
    for mongo_player in mongo_rows:
        print(mongo_player["player_name"])
        if not mongo_player.get("nationality"):
            name = mongo_player["player_name"]
            transfer_data = transfermarkt_spider(name)
            if transfer_data:
                print(transfer_data)
                collection.update_one({"player_name": name}, {"$set": transfer_data})

### Fetch crime stats from Wikipedia:
Download crime stats (intentional homicides) for all countries in the world (0.0-10.0)


In [17]:
import parsel


def crime_from_wiki():
    """scrape from wikipedia and yield results"""
    url = "https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate"
    with requests.Session() as session:
        req = session.get(url)
        response = parsel.Selector(req.text)
        table = response.xpath("//table[contains(@class,'static-row-numbers')]")
        body = table.xpath("./tbody//tr")
        for row in body:
            country = row.xpath("./td[1]//a/text()").get()
            if country:
                country = country.strip("*")
                country = country.strip()
                count_p_100k = float(row.xpath("./td[4]/text()").get())
                yield {"country":country, "count_p_100k":count_p_100k}

# Collect from Wikipedia
with MongoContext() as client:
    db= client[mongoDB]
    collection = db["countries"]
    collection.drop()
    for country in crime_from_wiki():
        collection.insert_one(country)

    for x in collection.find():
        print(x)

{'_id': ObjectId('64526c681a4d0e5464aa727c'), 'country': 'Afghanistan', 'count_p_100k': 6.7}
{'_id': ObjectId('64526c681a4d0e5464aa727d'), 'country': 'Albania', 'count_p_100k': 2.1}
{'_id': ObjectId('64526c681a4d0e5464aa727e'), 'country': 'Algeria', 'count_p_100k': 1.3}
{'_id': ObjectId('64526c681a4d0e5464aa727f'), 'country': 'Andorra', 'count_p_100k': 2.6}
{'_id': ObjectId('64526c681a4d0e5464aa7280'), 'country': 'Angola', 'count_p_100k': 4.8}
{'_id': ObjectId('64526c681a4d0e5464aa7281'), 'country': 'Anguilla', 'count_p_100k': 28.3}
{'_id': ObjectId('64526c681a4d0e5464aa7282'), 'country': 'Antigua and Barbuda', 'count_p_100k': 9.2}
{'_id': ObjectId('64526c681a4d0e5464aa7283'), 'country': 'Argentina', 'count_p_100k': 5.3}
{'_id': ObjectId('64526c681a4d0e5464aa7284'), 'country': 'Armenia', 'count_p_100k': 1.8}
{'_id': ObjectId('64526c681a4d0e5464aa7285'), 'country': 'Aruba', 'count_p_100k': 1.9}
{'_id': ObjectId('64526c681a4d0e5464aa7286'), 'country': 'Australia', 'count_p_100k': 0.9}
{'

## Transform Data
In order to analyze the data we apply type conversions on selected attributes

In [None]:
TYPE_CONVERSIONS = {"minutes_played": "int",
                    'match_played': "int", 'goals': "int", 'assists': "int", 'distance_covered': 'float',
                    'fouls_committed': "int", 'fouls_suffered': "int", 'red': "int", 'yellow': "int",
                    'cross_accuracy': "int", 'cross_attempted': "int", 'cross_complted': "int",
                    'freekicks_taken': "int", 'pass_accuracy': "float", 'pass_attempted': "int",
                    'pass_completed': "int",
                    'balls_recoverd': "int",
                    'clearance_attempted': "int",
                    't_lost': "int",
                    't_won': "int",
                    'tackles': "int"
                    }

def type_converter(item: dict, definitions) -> dict:
    """converts all values that are in a given type key"""
    new_item = dict()
    for k, v in item.items():
        if k in definitions:
            if definitions[k] == "int":
                try:
                    float(v)
                    v = int(v)
                except:
                    v = None
            elif definitions[k] == "float":
                try:
                    v = float(v)
                except:
                    v = None

        new_item[k] = v
    return new_item

def transform_raw_data():
    with MongoContext() as client:
        db = client[mongoDB]
        raw = db[UEFA_RAW_DATA]
        collection = db["players"]
        collection.drop()
        for doc in raw.find():
            cleaned_item = type_converter(doc, TYPE_CONVERSIONS)
            collection.insert_one(cleaned_item)

transform_raw_data()

### Metadata in MongoDB
#### Collections
##### Players:
Example data entry from a soccer player:

```{'_id': ObjectId('643c1b35afd7d39c21e6dbf8'),'player_name': 'Courtois', 'club': 'Real Madrid', 'position': 'Goalkeeper', 'minutes_played': 1230, 'match_played': 13, 'goals': 0, 'assists': 0, 'distance_covered': 64.2, 'cross_accuracy': 0, 'cross_attempted': 0, 'cross_complted': 0, 'freekicks_taken': 27, 'pass_accuracy': 76.7, 'pass_attempted': 483, 'pass_completed': 365, 'serial': 447, 'full_name': 'Thibaut Courtois', 'icon': 'https://img.a.transfermarkt.technology/portrait/small/108390-1665067957.jpg?lm=1', 'market_value': '€60.00m', 'nationality': 'Belgium'}```

##### Countries:
Example data entry for the crime stats:

`{'_id': ObjectId('643c37c7afd7d39c21e6dfc0'), 'country': 'Afghanistan', 'count_p_100k': 6.7}`

## Data Cleaning
<>

## Data Analysis
<>

## Data Output
<>