# Criminal Soccer
**Team:** Michael Gangl, Sebastian Grünewald, Patrick Leitner

## Topic <TBR>
The goal of the project is to load and process various datasets from the soccer area and visualize the results. The project is to be run and analyzed under the
Big Data aspect. For this purpose, relational and NoSQL databases were used, as well as a MapReduce algorithm. Furthermore, the architecture is built in a way that all services and metadata are hosted and is therefore multiuser capable.

## Question <TBR>
With our diagrams we want to analyze if there are correlations between the foul statistics of the players and the crime statistics of their countries of origin.
Additionally, some other interesting and relevant graphs about the datasets will be shown.

## Architecture <TBD>
GitHub: <>
Mongo: <>
Kafka: <>
Spark: <>

<TBD: Architecture Diagram>

## Data Collection
The following scripts load the data from various sources and stores it into MongoDB
### Datasets
[Kaggle UEFA (.csv-files)](https://www.kaggle.com/datasets/azminetoushikwasi/ucl-202122-uefa-champions-league): This dataset contains all the player stats of UEFA Champions League season 2021-22.

[World Bank API](TBD): <>

[Wikipedia National Crime Stats](https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate): Wikipedia provides a table with crime rates per 100.000 inhabitants.

### Technical Notice
**Make sure Docker is up and running!**
Due to historical problems with the hosted MongoDB instance from the FH, we implemented a backup strategy with Docker.

### Steps
1. Prepare local backup strategy for mongo with a docker container due to problems with the hosted mongo instance.
2. Download UEFA Dataset from kaggle and store the data in MongoDB.
3. <>
4. Transform the data with proper datatypes and cleanup combined dataset.
5. Scrape Wikipedia to get crime statistics per country and store it in MongoDB.

In [14]:
import pandas as pd
import zipfile
import os
import requests

# Importing necessary packages for mongodb connectivity
try:
    from pymongo import MongoClient
    from pymongo.errors import ServerSelectionTimeoutError
except ImportError:
    !pip install pymongo[srv]
    from pymongo import MongoClient
    from pymongo.errors import ServerSelectionTimeoutError

# Importing config from config.py
from conifg import MONGO_HOST_REMOTE, MONGO_DB_REMOTE, MONGO_HOST_LOCAL, MONGO_DB_LOCAL

# Defining constants for kaggle files
UEFA_ZIP = "kaggle_players_zip.zip"
UEFA_UNZIPPED = "kaggle_files"
UEFA_FILES = ["key_stats.csv", "disciplinary.csv", "distributon.csv", "defending.csv"] # only some .csv-files are interesting for our purposes
UEFA_RAW_DATA = "raw_players"

# Defining constants for MongoDB connection
conn_str = MONGO_HOST_REMOTE
mongoDB = MONGO_DB_REMOTE

class MongoContext:
    """mongodb client context manager"""
    def __init__(self):
        self.conn_str = MONGO_HOST_REMOTE
        self.mongoDB = MONGO_DB_REMOTE
    def __enter__(self):
        try:
            # print(f"conn_str: {conn_str}  mongoDB: {self.mongoDB}")
            # Connection to Mongo Server from FH-Technikum
            self.client = MongoClient(conn_str)
            self.client.server_info()
            #print("Connection successful to remote mongo host")
            return self.client
        # If connection is not possible, setting a local docker instance
        except ServerSelectionTimeoutError as err:
            print("Remote Error: " + str(err))
            os.system("docker pull mongo")
            os.system("docker run -d -p 27017:27017 mongo:latest")
            self.con_str = MONGO_HOST_LOCAL
            self.mongoDB = MONGO_DB_LOCAL
            try:
                # Trying to connect to the local docker Mongo database
                self.client = MongoClient(conn_str)
                self.client.server_info()
                #print("Connection successful to local mongo host")
                return self.client
            except ServerSelectionTimeoutError as errLocal:
                print("Local Error: " + str(errLocal))

    def __exit__(self, exception_type, exception_value, exception_traceback):
        self.client.close()
        del self.client

def unpack_zip(src, dest):
    """takes files in zip folder from src and extracts them to dest"""
    with zipfile.ZipFile(src, 'r') as zip_ref:
        zip_ref.extractall(dest)

def csv_to_mongo(folder, files, map_key):
    """Fetching data from interesting files in csv folder"""
    # kill existing collection if it exists:
    with MongoContext() as client:
        db = client[mongoDB]
        collection = db[UEFA_RAW_DATA]
        collection.drop()

        for idx, file in enumerate(files):
            df = pd.read_csv(f"{folder}/{file}")
            data = df.to_dict(orient='records')
            if idx == 0:
                # Insert data into mongo
                collection.insert_many(data)
            else:
                for row in data:
                    # Insert data into mongo
                    query = {map_key:  row[map_key]}
                    new_values = {"$set": row}
                    collection.update_one(query, new_values)

def read_from_mongo():
    """reading from mongo database and printing the collection"""
    with MongoContext() as client:
        db = client[mongoDB]
        collection = db[UEFA_RAW_DATA]

        data = collection.find()
        for x in data:
            print("==========================================================================")
            print(x)

def collect_from_kaggle():

    # guard, in case data is already in database.
    with MongoContext() as client:
        db = client[mongoDB]
        if UEFA_RAW_DATA in db.list_collection_names():
            print(f"{UEFA_RAW_DATA} is already in database")
            return False

    unpack_zip(UEFA_ZIP, UEFA_UNZIPPED)
    csv_to_mongo(UEFA_UNZIPPED, UEFA_FILES, "player_name")
    read_from_mongo()

collect_from_kaggle()

{'_id': ObjectId('644b8ed8f7fb1e526a4dd719'), 'player_name': 'Courtois', 'club': 'Real Madrid', 'position': 'Goalkeeper', 'minutes_played': 1230, 'match_played': 13, 'goals': 0, 'assists': 0, 'distance_covered': '64.2', 'cross_accuracy': 0, 'cross_attempted': 0, 'cross_complted': 0, 'freekicks_taken': 27, 'pass_accuracy': 76.7, 'pass_attempted': 483, 'pass_completed': 365, 'serial': 447}
{'_id': ObjectId('644b8ed8f7fb1e526a4dd71a'), 'player_name': 'Vinícius Júnior', 'club': 'Real Madrid', 'position': 'Forward', 'minutes_played': 1199, 'match_played': 13, 'goals': 4, 'assists': 6, 'distance_covered': '133.0', 'fouls_committed': 13, 'fouls_suffered': 24, 'red': 1, 'serial': 121, 'yellow': 0, 'cross_accuracy': 31, 'cross_attempted': 19, 'cross_complted': 6, 'freekicks_taken': 0, 'pass_accuracy': 83.1, 'pass_attempted': 451, 'pass_completed': 377, 'balls_recoverd': 29, 'clearance_attempted': 0, 't_lost': 8, 't_won': 3, 'tackles': 11}
{'_id': ObjectId('644b8ed8f7fb1e526a4dd71b'), 'player_na

False

### WorldBank API with Kafka
<>

### Transform Data
In order to analyze the data we apply type conversions on selected attributes

In [16]:
TYPE_CONVERSIONS = {"minutes_played": "int",
                    'match_played': "int", 'goals': "int", 'assists': "int", 'distance_covered': 'float',
                    'fouls_committed': "int", 'fouls_suffered': "int", 'red': "int", 'yellow': "int",
                    'cross_accuracy': "int", 'cross_attempted': "int", 'cross_complted': "int",
                    'freekicks_taken': "int", 'pass_accuracy': "float", 'pass_attempted': "int",
                    'pass_completed': "int",
                    'balls_recoverd': "int",
                    'clearance_attempted': "int",
                    't_lost': "int",
                    't_won': "int",
                    'tackles': "int"
                    }

def type_converter(item: dict, definitions) -> dict:
    """converts all values that are in a given type key"""
    new_item = dict()
    for k, v in item.items():
        if k in definitions:
            if definitions[k] == "int":
                try:
                    float(v)
                    v = int(v)
                except:
                    v = None
            elif definitions[k] == "float":
                try:
                    v = float(v)
                except:
                    v = None

        new_item[k] = v
    return new_item

def transform_raw_data():
    with MongoContext() as client:
        db = client[mongoDB]
        raw = db[UEFA_RAW_DATA]
        collection = db["players"]
        collection.drop()
        for doc in raw.find():
            cleaned_item = type_converter(doc, TYPE_CONVERSIONS)
            collection.insert_one(cleaned_item)

transform_raw_data()

### Fetch crime stats from Wikipedia:
Download crime stats (intentional homicides) for all countries in the world (0.0-10.0)


In [17]:
def crime_from_wiki():
    """scrape from wikipedia and yield results"""
    url = "https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate"
    with requests.Session() as session:
        req = session.get(url)
        response = parsel.Selector(req.text)
        table = response.xpath("//table[contains(@class,'static-row-numbers')]")
        body = table.xpath("./tbody//tr")
        for row in body:
            country = row.xpath("./td[1]//a/text()").get()
            if country:
                country = country.strip("*")
                country = country.strip()
                count_p_100k = float(row.xpath("./td[4]/text()").get())
                yield {"country":country, "count_p_100k":count_p_100k}

# Collect from Wikipedia
with MongoContext() as client:
    db= client[mongoDB]
    collection = db["countries"]
    collection.drop()
    for country in crime_from_wiki():
        collection.insert_one(country)

    for x in collection.find():
        print(x)

{'_id': ObjectId('64526c681a4d0e5464aa727c'), 'country': 'Afghanistan', 'count_p_100k': 6.7}
{'_id': ObjectId('64526c681a4d0e5464aa727d'), 'country': 'Albania', 'count_p_100k': 2.1}
{'_id': ObjectId('64526c681a4d0e5464aa727e'), 'country': 'Algeria', 'count_p_100k': 1.3}
{'_id': ObjectId('64526c681a4d0e5464aa727f'), 'country': 'Andorra', 'count_p_100k': 2.6}
{'_id': ObjectId('64526c681a4d0e5464aa7280'), 'country': 'Angola', 'count_p_100k': 4.8}
{'_id': ObjectId('64526c681a4d0e5464aa7281'), 'country': 'Anguilla', 'count_p_100k': 28.3}
{'_id': ObjectId('64526c681a4d0e5464aa7282'), 'country': 'Antigua and Barbuda', 'count_p_100k': 9.2}
{'_id': ObjectId('64526c681a4d0e5464aa7283'), 'country': 'Argentina', 'count_p_100k': 5.3}
{'_id': ObjectId('64526c681a4d0e5464aa7284'), 'country': 'Armenia', 'count_p_100k': 1.8}
{'_id': ObjectId('64526c681a4d0e5464aa7285'), 'country': 'Aruba', 'count_p_100k': 1.9}
{'_id': ObjectId('64526c681a4d0e5464aa7286'), 'country': 'Australia', 'count_p_100k': 0.9}
{'

### Metadata in MongoDB
#### Collections
##### Players:
Example data entry from a soccer player:

```{'_id': ObjectId('643c1b35afd7d39c21e6dbf8'),'player_name': 'Courtois', 'club': 'Real Madrid', 'position': 'Goalkeeper', 'minutes_played': 1230, 'match_played': 13, 'goals': 0, 'assists': 0, 'distance_covered': 64.2, 'cross_accuracy': 0, 'cross_attempted': 0, 'cross_complted': 0, 'freekicks_taken': 27, 'pass_accuracy': 76.7, 'pass_attempted': 483, 'pass_completed': 365, 'serial': 447, 'full_name': 'Thibaut Courtois', 'icon': 'https://img.a.transfermarkt.technology/portrait/small/108390-1665067957.jpg?lm=1', 'market_value': '€60.00m', 'nationality': 'Belgium'}```

##### Countries:
Example data entry for the crime stats:

`{'_id': ObjectId('643c37c7afd7d39c21e6dfc0'), 'country': 'Afghanistan', 'count_p_100k': 6.7}`

## Data Cleaning
<>

## Spark Data Consumer
<>


## Data Analysis with Spark
<>

## Data Output
<>