### Introduction to Machine Learning in Finance and Insurance (Spring 2025)
# Project 2 - Insurance Claim Prediction - Sandbox

In [2]:
# Import basic libraries
import numpy as np
import matplotlib.pyplot as plt

# Read a csv file using pandas

In [3]:
# Import libraries
# Pandas is a package used for data manipulation (e.g. dataframes, databases, etc)
import pandas as pd

In [4]:
# Load dataset from csv file into pandas dataframe object 
df = pd.read_csv('freMTPL2freq.csv', sep=';', decimal=',')

In [5]:
# Inspect the first few rows of the dataframe
df

Unnamed: 0,VehPower,VehAge,DrivAge,BonusMalus,VehBrand,VehGas,Density,Region,Exposure,ClaimNb
0,4,9,23,100,B6,Regular,7887,R31,0.760000,0
1,4,6,26,100,B6,Regular,2308,R31,0.740000,0
2,4,6,26,100,B6,Regular,2308,R31,0.110000,0
3,7,4,44,50,B6,Regular,37,R94,0.830000,0
4,5,2,29,90,B6,Regular,335,R91,0.690000,0
...,...,...,...,...,...,...,...,...,...,...
678002,6,10,27,118,B1,Diesel,1978,R31,0.120000,0
678003,4,9,34,76,B1,Regular,6681,R11,0.060000,0
678004,4,15,37,50,B1,Regular,1767,R31,0.060000,0
678005,4,15,69,50,B1,Regular,1541,R91,0.060000,0


# Pre-process dataset features

In [6]:
# Define the pre-processing function for VehAge
# Attention! This is just an example. For your project submission, you must modify this function according to instructions.

def pre_process_VehAge(x):
    
    if x >= 0 and x < 6:
        output = 0
    else:
        output = 1

    return output

In [8]:
Exposure = df['Exposure']

# Transform discrete/continuous variables
VehPower = np.log(df['VehPower'])
DrivAge = np.log(df['DrivAge'])
BonusMalus = np.log(df['BonusMalus'])

# Apply pre-processing function to VehAge and one-hot encode it
VehAge = pd.get_dummies(df['VehAge'].apply(pre_process_VehAge))

# Re-assemble the dataset by concatenating vertically the transformed features
X = np.float32(pd.concat([Exposure, VehPower, VehAge, DrivAge, BonusMalus], axis=1).values)
# Define the target labels (i.e. claim frequency)
y = np.float32(df['ClaimNb'].values/df['Exposure'].values)

# Attention! Since this is an example, we are keeping only some features of the original dataset. 
# For your final submission, modify the code accordingly.
# Attetion! For the moment, we also keep `Exposure` as the first feature in the dataset, because we want to be able 
# to split it during the train-test split together with the rest of the dataset. We will then remove it from the dataset
# before training (see the function `get_train_test_split` below)

a different way to get the same result a bit more directly

In [17]:
df_ = df.copy()
for col in ['VehPower', 'DrivAge', 'BonusMalus']:
    df_[col] = np.log(df[col])
df_['VehAge'] = df['VehAge'].apply(pre_process_VehAge)
df_ = pd.get_dummies(df_, columns=['VehAge'])

X_ = df_[['Exposure', 'VehPower', 'VehAge_0', 'VehAge_1', 'DrivAge', 'BonusMalus',]].values.astype(np.float32)
y_ = (df_['ClaimNb']/df_['Exposure']).values.astype(np.float32)

print(np.all(X==X_), np.all(y==y_))

True True


In [15]:
X.shape

(678007, 6)

In [18]:
ages = np.sort(pd.unique(df['DrivAge']))

In [19]:
df['DrivAge']

0         23
1         26
2         26
3         44
4         29
          ..
678002    27
678003    34
678004    37
678005    69
678006    18
Name: DrivAge, Length: 678007, dtype: int64

In [20]:
freqs_up = np.zeros(len(ages))
freqs_lo = np.zeros(len(ages))
means = np.zeros(len(ages))
for i_age, age in enumerate(ages):
    n = (df['DrivAge'] == age).value_counts().loc[True]
    std = np.std(y[df['DrivAge'] == age])
    means[i_age] = np.mean(y[df['DrivAge'] == age])
    freqs_up[i_age] = np.mean(y[df['DrivAge'] == age]) + std/np.sqrt(n)
    freqs_lo[i_age] = np.mean(y[df['DrivAge'] == age]) - std/np.sqrt(n)

In [21]:
plt.fill_between(ages, np.log(freqs_lo), np.log(freqs_up), alpha=0.5, color='blue')
plt.plot(ages, np.log(means), color='blue', linewidth=0.5)
plt.ylabel('Claim frequency (log)')
plt.xlabel('Age')
plt.ylim(-3, 0.1)
plt.xlim(17, 93)
plt.savefig('U_shaped_freq.pdf', bbox_inches='tight')
plt.show()

# Git and GitHub: An overview

To understand Git and GitHub, it's crucial to distinguish between the two:

* **Git:**
    * Git is a version control system (VCS). This means it's software that tracks changes to files over time.
    * It allows you to record snapshots of your files, revert to previous versions, and manage different versions of your projects.
    * Git is primarily a command-line tool, though graphical user interfaces (GUIs) are available.
    * It's designed for collaborative work, enabling multiple people to work on the same project without overwriting each other's changes.
    * Essentially, Git operates locally on your computer.

* **GitHub:**
    * GitHub is a web-based platform that provides hosting for Git repositories.
    * It acts as a central location where developers can store, manage, and share their Git projects.
    * GitHub enhances Git's functionality with features like:
        * Collaboration tools (pull requests, code reviews).
        * Issue tracking for bug reporting and task management.
        * Project management tools.
        * A social networking aspect for developers.
    * GitHub facilitates remote collaboration, allowing teams to work together seamlessly, regardless of their location.
    * It is a cloud based service.

In short:

* Git is the tool that handles version control.
* GitHub is the platform that hosts Git repositories and provides collaborative features.

They work together, with Git handling the version control and GitHub providing the online space for storing and collaborating on those version-controlled projects.


## Git Cheatsheet

This cheatsheet provides a quick reference for common Git commands.

**Configuration:**

* **Set your username for commits:**
    ```bash
    git config --global user.name "Your Name"
    ```
* **Set your email address for commits:**
    ```bash
    git config --global user.email "your.email@example.com"
    ```
* **Initialize a new Git repository in the current directory:**
    ```bash
    git init
    ```
* **Clone an existing repository:**
    ```bash
    git clone <repository_url> [local_directory_name]
    ```

**Basic Workflow:**

* **Check the status of your working directory and staging area:**
    ```bash
    git status
    ```
* **Add changes in the current directory to the staging area:**
    ```bash
    git add .
    ```
* **Add a specific file to the staging area:**
    ```bash
    git add <file_name>
    ```
* **Commit the staged changes with a message:**
    ```bash
    git commit -m "Your commit message here"
    ```
* **Commit directly, staging all tracked changes (not recommended for careful staging):**
    ```bash
    git commit -am "Your commit message here"
    ```
* **View your commit history:**
    ```bash
    git log
    ```
* **View a more concise commit history (one line per commit):**
    ```bash
    git log --oneline
    ```
* **View the changes in a specific commit:**
    ```bash
    git show <commit_hash>
    ```
* **View the changes between the working directory and the staging area:**
    ```bash
    git diff
    ```
* **View the changes between the staging area and the last commit:**
    ```bash
    git diff --staged
    ```
* **Remove a file from the staging area (but keep it in the working directory):**
    ```bash
    git rm --cached <file_name>
    ```
* **Remove a file from both the staging area and the working directory:**
    ```bash
    git rm <file_name>
    ```
* **Rename a file:**
    ```bash
    git mv <old_file_name> <new_file_name>
    ```

**Branching and Merging:**

* **List all branches (local and remote):**
    ```bash
    git branch -a
    ```
* **List local branches:**
    ```bash
    git branch
    ```
* **Create a new branch:**
    ```bash
    git branch <new_branch_name>
    ```
* **Switch to an existing branch:**
    ```bash
    git checkout <branch_name>
    ```
* **Create and switch to a new branch in one command:**
    ```bash
    git checkout -b <new_branch_name>
    ```
* **Merge a branch into the currently checked-out branch:**
    ```bash
    git merge <branch_to_merge>
    ```
* **Delete a local branch (if it has been fully merged):**
    ```bash
    git branch -d <branch_to_delete>
    ```
* **Force delete a local branch (even if not fully merged - use with caution):**
    ```bash
    git branch -D <branch_to_delete>
    ```
* **Delete a remote branch:**
    ```bash
    git push origin --delete <remote_branch_name>
    ```

**Remote Repositories:**

* **Add a remote repository:**
    ```bash
    git remote add <remote_name> <repository_url>
    ```
    (Commonly `<remote_name>` is `origin`)
* **List configured remote repositories:**
    ```bash
    git remote -v
    ```
* **Fetch changes from a remote repository (without merging):**
    ```bash
    git fetch <remote_name>
    ```
* **Pull changes from a remote repository and merge them into the current branch:**
    ```bash
    git pull <remote_name> <branch_name>
    ```
    (Often `git pull origin main` or `git pull origin master`)
* **Push local commits to a remote repository:**
    ```bash
    git push <remote_name> <branch_name>
    ```
    (For the first push of a new local branch, you might need `git push -u origin <branch_name>`)

**Undoing Changes:**

* **Discard changes in the working directory (revert to the last commit):**
    ```bash
    git checkout <file_name>
    ```
* **Unstage changes that were added to the staging area:**
    ```bash
    git reset HEAD <file_name>
    ```
* **Go back to a previous commit (resets the staging area and working directory - use with caution, data loss possible):**
    ```bash
    git reset --hard <commit_hash>
    ```
* **Go back to a previous commit (keeps changes as unstaged in the working directory):**
    ```bash
    git reset --soft <commit_hash>
    ```
* **Go back to a previous commit (keeps changes as staged in the staging area):**
    ```bash
    git reset --mixed <commit_hash>
    ```
    (This is the default behavior of `git reset <commit_hash>`)
* **Create a new commit that undoes the changes in a specific commit:**
    ```bash
    git revert <commit_hash>
    ```

**Stashing:**

* **Save uncommitted changes temporarily:**
    ```bash
    git stash
    ```
* **List all stashed changes:**
    ```bash
    git stash list
    ```
* **Apply the most recent stashed changes:**
    ```bash
    git stash apply
    ```
* **Apply a specific stashed change (e.g., stash@{1}):**
    ```bash
    git stash apply stash@{<stash_id>}
    ```
* **Apply and remove the most recent stashed changes:**
    ```bash
    git stash pop
    ```
* **Remove a specific stashed change:**
    ```bash
    git stash drop stash@{<stash_id>}
    ```
* **Remove all stashed changes:**
    ```bash
    git stash clear
    ```

**Tags:**

* **List all tags:**
    ```bash
    git tag
    ```
* **Create a lightweight tag at the current commit:**
    ```bash
    git tag <tag_name>
    ```
* **Create an annotated tag (recommended for releases):**
    ```bash
    git tag -a <tag_name> -m "Your tag message"
    ```
* **Push tags to the remote repository:**
    ```bash
    git push origin --tags
    ```
* **Push a specific tag:**
    ```bash
    git push origin <tag_name>
    ```
* **Checkout a specific tag (creates a detached HEAD):**
    ```bash
    git checkout <tag_name>
    ```
* **Delete a local tag:**
    ```bash
    git tag -d <tag_name>
    ```
* **Delete a remote tag:**
    ```bash
    git push origin --delete tag <tag_name>
    ```

**Rewriting History (Use with Caution):**

* **Amend the last commit message:**
    ```bash
    git commit --amend -m "New commit message"
    ```
* **Amend the last commit by adding staged changes:**
    ```bash
    git add <file_to_add>
    git commit --amend --no-edit
    ```
* **Interactive rebasing (for modifying commit history):**
    ```bash
    git rebase -i <base_branch>
    ```

**Help:**

* **Get help for a specific Git command:**
    ```bash
    git help <command>
    ```
    (e.g., `git help commit`)

This cheatsheet covers the most frequently used Git commands. For more detailed information, refer to the official Git documentation or use the `git help` command. Remember to use history-rewriting commands with caution, especially on shared repositories.

# Structuring Code in Large Projects: An Overview and Advice

Structuring code effectively in large projects is crucial for maintainability, scalability, collaboration, and overall project success. Poor organization leads to a "big ball of mud," making it difficult to understand, debug, refactor, and onboard new team members.

Here's an overview and advice on how to approach code structuring in large projects:

**Core Principles:**

* **Modularity:** Break down the system into independent, self-contained modules with well-defined responsibilities and interfaces. This reduces dependencies and makes it easier to work on individual parts.
* **Separation of Concerns (SoC):** Each part of the code should address a distinct concern or responsibility. This improves readability, testability, and reduces the impact of changes in one area on others.
* **Consistency:** Adhere to consistent naming conventions, coding styles, and directory structures throughout the project. This makes the codebase more predictable and easier to navigate.
* **Clarity and Readability:** Write code that is easy to understand. Use meaningful names, comments where necessary, and follow established coding standards.
* **Testability:** Structure your code in a way that makes it easy to write unit, integration, and end-to-end tests. Modular and loosely coupled code is inherently more testable.
* **Scalability:** Design the structure with future growth in mind. Consider how new features or components will be integrated.

**Key Structural Elements and Advice:**

1.  **Directory Structure:** A well-organized directory structure is the foundation. Consider these common approaches and adapt them to your project's needs:

    * **By Feature/Domain:** Group files and folders based on the high-level features or business domains of the application (e.g., `user_management/`, `product_catalog/`, `order_processing/`). This makes it easy to locate code related to a specific area.
    * **By Layer/Tier:** Organize code based on architectural layers (e.g., `presentation/` (UI), `application/` (business logic), `domain/` (entities), `data_access/` (database interactions)). This enforces separation of concerns.
    * **Hybrid Approach:** Combine elements of both feature-based and layer-based structuring. For example, within a feature directory, you might have subdirectories for different layers.

    **Advice:**
    * Keep the top-level directory structure clear and concise.
    * Use descriptive and consistent naming for directories.
    * Avoid deeply nested directory structures, as they can become difficult to navigate.
    * Place related files together.
    * Include a clear `README.md` at the root level explaining the project structure and conventions.

2.  **Module and Package Organization:** Within your directories, organize code into logical modules or packages (depending on your programming language).

    **Advice:**
    * Each module/package should have a clear responsibility.
    * Define clear interfaces for how modules interact with each other.
    * Minimize dependencies between modules to promote reusability and reduce coupling.
    * Use visibility modifiers (e.g., `private`, `public`) to control access to internal components of a module.

3.  **File Naming Conventions:** Consistent and descriptive file naming is crucial for quickly identifying the purpose of a file.

    **Advice:**
    * Adopt a consistent naming convention (e.g., snake\_case, camelCase).
    * Use names that clearly indicate the file's content or role (e.g., `user_service.py`, `product_model.js`, `auth_controller.java`).
    * Consider using prefixes or suffixes to indicate the type of file (e.g., `_test.js`, `.interface.ts`).

4.  **Code Style and Formatting:** Enforce a consistent code style across the project.

    **Advice:**
    * Adopt a widely accepted style guide for your programming language (e.g., PEP 8 for Python, Google Style Guides).
    * Use linters and formatters (e.g., ESLint, Prettier, Flake8) to automate code style enforcement.
    * Configure your IDE to automatically apply code formatting.

5.  **Dependency Management:** Clearly define and manage project dependencies.

    **Advice:**
    * Use a dedicated dependency management tool (e.g., `pip` for Python, `npm` or `yarn` for JavaScript, Maven or Gradle for Java).
    * Keep dependencies up-to-date for security and bug fixes.
    * Be mindful of transitive dependencies and potential conflicts.
    * Use a `requirements.txt` or `package.json` file to specify dependencies and their versions.

6.  **Documentation:** Include documentation at various levels.

    **Advice:**
    * **Project-level:** `README.md` explaining the project, setup instructions, and high-level architecture.
    * **Module/Package-level:** Overview documentation within module directories.
    * **Code-level:** Docstrings for functions, classes, and modules explaining their purpose, parameters, and return values.
    * Consider using documentation generators (e.g., Sphinx, JSDoc) for larger projects.

**General Advice for Large Projects:**

* **Start Early:** Establish a good code structure from the beginning of the project. It's much harder to refactor a poorly structured codebase later on.
* **Iterate and Refine:** Code structure is not static. Be prepared to adapt and refactor as the project evolves and your understanding grows.
* **Team Agreement:** Discuss and agree on coding standards and structural conventions with your team. Consistency is key for collaboration.
* **Code Reviews:** Regularly conduct code reviews to ensure adherence to the agreed-upon structure and identify potential issues early.
* **Learn from Others:** Study the structures of well-established open-source projects in your domain for inspiration and best practices.
* **Consider Architectural Patterns:** For very large projects, consider adopting established architectural patterns like Microservices, Monolithic with Modules, or Layered Architecture to provide a high-level blueprint for your code organization.

By following these principles and advice, you can create a well-structured codebase that is easier to understand, maintain, scale, and collaborate on, ultimately contributing to the success of your large projects. Remember that the "best" structure depends on the specific needs and complexity of your project, so adapt these guidelines accordingly.

examples:

- https://github.com/FlorianKrach/PD-NJODE
- https://github.com/HeKrRuTe/OptStopRandNN
