# Bayesian Analysis of Math Exam Scores

## Problem Statement

You are provided with a dataset of math exam scores from Estonian primary schools. Your task is to build a Bayesian model to better understand the factors that influence exam performance.

## Data Description

The dataset includes the following variables:
- `year`: The year of the exam (2016-2023)
- `county`: The county where the student took the exam
- `domestic_background`: Whether the student is Estonian or Russian
- `gender`: Student's gender (Male/Female)
- `avg_grade`: The student's average grade (from 1 to 5, continuous)
- `exam_score`: The exam score (0-50 points)

## Your Task

1. Load and explore the data to understand its structure. 
2. Perform exploratory data analysis to investigate the relationships between variables. 
3. Create a DAG that represents the causal relationships between variables based on your domain knowledge. This DAG will help inform the structure of your Bayesian model, the choice of priors, and the conditional dependencies.
4. Based on the DAG, build a Bayesian model to analyze the following relationships:
   - How does the county influence exam scores?
   - What is the impact of domestic background (Estonian vs Russian) on exam performance?
   - Are there gender differences in exam scores?
   - How have exam scores evolved over time across different years (2016–2023)?
5. Visualize and interpret your results. Specifically, try to answer the following questions:
   - What is the posterior probability of getting a >=45 `exam_score` if `avg_grade` before taking the exam is less than 2.5?
   - What is the posterior distribution of the difference in exam performance between students of Estonian and Russian domestic backgrounds?
   - How has the posterior distribution of exam performance changed over the years (2016-2023)?
   - What is the posterior distribution of the relationship between `avg_grade` and `exam_score` across different counties?
   - What are the posterior distributions of gender differences in exam performance across counties?
   - What is the posterior probability that a student with an `avg_grade` of 4.5 will score at least 40 points?


#### From Margus & Tarmo:
*If you want to build a good model, draw the causal graph and then really plot the relationships -- as they are not simple here. But you should start with the simple linear model and then just go down the rabbit hole and see how deep it goes. A keyword that might help is "Heteroskedacity".*

## Data Loading

In [None]:
import pandas as pd
import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
data = pd.read_csv('math_exam_scores.csv', sep=";").sample(2000) # Work with a smaller dataset at first so that you don't have to wait 10 minutes every time you sample the model.

# Display first few rows
data.head()

## Hints
It might come handy to know that:
- The `exam_score` is a linear combination of the `primary influencing factor`, along with the `sigmoid` of linear combination of some `other factors`. There is no noise in the `sigmoid` function.
- The `sigmoid` is shifted down by 0.5 such that its range is (-0.5,0.5). Futhermore `sigmoid` is weigtened by `primary influencing factor`*(5-`primary influencing factor`) to reduce the influence of other factors at extreme values of exam scores (near 0 and 50).
- The `exam_score` is generated in such a way that its noise varies. The variance of the noise is smaller near the extreme values (0 and 50).

Good luck with your analysis!