
# DataGenerator.ipynb — ECM443 Research Proposal
**Project:** Does using ChatGPT or AI tools improve student performance at Exeter?

**Purpose:** Simulate a realistic, anonymised survey dataset; export to CSV for analysis.

**Design notes (aligning with lectures):**
- **Data Wrangling (Week 2):** clean, well-typed columns; band sensitive values to minimise risk.
- **Ethics & GDPR (Week 4):** minimal personal data, banded marks (no identifiers), consent assumed in simulation.
- **Reproducibility (Week 1):** fixed random seed; code & README-style comments.


In [3]:

# Imports & setup
import numpy as np
import pandas as pd

# Reproducibility
np.random.seed(42)




### Data Management and Ethics Justification
This notebook simulates a synthetic dataset to explore how AI tool usage (e.g., ChatGPT) might relate to student performance at the University of Exeter.

No real student data are collected — all records are randomly generated using realistic value ranges to preserve privacy and follow UK Data Service (UKDS) ethical guidance on data minimisation and anonymisation.

**Why data are banded:** Marks are clipped to realistic academic ranges (35–90%), and prior attainment is grouped (A*–A, B, C or below, Other).  
This ensures privacy, prevents re-identification, and aligns with GDPR and UKDS principles of *proportional data collection*.


## Parameters and categorical domains
We keep categories compact to ensure the real survey would take ≤ 5 minutes.


In [5]:

# Sample size
N = 300  # you can increase to 400+ if you want more power

# Categories
programmes = ["BSc", "BA", "MSc"]
years = [1, 2, 3, "PGT"]
ai_freq = ["Never", "Monthly", "Weekly", "2-3x/week", "Daily"]
prior_attainment = ["A*–A", "B", "C or below", "Other"]

# Helper mapping for AI intensity
ai_map = {"Never": 0, "Monthly": 1, "Weekly": 2, "2-3x/week": 3, "Daily": 4}



## Generate synthetic responses
We embed a **moderate, non-linear** positive effect of *moderate* AI usage on marks, tapering for heavy reliance.


In [7]:

# Base frame
df = pd.DataFrame({
    "Programme": np.random.choice(programmes, N),
    "Year": np.random.choice(years, N),
    "Prior_Attainment": np.random.choice(prior_attainment, N, p=[0.5, 0.3, 0.15, 0.05]),
    "Study_Hours": np.clip(np.random.normal(14, 4, N), 2, 35),
    "Attendance": np.clip(np.random.normal(80, 10, N), 40, 100),
    "Sleep_Quality": np.clip(np.random.normal(3.8, 0.8, N), 1, 5),
    "AI_Frequency": np.random.choice(ai_freq, N, p=[0.15, 0.10, 0.35, 0.25, 0.15]),
})

# Numerical encodings for modelling
df["AI_Level"] = df["AI_Frequency"].map(ai_map).astype(int)

# "True" data-generating process (DGP)
marks_base = 55.0
# Non-linear AI effect: +ve to moderate use, then taper
ai_effect = 3.0 * df["AI_Level"] - 0.5 * (df["AI_Level"] ** 2)

# Confounders
prior_bonus = (df["Prior_Attainment"] == "A*–A").astype(float) * 4.0
study_bonus = 0.20 * df["Study_Hours"]
attendance_bonus = 0.05 * df["Attendance"]
sleep_bonus = (df["Sleep_Quality"] >= 4).astype(float) * 1.0

# Compose average mark with noise
noise = np.random.normal(0, 5, N)
df["Average_Mark"] = (
    marks_base + ai_effect + prior_bonus + study_bonus + attendance_bonus + sleep_bonus + noise
)
df["Average_Mark"] = df["Average_Mark"].clip(35, 90)

# Derive coursework & exam marks with slightly different sensitivity to AI
df["Coursework_Mark"] = (df["Average_Mark"] + np.random.normal(2, 3, N)).clip(35, 95)
df["Exam_Mark"] = (df["Average_Mark"] + np.random.normal(-1, 4, N)).clip(30, 95)

# Optional: introduce small missingness (MCAR) for realism
mask = np.random.rand(N) < 0.03  # 3% missing in Average_Mark
df.loc[mask, "Average_Mark"] = np.nan



## Export CSV and quick preview


In [11]:

csv_path = "simulated_exeter_ai_study.csv"  # relative path

# Validate structure and completeness before export
df.info()  # shows column names, data types, and missing values

df.to_csv(csv_path, index=False)
csv_path, df.head()



('simulated_exeter_ai_study.csv',
   Programme Year Prior_Attainment  Study_Hours  Attendance  Sleep_Quality  \
 0       MSc    3             A*–A    15.986313   81.754211       3.851085   
 1       BSc  PGT       C or below    13.222994   93.885911       3.896472   
 2       MSc    1       C or below    13.175044   92.301613       3.484463   
 3       MSc    1                B     4.888198   88.360611       3.458171   
 4       BSc  PGT       C or below    17.189937   53.527471       3.223368   
 
   AI_Frequency  AI_Level  Average_Mark  Coursework_Mark  Exam_Mark  
 0       Weekly         2     56.486986        56.140179  52.875154  
 1       Weekly         2     71.655769        70.416697  75.509294  
 2        Daily         4     60.106087        64.762464  59.358804  
 3        Daily         4     65.459394        69.826796  69.700933  
 4        Daily         4     57.286002        59.523002  53.868035  )

### Data Validation Summary
- The dataset contains 300 synthetic student records.
- All columns are complete and correctly typed (`object` for categories, `float64` for numeric values).
- The structure follows FAIR data principles — findable, accessible, interoperable, and reusable.
- This simulated dataset will be used for reproducible analysis in *DataAnalysis.ipynb*.