# Business Understanding & Data Collection

**Probleemstelling**

Binnen het huidige studieprogramma ervaren studenten vaak moeilijkheden bij het kiezen van een passende vrije keuzemodule (VKM). Het aanbod is groot, de beschikbare informatie is verspreid, en de begeleiding bij het maken van keuzes is beperkt. Hierdoor nemen studenten soms beslissingen die niet optimaal aansluiten bij hun interesses, waarden of loopbaandoelen.

Het doel van dit project is daarom om een Smart Study Coach te ontwikkelen: een AI-toepassing die studenten ondersteunt bij het maken van correcte keuze van een vrij keuze module. Deze toepassing zal het studentprofiel analyseren en gepersonaliseerde aanbevelingen doen op basis van de overeenkomsten met de beschikbare modules.

**Maatschappelijke relevantie**

Het keuzeproces van studenten heeft een directe invloed op motivatie, studiesucces en welzijn. Door studenten beter te begeleiden, kan deze toepassing bijdragen aan:
1. Hogere studiebetrokkenheid en motivatie.
2. Minder studievertraging of verkeerde keuze.
3. Een beter aansluitend studiepad richting persoonlijke en professionele doelen.

De Smart Study Coach draagt bij aan de bredere maatschappelijke trend van verantwoorde AI in het onderwijs, waarin technologie wordt ingezet om gelijke kansen en persoonlijke ontwikkeling te bevorderen.

**Ethiek en privacy (EU AI Act 2025 / AVG)**

Bij het ontwikkelen van een AI-systeem voor studieadvies moet goed worden opgelet dat alles volgens de ethische en wettelijke regels gebeurt. De volgende principes, die direct of indirect voortkomen uit de EU AI Act 2025 en de AVG, sluiten goed aan op ons AI-systeem:
1. Transparantie & uitlegbaarheid: De aanbevelingen van de Smart Study Coach moeten begrijpelijk en uitlegbaar zijn, zodat studenten weten waarom een module wordt voorgesteld.
2. Privacy by Design: Persoonlijke gegevens van studenten (zoals interesses of waarden) worden uitsluitend gebruikt voor het aanbevelingsdoel en veilig opgeslagen volgens de AVG-richtlijnen.
3. Data-minimisatie: Er wordt enkel data verzameld die strikt noodzakelijk is voor het functioneren van het model.
4. Menselijke controle: De uiteindelijke keuze blijft altijd bij de student; de AI dient als ondersteunend instrument, niet als beslisser.

# Smart Study Coach — Data Exploration, Cleaning & Recommendation

This notebook is organized in **three parts** as you requested (Option C):
1. **Exploration** — quick overview of the raw dataset, counts (unique tags, locations), and column error analysis.
2. **Cleaning & Preprocessing** — the cleaning pipeline with explanations and functions. Produces a cleaned CSV.
3. **NLP & Recommendation Engine** — text cleaning, TF-IDF vectorization, cosine similarity and a `recommend()` helper.

The notebook uses the dataset you uploaded at `/Uitgebreide_VKM_dataset.csv` and will create a cleaned file `/Uitgebreide_VKM_dataset_zonder_weird_data.csv`.


---
## Part 1 — Exploration
We begin by loading the dataset and computing the requested analyses:
- number of unique tags (module_tags)
- distribution of courses per `location`
- column-wise counts of empty / weird / ntb values (pre-cleaning)


In [19]:
import pandas as pd
import numpy as np
import re
from collections import Counter

# Load dataset
raw_path = "Uitgebreide_VKM_dataset.csv"
df_raw = pd.read_csv(raw_path)
print(f"Loaded dataset with {len(df_raw)} rows and {len(df_raw.columns)} columns")

# Quick peek
display(df_raw.head())


Loaded dataset with 211 rows and 20 columns


Unnamed: 0,id,name,shortdescription,description,content,studycredit,location,contact_id,level,learningoutcomes,Rood,Groen,Blauw,Geel,module_tags,interests_match_score,popularity_score,estimated_difficulty,available_spots,start_date
0,159,Kennismaking met Psychologie,"Brein, gedragsbeinvloeding, ontwikkelingspsych...",In deze module leer je hoe je gedrag van jezel...,In deze module leer je hoe je gedrag van jezel...,15,Den Bosch,58,NLQF5,A. Je beantwoordt vragen in een meerkeuze kenn...,4.0,2.0,1.0,5.0,"['brein', 'gedragsbeinvloeding', 'ontwikkeling...",0.54,319,1,79,2025-12-24
1,160,Learning and working abroad,"Internationaal, persoonlijke ontwikkeling, ver...",Studenten kiezen binnen de (stam) van de oplei...,Studenten kiezen binnen de (stam) van de oplei...,15,Den Bosch,58,NLQF5,De student toont professioneel gedrag conform ...,5.0,3.0,1.0,1.0,"['internationaal', 'persoonlijke', 'ontwikkeli...",0.92,172,5,56,2025-12-20
2,161,Proactieve zorgplanning,"Proactieve zorgplanning, cocreatie, ziekenhuis",Het Jeroen Bosch ziekenhuis wil graag samen me...,Het Jeroen Bosch ziekenhuis wil graag samen me...,15,Den Bosch,59,NLQF5,De student past pro actieve zorgplanning toe b...,,,,,"['proactieve', 'zorgplanning', 'cocreatie', 'z...",0.78,217,5,55,2025-09-23
3,162,Rouw en verlies,"Rouw & verlies, palliatieve zorg & redeneren, ...",In deze module wordt stil gestaan bij rouw en ...,In deze module wordt stil gestaan bij rouw en ...,30,Den Bosch,58,NLQF6,De student regisseert en voert (deels) zelfsta...,,,,,"['rouw', 'verlies', 'palliatieve', 'zorg', 're...",0.69,454,1,54,2025-10-25
4,163,Acuut complexe zorg,"Acute zorg, complexiteit, ziekenhuis, revalidatie",In deze module kunnen studenten zich verdiepen...,In deze module kunnen studenten zich verdiepen...,30,Den Bosch,58,NLQF6,De student regisseert en voert (deels) zelfsta...,,,,,"['acute', 'zorg', 'complexiteit', 'ziekenhuis'...",0.4,178,5,38,2025-11-19


In [20]:
empty_values = ["", "nan", "none", "null", "[]"]
weird_values = [
    "nvt", "volgt", "ntb", "nader te bepalen", "nog niet bekend",
    "nadert te bepalen", "nog te formuleren", "tbd", "n.n.b.", "navragen", "['ntb']"
]

def is_empty(value):
    if value is None or (isinstance(value, float) and np.isnan(value)):
        return True
    if isinstance(value, str) and value.strip() == "":
        return True
    return False

def is_weird(value):
    if not isinstance(value, str):
        return False
    val = value.lower().strip()
    return any(w in val for w in weird_values)

def is_ntb(value):
    return isinstance(value, str) and value.strip().lower() == "ntb"

def analyze_dataframe_simple(df_in):
    analysis = []
    for col in df_in.columns:
        total = len(df_in[col])
        empty_count = df_in[col].apply(is_empty).sum()
        weird_count = df_in[col].apply(is_weird).sum()
        ntb_count = df_in[col].apply(is_ntb).sum()
        general_error_count = empty_count + weird_count
        general_error_percent = round((general_error_count / total) * 100, 2)
        analysis.append({
            "column": col,
            "empty_values": int(empty_count),
            "empty_%": round((empty_count / total) * 100, 2),
            "weird_values": int(weird_count),
            "weird_%": round((weird_count / total) * 100, 2),
            "ntb": int(ntb_count),
            "ntb_%": round((ntb_count / total) * 100, 2),
            "general_error_total": int(general_error_count),
            "general_error_%": general_error_percent
        })
        # Unique tags analysis (module_tags column assumed)
    if 'module_tags' in df_in.columns:
        # Split tag strings by common separators and count unique tags
        tags_series = df_in['module_tags'].fillna('')
        tag_counter = Counter()
        for t in tags_series.astype(str):
            # consider comma, semicolon, pipe and slash as separators
            parts = re.split(r"[,;/\\|]+", t)
            for p in parts:
                p = p.strip().lower()
                if p and p not in empty_values and p not in weird_values:
                    tag_counter[p] += 1
        print(f"Found {len(tag_counter)} unique tags")
        top_tags = tag_counter.most_common(30)
        display(pd.DataFrame(top_tags, columns=['tag','count']))
    else:
        print("No 'module_tags' column found in the dataset")

    # Location distribution
    if 'location' in df_in.columns:
        loc_counts = df_in['location'].fillna('ntb').astype(str).str.lower().value_counts()
        display(loc_counts.head(50))
    else:
        print("No 'location' column found in the dataset")
    analysis_df = pd.DataFrame(analysis).sort_values(by="general_error_%", ascending=False)
    return analysis_df

print("Column analysis (pre-cleaning):")
col_analysis_pre = analyze_dataframe_simple(df_raw)
display(col_analysis_pre)




Column analysis (pre-cleaning):
Found 829 unique tags


Unnamed: 0,tag,count
0,'en',25
1,'in',13
2,'design',13
3,'ontwikkeling',11
4,'zorg',11
5,'welzijn',10
6,'data',10
7,'chemie',9
8,'persoonlijke',8
9,'smart',8


location
breda                   105
den bosch                55
breda en den bosch       28
den bosch en tilburg     15
tilburg                   8
Name: count, dtype: int64

Unnamed: 0,column,empty_values,empty_%,weird_values,weird_%,ntb,ntb_%,general_error_total,general_error_%
12,Blauw,209,99.05,0,0.0,0,0.0,209,99.05
13,Geel,209,99.05,0,0.0,0,0.0,209,99.05
11,Groen,209,99.05,0,0.0,0,0.0,209,99.05
10,Rood,209,99.05,0,0.0,0,0.0,209,99.05
9,learningoutcomes,5,2.37,60,28.44,26,12.32,65,30.81
2,shortdescription,20,9.48,10,4.74,10,4.74,30,14.22
14,module_tags,0,0.0,10,4.74,0,0.0,10,4.74
3,description,0,0.0,8,3.79,2,0.95,8,3.79
4,content,0,0.0,8,3.79,2,0.95,8,3.79
1,name,0,0.0,0,0.0,0,0.0,0,0.0


---
## Part 2 — Cleaning & Preprocessing
We perform the cleaning steps with explanations. The strategy is:
1. Drop irrelevant columns (colors) if present.
2. Normalize values to string, lowercase and trim.
3. Replace literal empty indicators with `ntb` (Not To Be Determined).
4. Apply a safe regex replacement for known weird phrases (only if cell content exactly matches one of them).
5. Smart-fill `shortdescription` from `description` and `content` where available.


In [None]:
df = df_raw.copy()
cols_to_drop = ["Rood", "Groen", "Blauw", "Geel"]
df = df.drop(columns=[c for c in cols_to_drop if c in df.columns])

# Convert to string (safe for TF-IDF later) and normalize
df = df.fillna('')
for col in df.columns:
    # Cast to string for consistent processing
    df[col] = df[col].astype(str)
    df[col] = df[col].str.lower().str.strip()

# Replace explicit empty-like strings with 'ntb'
for val in empty_values:
    df.replace(val, 'ntb', inplace=True)

# Safe regex for weird_values: only replace if the entire cell equals the weird phrase
safe_pattern = r'^\s*(' + '|'.join([re.escape(v) for v in weird_values]) + r')\s*$'
for col in df.columns:
    df[col] = df[col].replace(to_replace=safe_pattern, value='ntb', regex=True)

def fill_short_smart(row):
    short = row.get('shortdescription', 'ntb')
    if short and short != 'ntb':
        return short
    desc = row.get('description', 'ntb')
    content = row.get('content', 'ntb')
    valid_desc = desc and desc != 'ntb'
    valid_content = content and content != 'ntb'
    if valid_desc and valid_content:
        if desc == content:
            return desc
        return f"{desc} {content}"
    if valid_desc:
        return desc
    if valid_content:
        return content
    return 'ntb'

if 'shortdescription' in df.columns:
    print("Filling shortdescription using description/content where needed...")
    df['shortdescription'] = df.apply(fill_short_smart, axis=1)
else:
    print("No shortdescription column found; skipping smart fill.")

print('\nAnalysis after cleaning:')
col_analysis_post = analyze_dataframe_simple(df)
display(col_analysis_post)
# Save cleaned file
out_path = 'Uitgebreide_VKM_dataset_zonder_weird_data.csv'
df.to_csv(out_path, index=False)
print(f"Cleaned file written to: {out_path}")


Filling shortdescription using description/content where needed...

Analysis after cleaning:
Found 829 unique tags


Unnamed: 0,tag,count
0,'en',25
1,'in',13
2,'design',13
3,'ontwikkeling',11
4,'zorg',11
5,'welzijn',10
6,'data',10
7,'chemie',9
8,'persoonlijke',8
9,'smart',8


location
breda                   105
den bosch                55
breda en den bosch       28
den bosch en tilburg     15
tilburg                   8
Name: count, dtype: int64

Unnamed: 0,column,empty_values,empty_%,weird_values,weird_%,ntb,ntb_%,general_error_total,general_error_%
9,learningoutcomes,0,0.0,65,30.81,62,29.38,65,30.81
10,module_tags,0,0.0,30,14.22,30,14.22,30,14.22
3,description,0,0.0,8,3.79,4,1.9,8,3.79
4,content,0,0.0,8,3.79,4,1.9,8,3.79
2,shortdescription,0,0.0,4,1.9,2,0.95,4,1.9
5,studycredit,0,0.0,0,0.0,0,0.0,0,0.0
0,id,0,0.0,0,0.0,0,0.0,0,0.0
1,name,0,0.0,0,0.0,0,0.0,0,0.0
7,contact_id,0,0.0,0,0.0,0,0.0,0,0.0
6,location,0,0.0,0,0.0,0,0.0,0,0.0


Cleaned file written to: Uitgebreide_VKM_dataset_zonder_weird_data.csv


---
## Part 3 — NLP & Recommendation Engine
We clean the text for NLP, vectorize using TF-IDF and compute cosine similarity. Explanations follow the cells.


In [22]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Ensure NLTK resources are available (the notebook will attempt to download if missing)
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
try:
    WordNetLemmatizer()
    nltk.data.find('corpora/wordnet')
except Exception:
    nltk.download('wordnet')

# Prepare stopwords
stop_words = set(stopwords.words('english')) | set(stopwords.words('dutch'))
lemmatizer_en = WordNetLemmatizer()
stemmer_nl = SnowballStemmer('dutch')

def detect_language(text):
    dutch_keywords = ["de","het","een","en","je","jij","wij","zijn","module","leren","opleiding"]
    english_keywords = ["the","a","an","and","is","are","course","learn"]
    text_low = text.lower()
    nl_score = sum(1 for w in dutch_keywords if w in text_low)
    en_score = sum(1 for w in english_keywords if w in text_low)
    return 'nl' if nl_score >= en_score else 'en'

def clean_text_nlp(text):
    if not isinstance(text, str) or text.strip() == '' or text.lower() in ['ntb','tbd','nader te bepalen']:
        return 'ntb'
    text = text.lower()
    text = re.sub(r"[^a-zA-Záéíóúàèçäëïöüñ\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    lang = detect_language(text)
    words = [w for w in text.split() if w not in stop_words]
    if lang == 'nl':
        words = [stemmer_nl.stem(w) for w in words]
    else:
        words = [lemmatizer_en.lemmatize(w) for w in words]
    return ' '.join(words) if words else 'ntb'

print('Applying NLP cleaning to shortdescription and description...')
df['shortdescription'] = df['shortdescription'].apply(clean_text_nlp) if 'shortdescription' in df.columns else ''
df['description'] = df['description'].apply(clean_text_nlp) if 'description' in df.columns else ''

for col in df.columns:
    if df[col].dtype == object:
        df[col] = df[col].str.lower()

df['combined_text'] = (
    df.get('name', pd.Series(['']*len(df))).astype(str) + ' ' +
    df.get('shortdescription', pd.Series(['']*len(df))).astype(str) + ' ' +
    df.get('module_tags', pd.Series(['']*len(df))).astype(str) + ' ' +
    df.get('location', pd.Series(['']*len(df))).astype(str)
)

combined_stopwords = list(set(stopwords.words('dutch') + stopwords.words('english')))
vectorizer = TfidfVectorizer(stop_words=combined_stopwords, min_df=1)
matrix = vectorizer.fit_transform(df['combined_text'])
print(f"TF-IDF matrix shape: {matrix.shape}")
similarities = cosine_similarity(matrix)
similarity_df = pd.DataFrame(similarities, index=df.get('name', pd.Series(range(len(df)))), columns=df.get('name', pd.Series(range(len(df)))))

def recommend(module_name, similarity_df, top_n=5):
    if module_name not in similarity_df.index:
        print('Module not found. Showing top items from dataset index:')
        print(list(similarity_df.index)[:10])
        return []
    recs = similarity_df.loc[module_name].sort_values(ascending=False)[1:top_n+1]
    results = [(name, float(score)) for name, score in recs.items()]
    print(f"Recommendations for '{module_name}':")
    for name, score in results:
        print(f"- {name} (score={score:.3f})")
    return results

# Example
example_name = df['name'].iloc[0] if 'name' in df.columns else None
if example_name is not None:
    recommend(example_name, similarity_df)
else:
    print('No name column found to demonstrate recommendations')


Applying NLP cleaning to shortdescription and description...
TF-IDF matrix shape: (211, 1615)
Recommendations for 'kennismaking met psychologie':
- minor forensisch onderzoek in de rechtbank- (if/ka) (score=0.099)
- de stem van je geweten. ga opzoek naar jouw moreel kompas. (score=0.039)
- tutorial club (score=0.027)
- business innovation (score=0.019)
- animatie / storytelling (score=0.019)


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Storm\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


---
### Notes & Next steps
- The notebook saves a cleaned CSV at `/mnt/data/Uitgebreide_VKM_dataset_zonder_weird_data.csv`.
- You can tweak TF-IDF `min_df` and stopwords to improve recommendations.
- If you'd like plots (e.g. bar charts for top tags or locations), tell me and I will add them into the Exploration section.
