# SC1003 FDAB Group 5 Mini Project Presentation


## 1. Overall interpretation of Task and Priorities

To fulfil our task of creating diverse teams, we interpreted the concept of 'diversity' to mean having a variety of CGPAs, genders and school affiliations. Such diversity is important as it allows us to create fair and inclusive teams that bring together people with different perspectives, experiences, and strengths. To achieve such fairness, each team should have members of differing academic abilities, genders and school backgrounds.

Our top priorities are CGPA and gender first, followed by school affiliation. This is because we feel that CGPA is the strongest factor that will affect a teamâ€™s performance for the mini project so each team should have an even spread of stronger to weaker members. Equally important is gender diversity, so we ensure that each team has either 3 males, 2 females or 3 females, 2 males when possible.

### 1.1 Thought Process and Computational Thinking Concepts Used + flowchart

We decompose our problem according to the following flowchart: 
```mermaid
graph TD
    A[Start] --> B[Read csv file to create a list of dictionaries representing each student];
    B --> C[For each tutorial grp, create 5 CGPA bands separately for each gender];
    C --> D[Determine gender ratio target for each tutorial grp];
    D --> E[Assign 1 person from each CGPA band to each grp of 5, respecting the gender target];
    E --> F[End];

## 2. Creation of dictionary of students

### 2a. read_students(read_students(start_index=0, count=50, filepath="records.csv"))
### Key considerations(Thought Process):
### Assuming that we do not know the number of students, to simplify the problem, we try to make one loop for the first 50 students

##### 1. Opening the file using a with statement (context manager), specifying UTF-8 encoding to **handle special characters**
##### 2. Sets newline = '' to properly handle CSV file across different operating system (os)
##### 3. csv.reader() to read the file line by line, would **stop once the file ran out of lines**
##### 4. Skips the header to **prevent overcounting the header as students**
##### 5. return sum(1 for row in reader) -- for each time a CSV reader goes through a row, an integer 1 would be generated. This line sums up all the 1s, which would effectively count all the rows

#### This function is:
##### a) **memory efficient** (as it does not load all the rows into the system memory)
##### b) **fast**

In [None]:
import csv
import itertools
import os
import random
import sys


def read_students(start_index=0, count=50, filepath="records.csv"):
    """Read up to count students starting from start_index (after header).
    Returns a list of student dicts with keys: tutorial group, id, school, name, gender, gpa
    """
    students = []

    with open(filepath, newline='', encoding='utf-8') as f:
        reader = csv.reader(f)
        headers = next(reader, None)  # skip header if present

        # skip start_index rows
        for _ in range(start_index):
            try:
                next(reader)
            except StopIteration:
                return students

        for _ in range(count):
            try:
                row = next(reader)
                
            except StopIteration:
                break

            try:
                gpa = float(row[5])
            except Exception:
                # skip rows with non-numeric GPA
                continue

            student = {
                'tutorial group': row[0].strip(),
                'id': row[1].strip(),
                'school': row[2].strip(),
                'name': row[3].strip(),
                'gender': row[4].strip(),
                'gpa': gpa,
            }
            students.append(student)

    return students
    
read_students(start_index=0, count=50, filepath="records.csv")


In [None]:

def count_total_students(filepath="records.csv"):
    """Count total number of student records (excluding header)."""        
    with open(filepath, newline='', encoding='utf-8') as f:
        reader = csv.reader(f)
        headers = next(reader, None)  # skip header
        return sum(1 for row in reader)
        
count_total_students(filepath="records.csv")

## 3. Program to determine Average GPA in each group 

### 3.1 The "GPA Band" Concept

To ensure that each group has a **similar average GPA** and a **balanced mix** of academically stronger and weaker students, we adopted a **GPA band** allocation strategy.
Specifically, the 50 students in each tutorial group are first **sorted** in descending order by GPA and then **divided** into five bands of ten students each.
When forming groups, one student is randomly selected from each band to create a five-member team.

### 3.2 Simplified Pseudocode

The following Pseudocode explains the simplified logic of the function.

In [None]:
FUNCTION form_groups(students, num_groups=10):
    IF students is empty:
        RETURN empty list

    # 1. Sort students by GPA descending
    sorted_students = sort students by GPA descending

    # 2. Split into 5 bands of equal size
    n = length of sorted_students
    band_size = ceil(n // 5)
    bands = split sorted_students into 5 bands of size band_size

    # 3. Shuffle each band to randomize selection
    FOR each band in bands:
        RANDOMLY shuffle(band)

    # 4. Initialize empty groups
    groups = list of num_groups empty lists

    # 5. Form groups by taking one student from each band
    FOR i from 0 to num_groups - 1:
        FOR each band in bands:
            student = pop a student from band
            add student to groups[i]

    RETURN groups

### 3.3 Actual Code

In [None]:
def form_group(students, num_groups=10):
    if not students:
        return []
        
    students_sorted = sorted(students, key=lambda x: x['gpa'], reverse=True)

    n = len(students_sorted)
    band_size = n // 5
    bands = [students_sorted[i:i + band_size] for i in range(0, n, band_size)]

    band_males = []
    band_females = []
    for band in bands:
        males = [s for s in band if s['gender'].strip().lower() in ('m', 'male')]
        females = [s for s in band if s['gender'].strip().lower() in ('f', 'female')]
        random.shuffle(males)
        random.shuffle(females)
        band_males.append(males)
        band_females.append(females)

    total_males = sum(len(m) for m in band_males)
    total_females = sum(len(f) for f in band_females)

You can see that the function is not finished, and splits the 5 bands each into male and female. The next speaker will explained the details of why we changed the original logic.

## 4. Gender diversity

## 5. Execution of all the codes 
### 

In [None]:

if __name__ == "__main__":
    BATCH_SIZE = 50
    NUM_GROUPS = 10
    
    try:
        # Count total students first
        total_students = count_total_students() #Taken from 2. (number of student )
        if total_students == 0:
            print("No student records found in the file.")
            sys.exit(1)
            
        print(f"Total students in file: {total_students}")
        print(f"Processing in batches of {BATCH_SIZE} students...")
        print("=" * 60)
        
        batch_number = 1
        start_index = 0
        
        # Process file in batches until all students are processed
        while True:
            print(f"Reading batch {batch_number} (students {start_index + 1} to {start_index + BATCH_SIZE})...")
            students = read_students(start_index, BATCH_SIZE)
            
            if not students:
                print("No more students to process.")
                break
                
            print(f"Processing {len(students)} students in batch {batch_number}...")
            
            # Sort students for band creation
            students_sorted = sorted(students, key=lambda x: x['gpa'], reverse=True)
            n = len(students_sorted)
            band_size = max(1, n // 5)
            bands = [students_sorted[i:i + band_size] for i in range(0, n, band_size)]

            # Form groups for this batch
            groups = form_groups(students, num_groups=NUM_GROUPS)
            
            # Print groups for this batch
            print_groups(groups, students_sorted, bands, batch_number)
            
            # Update for next batch
            start_index += BATCH_SIZE
            batch_number += 1
            
            # Add separator between batches
            if start_index < total_students:
                print("=" * 60)
                print()
        
        print(f"Processing complete! Processed {batch_number - 1} batch(es) total.")
        
    except FileNotFoundError as e:
        print(e)
        sys.exit(1)
    except Exception as e:
        print(f"An error occurred: {e}")
        sys.exit(1)



## 6) **Objective**: To create "diverse" groups based on three criterias: GPA, gender and schools.

### **i) GPA Distribution:**
- The GPA spread shows a roughly **normal distribution** centered around **4.1**
- **Bands** were used to group students evenly by academic performance


### **ii) Gender Distribution:**
- Males and females are fairly balanced across the dataset  
- This ensures gender diversity when forming groups using the automated grouping algorithm


### **iii) School Distribution:**
- Students come from **18 different schools**
- Inter-school collaboration opportunities exist by pairing students from different faculties



---

## **Summary and Remarks:**
**There is diversity in gender, school, and GPA in the dataset, therefore allowing fair team formation.**




In [None]:
import matplotlib.pyplot as plt
import csv

# Read data from records.csv file
tutorial_groups = []
gpas = []

with open('records.csv', 'r', encoding='utf-8') as file:
    reader = csv.DictReader(file)
    for row in reader:
        tutorial_groups.append(row['Tutorial Group'])
        gpas.append(float(row['CGPA']))  # Using CGPA column for GPA

# Organize data by tutorial group
group_data = {}
for i, group in enumerate(tutorial_groups):
    if group not in group_data:
        group_data[group] = []
    group_data[group].append(gpas[i])

# Calculate statistics for each tutorial group
sorted_groups = sorted(group_data.keys(), key=lambda x: int(x.split('-')[1]))
group_means = []
group_mins = []
group_maxs = []
all_student_gpas = []

# Prepare data for plotting
x_positions = []
y_gpas = []
group_labels = []

for i, group in enumerate(sorted_groups):
    group_gpas = group_data[group]
    
    # Add individual student points
    for gpa in group_gpas:
        x_positions.append(i)
        y_gpas.append(gpa)
        all_student_gpas.append(gpa)
    
    # Calculate group statistics
    group_means.append(sum(group_gpas) / len(group_gpas))
    group_mins.append(min(group_gpas))
    group_maxs.append(max(group_gpas))
    group_labels.append(group)

# Create the visualization
plt.figure(figsize=(20, 10))

# Plot individual student GPAs as scatter points with transparency
plt.scatter(x_positions, y_gpas, alpha=1.0, color='blue', s=30, label='Individual Students')

# Plot tutorial group average GPAs as red line
plt.plot(range(len(sorted_groups)), group_means, 'ro-', 
         linewidth=2, markersize=6, label='Group Average GPA')

# Customize the plot
plt.xlabel('Tutorial Groups', fontsize=15)
plt.ylabel('GPA', fontsize=15)
plt.title('GPA Distribution Across Tutorial Groups\n(Individual Students and Group Averages)', fontsize=14)

# Set x-axis labels to show every 10th tutorial group for readability
x_ticks = list(range(0, len(sorted_groups), 10))
x_labels = [sorted_groups[i] for i in x_ticks]
plt.xticks(x_ticks, x_labels, rotation=45)

# Set y-axis range from min_gpa-0.5 to 5
min_gpa = min(all_student_gpas)
plt.ylim(min_gpa - 0.5, 5.0)

# Add grid and legend
plt.grid(True, alpha=0.3)
plt.legend()

# Add horizontal line for overall average
overall_mean = sum(all_student_gpas) / len(all_student_gpas)
plt.axhline(y=overall_mean, color='blue', linestyle='--', 
            linewidth=2, alpha=1.0, label=f'Overall Average: {overall_mean:.2f}')

plt.legend()
plt.tight_layout()
plt.show()

# Print statistical summary
print("=" * 60)
print("STATISTICAL SUMMARY")
print("=" * 60)
print(f"Total Students: {len(all_student_gpas)}")
print(f"Total Tutorial Groups: {len(sorted_groups)}")
print(f"Overall Average GPA: {overall_mean:.3f}")
print(f"Minimum GPA: {min(all_student_gpas):.3f}")
print(f"Maximum GPA: {max(all_student_gpas):.3f}")

# Print group statistics
print(f"\nTutorial Group GPA Statistics:")
print(f"Average group mean GPA: {sum(group_means)/len(group_means):.3f}")
print(f"Lowest group average: {min(group_means):.3f}")
print(f"Highest group average: {max(group_means):.3f}")

# Show some sample groups with their averages
print(f"\nSample Group Averages:")
for i in range(min(5, len(sorted_groups))):
    print(f"  {sorted_groups[i]}: {group_means[i]:.3f}")
if len(sorted_groups) > 5:
    print("  ...")
    for i in range(len(sorted_groups)-3, len(sorted_groups)):
        print(f"  {sorted_groups[i]}: {group_means[i]:.3f}")

In [None]:
import matplotlib.pyplot as plt
import csv
from collections import defaultdict

# Read data from records.csv file
tutorial_school_counts = defaultdict(lambda: defaultdict(int))

with open('records.csv', 'r', encoding='utf-8') as file:
    reader = csv.DictReader(file)
    for row in reader:
        tutorial_group = row['Tutorial Group']
        school = row['School']
        tutorial_school_counts[tutorial_group][school] += 1

# Get all unique schools and tutorial groups
all_schools = set()
all_tutorial_groups = sorted(tutorial_school_counts.keys(), key=lambda x: int(x.split('-')[1]))

for tutorial_group in tutorial_school_counts:
    all_schools.update(tutorial_school_counts[tutorial_group].keys())

all_schools = sorted(list(all_schools))

# Prepare data for stacked bar chart
tutorial_groups = []
school_data = {school: [] for school in all_schools}

for tutorial_group in all_tutorial_groups:
    tutorial_groups.append(tutorial_group)
    total_in_group = sum(tutorial_school_counts[tutorial_group].values())
    
    for school in all_schools:
        count = tutorial_school_counts[tutorial_group].get(school, 0)
        school_data[school].append(count)

# Create the stacked bar chart
plt.figure(figsize=(20, 10))

# Create stacked bars
bottom = [0] * len(tutorial_groups)
colors = plt.cm.Set3(range(len(all_schools)))

for i, school in enumerate(all_schools):
    plt.bar(tutorial_groups, school_data[school], bottom=bottom, 
            label=school, color=colors[i], alpha=0.8, edgecolor='grey')
    bottom = [bottom[j] + school_data[school][j] for j in range(len(tutorial_groups))]

# Customize the plot
plt.xlabel('Tutorial Groups', fontsize=12)
plt.ylabel('Number of Students', fontsize=12)
plt.title('Distribution of Schools Within Each Tutorial Group', fontsize=14)

# Rotate x-axis labels and show only every 5th label for readability
x_ticks = list(range(0, len(tutorial_groups), 5))
x_labels = [tutorial_groups[i] for i in x_ticks]
plt.xticks(x_ticks, x_labels, rotation=45)

# Add legend
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

# Add grid
plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Print statistical summary
print("=" * 60)
print("STATISTICAL SUMMARY")
print("=" * 60)
print(f"Total Tutorial Groups: {len(tutorial_groups)}")
print(f"Total Schools: {len(all_schools)}")
print(f"Schools: {', '.join(all_schools)}")

print(f"\nSchool Distribution Across Tutorial Groups:")
for school in all_schools:
    total_students = sum(school_data[school])
    groups_with_students = sum(1 for count in school_data[school] if count > 0)
    print(f"  {school}: {total_students} students across {groups_with_students} tutorial groups")

# Show sample distribution for first few tutorial groups
print(f"\nSample Distribution (First 5 Tutorial Groups):")
for i in range(min(5, len(tutorial_groups))):
    print(f"  {tutorial_groups[i]}:")
    for school in all_schools:
        count = school_data[school][i]
        if count > 0:
            print(f"    {school}: {count} students")