# Project: Predicting Student Academic Performance

## 1. Context and Motivation

Educational outcomes are influenced by a wide range of factors, including student background, family situation, lifestyle, and prior academic achievement. Understanding these relationships can help educators, parents, and policymakers identify risk factors for poor performance and design better support systems for students.

The Student Performance Dataset from the UCI Machine Learning Repository provides information about secondary school students in Portugal. It includes data from two subjects: Mathematics and Portuguese language. For each student, demographic attributes, social background, school related factors, and academic grades are recorded.

My goal is to use this dataset to build predictive models for student performance, focusing on the final grade (G3) as the target variable.

## 2. Objective

* Primary Task: Formulate a regression problem where the goal is to predict the final grade (G3) of students based on prior grades, demographic features, family background, and lifestyle related variables.

* Approach: Apply linear regression models with regularization (Lasso and Ridge Regression) to understand how different features contribute to student outcomes while preventing overfitting.

* Potential Applications:
    - Early warning systems to flag students at risk of failing.
    - Insights into which factors most strongly affect academic success.
    - Basis for building educational dashboards or decision-support apps.

## 3. Dataset Overview  

Two CSV files are available:  
- `student-mat.csv` (Mathematics course)  
- `student-por.csv` (Portuguese language course)  

Some students appear in **both datasets (382 overlaps)**. They can be identified by comparing identical attributes across the files.  

### Target Variable  
- **G3**: Final grade (0-20) --> *output variable for regression models*.  

### Predictor Variables  

1. **Demographics & Family Background**
   - `school`: Student's school (GP = Gabriel Pereira, MS = Mousinho da Silveira)  
   - `sex`: Student's sex (F = female, M = male)  
   - `age`: Student's age (15-22)  
   - `address`: Home address type (U = urban, R = rural)  
   - `famsize`: Family size (LE3 =<3, GT3 >= 3)  
   - `Pstatus`: Parent's cohabitation status (T = together, A = apart)  
   - `Medu`, `Fedu`: Mother's and father's education (0 = none to 4 = higher education)  
   - `Mjob`, `Fjob`: Parents' occupations (teacher, health, services, at_home, other)  
   - `guardian`: Primary guardian (mother, father, other)  

2. **School & Learning Conditions**
   - `reason`: Reason for choosing the school (home, reputation, course, other)  
   - `traveltime`: Home-to-school travel time (1 = <15 min,..., 4 >= 1h)  
   - `studytime`: Weekly study time (1 =< 2h,..., 4 >= 10h)  
   - `failures`: Past class failures (0-4)  
   - `schoolsup`: Extra educational support (yes/no)  
   - `famsup`: Family educational support (yes/no)  
   - `paid`: Extra paid classes (yes/no)  
   - `activities`: Extra-curricular activities (yes/no)  
   - `nursery`: Attended nursery school (yes/no)  
   - `higher`: Aspiration for higher education (yes/no)  
   - `internet`: Internet access at home (yes/no)  

3. **Social & Lifestyle Factors**
   - `romantic`: In a romantic relationship (yes/no)  
   - `famrel`: Family relationship quality (1 = very bad, 5 = excellent)  
   - `freetime`: Free time after school (1-5)  
   - `goout`: Going out with friends (1-5)  
   - `Dalc`: Workday alcohol consumption (1-5)  
   - `Walc`: Weekend alcohol consumption (1-5)  
   - `health`: Current health status (1-5)  
   - `absences`: Number of school absences (0-93)  

4. **Academic History**
   - `G1`: First period grade (0-20)  
   - `G2`: Second period grade (0-20)  
   - `G3`: Final grade (0-20) --> *Target variable* 

## 4. Data Inspection

In [None]:
import pandas as pd

In [None]:
mat_df = pd.read_csv("data/student-mat.csv", sep=";")
por_df = pd.read_csv("data/student-por.csv", sep=";")

In [5]:
mat_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [6]:
mat_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [8]:
mat_df.isnull().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

In [15]:
mat_df.duplicated().sum()

np.int64(0)

In [9]:
por_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


In [10]:
por_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      649 non-null    object
 1   sex         649 non-null    object
 2   age         649 non-null    int64 
 3   address     649 non-null    object
 4   famsize     649 non-null    object
 5   Pstatus     649 non-null    object
 6   Medu        649 non-null    int64 
 7   Fedu        649 non-null    int64 
 8   Mjob        649 non-null    object
 9   Fjob        649 non-null    object
 10  reason      649 non-null    object
 11  guardian    649 non-null    object
 12  traveltime  649 non-null    int64 
 13  studytime   649 non-null    int64 
 14  failures    649 non-null    int64 
 15  schoolsup   649 non-null    object
 16  famsup      649 non-null    object
 17  paid        649 non-null    object
 18  activities  649 non-null    object
 19  nursery     649 non-null    object
 20  higher    

In [11]:
por_df.isnull().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

In [16]:
por_df.duplicated().sum()

np.int64(0)

In [None]:
# Check column names are the same
print("Columns identical:", list(mat_df.columns) == list(por_df.columns))

Columns identical: True


In [18]:
# Define identifying columns (exclude grades and subject)
id_cols = [col for col in mat_df.columns if col not in ["G1", "G2", "G3", "subject"]]

In [19]:
# Find overlaps between datasets
duplicates = pd.merge(mat_df[id_cols], por_df[id_cols], on=id_cols, how="inner")
print("Number of overlapping students:", len(duplicates))

Number of overlapping students: 39


### Observations: 
Both **student-mat.csv** (Math) and **student-por.csv** (Portuguese) datasets have the same set of 33 features.  
No missing or mismatched columns were found.  

For this project, I merged both datasets into a single dataset with an added `"subject"` column.  
This approach increases sample size and allows the regression model to generalize better, while still capturing subject-specific effects.  

The final dataset contains **1044 students** (395 from Math and 649 from Portuguese).

I inspected for students appearing in both Math and Portuguese datasets.  
- A total of **39 students** were found in both.  
- They have identical background attributes (family, demographics, etc) but different grades for each subject.  

For my regression objective, we decided to **keep these students in both datasets**.  
This is reasonable because Math and Portuguese represent distinct learning outcomes, and the same student may perform differently in each subject.  
--> No rows were dropped during cleaning

# 5. Data Cleaning

In [21]:
# Add subject column
mat_df["subject"] = "math"
por_df["subject"] = "portuguese"

# Merge into one dataset
combined_df = pd.concat([mat_df, por_df], ignore_index=True)

print("Combined dataset shape:", combined_df.shape)
print(combined_df.head())

Combined dataset shape: (1044, 34)
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  ...  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher  ...   
1     GP   F   17       U     GT3       T     1     1  at_home     other  ...   
2     GP   F   15       U     LE3       T     1     1  at_home     other  ...   
3     GP   F   15       U     GT3       T     4     2   health  services  ...   
4     GP   F   16       U     GT3       T     3     3    other     other  ...   

  freetime goout  Dalc  Walc  health absences  G1  G2  G3 subject  
0        3     4     1     1       3        6   5   6   6    math  
1        3     3     1     1       3        4   5   5   6    math  
2        3     2     2     3       3       10   7   8  10    math  
3        2     2     1     1       5        2  15  14  15    math  
4        3     2     1     2       5        4   6  10  10    math  

[5 rows x 34 columns]


In [None]:
# Save into new file
combined_df.to_csv("data/cleaned/students_combined.csv", index=False)
print("Combined dataset saved as students_combined.csv")

Combined dataset saved as students_combined.csv
