## Student Performance Analysis Indicator
### 1. EDA
### 2. Visualization

## Import Libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression,Perceptron
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score, classification_report, confusion_matrix, ConfusionMatrixDisplay, PrecisionRecallDisplay, RocCurveDisplay
import warnings 
warnings.filterwarnings('ignore') 

## Read Data

In [3]:
df_mat = pd.read_csv("data/student-mat.csv",sep=";")
df_por = pd.read_csv("data/student-por.csv",sep=";")

In [4]:
# Read top 5 records
df_mat.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [5]:
df_por.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


## Dataset Information

**Attributes for both `student-mat.csv` (Math course) and `student-por.csv` (Portuguese language course) datasets:**

1. **school** - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
2. **sex** - student's sex (binary: "F" - female or "M" - male)
3. **age** - student's age (numeric: from 15 to 22)
4. **address** - student's home address type (binary: "U" - urban or "R" - rural)
5. **famsize** - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
6. **Pstatus** - parent's cohabitation status (binary: "T" - living together or "A" - apart)
7. **Medu** - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
8. **Fedu** - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
9. **Mjob** - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g., administrative or police), "at_home", or "other")
10. **Fjob** - father's job (nominal: "teacher", "health" care related, civil "services" (e.g., administrative or police), "at_home", or "other")
11. **reason** - reason to choose this school (nominal: close to "home", school "reputation", "course" preference, or "other")
12. **guardian** - student's guardian (nominal: "mother", "father", or "other")
13. **traveltime** - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14. **studytime** - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15. **failures** - number of past class failures (numeric: n if 1 <= n < 3, else 4)
16. **schoolsup** - extra educational support (binary: yes or no)
17. **famsup** - family educational support (binary: yes or no)
18. **paid** - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. **activities** - extra-curricular activities (binary: yes or no)
20. **nursery** - attended nursery school (binary: yes or no)
21. **higher** - wants to take higher education (binary: yes or no)
22. **internet** - Internet access at home (binary: yes or no)
23. **romantic** - with a romantic relationship (binary: yes or no)
24. **famrel** - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. **freetime** - free time after school (numeric: from 1 - very low to 5 - very high)
26. **goout** - going out with friends (numeric: from 1 - very low to 5 - very high)
27. **Dalc** - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. **Walc** - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. **health** - current health status (numeric: from 1 - very bad to 5 - very good)
30. **absences** - number of school absences (numeric: from 0 to 93)

**Grades related to the course subject (Math or Portuguese):**

31. **G1** - first period grade (numeric: from 0 to 20)
32. **G2** - second period grade (numeric: from 0 to 20)
33. **G3** - final grade (numeric: from 0 to 20, output target)

**Additional Note:** There are several (382) students who belong to both datasets. These students can be identified by searching for identical attributes that characterize each student, as shown in the annexed R file.


In [6]:
#dataset info
df_mat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [7]:
df_por.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      649 non-null    object
 1   sex         649 non-null    object
 2   age         649 non-null    int64 
 3   address     649 non-null    object
 4   famsize     649 non-null    object
 5   Pstatus     649 non-null    object
 6   Medu        649 non-null    int64 
 7   Fedu        649 non-null    int64 
 8   Mjob        649 non-null    object
 9   Fjob        649 non-null    object
 10  reason      649 non-null    object
 11  guardian    649 non-null    object
 12  traveltime  649 non-null    int64 
 13  studytime   649 non-null    int64 
 14  failures    649 non-null    int64 
 15  schoolsup   649 non-null    object
 16  famsup      649 non-null    object
 17  paid        649 non-null    object
 18  activities  649 non-null    object
 19  nursery     649 non-null    object
 20  higher    

In [8]:

# Get the columns of each dataset
mat_columns = df_mat.columns.tolist()
por_columns = df_por.columns.tolist()

print("Columns in student-mat.csv:")
print(mat_columns)

print("\nColumns in student-por.csv:")
print(por_columns)

Columns in student-mat.csv:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']

Columns in student-por.csv:
['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']


## Merge Dataset

In [9]:
common_col = list(set(df_mat.columns).intersection(set(df_por.columns)))
print(common_col)

merge_columns = [
    'school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
    'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures',
    'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet',
    'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences'
]

#test = list(set(common_col).intersection(set(merge_columns)))

#print(test)
print(list(set(common_col) - set(merge_columns)))


['internet', 'famrel', 'freetime', 'G1', 'school', 'address', 'goout', 'famsize', 'romantic', 'age', 'sex', 'Fedu', 'paid', 'studytime', 'nursery', 'health', 'higher', 'guardian', 'schoolsup', 'activities', 'G3', 'Pstatus', 'reason', 'Dalc', 'Walc', 'Mjob', 'absences', 'G2', 'Medu', 'famsup', 'traveltime', 'Fjob', 'failures']
['G3', 'G1', 'G2']


In [17]:
# List all columns to be used for ID generation
all_columns = [
    'school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
    'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures',
    'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet',
    'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences'
]

# Ensure all columns are in both dataframes
for col in all_columns:
    if col not in df_mat.columns:
        print(col," not in mat")
    if col not in df_por.columns:
        print(col," not in por")

In [18]:
# Strip leading/trailing spaces and ensure consistent case for the key columns
key_columns = ["school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet"]
#key_columns = ["school","sex","age","address","famsize","Pstatus","Medu","Fedu","Mjob","Fjob","reason","nursery","internet","studytime"]

for col in key_columns:
    df_mat[col] = df_mat[col].astype(str).str.strip()
    df_por[col] = df_por[col].astype(str).str.strip()

# Perform an outer merge
combined_df = pd.merge(df_mat, df_por, on=key_columns, how='outer', suffixes=('_mat', '_por'))

# Generate a unique student ID
combined_df['student_id'] = combined_df[key_columns].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

# Fill missing values for students who belong to only one dataset with -1
combined_df.fillna(-1, inplace=True)

# Print the number of unique student IDs
print(f"Total unique student IDs: {combined_df['student_id'].nunique()}")

# Save the combined dataset
#combined_df.to_csv('data/combined_student_data.csv', index=False)

Total unique student IDs: 662


In [19]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 682 entries, 0 to 681
Data columns (total 54 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   school          682 non-null    object 
 1   sex             682 non-null    object 
 2   age             682 non-null    object 
 3   address         682 non-null    object 
 4   famsize         682 non-null    object 
 5   Pstatus         682 non-null    object 
 6   Medu            682 non-null    object 
 7   Fedu            682 non-null    object 
 8   Mjob            682 non-null    object 
 9   Fjob            682 non-null    object 
 10  reason          682 non-null    object 
 11  guardian_mat    682 non-null    object 
 12  traveltime_mat  682 non-null    float64
 13  studytime_mat   682 non-null    float64
 14  failures_mat    682 non-null    float64
 15  schoolsup_mat   682 non-null    object 
 16  famsup_mat      682 non-null    object 
 17  paid_mat        682 non-null    obj

In [13]:
all_match = (combined_df['traveltime_mat'] == combined_df['traveltime_por']).all()
print(all_match)  # True if all elements match, otherwise False


False


In [14]:
merge_columns = [
    'school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu',
    'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures',
    'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet',
    'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences'
]

key_columns = ["school", "sex", "age", "address", "famsize", "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", "reason", "nursery", "internet"]

print(list(set(merge_columns) - set(key_columns)),"\n",list(set(merge_columns) - set(key_columns)).__len__())


['paid', 'famrel', 'freetime', 'failures', 'goout', 'studytime', 'health', 'Dalc', 'higher', 'guardian', 'Walc', 'romantic', 'famsup', 'absences', 'schoolsup', 'traveltime', 'activities'] 
 17


In [15]:
# List of column prefixes to be merged
columns_to_merge = [
    'schoolsup', 'freetime', 'guardian', 'higher', 'Dalc', 'famsup', 'traveltime', 'studytime', 
    'Walc', 'health', 'activities', 'failures', 'famrel', 'paid', 'romantic', 'goout'
]
combined_df_new = combined_df
# Loop through the columns and merge _mat and _por versions
for col in columns_to_merge:
    mat_col = col + '_mat'
    por_col = col + '_por'
    
    # Create a new column that keeps the relevant data
    combined_df_new[col] = combined_df_new[mat_col].where(combined_df_new[mat_col] != -1, combined_df_new[por_col])

    # Drop the old _mat and _por columns
    combined_df_new.drop(columns=[mat_col, por_col], inplace=True)

# Get the size of the DataFrame
combined_df_new_size = combined_df.shape
combined_df_new_size

# Save the combined dataset
#combined_df_new.to_csv('data/combined_student_data1.csv', index=False)

(682, 38)

In [16]:
combined_df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 682 entries, 0 to 681
Data columns (total 38 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   school        682 non-null    object 
 1   sex           682 non-null    object 
 2   age           682 non-null    object 
 3   address       682 non-null    object 
 4   famsize       682 non-null    object 
 5   Pstatus       682 non-null    object 
 6   Medu          682 non-null    object 
 7   Fedu          682 non-null    object 
 8   Mjob          682 non-null    object 
 9   Fjob          682 non-null    object 
 10  reason        682 non-null    object 
 11  nursery       682 non-null    object 
 12  internet      682 non-null    object 
 13  absences_mat  682 non-null    float64
 14  G1_mat        682 non-null    float64
 15  G2_mat        682 non-null    float64
 16  G3_mat        682 non-null    float64
 17  absences_por  682 non-null    float64
 18  G1_por        682 non-null    

In [16]:
for col in key_columns:
    mat_unique = set(df_mat[col])
    por_unique = set(df_por[col])
    if mat_unique != por_unique:
        print(f"Inconsistency found in column: {col}")


In [20]:

# Identify and print the duplicate student_ids
duplicate_student_ids = combined_df[combined_df.duplicated(subset='student_id', keep=False)]

# Group the duplicate entries by student_id and compare the rows within each group
duplicate_groups = duplicate_student_ids.groupby('student_id')

# Inspect differences within each group
differences = {}
for student_id, group in duplicate_groups:
    if not group.equals(group.iloc[0]):
        differences[student_id] = group

# Show differences for the first few student_ids
for student_id, group in list(differences.items())[:]: 
    print(f"Differences for student_id: {student_id}")
    print(group)
    print("\n")

Differences for student_id: GP_F_17_U_GT3_T_4_4_teacher_services_course_yes_yes
    school sex age address famsize Pstatus Medu Fedu     Mjob      Fjob  ...  \
339     GP   F  17       U     GT3       T    4    4  teacher  services  ...   
340     GP   F  17       U     GT3       T    4    4  teacher  services  ...   

    freetime_por goout_por  Dalc_por  Walc_por  health_por absences_por  \
339          3.0       1.0       1.0       4.0         5.0          2.0   
340          4.0       4.0       1.0       3.0         4.0          0.0   

    G1_por G2_por G3_por                                         student_id  
339   11.0   11.0   12.0  GP_F_17_U_GT3_T_4_4_teacher_services_course_ye...  
340   13.0   12.0   13.0  GP_F_17_U_GT3_T_4_4_teacher_services_course_ye...  

[2 rows x 54 columns]


Differences for student_id: GP_F_18_U_GT3_T_1_1_other_other_home_yes_yes
    school sex age address famsize Pstatus Medu Fedu   Mjob   Fjob  ...  \
293     GP   F  18       U     GT3       T    

In [18]:
# Identify and print the duplicate student_ids
duplicate_student_ids = combined_df[combined_df.duplicated(subset='student_id', keep=False)]

# Group by student_id and identify inconsistent rows
inconsistent_ids = []
duplicate_groups = duplicate_student_ids.groupby('student_id')
for student_id, group in duplicate_groups:
    if not group.equals(group.iloc[0]):  # Check if all rows in the group are the same
        inconsistent_ids.append(student_id)

# Display the inconsistent student_ids for manual review
print(f"Inconsistent student_ids: {inconsistent_ids}")

# Optional: Display the rows corresponding to these inconsistent student_ids
for student_id in inconsistent_ids:
    print(f"\nReview data for student_id: {student_id}")
    display(duplicate_student_ids[duplicate_student_ids['student_id'] == student_id])


Inconsistent student_ids: ['GP_F_17_U_GT3_T_4_4_teacher_services_course_yes_yes', 'GP_F_18_U_GT3_T_1_1_other_other_home_yes_yes', 'GP_M_16_U_GT3_T_3_3_services_other_home_yes_yes', 'GP_M_16_U_GT3_T_4_4_services_services_course_yes_yes', 'GP_M_16_U_GT3_T_4_4_teacher_teacher_course_yes_yes', 'GP_M_17_U_GT3_T_2_1_other_other_home_yes_yes', 'GP_M_18_U_GT3_T_4_4_teacher_services_home_yes_yes', 'MS_F_16_R_GT3_T_2_2_other_other_course_yes_yes', 'MS_F_16_U_GT3_T_1_2_other_services_course_yes_no', 'MS_F_18_U_GT3_T_1_1_other_other_course_yes_no', 'MS_M_15_R_LE3_T_4_1_health_services_reputation_yes_yes']

Review data for student_id: GP_F_17_U_GT3_T_4_4_teacher_services_course_yes_yes


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
339,GP,F,17,U,GT3,T,4,4,teacher,services,...,3.0,3.0,4.0,yes,7.0,0.0,5.0,yes,no,4.0
340,GP,F,17,U,GT3,T,4,4,teacher,services,...,3.0,3.0,4.0,yes,7.0,0.0,5.0,yes,no,4.0



Review data for student_id: GP_F_18_U_GT3_T_1_1_other_other_home_yes_yes


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
293,GP,F,18,U,GT3,T,1,1,other,other,...,2.0,1.0,4.0,yes,4.0,0.0,5.0,no,no,4.0
294,GP,F,18,U,GT3,T,1,1,other,other,...,2.0,1.0,4.0,yes,4.0,0.0,5.0,no,no,4.0



Review data for student_id: GP_M_16_U_GT3_T_3_3_services_other_home_yes_yes


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
110,GP,M,16,U,GT3,T,3,3,services,other,...,3.0,1.0,5.0,yes,2.0,0.0,5.0,no,no,3.0
111,GP,M,16,U,GT3,T,3,3,services,other,...,3.0,1.0,5.0,yes,2.0,0.0,5.0,no,no,3.0
112,GP,M,16,U,GT3,T,3,3,services,other,...,2.0,2.0,3.0,yes,2.0,0.0,4.0,yes,yes,3.0
113,GP,M,16,U,GT3,T,3,3,services,other,...,2.0,2.0,3.0,yes,2.0,0.0,4.0,yes,yes,3.0



Review data for student_id: GP_M_16_U_GT3_T_4_4_services_services_course_yes_yes


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
249,GP,M,16,U,GT3,T,4,4,services,services,...,1.0,2.0,5.0,yes,0.0,0.0,5.0,no,no,2.0
250,GP,M,16,U,GT3,T,4,4,services,services,...,1.0,2.0,5.0,yes,0.0,0.0,5.0,no,no,2.0



Review data for student_id: GP_M_16_U_GT3_T_4_4_teacher_teacher_course_yes_yes


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
121,GP,M,16,U,GT3,T,4,4,teacher,teacher,...,2.0,2.0,5.0,yes,2.0,0.0,5.0,no,no,4.0
122,GP,M,16,U,GT3,T,4,4,teacher,teacher,...,2.0,2.0,5.0,yes,2.0,0.0,5.0,no,no,4.0
123,GP,M,16,U,GT3,T,4,4,teacher,teacher,...,1.0,1.0,5.0,no,0.0,0.0,3.0,no,yes,2.0
124,GP,M,16,U,GT3,T,4,4,teacher,teacher,...,1.0,1.0,5.0,no,0.0,0.0,3.0,no,yes,2.0



Review data for student_id: GP_M_17_U_GT3_T_2_1_other_other_home_yes_yes


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
78,GP,M,17,U,GT3,T,2,1,other,other,...,1.0,1.0,3.0,yes,2.0,3.0,4.0,no,no,1.0
79,GP,M,17,U,GT3,T,2,1,other,other,...,1.0,1.0,3.0,yes,2.0,3.0,4.0,no,no,1.0
80,GP,M,17,U,GT3,T,2,1,other,other,...,1.0,2.0,5.0,no,0.0,3.0,5.0,no,no,5.0
81,GP,M,17,U,GT3,T,2,1,other,other,...,1.0,2.0,5.0,no,0.0,3.0,5.0,no,no,5.0



Review data for student_id: GP_M_18_U_GT3_T_4_4_teacher_services_home_yes_yes


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
284,GP,M,18,U,GT3,T,4,4,teacher,services,...,1.0,4.0,3.0,yes,22.0,0.0,3.0,yes,no,4.0
285,GP,M,18,U,GT3,T,4,4,teacher,services,...,1.0,4.0,3.0,yes,22.0,0.0,3.0,yes,no,4.0
286,GP,M,18,U,GT3,T,4,4,teacher,services,...,2.0,2.0,2.0,yes,0.0,1.0,4.0,no,no,3.0
287,GP,M,18,U,GT3,T,4,4,teacher,services,...,2.0,2.0,2.0,yes,0.0,1.0,4.0,no,no,3.0



Review data for student_id: MS_F_16_R_GT3_T_2_2_other_other_course_yes_yes


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
497,MS,F,16,R,GT3,T,2,2,other,other,...,2.0,1.0,5.0,yes,0.0,0.0,4.0,no,no,4.0
498,MS,F,16,R,GT3,T,2,2,other,other,...,2.0,1.0,4.0,no,4.0,0.0,4.0,no,no,5.0
499,MS,F,16,R,GT3,T,2,2,other,other,...,2.0,2.0,1.0,no,1.0,0.0,3.0,no,no,5.0



Review data for student_id: MS_F_16_U_GT3_T_1_2_other_services_course_yes_no


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
527,MS,F,16,U,GT3,T,1,2,other,services,...,3.0,2.0,4.0,no,0.0,1.0,1.0,no,no,2.0
528,MS,F,16,U,GT3,T,1,2,other,services,...,3.0,2.0,4.0,no,3.0,1.0,1.0,no,no,2.0



Review data for student_id: MS_F_18_U_GT3_T_1_1_other_other_course_yes_no


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
400,MS,F,18,U,GT3,T,1,1,other,other,...,2.0,1.0,5.0,yes,0.0,1.0,1.0,no,no,1.0
401,MS,F,18,U,GT3,T,1,1,other,other,...,2.0,1.0,5.0,yes,0.0,1.0,1.0,no,no,1.0



Review data for student_id: MS_M_15_R_LE3_T_4_1_health_services_reputation_yes_yes


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,studytime,Walc,health,activities,absences,failures,famrel,paid,romantic,goout
517,MS,M,15,R,LE3,T,4,1,health,services,...,2.0,2.0,2.0,yes,0.0,0.0,5.0,no,no,4.0
518,MS,M,15,R,LE3,T,4,1,health,services,...,2.0,2.0,2.0,yes,7.0,0.0,5.0,no,no,4.0


In [20]:
# Generate a unique student ID based on all columns
df_mat_new = df_mat
df_por_new = df_por
def generate_student_id(df):
    df['student_id'] = df[key_columns].apply(lambda row: '_'.join(row.astype(str)), axis=1)
    return df

df_mat_new = generate_student_id(df_mat_new)
df_por_new = generate_student_id(df_por_new)

df_mat_new.to_csv('data/mat_new.csv', index=False)
df_por_new.to_csv('data/por_new.csv', index=False)

Since we have found some discrepencies