# 1. Introduction
## 1.1 Primary Goals and Motivation
Excessive alcohol consumption of teenagers has been a serious issue in different countries. Alcoholism and college drinking can be detrimental to physical and intellectual development of teenagers. While students may drink alcohol to relieve stress or fit in, too much alcohol leads to other mental health issue and harms interpersonal relationship. Drinking alcohol is often assumed to correlates with their academic performance; some social and economic factors may also be related to one’s grades. This may affect college admissions and physical health (brain development). The primary goal of this project is to explore whether students’ habits, especially alcohol consumption, are good predictors of a student’s grade and build a model that predict one’s GPA based on variables including alcohol intakes, parents’ education level, living environment, and health condition. I am interested in the relationship between alcohol consumption and GPA. And If drinking does not significantly predict academic success, what other social & economic factors should be considered when predicting grades?
## 1.2 Literature Review
Research that focus on college students found binging and intoxication correlates negatively with study hours, which reduce GPA. Meanwhile, pure increase in drinking frequency does not seem to affect GPA significantly (Wolaver, 2001). Fewer studies focus specifically on correlation between high school drinking and grades. As cited in Balsa, Giuliano, & French, DeSimone and Wolaver (2005) look specifically into high schoolers and found a negative association between high school drinking and GPA; the results also show that binge-drinkers GPA are averagely 0.4 lower for both genders (2011). 
The Pagnotta and Amran (2016) study uses the same data set as this project. They approach the data set from a secondary level using data mining and business intelligence; the model’s effectiveness shows a correlation between alcohol intake and grades. Another article establishes a model using random tree method (Hariharan, Krithivasan, & Angel, 2016).
## 1.3 Data
To reach the goal of my project, I choose the student performance data set from the UCI Machine Learning Repository. It includes the gender, social, economic, and alcohol intake information of Portuguese high school students in math and Portuguese language class (Cortez & Silva, 2008). The attributes in the data set seems relevant and comprehensive enough for the model, which makes this an appropriate data set to incorporate.


## Importing packages & Cleaning Data 

I am using the student performance dataset from UCI ML, which includes informations of two groups of students in math and Portuguese class respectively. The two csv files consist duplicated informaiton (some are enrolled in both classes), so I need to combine them and drop the duplicated rows.

## 2. The Student Performance Data Set
### 2.1 Background Information
The student performance dataset (named “student alcohol consumption” on Kaggle) records the response to a questionnaire of students from two public high school in Portuguese during the 2005-2006 school year. The questions are related to the students’ social, emotional and demographic situation. 
In Portugal, most students study under the public education system. Unlike the United States, Portuguese students complete 9 years of education before 3 years of high school. The school year is usually separated into trimesters. Students have two trimester scores (the G1 and G2 variables) and a final score (the G3 variable), and they are graded on a 20-point scale, where 0 is the lowest score and 20 is the highest (Cortez & Silva, 2008). This dataset is collected by Dr. Paulo Cortez, an information systems researcher at the Algoritmi R&D Centre, and Ms. Alice Silva, a secondary school teacher at Arcozelo, Portugal. Their intention was to collect information that may reflect the students’ grade, then use data mining (classification and regression) to build a prediction model. While the grades and number of absences are taken from report cards, other attributes are retreived by sending out questionnaire with multiple-choice questions (options were predefined). The questionnaire contains 34 questions, and it was answered by 788 students. However, 111 of the records were discarded due to identification issues. Some parts of the answers from the questionnaire, such as family income and access to computers, are also thrown away because students give mostly unified answers to the questions. When the data was collected, public schools in Portugal still had poor information system, so students mainly use paper in class and filled in physical copies of questionnaires. (Cortez & Silva, 2008). 
### 2.2 Principle of measurement
My question is closely related to the dataset because the collectors of the data have similar goals to this project. This social-economics status shapes our understanding to the students’ living condition from multiple aspects, which helps better the prediction. Many attributes have potential correlations with each other. 
However, I have to leave some doubts on the preciseness of some attributes of the data, such as weekly study time, family relationship quality, free time, and alcohol consumption. Though the study is anonymous, some students may still fill in non-real or inaccurate answers. For example, weekly study time (1 – 10 hours) is hard to measure, and the response to “free time,” “going out with friends,” and “health” attributes (all labeled as “very low” to “very high” from 1 to 5) are rather subjective. Alcohol consumption is also measure in a similar inaccurate scale; students who drink more alcohol may scale the level of consumption differently, depending on their definition of “very high (level of consumption).”
### 2.3 ethical consideration
The analysis of the dataset does not seem to harm the represented groups. Instead, it may help educator build a better class environment to better students’ performance. 
However, the population represented in the dataset limits the scope of the model and our conclusion. Because all students are from two Portugal public high school, the data set lacks diversity, we cannot extrapolate our findings to high school students in other countries or students in private high school.

### 2.4 Variables
The data set has 33 variables and is separated into two csv files, one for each class. During the preprocessing process, I merge the two file and drop the repeated rows. The variables are listed as follow:
* school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
* sex - student's sex (binary: 'F' - female or 'M' - male)
* age - student's age (numeric: from 15 to 22)
* address - student's home address type (binary: 'U' - urban or 'R' - rural)
* famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
* Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
* Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
* Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
* Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services')
* Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services')
* reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
* guardian - student's guardian (nominal: 'mother', 'father' or 'other')
* traveltime - home to school travel time (numeric: numeric: 1 – < 15 min., 2 – 15 to 30 min., 3 – 30 min. to 1 hour or 4 – > 1 hour)
* studytime - weekly study time (numeric: 1 – < 2 hours, 2 – 2 to 5 hours, 3 – 5 to 10 hours or 4 – > 10 hours)
* failures - number of past class failures (numeric: n if 1<=n<3, else 4)
* schoolsup - extra educational support (binary: yes or no)
* famsup - family educational support (binary: yes or no)
* paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
* activities - extra-curricular activities (binary: yes or no)
* nursery - attended nursery school (binary: yes or no)
* higher - wants to take higher education (binary: yes or no)
* internet - Internet access at home (binary: yes or no)
* romantic - with a romantic relationship (binary: yes or no)
* famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
* freetime - free time after school (numeric: from 1 - very low to 5 - very high)
* goout - going out with friends (numeric: from 1 - very low to 5 - very high)
* Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
* Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
* health - current health status (numeric: from 1 - very bad to 5 - very good)
* absences - number of school absences (numeric: from 0 to 93)
* These grades are related with the course subject, Math or Portuguese:
* G1 - first period grade (numeric: from 0 to 20)
* G2 - second period grade (numeric: from 0 to 20)
* G3 - final grade (numeric: from 0 to 20, output target)
* studytime - weekly study time (numeric: 1 – < 2 hours, 2 – 2 to 5 hours, 3 – 5 to 10 hours or 4 – > 10 hours)
* failures - number of past class failures (numeric: n if 1<=n<3, else 4)
* schoolsup - extra educational support (binary: yes or no)
* famsup - family educational support (binary: yes or no)
* paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
* activities - extra-curricular activities (binary: yes or no)
* nursery - attended nursery school (binary: yes or no)
* higher - wants to take higher education (binary: yes or no)
* internet - Internet access at home (binary: yes or no)
* romantic - with a romantic relationship (binary: yes or no)
* famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
* freetime - free time after school (numeric: from 1 - very low to 5 - very high)
* goout - going out with friends (numeric: from 1 - very low to 5 - very high)
* Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
* Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
* health - current health status (numeric: from 1 - very bad to 5 - very good)
* absences - number of school absences (numeric: from 0 to 93)
* These grades are related with the course subject, Math or Portuguese:
* G1 - first period grade (numeric: from 0 to 20)
* G2 - second period grade (numeric: from 0 to 20)
* G3 - final grade (numeric: from 0 to 20, output target)
<br />

### 2.4 Method
Since I am interested in finding the correlating factors with GPA and attempt to build a prediction model, I will try dimension reduction like principle component analysis (PCA), random forest classifier, and random forest regressor, which is derived from decision tree.

In [None]:
import csv
import numpy as np
import pandas as pd
import altair as alt
pd.set_option('display.max_columns', 100)

In [None]:
mat_student = pd.read_csv(r'../input/student-alcohol-consumption/student-mat.csv')
por_student = pd.read_csv(r'../input/student-alcohol-consumption/student-por.csv')
comb_data = pd.concat([mat_student,por_student])
print("Existence of Null value:", comb_data.isnull().values.any())
print(comb_data.columns)

In [None]:
data = comb_data.drop_duplicates(["school","sex","age","address","famsize","Pstatus","Medu",
                           "Fedu","Mjob","Fjob","reason","nursery","internet"])
#attributes shown in the annex R file that identify each student
len(data) # There are 662 students in total
data

## Exploratory Data Analysis
First, I used a heatmap to find out the correlation between all variables. There are little interesting and strong correlations between the variables. Student’s final grade is strongly related to their grades in the first and second trimesters. After the grades comes the number of class failed, which is reasonable because it directly affects the grades of the classes. The weekend and weekday alcohol intake seem to have different correlations with the variables. For instance, workday alcohol consumption has a 0.4 correlation with the “going out with friends” variable, while weekend alcohol consumption only has 0.25. However, it is important to note that the relationships between the dozens of categorical variables are not shown in the table. Thus, I move on to create distribution plots with Altair. 
<br /> The distributions of final grade are identical between two sexes; the density plot for female has higher count because the total number of female students is higher. From Figure 2, one can see the number of boys that consume very high level during the workday is more than 3 times higher than that of the female students. The alcohol consumption distribution for male students is more uniform than female students’.


![image.png](attachment:image.png)
![image.png](attachment:image.png)

In the social aspect, going out with friends seem to have some positive correlation with workday alcohol consumption (Figure 3). Specifically, the plot shows the distinctive difference in number counts of low and high drinking level students; many of those who filled in 5 for both questions are male. 
From Figure 4, however, we can get a fuller picture on the relationship between alcohol intake and grades. Interestingly, there is more correlation between workday alcohol intake and final grades. The range of score of the “very high” drinking level students are smaller; no one in that group are getting very high grades.  

![image.png](attachment:image.png)

In [None]:
#A plot that shows the correlation between all variables
corr = data.corr()
cor_df = corr.reset_index().melt(id_vars = 'index')
cor_df['value'] = np.round(cor_df['value'],2)
base = alt.Chart(cor_df).encode(
    x='index:O',
    y='variable:O'
    )
label = base.mark_text().encode(
    text = 'value',
    color=alt.condition(
        abs(alt.datum.value) > 0.5, 
        alt.value('white'),
        alt.value('black')
    )
    )
cor = base.mark_rect().encode(
    color=alt.Color('value:Q',scale=alt.Scale(scheme='lightmulti'))
)

alt.layer(cor + label).properties(width = 500, height = 500)


* Not surprisingly, workday alcohol intake correlates to weekday alcohol intake (0.62), and parents' education levels are correlated as well(0.64). 
* The grade of the first and second trimester also correlate closely to the final grades.

### Distribution
First we look at the distribution of different variable, including final grade, alcohol consumption.

In [None]:
grade_gender = alt.Chart(data).mark_area(opacity = 0.5).encode(
    x = alt.X('G3:Q', bin = alt.Bin(maxbins=20),scale = alt.Scale(zero = False),axis = alt.Axis(title = 'Final Grade, 0 to 20 points')),
    y = alt.Y('count()',stack = None),
    color = alt.Color('sex',scale=alt.Scale(scheme='set1'))
).properties(
    title = "Distribution of Final Grades by Gender"
)


alc_gender = alt.Chart(data).mark_line(opacity = 0.5,point = True).encode(
    x = alt.X('Walc:Q', bin = alt.Bin(maxbins=10),scale = alt.Scale(zero = False),axis = alt.Axis(title = 'Alcohol Consumption (0 = very low, 5 = very high)')),
    y = alt.Y('count()',stack = None),
    color = alt.Color('sex',scale=alt.Scale(scheme='set1'))
).properties(
    title = "Distribution of Alcohol consumption by Gender"
)

alt.vconcat(grade_gender, alc_gender).configure_title(
    fontSize = 18
).configure_axis(
    labelFontSize = 13,
    titleFontSize = 16
).configure_legend(
    titleFontSize = 14,
    labelFontSize = 13)


In [None]:
alt.Chart(data).mark_bar(opacity = 0.5,point = True).encode(
    x = alt.X('Walc:Q', bin = alt.Bin(maxbins=10),scale = alt.Scale(zero = False),axis = alt.Axis(title = 'Alcohol Consumption (0 = very low, 5 = very high)')),
    y = alt.Y('count()',stack = None),
    color = alt.Color('higher',scale=alt.Scale(scheme='set1'))
).properties(
    title = "Distribution of Alcohol consumption by Gender"
)

In [None]:
walc_grade = alt.Chart(data).mark_line().encode(
    x = alt.X('G3:Q', bin = alt.Bin(maxbins=10),scale = alt.Scale(zero = False),axis = alt.Axis(title = 'Final Grade, 0 to 20 points')),
    y = alt.Y('count()',stack = None),
    color = alt.Color('Walc:O',scale=alt.Scale(scheme='set1'))
).properties(
    width = 220,height = 220,
    title = "Distribution of grade by Weekend Alcohol Consumption"
)
dalc_grade = alt.Chart(data).mark_line().encode(
    x = alt.X('G3:Q', bin = alt.Bin(maxbins=10),scale = alt.Scale(zero = False),axis = alt.Axis(title = 'Final Grade, 0 to 20 points')),
    y = alt.Y('count()',stack = None),
    color = alt.Color('Dalc:O',scale=alt.Scale(scheme='set1'))
).properties(
    width = 220,height = 220,
    title = "Distribution of Grade by Weekday Alcohol Consumption"
)
alt.hconcat(walc_grade, dalc_grade)


In [None]:
# a = alt.Chart(data).mark_circle(opacity = 0.5).encode(
#     x = alt.X('Walc', bin = alt.Bin(maxbins=10),scale = alt.Scale(zero = False),axis = alt.Axis(title = 'Final Grade, 0 to 20 points')),
#     y = alt.Y('G3:Q',stack = None),
#     color = alt.Color('Walc:O',scale=alt.Scale(scheme='set1'))
# ).properties(
#     width = 220,height = 220,
#     title = "Distribution of grade by Weekend Alcohol Consumption"
# )
wa = alt.Chart(data).transform_density(
    'G3',
    as_=['G3', 'density'],
    groupby=['Walc']
).mark_area(orient='horizontal').encode(
    y='G3:Q',
    color= alt.Color('Walc:N',scale=alt.Scale(scheme='lightmulti')),
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0],grid=False, ticks=True),
    ),
    column=alt.Column(
        'Walc:N',
        header=alt.Header(
            title = 'Weekend Alcohol Consumption (0 = "very low", 5 = "very high")',
            titleOrient='bottom',
            labelOrient='bottom',
            labelPadding=0,
        )
    )).properties(
    width=100
)


da = alt.Chart(data).transform_density(
    'G3',
    as_=['G3', 'density'],
    groupby=['Dalc']
).mark_area(orient='horizontal').encode(
    y='G3:Q',
    color= alt.Color('Dalc:N',scale=alt.Scale(scheme='lightmulti')),
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0],grid=False, ticks=True),
    ),
    column=alt.Column(
        'Dalc:N',
        header=alt.Header(
            title = 'Normal Weekday Alcohol Consumption (0 = "very low", 5 = "very high")',
            titleOrient='bottom',
            labelOrient='bottom',
            labelPadding=0,
        )
    )).properties(
    width=100
)

(wa & da).configure_facet(
    spacing=0
).configure_view(
    stroke=None
)

In [None]:
data_sex = pd.DataFrame(pd.concat([data.groupby(data['sex']).mean(), data.groupby(data['sex']).median()]).T)
data_sex.columns = ['F_mean','M_mean','F_med','M_med']
data_sex
# alt.Chart(data).transform_density( 'Walc',
# as_=['Walc', 'density'], ).mark_line().encode(
#     x="Walc",
#     y='density:Q',
#     color = alt.Color('sex')
# ).properties(
# title = "Density Distribution of Shots Attempted" ).configure_title(
# fontSize = 18 ).configure_axis(
#     labelFontSize = 13,
#     titleFontSize = 15)

From this we can see that
* female students tend to approximate themselves to study more. 
* On weekends, male drink more on avarage than female students (male's walc = 2.7, female's walc = 1.92)

## Modeling
### Principle Component Analysis
To perform PCA, we need to convert the original data into numerical and drop some of the columns like school, famsize, Fjob/Mjob.


In [None]:
data
pcadata = pd.DataFrame(data)
# pcadata = data.replace({'GT3':1, 'LE3':0,
#                         'U':1,'R':0,
#                        'A':0,'T':1,
#                        'no':0,'yes':1,
#                        'M':0,'F':1})
# pcadata.drop('school',inplace = True, axis = 1)

pcadata.drop({'school','Mjob','Fjob','reason','guardian',
              'nursery','Medu','Fedu','famrel','traveltime'},inplace = True, axis = 1)
pcadata.drop({'G1','G2'},inplace = True, axis = 1)
pcadata = pd.get_dummies(pcadata)
pcadata.shape

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

pcadata.shape #46 columns
sc = StandardScaler(with_mean=True, with_std=True)
pcadata = sc.fit_transform(pcadata)
# pca = PCA(26)
# pca = pca.fit(pcadata)
u, d, vt = np.linalg.svd(pcadata, full_matrices=False)
pc = u@np.diag(d)
pcs_df = pd.DataFrame(pc)
pve = pd.DataFrame(d**2/np.sum(d**2),columns=["Percent of Variance Explained"]).reset_index()
np.sum(pve[0:16])

In [None]:
scree = alt.Chart(pve).mark_bar(width=10).encode(
    x = alt.X('index:O', title = 'PCs'),
    y='Percent of Variance Explained:Q'
).properties(
    width = 350,
    title = 'Scree Plot'
)

pve_cumsum = pd.DataFrame(np.cumsum(d**2/np.sum(d**2)),columns=["Percent of Variance Explained"]).reset_index()
cumulative = alt.Chart(pve_cumsum).mark_line().encode(
    x = alt.X('index:O', title = 'PCs'),
    y = 'Percent of Variance Explained'
).properties(
    width = 300,
    title = 'Cumulative Explained Variance')
scree | cumulative

### Finding GPA Predictor using Random Forest (With and Without G1 and G2)

Then I move on to try another approach to our goal, which is the random forest method.
I split the data into trainning and testing sets, then apply RandomForestRegressor()

In [None]:
data.head()

In [None]:
from sklearn.model_selection import train_test_split

#Use one-hot encoding
onehot = data.replace({'GT3':1, 'LE3':0,
                        'U':1,'R':0,
                       'A':0,'T':1,
                       'no':0,'yes':1,
                       'M':0,'F':1})

onehot.drop({'school','Mjob','Fjob','reason','guardian',
              'nursery','Medu','Fedu','famrel','traveltime'},inplace = True, axis = 1)
rfdf = onehot.copy()
# rfdf = pd.get_dummies(data)
rfdf['G3'].astype(int)
X=rfdf.drop('G3',axis = 1)
y=rfdf['G3'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
y_test.shape

In [None]:
#Import Random Forest Classifier
from sklearn.ensemble import RandomForestRegressor

#Create a random forest classifier object, choosing 1000 trees because 
reg=RandomForestRegressor(n_estimators=1000)

#fit the dataset into the object
reg.fit(X_train, y_train)

In [None]:
pred_gpa = reg.predict(X_test)
pred_gpa
error = np.sum(pred_gpa - y_test)**2/y_test.shape[0]
error
# pd.DataFrame(pred_gpa,y_test)

In [None]:
#Create a feature importance chart, which gives us the contribution of each variable in determining the GPA predition.
imp = pd.DataFrame(reg.feature_importances_, index=X.columns, columns=["Feature Importance"])
imp = imp.sort_values('Feature Importance', ascending = False)
imp.T

From this table we can see that G2 contributes the most to the model. 

### Finding Alcohol Consumption Predictor with Random Forest
#### Workday Alcohol Consumption

In [None]:
X1=rfdf.drop({'Walc','Dalc'},axis = 1)
y1=rfdf['Walc'] 
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.3)

#Create a random forest classifier object
reg1 = RandomForestRegressor(n_estimators=1300)

#fit the dataset into the object
reg1.fit(X_train1, y_train1)

In [None]:
prd_walc = reg1.predict(X_test1)
error_alc = np.sum(prd_walc - y_test1)**2/y_test1.shape[0]
error_alc

In [None]:
imp_alc = pd.DataFrame(reg1.feature_importances_, index=X1.columns, columns=["Feature Importance"])
imp_alc = imp_alc.sort_values('Feature Importance', ascending = False)
imp_alc.head()
print("error:", error_alc)
print(imp_alc.head())

In [None]:
rfdf.shape

#### Weekend Alcohol Consumption

In [None]:
X2=rfdf.drop({'Walc','Dalc'},axis = 1)
y2=rfdf['Dalc'] 
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, test_size=0.3)

#Create a random forest classifier object
reg2 = RandomForestRegressor(n_estimators=1000)

#fit the dataset into the object
reg2.fit(X_train2, y_train2)

prd_dalc = reg2.predict(X_test2)
error_dalc = np.sum(prd_dalc - y_test2)**2/y_test2.shape[0]
print("error:", error_dalc)
imp_dalc = pd.DataFrame(reg2.feature_importances_, index=X2.columns, columns=["Feature Importance"])
imp_dalc = imp_dalc.sort_values('Feature Importance', ascending = False)
print(imp_dalc.head())

In [None]:
imp_dalc.head()