# Data Understanding

The data for this data science project in researching factors influencing teenage alcoholism was sourced from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/student%2Bperformance#). It was donated to the site by Prof. Paulo Cortez of University Minho. His original work on the dataset, "USING DATA MINING TO PREDICT SECONDARY SCHOOL STUDENT PERFORMANCE, can be found [here](http://www3.dsi.uminho.pt/pcortez/student.pdf)

The data set consists of information on various attributes for each student, taking Portuguese language classes who come from ether of the two higher secondary schools, The Gabriel Pereira School and  the Mousinho da Silveira School. There is information on 649 students on 33 attributes. A list of all the features with description can be found on [Readme](https://github.com/Yeshi341/Student_Alcohol_Consumption/blob/master/Readme.md) section of the Github page to this project. The features have also been described sequentially as [EDA]('EDA.ipynb') was performed on each variable in the EDA notebook.

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report,confusion_matrix

In [3]:
df = pd.read_csv('preprocessing_file.csv')

In [7]:
pd.set_option("display.max_columns",None)
df.head()

Unnamed: 0,school,sex,age,address,Pstatus,paid,activities,nursery,internet,romantic,absences,alc,stability,academic_support,idle,grade_avg,delinquency,Medu_1,Medu_2,Medu_3,Medu_4,Fedu_1,Fedu_2,Fedu_3,Fedu_4,Mjob_2,Mjob_3,Mjob_4,Mjob_5,Fjob_2,Fjob_3,Fjob_4,Fjob_5,reason_2,reason_3,reason_4,guardian_2,guardian_3,traveltime_2,traveltime_3,traveltime_4,studytime_2,studytime_3,studytime_4,failures_1,failures_2,failures_3,health_2,health_3,health_4,health_5
0,1,1,18,0,0,0,0,1,0,0,4,0,4,2,12,7.33,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,1,0,0
1,1,1,17,0,1,0,0,0,1,0,2,0,5,2,9,10.33,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0
2,1,1,15,0,1,0,0,1,1,0,6,0,0,2,6,12.33,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0
3,1,1,15,0,1,0,1,1,1,1,0,0,3,2,4,14.0,0,0,0,0,1,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1
4,1,1,16,0,1,0,0,1,0,0,0,0,4,2,6,12.33,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1


In [8]:
df.dtypes

school                int64
sex                   int64
age                   int64
address               int64
Pstatus               int64
paid                  int64
activities            int64
nursery               int64
internet              int64
romantic              int64
absences              int64
alc                   int64
stability             int64
academic_support      int64
idle                  int64
grade_avg           float64
delinquency           int64
Medu_1                int64
Medu_2                int64
Medu_3                int64
Medu_4                int64
Fedu_1                int64
Fedu_2                int64
Fedu_3                int64
Fedu_4                int64
Mjob_2                int64
Mjob_3                int64
Mjob_4                int64
Mjob_5                int64
Fjob_2                int64
Fjob_3                int64
Fjob_4                int64
Fjob_5                int64
reason_2              int64
reason_3              int64
reason_4            

In [9]:
df.shape

(649, 51)

### Train Test Split

In [24]:
X = df.drop(columns = ['grade_avg'], axis = 1) 
y = df['grade_avg']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=150, test_size=0.2)

print("Training set - Features: ", X_train.shape, "Target: ", y_train.shape,)
print("Test set - Features: ", X_test.shape, "Target: ",y_test.shape,)
#print(y_train.value_counts(normalize = True))
#print(y_test.value_counts(normalize = True))

Training set - Features:  (519, 50) Target:  (519,)
Test set - Features:  (130, 50) Target:  (130,)


**Scaling**

In [25]:
scaler= MinMaxScaler()  
scaler.fit(X_train)

X_train_scale = scaler.transform(X_train)  
X_test_scale = scaler.transform(X_test)

# Feature Selection

### Select Kbest 10

In [26]:

selector = SelectKBest(f_classif, k=10) 
selector.fit(X_train, y_train)

selected_columns = X_train.columns[selector.get_support()]

X_train_kb10 = X_train[selected_columns]
X_test_kb10 = X_test[selected_columns]
print(X_train_kb10.shape, X_test_kb10.shape)


(519, 10) (130, 10)


### Select Kbest 15

In [27]:

selector = SelectKBest(f_classif, k=15) 
selector.fit(X_train, y_train)

selected_columns = X_train.columns[selector.get_support()]
removed_columns = X_train.columns[~selector.get_support()]

X_train_kb15 = X_train[selected_columns]
X_test_kb15 = X_test[selected_columns]

### Select Kbest 20

In [28]:

selector = SelectKBest(f_classif, k=20) 
selector.fit(X_train, y_train)

selected_columns = X_train.columns[selector.get_support()]
removed_columns = X_train.columns[~selector.get_support()]

X_train_kb20 = X_train[selected_columns]
X_test_kb20 = X_test[selected_columns]

### Select Kbest 25

In [29]:

selector = SelectKBest(f_classif, k=25)
selector.fit(X_train, y_train)

selected_columns = X_train.columns[selector.get_support()]
removed_columns = X_train.columns[~selector.get_support()]

X_train_kb25 = X_train[selected_columns]
X_test_kb25 = X_test[selected_columns]

### Select Kbest 30

In [30]:

selector = SelectKBest(f_classif, k=30)
selector.fit(X_train, y_train)

selected_columns = X_train.columns[selector.get_support()]
removed_columns = X_train.columns[~selector.get_support()]

X_train_kb30 = X_train[selected_columns]
X_test_kb30 = X_test[selected_columns]

### Select Kbest 35

In [31]:

selector = SelectKBest(f_classif, k=35) 
selector.fit(X_train, y_train)

selected_columns = X_train.columns[selector.get_support()]
removed_columns = X_train.columns[~selector.get_support()]

X_train_kb35 = X_train[selected_columns]
X_test_kb35 = X_test[selected_columns]

# Models

Our target variable in this project, tells whether a student is a heavy alcohol drinker or not. Our main concern here becomes that we do not want to predict that a student is not a heavy drinker when they actually are. Thus, we are interested in minimizing chances of any False Negatives. Correctly, identifying student has a problem allows us to appropriately allocate help or resources to ameliorate conditions for that student/s to minimize any drinking problem. 

Thus, our focus will be on the recall score or sensitivity score that tells us the proportion of actual positives identified correctly, given by (TP/(TP+FN). The higher this score, the better. 

We also looked at the Accuracy score and the F1 scores as extra metrics to compare model performance on.

# 1.Linear Model
**Running a model without any class imbalance resolution on the original features with no transformations or scaling**

In [32]:
lr1 = LinearRegression()

lr1.fit(X_train, y_train)
p = lr1.predict(X_test)

print('MAE :',metrics.mean_absolute_error(y_test,p))
print('MSE :',metrics.mean_squared_error(y_test,p))


MAE : 1.8904532911737668
MSE : 6.619305947373856


In [33]:
results = {}

results['1.linear_original_features'] = (round(metrics.mean_absolute_error(y_test, p),4), 
                            round(metrics.mean_squared_error(y_test, p),4), 
                          )
results

{'1.linear_original_features': (1.8905, 6.6193)}

### 2. Linear Regression with X_train_scale ( Scaled Train set)

In [34]:
lrscaled = LinearRegression()
lrscaled.fit(X_train_scale, y_train)
p = lrscaled.predict(X_test_scale)

print('MAE :',metrics.mean_absolute_error(y_test,p))
print('MSE :',metrics.mean_squared_error(y_test,p))



MAE : 1.8904532911737661
MSE : 6.619305947373853


In [35]:
results['2.lr_scaled'] = (round(metrics.mean_absolute_error(y_test, p),4), 
                            round(metrics.mean_squared_error(y_test, p),4), 
                          )
results


{'1.linear_original_features': (1.8905, 6.6193),
 '2.lr_scaled': (1.8905, 6.6193)}

### 3. Linear Regression with K best 10

In [36]:
kb10_lr = LinearRegression()

kb10_lr.fit(X_train_kb10, y_train)
p = kb10_lr.predict(X_test_kb10)

print('MAE :',metrics.mean_absolute_error(y_test,p))
print('MSE :',metrics.mean_squared_error(y_test,p))




MAE : 1.8585921421933038
MSE : 6.034359976188658


In [37]:
results['3.lr_kb10'] =  (round(metrics.mean_absolute_error(y_test, p),4), 
                            round(metrics.mean_squared_error(y_test, p),4), 
                          )
results

{'1.linear_original_features': (1.8905, 6.6193),
 '2.lr_scaled': (1.8905, 6.6193),
 '3.lr_kb10': (1.8586, 6.0344)}

### 4.  Linear  Regression with K best 15

In [38]:
kb15_lr = LinearRegression()

kb15_lr.fit(X_train_kb15, y_train)
p = kb15_lr.predict(X_test_kb15)

print('MAE :',metrics.mean_absolute_error(y_test,p))
print('MSE :',metrics.mean_squared_error(y_test,p))


MAE : 1.7920546784091607
MSE : 5.506881045361295


In [39]:
results['4.lr_kb15'] = (round(metrics.mean_absolute_error(y_test, p),4), 
                            round(metrics.mean_squared_error(y_test, p),4), 
                          )
results

{'1.linear_original_features': (1.8905, 6.6193),
 '2.lr_scaled': (1.8905, 6.6193),
 '3.lr_kb10': (1.8586, 6.0344),
 '4.lr_kb15': (1.7921, 5.5069)}

### 5.  Linear  Regression with K best 20

In [40]:
kb20_lr = LinearRegression()

kb20_lr.fit(X_train_kb20, y_train)
p = kb20_lr.predict(X_test_kb20)

print('MAE :',metrics.mean_absolute_error(y_test,p))
print('MSE :',metrics.mean_squared_error(y_test,p))

MAE : 1.8059461464244662
MSE : 5.831062116589294


In [41]:
results['5.lr_kb20'] =  (round(metrics.mean_absolute_error(y_test, p),4), 
                            round(metrics.mean_squared_error(y_test, p),4), 
                          )
results

{'1.linear_original_features': (1.8905, 6.6193),
 '2.lr_scaled': (1.8905, 6.6193),
 '3.lr_kb10': (1.8586, 6.0344),
 '4.lr_kb15': (1.7921, 5.5069),
 '5.lr_kb20': (1.8059, 5.8311)}

### 6.  Linear  Regression with K best 25

In [42]:
kb25_lr = LinearRegression()

kb25_lr.fit(X_train_kb25, y_train)
p = kb25_lr.predict(X_test_kb25)

print('MAE :',metrics.mean_absolute_error(y_test,p))
print('MSE :',metrics.mean_squared_error(y_test,p))

MAE : 1.8189869499930544
MSE : 5.727453909697653


In [43]:
results['6.lr_kb25'] = (round(metrics.mean_absolute_error(y_test, p),4), 
                            round(metrics.mean_squared_error(y_test, p),4), 
                          )
results

{'1.linear_original_features': (1.8905, 6.6193),
 '2.lr_scaled': (1.8905, 6.6193),
 '3.lr_kb10': (1.8586, 6.0344),
 '4.lr_kb15': (1.7921, 5.5069),
 '5.lr_kb20': (1.8059, 5.8311),
 '6.lr_kb25': (1.819, 5.7275)}

### 7.  Linear  Regression with K best 30

In [44]:
kb30_lr = LinearRegression()

kb30_lr.fit(X_train_kb30, y_train)
p = kb30_lr.predict(X_test_kb30)

print('MAE :',metrics.mean_absolute_error(y_test,p))
print('MSE :',metrics.mean_squared_error(y_test,p))

MAE : 1.8772387211796646
MSE : 6.188421396564386


In [45]:
results['7.lr_kb30'] =  (round(metrics.mean_absolute_error(y_test, p),4), 
                            round(metrics.mean_squared_error(y_test, p),4), 
                          )
results

{'1.linear_original_features': (1.8905, 6.6193),
 '2.lr_scaled': (1.8905, 6.6193),
 '3.lr_kb10': (1.8586, 6.0344),
 '4.lr_kb15': (1.7921, 5.5069),
 '5.lr_kb20': (1.8059, 5.8311),
 '6.lr_kb25': (1.819, 5.7275),
 '7.lr_kb30': (1.8772, 6.1884)}

### 8.  Linear  Regression with K best 35

In [46]:
kb35_lr = LinearRegression()

kb35_lr.fit(X_train_kb35, y_train)
p = kb35_lr.predict(X_test_kb35)

print('MAE :',metrics.mean_absolute_error(y_test,p))
print('MSE :',metrics.mean_squared_error(y_test,p))

MAE : 1.8510871528279271
MSE : 6.11453353994273


In [47]:
results['8.lr_kb35'] =  (round(metrics.mean_absolute_error(y_test, p),4), 
                            round(metrics.mean_squared_error(y_test, p),4), 
                          )
results

{'1.linear_original_features': (1.8905, 6.6193),
 '2.lr_scaled': (1.8905, 6.6193),
 '3.lr_kb10': (1.8586, 6.0344),
 '4.lr_kb15': (1.7921, 5.5069),
 '5.lr_kb20': (1.8059, 5.8311),
 '6.lr_kb25': (1.819, 5.7275),
 '7.lr_kb30': (1.8772, 6.1884),
 '8.lr_kb35': (1.8511, 6.1145)}

**Final model**
we see that higher accuracy come from knn_kb15 so we featured 15 variables for Linear regression model.

In [49]:
kb15_lr = LinearRegression()

kb15_lr.fit(X_train_kb15, y_train)
p = kb15_lr.predict(X_test_kb15)

print('MAE :',metrics.mean_absolute_error(y_test,p))
print('MSE :',metrics.mean_squared_error(y_test,p))

MAE : 1.7920546784091607
MSE : 5.506881045361295
