http://archive.ics.uci.edu/ml/datasets/Student+Performance#

### All needed imports

In [None]:
import pandas as pd
pd.options.display.max_colwidth = 80

import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVC # SVM model with kernels
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error


import warnings
warnings.filterwarnings('ignore')

### Loading and Exploring Data

There are two files of students performance in two subjects: math and portugues (Portugal is the country the dataset is from). Important notice : description (later on, as DESCR) tells that "there are several (382) students that belong to both datasets", so since data set about portugues is twice larger than about math lessons, I will be taking former.

In [None]:
student_por = pd.read_csv('/kaggle/input/student-performance-data-set/student-por.csv')
student_por.head()

In [None]:
student_por.describe()

In [None]:
# check missing values in variables

student_por.isnull().sum()

In [None]:
student_por.isnull().any()

In [None]:
student_por.info()

#### I know from DESCR, that *G1* and *G2* are grades for midterm exams, so they are a consequence of the last exam and they correlate a great deal with our target variable *G3*, so  I won't be making another column of average value for these three

#### After inspecting the dataset description I'm curious how *health* and *absences* values corelate. Perhaps, I could made one feature out of them. But before looking for correlation we should normalize these features, cause their ranges differ very much.

### UPD: normalizing values didn't help. Seems that normalizing or not, nothing changes... I should look it up. Nonetheless, I leave the code in one cell below just as a reminder to myself

In [None]:
copied = student_por.copy()

mean = 5.7
max_min = 75

def mean_normalization(x):
    return((x-mean)/max_min)

copied['absences'] = copied['absences'].apply(mean_normalization)
copied['health'] = copied['health'].apply(mean_normalization)

corr_matrix = copied.corr()

corr_matrix["absences"].sort_values(ascending=False)

In [None]:
corr_matrix = student_por.corr()

corr_matrix["absences"].sort_values(ascending=False)

### A little bit about correlation. 

<p>Since the dataset is not too large we can easily compute standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the *corr()* method.
    
The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation; when the coefficient is close to –1, it means that there is a strong negative correlation.Finally, coefficients close to 0 mean that there is no linear correlation. </p>

<p>The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss out on nonlinear relationships (e.g., “if x is close to 0, then y generally goes up”)</p>

#### Let’s look at how much each *numerical* attributes correlates with *G3* value:

In [None]:
corr_matrix["G3"].sort_values(ascending=False)

#### Apparently, *G3* has correlation not only with *G1* and *G2* but also with *studytime*, *failures*, *Dalc*, *Walc*, *traveltime*, *freetime*, *age*, *Medu* (mother's education) and *Fedu* (father's education)

#### Another way to check for correlation between attributes is to use the pandas *scatter_matrix()* function, which plots every numerical attribute against every other numerical attribute. Since there are 16 numerical attributes, we would get 16x16 = 256 plots, which would not fit on a page—so let’s just focus on a few promising attributes that seem most correlated with *G3*

In [None]:
from pandas.plotting import scatter_matrix

# I don't take G2 and G1 into account, because they are an obvious choice
attributes = ["G3", "studytime", "Fedu", "failures", "Dalc", "Walc"] 

scatter_matrix(student_por[attributes], figsize=(16, 12))

## Choosing features. The goal is to predict *G3*

#### And yet another way to check numeric data for correlations

In [None]:
import seaborn as sns

corr_matrix = student_por.corr()

plt.figure(figsize=(20,20))
sns.heatmap(corr_matrix, annot=True, cmap="Blues")
plt.title('Correlation Heatmap', fontsize=20)

#### Judging by this heatmap and also by previous correlations matrices, *studytime, failures, Dalc, Walc, traveltime, freetime, age, Medu and Fedu* might really have an impact on *G1-G3*

### Let's now analyze categorical variables

In [None]:
#comparing sex with G3
sns.boxplot(x="sex", y="G3", data=student_por)

In [None]:
#comparing school with G3
sns.boxplot(x="school", y="G3", data=student_por)

In [None]:
#comparing adress with G3
sns.boxplot(x="address", y="G3", data=student_por)

In [None]:
#comparing parent's jobs with G3
sns.boxplot(x="Mjob", y="G3", data=student_por)
sns.boxplot(x="Fjob", y="G3", data=student_por)

In [None]:
#comparing famsize with G3
sns.boxplot(x="famsize", y="G3", data=student_por)

In [None]:
#comparing Pstatus with G3
sns.boxplot(x="Pstatus", y="G3", data=student_por)

In [None]:
#comparing reason with G3
sns.boxplot(x="reason", y="G3", data=student_por)

In [None]:
#comparing guardian with G3
sns.boxplot(x="guardian", y="G3", data=student_por)

In [None]:
#comparing schoolsup with G3
sns.boxplot(x="schoolsup", y="G3", data=student_por)

In [None]:
#comparing famsup with G3
sns.boxplot(x="famsup", y="G3", data=student_por)

In [None]:
#comparing paid with G3
sns.boxplot(x="paid", y="G3", data=student_por)

In [None]:
#comparing activities with G3
sns.boxplot(x="activities", y="G3", data=student_por)

In [None]:
#comparing nursery with G3
sns.boxplot(x="nursery", y="G3", data=student_por)

In [None]:
#comparing higher with G3
sns.boxplot(x="higher", y="G3", data=student_por)

In [None]:
#comparing internet with G3
sns.boxplot(x="internet", y="G3", data=student_por)

In [None]:
#comparing romantic with G3
sns.boxplot(x="romantic", y="G3", data=student_por)

### After examining boxplots, I've come to a conclusion that the following numerical and categorical features have an inpact on *G3* : 


<ul>
    <li>Numerical: studytime, failures, Dalc, Walc, traveltime, freetime, Medu and Fedu, G1, G2</li>
    <li>Categorical: Sex, School, Address, Mjob + FJob, Reason, Guardian, Schoolsup, Higher, Internet</li>
</ul>

<p>See dataset description for info about each feature </p>

In [None]:
# making dataframe I'm gonna work with + target G3

features_chosen = ['studytime', 'failures', 'Dalc', 'Walc', 'traveltime', 'freetime',  'Medu', 'Fedu', 
                   'sex', 'school', 'address', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 
                   'higher', 'internet', 'G1', 'G2', 'G3']

student_reduced = student_por[features_chosen].copy()

student_reduced

### I have given it a lot of thoughts and here is what I'm thinking.
#### The point of this notebook is to find *G3* , of course by selecting the best model and the best features for that. And we are visualizing, analysing these features, such as *traveltime* from home to school or possible drinking problems or romantic affairs, family statuses and so on and so on ... we are basically thinking of the things, that influence our grades. So, based on these thoughts, it would've been better to get rid off *G1* and *G2*, since these are grades for first and second halves of the year respectively. And they are, as much as *G3* reflections of the features choses. Instead of having three grades, we should make one mean *G* out of them.

<code>
    # mean
    student_reduced["G"]=(student_reduced["G1"]+student_reduced["G2"]+student_reduced["G3"])/3
    # dropping initial grades and leaving mean 
    student_reduced.drop(['G1', 'G2', 'G3'], axis=1, inplace=True)
</code>

#### But for now, I will leave them be

#### Another quick way to get a feel of the type of data we are dealing with is to plot a histogram for each numerical attribute. A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis). 

In [None]:
student_reduced.hist(bins=20, figsize=(20,15))
plt.show()

#### Looking at the data we can see string-valued features. 

##### They are not arbitrary texts: these are a limited number of possible values, each of which represents a category. So these attributes are categorical attributes. Most Machine Learning algorithms prefer to work with numbers, so let’s convert these categories from text to numbers. For this, we can use Scikit-Learn’s OneHotEncoder class, because it's one of the best when working with categorical nominal variables. And for numerical values I will use StandardScaler. These two function I will put in one pipeline.

<p>As far as I know, all but the last estimator must be transformers (i.e., they must have a fit_transform() method)</p>

<code>
from sklearn.preprocessing import OneHotEncoder
    from sklearn.preprocessing import StandardScaler
    from sklearn.compose import ColumnTransformer
features_cat = ['sex','school','address','Mjob','Fjob','reason','schoolsup','guardian','higher','internet']
features_num = ['studytime', 'failures', 'Dalc', 'Walc', 'traveltime', 'freetime', 'Medu', 'Fedu']

full_pipeline = ColumnTransformer([
    ("num", StandardScaler(), features_num), 
    ("encoder", OneHotEncoder(), features_cat),
])

X_train_prepared = full_pipeline.fit_transform(X_train)
</code>

### UPD: insted of this pipeline I thought of better way to transform my features. Anyways, for the sake of my experiments, I will be leaving the above discussed pipeline here in code-block:

### *get_dummies()* method from pandas yields every values from every categorical feature as a column name and assigns 1 to instances where this value is True and  0 to instances where it is not. This method affects only categorical features

In [None]:
features_cat = ['sex','school','address','Mjob','Fjob','reason','schoolsup','guardian','higher','internet']

student_reduced_cat = pd.get_dummies(student_reduced, columns = features_cat)
student_reduced_cat

In [None]:
student_reduced_cat.columns

#### Predict and Target variables

In [None]:
X = np.array(student_reduced_cat.drop(['G3'],1))
y = np.array(student_reduced_cat['G3'])  

#### Scaling numerical variables

In [None]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [None]:
X.shape

#### Before looking at the data any further, I need to create a test set, put it aside, and never look at it. (c) Aurélien Geron

In [None]:
X_train, X_test,y_train, y_test = train_test_split(X, y, test_size=0.24, random_state=42)

In [None]:
X_train.shape, X_test.shape

#### I guess we have a sufficient number of instances in dataset for each stratum, so no need in *Stratified sampling*

### Selecting and Training the Model

#### I'll try Linear Regression with regularization

In [None]:
from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(penalty="l2") # specifying Ridge Regression

sgd_reg.fit(X_train, y_train)

In [None]:
accuracy=sgd_reg.score(X_test,y_test)  
accuracy

### Accuracy of 0.86 is really good. 

#### But perhaps the model underfits or overfits.

#### There are a few ways to find that out:
<ul>
    <li>Learning curves - these are plots of the model’s performance on the training set and the validation set as a function of the training set size </li>
    <li>Cross-validation - if a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then your model is overfitting. If it performs poorly on both, then it is underfitting.</li>
</ul>

### Learning Curves

In [None]:
def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2) 
    train_errors, val_errors = [], []
    
    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val) 
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict)) 
        val_errors.append(mean_squared_error(y_val, y_val_predict))
        
    plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train") 
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")

#### Estimating SGDRegressor's geberalization performance

In [None]:
sgd_reg_curves = SGDRegressor(penalty='l2') 

plot_learning_curves(sgd_reg_curves, X, y)

### From what I can understand, looking at the Learning curve, the model is fine

### Better Evaluation Using Cross-Validation

####  Scikit-Learn’s cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a negative value), which is why the preceding code computes -scores before calculating the square root.

In [None]:
scores = cross_val_score(sgd_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10) 

sgd_reg_scores = np.sqrt(-scores)

In [None]:
sgd_reg_scores

#### Let's look at the results

In [None]:
def display_scores(scores):
    print('Scores:', scores)
    print('Std.  :', scores.std())
    print('Mean  :', scores.mean())
    
display_scores(sgd_reg_scores)

## So, as a conclusion I must say, that LinearRegression with regularisation works fine