# **Assignment \#2**: Machine Learning MC886/MO444
University of Campinas (UNICAMP), Institute of Computing (IC)

Prof. Sandra Avila, 2022s2



In [1]:
# TODO: RA & Name 
print('229062: ' + 'Gustavo L F Walbon')

229062: Gustavo L F Walbon


## Objective

Explore **linear regression** and **logistic regression** alternatives and come up with the best possible model for the problems, avoiding overfitting. In particular, predict the performance of students from public schools in the state of São Paulo based on socioeconomic data from SARESP (School Performance Assessment System of the State of São Paulo, or Sistema de Avaliação de Rendimento Escolar do Estado de São Paulo) 2021.

### Dataset

These data were aggregated from [Open Data Platform of the Secretary of Education of the State of São Paulo](https://dados.educacao.sp.gov.br/) (*Portal de Dados Abertos da Secretaria da Educação do Estado de São Paulo*). The dataset is based on two data sources: [SARESP questionnaire](https://dados.educacao.sp.gov.br/dataset/question%C3%A1rios-saresp) and [SARESP test](https://dados.educacao.sp.gov.br/dataset/profici%C3%AAncia-do-sistema-de-avalia%C3%A7%C3%A3o-de-rendimento-escolar-do-estado-de-s%C3%A3o-paulo-saresp-por), conducted in 2021 with students from the 5th and 9th year of Primary School and 3rd year of Highschool. The questionnaire comprehends 63 socio-economical questions, and it is available at the [link](https://dados.educacao.sp.gov.br/sites/default/files/Saresp_Quest_2021_Perguntas_Alunos.pdf ) ([English version](https://docs.google.com/document/d/1GUax3wwYxA43d3iNOiyCRImeCHgx8vUJrHlSzzYIXA4/edit?usp=sharing)), and the test is composed of questions of Portuguese, Mathematics, and Natural Sciences.


**Data Dictionary**:

- **CD_ALUNO**: Student ID;

- **CODESC**: School ID;

- **NOMESC**: School Name;

- **RegiaoMetropolitana**: Metropolitan region;

- **DE**: Name of the Education Board;

- **CODMUN**: City ID;

- **MUN**: City name;

- **SERIE_ANO**: Scholar year;

- **TURMA**: Class;

- **TP_SEXO**: Sex (Female/Male);

- **DT_NASCIMENTO**: Birth date;

- **PERIODO**: Period of study (morning, afternoon, evening);

- **Tem_Nec**: Whether student has any special needs (1 = yes, 0 = no);

- **NEC_ESP_1** - **NEC_ESP_5**: Student disabilities;

- **Tipo_PROVA**: Exam type (A = Enlarged, B = Braile, C = Common);

- **QN**: Student answer to the question N (N= 1, ... , 63), see  questions in [questionnaire](https://dados.educacao.sp.gov.br/sites/default/files/Saresp_Quest_2021_Perguntas_Alunos.pdf ) ([English version](https://docs.google.com/document/d/1GUax3wwYxA43d3iNOiyCRImeCHgx8vUJrHlSzzYIXA4/edit?usp=sharing));

- **porc_ACERT_lp**: Percentage of correct answers in the Portuguese test;

- **porc_ACERT_MAT**: Percentage of correct answers in the Mathematics test;

- **porc_ACERT_CIE**: Percentage of correct answers in the Natural Sciences test;

- **nivel_profic_lp**: Proficiency level in the Portuguese test;

- **nivel_profic_mat**: Proficiency level in the Mathematics test;

- **nivel_profic_cie**:  Proficiency level in the Natural Sciences test.


---



You must respect the following training/test split:
- SARESP_train.csv
- SARESP_test.csv

## Linear Regression

This part of the assignment aims to predict students' performance on Portuguese, Mathematics, and Natural Sciences tests (target values: `porc_ACERT_lp`, `porc_ACERT_MAT`, and  `porc_ACERT_CIE`) based on their socioeconomic data. Then, at this point, you have to **drop the columns `nivel_profic_lp`, `nivel_profic_mat`** and **`nivel_profic_cie`**.

### Activities

1. (3.5 points) Perform Linear Regression. You should implement your solution and compare it with ```sklearn.linear_model.SGDRegressor``` (linear model fitted by minimizing a regularized empirical loss with SGD, http://scikit-learn.org). Keep in mind that friends don't let friends use testing data for training :-)

Note: Before we start an ML project, we always conduct a brief exploratory analysis :D 

Some factors to consider: Are there any outliers? Are there missing values? How will you handle categorical variables? Are there any features with low correlation with the target variables? What happens if you drop them?


#### Answers
Outliers: The answers following the limit of types of possibilities as with letters(A,..E), which depends of the questionary, given an kind of question wants to measure.
Missing values: NO, all questionary seems filled.
The categorical variables: That will use the encoding engine from sklearn library, OneHotEnconding to classify the itens as no-ordinaries values.
Target Variables?: 


In [69]:
# TODO: Load and preprocess your dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
import seaborn as sns

%matplotlib inline
pd.set_option('display.max_columns', None)

df   = pd.read_csv("SARESP_train.csv", low_memory=False)

X_data   = df[[col for col in df if col[0] == "Q"]] 

Y_target = df[ [col for col in df if col[:11] == "porc_ACERT_" ] ]

# processing data
# get_dummies to map the not-ordering inputs
for q in range(1,64):
    X_data = pd.get_dummies(data = X_data, columns=[f"Q{q}"], prefix=[f"Q{q}"], drop_first=True)


In [70]:
X_data.head()

Unnamed: 0,Q1_B,Q1_C,Q1_D,Q2_B,Q2_C,Q2_D,Q2_E,Q3_B,Q3_C,Q3_D,Q3_E,Q4_B,Q4_C,Q4_D,Q4_E,Q5_B,Q5_C,Q5_D,Q5_E,Q6_B,Q6_C,Q6_D,Q6_E,Q7_B,Q7_C,Q7_D,Q7_E,Q8_B,Q8_C,Q8_D,Q8_E,Q9_B,Q9_C,Q10_B,Q10_C,Q11_B,Q11_C,Q12_B,Q12_C,Q13_B,Q13_C,Q14_B,Q14_C,Q15_B,Q15_C,Q16_B,Q16_C,Q17_B,Q17_C,Q18_B,Q18_C,Q19_B,Q19_C,Q20_B,Q20_C,Q21_B,Q21_C,Q22_B,Q23_B,Q23_C,Q23_D,Q24_B,Q25_B,Q25_C,Q25_D,Q26_B,Q26_C,Q26_D,Q27_B,Q27_C,Q28_B,Q28_C,Q29_B,Q29_C,Q30_B,Q30_C,Q31_B,Q31_C,Q32_B,Q32_C,Q32_D,Q33_B,Q33_C,Q33_D,Q34_B,Q35_B,Q35_C,Q36_B,Q36_C,Q37_B,Q37_C,Q38_B,Q38_C,Q39_B,Q39_C,Q40_B,Q40_C,Q41_B,Q41_C,Q42_B,Q42_C,Q42_D,Q42_E,Q43_B,Q43_C,Q43_D,Q44_B,Q44_C,Q44_D,Q45_B,Q45_C,Q45_D,Q46_B,Q46_C,Q46_D,Q47_B,Q47_C,Q47_D,Q48_B,Q48_C,Q48_D,Q49_B,Q49_C,Q49_D,Q50_B,Q50_C,Q50_D,Q50_E,Q51_B,Q51_C,Q51_D,Q51_E,Q52_B,Q52_C,Q52_D,Q52_E,Q53_B,Q53_C,Q53_D,Q53_E,Q54_B,Q54_C,Q54_D,Q54_E,Q55_B,Q55_C,Q55_D,Q55_E,Q56_B,Q56_C,Q56_D,Q56_E,Q57_B,Q57_C,Q57_D,Q57_E,Q58_B,Q58_C,Q59_B,Q59_C,Q60_B,Q60_C,Q60_D,Q61_B,Q61_C,Q61_D,Q62_B,Q62_C,Q62_D,Q63_B,Q63_C,Q63_D
0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,1,0,1,0,0,1,0,0,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,1,0,1,0,0,0,0,0,0,1
1,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,1,1,0,1,0,0,0,0,1,0,0,0,1,0,1,0,0,1,0,1,0,1,1,0,1,0,0,0,1,0,0,0,0,1,0,0,1,0,1,1,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,1,0,0,1,0,0,1,0,1,0,0
2,1,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,1,0,0,1,1,0,0,1,0,1,0,1,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,0,1,0,1,0,0,1,0
3,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,1,0,1,0,0,1,1,0,1,0,0,0,1,0,0,0,1,1,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,1,0,0,0,1,0,1,0,0,0,1,0
4,1,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,1,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,0,1,1,0,1,0,1,0,0,1,0,0,0,0,1,0,1,1,0,0,1,0,1,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1


In [64]:
# TODO: Linear Regression. Implement your solution. You cannot use scikit-learn, Keras/TensorFlow, or PyTorch libraries.

class myLinearRegression():
    
    def __init__(self,
                    '''Initialize the myLinearRegression to find the weights'''
                    loss                 = loss_cost,
                    min_tolerance: float = 1e-4,
                    max_iter: int        = 1e5,
                    learning_rate: float = 1e-3,
                    bias: bool           = True):
        self.loss: float          = loss
        self.min_tolerance: float = min_tolerance
        self.max_iter: int        = max_iter
        self.learning_rate: float = learning_rate
        self.bias: bool           = bias
    
    def loss_cost(target:float|int, value:float|int) -> float|int:
        '''calculating the loss cost of target value'''
        #L2 loss example https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#mse-l2
        loss = np.mean((target - value)**2)
        return loss
    
    def gradient(target:float|int, value:float|int, array: []):
        '''Value of the recalculated gradients.'''
        grad = np.mean((value - target) * array, axis = 0).reshape(-1,1)    
    return grad
        
    def fit(self, input, output):
        '''Keeping Your Body In Mind .'''
        inputs  = np.array(input)
        outputs = np.array(output)
        
        return
    


In [73]:
# TODO: Linear Regression. You can use scikit-learn libraries.
from sklearn.linear_model import SGDRegressor, LinearRegression

sklearn_MAT = SGDRegressor(alpha = 0, learning_rate = 'constant', eta0 = 1e-3, ) #replicate same parameters as in custom
sklearn_MAT.fit(X_data, Y_target.porc_ACERT_MAT)

sklearn_CIE = SGDRegressor(alpha = 0, learning_rate = 'constant', eta0 = 1e-3, ) #replicate same parameters as in custom
sklearn_CIE.fit(X_data, Y_target.porc_ACERT_CIE)

sklearn_LP = SGDRegressor(alpha = 0, learning_rate = 'constant', eta0 = 1e-3, ) #replicate same parameters as in custom
sklearn_LP.fit(X_data, Y_target.porc_ACERT_lp)


> What are the conclusions? (1-2 paragraphs)




2. (1 point) Use different Gradient Descent (GD) learning rates when optimizing. Compare the GD-based solutions with Normal Equation. What are the conclusions?


In [None]:
# TODO: Gradient Descent (GD) with 3 different learning rates. You can use scikit-learn libraries.


3. (0.75 point) Sometimes, we need some more complex function to make good prediction. Devise and evaluate a Polynomial Linear Regression model. 


In [None]:
# TODO: Complex model. You can use scikit-learn libraries.

*texto em itálico*
 > What are the conclusions? What are the actions after such analyses? (1-2 paragraphs)

 


4. (0.5) Plot the cost function vs. number of epochs in the training/validation set and analyze the model. 

In [None]:
# TODO: Plot the cost function vs. number of iterations in the training set.

In [None]:
*texto em itálico*
 > What are the conclusions? What are the actions after such analyses? (1-2 paragraphs)

5. (0.25 point) Pick **your best model**, based on your validation set, and predict the target values for the test set.

## Logistic Regression

Now, this part of the assignment aims to predict students' proeficiency level on Portuguese, Mathematics, and Natural Sciences (target values: `nivel_profic_lp`, `nivel_profic_mat` and `nivel_profic_cie`) based on their socioeconomic data. Then, you have to **drop the columns `porc_ACERT_lp`,  `porc_ACERT_MAT`** and  **`porc_ACERT_CIE`**.

### Activities

1. (2.75 points) Perform Multinomial Logistic Regression (_i.e._, softmax regression). It is a generalization of Logistic Regression to the case where we want to handle multiple classes. Try different combinations of features, dropping the ones less correlated to the target variables.

In [None]:
# TODO: Multinomial Logistic Regression. You can use scikit-learn libraries.

> What are the conclusions? (1-2 paragraphs)


2. (0.5 point) Plot the cost function vs. number of epochs in the training/validation set and analyze the model. 

In [None]:
# TODO: Plot the cost function vs. number of iterations in the training set.

> What are the conclusions? (1-2 paragraphs)


3. (0.75 point) Pick **your best model** and plot the confusion matrix in the **test set**. 


In [None]:
# TODO: Plot the confusion matrix. You can use scikit-learn, seaborn, matplotlib libraries.

> What are the conclusions? (1-2 paragraphs)


## Deadline

Monday, September 19, 11:59 pm. 

Penalty policy for late submission: You are not encouraged to submit your assignment after due date. However, in case you do, your grade will be penalized as follows:
- September 20, 11:59 pm : grade * 0.75
- September 21, 11:59 pm : grade * 0.5
- September 22, 11:59 pm : grade * 0.25


## Submission

On Google Classroom, submit your Jupyter Notebook (in Portuguese or English).

**This activity is NOT individual, it must be done in pairs (two-person group).**