<h1><center>  lab 8 : ML Overview: Supervised Learning algorithms </center>
    
<img src="https://files.realpython.com/media/NLP-for-Beginners-Pythons-Natural-Language-Toolkit-NLTK_Watermarked.16a787c1e9c6.jpg" width="400">


```Created by Jinnie Shin (jinnie.shin@coe.ufl.edu)```\
```Date: ```

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQmNf86oJnfhpkPA9LnrFnAbfwF2VywPYpB_w&usqp=CAU" align="left" width="70" height="70" align="left">

 ### Required Packages or Dependencies

In [1]:
#!pip install { } ! in case you run into the `package not avaialble` error
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


## **REVIEW**: Dataset

> We will use the coh-metrix indices introduced in Week 6, `features.xlsx`

In [3]:
data= pd.read_excel('features.xlsx')

############################### MINI TASKS ####################################
# Q1. The total number of rows?

# Q2. How many coh-metrix features?
# (excluding, `TextID`, `domain1_score`, `domain2_score`, `essay_id`, and `essay_set`)

###############################################################################

X = data.drop(columns=['TextID','domain1_score', 'domain2_score', 'essay_id', 'essay_set'])
y = data.domain1_score


## 1. Regression and Classification Problems

> Our task is to predict the `domain1_score` column using the given coh-metrix features.
> We will implement and use the two algorithms, linear and logistic regression, as our main prediction/classification models. Before we construct the algorithms next week, we will take a look at how the model weights are learned using **the gradient descent algorithms**.

### 1.1 Gradient Descent
<img src="https://miro.medium.com/proxy/1*fBxEzbzP1KkqR7PTexJZdw.png" width="250">

> The objective of the learning algorithm is to determine the best possible values for the parameters (`w` and `b`), such that the overall loss (squared error loss) of the model is minimized as much as possible. \
> Let's solve this regression problem: `y = 4.0+(3.0𝑥0)+(1.0𝑥1)+(3.0𝑥2)+(0.5𝑥3)+(1.5𝑥4)`

In [5]:
## Define number of samples
num_samples = 20

x0 = 3.0 + np.random.standard_normal(num_samples)
x1 = 1.0 + np.random.standard_normal(num_samples)
x2 = -8.0 + np.random.standard_normal(num_samples)
x3 = -2.0 + np.random.standard_normal(num_samples)
x4 = 0.5 + np.random.standard_normal(num_samples)
y = 4.0 + 3.0 * x0 + 1.0 * x1 + 3.0 * x2 + 0.5 * x3 + 1.5 * x4 + np.random.standard_normal(num_samples)

X = np.column_stack((x0, x1, x2, x3, x4))
Y = y

#### 1.1.1 Batch Gradient Descent (BGD)
> Partial derivates of `b` and `w` in linear regression with the squared loss is:
<img src="https://eli.thegreenplace.net/images/math/aef02f077919896478d0456619f934dcc5809142.png" width="250">


In [6]:
def BGD(X, Y, b, w, alpha=0.005): # alpha is a learning rate, we will set it as 0.005 for now

    num_feat = X.shape[1]

    num_sample = X.shape[0] # This indicates the total number of data points (rows)

    b_grad = 0 #Intercept

    w_grad = np.zeros(num_feat) # weight vector

    for i in range(num_sample): # BGD first calculates the `b_grad` or `w_grad`
                                # from the total sample N
        y = Y[i] # one sample, y
        x = X[i] # one sample, x
        b_grad += -(2./float(num_sample)) * (y - (b + w.dot(x)))

        for j in range(num_feat):
            x_ij = x[j]
            w_grad[j] += -(2./float(num_sample)) * x_ij * (y - (b + w.dot(x)))

    b_new = b - alpha * b_grad
    w_new = np.array([w[i] - alpha * w_grad[i] for i in range(num_feat)])
    return b_new, w_new

In [7]:
def BGD_train(X, Y, alpha=0.005):
    b = 0
    w = np.zeros(X.shape[1])
    print('===== Start Training ====')
    for i in range(10000):
        b_new, w_new = BGD(X, Y, b, w, alpha=alpha)
        b = b_new
        w = w_new
        if i % 1000 == 0:
            print('{}: b = {}, w = {}'.format(i, np.round(b_new, 2), np.round(w_new, 2)))

    print('final: b = {}, w = {}'.format(np.round(b, 2), np.round(w, 2)))
    return b, w

> *Let's explore!*

In [8]:
BGD_train(X, Y)

===== Start Training ====
0: b = -0.09, w = [-0.31 -0.11  0.8   0.2  -0.02]
1000: b = 0.22, w = [2.73 0.9  2.38 0.51 1.17]
2000: b = 0.53, w = [2.75 0.89 2.43 0.49 1.18]
3000: b = 0.82, w = [2.74 0.89 2.46 0.5  1.17]
4000: b = 1.08, w = [2.73 0.89 2.48 0.5  1.17]
5000: b = 1.31, w = [2.72 0.89 2.5  0.51 1.16]
6000: b = 1.52, w = [2.72 0.89 2.53 0.51 1.16]
7000: b = 1.72, w = [2.71 0.88 2.54 0.52 1.16]
8000: b = 1.89, w = [2.71 0.88 2.56 0.52 1.15]
9000: b = 2.05, w = [2.7  0.88 2.58 0.52 1.15]
final: b = 2.2, w = [2.7  0.88 2.59 0.53 1.15]


(2.199463971499102,
 array([2.69550846, 0.87949332, 2.59251754, 0.52785271, 1.1455164 ]))

#### 1.1.1 Stochastic Gradient Descent (SGD)
> Shuffles the data and randomly sample one data point to update the gradient

In [9]:
def SGD(x, y, b, w, num_feat, num_sample, alpha=0.005):

    b_grad = -(2./float(num_sample)) * (y - (b + w.dot(x)))
    w_grad = np.zeros(num_feat)

    for i in range(num_feat):
        w_grad[i] += -(2./float(num_sample)) * x[i] * (y - (b + w.dot(x)))

    b_new = b - alpha * b_grad
    w_new = np.array([w[i] - alpha * w_grad[i] for i in range(num_feat)])
    return b_new, w_new

In [19]:
def SGD_train(X, Y, alpha =0.005):

    import random

    b = 0
    w = np.zeros(X.shape[1])

    num_sample = X.shape[0]
    num_feat = X.shape[1]

    for i in range(5000):
        indices = list(range(num_sample))
        random.shuffle(indices)

        for j in indices:
          b_new, w_new = SGD(X[j], Y[j], b, w, num_feat, num_sample,  alpha=alpha)
          b = b_new
          w = w_new

        if i % 1000 == 0:
          print('{}: b = {}, w = {}'.format(i, np.round(b_new, 2), np.round(w_new, 2)))

    print('final: b = {}, w = {}'.format(np.round(b,2), np.round(w, 2)))


> *Let's explore!*

In [20]:
SGD_train(X, Y)

0: b = -0.06, w = [-0.2  -0.07  0.54  0.14 -0.01]
1000: b = 0.23, w = [2.73 0.9  2.38 0.51 1.17]
2000: b = 0.54, w = [2.75 0.9  2.42 0.49 1.18]
3000: b = 0.83, w = [2.74 0.89 2.45 0.5  1.17]
4000: b = 1.09, w = [2.73 0.89 2.48 0.51 1.17]
final: b = 1.33, w = [2.72 0.89 2.51 0.51 1.16]


<img src="https://i.pinimg.com/736x/2e/aa/7d/2eaa7d5021ca7c3c98bc93b98b9646fe.jpg" align="left" width="70" height="70" align="left">

 ## Task 1: Training & Testing data
>  Q1. In order to analyze large dataset efficiently, we will use the package `scikit-learn` to implement regression models.
>> **Step 1**: Download the package `!pip install sklearn` \
>> **Step 2**: Import models ` from sklearn.linear_model import LinearRegression`\
>> **Step 3**: Call the module `lr = LinearRegression()` \
>> **Step 4**: Fit the dataset using `lr.fit({input}, {output})` and check the intercept and the coefficients using `lr.intercept_` and `lr.coef_`

> More information about the package is available at: https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

> Q2. Compare the results with our findings.

In [17]:
################################### YOUR CODE HERE #############################
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

model = lr.fit(X, Y)

print(model.intercept_)
print(model.coef_)

###############################################################################

3.660912679523795
[2.64898059 0.86619805 2.73710623 0.55868311 1.11434095]


<img src="https://i.pinimg.com/736x/2e/aa/7d/2eaa7d5021ca7c3c98bc93b98b9646fe.jpg" align="left" width="70" height="70" align="left">

 ## Task 2: Training & Testing data using `Linear Regression`
>  Q3. Let's use the `data` and fit a `linear regression` model (DV = `domain1_score`.

> Q4. Evaluate the R2-score from `from sklearn.metrics import r2_score`

In [24]:
# data= pd.read_excel('./data/features.xlsx')
data = data.dropna()
X = data.drop(columns=['TextID','domain1_score', 'domain2_score', 'essay_id', 'essay_set'])
Y = data.domain1_score

################################### YOUR CODE HERE #############################
model = lr.fit(X, Y)

print(model.intercept_)
print(model.coef_)

from sklearn.metrics import r2_score
y_hat = lr.predict(X)
r2_score(Y, y_hat)
###############################################################################

196.6205276062531
[ 1.42873156e-10  7.68276083e-03  7.18749660e-04  7.68276019e-03
  1.96876480e-11  2.16594319e-01 -9.55663468e-03  4.98671734e+00
  3.80727051e-02  5.78376159e-01  2.12737779e-01  7.96779905e+00
 -3.55662585e-03  1.26331551e+01 -9.06477545e-04  1.25298388e+01
  2.65720482e-03 -1.57279764e+01 -1.55238790e-03 -8.54094317e+00
  2.08599378e-04  3.29594691e+00  4.47423888e-03 -8.89213182e+00
 -2.19739583e-04  3.26256983e+00  7.63296159e-04  6.79577697e+00
  1.09928495e+01 -4.84666736e-01  5.59817983e-02  1.53950963e+01
  1.00861040e+01  3.79960845e+01 -1.65538687e-01  5.04322039e+01
  3.25834855e-02 -1.60702898e-01  5.85594450e-02  1.21502522e+01
  5.25479095e-01  1.64709895e+01 -7.37637325e-01 -2.00530842e-11
 -3.25838176e-11  2.55977132e+01 -2.06234307e+00 -4.60397197e+00
 -2.90330236e+00 -2.83009806e-02 -5.40364816e-02 -4.58945821e-02
  2.16618775e-01  2.31341813e-01 -5.86442131e-01  1.17778186e-01
 -4.92383253e-03 -7.70509322e-01 -3.60038887e-03 -6.87745232e-03
 -1.230

0.6044758644232586

<img src="https://i.pinimg.com/736x/2e/aa/7d/2eaa7d5021ca7c3c98bc93b98b9646fe.jpg" align="left" width="70" height="70" align="left">

 ## Task 3: Training & Testing data with `Logistic Regression`
>  Q3. Let's use the `data` and fit a `logistic regression` model (DV = `domain1_score`. (Hint: You should create a ____ output). Use `from sklearn.linear_model import LogisticRegression`

> Q4. Evaluate the accuracy from `from sklearn.metrics import accuracy_score`

In [None]:
data= pd.read_excel('./data/features.xlsx')
X = data.drop(columns=['TextID','domain1_score', 'domain2_score', 'essay_id', 'essay_set'])
y = data.domain1_score

################################### YOUR CODE HERE #############################
from sklearn.linear_model import LogisticRegression



###############################################################################