<a href="https://colab.research.google.com/github/DavidGoing/PHYS3151-Machine-Learning-in-Physics-2024/blob/main/logistic-regression/Grading_weight.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Layer

Consider the grading method of this course.  Four assignments account for in total 30%, a project and a presentation each takes up 20%, and a final exam accounts for 30%.
Consider a (fictitious) grading cut: F(<50), D(50\~60), C(60\~75), B(75\~90) A(90\~100), without any subdivision. Assign an integer from 0 to 4 to each of them, with 0 being F and 4 being A.

In [1]:
import numpy as np
import pandas as pd
from pandas import DataFrame
import csv

In [2]:
letter_grade=["F","D","C","B","A"]
weight=[0.075, 0.075, 0.075, 0.075, 0.2, 0.2, 0.3]
np.random.seed
with open('grade_data.csv', mode='w') as sample_file:
    grade_writer = csv.writer(sample_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    grade_writer.writerow(['HW1','HW2','HW3','HW4','Project','Presentation','Exam','Grade',])
    for i in range (0, 1000):
        mark_temp=np.random.normal(60, 60, 7)
        #prevent marks from going out of the range [0,100]
        mark_temp=[max(p,0) for p in mark_temp]
        mark_temp=[min(p,100) for p in mark_temp]
        mark_weighted=np.dot(weight,mark_temp)
        grade=0
        if mark_weighted>=50:
          grade+=1
          if mark_weighted>=60:
            grade+=1
            if mark_weighted>=75:
              grade+=1
              if mark_weighted>=90:
                grade+=1
        mark_temp.append(grade)
        grade_writer.writerow(mark_temp)

In [3]:
df = pd.read_csv('/content/grade_data.csv')
print(df)

            HW1         HW2         HW3         HW4     Project  Presentation  \
0    100.000000  100.000000   30.954987    6.821568   13.376814      0.000000   
1     49.797146   46.227928   56.773286    0.000000   32.709680     74.314485   
2     20.366042    7.352175   87.481686   61.355763  100.000000      0.000000   
3     50.744332   66.416793   10.791449   86.776878   30.667904    100.000000   
4     10.557723   94.261697    0.000000   84.479583   58.941993      0.000000   
..          ...         ...         ...         ...         ...           ...   
995   94.179794    0.000000   55.672433  100.000000   26.116161     41.534595   
996    0.000000    0.000000  100.000000   14.663096  100.000000    100.000000   
997   32.590867   33.482063  100.000000   77.136192   67.067924     97.539828   
998  100.000000    0.000000   17.654241   73.715375   77.781429      0.000000   
999    0.000000   25.242776    0.000000   49.191349   28.109211     59.636227   

           Exam  Grade  
0 

### Code Layer

For the representation of the hypothesis, we define <span style="border-bottom: dashed">sigmoid function</span> :
$$
h_{\theta}=\frac{1}{1+e^{-\theta^Tx}}
$$
Note that $\theta^Tx$ can be non-linear.
    <br> <span style="border-bottom: dashed">Cost function</span> in logistic refression is defined as:
$$
J(\theta)=-\frac{1}{m}\sum_{i=1}^m(y\log(h_\theta)+(1-y)\log(1-h_\theta)),
$$
and still, we can use gradient descent to minimize it.
    
<br>In addition, we use <span style="border-bottom: dashed">accuracy function</span> to see how well the algorithm works, which is defined as:
$$
Accuracy=\frac{1}{m}\sum_{i=1}^{m}\delta_{predicted,real}
$$

The N-dimensional hyperplane that satisfies $z=\theta^{T}x=0$ is called the decision boundary.  Ideally, it separates two classes of samples.

$z\gt0$ implies the sample is more likely to belong to class $y=1$, and vise versa

 <img title="decision boundary" src="https://github.com/LeoisWTT/PHYS3151-Machine-Learning-in-Physics-2023/blob/main/logistic-regression/Decision_boundary.png?raw=1" width="800" /> <br>

In [None]:
import numpy as np
from scipy.optimize import fmin_tnc


class LogisticRegressionUsingGD:

# Activation function used to map any real value between 0 and 1
    @staticmethod
    def sigmoid(x):
        return 1 / (1 + np.exp(-x))

# Computes the weighted sum of inputs Similar to Linear Regression
    @staticmethod
    def net_input(theta, x):
        return np.dot(x, theta)

# Calculates the probability that an instance belongs to a particular class
    def probability(self, theta, x):
        return self.sigmoid(self.net_input(theta, x))

# Computes the cost function for all the training samples
    def cost_function(self, theta, x, y):
        m = x.shape[0]
        total_cost = -(1 / m) * np.sum([np.log(self.probability(theta, x[i])) if y[i]==1 else np.log(1-self.probability(theta, x[i])) for i in range(m)])
        return total_cost

# Computes the gradient of the cost function at the point theta
    def gradient(self, theta, x, y):
        m = x.shape[0]
        return (1 / m) * np.dot(x.T, self.sigmoid(self.net_input(theta, x)) - y)

    def fit(self, x, y, theta):
        opt_weights = fmin_tnc(func=self.cost_function, x0=theta, fprime=self.gradient, args=(x, y.flatten()))
        self.w_ = opt_weights[0]
        return self

    def predict(self, x):
        theta = self.w_[:, np.newaxis]
        return self.probability(theta, x)

    def accuracy(self, x, actual_classes, probab_threshold=0.5):
        predicted_classes = (self.predict(x) >= probab_threshold).astype(int)
        predicted_classes = predicted_classes.flatten()
        accuracy = np.mean(predicted_classes == actual_classes)
        return accuracy * 100

Now, we pick the features and outcome out of the initial data.

In [None]:
data = df
X = data.iloc[:, :7]        #features
grade = data.iloc[:, 7]         #outcome
X = np.c_[np.ones((X.shape[0], 1)), X]    #we need x_0

Input whether or not each sample is in or above B range.  Assign 0 to all of those below B range, and vice versa. This devides the sample into two different classes for us to perform logistic regression.

As a result, we should be able to find the boundary between grade B and C.

 <img title="grade" src="https://github.com/LeoisWTT/PHYS3151-Machine-Learning-in-Physics-2023/blob/main/logistic-regression/grade.png?raw=1" width="800" /> <br>

In [None]:
grade_here=3
y = [g>=grade_here for g in grade]                 #devide sample by whether they are in or above B(3) range
y = np.array(y)
theta = np.zeros((X.shape[1], 1))

Implement the Logistic Regression algorithm.

In [None]:
model1 = LogisticRegressionUsingGD()
model1.fit(X, y, theta)
accuracy = model1.accuracy(X, y.flatten())
parameters = model1.w_
print("The accuracy of the model is {}".format(accuracy))
print("The model parameters got by Gradient descent:")
print(parameters)

The accuracy of the model is 100.0
The model parameters got by Gradient descent:
[-1.04302438e+03  1.03197740e+00  1.06828718e+00  1.03913938e+00
  1.04336035e+00  2.77918227e+00  2.80358328e+00  4.14899099e+00]


  return 1 / (1 + np.exp(-x))


We normalize the parameters ($\theta$) by letting the sum of weight (excluding bias term) equal to 1, as it should be.  We can see the 7 weights we used to determine the grade of each sample and bias term being negative of cut line.

In [None]:
parameters_normalized=parameters/sum(parameters[1:])
print("The cut line for", letter_grade[grade_here], "range is {}".format(-parameters_normalized[0]))
print("The weights of the 7 components are {}".format(parameters_normalized[1:]))

The cut line for B range is 74.95941787282493
The weights of the 7 components are [0.0741655  0.07677499 0.07468021 0.07498356 0.19973252 0.20148615
 0.29817707]


We can try the same by deviding in C range and reach the same result, except for the bias term

In [None]:
grade_here=2
y = [g>=grade_here for g in grade]                 #devide sample by whether they are in or above C(2) range
y = np.array(y)
theta = np.zeros((X.shape[1], 1))

In [None]:
model2 = LogisticRegressionUsingGD()
model2.fit(X, y, theta)
accuracy = model2.accuracy(X, y.flatten())
parameters = model2.w_
print("The accuracy of the model is {}".format(accuracy))
print("The model parameters got by Gradient descent:")
print(parameters)

  return 1 / (1 + np.exp(-x))


The accuracy of the model is 100.0
The model parameters got by Gradient descent:
[-1320.86557045     1.65564797     1.66119478     1.53581502
     1.56276266     4.46217408     4.4774591      6.65694482]


In [None]:
parameters_normalized=parameters/sum(parameters[1:])
print("The cut line for", letter_grade[grade_here], "range is {}".format(-parameters_normalized[0]))
print("The weights of the 7 components are {}".format(parameters_normalized[1:]))

The cut line for C range is 60.00661753760554
The weights of the 7 components are [0.0752157  0.07546769 0.06977172 0.07099595 0.20271554 0.20340993
 0.30242346]


Consider the case where we cannot know about all relevant inputs. Is it still possible to get the result using remaining data.

In the following case we ignore the effect of the first two assignments which takes up 15% combined.

In [None]:
X2 = data.iloc[:, 2:7]        #features (excluding first 2 columns)
grade = data.iloc[:, 7]         #outcome
X2 = np.c_[np.ones((X2.shape[0], 1)), X2]    #we need x_0

In [None]:
grade_here=3
y = [g>=grade_here for g in grade]                 #devide sample by whether they are in or above B(3) range
y = np.array(y)
theta = np.zeros((X2.shape[1], 1))

In [None]:
model3 = LogisticRegressionUsingGD()
model3.fit(X2, y, theta)
accuracy = model3.accuracy(X2, y.flatten())
parameters = model3.w_
print("The accuracy of the model is {}".format(accuracy))
print("The model parameters got by Gradient descent:")
print(parameters)

The accuracy of the model is 95.39999999999999
The model parameters got by Gradient descent:
[-31.04156189   0.03454377   0.03704007   0.09284211   0.09181337
   0.14214254]


This time we normalize the weights to sum up to only 85%, but rescale the bias term up so that it is still out of 100 points.

Note that we can still get a result for the remaining components with slightly less accuracy.

In [None]:
parameters_normalized=parameters/sum(parameters[1:])*0.85
print("The cut line for", letter_grade[grade_here], "range is {}".format(-parameters_normalized[0]/0.85))
print("The weights of the remaining 5 components are {}".format(parameters_normalized[1:]))

The cut line for B range is 77.91911508824822
The weights of the remaining 5 components are [0.07370366 0.07902985 0.19809083 0.19589588 0.30327978]
