Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try using GaussianNB() #1

Closed
RAWKING opened this issue Jun 29, 2019 · 4 comments
Closed

try using GaussianNB() #1

RAWKING opened this issue Jun 29, 2019 · 4 comments

Comments

@RAWKING
Copy link

RAWKING commented Jun 29, 2019

I have re run your model by GaussianNB() and the results are good.
Mean Model Accuracy : 0.9178075396825397
Accuracy Score : 0.9240506329113924

@RAWKING
Copy link
Author

RAWKING commented Jun 29, 2019

Confusion Matrix:
[[23 2]
[ 4 50]]
Precision : 0.9615384615384616
Recall : 0.9259259259259259
F1 Score : 0.9433962264150944

and "failures","G2" are the two features i have obtained after debugging your code....

@RAWKING RAWKING closed this as completed Jun 29, 2019
@RAWKING
Copy link
Author

RAWKING commented Jun 29, 2019

This is the RESULT

Student Performance Prediction

Model Accuracy Knowing G1 & G2 Scores

['failures', 'G2']
Mean Model Accuracy : 0.9178075396825397
Accuracy Score : 0.9240506329113924

Confusion Matrix:
[[23 2]
[ 4 50]]
Precision : 0.9615384615384616
Recall : 0.9259259259259259
F1 Score : 0.9433962264150944

Model Accuracy Knowing Only G1 Score

['failures', 'G1']
Mean Model Accuracy : 0.8259424603174603
Accuracy Score : 0.8734177215189873

Confusion Matrix:
[[23 2]
[ 8 46]]
Precision : 0.9583333333333334
Recall : 0.8518518518518519
F1 Score : 0.9019607843137256

Model Accuracy Without Knowing Scores

['failures', 'absences']
Mean Model Accuracy : 0.7120039682539683
Accuracy Score : 0.7468354430379747

Confusion Matrix:
[[ 7 18]
[ 2 52]]
Precision : 0.7428571428571429
Recall : 0.9629629629629629
F1 Score : 0.8387096774193549

@RAWKING RAWKING reopened this Jun 29, 2019
@sachanganesh
Copy link
Owner

Hey Rishab, I'm glad you were able to improve on my scores!

I can't recall exactly why I decided against using the GaussianNB classifier, but I have a general idea. It should be noted that because the associated publication emphasized the effectiveness of Naive Bayes, it was the first model I tried.

I haven't found the time to dig deep, but I was able to recreate your results by simply switching the models. I noticed that the GaussianNB approach provides a high False Pass Rate (FPR) of 0.72 and very low False Fail Rate (FFR) of 0.04. This is obfuscated in your output as these metrics are inverse to the ones you've chosen to use (precision and recall).

This implies that the GaussianNB model tends to predict a "pass" rating for students who eventually fail at an alarmingly high rate (FPR). On the other hand, it greatly minimizes the rate at which students are predicted to fail when they indeed pass (FFR).

In practice, I imagine a teacher would appreciate the fact that the model has a low FFR; their time isn't wasted looking after students that would eventually pass anyways. However, a high FPR could imply that the model isn't very helpful in tagging the students that need attention.


My code, which uses a LinearSVC, has the following averaged results. This was pulled from the README.

Features Considered G1 & G2 G1 & School School & Absences
Paper Accuracy 0.919 0.838 0.706
My Model Accuracy 0.9165 0.8285 0.6847
False Pass Rate 0.096 0.12 0.544
False Fail Rate 0.074 0.1481 0.2185

This approach compromises on the FFR, and instead improves on the FPR. This implies that the model refrains from misclassifying students who eventually fail a little better than the GuassianNB model. As expected, there is a tradeoff, and we see that in the higher FFR, though it's still quite low.

In practice, this might mean teachers have a little wider of a net of "failing" students due to the slightly higher FFR. However, it also means that there are less students that are incorrectly predicted to pass, when instead they need academic support and attention.


The point of this project can be summed up from the following quote from the publication:

As a direct outcome of this research, more efficient student prediction
tools can be be developed, improving the quality of education and enhancing school resource management.

This is open to interpretation, but I think that the point of classifying passing/failing students is to identify the students who will fail and give them help early on. Teachers can better focus on the students that need help, thus enhance resource management. That's why it's important to leave the G1 and G2 grade reports out, so that we can predict which students will fail without grade information and give them help early on.

I don't think focusing on students who pass is particularly valuable, so I'm not sure if precision and recall are the best metrics under that framework. If there is class imbalance (which I'm not sure is the case here, it's been a while), then high accuracy alone may be misleading as well.

By maintaining a low FPR, a monitoring system can be sure that it's not letting any troubled students slip through the cracks. Giving students attention when they don't need it and will pass regardless is a lesser issue in my opinion, as long as it doesn't happen too much and we can maintain a reasonably low FFR.

It's a matter of opinion on which type of error we're more comfortable with (Type 1 or 2). But I prefer a low FPR, which is reflected in the LinearSVC approach. In addition, the FFR is only slightly higher when compared to the GaussianNB model, and I believe it's still at a reasonable value. Overall, under the focus of identifying failing students, the LinearSVC seems to be a better model.


Please remember that I haven't touched this code-base in almost 2 years, so I'm not as involved as I used to be. Perhaps there are implementation-level details that may need to be addressed as well. I welcome any further discussion about what I've said above, but I don't intend to update the codebase unless there's a generous improvement in the tracked metrics or heinous code mistakes.

In case you have any final thoughts, I'll refrain from closing the issue right away.

@sachanganesh
Copy link
Owner

It's been more than 10 days since my reply, so I'm going to go ahead and close the issue. Hope my response was helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants