# Analyzing Best Studying Strategies

Author: David Wang

Course Project, UC Irvine, Math 10, Summer 2023

## Introduction

Welcome! The purpose of this project is to see if there is a correlation between different factors starting with GPA versus Study Time all the way to GPA versus Going Out. I will compare the difference between them all to find what is the most effective way to study in college while also maintaining a high GPA. Student life is quite important especially in college so managing to fit a healthy lifestyle is a must. 

## Importing Data

Importing data which allows us to analyze GPA between how much studying per week and sleeping. 

In [1]:
import pandas as pd
import altair as alt
import numpy as np

In [2]:
df = pd.read_csv("gpa.csv")
df.head()

Unnamed: 0,gpa,studyweek,sleepnight,out,gender
0,3.89,50,6.0,3.0,female
1,3.9,15,6.0,1.0,female
2,3.75,15,7.0,1.0,female
3,3.6,10,6.0,4.0,male
4,4.0,25,7.0,3.0,female


In [3]:
df.describe()

Unnamed: 0,gpa,studyweek,sleepnight,out
count,55.0,55.0,55.0,55.0
mean,3.600073,19.145455,7.063636,2.109091
std,0.335618,12.3864,1.032143,1.003194
min,2.9,2.0,5.0,0.0
25%,3.4,10.0,6.0,1.25
50%,3.65,15.0,7.0,2.0
75%,3.825,26.5,8.0,3.0
max,4.67,50.0,9.0,4.0


Some of the data seen shows the gpa and study week. The out data provides time spent outisde of class.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   gpa         55 non-null     float64
 1   studyweek   55 non-null     int64  
 2   sleepnight  55 non-null     float64
 3   out         55 non-null     float64
 4   gender      55 non-null     object 
dtypes: float64(3), int64(1), object(1)
memory usage: 2.3+ KB


## Cleaning Data

To reduce variability in comparing trends, we will be dropping any rows that contain "male" in the "gender" column because males and females have different biological processes that may influence the data. We will also remove "out" because the column only mentions social aspect as we only care about time studied and time slept. #Add so few "males"

In [5]:
df = df[df['gender'] != 'male']
df = df.drop('out', axis=1)
df

Unnamed: 0,gpa,studyweek,sleepnight,gender
0,3.89,50,6.0,female
1,3.9,15,6.0,female
2,3.75,15,7.0,female
4,4.0,25,7.0,female
6,3.25,15,6.0,female
7,3.925,10,8.0,female
8,3.428,12,8.0,female
10,3.9,10,8.0,female
11,2.9,30,6.0,female
12,3.925,30,7.0,female


In [6]:
df.describe() # to find the average gpa

Unnamed: 0,gpa,studyweek,sleepnight
count,43.0,43.0,43.0
mean,3.611256,20.093023,7.011628
std,0.312429,12.495359,1.026369
min,2.9,4.0,5.0
25%,3.4,10.0,6.0
50%,3.7,15.0,7.0
75%,3.87,29.0,8.0
max,4.0,50.0,9.0


## Visualizing Data

### Context: 

Everyone who are currently in school always have asked the question if there is a correlation between amount of sleep, study hours and GPA. I always like to see if there is a sweet spot for sleep and study hours to GPA for maximum efficiency. We will be using x = study hours (for one data), x = sleep, y = GPA to see data.


In [7]:
scatter_plot = alt.Chart(df).mark_point(filled = True).encode(
    x = "studyweek",
    y = "gpa",
    tooltip = ['gpa:N', 'studyweek:N'], 
)

chart = alt.Chart(df).mark_bar().encode(
    x = 'sleepnight', 
    y = 'gpa', 
    tooltip = ['gpa', 'sleepnight']
)

scatter_plot | chart

Looking at the data there seems to be no correlation between the amount of studying and GPA alongside sleeping and GPA for female students. 

## Linear Regression

We will be using Linear Regression to see if there is a potential correlation between the amount studied and gpa. 

In [8]:
from sklearn.linear_model import LinearRegression

X = df['studyweek'].values.reshape(-1, 1) # To change into 2D array
y = df['gpa']

reg = LinearRegression().fit(X, y)

m = reg.coef_[0]
intercept = reg.intercept_

print("Slope: ", m)
print("Intercept: ", intercept)

Slope:  0.0021365922164140487
Intercept:  3.5683252168608894


In [9]:
regression = alt.Chart(pd.DataFrame({"x": [0, 50], "y": [intercept, intercept + m * 50]})).mark_line().encode(
    x='x',
    y='y',
    color=alt.value('red')
)

combined = (scatter_plot + regression)

combined

From the chart above, we can see that there is a positive relationship between the amount of studying per week and gpa. However, the relationship/slope is not as noticeable so we will see if there is another way to look at the data.

## Logistic Regression

Using Logisitic Regression can help predict if a certain amount of studying correspond to a high gpa. 

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [11]:
X = df['studyweek'].values.reshape(-1, 1)
y = df['gpa'] >= 3.7

I chose greater than 3.7 because that is the mean that I hope to distnguish of this skewed data. Better than finding the mean to be 2.0. 

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Report", report)

Accuracy: 0.4444444444444444
Report               precision    recall  f1-score   support

       False       0.43      0.75      0.55         4
        True       0.50      0.20      0.29         5

    accuracy                           0.44         9
   macro avg       0.46      0.47      0.42         9
weighted avg       0.47      0.44      0.40         9



Taking a look at the data, we see that the accuracy is about `44.44%.` The meaning is the porportions of correctly classified predictions of the data.  

However, the accuracy is quite low in this case. The reason may be behind the imbalance data set where most data are clustered high with the domain only on studying per week.

Better than random guessing (33 percent but this is 44 percent)

## K-Nearest Neighbors

Since the data is continuous and not categorical, we shall add another column to make them more specific.

In [13]:
gpa_bin = [0, 2.5, 3.5, 4.0]
gpa_labels = ["Low GPA", "Medium GPA", "High GPA"]

df['gpa_classification'] = pd.cut(df['gpa'], bins = gpa_bin, labels = gpa_labels)

df.head()

Unnamed: 0,gpa,studyweek,sleepnight,gender,gpa_classification
0,3.89,50,6.0,female,High GPA
1,3.9,15,6.0,female,High GPA
2,3.75,15,7.0,female,High GPA
4,4.0,25,7.0,female,High GPA
6,3.25,15,6.0,female,Medium GPA


In [14]:
from sklearn.model_selection import train_test_split

X = df["sleepnight"].values.reshape(-1, 1)
y = df['gpa_classification']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 40)

In [15]:
from sklearn.neighbors import KNeighborsClassifier

k = 4
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(X_train, y_train)

In [16]:
y_pred = knn.predict(X_test)

In [17]:
from sklearn.metrics import accuracy_score, confusion_matrix

accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)

print("Accuracy: ", accuracy)
print("Confusion Matrix:\n", confusion)

Accuracy:  0.4444444444444444
Confusion Matrix:
 [[4 0]
 [5 0]]


Looking at the accuracy, we see that the accuracy is pretty low at around `44.44%` which is quite similar to the result found using logistic regression. I believe that the reason to this error is due to the slewness of data in the upper GPA bracket despite having plenty of students getting a variety of sleep and study patterns. 

The patterns also showed that since there are three categories that the machine did better than guessing at random which is 33.33%

Looking at the Confusion Matrix, we can interpret the following: 

1) True Positives: `4`
2) True Negatives: `0`
3) False Positives: `5`
4) False Negatives: `0`

Seems like there is a lot of positives that are right with some false positives or instances predicted as positive in error. Sensitivity is `1` since there are no false negatives which means it identifies all positive instances. I feel like the algorithm may be flawed due to a flawed dataset but points to a good area of exploration. 

## Stat Models

In [18]:
import statsmodels.api as sm

In [19]:
X = df[['studyweek', 'sleepnight']]
y = df['gpa']

model = sm.OLS(y, sm.add_constant(X)).fit()

In [20]:
model.summary()

0,1,2,3
Dep. Variable:,gpa,R-squared:,0.066
Model:,OLS,Adj. R-squared:,0.019
Method:,Least Squares,F-statistic:,1.417
Date:,"Wed, 27 Sep 2023",Prob (F-statistic):,0.254
Time:,21:01:45,Log-Likelihood:,-9.0111
No. Observations:,43,AIC:,24.02
Df Residuals:,40,BIC:,29.31
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9640,0.391,7.579,0.000,2.174,3.754
studyweek,0.0045,0.004,1.101,0.277,-0.004,0.013
sleepnight,0.0794,0.050,1.588,0.120,-0.022,0.180

0,1,2,3
Omnibus:,3.062,Durbin-Watson:,2.417
Prob(Omnibus):,0.216,Jarque-Bera (JB):,2.698
Skew:,-0.521,Prob(JB):,0.26
Kurtosis:,2.352,Cond. No.,203.0


Let's analyze what data is found within the `model summary`. 

The R-Squared value is `0.066` which represents a horrible fit within the data. Most of the values are indicating that the data may not be a good way to summarize sleep and study hours with GPA. 

The p-value as `0.254` is relatively high because that means the result is not statistically significant and there is not strong evidence to suggest a pattern. 

Even the data provided above concludes that there is no correlation to study weeks and sleep amounts to gpa. 

## Summary

In summary, the goal of the project is to see if there are any relationship between studying and sleeping with GPA. However, using the data proved uncessessful as the data is skewed with little to no relationship. Additionally, the logistic regression model does better than guessing which is a plus in the dataset. 

## References

Your code above should include references.  Here is some additional space for references.

* What is the source of your dataset(s)?

Kaggle: https://www.kaggle.com/datasets/joebeachcapital/duke-students-gpa 

* List any other references that you found helpful.

Codes from Math 10 Summer Notes

Statsmodels:
https://www.statsmodels.org/stable/index.html

K-Nearest Neighbors (Will use Linear Regression): 
https://www.ibm.com/topics/knn#:~:text=The%20k%2Dnearest%20neighbors%20algorithm%2C%20also%20known%20as%20KNN%20or,of%20an%20individual%20data%20point

ChatGPT (For ideas and guidance)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=340b9e2d-01b5-43f6-8e56-722f4c2039f7' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>