## Background / Motivation

What motivated you to work on this problem?

Mention any background about the problem, if it is required to understand your analysis later on.

The topic of our project is analyzing student academic performance. We specifically looked at behaviors that can impact academic performance, such as raising hands, viewing resources, reading announcements, etc. We were interested in this problem because we thought it would be interesting to identify which behaviors have a positive and negative impact on overall academic performance. Many students have different comfort levels in raising their hands or engaging in discussion in a classroom setting, so it would be valuable to see how these behaviors can contribute to performance. From our personal experience, some students do not regularly engage in class but still achieve high grades while for others, engaging with the teacher and other students might positively influence their grade. It varies from student to student, but this project would give us an overall idea of a general trend, if one exists, regarding these behaviors and academic performance. 

## Problem statement 

Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.

Our problem statement is with this dataset, can we use a logistic model to predict students’ academic performance (‘middle’ or ‘high’ category) based on their engagement? This is a prediction problem because the goal is to predict the value of a variable, which is academic performance, based on input variables (such as raising hands, viewing announcements, and other behaviors). The academic performance is denoted in middle or high, with the middle indicating a grade value of 70-89 points and high indicating a grade value of 90-100 points. The goal of this prediction model is to find the best relationship between the input and output variables so that it can make accurate predictions for new, unseen data on students’ academic performance on the basis of their behavior metrics. 

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

We used a dataset from Kaggle (https://www.kaggle.com/datasets/aljarah/xAPI-Edu-Data). This is an educational dataset that is collected from a learning management system (LMS) called Kalboard 360. This LMS was specifically designed to facilitate learning through the use of technology, and provides school administrators with greater understanding of how their students are engaging with the classroom material. Data is collected through a learner activity tracker tool called experience API, and can track learning progress and student actions like reading articles or watching educational videos. This tool helps determine the behaviors that are involved in the learning experience. There are three major prongs of data: demographic information, educational background, and behavioral features. 

The demographic information includes gender, place of birth, and if the student was primarily raised by the mother or father. Educational background includes data like topic and grade level. Behavioral features include raising hands, discussion in class, and viewing announcements.

## Stakeholders
Who cares? If you are successful, what difference will it make to them?

The main stakeholders are educators/school administrators, parents/guardians, and policy makers. This analysis can help educators/school admin identify how to better support their students. Teachers can alter their teaching styles or enhance certain behaviors if they know which behaviors can positively impact academic performance. For parents/guardians, they can also have a greater understanding of how their child is performing in the classroom and identify key areas of how they can encourage or decrease specific behaviors. For policymakers, they can enact better policies to increase the quality of education. If they know what behaviors make up a good learning environment, they will know what kind of resources or technology to invest in and allocate greater funding for those particular expenses for schools.  

## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

We were lucky enough to have no NA values in our dataset. Our first decision was to drop the columns that weren't pertinent to us. These columns didn't help us in answering our question as they didn't have to do with the students and how they perform during classes and after. We had many grades that we could study from, but we subset grades seven and eight (wanted to look specifically at middle school) in order to use the other possible grades as test data (use grade 6 as test data, the other grade in the middle school category). Looking at the distribution of topics in the school, we saw that english only appeared twice meaning that it was near useless to include, especially when introducing dummy varibles.  We saw that when dealing with what we wanted the response variable to be, it had three classes. High, Medium, and Low were the values. We decided to drop the Low category because it had the fewest observations and chose focus on the students that could increase their grade from Medium to High and how they can do so. We had to replace Medium(M) and High(H) with 0 and 1 respectively to form a binary response. Finally, we transformed the columns into dummy variables and cleaned the column names in order to make subset selection easier and checking for interaction terms easier.

## Exploratory data analysis

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

We used Variance Inflation Factor in order to see if there existed any multicollinearity. Thankfully, among our continous variables we saw that there didn't exist any problems and we could proceed. At this point, we figured that all these variables were useful, but the confirmation helped us proceed. We also used a heatmap to visualize the correlations and found that visitedResources and raisedHands are relatively correlated allowing us to possibly form an interaction variable between them.

## Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

We are using a logistic model to try to predict the 'Class' (success of the students in class). We prioritized looking at classification accuracy and TPR/Recall because we thought they would be good overall metrics for how successful our models would be at prediction. Before developing our models, we predicted that the main issue would probably be making sure we removed any predictors that were insignificant (which we did encounter). For our first model, we actually got decent results (one insignificant predictor, high TPR, but FPR also a little high). Running the model on test data, we were able to decrease the FPR. We did not use code from the dataset or other code repositories.

## Developing the model

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why?

**Put the final model equation**.

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

For our first model, we used 6 predictors that we thought could be influential in decidine 'Class' (gender+raisedhands+VisITedResources+AnnouncementsView+Discussion+StudentAbsenceDays). One of our predictors, AnnouncementsView, was very insignificant, so we got rid of it for our second model (5 predictors). However, the performance of the 2 models (according to the confusion matrix) was similar on both train and test data. For our third model, we wanted to look at variable interactions, specifically how gender might interact with class participation, so we used these predictors: gender*Discussion + gender*raisedhands + VisITedResources + StudentAbsenceDays. This time, a couple of our predictors were insignificant and we saw similar performance on train and test data compared to model2. Lastly, we decided to employ forward selection to help us optimize our model.

## Limitations of the model with regard to inference / prediction

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

## Conclusions and Recommendations to stakeholder(s)

What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

Based on the logit regression results, we can draw the following conclusions:

1) The intercept is statistically significant, which means that the model has a baseline prediction for the dependent variable, "Class."

2) The predictor variable "StudentAbsenceDays" is statistically significant, with a coefficient of 2.5501. This suggests that students who have missed fewer than seven days of school are more likely to pass the class.

3) The predictor variable "raisedhands" is statistically significant, with a coefficient of 0.0376. This suggests that students who participate more in class by raising their hands are more likely to pass the class.

4) The predictor variable "VisITedResources" is also statistically significant, with a coefficient of 0.0488. This suggests that students who visit the online resources provided by the school more frequently are more likely to pass the class.

5) The interaction term "gender[T.M]:Discussion" is marginally significant, with a coefficient of 0.0358 and a p-value of 0.050. This suggests that male students who participate more in class discussions are slightly more likely to pass the class compared to female students.

Based on these conclusions, we can make the following recommendations for stakeholders:

1) Encourage students to attend school regularly, as students who miss fewer than seven days of school are more likely to pass the class.

2) Encourage student participation in class discussions and activities, as students who raise their hands more frequently are more likely to pass the class.

3) Provide students with easy access to online resources, as students who visit these resources more frequently are more likely to pass the class.

4) Consider providing additional support for female students to encourage their participation in class discussions and activities, as male students who participate more in these activities are slightly more likely to pass the class.

It is important to note that there are limitations to this model. For example, the model only includes a limited number of predictor variables and interactions, and there may be other factors that contribute to student success in the class. Additionally, the model is based on data collected up until September 2021, and it may not accurately reflect current trends or patterns.

If stakeholders want to continue using this model in the future, it may be necessary to update the model with more recent data and include additional predictor variables and interactions. The model could potentially be used for the next academic year, but it may need to be updated again after that to ensure accuracy.


## GitHub and individual contribution {-}

Put the **Github link** for the project repository.

https://github.com/wchen952/STAT-303-2-Project

Add details of each team member's contribution in the table below.

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Marcelo Barillas</td>
    <td>Data cleaning and EDA</td>
    <td>Cleaned data to impute missing values and developed visualizations to identify appropriate variable transformations.</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Phillip Meng</td>
    <td>Approach and Developing Model</td>
    <td>Ran the different logistic regression models.</td>
    <td>120</td>
  </tr>
    <tr>
    <td>Jason Jiang</td>
    <td>Data quality check </td>
    <td>Check for NA, check correlation and visualize pairplot</td>
    <td>130</td>    
  </tr>
    <tr>
    <td>Chun-Li</td>
    <td>Variable selection and addressing overfitting</td>
    <td>Performed variable selection on an exhaustive set of predictors to address multicollinearity and overfitting.</td>
    <td>150</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*