<img src="https://brand.umich.edu/assets/brand/style-guide/logo-guidelines/U-M_Logo-Horizontal-Hex.png" alt="Drawing" style="width: 300px;" align="left"/><br>
    
## Week 2: Building prediction models of student success (20pts)

Building prediction models of student success is one of the most prominent application of data science in education. Early detection of at-risk students will help the universities design in-time interventions and provide targeted support to those who were struggling. We will use the same OULAD dataset that you have worked on in week 1.

**Overview of the dataset**

The dataset contains the information about 22 courses, 32,593 students, their assessment results, and logs of their interactions with the Virtual Learning Environment (e.g., Moodle) represented by daily summaries of student clicks (10,655,280 entries). 

**Reference**

Kuzilek, J., Hlosta, M., & Zdrahal, Z. (2017). Open university learning analytics dataset. Scientific data, 4, 170171. https://www.nature.com/articles/sdata2017171

# Open University Learning Analytics (OULAD) dataset

Kuzilek, J., Hlosta, M., & Zdrahal, Z. (2017). Open university learning analytics dataset. Scientific data, 4, 170171. https://www.nature.com/articles/sdata2017171
## Data scheme
![](https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fsdata.2017.171/MediaObjects/41597_2017_Article_BFsdata2017171_Fig2_HTML.jpg)
## Course timeline
![](https://media.springernature.com/lw685/springer-static/image/art%3A10.1038%2Fsdata.2017.171/MediaObjects/41597_2017_Article_BFsdata2017171_Fig1_HTML.jpg)
## Relational database
* A module is a course
* A presentation is a semester (e.g., 2019J - Fall 2019, 2019B = Winter 2019)
* vle = virtual learning enviroment
![](https://analyse.kmi.open.ac.uk/resources/images/model.png)

---

## We only use data of BBB and FFF in 2013J and 2014J in this week assignment

## A. Prepare the data (2pts)

Load the data 'assets/studentInfo.csv'   
Write a function that returns a pd.DataFrame of shape (6254, 39)

* Create a new column ['outcome'], where outcome == 1 if students achieve a pass or a distinction, outcome == 0 if students failed
* One-Hot Encode categorical features. Use dummies for multi-class features.
* Do not encode ['id_student','code_module','code_presentation']
* Make sure you only use data from BBB and FFF in 2013J and 2014J

The final dataframe should consist of the following columns:

 'code_module',  
 'code_presentation',  
 'id_student',  
 'num_of_prev_attempts',  
 'studied_credits',  
 'outcome',    
 'gender_F',  
 'region_East Anglian Region',  
 'region_East Midlands Region',  
 'region_Ireland',  
 'region_London Region',  
 'region_North Region',  
 'region_North Western Region',  
 'region_Scotland',  
 'region_South East Region',  
 'region_South Region',  
 'region_South West Region',  
 'region_Wales',  
 'region_West Midlands Region',  
 'region_Yorkshire Region',  
 'highest_education_A Level or Equivalent',  
 'highest_education_HE Qualification',  
 'highest_education_Lower Than A Level',  
 'highest_education_No Formal quals',  
 'highest_education_Post Graduate Qualification',  
 'imd_band_0-10%',  
 'imd_band_10-20',  
 'imd_band_20-30%',  
 'imd_band_30-40%',  
 'imd_band_40-50%',  
 'imd_band_50-60%',  
 'imd_band_60-70%',  
 'imd_band_70-80%',  
 'imd_band_80-90%',  
 'imd_band_90-100%',  
 'age_band_0-35',  
 'age_band_35-55',  
 'age_band_55<=',  
 'disability_Y'  
  

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
def answer_a():
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return result

In [None]:
# Cell for autograder

# Check data frame shape 
assert answer_a().shape == (6254, 39), "Your pandas data frame should have 39 columns and 6254 rows"


## B. Feature engineering: assessments (3pts)

For each unique combination of ['code_module','code_presentation','id_student'], create 3 new features: TMA1, TMA2, TMA3.
* TMA1 is the the weighted score of 1st TMA rank by date 
* TMA2 is the the weighted score of 2nd TMA rank by date
* TMA3 is the the weighted score of 3rd TMA rank by date 

Note:
* Weighted_score = weight * score /100
* If the students did not submit their TMA, then their score == 0. However, since those who did not submit TMA was not recorded in the studentAssessment.csv, you would need to think of a clever way to capture that information. Hint: You can use studentInfo.csv as a reference point. 
* Make sure you only use data from BBB and FFF in 2013J and 2014J

The final dataframe should have a shape of (6254, 6) and it should consists of the following columns:
* 'code_module' 
* 'code_presentation'
* 'id_student'
* 'TMA1'
* 'TMA2'
* 'TMA3'

In [None]:
def answer_b():
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return result

In [None]:
# Cell for autograder

# Data frame shape 
assert answer_b().shape == (6254, 6), "Your data frame should have 6 columns and 6254 rows"

# Missing scores as 0
assert answer_b()['TMA1'].isnull().sum() < 1, "If the students did not submit their TMA, then their weighted score == 0"
assert answer_b()['TMA2'].isnull().sum() < 1, "If the students did not submit their TMA, then their weighted score == 0"
assert answer_b()['TMA3'].isnull().sum() < 1, "If the students did not submit their TMA, then their weighted score == 0"


---

## C. Feature engineering: VLE activities (3pts)

Write a function that summarizes the number of clicks on each course section during day [0,100] for each student. It should returns a pd.dataframe with the shape of (6254,7) with the following columns:
* 'code_module'
* 'code_presentation'
* 'id_student'

And compute the sum of click for:
* 'forumng'
* 'homepage'
* 'oucontent'
* 'resource'
* 'glossary'
* 'oucollaborate'
* 'quiz'
* 'subpage'
* 'url'

**Note**: 
* Missing data should be replaced by 0, which means that the students did not click on that course section
* Make sure you only use data from BBB and FFF in 2013J and 2014J

In [None]:
def answer_c():
    
    # YOUR CODE HERE
    raise NotImplementedError()

    return result

In [None]:
# Cell for autograder
ans_c = answer_c()

# Data frame shape 
assert ans_c.shape == (6254, 12), "Your data frame should have 7 columns and 6254 rows"


## D. Feature extraction: PCA (3pts)

Many of the features from VLE activities are highly correlated (e.g. if students click on homepage, they will be likely to click on oucontent). One way to reduce the number of highly correlated features is to perform a Principal Component Analysis (PCA).

Write a function
* Perform a PCA on the four VLE features in answer_c(): forumng, homepage, oucontent, resource. Make sure to standardize the features before run the PCA.
* Select the number of k components such that the **cummulative variance explained ratio > 0.8**
* Return a pd.dataframe that consist of: 'code_module','code_presentation','id_student', and the principal components that you chose. For example, if you choose k=2 then the columns should be PC1, PC2. If you choose k=3, then the columns should be PC1,PC2,PC3

In [None]:
def answer_d():
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    df = answer_c()
    
    # YOUR CODE HERE
    raise NotImplementedError()

    return result

In [None]:
# Cell for autograder


### The subsequent questions will depend on the previous output. So make sure you submit the assignment at this point to ensure you got the right output before moving on

## E. Train-test split (3pts)

Write a function that
* Combine all the features from answer_a(), answer_b(), and answer_d() into a single dataframe
* Perform a feature scaling on the merged data using Standard Scaler
* Split the data into a traing set and a test set, make sure to use stratified sample because we have imbalanced data 
* Return X_train, X_test, y_train, y_test as np.arrays

Note: 
* Whenever appropriate, using a random_state =42
* 'code_module','code_presentation','id_student' should be excluded in the feature sets

In [None]:
def answer_e():
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    # Call data from the previous answers
    a = answer_a()
    b = answer_b()
    d = answer_d()
    
    # YOUR CODE HERE
    raise NotImplementedError()

    return X_train, X_test, y_train, y_test


In [None]:
# Cell for autograder
X_train, X_test, y_train, y_test = answer_e()

# Data length
assert X_train.shape == (4690, 43), "There should be 4690 data points and 40 features in the train set"
assert X_test.shape == (1564, 43), "There should be 1564 data points and 40 features in the test set"

# Feature normalization
assert (X_train < 200).all(), "You should perform a feature scaling after spliting the data"


## F. Apply classification algorithms (3pts)

Write a function that applies four different classification algorithms using the training and testing sets obtained from answer_e():
* Logistic Regression (random_state=42)
* Random Forest (random_state=42)
* Support Vector Machine (random_state=42)
* K-Nearest Neighbour (n_neighbors=5)

Return a pd.dataframe of shape (4,4) with the following columns ['Accuracy','Recall', 'Precision', 'Model']

**Resources**: You can find more details about the pros and cons of each alogrithm [here](https://towardsdatascience.com/comparative-study-on-classic-machine-learning-algorithms-24f9ff6ab222)

In [None]:
def answer_f():
    from sklearn.metrics import recall_score, precision_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import SVC
    from sklearn.neighbors import KNeighborsClassifier
    
    # Call data from the previous answers
    X_train, X_test, y_train, y_test = answer_e()
    
    # YOUR CODE HERE
    raise NotImplementedError()

    return result


In [None]:
# Cell for autograder
ans_f = answer_f()

# Data frame shape 
assert ans_f.shape == (4,4), "Your data frame should 4 rows and 4 columns"

# Check accuracy of LR 
actual =  ans_f.loc[ans_f['Model']=='Logistic Regression']['Accuracy'].values 
desired = 0.862532
np.testing.assert_almost_equal(actual, desired, decimal=4, err_msg='The accuracy of Logistic Regression is not correct', verbose=True)


In [None]:
# Run to plot your result 
df=answer_f().melt('Model', var_name='Metrics', value_name='Value')
sns.barplot(x="Metrics", y="Value", hue="Model", data=df)
plt.legend(bbox_to_anchor=(1, 1), loc=2) 

## G. Model evaluation (3pts)

Write a function that applies 2 different classification algorithms using 10-fold [stratified cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html):

* Logistic regression
* Random forest

For each algorithm, return the mean roc_auc and standard deviation roc_auc
The final output should be a pd.DataFrame with the folliwng columns: 'mean_auc_score','std_auc_score','model'

Note: 
* While running the [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html), you can speed up the processing time by setting the number of jobs to run in parallel (e.g., n_jobs = 2) to make use of multi-core processing. See more in the documentation
* Make sure you use random_state=42 in setting up the StratifiedKFold, LogisticRegression, and RandomForestClassifier


In [None]:
def answer_g():
    
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import SVC
    from sklearn.neighbors import KNeighborsClassifier
    
    from sklearn.model_selection import StratifiedKFold
    from sklearn.model_selection import cross_val_score
    
    # Call data from the previous answers
    X_train, X_test, y_train, y_test = answer_e()
    X = np.concatenate((X_train,X_test), axis=0)
    y = np.concatenate((y_train,y_test), axis=0)

    # YOUR CODE HERE
    raise NotImplementedError()

    return result

In [None]:
# Cell for autograder
ans_g = answer_g()

# Data frame shape 
assert ans_g.shape == (2,3), "Your data frame should 4 rows and 4 columns"

# Check mean_auc_score of LR
actual =  ans_g.loc[ans_g['model']=='Logistic Regression']['mean_auc_score'].values 
desired = 0.89835
np.testing.assert_almost_equal(actual, desired, decimal=4, err_msg='The accuracy of Logistic Regression is not correct', verbose=True)

