---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource._

---

# Assignment 4 - Predicting and understanding viewer engagement with educational videos 

With the accelerating popularity of online educational experiences, the role of online lectures and other educational video continues to increase in scope and importance. Open access educational repositories such as <a href="http://videolectures.net/">videolectures.net</a>, as well as Massive Open Online Courses (MOOCs) on platforms like Coursera, have made access to many thousands of lectures and tutorials an accessible option for millions of people around the world. Yet this impressive volume of content has also led to a challenge in how to find, filter, and match these videos with learners. This assignment gives you an example of how machine learning can be used to address part of that challenge.

## About the prediction problem

One critical property of a video is engagement: how interesting or "engaging" it is for viewers, so that they decide to keep watching. Engagement is critical for learning, whether the instruction is coming from a video or any other source. There are many ways to define engagement with video, but one common approach is to estimate it by measuring how much of the video a user watches. If the video is not interesting and does not engage a viewer, they will typically abandon it quickly, e.g. only watch 5 or 10% of the total. 

A first step towards providing the best-matching educational content is to understand which features of educational material make it engaging for learners in general. This is where predictive modeling can be applied, via supervised machine learning. For this assignment, your task is to predict how engaging an educational video is likely to be for viewers, based on a set of features extracted from the video's transcript, audio track, hosting site, and other sources.

We chose this prediction problem for several reasons:

* It combines a variety of features derived from a rich set of resources connected to the original data;
* The manageable dataset size means the dataset and supervised models for it can be easily explored on a wide variety of computing platforms;
* Predicting popularity or engagement for a media item, especially combined with understanding which features contribute to its success with viewers, is a fun problem but also a practical representative application of machine learning in a number of business and educational sectors.


## About the dataset

We extracted training and test datasets of educational video features from the VLE Dataset put together by researcher Sahan Bulathwela at University College London. 

We provide you with two data files for use in training and validating your models: train.csv and test.csv. Each row in these two files corresponds to a single educational video, and includes information about diverse properties of the video content as described further below. The target variable is `engagement` which was defined as True if the median percentage of the video watched across all viewers was at least 30%, and False otherwise.

Note: Any extra variables that may be included in the training set are simply for your interest if you want an additional source of data for visualization, or to enable unsupervised and semi-supervised approaches. However, they are not included in the test set and thus cannot be used for prediction. **Only the data already included in your Coursera directory can be used for training the model for this assignment.**

For this final assignment, you will bring together what you've learned across all four weeks of this course, by exploring different prediction models for this new dataset. In addition, we encourage you to apply what you've learned about model selection to do hyperparameter tuning using training/validation splits of the training data, to optimize the model and further increase its performance. In addition to a basic evaluation of model accuracy, we've also provided a utility function to visualize which features are most and least contributing to the overall model performance.

**File descriptions** 
    assets/train.csv - the training set (Use only this data for training your model!)
    assets/test.csv - the test set
<br>

**Data fields**

train.csv & test.csv:

    title_word_count - the number of words in the title of the video.
    
    document_entropy - a score indicating how varied the topics are covered in the video, based on the transcript. Videos with smaller entropy scores will tend to be more cohesive and more focused on a single topic.
    
    freshness - The number of days elapsed between 01/01/1970 and the lecture published date. Videos that are more recent will have higher freshness values.
    
    easiness - A text difficulty measure applied to the transcript. A lower score indicates more complex language used by the presenter.
    
    fraction_stopword_presence - A stopword is a very common word like 'the' or 'and'. This feature computes the fraction of all words that are stopwords in the video lecture transcript.
    
    speaker_speed - The average speaking rate in words per minute of the presenter in the video.
    
    silent_period_rate - The fraction of time in the lecture video that is silence (no speaking).
    
train.csv only:
    
    engagement - Target label for training. True if learners watched a substantial portion of the video (see description), or False otherwise.
    

## Evaluation

Your predictions will be given as the probability that the corresponding video will be engaging to learners.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC). 

Your grade will be based on the AUC score computed for your classifier. A model with an AUC (area under ROC curve) of at least 0.8 passes this assignment, and over 0.85 will receive full points.
___

For this assignment, create a function that trains a model to predict significant learner engagement with a video using `asset/train.csv`. Using this model, return a Pandas Series object of length 2309 with the data being the probability that each corresponding video from `readonly/test.csv` will be engaging (according to a model learned from the 'engagement' label in the training set), and the video index being in the `id` field.

Example:

    id
       9240    0.401958
       9241    0.105928
       9242    0.018572
                 ...
       9243    0.208567
       9244    0.818759
       9245    0.018528
             ...
       Name: engagement, dtype: float32
       
### Hints

* Make sure your code is working before submitting it to the autograder.

* Print out and check your result to see whether there is anything weird (e.g., all probabilities are the same).

* Generally the total runtime should be less than 10 mins. 

* Try to avoid global variables. If you have other functions besides engagement_model, you should move those functions inside the scope of engagement_model.

* Be sure to first check the pinned threads in Week 4's discussion forum if you run into a problem you can't figure out.

### Extensions

* If this prediction task motivates you to explore further, you can find more details here on the original VLE dataset and others related to video engagement: https://github.com/sahanbull/VLE-Dataset



Steps for Solving the Assignment: (My plan)
1) Data Preprocessing:
Load the data: First, load the training and test datasets (train.csv and test.csv) using pandas.
Handle missing values: Check for any missing values in the data and decide how to handle them (either by imputing, dropping, or filling them).
Feature Scaling/Normalization: Normalize or scale continuous features to improve model performance (especially important for models like SVMs, logistic regression, etc.).
Feature Encoding: If any features are categorical, we will need to encode them (though based on the description, it seems that all features are numeric).

2) Model Training:
Select models: Try multiple supervised machine learning models. Common choices for this kind of task include:
Logistic Regression
Random Forest Classifier
Gradient Boosting (e.g., XGBoost, LightGBM)
Support Vector Machine (SVM)
Neural Networks (if feasible)
Train a model: Train a classifier on the train.csv dataset using the features provided (excluding the target column engagement).

3) Hyperparameter Tuning (optional):
Hyperparameter tuning: Use techniques such as GridSearchCV or RandomizedSearchCV to tune hyperparameters of the model to improve its performance.
Optimization: Fine-tune hyperparameters like the number of trees (for Random Forests or Gradient Boosting), learning rate, max depth, etc.

4) Model Evaluation:
AUC Metric: The evaluation metric for this assignment is the Area Under the ROC Curve (AUC). Ensure to evaluate the models based on this metric.
Cross-validation: Use cross-validation on the training set to get a reliable estimate of model performance and avoid overfitting.

5) Model Prediction on Test Data:
Generate predictions: Once the best model is selected, use it to predict engagement probabilities for the videos in the test.csv dataset.
Return results: Format the output as a Pandas Series with the id field and the predicted probabilities of engagement.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)   # Do not change this value: required to be compatible with solutions generated by the autograder.

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from matplotlib.colors import ListedColormap
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_recall_curve
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [3]:
df_train = pd.read_csv('assets/train.csv') # in training dataset have extra column "engagement" which is the target 
df_train.head(10)

Unnamed: 0,id,title_word_count,document_entropy,freshness,easiness,fraction_stopword_presence,normalization_rate,speaker_speed,silent_period_rate,engagement
0,1,9,7.753995,16310,75.583936,0.553664,0.034049,2.997753,0.0,True
1,2,6,8.305269,15410,86.870523,0.584498,0.018763,2.635789,0.0,False
2,3,3,7.965583,15680,81.915968,0.605685,0.03072,2.538095,0.0,False
3,4,9,8.142877,15610,80.148937,0.593664,0.016873,2.259055,0.0,False
4,5,9,8.16125,14920,76.907549,0.581637,0.023412,2.42,0.0,False
5,6,10,8.182952,16180,76.684133,0.57529,0.023649,2.244809,0.0,False
6,7,10,8.101635,12760,85.303173,0.600232,0.018423,2.458497,0.196126,False
7,8,9,7.733064,11460,97.57219,0.687275,0.008956,1.992327,0.289208,False
8,9,7,8.219794,15070,87.008975,0.600454,0.018557,1.715436,0.0,False
9,10,10,7.714182,14840,88.650478,0.6179,0.018933,2.210577,0.0,False


In [4]:
df_test = pd.read_csv('assets/test.csv')
df_test.head(10)

Unnamed: 0,id,title_word_count,document_entropy,freshness,easiness,fraction_stopword_presence,normalization_rate,speaker_speed,silent_period_rate
0,9240,6,8.548351,14140,89.827395,0.64081,0.017945,2.262723,0.0
1,9241,8,7.73011,14600,82.446667,0.606738,0.027708,2.690351,0.0
2,9242,3,8.200887,16980,88.821542,0.621089,0.009857,3.116071,0.0
3,9243,5,6.377299,16260,86.87466,0.6,0.004348,2.8375,0.017994
4,9244,18,7.75653,14030,88.872277,0.616105,0.03324,1.354839,0.0
5,9245,13,7.941503,14750,83.918688,0.624874,0.013074,2.381301,0.233458
6,9246,20,7.676826,14750,78.83557,0.603597,0.035575,1.731507,0.249336
7,9247,8,8.046349,14070,85.396646,0.635028,0.021921,3.080714,0.168078
8,9248,13,6.398993,14850,97.891938,0.728659,0.02439,1.106897,0.0
9,9249,9,8.310242,14200,79.556191,0.590176,0.025293,1.889474,0.292243


In [9]:
def engagement_model():
    # YOUR CODE HERE
 
    # Step 1: Load the data
    train_data = pd.read_csv('assets/train.csv')
    test_data = pd.read_csv('assets/test.csv')

    # Step 2: Preprocessing
    X_train = train_data.drop(columns = ['engagement']) # In training dataset have extra column "engagement" which is the target 
    y_train = train_data['engagement'] #Target column
    
    X_test = test_data 
    # Since the test set doesn't contain the engagement column, you're right that we cannot compute the ROC curve directly with that data. 
    # The confusion arises from the fact that your task involves predicting engagement for the test data (i.e., predicting whether the video will be engaging or not). 
    # However, for the evaluation of your model, the ROC curve needs the true engagement labels (y_test), which are only available in the training data.

    # Feature Scaling
    scaler = MinMaxScaler()
    X_train_scaled = scaler.fit_transform(X_train)

    # Step 3: Model Training (Using Random Forest Classifier as an example)
    RF_clf = RandomForestClassifier(n_estimators=100, random_state=999)
    RF_clf.fit(X_train_scaled, y_train)
    
    # Step 4: Hyperparameter Tuning (Optional)
    # Perform grid search or random search for hyperparameter tuning if needed.
    grid_values = {'n_estimators': [100, 200], 'max_depth': [2, 5, 10]}
    grid_search = GridSearchCV(estimator = RF_clf, param_grid = grid_values, cv = 5, scoring = 'roc_auc', n_jobs = -1)
    grid_search.fit(X_train_scaled, y_train)
    
    print('Grid best parameter (max. AUC): ', grid_search.best_params_)
    print('Grid best score (AUC): ', grid_search.best_score_)

    
    # Step 6: Use the best estimator from GridSearchCV for predictions
    best_model = grid_search.best_estimator_  # Use the model with best parameters from grid search
    print("Best Estimator (Model): {}".format(best_model))
    
 
    
    # Evaluate on training data. Typically, when evaluating a machine learning model, metrics like accuracy, precision, recall, F1-score, etc., should be applied to the test data, not the training data.
    # But in our this example, we do not have testing data i.e. y_test.
    y_train_pred = best_model.predict_proba(X_train_scaled)[:, 1]
    
    # Evaluate on Training Data with Lowered decision threshold because initially the training recall results is too low at Recall: 63.8%. Implication: The remaining 36.2% of the true positive cases are being missed by the model — they are incorrectly predicted as negative (false negatives).
    threshold = 0.3
    y_train_pred_lower_threshold = (y_train_pred >= threshold).astype(int)  # Classify based on new threshold
    
    # The reason train_accuracy uses y_train is that we are evaluating the model's performance on the training data, where we already know the true labels.
    # Training accuracy measures how well the model performs on the data it has seen during training. In this case, y_train represents the true labels (ground truth) for the training data, and y_train_pred represents the model's predictions on the same data.
    train_accuracy = best_model.score(X_train_scaled, y_train) # Accuracy: proportion of correct predictions
    # is the same as train_accuracy_function = accuracy_score(y_train, y_train_pred), just that using this method will require y_train and y_train_pred as input
    
    # Using y_train directly (but since we do not have y_test engagement - Target label) wouldn't make sense for metrics like precision, recall, and F1-score because these metrics are designed to evaluate how well the model’s predictions (i.e., y_train_pred) match the true labels (i.e., y_train). 
    # If you were to use y_train as both the predicted and true labels, it would trivially give you a perfect score for precision, recall, and F1 (since true labels and predictions would be identical).
    precision = precision_score(y_train, y_train_pred_lower_threshold) # Precision: proportion of true positives among positive predictions
    recall = recall_score(y_train, y_train_pred_lower_threshold)  # Recall: proportion of true positives among actual positives
    f1 = f1_score(y_train, y_train_pred_lower_threshold) # F1-score: harmonic mean of precision and recall
    # Accuracy = TP + TN / (TP + TN + FP + FN)
    # Precision = TP / (TP + FP)
    # Recall = TP / (TP + FN)  Also known as sensitivity, or True Positive Rate
    # F1 = 2 * Precision * Recall / (Precision + Recall) 

    # Performance metric of the best model using training data. 
    print("Training Accuracy of the Best Model: {}".format(train_accuracy))
    print("Training Precision of the Best Model: {}".format(precision))
    print("Training Recall of the Best Model: {}".format(recall))
    print("Training F1 Score of the Best Model: {}".format(f1))
    
    # Cross-validation AUC scores of best model using training data. Unlike the above performance metrics, cross_val_score is used on the training data to assess the model's performance during training and help with model selection.
    cv_scores = cross_val_score(best_model, X_train_scaled, y_train, cv = 5, scoring = 'roc_auc', n_jobs = -1)
    print("Cross-validation AUC scores: {}".format(cv_scores))
    print("Mean AUC score: {:.4f}".format(cv_scores.mean()))
    
    # Step 7: Predictions on the test set
    X_test_scaled = scaler.transform(X_test)
    # Predict probabilities of engagement
    predictions_proba = best_model.predict_proba(X_test_scaled)[:, 1]  # We want the probability for the positive class (engaged)
     
    # Step 8: Format the output
    results = pd.Series(predictions_proba, index = test_data['id'], name = 'engagement')
    
    return results

    raise NotImplementedError()
    
engagement_model()

Grid best parameter (max. AUC):  {'max_depth': 10, 'n_estimators': 100}
Grid best score (AUC):  0.8649993504665725
Best Estimator (Model): RandomForestClassifier(max_depth=10, random_state=999)
Training Accuracy of the Best Model: 0.9629830068189198
Training Precision of the Best Model: 0.8707653701380176
Training Recall of the Best Model: 0.7736900780379041
Training F1 Score of the Best Model: 0.8193624557260921
Cross-validation AUC scores: [0.89870104 0.87771783 0.88862297 0.89351667 0.76643825]
Mean AUC score: 0.8650


id
9240     0.042783
9241     0.068065
9242     0.270520
9243     0.866969
9244     0.131426
           ...   
11544    0.043442
11545    0.008248
11546    0.009410
11547    0.837368
11548    0.026534
Name: engagement, Length: 2309, dtype: float64

In [10]:
stu_ans = engagement_model()
assert isinstance(stu_ans, pd.Series), "Your function should return a pd.Series. "
assert len(stu_ans) == 2309, "Your series is of incorrect length: expected 2309 "
assert np.issubdtype(stu_ans.index.dtype, np.integer), "Your answer pd.Series should have an index of integer type representing video id."

Grid best parameter (max. AUC):  {'max_depth': 10, 'n_estimators': 100}
Grid best score (AUC):  0.8649993504665725
Best Estimator (Model): RandomForestClassifier(max_depth=10, random_state=999)
Training Accuracy of the Best Model: 0.9629830068189198
Training Precision of the Best Model: 0.8707653701380176
Training Recall of the Best Model: 0.7736900780379041
Training F1 Score of the Best Model: 0.8193624557260921
Cross-validation AUC scores: [0.89870104 0.87771783 0.88862297 0.89351667 0.76643825]
Mean AUC score: 0.8650
