## Introduction

A machine learning algorithm is implemented in order to predict whether coursera users will continue their subscription. The dataset is a sample of subscriptions that were iniciated in 2021, all snapshotted at a particular date before the subscription was cancelled.


Regardless the reason, Coursera has a vested interest in understanding the likelihood of each individual learner to retain in their subscription so that resources can be allocated appropriately to support learners across the various stages of their learning journeys.

## Understanding the Datasets

### Train vs. Test
In this competition, you’ll gain access to two datasets that are samples of past specialization subscriptions that contain information about the learner, the specialization, and the learner's activity in the subscription thus far. One dataset is titled `train.csv` and the other is titled `test.csv`.

`train.csv` contains 70% of the overall sample (509,837 subscriptions to be exact) and importantly, will reveal whether or not the subscription was continued into the next month (the “ground truth”).

The `test.csv` dataset contains the exact same information about the remaining segment of the overall sample (217,921 subscriptions to be exact), but does not disclose the “ground truth” for each subscription. It’s your job to predict this outcome!

Using the patterns you find in the `train.csv` data, predict whether the subscriptions in `test.csv` will be continued for another month, or not.

### Dataset descriptions
Both `train.csv` and `test.csv` contain one row for each unique specialization subscription. For each subscription, a single observation (`subscription_id`) is included as of a particular date (`observation_dt`) during which the subscription was active. This date was chosen at random from all the dates during which the subscription was active. In some instances it is soon after the subscription was initiated; in other instances, it is several months after the subscription was initiated and after several previous payments were made. Therefore, your model will have to be able to adapt to different stages of the subscription.

In addition to those identifier columns, the `train.csv` dataset also contains the target label for the task, a binary column `is_retained`.

Besides that column, both datasets have an identical set of features that can be used to train the model to make predictions. Descriptions of each feature are shown below. 

In [1]:
import pandas as pd
data_descriptions = pd.read_csv('data_descriptions.csv')
pd.set_option('display.max_colwidth', None)
data_descriptions

Unnamed: 0,Column_name,Column_type,Data_type,Description
0,subscription_id,Identifier,character,Unique identifier of each subscription
1,observation_dt,Identifier,date,The date on which the subscription was observed to calculate the features in the dataset. It was chosen at random amongst all the dates between the start of the subscription and the end of the subscription (before cancellation)
2,is_retained,Target,Integer,"TRAINING SET ONLY! 0 = the learner cancelled their subscription before next payment, 1 = the learner made an additional payment in this subscription"
3,specialization_id,Feature - Specialization Info,character,Unique identifier of a specialization (each subscription gives a learner access to a particular specialization)
4,cnt_courses_in_specialization,Feature - Specialization Info,integer,number of courses in the specialization
5,specialization_domain,Feature - Specialization Info,character,"primary domain of the specialization (Computer Science, Data Science, etc.)"
6,is_professional_certificate,Feature - Specialization Info,boolean,"BOOLEAN for whether the specialization is a ""professional certicate"" (a special type of specialization that awards completers with an industry-sponsored credential)"
7,is_gateway_certificate,Feature - Specialization Info,boolean,"BOOLEAN for whether the specialization is a ""gateway certificate"" (a special type of specialization geared towards learners starting in a new field)"
8,learner_days_since_registration,Feature - Learner Info,integer,Days from coursera registration date to the date on which the observation is made
9,learner_country_group,Feature - Learner Info,character,"the region of the world that the learner is from (United States, East Asia, etc.)"


In [54]:
# Import required packages

# Data packages
import pandas as pd
import numpy as np
import tensorflow as tf

# Machine Learning / Classification packages
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

# Visualization Packages
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

In [55]:
# Import any other packages you may want to use
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LeakyReLU
from tensorflow.keras.activations import linear, relu, sigmoid


## Load the Data

Let's start by loading the dataset `train.csv` into a dataframe `train_df`, and `test.csv` into a dataframe `test_df` and display the shape of the dataframes.

In [30]:
train_df = pd.read_csv("train.csv")
train_df.shape

  train_df = pd.read_csv("train.csv")


(413955, 37)

In [31]:
# test_df = pd.read_csv("test.csv")
# test_df.shape

In [32]:
train_df.head()

Unnamed: 0,subscription_id,observation_dt,is_retained,specialization_id,cnt_courses_in_specialization,specialization_domain,is_professional_certificate,is_gateway_certificate,learner_days_since_registration,learner_country_group,...,cnt_enrollments_completed_during_payment_period,cnt_enrollments_active_during_payment_period,cnt_items_completed_during_payment_period,cnt_graded_items_completed_during_payment_period,is_active_capstone_during_pay_period,sum_hours_learning_before_payment_period,sum_hours_learning_during_payment_period,cnt_days_active_before_payment_period,cnt_days_active_during_payment_period,cnt_days_since_last_activity
0,--rKikbGEeyQHQqIvaM5IQ,2022-05-04,1.0,kr43OcbTEeqeNBKhfgCLyw,8.0,Data Science,True,True,2321.0,Northern Europe,...,0.0,0.0,0.0,0.0,False,73.783333,0.0,68.0,0.0,20.0
1,-0XGzEq2EeyimBISGRuNeQ,2021-11-30,0.0,Q0Fc_Yl0EeqdTApgQ4tM7Q,6.0,Data Science,True,False,612.0,Northern Europe,...,0.0,0.0,0.0,0.0,False,0.85,0.0,7.0,2.0,0.0
2,-1P9kOb6EeuRugq1Liq62w,2021-08-13,0.0,9kmimrDIEeqxzQqieMm42w,6.0,Business,True,True,27.0,Australia and New Zealand,...,0.0,1.0,12.0,2.0,False,1.833333,2.983333,2.0,1.0,18.0
3,-2ifTJZbEeuIuRKpAhovaw,2021-08-03,1.0,7lHCSlFIEeeffRIHljDI_g,5.0,Information Technology,True,True,120.0,United States,...,0.0,2.0,83.0,9.0,False,18.45,7.1,18.0,4.0,3.0
4,-5YKZbchEeufeAq6C_fAOw,2021-06-04,0.0,kr43OcbTEeqeNBKhfgCLyw,8.0,Data Science,True,True,1228.0,India,...,1.0,1.0,61.0,4.0,False,29.566667,15.25,18.0,8.0,1.0


In [33]:
# Get 60% of the dataset as the training set. Put the remaining 40% in a temporary data frame: temp
# Split the 40% subset above into two: one half for cross validation and the other for the test set
# train, temp = train_test_split(train_df, test_size=0.4, random_state = 1234)
# validation, test = train_test_split(temp, test_size = 0.5, random_state = 1234)

# del temp


In [45]:
# train.to_csv("training.csv")
# validation.to_csv("validation.csv")
# test.to_csv("test.csv")

In [52]:
training_data = pd.read_csv("training.csv")
validation_data = pd.read_csv("validation.csv")
test_data = pd.read_csv("test.csv")

  training_data = pd.read_csv("training.csv")


In [51]:
training_data.shape

(248373, 38)

### preprocessing training set

In [56]:

def pre_processing(path, frac_features, frac_features_names, drop_features):
#     drop_features: list of strings with the names of the categorical features 
#     binary_features: list of string witht the names of the binary columns
#     target_col: string

    df = pd.read_csv(path)
    df = df.dropna()
    
#     y_train = df[target_col].values

    total_courses_count = "cnt_courses_in_specialization"
    
    for feature, name in zip(frac_features, frac_features_names):
        
        df[name] = df[feature] / df[total_courses_count]

    df = df.drop(columns = drop_features)
    
    return df
    
def scaling(df, binary_features, is_train): 
    
# scaling the data using the standard scaler
    binary_cols = df[binary_features].values
    binary_cols = binary_cols.astype(int)
    cols_to_scale = list(set(df.columns).difference(set(binary_features)))
    
    if is_train:
    
        cols_scaled = scaler.fit_transform(df[cols_to_scale])
        
    else:
        
        cols_scaled = scaler.transform(df[cols_to_scale])
        
    scaled_df = np.concatenate((binary_cols, cols_scaled), axis = 1)
    
    return scaled_df


    

In [57]:
# convert integers of course enrollment to fractions
frac_features = ["cnt_enrollments_started_before_payment_period", 
                                    "cnt_enrollments_completed_before_payment_period", 
                                    "cnt_enrollments_active_before_payment_period", 
                                    "cnt_enrollments_started_during_payment_period", 
                                    "cnt_enrollments_completed_during_payment_period", 
                                    "cnt_enrollments_active_during_payment_period",]

frac_features_names = ["frc_enrollments_started_before_payment_period", 
                       "frc_enrollments_completed_before_payment_period", 
                       "frc_enrollments_active_before_payment_period", 
                       "frc_enrollments_started_during_payment_period", 
                       "frc_enrollments_completed_during_payment_period", 
                       "frc_enrollments_active_during_payment_period"]

# drop the categorical features
drop_features = ["specialization_id", "specialization_domain", "learner_country_group", 
                                    "learner_gender", "observation_dt", "subscription_id", 
                                    "learner_cnt_other_transactions_past", 
                                    "cnt_enrollments_started_before_payment_period", 
                                    "cnt_enrollments_completed_before_payment_period", 
                                    "cnt_enrollments_active_before_payment_period", 
                                    "cnt_enrollments_started_during_payment_period", 
                                    "cnt_enrollments_completed_during_payment_period", 
                                    "cnt_enrollments_active_during_payment_period", 
                                    "cnt_courses_in_specialization",]

binary_features = ["is_professional_certificate", "is_gateway_certificate", 
                   "is_subscription_started_with_free_trial", "is_active_capstone_during_pay_period",]

target_col = ["is_retained"]

In [58]:
# preprocessing training set
path_train = "training.csv"
train_data = pre_processing(path_train, frac_features, frac_features_names, drop_features)

y_train = train_data["is_retained"].values
train_data = train_data.drop(columns = "is_retained")

# scaling training data
scaler = StandardScaler()
X_train = scaling(train_data, binary_features, 1 == 1)

  df = pd.read_csv(path)


In [61]:
# preprocessing validation set
path_val = "validation.csv"
val_data = pre_processing(path_val, frac_features, frac_features_names, drop_features)

y_val = val_data["is_retained"].values
val_data = val_data.drop(columns = "is_retained")

# scaling validation data
X_val = scaling(val_data, binary_features, 1 == 0)

In [64]:
# preprocessing test set
path_test = "test.csv"
test_df = pd.read_csv(path_test)
test_ids = test_df[['subscription_id']]
test_data = pre_processing(path_test, frac_features, frac_features_names, drop_features)

y_test = test_data["is_retained"].values
test_data = test_data.drop(columns = "is_retained")

# scaling test data
X_test = scaling(test_data, binary_features, 1 == 0)



### model

In [76]:
model = Sequential(
    [
        Dense(256, input_shape = (X_train.shape[1],), activation = "relu"),
        Dense(128, activation = "relu"),
        Dense(64, activation = "relu"),
        Dense(1, activation = "sigmoid")
    ]
)


In [77]:
model.compile(
    loss = tf.keras.losses.BinaryCrossentropy(),
    optimizer = tf.keras.optimizers.Adam(0.001),
    metrics = [
        tf.keras.metrics.BinaryAccuracy(),
        tf.keras.metrics.Precision(),
        tf.keras.metrics.Recall(),
    ],
)

model.fit(
    X_train,y_train,
    validation_data = (X_val, y_val),
    epochs = 100,
    callbacks = [
        tf.keras.callbacks.EarlyStopping(
        monitor = "val_loss",
        mode = "min",
        patience = 2,
        verbose = 0,
        restore_best_weights = True,
    ),
                ]
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100


<keras.callbacks.History at 0x13a4bba00>

## Make predictions (required)

Remember you should create a dataframe named `prediction_df` with exactly 217,921 entries plus a header row attempting to predict the likelihood of retention for subscriptions in `test_df`. Your submission will throw an error if you have extra columns (beyond `subscription_id` and `predicted_probaility`) or extra rows.

The file should have exactly 2 columns:
`subscription_id` (sorted in any order)
`predicted_probability` (contains your numeric predicted probabilities between 0 and 1, e.g. from `estimator.predict_proba(X, y)[:, 1]`)

The naming convention of the dataframe and columns are critical for our autograding, so please make sure to use the exact naming conventions of `prediction_df` with column names `subscription_id` and `predicted_probability`!

### Example prediction submission:

The code below is a very naive prediction method that simply predicts retention using a Dummy Classifier. This is used as just an example showing the submission format required. Please change/alter/delete this code below and create your own improved prediction methods for generating `prediction_df`.

**PLEASE CHANGE CODE BELOW TO IMPLEMENT YOUR OWN PREDICTIONS**

In [80]:
predictions = model.predict(X_test).flatten()



In [81]:
# Combine predictions with label column into a dataframe

prediction_df = pd.DataFrame({'subscription_id': test_ids.values[:, 0], 
                             'predicted_probability': predictions})

In [83]:
prediction_df.head(10)

Unnamed: 0,subscription_id,predicted_probability
0,cmruPNNlEeu5xhIWn-rHGQ,0.42754
1,sxQIF1Q_EeyCJRL_5xXcrw,0.673845
2,8dAgJoKnEeuyFgrvXo0Zqw,0.782838
3,_i3b2spOEeuktA5pX5p5sQ,0.513716
4,UmAgAiquEeyEkBIiq6qFSQ,0.781797
5,QidbzStWEeyFfhIE_bgb3w,0.471584
6,fUT_lYj5EeuT6Q4G3AaHkQ,0.384341
7,GlZ8D5VbEeuQdQonodlz7Q,0.760556
8,NcRNmzzkEey-bw4pL2HaOQ,0.556199
9,zc1hoba_EeufeAq6C_fAOw,0.55826


In [91]:
# roc_auc_score

roc_auc_score(y_test, predictions)

0.7527995542295384