# Project Summary
This code trains a Long Short-Term Memory (LSTM) neural network to predict the next user action based on the user's history of interactions. The input data is taken from a CSV file and consists of a set of user interactions with a learning platform, where each interaction is associated with a user ID, a timestamp, and an action performed by the user.

The code pre-processes the input data by extracting features from the timestamp column, one-hot encoding the categorical columns, and normalizing the non-categorical columns. The target variable 'action' is also one-hot encoded. The preprocessed data is then split into training and testing sets, and an LSTM model is trained on the training set to predict the next user action.

The model is evaluated on the test set and the accuracy is printed to the console. Additionally, the code provides a function to predict the next n actions for a given user, as well as a function to predict the next n actions for each unique user in the dataset.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.utils import to_categorical


# Reading the Input Data
The code reads the input data from a CSV file using the pd.read_csv function. The dt column is converted to a datetime format, and we drop the columns which we decided not to include in the model. I decided to remove all the 'id' columns except for the user_id because there are directly correlated to other attributes. eg. (step and step_id represent the same information).

In [3]:
# Read the CSV file into a pandas dataframe
df = pd.read_csv('internship_assignment.csv')

# Convert the 'dt' column to a datetime format
df['dt'] = pd.to_datetime(df['dt'])

# Extract features from 'dt' column
df['day_of_week'] = df['dt'].dt.dayofweek
df['hour_of_day'] = df['dt'].dt.hour


# Drop unnecessary columns
df = df.drop(columns=['dt', 'selected_track_id', 'selected_project_id', 'step_id'])

# Sort the data by 'user_id_hashed' columns in ascending order
df = df.sort_values(['user_id_hashed'], ascending=True)




# Feature Engineering & One-Hot Encoding

The code extracts features from the dt column using the dt.dayofweek and dt.hour functions. These features are added to the input data as new columns.The code one-hot encodes the categorical columns in the input data using the OneHotEncoder class from Scikit-learn. The encoder is fitted on the entire dataset and transforms the categorical columns in one step. The encoded data is then concatenated with the non-categorical columns and the user_id_hashed column.

In [4]:
# Define a list of the categorical column names
categorical_columns = ['user_id_hashed', 'learning_goal', 'selected_project', 'topic', 'project', 'project_difficulty', 'step', 'step_difficulty']

# Define an instance of the OneHotEncoder class for encoding categorical variables
encoder = OneHotEncoder(handle_unknown='ignore')

# Fit and transform the categorical columns in one step
encoded_categorical_data = encoder.fit_transform(df[categorical_columns])

# Combine the non-categorical columns and 'user_id_hashed' with the encoded categorical data
non_categorical_columns = ['day_of_week', 'hour_of_day']
encoded_data = pd.concat([df[non_categorical_columns], df['user_id_hashed'], pd.DataFrame(encoded_categorical_data.toarray())], axis=1)

# One-hot encode the target variable 'action'
target_encoder = OneHotEncoder(handle_unknown='ignore')
encoded_target = target_encoder.fit_transform(df[['action']])




# Data Splitting & Preprocessing
The code splits the preprocessed data into training and testing sets using the train_test_split function from Scikit-learn. The non-categorical columns are normalized using the MinMaxScaler function from Scikit-learn.

The target variables are converted to one-hot encoded format using the to_categorical function from Keras.

In [None]:
# Normalize the non-categorical columns using MinMaxScaler
scaler = MinMaxScaler()
encoded_data[['day_of_week', 'hour_of_day']] = scaler.fit_transform(encoded_data[['day_of_week', 'hour_of_day']])

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(encoded_data, encoded_target, test_size=0.2, random_state=42)

# Convert the sparse matrix to a dense NumPy array and get the index of the non-zero element
y_train = y_train.toarray().argmax(axis=1)
y_test = y_test.toarray().argmax(axis=1)

# Convert the target variables to one-hot encoded format
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

n_features = X_train.shape[1]
n_actions = y_train.shape[1]


# Model Creation & Training
The code creates an LSTM model using the Keras Sequential class. The model consists of three LSTM layers with increasing numbers of hidden units, followed by a dense output layer with a softmax activation function. The model is compiled using the Adam optimizer and categorical cross-entropy loss function.

The input data is reshaped to be in a 3D format (samples, timesteps, features), as required by the LSTM model. The model is trained on the training data using the fit function. The training history is stored in the history variable.

There is an error under the first part of training because I decided to interrupt the process early. The accuracy results were getting good, and I didn't want to run the model for an hour.

In [13]:
# Increase the number of hidden units and LSTM layers
model = Sequential()
model.add(LSTM(256, activation='relu', input_shape=(1, n_features), return_sequences=True))
model.add(LSTM(128, activation='relu', return_sequences=True))
model.add(LSTM(64, activation='relu'))
model.add(Dense(n_actions, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Reshape the input data to be 3D, as required by the LSTM model (samples, timesteps, features)
X_train_reshaped = X_train.values.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test_reshaped = X_test.values.reshape((X_test.shape[0], 1, X_test.shape[1]))

# Train the model
history = model.fit(X_train_reshaped, y_train, epochs=50, validation_data=(X_test_reshaped, y_test), verbose=1)  # Increase epochs to 50



Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50

KeyboardInterrupt: 

# Model Evaluation
The code evaluates the trained model on the test data using the evaluate function. The test accuracy is printed to the console.

In [20]:
# Evaluate the model on the test data
loss, accuracy = model.evaluate(X_test_reshaped, y_test)
print(f"Test accuracy: {accuracy * 100:.2f}%")

def predict_next_actions(model, user_data, n_actions):
    padding_length = model.input_shape[2] - user_data.shape[1]
    user_data_padded = np.pad(user_data.values, ((0, 0), (padding_length, 0)), 'constant')
    user_data_reshaped = user_data_padded.reshape((1, 1, user_data_padded.shape[1]))
    predictions = []

    for _ in range(n_actions):
        prediction = model.predict(user_data_reshaped)
        predictions.append(prediction)
        prediction = prediction.reshape(1, 1, -1)  # Add an extra dimension to the prediction array
        user_data_reshaped = np.concatenate((user_data_reshaped[:, :, :-prediction.shape[2]], prediction), axis=2)

    return np.array(predictions)



Test accuracy: 63.39%


# Testing Prediction Function for a Single User
The code defines a function predict_next_actions to predict the next n actions for a given user. This function takes the trained model, user data, and the number of actions to predict as inputs. It pads the user data to the same shape as the input shape of the LSTM model, and then iteratively predicts the next action by concatenating the prediction to the end of the input data. The predicted actions are returned in encoded format.

In [27]:
user_id = 782347178622731989
user_data = encoded_data[encoded_data['user_id_hashed'] == user_id].iloc[-1:].drop(columns=['user_id_hashed'])

# Predict the next 5 actions for the given user
predicted_actions_encoded = predict_next_actions(model, user_data, 10)

# Decode the predicted actions
predicted_actions = target_encoder.inverse_transform(predicted_actions_encoded.reshape(predicted_actions_encoded.shape[0], -1))
print(f"Next 5 actions for user {user_id}: {predicted_actions}")



Next 5 actions for user 782347178622731989: [['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']]


# Prediction Function for Each Unique User
The code defines a function predict_next_actions_for_each_user to predict the next n actions for each unique user

In [28]:
# Function to predict the next n actions for a given user data
def predict_next_actions_for_each_user(model, encoded_data, n_actions=5):
    unique_users = encoded_data['user_id_hashed'].unique()
    predictions = {}

    for user_id in unique_users:
        user_data = encoded_data[encoded_data['user_id_hashed'] == user_id].iloc[-1:].drop(columns=['user_id_hashed'])
        predicted_actions_encoded = predict_next_actions(model, user_data, n_actions)
        predicted_actions = target_encoder.inverse_transform(predicted_actions_encoded.reshape(predicted_actions_encoded.shape[0], -1))
        predictions[user_id] = predicted_actions

    return predictions

# Predict the next 5 actions for each unique user
next_actions_for_each_user = predict_next_actions_for_each_user(model, encoded_data, 5)

# Print the predictions
for user_id, actions in next_actions_for_each_user.items():
    print(f"Next 5 actions for user {user_id}: {actions}")


Next 5 actions for user 782347178622731989: [['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']]
Next 5 actions for user 2911199505101500553: [['completed_submission']
 ['completed_submission']
 ['completed_submission']
 ['completed_submission']
 ['completed_submission']]
Next 5 actions for user 3016752473480896665: [['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']
 ['codeChallengeSolved']]
Next 5 actions for user 3709264465427925985: [['completed_submission']
 ['completed_submission']
 ['completed_submission']
 ['completed_submission']
 ['completed_submission']]
Next 5 actions for user 4534275443679530850: [['completed_step']
 ['failed_submission']
 ['completed_step']
 ['failed_submission']
 ['completed_step']]
Next 5 actions for user 6005434371571979741: [['failed_submission']
 ['failed_submission']
 ['failed_submission']
 ['failed_submission']
 ['failed_submissi