Here's a introductive LightGBM model used for predicting correct answers.
For some memory issues faced with pandas library, it would be a nice idea to try datatable library known for speed and big data support which uses less memory, for more information about it you can read the following datatable documentation: https://datatable.readthedocs.io/en/latest/start/quick-start.html

In [None]:
!pip install ../input/python-datatable/datatable-0.11.0-cp37-cp37m-manylinux2010_x86_64.whl

In [None]:
import pandas as pd
import datatable as dt
import lightgbm as lgb
from matplotlib import pyplot as plt

In [None]:
features =  ['user_id', 'content_id', 'answered_correctly', 'prior_question_elapsed_time', 'prior_question_had_explanation']
train_df = dt.fread('../input/riiid-test-answer-prediction/train.csv').to_pandas()
train_df = train_df[features]

In [None]:
train_df.head()

# Preprocessing
Let's add some preprocessing to add more information and features to the training dataset

In [None]:
train_df['answered_correctly'].unique()

As you see in this column 'answered_correctly' there are some indesired values which add noise to the data as well as for the predicted results of the model so it's a good habit to get rid of all empty and inapropriate values within the columns that we need for

In [None]:
#Eliminate rows with -1 values in the target
train_df = train_df[train_df['answered_correctly'] != -1].reset_index(drop=True)
#Replace null values with FALSE
train_df.fillna(False, inplace=True)

train_df['user_id'] = train_df['user_id'].astype('int32')
train_df['content_id'] = train_df['content_id'].astype('int16')
train_df['answered_correctly'] = train_df['answered_correctly'].astype('int8')
train_df['prior_question_elapsed_time'] = train_df['prior_question_elapsed_time'].astype('float32')
train_df['prior_question_had_explanation'] = train_df['prior_question_had_explanation'].astype('bool')

In [None]:
train_df['user'] = train_df.groupby('user_id')['answered_correctly'].shift()
#Calculate ratio of correct answers of the whole answers provided by the user
cumulated = train_df.groupby('user_id')['user'].agg(['cumsum', 'cumcount'])
train_df['user_correctness'] = cumulated['cumsum'] / cumulated['cumcount']
train_df.drop(columns=['user'], inplace=True)

In [None]:
user_agg = train_df.groupby('user_id')['answered_correctly'].agg(['sum', 'count'])
content_agg = train_df.groupby('content_id')['answered_correctly'].agg(['sum', 'count'])

In [None]:
train_df = train_df.groupby('user_id').tail(60).reset_index(drop=True)

In [None]:
questions_df = pd.read_csv('../input/riiid-test-answer-prediction/questions.csv')[['question_id', 'part']]
questions_df['question_id'] = questions_df['question_id'].astype('int16')
questions_df['part'] = questions_df['part'].astype('int8')

train_df = pd.merge(train_df, questions_df, left_on='content_id', right_on='question_id', how='left')
train_df.drop(columns=['question_id'], inplace=True)

In [None]:
train_df['content_count'] = train_df['content_id'].map(content_agg['count']).astype('int32')
train_df['content_id'] = train_df['content_id'].map(content_agg['sum'] / content_agg['count'])

In [None]:
valid_df = train_df.groupby('user_id').tail(15)
train_df.drop(valid_df.index, inplace=True)

In [None]:
train_df.head()

# The LightGBM model and training process

In [None]:
#Defining the features to consider after feature engineering
features = [
    'content_id',
    'prior_question_elapsed_time',
    'prior_question_had_explanation',
    'user_correctness',
    'part',
    'content_count'
]

target = 'answered_correctly'

In [None]:
#Defining LightGBM parameters
params = {
    'objective': 'binary',
    #'tree_method': 'hist'
    'seed': 42,
    'metric': 'auc',
    'learning_rate': 0.05,
    'max_bin': 800,
    'num_leaves': 100
}

In [None]:
tr_data = lgb.Dataset(train_df[features], label=train_df[target])
va_data = lgb.Dataset(valid_df[features], label=valid_df[target])

#Training of the model
model = lgb.train(
    params, 
    tr_data, 
    num_boost_round=10000,
    valid_sets=[tr_data, va_data], 
    early_stopping_rounds=50,
    verbose_eval=50
)

#If you want to save the model
# model.save_model(f'model.txt')

To see how the features selected to the training are valuable it would be nice to plot their importance for the predicition of the correctness of each user's answer

In [None]:
lgb.plot_importance(model, importance_type='gain')
plt.show()

# Testing the model via EducationRiid library

In [None]:
import riiideducation

In [None]:
env = riiideducation.make_env()
iter_test = env.iter_test()
prior_test_df = None

In [None]:
for (test_df, sample_prediction_df) in iter_test:
    if prior_test_df is not None:
        prior_test_df[target] = eval(test_df['prior_group_answers_correct'].iloc[0])
        prior_test_df = prior_test_df[prior_test_df[target] != -1].reset_index(drop=True)
        
        user_ids = prior_test_df['user_id'].values
        content_ids = prior_test_df['content_id'].values
        targets = prior_test_df[target].values
        
        for user_id, content_id, answered_correctly in zip(user_ids, content_ids, targets):
            if user_id in user_agg.index:
                user_agg.loc[user_id, 'sum'] += answered_correctly
                user_agg.loc[user_id, 'count'] += 1
            else:
                user_agg.loc[user_id] = [answered_correctly, 1]
            
            if content_id in content_agg.index:
                content_agg.loc[content_id, 'sum'] += answered_correctly
                content_agg.loc[content_id, 'count'] += 1
            else:
                content_agg.loc[content_id] = [answered_correctly, 1]
                
    prior_test_df = test_df.copy()
    
    test_df = pd.merge(test_df, questions_df, left_on='content_id', right_on='question_id', how='left')
    
    test_df['prior_question_had_explanation'] = test_df['prior_question_had_explanation'].fillna(False).astype('bool')    
    
    test_df['user_correctness'] = test_df['user_id'].map(user_agg['sum'] / user_agg['count'])
    
    test_df['content_count'] = test_df['content_id'].map(content_agg['count']).fillna(1)
    test_df['content_id'] = test_df['content_id'].map(content_agg['sum'] / content_agg['count']).fillna(0.7)
      
    test_df['answered_correctly'] = model.predict(test_df[features])
    env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])

# Conclusion 

The model is showing a very important precision results but due to the large dataset it's impossible to train it with other features from the proposed datasets (train, question and lectures) also it's only trained on a part of the users (not all of them).
Maybe it would beneficial to use more hardware resources (RAM !!!) in order to get higher performances.