# Stepik ML contest

## Description
This project is a part of Stepik course "Introduction to Data Science and machine learning". The task is to analyze users activity in another Stepik course and predict whether the user is going to complete the course based on his activity during the first two days of education. We assume that person has completed the course if he has more then 40 correctly done 'steps'.
We are given two datasets:
- events_train - data about users actions on 'steps'. Brief description of the dataset:
  - step_id
  - used_id
  - timestamp - action time
  - action - one of four possible actions:
    - discovered
    - viewed
    - started_attempt
    - passed
- submissions_train - data about users submits of the 'steps'. Brief description of the dataset:
  - step_id
  - timestamp - submit time
  - submission_status:
    - correct
    - wrong
  - user_id

In [764]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

In [None]:
submission_data_test = pd.read_csv('submission_data_test.csv')
events_data_test = pd.read_csv('events_data_test.csv')
submissions_data_train = pd.read_csv('submissions_data_train.csv')
events_data_train = pd.read_csv('event_data_train.csv')

# Loading test and train datasets

In [None]:
finished_users = submissions_data_train.query('submission_status == "correct"').groupby('user_id', as_index=False).agg({'timestamp': 'count'}).rename(columns={'timestamp' :'correct_count'})
finished_users['has_finished'] = finished_users.correct_count > 40

# Selecting users, who has more then 40 correct submits

In [780]:
events_data_for_finished_users = events_data_train.merge(finished_users, how='inner', on='user_id')[['user_id', 'timestamp']]
events_data_for_finished_users['day'] = pd.to_datetime(events_data_for_finished_users['timestamp'], unit='s').dt.date
events_data_for_finished_users = events_data_for_finished_users.drop_duplicates(subset=['user_id', 'day'])
events_time_dif_for_finished_users = pd.Series(np.concatenate(events_data_for_finished_users.groupby('user_id')['timestamp'].apply(list).apply(np.diff).values, axis=0))
events_time_dif_for_finished_users = events_time_dif_for_finished_users / (24 * 60 * 60)
days_for_user_to_be_gone = events_time_dif_for_finished_users.quantile(0.9)

# Calculating time between users actions for those, who has more then 40 correct submits, and selecting 0.9 quantile. Then if the user has gone for more then this time, we wiil assume his as a gone user.

In [781]:
last_time_stamp = events_data_train['timestamp'].max()
gone_users = events_data_train.groupby('user_id', as_index=False).agg({'timestamp': 'max'})
gone_users['is_gone'] = 1526772811 - days_for_user_to_be_gone * 24 * 60 * 60 > gone_users['timestamp']
gone_users = gone_users.drop('timestamp', axis=1)

# Selecting gone users

In [782]:
first_two_days_actions = events_data_train
frist_two_days_submisions = submissions_data_train
first_two_days_actions['first_timestamp'] = first_two_days_actions.groupby('user_id')['timestamp'].transform('min')
frist_two_days_submisions['first_timestamp'] = frist_two_days_submisions.groupby('user_id')['timestamp'].transform('min')
first_two_days_actions = first_two_days_actions[first_two_days_actions['timestamp'] <= first_two_days_actions['first_timestamp'] + 2 * 24 * 60 * 60]
frist_two_days_submisions = frist_two_days_submisions[frist_two_days_submisions['timestamp'] <= frist_two_days_submisions['first_timestamp'] + 2 * 24 * 60 * 60]
actions_pivot_table = first_two_days_actions.pivot_table(index='user_id', columns='action', values='step_id', aggfunc='count', fill_value=0).reset_index()
submissions_pivot_table = frist_two_days_submisions.pivot_table(index='user_id', columns='submission_status', values='step_id', aggfunc='count', fill_value=0).reset_index()
submissions_pivot_table['total_sub_count'] = submissions_pivot_table.correct + submissions_pivot_table.wrong
submissions_pivot_table['correct_ratio'] = submissions_pivot_table.correct / submissions_pivot_table.total_sub_count
submissions_pivot_table = submissions_pivot_table.drop(['correct', 'wrong'], axis=1)

# Selecting actions of first two days for each user. Also calculating some features for predictions.

In [None]:
final_df = submissions_pivot_table.merge(actions_pivot_table, how='outer', on='user_id').fillna(0)
final_df = final_df.merge(gone_users, how='outer', on='user_id')
final_df = final_df.merge(finished_users, how='outer', on='user_id')
final_df = final_df.fillna({'correct_count': 0, 'has_finished': False})
final_df = final_df[(final_df.is_gone == True) | (final_df.has_finished == True)]
X_train = final_df.drop(['is_gone', 'has_finished', 'user_id', 'correct_count'], axis=1)
y_train = final_df.has_finished

# Merging final dataset

In [None]:
actions_test_pivot_table = events_data_test.pivot_table(index='user_id', columns='action', values='step_id', aggfunc='count', fill_value=0).reset_index()
submissions_test_pivot_table = submission_data_test.pivot_table(index='user_id', columns='submission_status', values='step_id', aggfunc='count', fill_value=0).reset_index()
submissions_test_pivot_table['total_sub_count'] = submissions_test_pivot_table.correct + submissions_test_pivot_table.wrong
submissions_test_pivot_table['correct_ratio'] = submissions_test_pivot_table.correct / submissions_test_pivot_table.total_sub_count
submissions_test_pivot_table = submissions_test_pivot_table.drop(['correct', 'wrong'], axis=1)
final_test_df = submissions_test_pivot_table.merge(actions_test_pivot_table, how='outer', on='user_id').fillna(0)
X_test = final_test_df.drop('user_id', axis=1)

# Merging test datasets

In [None]:
rfc = RandomForestClassifier()
params = {'n_estimators': range(10, 201, 10), 'max_depth': range(1, 10), 'min_samples_split': range(1, 10), 'min_samples_leaf': range(1, 10)}
gridscv = RandomizedSearchCV(rfc, params, cv=5, n_jobs=-1)
gridscv.fit(X_train, y_train)
best_estimator = gridscv.best_estimator_

# Searcing for best RandomForestClassifier

In [783]:
predictions = best_estimator.predict_proba(X_test)
pred_df = pd.DataFrame(predictions)
pred_df = pred_df.rename(columns={0: 'user_id', 1: 'is_gone'})
pred_df['user_id'] = final_test_df['user_id']
pred_df.to_csv('result.csv', index=False)

# Predicting and saving the results

## Result
![AUC ROC score](AUC_ROC.)
