Competition Link: https://datahack.analyticsvidhya.com/contest/practice-problem-recommendation-engine

Recommendation Engine

Recommending the questions that a programmer should solve given his/her current expertise is a big challenge for Online Judge Platforms but is an essential task to keep a programmer engaged on their platform.

In this practice problem, you are given the data of programmers and questions that they have previously solved along with the time that they took to solve that particular question.

As a data scientist, your task is to build a model that can predict the time taken to solve a problem given the user current status.

This model will help online judges to decide the next level of questions to recommend to a user.

In [None]:
#Necessary imports
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import time

In [None]:
user_df = pd.read_csv(r'../input/recommendation-engine/user_data.csv')
problem_df = pd.read_csv(r'../input/recommendation-engine/problem_data.csv')
train_submussion_df = pd.read_csv(r'../input/recommendation-engine/train_submissions.csv')
test_submussion_df = pd.read_csv(r'../input/recommendation-engine/test_submissions_NeDLEvX.csv')

In [None]:
#lets look at the sample data for each of the data frame. Sample data for user data
user_df.head()

In [None]:
#description of user data
user_df.describe()

In [None]:
#count of null values
user_df.isna().sum()

In [None]:
#percentage of null values
user_df.isnull().mean()

In [None]:
#plotting submision counts
sns.distplot(user_df["submission_count"])

In [None]:
#plotting problem solved counts
user_df['submission_count'].quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])

In [None]:
#creating bins for submission counts
submission_count_bins = pd.qcut(user_df["submission_count"], 4,labels = False)

#creating the new column for quantiled submission count
user_df["submission_count_bins"] = submission_count_bins.values

In [None]:
#Let's have a look at the distribution for submission count bins
sns.distplot(user_df["submission_count_bins"])

In [None]:
#plotting problem solved counts
sns.distplot(user_df["problem_solved"])

In [None]:
#plotting problem solved counts
user_df['problem_solved'].quantile([.2, .4,.6, .8])

In [None]:
#quantiling the problem solved counts
problem_solved_bins = pd.qcut(user_df["problem_solved"], 5,labels = False)

#creating bins for problem solved counts
user_df["problem_solved_bins"] = problem_solved_bins.values

In [None]:
#let's look at the distribution of the problem solved bins
sns.distplot(user_df["problem_solved_bins"])

In [None]:
#let's have a look at the new column for problem solved bins
user_df.head()

In [None]:
#looks like problem solved count bins and submission count bins are identical. So I will check them once if they are idential
user_df['submission_count_bins'].equals(user_df['problem_solved_bins'])

submission_count_bins and problem_solved_bins are not identical so we can proceed with the other columns

In [None]:
#define success rate as a column
user_df['success_rate'] = user_df['problem_solved']/user_df['submission_count']*100

In [None]:
#Let's look at the distribution of the contribution
sns.distplot(user_df["contribution"], kde=False, rug=True)

In [None]:
#this is quite skewed with 2530 values as 0
user_df["contribution"].value_counts(normalize = True).head(10)

We will have to remove the fiels contribution as it has 70% values as 0

In [None]:
#Now let's look at the number of countries or values
user_df["country"].unique().shape

Strategy for imputing the null values will be based on the ratio of occurence of the countries in the rest of the data.
For example, India occured 25.6% and Bangladesh occured 13.6% and so on. We will use this ratio of all the countries to fill the missing data.

In [None]:
#Getting all the ratios
country_data = (user_df["country"].value_counts()/user_df["country"].count())

In [None]:
#imputing missing values
user_df["country"]= user_df["country"].fillna(pd.Series(np.random.choice(country_data.index,p=country_data.values, size=len(user_df))))

Country is a categorical feature and there are 79 levels. I would like to keep the levels to 10. So, I will keep the first 9 countries by count and the rest of the countries will be put under "other"

In [None]:
country_list = user_df['country'].value_counts().index[:9]
user_df['country_new'] = np.where(user_df['country'].isin(country_list), user_df['country'], 'Other')

In [None]:
#Now let's look at the countries distribution
sns.countplot(user_df["country_new"])

In [None]:
#value counts of new field
user_df["country_new"].value_counts()

In [None]:
#plotting follower_count
sns.distplot(user_df["follower_count"])

In [None]:
user_df.loc[user_df["follower_count"]==0].shape

In [None]:
#quantiling the follower_count
user_df['follower_count'].quantile([.2, .4, .6, .8])

In [None]:
#creating bins for submission counts
follower_count_bins = pd.qcut(user_df["follower_count"], 5,labels = False)
#creating the new column for quantiled submission count
user_df["follower_count_bins"] = follower_count_bins.values

In [None]:
#Let's have a look at the new distribution
sns.distplot(user_df["follower_count_bins"])

In [None]:
#let's find the age of the user in the platform in months
user_df["age_in_platform"] = (user_df["last_online_time_seconds"] - user_df["registration_time_seconds"])/(24*3600*30)

In [None]:
sns.distplot(user_df["age_in_platform"])

In [None]:
#plotting max_rating
sns.distplot(user_df["max_rating"])

In [None]:
#creating bins for max_rating counts
max_rating_bins = pd.qcut(user_df["max_rating"], 4,labels = False)

In [None]:
#creating the new column for quantiled max_rating count
user_df["max_rating_bins"] = max_rating_bins.values

In [None]:
#plotting max_rating counts
sns.distplot(user_df["max_rating_bins"])

In [None]:
#plotting submision counts
sns.distplot(user_df["rating"])

In [None]:
#Now let's look at the unique ranks
user_df["rank"].unique()

In [None]:
#Now let's look at the rank distribution
sns.countplot(user_df["rank"])

In [None]:
#Percentage distribution of rank
sns.barplot(user_df["rank"].value_counts(normalize = True).index, user_df["rank"].value_counts(normalize = True).values)

In [None]:
#percentage distribution of rank. It looks good to go
user_df["rank"].value_counts(normalize = True)

In [None]:
user_df.columns

In [None]:
#70% values are 0, so we can drop this field
user_df.drop(columns = ["contribution"],axis = 1, inplace = True)

In [None]:
#drop country as we have a new field for country with 'Other'
user_df.drop(columns = ["country"],axis = 1, inplace = True)

In [None]:
#registration time in years
user_df["registration_time"] = (time.time()-user_df["registration_time_seconds"])/(3600*24*365)

In [None]:
#last online time in years
user_df["last_online_time"] = (time.time()-user_df["last_online_time_seconds"])/(3600*24*365)

In [None]:
#drop last_online_time_seconds and registration_time_seconds as we have new fields for them
user_df.drop(columns = ["last_online_time_seconds","registration_time_seconds"],axis = 1, inplace = True)

In [None]:
#change values of country_new using a label encoder
labelencoder = LabelEncoder()
user_df['country_new'] = labelencoder.fit_transform(user_df['country_new'])

In [None]:
#change values of rank to numeric
rank_dict = {'beginner':0, 'intermediate':1, 'advanced':2, 'expert':3}
user_df["rank"] = user_df["rank"].apply(lambda x: rank_dict[x])

In [None]:
user_df.head()

Let's look at the problem data now

In [None]:
#lets look at the sample data for problem data.
problem_df.head()

In [None]:
#let's look at the null values and the shape of the problem data
print(problem_df.shape)
print(problem_df.isna().sum())
print(problem_df.isna().mean())

In [None]:
#let's look at the distribution of the level type
problem_df.level_type.value_counts()

In [None]:
#let's look at the distribution of the level type
problem_df.level_type.value_counts(normalize = True)

In [None]:
#I will fill up the values based on the ratio of distribution
#Getting all the ratios
level_type_data = (problem_df["level_type"].value_counts()/problem_df["level_type"].count())

#imputing missing values
problem_df["level_type_new"]= problem_df["level_type"].fillna(pd.Series(np.random.choice(level_type_data.index,p=level_type_data.values, size=len(problem_df))))

In [None]:
#Now I will have to label the level_type_new field
level_type_dict = {'A':0, 'B':1, 'C':2, 'D':3,'E':4,'F':5,'G':6,'H':7,'I':8,'J':9,'K':10,'L':11,'M':12,'N':13}
problem_df["level_type_new"] = problem_df["level_type_new"].apply(lambda x: level_type_dict[x])

In [None]:
print(problem_df["points"].mean())
print(problem_df["points"].mode())
print(problem_df["points"].median())

In [None]:
#imputing missing points values
problem_df["points"]= problem_df["points"].fillna(problem_df["points"].mean())
#I will fill up the values based on the ratio of distribution
#Getting all the ratios
#points_data = (problem_df["points"].value_counts()/problem_df["points"].count())

#imputing missing values for points
#problem_df["points"]= problem_df["points"].fillna(pd.Series(np.random.choice(points_data.index,p=points_data.values, size=len(problem_df))))

In [None]:
#I will remove level_type as there is a new field for that. tags should be removed as they have more than 50% null values
problem_df.drop(columns = ["level_type","tags"],axis = 1, inplace = True)

In [None]:
problem_df.head()

In [None]:
train_submussion_df.head()

In [None]:
#let's look at the distribution of the attempts range
sns.distplot(train_submussion_df["attempts_range"])

In [None]:
train_submussion_df["attempts_range"].value_counts(normalize = True)

In [None]:
#merge train submission and user data
train_df = pd.merge(train_submussion_df,user_df,how = 'left',on = "user_id")
test_df = pd.merge(test_submussion_df,user_df,how = 'left',on = "user_id")

In [None]:
#merge train data and problem data
train_df = pd.merge(train_df,problem_df,how = 'left',on = "problem_id")
test_df = pd.merge(test_df,problem_df,how = 'left',on = "problem_id")

In [None]:
#create ID field for train data, ID already there for test data
train_df["ID"] = train_df["user_id"] + train_df["problem_id"]

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
#user_id count - number of times user is appearing
train_df['user_id_count'] = train_df.groupby('user_id')['user_id'].transform('count')
test_df['user_id_count'] = train_df.groupby('user_id')['user_id'].transform('count')

In [None]:
#problem_id count - number of times problem is appearing
train_df['problem_id_count'] = train_df.groupby('problem_id')['problem_id'].transform('count')
test_df['problem_id_count'] = train_df.groupby('problem_id')['problem_id'].transform('count')

In [None]:
#user id min attempts
train_df['user_id_min_attempts'] = train_df.groupby('user_id')['attempts_range'].transform('min')
test_df['user_id_min_attempts'] = train_df.groupby('user_id')['attempts_range'].transform('min')

In [None]:
#user id max attempts
train_df['user_id_max_attempts'] = train_df.groupby('user_id')['attempts_range'].transform('max')
test_df['user_id_max_attempts'] = train_df.groupby('user_id')['attempts_range'].transform('max')

In [None]:
#user id mean attempts
train_df['user_id_mean_attempts'] = train_df.groupby('user_id')['attempts_range'].transform('mean')
test_df['user_id_mean_attempts'] = train_df.groupby('user_id')['attempts_range'].transform('mean')

In [None]:
#problem id min attempts
train_df['problem_id_min_attempts'] = train_df.groupby('problem_id')['attempts_range'].transform('min')
test_df['problem_id_min_attempts'] = train_df.groupby('problem_id')['attempts_range'].transform('min')

In [None]:
#problem id max attempts
train_df['problem_id_max_attempts'] = train_df.groupby('problem_id')['attempts_range'].transform('max')
test_df['problem_id_max_attempts'] = train_df.groupby('problem_id')['attempts_range'].transform('max')

In [None]:
#problem id mean attempts
train_df['problem_id_mean_attempts'] = train_df.groupby('problem_id')['attempts_range'].transform('mean')
test_df['problem_id_mean_attempts'] = train_df.groupby('problem_id')['attempts_range'].transform('mean')

In [None]:
#user id min level
train_df['user_id_min_level'] = train_df.groupby('user_id')['level_type_new'].transform('min')
test_df['user_id_min_level'] = train_df.groupby('user_id')['level_type_new'].transform('min')

In [None]:
#user id max level
train_df['user_id_max_level'] = train_df.groupby('user_id')['level_type_new'].transform('max')
test_df['user_id_max_level'] = train_df.groupby('user_id')['level_type_new'].transform('max')

In [None]:
#user id mean level
train_df['user_id_mean_level'] = train_df.groupby('user_id')['level_type_new'].transform('mean')
test_df['user_id_mean_level'] = train_df.groupby('user_id')['level_type_new'].transform('mean')

In [None]:
train_df['country_percent'] = train_df.groupby('country_new')['country_new'].transform('count')/len(train_df)
test_df['country_percent'] = train_df.groupby('country_new')['country_new'].transform('count')/len(train_df)

In [None]:
print(train_df.columns.shape)
print(test_df.columns.shape)

In [None]:
#define X
X = train_df.drop(columns=['user_id','problem_id','ID','attempts_range'],axis=1)

In [None]:
#define y
y = train_df["attempts_range"]

In [None]:
#split training data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [None]:
#xgb bseline model
xgbC = XGBClassifier(n_estimators= 300)

In [None]:
xgbC.fit(X_train,y_train)

In [None]:
y_test_pred = xgbC.predict(X_test)

In [None]:
accuracy_score(y_test, y_test_pred)

In [None]:
f1_score(y_test, y_test_pred, average='weighted')

In [None]:
#use random forest regressor
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(max_depth=5, min_samples_leaf=100)

In [None]:
#fit the model
RF.fit(X_train,y_train)

In [None]:
y_test_pred = xgbC.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_test_pred)

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test, y_test_pred, average='weighted')