# Exploratory Data Analysis using Dask

Hi Kagglers, this is the first notebook I'm sharing on Kaggle. I've used dask dataframes to keep the memory usage low. This notebook is just for EDA. Let me know if it helped you in any way or if you have suggestions for better code, presentation, etc. Thanks!

In [None]:
import pandas as pd
import numpy as np

import dask
import dask.dataframe as dd

import matplotlib.pyplot as plt
import seaborn as sns

from itertools import chain

## Data loading
Load a sample of the data using pandas

In [None]:
path_append = '/kaggle/input/riiid-test-answer-prediction/'
train_data = pd.read_csv(path_append + 'train.csv', nrows=1000)

In [None]:
train_data.info()

In [None]:
train_data.head()

## Data loading using dask

In [None]:
# Load data using dask

train_data_dd = dd.read_csv(path_append + "train.csv", low_memory=False) # Lazy evaluation - doesn't actually load until .compute() is called

Calling `dask.compute()` with multiple objects allows for shared computation steps, e.g., file loading, and reduces overall runtime

In [None]:
# Calling dask.compute() with multiple objects allows for shared computation steps, e.g., file loading, and reduces overall runtime
num_content_ids, num_task_ids, num_content_types = dask.compute(train_data_dd['content_id'].nunique(),\
                                                               train_data_dd['task_container_id'].nunique(),\
                                                               train_data_dd['content_type_id'].nunique())
print("Unique content IDs: {}".format(num_content_ids))
print("Unique task container IDs: {}".format(num_task_ids))
print("Number of content types: {}".format(num_content_types))

In [None]:
# How many users in total?

print("Total number of users: {}".format(train_data_dd['user_id'].nunique().compute()))

In [None]:
# What proportion of all answers are correct?

print("Overall answer correctness rate: {:.2f}%".format(train_data_dd[train_data_dd['content_type_id']==0]['answered_correctly'].mean().compute() * 100))

## Analysis of questions

In [None]:
# Aggregating at the content (question) level
content_df = (train_data_dd.query("content_type_id==0")
              .groupby('content_id')
              .agg({'user_id': 'count',
                    'answered_correctly': 'mean'})
              .compute())

### Load questions data

In [None]:
questions_df = pd.read_csv(path_append + 'questions.csv')
questions_df.info()

In [None]:
questions_df.head()

In [None]:
# How many unique 'parts'?
print("# unique parts: {}".format(questions_df['part'].nunique()))

In [None]:
# Merge the questions dataframe with the aggregated content dataframe 
content_df = (content_df.reset_index().rename(columns={"content_id": "question_id"})
              .merge(questions_df, how='left', on=['question_id']))
content_df['num_answered_correctly'] = (content_df['user_id'] * content_df['answered_correctly']).astype(int)

In [None]:
content_df.head()

### Are all questions attempted equally often?

In [None]:
plt.plot(content_df['user_id'].sort_values(ascending=False).values)
plt.yscale("log")
plt.title("Questions vs number of attempts")
plt.xlabel("Questions")
plt.ylabel("Number of attempts")
plt.show()

### Are some questions harder than others?

In [None]:
content_df[content_df['user_id']>100]['answered_correctly'].hist(bins=100)
plt.title("Distribution of answer correctness rate by questions")
plt.show()

### Distribution of time elapsed on previous question

In [None]:
train_data_dd['prior_question_elapsed_time'].compute().hist(bins=100)
plt.title("Distribution of time elapsed on previous question")
plt.show()

### How well does time elapsed on previous question predict answer correctness?

In [None]:
prior_question_qcut = train_data_dd['prior_question_elapsed_time'].map_partitions(pd.qcut, 10, labels=False,\
                                                                                   meta=train_data_dd['prior_question_elapsed_time'])
pqet_df = (train_data_dd.groupby(prior_question_qcut).agg({'answered_correctly': 'mean'}).compute()
           .reset_index().rename(columns={'prior_question_elapsed_time': 'prior_question_elapsed_time_decile'}))
sns.barplot(x='prior_question_elapsed_time_decile', y='answered_correctly', data=pqet_df, hue=None)
plt.show()

### Prior question had explanation - does this predict answer correctness?

In [None]:
train_data_dd.groupby('prior_question_had_explanation').agg({'user_id': 'count', 'answered_correctly': 'mean'}).compute()

### Does the section of the test (column: 'part') predict answer correctness?

In [None]:
part_agg = content_df.groupby('part', as_index=False).agg({'user_id': 'sum', 'num_answered_correctly': 'sum'})
part_agg['prop_correct'] = part_agg['num_answered_correctly'] / part_agg['user_id']
part_agg

In [None]:
sns.barplot(x='part', y='prop_correct', data=part_agg, hue=None)
plt.show()

### Analysis of tags

In [None]:
# Convert the 'tags' string to a list
content_df['tags'].fillna('', inplace=True)
content_df['tags_list'] = content_df['tags'].apply(lambda x: [int(t) for t in x.split()])

In [None]:
tags_df = content_df.apply(lambda x: [(t, x['user_id'], x['num_answered_correctly']) for t in x['tags_list']],axis=1).values
tags_df = chain.from_iterable(tags_df)
tags_df = pd.DataFrame(tags_df, columns=['tag', 'num_questions', 'num_answered_correctly'])
tags_df = tags_df.groupby('tag', as_index=False).sum()
tags_df['prop_correct'] = tags_df['num_answered_correctly'] / tags_df['num_questions']
tags_df

In [None]:
tags_df['prop_correct'].hist(bins=20)
plt.title("Distribution of correctness rate by tag")
plt.show()

## Analysis of users

In [None]:
user_df = (train_data_dd.query("content_type_id==0")
           .groupby('user_id')
           .agg({'user_answer': 'count', 'answered_correctly': 'mean', 'timestamp': 'max'})
           .rename(columns={'user_answer': 'num_questions_answered', 'timestamp': 'total_time_spent'})).compute()

In [None]:
user_df['total_time_spent_mins'] = user_df['total_time_spent'] / 60000.0

In [None]:
user_df.head()

### Distribution of number of questions answered by each user

In [None]:
user_df[user_df['num_questions_answered'] < 2000]['num_questions_answered'].hist(bins=100)
plt.title("Distribution of # questions answered by users")
plt.show()

### Distribution of total time spent

In [None]:
user_df['total_time_spent_mins'].hist(bins=100)
plt.title("Distribution of # questions answered by users")
plt.show()

In [None]:
# Closer look at users with millions of minutes (long-time users)

outlier_user_ids = train_data_dd[train_data_dd['timestamp'] > 1e6 * 60000][['user_id']].drop_duplicates()
outlier_user_df = outlier_user_ids.merge(train_data_dd, how='inner', on='user_id').compute()

In [None]:
outlier_user_df

In [None]:
sample_user = outlier_user_df[outlier_user_df['user_id']==np.random.choice(outlier_user_df['user_id'].unique())].copy()

In [None]:
sample_user['timestamp_mins'] = sample_user['timestamp'] / 60000
sample_user['timestamp_hrs'] = sample_user['timestamp_mins'] / 60

In [None]:
plt.plot(sample_user['timestamp_hrs'].values)
plt.show()

In [None]:
sample_user.head(50)

### Distribution of user ability - are some users correct more often than others?

In [None]:
user_df[user_df['num_questions_answered']>10]['answered_correctly'].hist(bins=50)
plt.title("Distribution of answer correctness by users")
plt.show()

## Analysis of lectures

In [None]:
lectures_df = pd.read_csv(path_append + 'lectures.csv')
lectures_df.info()

In [None]:
lectures_df.head()

In [None]:
lectures_df['type_of'].value_counts()

In [None]:
lectures_df['part'].value_counts()

To do: does viewing a lecture affect performance on the next set of questions? (feeling too lazy to do it now.....)

# That's all, folks!