### Dataset Description:

There are three tables provided in the dataset<br>
***The Log_Problem.csv*** recorded 16,217,311 problem attempt logs of 72,758 students for a year from 2018/08/01 to 2019/07/31.<br>
An exercise is a basic unit of learning, which is related to a certain concept, it consists of several problems. <br>
***Info_Content.csv*** describes the metadata of the exercise.<br> 
***Info_UserData.csv*** described the metadata of the selected students in Junyi Academy.<br>

### About this notebook:

This notebook will try to introduce you to the Junyi Academy Dataset<br>
We hope that these tasks can help you better understand the dataset and enable you to discover more interesting findings.<br>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10, 6)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

print('The followings are files in the dataset: ')
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
df_InfoUser = pd.read_csv('/kaggle/input/learning-activity-public-dataset-by-junyi-academy/Info_UserData.csv')
df_LogProblem = pd.read_csv('/kaggle/input/learning-activity-public-dataset-by-junyi-academy/Log_Problem.csv')
df_InfoContent = pd.read_csv('/kaggle/input/learning-activity-public-dataset-by-junyi-academy/Info_Content.csv')

In [None]:
df_InfoUser.head(5)

In [None]:
df_LogProblem.head(5)

In [None]:
df_InfoContent.head(5)

# EDA : Info_UserData

### Student Count for each gender

In [None]:
print('Total number of users:', len(df_InfoUser))

In [None]:
df_InfoUser = df_InfoUser.fillna('-1')
count_each_gender = df_InfoUser['gender'].value_counts()
count_each_gender

In [None]:
plt.title('Number of users per gender')
plt.bar(count_each_gender.index, count_each_gender.values)
plt.show()

The Majority of the students do not set their gender on the platform.

### Distribution of energy points for students in elementary school

In [None]:
# Select students from grade 1 to grade 6
df_elem = df_InfoUser[(df_InfoUser['user_grade'] > 0) & (df_InfoUser['user_grade'] < 7)]
df_elem.describe()

We have 47867 students from grade 1 to grade 6 and the student with max energy points have 4047528 energy points.

In [None]:
plt.plot(df_elem['points'].sort_values().reset_index(drop=True))

plt.title('Distribution of energy points for students in elementary school', fontsize=16)
plt.xlim((1, 50000))
plt.xlabel('Student sorted by energy point count', fontsize=10)
plt.ylim((0, 4100000))
plt.ylabel('Energy points', fontsize=10)
plt.grid()

plt.show()

The plot shows that there are significant differences within the students. <br>
Some are power users of the platform, while some students are less active.

## What is [energy point](https://help.junyiacademy.org/home/badge_learning/)?

Energy points are like experience points in games. <br>
Similar to game-based learning, users will try to learn to earn more energy points. <br>
Energy Points are earned from Junyi Academy after completing exercises, watching videos, and when the user receives a badge. <br>

The rules of energy point are as the following:
1. A user earns 750 * (effective watching time / video length) energy points after watching a video. (Effective watching time for a 10-minute video at 2X speed will be only 5-minute)
2. A user earns a base of 75 energy points after completing an exercise at <font color='red'>level</font> 0.
3. This might increase to at most 225 due to fast answering speed or repeated correct attempts.
4. The points earned will decrease to as few as 5 as the user's <font color='red'>level</font> of that exercise increase to encourage the user to practice other exercises.
5. Users will earn points from badges, which encourage students to keep learning and complete certain targets to earn the different badges.

![能量點數說明:習題](https://help.junyiacademy.org/wp-content/uploads/2018/04/learn_4.png)
![能量點數說明:影片](https://help.junyiacademy.org/wp-content/uploads/2018/04/learn_5.png)

## What is <font color='red'>level</font> ([Proficiency mechanism](https://help.junyiacademy.org/home/badge_learning/))?

"*Proficiency mechanism*" allows students to convert short-term memory into long-term memory through appropriate and repeated review. <br>
It also helps teachers and parents to use the least amount of time to confirm whether the children's learning is proficient. <br>

There are five possible levels, all users start from level 0 and progress to level 4 which we considered Proficient for that exercise. <br>
The rules are as the following for each exercise:

1. Every user starts at level 0.
<br>
<br>
2. To reach level 1, the user will have to answer the problems correctly 5 times in the recent 6 problems in the exercise.
<br>
<br>
3. To reach next level, the user needs to wait some hours, which will be longer if user is at higher level, and will answer 2 problems from the exercise. For example, the user is at level 2 and get a chance to answer 2 problems:<br>
    *  If <font color='red'>both are correct</font>, the user is <font color='red'>upgraded</font> to level 3.<br>
    *  If <font color='red'>both are incorrect</font>, the user is <font color='red'>downgraded</font> to level 1.<br>
    *  If one of them is correct and one of them is incorrect, the level is <font color='red'>unchanged</font> and the user is prompted to try again the challenge.
<br>
<br>
4. The procedure to upgrade or downgrade is the same for the other levels. But users will not be downgraded at level 1 or when the user reaches level 4 `Proficient`.
<br>
<br>
5. After reaching <font color='blue'>level 1</font>, the user is required to wait <font color='blue'>6 hours</font> before he/she is able to attempt to level up again to level 2.
<br>
<br>
6. After reaching <font color='blue'>level 2</font>, the user is required to wait <font color='blue'>16 hours</font> before he/she is able to attempt to level up again to level 3.<br>
<br>
7. After reaching <font color='blue'>level 3</font>, the user is required to wait <font color='blue'>40 hours</font> before he/she is able to attempt to level up again to level 4, which is the final level and considered `Proficient` for that exercise.
<br>
<br>
![精熟機制說明](https://help.junyiacademy.org/wp-content/uploads/2018/04/learn_6.png)

## Does the Proficiency mechanism affect users?

In [None]:
# Let's pick a random exercise
df_LogProblem_first_ucid = df_LogProblem[df_LogProblem['ucid'] == df_LogProblem['ucid'][1]]

In [None]:
# Calculate number of problems done by each user
df_pcnt = df_LogProblem_first_ucid.groupby('uuid').size().reset_index(name='problem_cnt')
df_pcnt = df_pcnt.sort_values(by=['problem_cnt'])
df_pcnt = df_pcnt.reset_index()

# Sort and plot
pcnt_distribution = df_pcnt['problem_cnt'].value_counts()
pcnt_distribution = pcnt_distribution.sort_index()

plt.bar(pcnt_distribution.index, pcnt_distribution.values)

plt.title('Distribution of problem attempts for students in this exercise', fontsize=16)
plt.xlabel("Number of problems done in this exercise", fontsize=10)
plt.ylabel("User count", fontsize=10)
plt.xlim((0, 25))

plt.show()

We can observe a peak in the plot, most users do 5 or 6 problems in the exercise. <br>
This is mainly due to the proficiency mechanism, users would usually want to upgrade to level 1 and move onto the next exercise.

### Top 5 city of all users

In [None]:
print(df_InfoUser['user_city'].value_counts().head(5))
#TOP 5 : Taipei, New Taipei, Taichung, Taoyuan, Kaohsiung

The city names are the different cities in Taiwan

![Taiwan Map](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Taiwan_ROC_political_divisions_labeled.svg/1200px-Taiwan_ROC_political_divisions_labeled.svg.png)

# EDA : Log_Problem

## What do "<font color='red'>problem number</font>" and "<font color='red'>exercise_problem_repeat_session</font>" mean?

In [None]:
# Lets randomly pick a user and an exercise and observe the learning process!
learning_path = df_LogProblem[(df_LogProblem['uuid'] == "AAITw26FaJFdy0VfpYXlUhEpJnYcjEucad09AXqKmUE=") &
                              (df_LogProblem['ucid'] == "FDFKlshYbN4rO93MtgimwfpEoKerSWp1RFhoSKWXHsY=")]

In [None]:
#sort by problem_number
learning_path = learning_path.sort_values(by=['problem_number']).reset_index()
learning_path = learning_path[['timestamp_TW', 'upid', 'problem_number', 'exercise_problem_repeat_session', 'is_correct']]
learning_path

Observing the table, we can conclude that:
1. "problem number" is <font color='red'>the problem order</font> in the exercise for the user
2. "exercise_problem_repeat_session" is <font color='red'>how many times the user encounter this problem</font>
3. The timestamp is rounded to the nearest 15 minute interval to preserve privacy in the dataset

### How long does it take a user to finish a problem ?

In [None]:
uuidgb = df_LogProblem.groupby('uuid')

In [None]:
problem_cnt = uuidgb['uuid'].count()
total_time = uuidgb['total_sec_taken'].agg(np.sum)
mean_time_taken = total_time / problem_cnt
print("The mean of mean_time_taken", mean_time_taken.mean())
print("The std of mean_time_taken", mean_time_taken.std())

In [None]:
plt.plot(np.sort(mean_time_taken))

plt.title('Mean time for a user to finish a problem',fontsize=16)
plt.xlabel('Student sorted by average time taken', fontsize=10)
plt.ylabel('time (sec)', fontsize=10)
plt.ylim((0, 200))

plt.grid()
plt.show()

In [None]:
# There are definitely outliers in the time recorded
mean_time_taken[mean_time_taken > 1000]

### Calculate the correct rate for each user

In [None]:
correct_count = uuidgb['is_correct'].agg(np.sum)
correct_rate = correct_count / problem_cnt
print(f"mean : {correct_rate.mean()}\n std : {correct_rate.std()}\n min : {correct_rate.min()}\n max : {correct_rate.max()}")

In [None]:
plt.plot(np.sort(correct_rate))

plt.title('Distribution of correct rate', fontsize=16)
plt.xlabel('Student sorted by correct rate', fontsize=10)
plt.ylabel('correct rate', fontsize=10)

plt.grid()
plt.show()

### For each exercise, how many math problems were done by elementary students from 2018-09-01 to 2018-09-30 ? 

In [None]:
# Lets look at the problems in elementary school students
ucid_chosen = df_InfoContent[df_InfoContent['learning_stage']=='elementary']['ucid']

In [None]:
filter_time = (df_LogProblem['timestamp_TW'] >= "2018-09-01") & (df_LogProblem['timestamp_TW'] < "2018-10-01")
filter_ucid = (df_LogProblem['ucid'].isin(ucid_chosen))
df_LogProblem_elem = df_LogProblem[filter_ucid & filter_time].groupby(['ucid']).size().reset_index(name='counts')
df_LogProblem_elem

In [None]:
plt.plot(np.sort(df_LogProblem_elem['counts']))
plt.axhline(df_LogProblem_elem['counts'].mean(), color = 'r',linestyle = '--')

plt.title('How many problems attempts were done for each exercise during 2018/09', fontsize=16)

plt.xlabel('Exercise', fontsize=10)
plt.xlim((1, 741))
plt.ylabel('Number of problem attempts', fontsize=10)
plt.ylim((0, 16000))

plt.grid()
plt.show()

There are some exercise that are much more popular than the others. <br>
It will be interesting to look into the seasonality changes to the exercises as teachers guide students throughout the school year.

# EDA : Info_Content

## The relation between <font color='red'>levels</font>, <font color='red'>exercise</font> and <font color='red'>problem</font>.
![level relation](https://i.imgur.com/fi8rgqK.jpg)

### How many exercises are in each learning stage?

In [None]:
df_InfoContent['learning_stage'].value_counts()

### The distribution of difficulty of exercises.

In [None]:
diff_count = df_InfoContent['difficulty'].value_counts()

In [None]:
plt.bar(x=diff_count.index, height=diff_count)

plt.ylim((0, 900))
plt.title('Distribution of Difficulty',fontsize=16)
plt.xlabel('Difficulty', fontsize=10)
plt.ylabel('Problem count', fontsize=10)

plt.show()

### The correct rate for each exercise difficulty

In [None]:
df_Problem_Content = df_LogProblem[['ucid', 'is_correct']].merge(df_InfoContent[['ucid', 'difficulty']], how='inner', left_on='ucid', right_on='ucid')

# We remove the content with difficulty unset for now
df_Problem_Content = df_Problem_Content[df_Problem_Content['difficulty'] != 'unset']

df_Problem_Content = df_Problem_Content.groupby(['difficulty', 'is_correct']).size().unstack(level=-1)
df_Problem_Content['correct_rate'] = df_Problem_Content[True] / (df_Problem_Content[True] + df_Problem_Content[False])

df_Problem_Content.sort_values(by=['correct_rate'], ascending=False)

Different exercises will also have different correct rates distribution among users!

## How many percent of user will continue to review the exercise in the next few weeks?

In [None]:
# Randomly pick an exercise
df_userreturn = df_LogProblem[df_LogProblem['ucid'] == "CPI+5YCeEmhqdk6znJeii6jJUNl1QWGEvwCUJ6uLflg="][['timestamp_TW','uuid']].sort_values(by=['timestamp_TW'])

In [None]:
def GetSurvivalRate(df_week1, df_week2):
    user_w1 = set(np.unique(df_week1['uuid']))
    user_w2 = set(np.unique(df_week2['uuid']))
    SurvivalRate = round(len(user_w1.intersection(user_w2)) / len(user_w1) * 100, 3)
    print(f"{SurvivalRate}% users in the first week still stay in this week")
    return SurvivalRate

In [None]:
df_week0 = df_userreturn[(df_userreturn['timestamp_TW'] >= "2018-09-02") & (df_userreturn['timestamp_TW'] < "2018-09-09")]
df_week1 = df_userreturn[(df_userreturn['timestamp_TW'] >= "2018-09-09") & (df_userreturn['timestamp_TW'] < "2018-09-16")]
df_week2 = df_userreturn[(df_userreturn['timestamp_TW'] >= "2018-09-16") & (df_userreturn['timestamp_TW'] < "2018-09-23")]
df_week3 = df_userreturn[(df_userreturn['timestamp_TW'] >= "2018-09-23") & (df_userreturn['timestamp_TW'] < "2018-09-30")]
df_week4 = df_userreturn[(df_userreturn['timestamp_TW'] >= "2018-09-30") & (df_userreturn['timestamp_TW'] < "2018-10-07")]

SR_list = [GetSurvivalRate(df_week0, df_week0),
           GetSurvivalRate(df_week0, df_week1),
           GetSurvivalRate(df_week0, df_week2),
           GetSurvivalRate(df_week0, df_week3),
           GetSurvivalRate(df_week0, df_week4)]

In [None]:
plt.plot(SR_list)

plt.xticks(np.arange(0, 4.1, 1))
plt.yticks(np.arange(0, 101, 10))
plt.title('Survival Rate of Week 0 users in the next few weeks', fontsize=16)
plt.xlabel('Number of week', fontsize=10)
plt.ylabel('Survival Rate', fontsize=10)

for x, y in zip(range(5), SR_list): 
    plt.text(x, y, str(round(y, 2)))

plt.grid()
plt.show()

# Your turn now!

There are lots of open ended questions that can be answered.

Please refer to the dataset description for some ideas to start with!