# MOOC Academy

<img src="./images/mooc.jpg" width=600 height=600/>

# Table of Contents
* [MOOC Academy](#MOOC-Academy)
	* [1.1 Overview and Motivation](#1.1-Overview and motivation)
	* [1.2 Related Works](#1.2-Related-Works)
	* [1.3 Data](#1.3-Data)
		* [1.3.1 Overview](#1.3.1-Overview)
		* [1.3.2 Dropout definition](#1.3.2-Dropout-definition)
	* [1.4 Questions](#1.4-Questions)
	* [1.5 Data Management](#1.5-Data-Management)
        * [1.5.1 Create course dates dataframe](#1.5.1-Create-course-dates-dataframe)
        * [1.5.2 Load task duration data](#1.5.2-Load-task-duration-data)
        * [1.5.3 Load video data](#1.5.3-Load-video-data)
        * [1.5.4 Load forum events data](#1.5.4-Load-forum-events-data)
        * [1.5.5 Load problems data](#1.5.5-Load-problems-data)
        * [1.5.6 Merge tables](#1.5.6-Merge-tables)
        * [1.5.7 Convert numeric data types](#1.5.7-Convert-numeric-data-types)
        * [1.5.8 Merge course dates with activity data](#1.5.8-Merge-course-dates-with-activity-data)
        * [1.5.9 Determine dropouts](#1.5.9-Determine-dropouts)
        * [1.5.10 Clean up 'user info' masterfile](#1.5.10-Clean-up-'user-info'-masterfile)
        * [1.5.11 Determine users to be excluded](#1.5.11-Determine-users-to-be-excluded)
        * [1.5.12 Merge user details with dropout status and tasks dataframe](#1.5.12-Merge-user-details-with-dropout-status-and-tasks-dataframe)
        * [1.5.13 Save files](#1.5.13-Save-files)

## 1.1 Overview and Motivation

**Massive open online courses (MOOCs)**, an amalgamation of technology and pedagogy, are changing the face of traditional classroom education by offering unprecedented avenues and resources for learning from the world’s best institutions and professors, free of cost, without any prior knowledge, and at one’s own pace. They offer the dream of free, quality education, accessible to anyone, anywhere, transcending cultural and demographic backgrounds and boundaries, learning styles and abilities. Rightly so, MOOCs have received wide publicity and many institutions have invested considerable effort in developing, promoting and delivering such courses. With the proliferation of online learning platforms such as Coursera, edX, and Udacity, thousands of participants from all over the world enroll in MOOCs.
 
However, there are still many questions about the effectiveness of MOOCs. In the absence of a traditional classroom setup and personal communication between teacher-student it is important to understand the class dynamics based on course log and user interaction data. This will help the instructor to better design the course and serve the needs of the students. One of the key challenges and questions is the rate of attrition. Although many students now have the opportunity to learn from the best teachers and institutions around the world, the completion rate for most online courses is still very low. The goal of this project is to model student dropout rate as a function of interaction with various course components.

## 1.2 Related Works

The task of predicting attrition rates has been the topic of several studies which have primarily relied on click-stream data and forum discussion.
* In Kloft et al. (Kloft et al., 2014b) it was observed that students’ forum activity in the first two weeks can reasonably predict the likelihood of users dropping out.
* Using a sentiment analysis approach, Wen et al. (Wen et al., 2014c) observed a correlation between student sentiments via forum posts and their chance of dropping out of the course.

Even though a discussion forum would provide a rich source of insight and information regarding a student's’ behavior and likelihood of dropping out, a very small percentage of students (5-10%) actually participate in the discussion forum (Ammeuypornsakul, 2014a). An an alternate, student interaction with the course material such as viewing the video lectures, quiz participation, problem set attempts and grades could also provide information on student behavior. 
* Studies such as (Guo and Reinecke, 2014) look at the navigation behavior of various demographic groups, (Kizilcec et al., 2013) which seeks to understand how students engage with the course, (Ramesh et al., 2014), that attempts to understand student disengagement and their learning patterns towards minimizing dropout rate and (Stephens-Martinez et al., 2014) which seeks to model motivations of users by mining clickstream data (obtained from Ammeuypornsakul, 2014a).
* Studies such as (Kloft et al., 2014b) and (Ammeuypornsakul, 2014a) look at understanding dropout rate via clickstream data and using machine learning methods.


## 1.3 Data

### 1.3.1 Overview
As part of this project we wanted to predict the likelihood of a user dropping out using information from forum posts and clickstream information, such as interaction with the various course components and quiz attempts/submissions.

We are using the edX dataset obtained from MIT Institutional Research, the MIT Office that oversees MITx/edX data. This dataset is more rich and granular than the publicly available edX data. Prior to accessing and using this data, every group member has completed training approved by MIT'S Committee on Use of Humans as Experimental Subjects (COUHES) to certify that we are trained in handling HIPAA data properly.

This data have been transferred to us as text files sorted according to the MOOC classes offered. The exact data we will receive are not publicly available, but can be collected by any other researcher who follows MIT Institutional Research’s procedures. The eventual data size were about 150GB, covering 11 completed classes in edX. We were presented with log files of metadata, text files of SQL query output showing tracking of various student activities related to forum, problems and videos. 

Due to the size of the data and to focus our analysis we picked [‘MITx: 8.MechCx Advanced Introductory Classical Mechanics’](https://courses.edx.org/courses/MITx/8.MechCx/1T2015/syllabus/) as a representative course. This particular course ran from January 8, 2015 - May 10th, 2015 with about 13,000 unique users within that timeline. Based on the data and user activity it seemed that the course was open to the students even past the last day up until October, 2015. The class consisted of 16 required units with associated video lectures and problem sets, as well as forum activity. The problem sets consisted of checkpoint problems, homework problems, labs, quizzes and finals. However our dataset did not provide that level of granularity and all the problem sets were aggregated together.  Participants who earned a minimum of 60% of total course credit obtained a certificate of mastery, at no cost, issued by edX under the name of MITx. 

We received the data through MIT Dropbox and tranferred it over on an Amazon EC2 server. This let us collaboratively access and modify the data, and leverage large amount of computational power to quickly load large files. Also, we set up port forwarding so we can run the iPython kernel on our EC2 instance, allowing us access to the large amount of RAM and CPU power that an EC2 instance can provide, while using the same flexible interface we’ve been using for this class.

### 1.3.2 Dropout definition
An early challenge in this project was to define the term ‘dropout’. Based on literature review, and evaluating the data at hand, we went through various iterations in trying to define the dropout metric. 
One potential option was to define a dropout as someone who did not meet their own pre-defined course objectives which could vary from person to person. Due to the unstandardized way the user objective data was collected, lack of post-course evaluation, and the ambiguity of quantifying ‘success’ using this metric we did not consider it. 

Another option was to define a dropout as someone who did not obtain a passing grade of 60% or a completion certificate. However this did not account for the fact that MOOCs are inherently different from a regular classroom structure wherein a user who scored lower than the passing grade but stayed put through the length of the course cannot be considered a ‘dropout’. 

For the purposes of this analysis, we ended up defining a ‘dropout’ as **someone who signed up but did not complete the course**. This meant that a person who was not a dropout had to have been **active on the course website for at least 1 hour within the last month before the course officially ended (i.e. April 10 to May 10, 2015)**. This 1 hour threshold was introduced to filter out the users who simply logged in without participating in the course. The threshold was chosen by looking at the distribution of the time per day spent by the active users in the last month of the course. 

## 1.4 Questions
Some of our initial questions were:
* Why do MOOCs have such a large attrition rate?
* What factors affect dropout?
* Could we predict dropouts through student engagement and activity within the first two weeks of the course?

## 1.5 Data Management

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

import datetime
import dateutil

Our dataset came from various text files, so we spent a lot of time deciphering various columns within the text files and understanding how could we relate them amongst each other. Also, text files we received had student and staff personal identification information, such as usernames and email addresses. We wrote some scripts to hash usernames and user ids and make them anonymous and dropped email column. After anonymizing the data, we then cleaned it up by filtering out staff members, rows that were missing key information and extreme outliers that may have been automated accounts (e.g. there was a participant with daily activity that was close to 24 hours in more than one day). After cleaning the data, we then merged the various tables and used this combined dataset to determine the dropouts. Data Management section here shows how we started from separate files, identified relationships among them and created a final master file that merged many individual files together. This file would be later used for the EDA as well as modeling.

## 1.5.1 Create course dates dataframe

Many textfiles had a timestamp column, but the timestamp was event based. In order to later merge the files together based on the date, we created this course dates dataframe. 

In [2]:
course_start = datetime.datetime(2015, 1, 8)
course_end = datetime.datetime(2015, 5, 10)
datelist = pd.date_range(start=course_start, end=course_end, normalize=True).tolist()
date_tuples = []
for i, class_date in enumerate(datelist):
    day = i + 1 # Adding 1 to ensure the first day is 1
    week = ((class_date - course_start).days / 7) + 1 # Adding 1 to ensure the first week is 1
    date_tuples.append((day, week, class_date))
date_df = pd.DataFrame(date_tuples)
date_df.columns = ['course_day', 'course_week', 'date']
del date_tuples
date_df.head(8)

Unnamed: 0,course_day,course_week,date
0,1,1,2015-01-08
1,2,1,2015-01-09
2,3,1,2015-01-10
3,4,1,2015-01-11
4,5,1,2015-01-12
5,6,1,2015-01-13
6,7,1,2015-01-14
7,8,2,2015-01-15


## 1.5.2 Load task duration data

In [3]:
df_task = pd.read_csv('../data/MITx__8_MechCx__1T2015_latest_time_on_task_data.txt', sep='\t', parse_dates=['date'], 
                  date_parser=dateutil.parser.parse, usecols=['date', 'username', 'total_time_30', 'total_video_time_30', 
                                                              'total_problem_time_30', 'total_forum_time_30', 
                                                              'total_text_time_30'])
print 'No. of rows =', len(df_task)
print 'No. of unique users =', len(df_task.username.unique())
#df_task.head()

No. of rows = 66537
No. of unique users = 8246


## 1.5.3 Load video data

In [4]:
# File containing one row for every video a user views each day.
df_video = pd.read_csv('../data/MITx__8_MechCx__1T2015_latest_video_stats_day_data.txt', sep='\t', 
                  parse_dates=['date'], date_parser=dateutil.parser.parse, usecols=['date', 'username', 'video_id'])
# Sum up the no. of videos per user per day
df_video = df_video.groupby(['date', 'username']).count()
df_video.rename(columns={'video_id': 'nvideo'}, inplace=True)
df_video.reset_index(inplace=True)
#df_video.head()

## 1.5.4 Load forum events data

#### load all forum events data

In [5]:
# File containing one row for every forum action a user performs on each day.
df_forum_event = pd.read_csv('../data/MITx__8_MechCx__1T2015_latest_forum_events_data.txt', sep='\t', 
                             parse_dates=['time'], date_parser=dateutil.parser.parse, 
                             usecols=['time', 'username', 'forum_action'])
df_forum_event['date'] = df_forum_event.time.apply(lambda x : pd.to_datetime(x).date())
df_forum_event['date'] = df_forum_event.date.astype('datetime64[D]')
df_forum_event.drop(['time'], axis=1, inplace=True)
#df_forum_event.head()

In [6]:
# Sum of the forum activities per user per day
df_forum = df_forum_event.groupby(['date', 'username']).count()
df_forum.rename(columns={'forum_action': 'nforum_activity'}, inplace=True)
df_forum.reset_index(inplace=True)
#df_forum.head()

### load forum events subset

The entire events data was filtered to obtain a subset of relevant events (e.g. comments, upvote, downvote, etc) that we hope to analyze in detail later.

In [7]:
# File containing one row for every forum action a user performs on each day.
df_forum_event_subset = pd.read_csv('../data/MITx__8_MechCx__1T2015_latest_forum_events_data_subset.txt', sep='\t', 
                             parse_dates=['time'], date_parser=dateutil.parser.parse, 
                             usecols=['time', 'username', 'forum_action'])
df_forum_event_subset['date'] = df_forum_event_subset.time.apply(lambda x : pd.to_datetime(x).date())
df_forum_event_subset['date'] = df_forum_event_subset.date.astype('datetime64[D]')
df_forum_event_subset.drop(['time'], axis=1, inplace=True)
#df_forum_event_subset.head()

In [8]:
df_forum_event_subset['count'] = 1
print 'No. of rows before aggregating =', len(df_forum_event_subset)
df_forum_event_subset = df_forum_event_subset.groupby(['username', 'forum_action', 'date']).sum()
df_forum_event_subset.reset_index(inplace=True)
print 'No. of rows after aggregating =', len(df_forum_event_subset)
#df_forum_event_subset.head()

No. of rows before aggregating = 133018
No. of rows after aggregating = 29585


## 1.5.5 Load problems data

In [9]:
# File containing a row for each problem a user attempts daily
df_problems_events = pd.read_csv('../data/MITx__8_MechCx__1T2015_latest_problem_check_data.txt', sep='\t', 
                             parse_dates=['time'], date_parser=dateutil.parser.parse, 
                             usecols=['time', 'username', 'module_id', 'attempts', 'success', 'grade'])
df_problems_events['date'] = df_problems_events.time.apply(lambda x : x.date())
df_problems_events['date'] = df_problems_events.date.astype('datetime64[D]')
df_problems_events['correct'] = 0
mask = (df_problems_events.success == 'correct')
df_problems_events.loc[mask, 'correct'] = 1
df_problems_events.drop(['time', 'success', 'grade'], axis=1, inplace=True)
#df_problems_events.head()

In [10]:
# Aggregate the users' daily problem attempts in order to end up with one row per user per day
df_problems = df_problems_events.groupby(['date', 'username']).agg(['count', 'sum'])
df_problems.reset_index(inplace=True)
#df_problems.head()

In [11]:
df_problems.columns = df_problems.columns.get_level_values(0)
df_problems.columns = ['date', 'username', 'nmodule_id', 'module_id', 'nproblem_attempts_rows', 'nproblem_attempts', 
                       'nproblems', 'nproblems_correct']
df_problems.drop(['nmodule_id', 'module_id', 'nproblem_attempts_rows'], axis=1, inplace=True)
#df_problems.head()

In [12]:
print 'passed validation test:', len(df_problems[df_problems.nproblem_attempts < df_problems.nproblems]) == 0
print 'passed validation test:', len(df_problems[df_problems.nproblem_attempts < df_problems.nproblems_correct]) == 0
print 'passed validation test:', len(df_problems[df_problems.nproblems < df_problems.nproblems_correct]) == 0

passed validation test: True
passed validation test: True
passed validation test: True


## 1.5.6 Merge tables

In [13]:
df_all = df_task.merge(df_video, how='outer', on=['date', 'username'])\
            .merge(df_forum, how='outer', on=['date', 'username'])\
            .merge(df_problems, how='outer', on=['date', 'username'])
#df_all.head()

## 1.5.7 Convert numeric data types

In [14]:
df_all.dtypes
df_all[df_all.columns] = df_all[df_all.columns].replace('null', np.NaN)

numeric_fields = ['total_time_30', 'total_video_time_30', 'total_problem_time_30', 
                  'total_forum_time_30', 'total_text_time_30', 'nvideo', 
                  'nforum_activity', 'nproblem_attempts', 'nproblems', 'nproblems_correct']
df_all[numeric_fields] = df_all[numeric_fields].astype(float)
print 'No. of rows b4 filtering =', len(df_all)
df_all = df_all[~np.isnan(df_all.total_time_30)]
print 'No. of rows after filtering =', len(df_all)
#df_all.head()

No. of rows b4 filtering = 68459
No. of rows after filtering = 62506


  mask = arr == x


## 1.5.8 Merge course dates with activity data

In [15]:
# Inner join limits the activities to those that happened during the course
df_merged = pd.merge(left=date_df, right=df_all, left_on='date', right_on='date', how='inner');
print '# of rows b4 merge =', len(df_all)
print '# of unique users b4 merge =', len(df_all.username.unique())
print '# of rows after merge =', len(df_merged)
print '# of unique users after merge =', len(df_merged.username.unique())

# of rows b4 merge = 62506
# of unique users b4 merge = 7831
# of rows after merge = 55530
# of unique users after merge = 7295


## 1.5.9 Determine dropouts
Our dropout condition is that the user should have been 'active' for a total of at least one hour within the last month of the class.

In [16]:
df_tasktime = df_task[['date', 'username', 'total_time_30']]
#df_tasktime.head()

In [17]:
df_tasktime['total_time_30'] = df_tasktime['total_time_30'].replace('null', 0)
#df_tasktime.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [18]:
print 'Number of rows in df_tasktime b4 filtering =', df_tasktime.shape[0]
#Filter for only those users who interacted during the course period
df_tasktime = df_tasktime[(df_tasktime.date >= course_start) & (df_tasktime.date <= course_end)]
print 'Number of rows in df_tasktime after filtering =', df_tasktime.shape[0]

Number of rows in df_tasktime b4 filtering = 66537
Number of rows in df_tasktime after filtering = 58783


In [19]:
df_task_last_month = df_tasktime.copy()
df_task_last_month.rename(columns={'total_time_30':'total_time_seconds'}, inplace=True)
df_task_last_month['total_time_seconds'] = df_task_last_month['total_time_seconds'].astype(float)

# Set zero time for all times earlier than the last month of the class
dropout_start_date = course_end - datetime.timedelta(days = 31)
print 'Dropout start date =', dropout_start_date
df_task_last_month.loc[(df_tasktime.date < dropout_start_date), 'total_time_seconds'] = 0

df_task_last_month = df_task_last_month.groupby(['username']).sum()
df_task_last_month.reset_index(inplace=True)
df_task_last_month['dropout'] = 1
df_task_last_month.loc[df_task_last_month.total_time_seconds >= 60*60, 'dropout'] = 0
print '% of dropouts =', 100.0 * len(df_task_last_month[df_task_last_month.dropout == 1]) / len(df_task_last_month)
print '# of rows =', len(df_task_last_month)
print '# of users =', len(df_task_last_month.username.unique())

#df_task_last_month.head()

Dropout start date = 2015-04-09 00:00:00
% of dropouts = 90.9727836163
# of rows = 7422
# of users = 7422


## 1.5.10 Clean up 'user info' masterfile

In [20]:
df_user_combo=pd.read_csv("../data/MITx__8_MechCx__1T2015_latest_user_info_combo_data_unique_usernames.txt",sep='\t', 
                     parse_dates=["enrollment_created"], date_parser=dateutil.parser.parse, 
                    usecols=["user_id","username","is_staff", "enrollment_created", "enrollment_is_active"])
print df_user_combo.shape
print 'unique user_id=', len(df_user_combo.user_id.unique())
print 'unique username=', len(df_user_combo.username.unique())
#df_user_combo.head()

(13207, 5)
unique user_id= 13207
unique username= 13207


In [21]:
#Replace null with 'NaN
df_user_combo[["is_staff","enrollment_is_active"]] = \
    df_user_combo[["is_staff","enrollment_is_active"]].replace('null', np.NaN).astype(int)
# print 'null is_staff=', len(df_combo[np.isnan(df_combo.is_staff)])
# print 'null enrollment_is_active=', len(df_combo[np.isnan(df_combo.enrollment_is_active)])
df_user_combo.dtypes

user_id                         object
username                        object
is_staff                         int64
enrollment_created      datetime64[ns]
enrollment_is_active             int64
dtype: object

## 1.5.11 Determine users to be excluded

### staff members

In [22]:
invalid_usernames = set([])

In [23]:
staff_usernames = df_user_combo[df_user_combo.is_staff==1]['username'].unique()
print 'No. of unique staff members=', len(staff_usernames)
invalid_usernames |= set(staff_usernames)
print 'Total no. of invalid usernames =', len(invalid_usernames)

No. of unique staff members= 23
Total no. of invalid usernames = 23


### users who performed tasks but are missing from masterfile

In [24]:
users_not_in_masterfile = np.setdiff1d(df_merged.username.unique(), df_user_combo.username)
print 'No. of users missing from users masterfile =', len(users_not_in_masterfile)
invalid_usernames |= set(users_not_in_masterfile)
print 'Total no. of invalid usernames =', len(invalid_usernames)

No. of users missing from users masterfile = 17
Total no. of invalid usernames = 40


### users with excessive total task time

In [25]:
# > 24 hours i.e. 86400 seconds
excessive_time = df_merged[(df_merged.total_time_30 > 60*60*24)]['username'].unique()
print 'No. of users with excessive total task time =', len(excessive_time) 
invalid_usernames |= set(excessive_time)
print 'Total no. of invalid usernames =', len(invalid_usernames)

No. of users with excessive total task time = 1
Total no. of invalid usernames = 41


### users that joined the course late
We exclude all users that joined the course after the day we specify as our prediction cut off date i.e. we will use data available from the start of the course until the prediction cut off date in order to predict dropouts

In [26]:
prediction_cutoff_days = 14

In [27]:
enrolment_cutoff_date = course_start + datetime.timedelta(days=prediction_cutoff_days)
late_reg = df_user_combo[df_user_combo.enrollment_created > enrolment_cutoff_date]['username'].unique()
print 'No. of users that enrolled late =', len(late_reg) 
invalid_usernames |= set(late_reg)
print 'Total no. of invalid usernames =', len(invalid_usernames)

No. of users that enrolled late = 3038
Total no. of invalid usernames = 3068


## 1.5.12 Merge user details with dropout status and tasks dataframe

### create dataframe with user identifiers and dropout status

In [28]:
df_user_dropout = df_user_combo[['user_id', 'username']].merge(df_task_last_month[['username', 'dropout']], 
                                                                    how='inner', on=['username'])
print len(df_user_dropout)
#df_user_dropout.head()

7405


### merge user identifers and dropout status with tasks dataframe

In [29]:
df_merged2 = df_user_dropout.merge(df_merged, how='inner', on=['username'])
print '# of valid unique users in task dataframe after merging with dropout status: ', len(df_merged2.username.unique())

# of valid unique users in task dataframe after merging with dropout status:  7278


### exclude invalid users

In [30]:
df_merged_mask = ~df_merged2.username.isin(invalid_usernames)
df_merged_final = df_merged2[df_merged_mask]
print '# of valid unique users in task dataframe after removing invalid users: ', len(df_merged_final.username.unique())

# of valid unique users in task dataframe after removing invalid users:  4956


In [31]:
#df_merged_final.head()

## 1.5.13 Save files

### save user tasks data frame to text file

In [32]:
df_merged_final.to_csv('data/usertasks.txt', sep='\t')

### save course dates to text file

In [33]:
date_df.to_csv('data/course_dates.txt', sep='\t')

### save global variables to json

In [34]:
global_vars = {}
global_vars['course_start'] = course_start.strftime('%Y-%m-%d')
global_vars['course_end'] = course_end.strftime('%Y-%m-%d')
global_vars['dropout_start_date'] = dropout_start_date.strftime('%Y-%m-%d')
global_vars['enrolment_cutoff_date'] = enrolment_cutoff_date.strftime('%Y-%m-%d')
global_vars['prediction_cutoff_days'] = prediction_cutoff_days
print global_vars

{'enrolment_cutoff_date': '2015-01-22', 'course_end': '2015-05-10', 'prediction_cutoff_days': 14, 'dropout_start_date': '2015-04-09', 'course_start': '2015-01-08'}


In [35]:
import json
with open("data/global_vars.json", "w") as global_vars_file:
    json.dump(global_vars, global_vars_file)

In [36]:
# # Read global vars from json into dict
# with open("data/global_vars.json", "r") as f1:
#     dictx = json.load(f1)
#     print dictx
#     print type(dictx)

### save forum events data subset to text file

In [37]:
print '# of rows before filtering =', len(df_forum_event_subset)
df_forum_event_subset = df_forum_event_subset[(df_forum_event_subset.date >= course_start) & 
                                              (df_forum_event_subset.date <= course_end)]
print '# of rows after filtering invalid dates =', len(df_forum_event_subset)
df_forum_event_subset = df_forum_event_subset[~df_forum_event_subset.username.isin(invalid_usernames)]
print '# of rows after filtering invalid users =', len(df_forum_event_subset)
df_forum_event_subset.to_csv('data/forum_event_subset.txt', sep='\t')

# of rows before filtering = 29585
# of rows after filtering invalid dates = 27713
# of rows after filtering invalid users = 26112
