## OULAD: Open University Learning Analytics Dataset

A dataset containing demographical information about students, their courses attended and final results each of their course.

### Agenda

a. More about dataset<br>
b. Schema involved<br>
c. EDA exercise on OULAD dataset<br>
d. Folder structuring<br>
e. Conclusion<br>

### a. More about dataset

This page introduces the anonymised Open University Learning Analytics Dataset (OULAD). It contains data about courses, students and their interactions with Virtual Learning Environment (VLE) for seven selected courses (called modules). Presentations of courses start in February and October - they are marked by “B” and “J” respectively. The dataset consists of tables connected using unique identifiers. All tables are stored in the csv format.

Kuzilek J., Hlosta M., Zdrahal Z. Open University Learning Analytics dataset Sci. Data 4:170171 doi: 10.1038/sdata.2017.171 (2017).

### b. Schema Involved

In [None]:
from IPython.display import Image
Image("../input/oulad-schema/model.png")

A data frame with 32593 rows and 12 variables:

code_module
Name of course, for which student registered

code_presentation
Name of semester, for which student registered

id_student
Unique integer identifiing each student

gender
Students gender

region
UK region, in which student lives

highest_education
Highest education student achieved before taking course

imd_band
Index of Multiple Deprivation (see https://www.gov.uk/government/statistics/english-indices-of-deprivation-2015) percentile, students with imd_band lower than 20 comes from the most deprived regions

age_band
Age band of student

num_of_prev_attempts
Number of student previous attempts on the selected course

studied_credits
Total credits student is studiing at the Open University during period of the course

disability
Student claims disability of any type, logical

final_result
Student final result in the course

Region values
East Anglian Region

Scotland

North Western Region

South East Region

West Midlands Region

Wales

North Region

South Region

Ireland

South West Region

East Midlands Region

Yorkshire Region

London Region

See https://en.wikipedia.org/wiki/Regions_of_England for explanation.

Highest education values
HE Qualification - awarded after one year full-time study at the university or higher education institution

A Level or Equivalent - secondary school leaving qualification

Lower Than A Level - did not completed secondary school

Post Graduate Qualification - equal to Master degree more or less

No Formal quals - no previous formal education

Final result values
Pass - passed the course

Withdrawn - whithdrawn the course before offical end

Fail - failed the course after taking final exam

Distinction - passed course with outstanding results

Source
https://analyse.kmi.open.ac.uk/open_dataset

### c. EDA exercise on OULAD dataset

In [None]:
#Data loading/transformation
import numpy as np
import pandas as pd

In [None]:
#visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
#function to display basic info for a given dataframe
def show_basic_info(df):
    print("========================================================================================================")
    print("HEAD:")
    print(df.head(3))
    print("--------------------------------------------------------------------------------------------------------")
    print("SHAPE:")
    print(df.shape)
    print("--------------------------------------------------------------------------------------------------------")
    print("INFO:")
    print(df.info())
    print("--------------------------------------------------------------------------------------------------------")
    print("DESCRIBE:")
    print(df.describe())
    print("--------------------------------------------------------------------------------------------------------")
    print("========================================================================================================")

    

In [None]:
assessments_df = pd.read_csv('../input/ouladdata/assessments.csv')
#assessments_df = pd.read_csv('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/assessments.csv')
show_basic_info(assessments_df)

In [None]:
courses_df = pd.read_csv('../input/ouladdata/courses.csv')
#courses_df = pd.read_csv('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/courses.csv')
show_basic_info(courses_df)

In [None]:
studentAssessment_df = pd.read_csv('../input/ouladdata/studentAssessment.csv')
#studentAssessment_df = pd.read_csv('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentAssessment.csv')
show_basic_info(studentAssessment_df)

In [None]:
studentInfo_df = pd.read_csv('../input/ouladdata/studentInfo.csv')
#studentInfo_df = pd.read_csv('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentInfo.csv')
show_basic_info(studentInfo_df)

In [None]:
studentRegistration_df = pd.read_csv('../input/ouladdata/studentRegistration.csv')
#studentRegistration_df = pd.read_csv('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentRegistration.csv')
show_basic_info(studentRegistration_df)

In [None]:
vle_df = pd.read_csv('../input/ouladdata/vle.csv')
#vle_df = pd.read_csv('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/vle.csv')
show_basic_info(vle_df)

In [None]:
#Checking gender distribution
sns.countplot(studentInfo_df.gender);    #this shows that courses data is almost equally distributed on gender

In [None]:
#Now let's try the same on age
studentInfo_df[['id_student', 'age_band']].groupby(by='age_band').count().plot.bar();    #this shows majority of students fall in age band of 0-35

In [None]:
#Now let's try the same on region
studentInfo_df[['id_student', 'region']].groupby(by='region').count().plot.bar();

In [None]:
# What if we want to do a multi dimensional visualization?
# Pandas provide this using crosstab
# crosstab: Compute a simple cross-tabulation of two (or more) factors. By default computes a frequency table of the factors 
# unless an array of values and an aggregation function are passed.

In [None]:
pd.crosstab(studentInfo_df.region, studentInfo_df.age_band).plot.barh(stacked = True);

In [None]:
# How to visualize continous variables, ouliers?
# Python provides us boxplot for this.
# boxplot:  The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the 
# distribution of data based on the five number summary: 
# minimum, first quartile, median, third quartile, and maximum.

In [None]:
studentInfo_df.drop(['id_student', 'num_of_prev_attempts'], axis=1).boxplot(by = 'region')
plt.xticks(rotation = 90)    #without this, x-labels overlap

In [None]:
# plotting boxplot using seaborn
sns.boxplot(x = 'region', y = 'studied_credits', data=studentInfo_df)
plt.xticks(rotation = 90)

In [None]:
# selecting a subset of cols which are of importance to us and grouping them by student id and aggregating them using median
studentPerformance_df = studentInfo_df[['id_student', 'num_of_prev_attempts', 'studied_credits']].groupby('id_student').median()

In [None]:
studentPerformance_df.head()

In [None]:
# Here above if you notice, indices are random. Since we have selected it from df, we need to reset them
studentPerformance_df = studentPerformance_df.reset_index()

In [None]:
studentPerformance_df.head()

In [None]:
studentPerformance_df.num_of_prev_attempts.unique()

In [None]:
sns.countplot(studentPerformance_df.num_of_prev_attempts);

In [None]:
# the above countplot shows that most of the students are giving their 1st attempt (0 prev attempts)

In [None]:
studentProfile_df = studentInfo_df[['id_student', 'gender', 'region', 'highest_education', 'imd_band', 'age_band']].drop_duplicates()


In [None]:
show_basic_info(studentProfile_df)

In [None]:
studentAges_df = studentInfo_df[['id_student', 'age_band']].groupby(['id_student']).count()
studentAges_df = studentAges_df.reset_index()
studentAges_df.age_band.hist();

In [None]:
# majority of the students fall in age band of 0-35

In [None]:
sns.countplot(studentInfo_df.code_module)

In [None]:
# course BBB and FFF are very famous

In [None]:
pd.crosstab(studentInfo_df.code_module, studentInfo_df.code_presentation).plot.barh(stacked = True);

In [None]:
# 'B' is for courses offered in Feb and 'J' is for courses offered in Oct.
# course 'CCC' is something introduced in 2014 only.
# course 'AAA' has a very low student count as compared to other courses

In [None]:
studentInfo_df.head(2)

In [None]:
sns.pairplot(data=studentInfo_df[["code_module","num_of_prev_attempts"]],hue="code_module", dropna=True, size=5);

In [None]:
studentModuleLengths_df = studentInfo_df.merge(courses_df, on = ['code_module', 'code_presentation'], how='left')
studentModuleLengths_df = studentModuleLengths_df[['id_student', 'module_presentation_length']].groupby('id_student').median()
studentModuleLengths_df = studentModuleLengths_df.reset_index()

In [None]:
show_basic_info(studentModuleLengths_df)

In [None]:
sns.countplot(studentModuleLengths_df.module_presentation_length);

In [None]:
studentRegistration_df['unregistered'] = np.where(pd.isnull(studentRegistration_df.date_unregistration), 0, 1)
studentRegistration_df['registered'] = np.where(pd.isnull(studentRegistration_df.date_unregistration), 0, 1)

In [None]:
studentRegistration_df['register_days'] = (np.where(pd.isnull(studentRegistration_df.date_registration), 0, 
                                          studentRegistration_df.date_registration)).astype(int)
studentRegistration_df['unregister_days'] = (np.where(pd.isnull(studentRegistration_df.date_unregistration), 0, 
                                            studentRegistration_df.date_unregistration)).astype(int)
studentRegDays_df = studentRegistration_df[['id_student', 'register_days', 
                                   'unregister_days']].groupby(['id_student']).mean()
studentRegDays_df = studentRegDays_df.reset_index()
studentRegDays_df.head()

In [None]:
studentInterest_df = studentRegistration_df[['id_student', 'registered', 'unregistered']].groupby(['id_student']).sum()
studentInterest_df = studentInterest_df.reset_index()

In [None]:
show_basic_info(studentInterest_df)

In [None]:
studentInterest_df[['registered', 'unregistered']].boxplot();


In [None]:
studentInterest_df.unregistered.hist();

In [None]:
studentAssessment_df['score'] = (np.where(pd.isnull(studentAssessment_df.score), 0, studentAssessment_df.score)).astype(int)

In [None]:
studentAssessment_df['assessment_mean'] = studentAssessment_df['score'].groupby(studentAssessment_df['id_assessment']) \
.transform('mean')

In [None]:
studentAssessment_df['score_std'] = studentAssessment_df.score/studentAssessment_df.assessment_mean


In [None]:
studentScoring_df = studentAssessment_df[['id_student', 
                                          'score_std']].groupby(['id_student']).median()
studentScoring_df = studentScoring_df.reset_index()
studentScoring_df.info()

In [None]:
studentScoring_df.score_std.hist();

In [None]:
# splitting big file ~400+ MB into smaller chunks

#for i,chunk in enumerate(pd.read_csv('../data/raw/studentVle.csv', chunksize=1500000)):
#    chunk.to_csv('../data/raw/studentVle_{}.csv'.format(i))

In [None]:
studentVle_df1 = pd.read_csv('../input/ouladdata/studentVle_0.csv')    #('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentVle_0.csv')

In [None]:
studentVle_df2 = pd.read_csv('../input/ouladdata/studentVle_1.csv')    #('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentVle_1.csv')

In [None]:
studentVle_df3 = pd.read_csv('../input/ouladdata/studentVle_2.csv')    #('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentVle_2.csv')

In [None]:
studentVle_df4 = pd.read_csv('../input/ouladdata/studentVle_3.csv')    #('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentVle_3.csv')

In [None]:
studentVle_df5 = pd.read_csv('../input/ouladdata/studentVle_4.csv')    #('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentVle_4.csv')

In [None]:
studentVle_df6 = pd.read_csv('../input/ouladdata/studentVle_5.csv')    #('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentVle_5.csv')

In [None]:
studentVle_df7 = pd.read_csv('../input/ouladdata/studentVle_6.csv')    #('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentVle_6.csv')

In [None]:
studentVle_df8 = pd.read_csv('../input/ouladdata/studentVle_7.csv')    #('https://raw.githubusercontent.com/vjcalling/OULAD-data-analysis-EDA-/master/data/raw/studentVle_7.csv')

In [None]:
studentVle_df = pd.concat([studentVle_df1,studentVle_df2, studentVle_df3, studentVle_df4, studentVle_df5, studentVle_df6, studentVle_df7, studentVle_df8])

In [None]:
studentVle_df.shape

In [None]:
show_basic_info(studentVle_df)

In [None]:
studentVle_df = studentVle_df.merge(vle_df, on = 'id_site', how = 'left')

In [None]:
sns.countplot(studentVle_df.activity_type)
plt.xticks(rotation = 90)

In [None]:
studentInteractivity_df = studentVle_df[['id_student', 
                                     'activity_type', 'sum_click']].groupby(['id_student', 'activity_type']).mean()
studentInteractivity_df = studentInteractivity_df.reset_index()
studentInteractivity_df.head()

In [None]:
# pivoting will help us reduce multiple rows per student to one single row with multiple columns 
# After this we can visualize the columns with missing data

In [None]:
import missingno as msno

In [None]:
studentInteractivity_df = studentInteractivity_df.pivot(index='id_student', 
                                                    columns='activity_type', values='sum_click')
studentInteractivity_df = studentInteractivity_df.reset_index()
msno.matrix(studentInteractivity_df)
studentInteractivity_df = studentInteractivity_df.fillna(0)
studentInteractivity_df.info()

In [None]:
studentInteractivity_df = studentInteractivity_df[['id_student', 'forumng', 'homepage', 'oucollaborate',
       'oucontent', 'ouwiki', 'page', 'questionnaire', 'quiz',
       'resource', 'subpage', 'url']]

In [None]:
dataset = studentPerformance_df.merge(studentModuleLengths_df, 
                                    on = 'id_student', how='left')
dataset = dataset.merge(studentInterest_df, 
                                    on = 'id_student', how='left')
dataset = dataset.merge(studentRegistration_df[['id_student', 'register_days']], 
                                    on = 'id_student', how='left')
dataset = dataset.merge(studentScoring_df, 
                                    on = 'id_student', how='left')
dataset = dataset.merge(studentInteractivity_df, 
                                    on = 'id_student', how='left')
dataset.info()

In [None]:
msno.bar(dataset)
dataset = dataset.fillna(0)

In [None]:
plt.matshow(dataset.corr());

In [None]:
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.cluster import k_means
from sklearn.metrics import silhouette_score

In [None]:
# Scaling the data to bring into one range
sc = RobustScaler()


In [None]:
dataset.score_std.plot.box()