Some notes I found so far - 

The variable major has a lot of categories present in them. A good percentage of the categories have a frequency of less than 10. There are also categories that are the same but are differentiated by lower case, upper case letters and abbreviations. We can deal with the case sensitive categories easily. However any idea how we can deal with the abbreviations.
Also is it useful to group all low frequency categories into one category. I have done this, however this along with the presence of missing values in the variable skews it. 

Target class could be computed by DASS guide -
https://www.psytoolkit.org/survey-library/depression-anxiety-stress-dass.html

Severites of depression, anxiety and stress are categorized to:
-	0 - Normal
-	1 - Mild
-	2 - Moderate
-	3 - Severe
-	4 - Extremely severe

Meaning | Depression | Anxiety | Stress
:---|:---|:---|:---
Normal | 0-9 | 0-7 | 0-14
Mild | 10-13 | 8-9 | 15-18
Moderate | 14-20 | 10-14 | 19-25
Severe | 21-27 | 15-19 | 26-33
Extremely severe | 28+ | 20+ | 34+

Some more detail on the data:
- There are 42 questions in the survey. 
- Q1A holds the answer to Question 1. 
- Q1E holds the elapse time in milliseconds to answer that question. 
- Q1I holds that question's position in the survey.
- introelapse is the time spent on the introduction/landing page (in seconds)
- testelapse is time spent on all the DASS questions (should be equivalent to the time elapsed on all the indiviudal questions combined)
- surveyelapse is the time spent answering the rest of the demographic and survey questions
- TIPI1 through TIPI10 match up to personality as administered (see Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr. (2003). A Very Brief Measure of the Big Five Personality Domains. Journal of Research in Personality, 37, 504-528.):

TIPI number | What it means
:---|:---
TIPI1 | Extraverted, enthusiastic.
TIPI2 | Critical, quarrelsome.
TIPI3 | Dependable, self-disciplined.
TIPI4 | Anxious, easily upset.
TIPI5 | Open to new experiences, complex.
TIPI6 | Reserved, quiet.
TIPI7 | Sympathetic, warm.
TIPI8 | Disorganized, careless.
TIPI9 | Calm, emotionally stable.
TIPI10 | Conventional, uncreative.

The following items were presented as a check-list and subjects were instructed "In the grid below, check all the words whose definitions you are sure you know":

Valid Check | Word
:-----|:---
VCL1 | boat
VCL2 | incoherent
VCL3 | pallid
VCL4 | robot
VCL5 | audible
VCL6 | cuivocal
VCL7 | paucity
VCL8 | epistemology
VCL9 | florted
VCL10 | decide
VCL11 | pastiche
VCL12 | verdid
VCL13 | abysmal
VCL14 | lucid
VCL15 | betray
VCL16 | funny

A value of 1 is checked, 0 means unchecked. The words at VCL6, VCL9, and VCL12 are not real words and can be used as a validity check.

Field | what the choices were with meaning
:---|:---
education | "How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
urban | "What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
gender | "What is your gender?", 1=Male, 2=Female, 3=Other
engnat | "Is English your native language?", 1=Yes, 2=No
age | "How many years old are you?"
hand | "What hand do you use to write with?", 1=Right, 2=Left, 3=Both
religion | "What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
orientation | "What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
race | "What is your race?", 10=Asian, 20=Arab, 30=Black, 40=Indigenous Australian, 50=Native American, 60=White, 70=Other
voted | "Have you voted in a national election in the past year?", 1=Yes, 2=No
married | "What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
familysize | "Including you, how many children did your mother have?"
major | "If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"

The following values were derived from technical information:
- country - ISO country code of where the user connected from
- screensize 1=device with small screen (phone, etc), 2=device with big screen (laptop, desktop, etc)
- uniquenetworklocation 1=only one survey from user's specific network in dataset, 2=multiple surveys submitted from the network of this user (2 does not necessarily imply duplicate records for an individual, as it could be different students at a single school or different memebers of the same household; and even if 1 there still could be duplicate records from a single individual e.g. if they took it once on their wifi and once on their phone)
- source how the user found the test, 1=from the front page of the site hosting the survey, 2=from google, 0=other or unknown

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('../../during_class_work/data.csv', sep=r'\t', engine='python')
df.head()

Unnamed: 0,Q1A,Q1I,Q1E,Q2A,Q2I,Q2E,Q3A,Q3I,Q3E,Q4A,...,screensize,uniquenetworklocation,hand,religion,orientation,race,voted,married,familysize,major
0,4,28,3890,4,25,2122,2,16,1944,4,...,1,1,1,12,1,10,2,1,2,
1,4,2,8118,1,36,2890,2,35,4777,3,...,2,1,2,7,0,70,2,1,4,
2,3,7,5784,1,33,4373,4,41,3242,1,...,2,1,1,4,3,60,1,1,3,
3,2,23,5081,3,11,6837,2,37,5521,1,...,2,1,2,4,5,70,2,1,5,biology
4,2,36,3215,2,13,7731,3,5,4156,4,...,2,2,3,10,1,10,2,1,4,Psychology


In [3]:
df.describe()

Unnamed: 0,Q1A,Q1I,Q1E,Q2A,Q2I,Q2E,Q3A,Q3I,Q3E,Q4A,...,age,screensize,uniquenetworklocation,hand,religion,orientation,race,voted,married,familysize
count,39775.0,39775.0,39775.0,39775.0,39775.0,39775.0,39775.0,39775.0,39775.0,39775.0,...,39775.0,39775.0,39775.0,39775.0,39775.0,39775.0,39775.0,39775.0,39775.0,39775.0
mean,2.619485,21.555977,6970.591,2.172269,21.24807,5332.376,2.226097,21.583004,7426.446,1.95017,...,23.612168,1.274519,1.200025,1.13516,7.555852,1.642992,31.312885,1.705795,1.159547,3.51027
std,1.032117,12.133621,86705.13,1.111563,12.125288,26513.61,1.038526,12.115637,158702.4,1.042218,...,21.581722,0.446277,0.400024,0.4003,3.554395,1.351362,25.871272,0.473388,0.445882,2.141518
min,1.0,1.0,180.0,1.0,1.0,176.0,1.0,1.0,-10814.0,1.0,...,13.0,1.0,1.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0
25%,2.0,11.0,2664.0,1.0,11.0,2477.0,1.0,11.0,2857.0,1.0,...,18.0,1.0,1.0,1.0,4.0,1.0,10.0,1.0,1.0,2.0
50%,3.0,22.0,3609.0,2.0,21.0,3511.0,2.0,22.0,3898.0,2.0,...,21.0,1.0,1.0,1.0,10.0,1.0,10.0,2.0,1.0,3.0
75%,4.0,32.0,5358.0,3.0,32.0,5216.0,3.0,32.0,5766.0,3.0,...,25.0,2.0,1.0,1.0,10.0,2.0,60.0,2.0,1.0,4.0
max,4.0,42.0,12102280.0,4.0,42.0,2161057.0,4.0,42.0,28582690.0,4.0,...,1998.0,2.0,2.0,3.0,12.0,5.0,70.0,2.0,3.0,133.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39775 entries, 0 to 39774
Columns: 172 entries, Q1A to major
dtypes: int64(170), object(2)
memory usage: 52.2+ MB


In [5]:
df.info

<bound method DataFrame.info of        Q1A  Q1I    Q1E  Q2A  Q2I   Q2E  Q3A  Q3I    Q3E  Q4A  ...  screensize  \
0        4   28   3890    4   25  2122    2   16   1944    4  ...           1   
1        4    2   8118    1   36  2890    2   35   4777    3  ...           2   
2        3    7   5784    1   33  4373    4   41   3242    1  ...           2   
3        2   23   5081    3   11  6837    2   37   5521    1  ...           2   
4        2   36   3215    2   13  7731    3    5   4156    4  ...           2   
...    ...  ...    ...  ...  ...   ...  ...  ...    ...  ...  ...         ...   
39770    2   31   3287    1    5  2216    3   29   3895    2  ...           2   
39771    3   14   4792    4   41  2604    3   15   2668    4  ...           1   
39772    2    1  25147    1    4  4555    2   14   3388    1  ...           2   
39773    3   36   4286    1   34  2736    2   10   5968    2  ...           2   
39774    2   28  32251    1   22  3317    2    4  11734    1  ...           1

In [6]:
df.columns

Index(['Q1A', 'Q1I', 'Q1E', 'Q2A', 'Q2I', 'Q2E', 'Q3A', 'Q3I', 'Q3E', 'Q4A',
       ...
       'screensize', 'uniquenetworklocation', 'hand', 'religion',
       'orientation', 'race', 'voted', 'married', 'familysize', 'major'],
      dtype='object', length=172)

In [7]:
df['country'].value_counts()

MY    21605
US     8207
GB     1180
CA      978
ID      884
      ...  
UZ        1
AM        1
AF        1
IM        1
VC        1
Name: country, Length: 145, dtype: int64

In [8]:
df['major'].value_counts()

English                         1163
Psychology                      1127
Accounting                       786
Business                         756
Engineering                      751
                                ... 
creative media                     1
PPE                                1
international communication        1
Dentistry or food technology       1
computer sciece                    1
Name: major, Length: 4647, dtype: int64

In [14]:
df['Q2_SurveyPosition'].value_counts()


KeyError: 'Q2_SurveyPosition'