##### I have some questions for you that I need answered before the board meeting Thursday afternoon. My questions are listed below; however, if you discover anything else important that I didn’t think to ask, please include that as well.

1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
7. Which lessons are least accessed?
8. Anything else I should be aware of?


In [149]:
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import seaborn as sns
import env
import joint_prepare

In [98]:
url = f'mysql+pymysql://{env.user}:{env.password}@{env.host}/curriculum_logs'
query = '''
Select * from logs
left join cohorts on logs.cohort_id = cohorts.id
ORDER BY date ASC, time ASC;
'''
df = pd.read_sql(query, url)

In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900223 entries, 0 to 900222
Data columns (total 15 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   date        900223 non-null  object 
 1   time        900223 non-null  object 
 2   path        900222 non-null  object 
 3   user_id     900223 non-null  int64  
 4   cohort_id   847330 non-null  float64
 5   ip          900223 non-null  object 
 6   id          847330 non-null  float64
 7   name        847330 non-null  object 
 8   slack       847330 non-null  object 
 9   start_date  847330 non-null  object 
 10  end_date    847330 non-null  object 
 11  created_at  847330 non-null  object 
 12  updated_at  847330 non-null  object 
 13  deleted_at  0 non-null       object 
 14  program_id  847330 non-null  float64
dtypes: float64(3), int64(1), object(11)
memory usage: 103.0+ MB


In [100]:
df.isnull().sum()

date               0
time               0
path               1
user_id            0
cohort_id      52893
ip                 0
id             52893
name           52893
slack          52893
start_date     52893
end_date       52893
created_at     52893
updated_at     52893
deleted_at    900223
program_id     52893
dtype: int64

In [101]:
df.user_id.value_counts()

11     17913
64     16347
53     12329
314     7783
1       7404
       ...  
66         1
163        1
918        1
212        1
952        1
Name: user_id, Length: 981, dtype: int64

In [102]:
(df.isnull().sum()/df.shape[0]*100)[:] # Percentage of nulls in each column

date            0.000000
time            0.000000
path            0.000111
user_id         0.000000
cohort_id       5.875544
ip              0.000000
id              5.875544
name            5.875544
slack           5.875544
start_date      5.875544
end_date        5.875544
created_at      5.875544
updated_at      5.875544
deleted_at    100.000000
program_id      5.875544
dtype: float64

In [103]:
dfnull = df[df.program_id.isnull()]

In [104]:
df[df.updated_at.isnull()].user_id.value_counts()

354    2965
736    2358
363    2248
716    2136
368    2085
       ... 
644       6
663       4
62        4
89        3
176       3
Name: user_id, Length: 78, dtype: int64

- looks like there are about 78 unique user_ids with the chunk of null values, could be admin/non-student

In [105]:
df[df.path.isnull()]

Unnamed: 0,date,time,path,user_id,cohort_id,ip,id,name,slack,start_date,end_date,created_at,updated_at,deleted_at,program_id
506305,2020-04-08,09:25:18,,586,55.0,72.177.240.51,55.0,Curie,#curie,2020-02-03,2020-07-07,2020-02-03 19:31:51,2020-02-03 19:31:51,,3.0


In [106]:
df[df.created_at == df.updated_at]

Unnamed: 0,date,time,path,user_id,cohort_id,ip,id,name,slack,start_date,end_date,created_at,updated_at,deleted_at,program_id
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,8.0,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,22.0,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900218,2021-04-21,16:41:51,jquery/personal-site,64,28.0,71.150.217.33,28.0,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,,2.0
900219,2021-04-21,16:42:02,jquery/mapbox-api,64,28.0,71.150.217.33,28.0,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,,2.0
900220,2021-04-21,16:42:09,jquery/ajax/weather-map,64,28.0,71.150.217.33,28.0,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,,2.0
900221,2021-04-21,16:44:37,anomaly-detection/discrete-probabilistic-methods,744,28.0,24.160.137.86,28.0,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,,2.0


In [107]:
df.updated_at.value_counts()

2018-12-06 17:04:19    84031
2019-07-15 16:57:21    40730
2019-01-20 23:18:57    38096
2020-09-21 18:06:27    37109
2020-01-13 21:17:08    36902
2018-05-25 22:25:57    35636
2020-03-23 17:52:16    33844
2020-07-29 18:41:13    33568
2019-09-16 13:07:04    32888
2020-07-13 18:32:19    32015
2018-01-08 13:59:10    30926
2020-05-26 19:22:44    29855
2019-05-28 18:41:05    29356
2018-03-05 14:22:11    28534
2019-11-04 18:27:07    28033
2018-09-17 19:09:51    27749
2019-08-20 14:38:55    26538
2018-07-23 15:02:25    25586
2019-03-18 20:35:06    25359
2020-11-02 20:43:58    23691
2020-02-03 19:31:51    21582
2018-11-05 15:26:37    20743
2020-09-30 15:54:46    17713
2020-12-07 16:58:43    16623
2021-01-20 21:31:11    16397
2016-06-14 19:52:26    14775
2020-12-07 15:20:18    14715
2016-07-18 19:06:27     9587
2021-03-15 18:18:20     8562
2017-09-27 20:22:41     7444
2021-03-15 19:57:09     7276
2017-02-06 17:49:10     4954
2017-03-28 00:33:12     2158
2021-04-12 18:07:21     1672
2017-06-05 20:

In [108]:
df.program_id.value_counts()

2.0    713365
3.0    103412
1.0     30548
4.0         5
Name: program_id, dtype: int64

In [109]:
df[df.name== "Staff"].user_id.value_counts()

11     15178
64     12530
428     5819
1       5787
248     5027
314     4617
53      4132
545     3528
211     3162
581     2961
546     2585
514     2073
315     2042
404     1668
816     1527
742     1507
480     1256
146     1216
521     1088
430      981
744      651
951      583
893      402
572      390
37       374
502      357
618      318
397      305
630      253
41       204
257      160
308      151
513      132
854      131
312      131
738      128
653      117
953       85
539       84
40        66
620       58
370       54
813       49
855       47
745       46
894       29
148       26
461       11
980        3
652        1
592        1
Name: user_id, dtype: int64

#### Cleaning the Data
- following the merging of the tables there are some simple cleaning steps we can take. 
- deleted_at column is 100% null, can be removed. 
- updated_at and created_at are duplicate columns, updated_at can be removed. - - Slack col is duplicate of name, drop it. id and cohort_id are duplicates, drop id

In [110]:
def clean_curriculum(df):
    #drop unneeded columns
    df = df.drop(['deleted_at'], axis=1)
    df = df.drop(['updated_at'], axis=1)
    df = df.drop(['slack'], axis=1)
    df = df.drop(['id'], axis=1)
    
    #
    
    
    #drop remaining nulls
    df = df.dropna()
    
    return df

In [111]:
df = clean_curriculum(df)

In [112]:
df.head(2)

Unnamed: 0,date,time,path,user_id,cohort_id,ip,name,start_date,end_date,created_at,program_id
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0


In [113]:
df[(df.name == "Hampton") & (df.ip == '97.105.19.61')]

Unnamed: 0,date,time,path,user_id,cohort_id,ip,name,start_date,end_date,created_at,program_id
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0
58,2018-01-26,10:40:15,javascript-i/functions,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0
...,...,...,...,...,...,...,...,...,...,...,...
85410,2018-07-13,09:13:18,javascript-ii/map-filter-reduce,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0
85411,2018-07-13,09:13:25,appendix/angular/models,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0
85412,2018-07-13,09:13:48,/,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0
85413,2018-07-13,09:13:52,toc,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,1.0


In [114]:
dfnull.path.describe()

count     52893
unique     1112
top           /
freq       4459
Name: path, dtype: object

In [115]:
#Which lesson appears to attract the most traffic consistently across cohorts (per program)?

In [116]:
df[df.path != '/'].path.value_counts()

javascript-i                                                    18203
toc                                                             17591
search/search_index.json                                        17534
java-iii                                                        13166
html-css                                                        13127
                                                                ...  
content/examples/javascript/primitive-types.html                    1
content/examples/javascript/conditionals.html                       1
2-storytelling/1-overview/www.qlik.com                              1
syntax-types-and-variables                                          1
appendix/professional-development/post-interview-review-form        1
Name: path, Length: 2223, dtype: int64

In [144]:
def parse_path(path):
    parts = path.split("/")
    output = {}
    if len(parts) == "/":
        output['primary_topic'] = 'None'
        output['subtopic'] = 'None'
        output['tertiary'] = 'None'  
    elif len(parts) == 1:
        output['primary_topic'] = parts[0]
        output['subtopic'] = 'None'
        output['tertiary'] = 'None'
    elif len(parts) == 2:
        output['primary_topic'] = parts[0]
        output['subtopic'] = parts[1]
        output['tertiary'] = 'None'
    else: 
        output['primary_topic'] = parts[0]
        output['subtopic'] = parts[1]
        output['tertiary'] = parts[2]
    return pd.Series(output)

In [145]:
parsed_path = df.path.apply(parse_path)

In [146]:
parsed_path

Unnamed: 0,primary_topic,subtopic,tertiary
0,,,
1,java-ii,,
2,java-ii,object-oriented-programming,
3,slides,object_oriented_programming,
4,javascript-i,conditionals,
...,...,...,...
900218,jquery,personal-site,
900219,jquery,mapbox-api,
900220,jquery,ajax,weather-map
900221,anomaly-detection,discrete-probabilistic-methods,


In [143]:
parsed_path.tertiary.value_counts()

None                                               629814
working-with-data-types-operators-and-variables      7330
flexbox                                              6867
bootstrap-grid-system                                6422
bootstrap-introduction                               5964
                                                    ...  
2-Overview                                              1
Selecting_a_hypothesis_test.svg                         1
sw-planning.md                                          1
constructors-destructors.html                           1
post-interview-review-form                              1
Name: tertiary, Length: 325, dtype: int64

In [148]:
parsed_path['primary_topic'].unique()

array(['', 'java-ii', 'slides', 'javascript-i', 'mkdocs', 'git', 'spring',
       'appendix', 'index.html', 'java-i', 'html-css', 'examples',
       'javascript', 'mysql', 'content', 'jquery', 'java',
       'javascript-ii', 'teams', 'java-iii', 'prework', 'asdf', 'css',
       'single-page.html', 'home', 'assets', 'forms', 'css-i',
       'alumni-tech-survey-2018', 'alumni-tech-survey-2018.html', 'es6',
       'introduction-to-java', 'strings', 'methods', 'introduction',
       'elements', 'file-io', 'css-ii', 'functions',
       'javascript-with-html', 'conditionals', 'bom-and-dom', 'mvc',
       'students', 'fundamentals', 'setup', 'group-by',
       'finish-the-adlister', 'essential-methods', 'uploads', 'ajax',
       'student', 'hfdgafdja', 'php', '.git', '.gitignore', 'toc',
       'wp-admin', 'wp-login', 'registerUser', 'search', 'pre-work',
       'learn-to-code', 'capstone-workbook', 'jsp-and-jstl', 'html',
       'handouts', 'javascript-functions', 'login', 'quize', 'cohorts'

In [150]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,900213,900214,900215,900216,900217,900218,900219,900220,900221,900222
date,2018-01-26,2018-01-26,2018-01-26,2018-01-26,2018-01-26,2018-01-26,2018-01-26,2018-01-26,2018-01-26,2018-01-26,...,2021-04-21,2021-04-21,2021-04-21,2021-04-21,2021-04-21,2021-04-21,2021-04-21,2021-04-21,2021-04-21,2021-04-21
time,09:55:03,09:56:02,09:56:05,09:56:06,09:56:24,09:56:41,09:56:46,09:56:48,09:56:59,09:58:26,...,16:38:14,16:41:29,16:41:31,16:41:49,16:41:51,16:41:51,16:42:02,16:42:09,16:44:37,16:44:39
path,/,java-ii,java-ii/object-oriented-programming,slides/object_oriented_programming,javascript-i/conditionals,javascript-i/loops,javascript-i/conditionals,javascript-i/functions,javascript-i/loops,javascript-i/functions,...,java-iii/servlets,javascript-i,javascript-ii,jquery,javascript-i/bom-and-dom/dom,jquery/personal-site,jquery/mapbox-api,jquery/ajax/weather-map,anomaly-detection/discrete-probabilistic-methods,jquery/mapbox-api
user_id,1,1,1,1,2,2,3,3,2,4,...,834,64,64,64,875,64,64,64,744,64
cohort_id,8.0,8.0,8.0,8.0,22.0,22.0,22.0,22.0,22.0,22.0,...,134.0,28.0,28.0,28.0,135.0,28.0,28.0,28.0,28.0,28.0
ip,97.105.19.61,97.105.19.61,97.105.19.61,97.105.19.61,97.105.19.61,97.105.19.61,97.105.19.61,97.105.19.61,97.105.19.61,97.105.19.61,...,67.11.50.23,71.150.217.33,71.150.217.33,71.150.217.33,24.242.150.231,71.150.217.33,71.150.217.33,71.150.217.33,24.160.137.86,71.150.217.33
name,Hampton,Hampton,Hampton,Hampton,Teddy,Teddy,Teddy,Teddy,Teddy,Teddy,...,Luna,Staff,Staff,Staff,Marco,Staff,Staff,Staff,Staff,Staff
start_date,2015-09-22,2015-09-22,2015-09-22,2015-09-22,2018-01-08,2018-01-08,2018-01-08,2018-01-08,2018-01-08,2018-01-08,...,2020-12-07,2014-02-04,2014-02-04,2014-02-04,2021-01-25,2014-02-04,2014-02-04,2014-02-04,2014-02-04,2014-02-04
end_date,2016-02-06,2016-02-06,2016-02-06,2016-02-06,2018-05-17,2018-05-17,2018-05-17,2018-05-17,2018-05-17,2018-05-17,...,2021-06-08,2014-02-04,2014-02-04,2014-02-04,2021-07-19,2014-02-04,2014-02-04,2014-02-04,2014-02-04,2014-02-04
created_at,2016-06-14 19:52:26,2016-06-14 19:52:26,2016-06-14 19:52:26,2016-06-14 19:52:26,2018-01-08 13:59:10,2018-01-08 13:59:10,2018-01-08 13:59:10,2018-01-08 13:59:10,2018-01-08 13:59:10,2018-01-08 13:59:10,...,2020-12-07 16:58:43,2018-12-06 17:04:19,2018-12-06 17:04:19,2018-12-06 17:04:19,2021-01-20 21:31:11,2018-12-06 17:04:19,2018-12-06 17:04:19,2018-12-06 17:04:19,2018-12-06 17:04:19,2018-12-06 17:04:19
