## Лаба 7. Построить content-based рекомендательную систему образовательных курсов

### Дедлайн

⏰ Четверг, 30 мая 2019 года, 23:59.

### Задача

По имеющимся данным портала eclass.cc построить content-based рекомендации по образовательным курсам.

#### Обработка данных на вход

Имеются следующие входные данные:
* Набор данных о всех курсах. Датасет взять с HDFS по адресу: `/labs/lab07data/DO_record_per_line.json`
* id курсов, для которых надо дать рекомендации (указаны в [Личном кабинете](http://lk.newprolab.com/lab/laba07)) 

#### courses to make recommendations:

[[74, u'en', u'The Dynamic Earth: A Course for Educators'], 
[11821, u'en', u'Real Estate Investing II: Financing Your Property'], 
[23115, u'es', u'C\xf3mo estructurar y redactar tu tesis de investigaci\xf3n'], 
[21704, u'es', u'Excel'], 
[1256, u'ru', u'Visual Basic .NET'], 
[21404, u'ru', u'\u0421\u043e\u0432\u0440\u0435\u043c\u0435\u043d\u043d\u044b\u0435 \u0441\u0442\u0440\u0430\u0442\u0435\u0433\u0438\u0438 \u0440\u0435\u0430\u043b\u0438\u0437\u0430\u0446\u0438\u0438 \u0434\u043e\u0448\u043a\u043e\u043b\u044c\u043d\u043e\u0433\u043e \u043e\u0431\u0440\u0430\u0437\u043e\u0432\u0430\u043d\u0438\u044f']]

#### Обработка данных на выход

Для каждого id курса необходимо дать топ-10 наиболее похожих на него курсов. Рекомендованные курсы должны быть того же языка, что и курс, для которого строится рекомендация.

Для подбора рекомендаций следует использовать меру TF\*IDF, а в качестве метрики для ранжирования — косинус угла между TF\*IDF-векторами для разных курсов.

TF\*IDF нужно считать для описаний курсов. При извлечении слов из описания словом считаем то, что состоит из латинских или кириллических букв или цифр, знаки препинания и прочие символы не учитываются.

Для поиска слов можно использовать такой код:
```
regex = re.compile(r'[\w\d]{2,}', re.U)
regex.findall(string.lower())
```

Выходной формат — json — должен иметь следующую структуру:

```
{
  "123": [5372, 16663, 23114, 13079, 13084, ...],
  "456": [...],
  "789": [...],
  "123456": [...],
  "456789": [...],
  "987654": [...]
}
```

Ключи — это id курсов, для которых строится рекомендация. Для каждого такого ключа в качестве значения задается массив рекомендованных курсов, состоящий из их id, отсортированных по убыванию метрики. При равенстве значений метрики курсы сортируются лексикографически по названию. 

Также возможна очень редкая ситуация (в основном с русскоязычными курсами), когда в рекомендацию попадут два дубликата одного курса, но с разными id. Таких дубликатов очень мало относительно числа курсов, но все равно рекомендуется сортировать в следующей последовательности: по метрике (убывание) => по названию (лексикографически по возрастанию) => по возрастанию id.

**При вычислении TF с помощью `HashingTF` использовалось число фичей 10000. То есть: `tf = HashingTF(10000)`.**

### Проверка

Проверка осуществляется по результатам рекомендаций текущей рекомендательной системы на eclass.cc. Для прохождения лабораторной для каждого курса, для которого строится рекомендация, должно быть пересечение рекомендованных курсов с результатами текущей системы — **не менее 20%.**

Файл необходимо положить в свою домашнюю директорию под названием: `lab07.json`. Проверка осуществляется из [Личного кабинета](http://lk.newprolab.com/lab/laba07). В чекере в качестве значения для курсов указаны id и доля пересечения конкретно для каждого из курсов.


## Решение

In [47]:
import numpy as np
import pandas as pd
import re
import json
from pandas.io.json import json_normalize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

### Load data

In [1]:
!hadoop fs -copyToLocal /labs/lab07data/DO_record_per_line.json ~/

In [27]:
# Load data which contains json strings separated by newline
data = pd.read_csv("DO_record_per_line.json", header=None, sep="\n")
data.head()

Unnamed: 0,0
0,"{""lang"": ""en"", ""name"": ""Accounting Cycle: The ..."
1,"{""lang"": ""en"", ""name"": ""American Counter Terro..."
2,"{""lang"": ""fr"", ""name"": ""Arithm\u00e9tique: en ..."
3,"{""lang"": ""en"", ""name"": ""Becoming a Dynamic Edu..."
4,"{""lang"": ""en"", ""name"": ""Bioethics"", ""cat"": ""2/..."


In [31]:
# Review the 1st line with a json string
data.loc[0, 0]

'{"lang": "en", "name": "Accounting Cycle: The Foundation of Business Measurement and Reporting", "cat": "3/business_management|6/economics_finance", "provider": "Canvas Network", "id": 4, "desc": "This course introduces the basic financial statements used by most businesses, as well as the essential tools used to prepare them. This course will serve as a resource to help business students succeed in their upcoming university-level accounting classes, and as a refresher for upper division accounting students who are struggling to recall elementary concepts essential to more advanced accounting topics. Business owners will also benefit from this class by gaining essential skills necessary to organize and manage information pertinent to operating their business. At the conclusion of the class, students will understand the balance sheet, income statement, and cash flow statement. They will be able to differentiate between cash basis and accrual basis techniques, and know when each is appr

In [33]:
# Normalize data to create a dataframe
df = json_normalize(data[0].apply(json.loads))
df.head()

Unnamed: 0,cat,desc,id,lang,name,provider
0,3/business_management|6/economics_finance,This course introduces the basic financial sta...,4,en,Accounting Cycle: The Foundation of Business M...,Canvas Network
1,11/law,This online course will introduce you to Ameri...,5,en,American Counter Terrorism Law,Canvas Network
2,5/computer_science|15/mathematics_statistics_a...,This course is taught in French Vous voulez co...,6,fr,Arithmétique: en route pour la cryptographie,Canvas Network
3,14/social_sciences,We live in a digitally connected world. The wa...,7,en,Becoming a Dynamic Educator,Canvas Network
4,2/biology_life_sciences,This self-paced course is designed to show tha...,8,en,Bioethics,Canvas Network


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28153 entries, 0 to 28152
Data columns (total 6 columns):
cat         28153 non-null object
desc        28153 non-null object
id          28153 non-null int64
lang        28153 non-null object
name        28153 non-null object
provider    28153 non-null object
dtypes: int64(1), object(5)
memory usage: 1.3+ MB


In [48]:
# Set index as id
df = df.set_index("id")
df.head()

Unnamed: 0_level_0,cat,desc,lang,name,provider
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4,3/business_management|6/economics_finance,This course introduces the basic financial sta...,en,Accounting Cycle: The Foundation of Business M...,Canvas Network
5,11/law,This online course will introduce you to Ameri...,en,American Counter Terrorism Law,Canvas Network
6,5/computer_science|15/mathematics_statistics_a...,This course is taught in French Vous voulez co...,fr,Arithmétique: en route pour la cryptographie,Canvas Network
7,14/social_sciences,We live in a digitally connected world. The wa...,en,Becoming a Dynamic Educator,Canvas Network
8,2/biology_life_sciences,This self-paced course is designed to show tha...,en,Bioethics,Canvas Network


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28153 entries, 4 to 28317
Data columns (total 5 columns):
cat         28153 non-null object
desc        28153 non-null object
lang        28153 non-null object
name        28153 non-null object
provider    28153 non-null object
dtypes: object(5)
memory usage: 1.3+ MB


### Review courses

In [116]:
# Id of courses to make recommendations
idxs = [74, 11821, 1256, 21704, 21404, 23115]

In [117]:
# Посмотрим курсы для которых надо дать рекомендации - топ-10 наиболее похожих на него курсов на том же языке
df.query("index in @idxs")

Unnamed: 0_level_0,cat,desc,lang,name,provider
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
74,7/energy_earth_sciences|9/humanities|14/social...,How and why is the Earth constantly changing? ...,en,The Dynamic Earth: A Course for Educators,Coursera
1256,5/computer_science,"Этот курс с помощью пошаговых упражнений, прим...",ru,Visual Basic .NET,Intuit
11821,6/economics_finance,Discover the tools professional investors use ...,en,Real Estate Investing II: Financing Your Property,ed2go
21404,,Что входит в современную стратегию развития до...,ru,Современные стратегии реализации дошкольного о...,Universarium
21704,5/computer_science,En este curso aprenderás las herramientas más ...,es,Excel,edX
23115,,Aprende a estructurar tu tesis y redactar cad...,es,Cómo estructurar y redactar tu tesis de invest...,Udemy


### Split courses in required languages into different dataframes
#### English

In [53]:
# English courses
df_en = df.query("lang=='en'")
df_en.head()

Unnamed: 0_level_0,cat,desc,lang,name,provider
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4,3/business_management|6/economics_finance,This course introduces the basic financial sta...,en,Accounting Cycle: The Foundation of Business M...,Canvas Network
5,11/law,This online course will introduce you to Ameri...,en,American Counter Terrorism Law,Canvas Network
7,14/social_sciences,We live in a digitally connected world. The wa...,en,Becoming a Dynamic Educator,Canvas Network
8,2/biology_life_sciences,This self-paced course is designed to show tha...,en,Bioethics,Canvas Network
9,9/humanities|15/mathematics_statistics_and_dat...,This game-based course provides prospective st...,en,"College Foundations: Reading, Writing, and Math",Canvas Network


In [54]:
df_en.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24553 entries, 4 to 28317
Data columns (total 5 columns):
cat         24553 non-null object
desc        24553 non-null object
lang        24553 non-null object
name        24553 non-null object
provider    24553 non-null object
dtypes: object(5)
memory usage: 1.1+ MB


In [55]:
vect_en = TfidfVectorizer(max_features=10000)

In [58]:
# Make bag of English words
%time X_en = vect_en.fit_transform(df_en.desc).toarray()
X_en.shape

(24553, 10000)

In [60]:
# Calculate similarity between Englsish courses
%time cos_en = cosine_similarity(X_en)
cos_en.shape

CPU times: user 4min 49s, sys: 1min 44s, total: 6min 34s
Wall time: 22.3 s


(24553, 24553)

In [64]:
# Save similarity to a DF with original courses id as index/columns
%time df_cos_en = pd.DataFrame(cos_en, index=df_en.index, columns=df_en.index)
df_cos_en.head()

id,4,5,7,8,9,10,11,12,13,14,...,28306,28307,28309,28310,28311,28312,28313,28314,28315,28317
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,1.0,0.136894,0.139506,0.158058,0.149196,0.09731,0.153044,0.118953,0.054054,0.120793,...,0.105373,0.137813,0.065188,0.097138,0.156658,0.057127,0.07732,0.094241,0.137029,0.171833
5,0.136894,1.0,0.183893,0.1631,0.085941,0.091353,0.127471,0.113357,0.066714,0.13796,...,0.122002,0.141799,0.121862,0.128902,0.19962,0.087309,0.126698,0.109233,0.162364,0.215739
7,0.139506,0.183893,1.0,0.142997,0.111101,0.175351,0.144408,0.137336,0.071633,0.132505,...,0.113431,0.181714,0.113134,0.10396,0.191766,0.079853,0.12859,0.100633,0.179663,0.255795
8,0.158058,0.1631,0.142997,1.0,0.129671,0.10991,0.118275,0.094672,0.0813,0.127326,...,0.102216,0.121183,0.071505,0.10795,0.172319,0.072124,0.100012,0.099122,0.159613,0.172922
9,0.149196,0.085941,0.111101,0.129671,1.0,0.053981,0.08199,0.087311,0.191186,0.115658,...,0.066103,0.109611,0.042187,0.065393,0.111199,0.044974,0.057033,0.058841,0.111989,0.159206


In [105]:
# recommendations for course with id=74
r74 = df_cos_en.loc[74].nlargest(11)[1:11]
r74

id
22656    0.606124
89       0.578718
1676     0.552499
9684     0.550615
7635     0.550475
4279     0.541743
5784     0.530061
75       0.509401
7612     0.467865
238      0.463531
Name: 74, dtype: float64

In [120]:
# recommendations for course with id=11821
r11821 = df_cos_en.loc[11821].nlargest(11)[1:11]
r11821

id
26350    0.284575
9762     0.283353
14380    0.272636
22293    0.269856
24807    0.261708
24183    0.257700
18426    0.254538
7483     0.253510
11964    0.249346
1902     0.244060
Name: 11821, dtype: float64

#### Russian

In [80]:
df_ru = df.query("lang=='ru'")
df_ru.head()

Unnamed: 0_level_0,cat,desc,lang,name,provider
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
46,5/computer_science,Часть 1. Продвинутые структуры данных\r\nПриор...,ru,Дополнительные главы алгоритмов,Computer Science Center
47,5/computer_science,Splay-дерево и декартово дерево\r\nХеширование...,ru,Алгоритмы и структуры данных 2,Computer Science Center
48,5/computer_science,Курс посвящён теоретическим и практическим асп...,ru,Технологии хранения и обработки больших объёмо...,Computer Science Center
49,2/biology_life_sciences|5/computer_science,Биоинформатика — это быстро растущий раздел co...,ru,Алгоритмы в биоинформатике,Computer Science Center
50,5/computer_science|15/mathematics_statistics_a...,Курс знакомит со сложностью вероятностных вычи...,ru,Сложность вычислений и основы криптографии,Computer Science Center


In [81]:
df_ru.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1231 entries, 46 to 28290
Data columns (total 5 columns):
cat         1231 non-null object
desc        1231 non-null object
lang        1231 non-null object
name        1231 non-null object
provider    1231 non-null object
dtypes: object(5)
memory usage: 57.7+ KB


In [82]:
vect_ru = TfidfVectorizer(max_features=10000)

In [83]:
# Make bag of Russian words
%time X_ru = vect_ru.fit_transform(df_ru.desc).toarray()
X_ru.shape

CPU times: user 140 ms, sys: 40 ms, total: 180 ms
Wall time: 177 ms


(1231, 10000)

In [84]:
# Calculate similarity between Russian courses
%time cos_ru = cosine_similarity(X_ru)
cos_ru.shape

CPU times: user 2.73 s, sys: 328 ms, total: 3.06 s
Wall time: 136 ms


(1231, 1231)

In [85]:
# Save similarity to a DF with original courses id as index/columns
%time df_cos_ru = pd.DataFrame(cos_ru, index=df_ru.index, columns=df_ru.index)
df_cos_ru.head()

CPU times: user 0 ns, sys: 4 ms, total: 4 ms
Wall time: 432 µs


id,46,47,48,49,50,51,52,53,54,55,...,27383,27534,27858,27941,28005,28074,28075,28212,28245,28290
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
46,1.0,0.100534,0.010461,0.026975,0.00776,0.028751,0.020654,0.003199,0.003894,0.0,...,0.0,0.016438,0.0,0.0055,0.0,0.010876,0.002805,0.0,0.005185,0.017072
47,0.100534,1.0,0.00402,0.035679,0.064843,0.0,0.0123,0.001628,0.0,0.045873,...,0.0,0.005803,0.0,0.022195,0.0,0.006054,0.002111,0.0,0.005442,0.0
48,0.010461,0.00402,1.0,0.071906,0.054878,0.018257,0.037081,0.017578,0.022498,0.046588,...,0.037665,0.0081,0.043466,0.010801,0.032783,0.038692,0.028087,0.007181,0.015059,0.027777
49,0.026975,0.035679,0.071906,1.0,0.028187,0.023986,0.012038,0.017116,0.026118,0.023255,...,0.0,0.007638,0.0,0.007597,0.010873,0.040925,0.028749,0.0,0.013107,0.0
50,0.00776,0.064843,0.054878,0.028187,1.0,0.023168,0.049882,0.055441,0.031268,0.00611,...,0.007918,0.003531,0.005439,0.008299,0.0,0.065364,0.037342,0.028002,0.003312,0.067152


In [107]:
# recommendations for course with id=1256
r1256 = df_cos_ru.loc[1256].nlargest(11)[1:11]
r1256

id
20292    1.000000
1285     0.163715
1011     0.157126
819      0.152944
20307    0.152944
1060     0.139839
1348     0.134279
1369     0.130401
1228     0.127044
960      0.124576
Name: 1256, dtype: float64

In [108]:
# recommendations for course with id=21404
r21404 = df_cos_ru.loc[21404].nlargest(11)[1:11]
r21404

id
21403    0.273785
1052     0.139004
21042    0.128938
1288     0.127693
20368    0.127693
992      0.122095
1057     0.108524
1349     0.104433
8298     0.089629
21088    0.087661
Name: 21404, dtype: float64

#### Spanish

In [95]:
df_es = df.query("lang=='es'")
df_es.head()

Unnamed: 0_level_0,cat,desc,lang,name,provider
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
59,,A través de diferentes actividades de campo el...,es,El ABC del emprendimiento esbelto,Coursera
124,2/biology_life_sciences|9/humanities,Aprenderemos cómo podemos usar el pensamiento ...,es,Pensamiento Científico,Coursera
160,8/engineering_technology|9/humanities|14/socia...,¡Claro que todos podemos potenciar nuestra cre...,es,Ser más creativos,Coursera
166,7/energy_earth_sciences|9/humanities|13/physic...,Este curso provee al estudiante con conceptos ...,es,Conceptos y Herramientas para la Física Univer...,Coursera
196,9/humanities,Este curso introduce a los estudiantes de grad...,es,Egiptología (Egyptology),Coursera


In [96]:
df_es.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1374 entries, 59 to 28316
Data columns (total 5 columns):
cat         1374 non-null object
desc        1374 non-null object
lang        1374 non-null object
name        1374 non-null object
provider    1374 non-null object
dtypes: object(5)
memory usage: 64.4+ KB


In [97]:
vect_es = TfidfVectorizer(max_features=10000)

In [98]:
# Make bag of Spanish words
%time X_es = vect_es.fit_transform(df_es.desc).toarray()
X_es.shape

CPU times: user 640 ms, sys: 56 ms, total: 696 ms
Wall time: 695 ms


(1374, 10000)

In [99]:
# Calculate similarity between Spanish courses
%time cos_es = cosine_similarity(X_es)
cos_es.shape

CPU times: user 2.84 s, sys: 344 ms, total: 3.18 s
Wall time: 134 ms


(1374, 1374)

In [100]:
# Save similarity to a DF with original courses id as index/columns
%time df_cos_es = pd.DataFrame(cos_es, index=df_es.index, columns=df_es.index)
df_cos_es.head()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 313 µs


id,59,124,160,166,196,198,252,272,273,386,...,28003,28060,28104,28196,28197,28256,28293,28305,28308,28316
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
59,1.0,0.176484,0.253005,0.620437,0.346153,0.555848,0.131815,0.516265,0.523597,0.288639,...,0.021902,0.04628,0.012937,0.008466,0.076062,0.034706,0.611635,0.186674,0.305578,0.512929
124,0.176484,1.0,0.326396,0.154654,0.271194,0.213047,0.103031,0.139288,0.05882,0.2072,...,0.045815,0.034386,0.014053,0.014444,0.052614,0.057494,0.062809,0.074058,0.070951,0.110973
160,0.253005,0.326396,1.0,0.183561,0.285323,0.275872,0.136033,0.207617,0.096552,0.258254,...,0.020012,0.04215,0.020907,0.02451,0.062677,0.036572,0.092382,0.105128,0.100419,0.144274
166,0.620437,0.154654,0.183561,1.0,0.265303,0.494196,0.094684,0.560361,0.628555,0.275363,...,0.009847,0.034417,0.017605,0.01293,0.035282,0.022621,0.672668,0.174325,0.322656,0.48579
196,0.346153,0.271194,0.285323,0.265303,1.0,0.342594,0.166987,0.249826,0.185296,0.362461,...,0.0213,0.049835,0.013936,0.006694,0.078546,0.038244,0.21705,0.122296,0.155739,0.24265


In [109]:
# recommendations for course with id=21704
r21704 = df_cos_es.loc[21704].nlargest(11)[1:11]
r21704

id
12247    0.421412
5687     0.417660
12863    0.411200
18813    0.400701
23506    0.391341
5558     0.387094
17964    0.383548
12660    0.383405
9563     0.381669
11575    0.375659
Name: 21704, dtype: float64

In [110]:
# recommendations for course with id=23115
r23115 = df_cos_es.loc[23115].nlargest(11)[1:11]
r23115

id
6863     0.677029
19967    0.599954
20053    0.555117
20277    0.464146
20215    0.431298
20251    0.287065
23629    0.240561
17838    0.240361
4714     0.238875
6864     0.238516
Name: 23115, dtype: float64

## Save to Json

In [121]:
d = {"74": [x for x in r74.index],
     "11821": [x for x in r11821.index],
     "1256": [x for x in r1256.index],
     "21404": [x for x in r21404.index],
     "21704": [x for x in r21704.index],
     "23115": [x for x in r23115.index]}
d

{'74': [22656, 89, 1676, 9684, 7635, 4279, 5784, 75, 7612, 238],
 '11821': [26350, 9762, 14380, 22293, 24807, 24183, 18426, 7483, 11964, 1902],
 '1256': [20292, 1285, 1011, 819, 20307, 1060, 1348, 1369, 1228, 960],
 '21404': [21403, 1052, 21042, 1288, 20368, 992, 1057, 1349, 8298, 21088],
 '21704': [12247, 5687, 12863, 18813, 23506, 5558, 17964, 12660, 9563, 11575],
 '23115': [6863, 19967, 20053, 20277, 20215, 20251, 23629, 17838, 4714, 6864]}

In [122]:
# Preview json string
json.dumps(d)

'{"74": [22656, 89, 1676, 9684, 7635, 4279, 5784, 75, 7612, 238], "11821": [26350, 9762, 14380, 22293, 24807, 24183, 18426, 7483, 11964, 1902], "1256": [20292, 1285, 1011, 819, 20307, 1060, 1348, 1369, 1228, 960], "21404": [21403, 1052, 21042, 1288, 20368, 992, 1057, 1349, 8298, 21088], "21704": [12247, 5687, 12863, 18813, 23506, 5558, 17964, 12660, 9563, 11575], "23115": [6863, 19967, 20053, 20277, 20215, 20251, 23629, 17838, 4714, 6864]}'

In [123]:
# Save the dictionary to a json file
with open('../../lab07.json', 'w') as f:
    json.dump(d, f)

In [124]:
# Save the dictionary to a json file
with open('../../lab07s.json', 'w') as f:
    json.dump(d, f)