# Форматы данных (1)

Материалы:
* Макрушин С.В. "Лекция 4: Форматы данных"
* https://docs.python.org/3/library/json.html
* https://docs.python.org/3/library/pickle.html
* https://www.crummy.com/software/BeautifulSoup/bs4/doc.ru/bs4ru.html
* Уэс Маккини. Python и анализ данных

## Задачи для совместного разбора

1. Вывести все адреса электронной почты, содержащиеся в адресной книге `addres-book.json`

In [1]:
import json

with open('data/addres-book.json', 'r', encoding='utf-8') as f:
    ab_json = json.load(f)

In [2]:
emails = [i['email'] for i in ab_json]
print(emails)

['faina@mail.ru', 'robert@mail.ru']


2. Вывести телефоны, содержащиеся в адресной книге `addres-book.json`

In [3]:
phones = [j['phone'] for i in ab_json for j in i['phones']]
print(phones)

['232-19-55', '+7 (916) 232-19-55', '111-19-55', '+7 (916) 445-19-55']


3. По данным из файла `addres-book-q.xml` сформировать список словарей с телефонами каждого из людей. 

In [4]:
# pip install beautifulsoup4
# pip install lxml

from bs4 import BeautifulSoup

with open('data/addres-book-q.xml', 'r', encoding='utf-8') as f:
    ab_xml = BeautifulSoup(f, features="xml")

In [5]:
name_phones = dict(
    [(address.find('name').contents[0],
      [phone.contents[0]
       for phones in address.find_all('phones')
       for phone in phones.find_all('phone')])
     for country in ab_xml.address_book.find_all('country')
     for address in country.find_all('address')])
name_phones

{'Aicha Barki': ['+ (213) 6150 4015', '+ (213) 2173 5247'],
 'Francisco Domingos': ['+ (244-2) 325 023', '+ (244-2) 325 023'],
 'Maria Luisa': ['+ (244) 4232 2836'],
 'Abraao Chanda': ['+ (244-2) 325 023', '+ (244-2) 325 023'],
 'Beatriz Busaniche': ['+ (54-11) 4784 1159'],
 'Francesca Beddie': ['+ (61-2) 6274 9500', '+ (61-2) 6274 9513'],
 'Graham John Smith': ['+ (61-3) 9807 4702']}

## Лабораторная работа №4

### JSON

1.1 Считайте файл `contributors_sample.json`. Воспользовавшись модулем `json`, преобразуйте содержимое файла в соответствующие объекты python. Выведите на экран информацию о первых 3 пользователях.

In [6]:
with open('data/contributors_sample.json', 'r', encoding='utf-8') as f:
    cs_json = json.load(f)

In [7]:
print(cs_json[:3])

[{'username': 'uhebert', 'name': 'Lindsey Nguyen', 'sex': 'F', 'address': '01261 Cameron Spring\nTaylorfurt, AK 97791', 'mail': 'jsalazar@gmail.com', 'jobs': ['Energy engineer', 'Engineer, site', 'Environmental health practitioner', 'Biomedical scientist', 'Jewellery designer'], 'id': 35193}, {'username': 'vickitaylor', 'name': 'Cheryl Lewis', 'sex': 'F', 'address': '66992 Welch Brooks\nMarshallshire, ID 56004', 'mail': 'bhudson@gmail.com', 'jobs': ['Music therapist', 'Volunteer coordinator', 'Designer, interior/spatial'], 'id': 91970}, {'username': 'sheilaadams', 'name': 'Julia Allen', 'sex': 'F', 'address': 'Unit 1632 Box 2971\nDPO AE 23297', 'mail': 'darren44@yahoo.com', 'jobs': ['Management consultant', 'Engineer, structural', 'Lecturer, higher education', 'Theatre manager', 'Designer, textile'], 'id': 1848091}]


1.2 Выведите уникальные почтовые домены, содержащиеся в почтовых адресах людей

In [8]:
emails = [i['mail'] for i in cs_json]
unique = set([email.split('@')[1] for email in emails])
print(unique)

{'yahoo.com', 'gmail.com', 'hotmail.com'}


1.3 Напишите функцию, которая по `username` ищет человека и выводит информацию о нем. Если пользователь с заданным `username` отсутствует, возбудите исключение `ValueError`

In [9]:
def getPersonalData(json: list, username: str):
    search = [person for person in json if person['username'] == username]
    if not search:
        raise ValueError('username is not found in json')
    else:
        person = search[0]
        print("Name:", person['name'], "\nSex:", person['sex'], "\nAddress:", person['address'].replace('\n', ', '))
        print("E-mail:", person['mail'], "\nJobs:", person['jobs'])

In [10]:
getPersonalData(cs_json, 'uhebert')

Name: Lindsey Nguyen 
Sex: F 
Address: 01261 Cameron Spring, Taylorfurt, AK 97791
E-mail: jsalazar@gmail.com 
Jobs: ['Energy engineer', 'Engineer, site', 'Environmental health practitioner', 'Biomedical scientist', 'Jewellery designer']


In [11]:
getPersonalData(cs_json, 'aboba')

ValueError: username is not found in json

1.4 Посчитайте, сколько мужчин и женщин присутсвует в этом наборе данных.

In [12]:
def countPeoples(json: list):
    males = [person for person in json if person['sex'] == 'M']
    females = [person for person in json if person['sex'] == 'F']
    print('Total:', len(json), '\nMales in dataset:', len(males), '\nFemales in dataset:', len(females))

In [13]:
countPeoples(cs_json)

Total: 4200 
Males in dataset: 2064 
Females in dataset: 2136


1.5 Создайте `pd.DataFrame` `contributors`, имеющий столбцы `id`, `username` и `sex`.

In [14]:
import numpy as np
import pandas as pd

def fromJsonToDataFrame(json: list) -> pd.core.frame.DataFrame:
    id_ = [person['id'] for person in json]
    username = [person['username'] for person in json]
    sex = [person['sex'] for person in json]
    allData = np.stack((np.array(id_), np.array(username), np.array(sex)), axis=-1)
    dataFrame = pd.DataFrame(data=allData, columns=['contributor_id', 'username', 'sex'])
    dataFrame['contributor_id'] = dataFrame['contributor_id'].astype("int64")
    return dataFrame

In [15]:
contributors = fromJsonToDataFrame(cs_json)
contributors.head(10)

Unnamed: 0,contributor_id,username,sex
0,35193,uhebert,F
1,91970,vickitaylor,F
2,1848091,sheilaadams,F
3,50969,nicole82,F
4,676820,jean67,M
5,64918,james67,F
6,113941,woodmarissa,M
7,398160,sampsontammy,M
8,35635,jonathan18,M
9,718054,michael53,M


1.6 Загрузите данные из файла `recipes_sample.csv` (__ЛР2__) в таблицу `recipes`. Объедините `recipes` с таблицей `contributors` с сохранением строк в том случае, если информация о человеке отсутствует в JSON-файле. Для скольких человек информация отсутствует? 

In [16]:
recipes = pd.read_csv('data/recipes_sample.csv')
recipes.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,


In [17]:
res = pd.merge(recipes, contributors, how='left', left_on = 'contributor_id', right_on = 'contributor_id')
res

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients,username,sex
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0,uhebert,F
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,,vickitaylor,F
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0,,
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,,,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,,,
...,...,...,...,...,...,...,...,...,...,...
29995,zurie s holey rustic olive and cheddar bread,267661,80,200862,2007-11-25,16.0,this is based on a french recipe but i changed...,10.0,ana38,F
29996,zwetschgenkuchen bavarian plum cake,386977,240,177443,2009-08-24,,"this is a traditional fresh plum cake, thought...",11.0,douglas33,F
29997,zwiebelkuchen southwest german onion cake,103312,75,161745,2004-11-03,,this is a traditional late summer early fall s...,,,
29998,zydeco soup,486161,60,227978,2012-08-29,,this is a delicious soup that i originally fou...,,jessica22,M


In [18]:
not_found = res['username'].isna().sum()
print('Not found inforamtion about', not_found, 'person')

Not found inforamtion about 15059 person


### pickle

2.1 На основе файла `contributors_sample.json` создайте словарь следующего вида: 
```
{
    должность: [список username людей, занимавших эту должность]
}
```

In [19]:
def getJobDict(json: list) -> dict:
    return dict([
        (job, list(set([
            person['username'] for person in json if job in person['jobs']
        ])))
        for person in json for job in person['jobs']
    ])

In [20]:
jobDict = getJobDict(cs_json)
#print(jobDict)

2.2 Сохраните результаты в файл `job_people.pickle` и в файл `job_people.json` с использованием форматов pickle и JSON соответственно. Сравните объемы получившихся файлов. При сохранении в JSON укажите аргумент `indent`.

In [21]:
import pickle

with open('result/job_people.pickle', 'wb') as f:
    pickle.dump(jobDict, f)

In [22]:
with open('result/job_people.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(jobDict, indent=2))

In [23]:
import os

def printFileSize(filepath: str):
    fullpath = os.getcwd() + '\\' + filepath.replace('/', '\\')
    print(fullpath)
    byteSize = os.path.getsize(fullpath)
    print('File size:', byteSize, 'bytes,', byteSize / 1024, 'KiB\n')

In [24]:
printFileSize('result/job_people.pickle')
printFileSize('result/job_people.json')

E:\Документы и программы\Тексты\Последний сем\Системы искусственного интеллекта\Семинары\ЛР03\result\job_people.pickle
File size: 132041 bytes, 128.9462890625 KiB

E:\Документы и программы\Тексты\Последний сем\Системы искусственного интеллекта\Семинары\ЛР03\result\job_people.json
File size: 337095 bytes, 329.1943359375 KiB



337095 / 132038 = ~2,553

2.3 Считайте файл `job_people.pickle` и продемонстрируйте, что данные считались корректно. 

In [25]:
with open('result/job_people.pickle', 'rb') as f:
    result = pickle.load(f)

#print(result)

### XML

3.1 По данным файла `steps_sample.xml` сформируйте словарь с шагами по каждому рецепту вида `{id_рецепта: ["шаг1", "шаг2"]}`. Сохраните этот словарь в файл `steps_sample.json`

In [26]:
with open('data/steps_sample.xml', 'r', encoding='utf-8') as f:
    steps_xml = BeautifulSoup(f, features="xml")

In [27]:
def convertXmlToJson(xml: BeautifulSoup) -> dict:
    return dict([
        (int(recipe.id.contents[0]), [step.contents[0] for step in recipe.steps.find_all('step')])
        for recipe in xml.recipes.find_all('recipe')])

In [28]:
steps_json = convertXmlToJson(steps_xml)
#steps_json

In [29]:
with open('result/steps_sample.json', 'w', encoding='utf-8') as f:
    f.write(json.dumps(steps_json, indent=2))

3.2 По данным файла `steps_sample.xml` сформируйте словарь следующего вида: `кол-во_шагов_в_рецепте: [список_id_рецептов]`

In [30]:
id_count = dict(
    [(int(recipe.id.contents[0]), len(recipe.steps.find_all('step')))
     for recipe in steps_xml.recipes.find_all('recipe')])

def stepsCountDict(xml: BeautifulSoup) -> dict:
    return dict([(count,
      [id_ for id_ in id_count.keys() if id_count.get(id_) == count])
     for count in id_count.values()])

In [31]:
steps = stepsCountDict(steps_xml)
#steps

3.3 Получите список рецептов, в этапах выполнения которых есть информация о времени (часы или минуты). Для отбора подходящих рецептов обратите внимание на атрибуты соответствующих тэгов.

In [32]:
def hasInfoAboutTime(xml: BeautifulSoup) -> list:
    return list(set(
        [int(recipe.id.contents[0])
         for recipe in xml.recipes.find_all('recipe')
         for step in recipe.steps.find_all('step')
         if step.has_attr('has_minutes')]))

In [33]:
hasInfo = hasInfoAboutTime(steps_xml)
hasInfo[:10]

[524289, 131082, 131087, 131090, 262166, 131096, 131107, 262188, 48, 262207]

3.4 Загрузите данные из файла `recipes_sample.csv` (__ЛР2__) в таблицу `recipes`. Для строк, которые содержат пропуски в столбце `n_steps`, заполните этот столбец на основе файла  `steps_sample.xml`. Строки, в которых столбец `n_steps` заполнен, оставьте без изменений.

In [34]:
recipes.head(15)

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,,an original recipe created by chef scott meska...,18.0
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,,my children and their friends ask for my homem...,
2,i can t believe it s spinach,38798,30,1533,2002-08-29,,"these were so go, it surprised even me.",8.0
3,italian gut busters,35173,45,22724,2002-07-27,,my sister-in-law made these for us at a family...,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,
5,mennonite corn fritters,44045,15,41706,2002-10-25,,ok - my heritage has been revealed. :) these a...,
6,open sesame noodles,107229,28,173674,2004-12-30,8.0,this is a very versatile and widely enjoyed pa...,12.0
7,say what banana sandwich,95926,5,118163,2004-07-20,4.0,you just have to try it to believe it.,
8,1 in canada chocolate chip cookies,453467,45,1848091,2011-04-11,12.0,this is the recipe that we use at my school ca...,11.0
9,412 broccoli casserole,306168,40,50969,2008-05-30,6.0,since there are already 411 recipes for brocco...,


In [35]:
def getFilledRecipes(recipes: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    filled_recipes = recipes.copy()
    ids = filled_recipes['id'][filled_recipes['n_steps'].isna()]
    for index, id_ in zip(ids.index, ids.values):
        filled_recipes.loc[index, 'n_steps'] = id_count.get(id_)
    return filled_recipes

In [36]:
filled_recipes = getFilledRecipes(recipes)
filled_recipes.head(15)

Unnamed: 0,name,id,minutes,contributor_id,submitted,n_steps,description,n_ingredients
0,george s at the cove black bean soup,44123,90,35193,2002-10-25,11.0,an original recipe created by chef scott meska...,18.0
1,healthy for them yogurt popsicles,67664,10,91970,2003-07-26,3.0,my children and their friends ask for my homem...,
2,i can t believe it s spinach,38798,30,1533,2002-08-29,5.0,"these were so go, it surprised even me.",8.0
3,italian gut busters,35173,45,22724,2002-07-27,7.0,my sister-in-law made these for us at a family...,
4,love is in the air beef fondue sauces,84797,25,4470,2004-02-23,4.0,i think a fondue is a very romantic casual din...,
5,mennonite corn fritters,44045,15,41706,2002-10-25,6.0,ok - my heritage has been revealed. :) these a...,
6,open sesame noodles,107229,28,173674,2004-12-30,8.0,this is a very versatile and widely enjoyed pa...,12.0
7,say what banana sandwich,95926,5,118163,2004-07-20,4.0,you just have to try it to believe it.,
8,1 in canada chocolate chip cookies,453467,45,1848091,2011-04-11,12.0,this is the recipe that we use at my school ca...,11.0
9,412 broccoli casserole,306168,40,50969,2008-05-30,6.0,since there are already 411 recipes for brocco...,


3.5 Проверьте, содержит ли столбец `n_steps` пропуски. Если нет, то преобразуйте его к целочисленному типу и сохраните результаты в файл `recipes_sample_with_filled_nsteps.csv`

In [37]:
nan = filled_recipes['n_steps'].isna().sum()
print("NaN values in n_steps column:", nan)

NaN values in n_steps column: 0


In [38]:
filled_recipes.to_csv('result/recipes_sample_with_filled_nsteps.csv')