## 预先了解 MongoDB

In [None]:
"""
    Your task is to sucessfully run the exercise to see how pymongo works
    and how easy it is to start using it.
    You don't actually have to change anything in this exercise,
    but you can change the city name in the add_city function if you like.

    Your code will be run against a MongoDB instance that we have provided.
    If you want to run this code locally on your machine,
    you have to install MongoDB (see Instructor comments for link to installation information)
    and uncomment the get_db function.

    你的任务是成功地运行练习，看看 pymongo 是如何运行的，并了解可以如何轻松地开始使用 pymongo。

    在这道练习中，你不需要更改任何内容，但是你可以根据需要更改 add_city 函数中的城市名称。

    你的代码将根据我们提供的 MongoDB 实例运行。

    如果你想在本地机器上运行代码，你需要安装 MongoDB 并取消注释 get_db 函数。
"""

def add_city(db):
    # Changes to this function will be reflected in the output. 
    # All other functions are for local use only.
    # Try changing the name of the city to be inserted
    db.cities.insert({"name" : "Chicago"})
    
def get_city(db):
    return db.cities.find_one()

def get_db():
    # For local use
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    # 'examples' here is the database name. It will be created if it does not exist.
    db = client.examples
    return db

if __name__ == "__main__":
    # For local use
    # db = get_db() # uncomment this line if you want to run this locally
    add_city(db)
    print get_city(db)

# 习题
## 准备数据

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
    In this problem set you work with another type of infobox data, audit it,
    clean it, come up with a data model, insert it into MongoDB and then run some
    queries against your database. The set contains data about Arachnid class
    animals.

    Your task in this exercise is to parse the file, process only the fields that
    are listed in the FIELDS dictionary as keys, and return a list of dictionaries
    of cleaned values. 

    The following things should be done:
    - keys of the dictionary changed according to the mapping in FIELDS dictionary
    - trim out redundant description in parenthesis from the 'rdf-schema#label'
      field, like "(spider)"
    - if 'name' is "NULL" or contains non-alphanumeric characters, set it to the
      same value as 'label'.
    - if a value of a field is "NULL", convert it to None
    - if there is a value in 'synonym', it should be converted to an array (list)
      by stripping the "{}" characters and splitting the string on "|". Rest of the
      cleanup is up to you, e.g. removing "*" prefixes etc. If there is a singular
      synonym, the value should still be formatted in a list.
    - strip leading and ending whitespace from all fields, if there is any
    - the output structure should be as follows:

    [ { 'label': 'Argiope',
        'uri': 'http://dbpedia.org/resource/Argiope_(spider)',
        'description': 'The genus Argiope includes rather large and spectacular spiders that often ...',
        'name': 'Argiope',
        'synonym': ["One", "Two"],
        'classification': {
                          'family': 'Orb-weaver spider',
                          'class': 'Arachnid',
                          'phylum': 'Arthropod',
                          'order': 'Spider',
                          'kingdom': 'Animal',
                          'genus': None
                          }
      },
      { 'label': ... , }, ...
    ]

      * Note that the value associated with the classification key is a dictionary
        with taxonomic labels.

    在此习题集中，你将处理另一种类型的 infobox 数据，审核、清理数据，并得出一种数据模型，将数据插入 MongoDB，然后对数据库运行一些查询。数据集中包含关于蛛形纲动物的数据。

    对于这道练习，你的任务是解析文件，仅处理 FIELDS 字典中作为键的字段，并返回清理后的值字典列表。

    你应该完成以下几个步骤：

    根据 FIELDS 字典中的映射更改字典的键
    删掉“rdf-schema#label”中的小括号里的多余说明，例如“(spider)”
    如果“name”为“NULL”，或包含非字母数字字符，将其设为和“label”相同的值。
    如果字段的值为“NULL”，将其转换为“None”
    如果“synonym”中存在值，应将其转换为数组（列表），方法是删掉“{}”字符，并根据“|” 拆分字符串。剩下的清理方式将由你自行决定，例如删除前缀“*”等。如果存在单数同义词，值应该依然是列表格式。
    删掉所有字段前后的空格（如果有的话）
    输出结构应该如下所示：

    [ { 'label': 'Argiope',
        'uri': 'http://dbpedia.org/resource/Argiope_(spider)',
        'description': 'The genus Argiope includes rather large and spectacular spiders that often ...',
        'name': 'Argiope',
        'synonym': ["One", "Two"],
        'classification': {
                          'family': 'Orb-weaver spider',
                          'class': 'Arachnid',
                          'phylum': 'Arthropod',
                          'order': 'Spider',
                          'kingdom': 'Animal',
                          'genus': None
                          }
      },
      { 'label': ... , }, ...
    ]
"""
import codecs
import csv
import json
import pprint
import re

DATAFILE = 'arachnid.csv'
FIELDS ={'rdf-schema#label': 'label',
         'URI': 'uri',
         'rdf-schema#comment': 'description',
         'synonym': 'synonym',
         'name': 'name',
         'family_label': 'family',
         'class_label': 'class',
         'phylum_label': 'phylum',
         'order_label': 'order',
         'kingdom_label': 'kingdom',
         'genus_label': 'genus'}


def process_file(filename, fields):

    process_fields = fields.keys()
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        for i in range(3):
            l = reader.next()

        for line in reader:
            # YOUR CODE HERE
            # print line
            # pass
            data_dict = {}
            classification = {'kingdom':'',\
                              'family':'',\
                              'order':'',\
                              'phylum':'',\
                              'genus':'',\
                              'class':''}
            # data_dict['classification'] = classification
            
            for field in process_fields:
                
                if field == 'rdf-schema#label':
                    # 处理label中的小括号
                    label = line[field]
                    find_bracket = lambda x: re.findall(re.compile('\(\w+\)+'), x)
                    label_bracket = find_bracket(label)
                    if len(label_bracket):
                        for bracket in label_bracket:
                            label = label.replace(bracket,'').strip()
                    if label == 'NULL':
                        label = None
                    data_dict[fields[field]] = label
                             
                elif field == 'name':
                    # 如果“name”为“NULL”，或包含非字母数字字符，将其设为和“label”相同的值。
                    name = line[field]
                    if name == 'NULL' or len(re.findall(re.compile('\W'), name)):
                        name = label
                    data_dict[fields[field]] = name
                             
                elif field == 'synonym':
                    # 如果“synonym”中存在值，应将其转换为数组（列表）
                    synonym = line[field]
                    if synonym == 'NULL': 
                        synonym = None
                    else:
                        synonym = parse_array(synonym)
                    data_dict[fields[field]] = synonym
                             
                else:
                    field_value = line[field]
                    # 如果字段的值为“NULL”，将其转换为“None”
                    if field_value == 'NULL':
                        field_value = None
                    # 删掉所有字段前后的空格（如果有的话）
                    elif type(field_value) == str:
                        field_value = field_value.strip()
                    
                    if fields[field] in classification.keys():
                        classification.update({fields[field] : field_value})
                        data_dict['classification'] = classification
                    else:
                        data_dict[fields[field]] = field_value
                                        
            data.append(data_dict)
                    
            
    return data

def parse_array(v):
    if (v[0] == "{") and (v[-1] == "}"):
        v = v.lstrip("{")
        v = v.rstrip("}")
        v_array = v.split("|")
        v_array = [i.strip() for i in v_array]
        return v_array
    return [v]


def test():
    data = process_file(DATAFILE, FIELDS)
    print "Your first entry:"
    pprint.pprint(data[0])
    first_entry = {
        "synonym": None, 
        "name": "Argiope", 
        "classification": {
            "kingdom": "Animal", 
            "family": "Orb-weaver spider", 
            "order": "Spider", 
            "phylum": "Arthropod", 
            "genus": None, 
            "class": "Arachnid"
        }, 
        "uri": "http://dbpedia.org/resource/Argiope_(spider)", 
        "label": "Argiope", 
        "description": "The genus Argiope includes rather large and spectacular spiders that often have a strikingly coloured abdomen. These spiders are distributed throughout the world. Most countries in tropical or temperate climates host one or more species that are similar in appearance. The etymology of the name is from a Greek name meaning silver-faced."
    }

    assert len(data) == 76
    assert data[0] == first_entry
    assert data[17]["name"] == "Ogdenia"
    assert data[48]["label"] == "Hydrachnidiae"
    assert data[14]["synonym"] == ["Cyrene Peckham & Peckham"]

if __name__ == "__main__":
    test()