# Building a database for crime reports

In this project, we will build a database for storing data related with crimes that occurred in Boston. This dataset is available in the file "boston.csv". The following diagram illustrates a high level overview of what we want to achieve:

<img src='https://dq-content.s3.amazonaws.com/250/goal.png'>

## Creating the crime database

We will start by creating a database for storing our crime data as well as a schema for containing the tables. Since the crime_db does not exist yet, we will create it by connecting to the dq database.

In [1]:
import psycopg2
conn = psycopg2.connect(dbname="dq", user="dq")

# set autocommit to True bacause this is required for creating databases
conn.autocommit = True
cur = conn.cursor()

# create the crime_db database
cur.execute("CREATE DATABASE crime_db;")
conn.close()

In [2]:
# now the crime_db database exists to we can connect to it
conn = psycopg2.connect(dbname="crime_db", user="dq")
conn.autocommit = True
cur = conn.cursor()

# create the crimes schema
cur.execute("CREATE SCHEMA crimes;")

## Column names and data sample

We now have a database and a schema — we are ready to start creating tables. Before we do that, let's gather some data about our crime dataset so that we can more easily select the right datatypes to use in our table.

In [22]:
import csv
with open('boston.csv') as file:
    reader = csv.reader(file)
    col_headers = next(reader)
    first_row = next(reader)
print('The column headers for the dataset are as follows:\n')
print(col_headers)
print(first_row)

The column headers for the dataset are as follows:

['incident_number', 'offense_code', 'description', 'date', 'day_of_the_week', 'lat', 'long']
['1', '619', 'LARCENY ALL OTHERS', '2018-09-02', 'Sunday', '42.35779134', '-71.13937053']


In [13]:
import pandas as pd
def get_col_value_set(filename, col_index):
    df = pd.read_csv(filename)
    value_counts = df.iloc[:,col_index].value_counts()
    return list(value_counts.index)

In [14]:
filename = 'boston.csv'
value_counts = {}
for i, col in enumerate(col_headers):
    values = get_col_value_set(filename, i)
    value_counts[col] = len(values)
print('The unique value counts for each column is as follows')
value_counts

The unique value counts for each column is as follows


{'date': 1177,
 'day_of_the_week': 7,
 'description': 239,
 'incident_number': 298329,
 'lat': 18177,
 'long': 18177,
 'offense_code': 219}

With this function we can compute the number of distinct values for each column. Columns with a low number of distinct values tend to be good candidates for enumerated datatypes. 

Another important aspect is to know the longest word in any column containing textual data. There are two textual column in the data set, namely, the **description** and **day_of_the_week** columns. However the day of the week contains only 7 values, one for each day. We can tell that the longest of them is *Wednesday* without needing any computation.

In [20]:
desc_index = 2
values = get_col_value_set(filename, desc_index)
val_lengths = []
for value in values:
    val_lengths.append(len(value))
max_length = max(val_lengths)
print('The maximum length of a value in the Description field is {} characters'.format(max_length))

The maximum length of a value in the Description field is 58 characters


## Creating the table

Here are some of the design considerations for the data types: 

- We are going with an enumerated datatype named `weekday` for `the day_of_the_week` since there there only seven possible values.
- Since the `description` has at most 58 character we decided to use the datatype VARCHAR(100) for representing it. This leaves some margin while not being so big that we will waste a lot of memory.
- For the `incident_number` we have decided to use the type INTEGER and set it as the primary key. The same datatype will also be used to represent the `offense_code`.
- The `lat` and `long` column see to need to hold quite a lot of precision so we will use the decimal type.

From the result of printing `first_row`, and the considerations above, the final data types we will be using are as follows: 

| column | data | dtype |
| :- | :-: | :-: |
| incident_number | integer number | INTEGER |
| offense_code | integer number | INTEGER |
| description | string | VARCHAR(100) |
| date | date | DATE |
| day_of_the_week | string | ENUM |  
| lat | decimal number | DECIMAL |
| long | decimal number| DECIMAL | 



In [23]:
# create the enumerated datatype for representing the weekday
cur.execute("""
    CREATE TYPE weekday AS ENUM ('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday');
""")

# create the table
cur.execute("""
    CREATE TABLE crimes.boston_crimes (
        incident_number INTEGER PRIMARY KEY,
        offense_code INTEGER,
        description VARCHAR(100),
        date DATE,
        day_of_the_week weekday,
        lat decimal,
        long decimal
    );
""")

## Loading the data

Now that we have created the table, we can load the data into it.

In [25]:
# load the data from boston.csv into the table boston_crimes that is in the crimes schema
with open(filename) as f:
    cur.copy_expert('COPY crimes.boston_crimes FROM STDIN WITH CSV HEADER;', f)
    
# print the number of rows to ensure that they were loaded
cur.execute("SELECT * FROM crimes.boston_crimes")
print('Number of data rows loaded: ', len(cur.fetchall()))

IntegrityError: duplicate key value violates unique constraint "boston_crimes_pkey"
DETAIL:  Key (incident_number)=(1) already exists.
CONTEXT:  COPY boston_crimes, line 2


## Managing user privileges

We revoke all privileges of the `public` group on the `public` schema to ensure that users will not inherit privileges on that schema such as the ability to create tables in the `public` schema.

We also need to revoke all privileges in the newly created schema. Doing this also makes it so that we do not need to revoke the privileges when we create users and groups because unless specified otherwise, privileges are not granted by default.

In [26]:
cur.execute("REVOKE ALL ON SCHEMA public FROM public;")
cur.execute("REVOKE ALL ON DATABASE crime_db FROM public;")

### Creating a readonly group

We create a `readonly` group with NOLOGIN because it is a group and not a user. We grant the group the ability to connect to the `crime_db` and the ability to use the `crimes` schema.

Then we deal with tables privileges by granting SELECT. We also add an extra line compared with what was asked. This extra line changes the way that privileges are given by default to the readonly group on new table that are created on the crimes schema. As we mentioned, by default privileges are not given. However we change this so that by default any user in the readonly group can issue select commands.

In [27]:
cur.execute("CREATE GROUP readonly NOLOGIN;")
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readonly;")
cur.execute("GRANT USAGE ON SCHEMA crimes TO readonly;")
cur.execute("GRANT SELECT ON ALL TABLES IN SCHEMA crimes TO readonly;")

### Creating a read-write group

Similar to the `readonly` group, we create a `readwrite` group with NOLOGIN. We give additional table privileges by granting SELECT, INSERT, UPDATE and DELETE.

In [28]:
cur.execute("CREATE GROUP readwrite NOLOGIN;")
cur.execute("GRANT CONNECT ON DATABASE crime_db TO readwrite;")
cur.execute("GRANT USAGE ON SCHEMA crimes TO readwrite;")
cur.execute("GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA crimes TO readwrite;")

### Creating test users

We now will test these newly created groups with a couple of test users: 

- user: **data_analyst**, password: *secret1*, group: `readonly`.
- user: **data_scientist**, password: *secret2*, group: `readwrite`

In [29]:
cur.execute("CREATE USER data_analyst WITH PASSWORD 'secret1';")
cur.execute("GRANT readonly TO data_analyst;")

cur.execute("CREATE USER data_scientist WITH PASSWORD 'secret2';")
cur.execute("GRANT readwrite TO data_scientist;")

## Testing

We will now test the database setup using SQL queries on the `pg_roles` table and `information_schema.table_privileges`.

In the `pg_roles` table we will check database related privileges and for that we will look at the following columns:

- `rolname`: The name of the user / group that the privilege refers to.
- `rolsuper`: Whether this user / group is a super user. It should be set to False on every user / group that we have created.
- `rolcreaterole`: Whether user / group can create users, groups or roles. It should be False on every user / group that we have created.
- `rolcreatedb`: Whether user / group can create databases. It should be False on every user / group that we have created.
- `rolcanlogin`: Whether user / group can login. It should be True on the users and False on the groups that we have created.

In the `information_schema.table_privileges` we will check privileges related to SQL queries on tables. We will list the privileges of each group that we have created.

In [31]:
# close the old connection to test with a brand new connection
conn.close()

conn = psycopg2.connect(dbname="crime_db", user="dq")
cur = conn.cursor()
# check users and groups
cur.execute("""
    SELECT rolname, rolsuper, rolcreaterole, rolcreatedb, rolcanlogin FROM pg_roles
    WHERE rolname IN ('readonly', 'readwrite', 'data_analyst', 'data_scientist');
""")
for user in cur:
    print(user)
print()
# check privileges
cur.execute("""
    SELECT grantee, privilege_type
    FROM information_schema.table_privileges
    WHERE grantee IN ('readonly', 'readwrite');
""")
for user in cur:
    print(user)
conn.close()

('readonly', False, False, False, False)
('readwrite', False, False, False, False)
('data_analyst', False, False, False, True)
('data_scientist', False, False, False, True)

('readonly', 'SELECT')
('readwrite', 'INSERT')
('readwrite', 'SELECT')
('readwrite', 'UPDATE')
('readwrite', 'DELETE')
