The attached csv files contain a simulated dataset of a certain legal services program. The core services of this program include a) providing two types of legal services to a particular population in immigration custody, and b) providing legal representation to a subset of the people seen in (a).  

You have been provided two tables with the following information:  

* **Demographics** – identifying information on every person who received services from the program within a given timeframe 

    * Columns include: id_number, first_name, last_name, gender, birth, nationality  

    * ID numbers serve as the primary key in this table, and serve as the foreign key to the event log table.  

* **Event Log** – represents all services performed in the program.  

    * Columns include: id_number, provider, region, facility, event_category, event_type, event_reason, event_mode, and event_date.  

    * There are two “event categories” – in-custody events and representation events. You will find that in-custody services and legal representation will have different information contained in the “event type,” “event reason,” and “event mode” fields. The “event date” field will always represent the date service was rendered. 

    * In-custody events represent the two types of legal services performed in custody. There are two “event types” for detained events – TYPE 1 and TYPE 2.  

        * TYPE 1 can happen in the individual or group setting (represented by the “event mode” field).  

        * TYPE 2 always happens in the individual setting. 

    * Representation events track open & closure of legal case representation by the program’s legal service providers. 

        * A case is opened with a “representation initiated” in the event type field. 

        * A case is closed with a “representation ended” in the event type field. 

        * Representation can be initiated in custody or after release. This is marked by “IN CUSTODY” or “RELEASED” in the “event reason” field, respectively.  

        * Representation can happen directly through one of three service modes: A, B, or C. This is denoted in the “event mode” field.  

        * Cases can be closed for a variety of reasons. The reason for closure is entered under the “event reason” field.  


With the given tables, please provide code for the following:

**1. SQL:** Create a reference table that summarizes each client’s case initiated information and current case status. Clients who received services in custody but never had representation may be omitted from this reference table. (est. 20-40 minutes) 

* The table should have the following columns: ID number, date first initiated, first initiated reason, first initiated provider, most recent provider, first initiated mode, most recent mode, case status, close date, close reason 

* You’ll derive the following columns from the first representation event for each client: date first initiated, first initiated provider, first initiated mode, first initiated reason 

* You’ll derive the following columns from the most recent representation event for each client: most recent provider, most recent mode, current case status, close date (if applicable), close reason (if applicable).  

    * A case status is “Open” if the most recent representation event is a “representation initiated” event.  

    * A case status is “Closed” if the most recent representation event is a “representation ended” event.  

    * A case may be opened & closed multiple times (i.e. a person’s event log may have a “case closed” event in their history, but have a current open case). *

**2. Python:** Please provide a Python function that calculates each of the following:

* Given a dataframe containing the demographics data, return the number of unique nationalities. (est. 5-10 minutes) 

* Given a dataframe of the event log data and a date, return how many cases were open on that date. Please state any assumptions you’ve made as part of your answer. (est. 20-30 minutes) 

In [4]:
import pandas as pd
import sqlite3

In [2]:
event_log = pd.read_csv('https://github.com/datalaker/assets/files/10905599/acj_event_log.csv')
event_log.head()

Unnamed: 0,id_number,provider,region,facility,event_category,event_type,event_reason,event_mode,event_date
0,888000491,DDD,b,L,in-custody event,TYPE1,,group,2020-05-08
1,888000491,DDD,b,L,in-custody event,TYPE2,,individual,2020-05-08
2,888001056,GGG,d,K,in-custody event,TYPE1,,group,2020-04-18
3,888001328,GGG,d,H,in-custody event,TYPE1,,group,2020-03-13
4,888001328,GGG,d,H,in-custody event,TYPE2,,individual,2020-03-13


In [9]:
demographics = pd.read_csv('https://github.com/datalaker/assets/files/10905598/acj_demographics.csv')
demographics.head()

Unnamed: 0,id_number,first_name,last_name,gender,birth,nationality
0,888350661,First,Last,Other,2015-01-08,Haiti
1,888198842,First,Last,Male,2013-05-30,Armenia
2,888391282,First,Last,Female,2005-07-20,Uzbekistan
3,888336161,First,Last,Female,2018-10-20,India
4,888908064,First,Last,Male,2012-10-10,Bolivia


## SQL

In [5]:
#connect to a database
conn = sqlite3.connect("event_log.db") #if the db does not exist, this creates a db file in the current directory

#store your table in the database
event_log.to_sql('event_log', conn)

2065

In [6]:
%load_ext sql
%sql sqlite:///event_log.db

In [8]:
%%sql
SELECT *
from event_log
GROUP BY id_number
LIMIT 5

 * sqlite:///event_log.db
Done.


index,id_number,provider,region,facility,event_category,event_type,event_reason,event_mode,event_date
0,888000491,DDD,b,L,in-custody event,TYPE1,,group,2020-05-08
2,888001056,GGG,d,K,in-custody event,TYPE1,,group,2020-04-18
3,888001328,GGG,d,H,in-custody event,TYPE1,,group,2020-03-13
5,888001365,III,b,F,in-custody event,TYPE1,,group,2020-04-29
6,888002920,CCC,b,X,in-custody event,TYPE1,,group,2020-04-04


## Python

In [10]:
demographics['nationality'].nunique()

17

In [11]:
def num_open_cases(df, date):
    
    filtered_date = df.loc[(df['event_date'] == f'{date}')]
    open_cases = filtered_date.loc[(filtered_date['event_type'] == 'representation initiated')]
    print(open_cases.shape[0])

In [12]:
# assumption - open cases are in 'representation initiated' as it also states here 'A case is opened with a “representation initiated” in the event type field
num_open_cases(event_log, '2020-05-12')

1


## Cleanup

In [13]:
!rm -rf event_log.db