# Technical Annex 1: Sample design for legal practioner surveys
This annex documents all steps of the sampling process and ensures complete reproducibility. The notebook is ordered chronologically and introduces all relevant code as required for the sampling process. All code is reproducible with Python (3.8). Required modules are listed below. For further information on the broader research methodology of this project, please consult the methodology document. For questions, please contact Peter Naderer (peter.naderer@outlook.com, +49 160 65 68 571)

## Overview

The survey is designed as a statistically representative instrument for lawyers and judges. The survey is desigend as a phone-based instrument, executed by employees of the Bar Association of Kyrgyzstan. The survey will be administered online through *LimeSurvey*, accessible [here](https://ohchr.limequery.net/index.php/admin). The instrument has an estimated response time of 7 minutes, based on a preliminary test run in September 2020 among employees of the UN Human Rights Office. The final questionnaire can be accessed [here](github.link). The survey is available in English, Russian and Kyrgyz.

## Sampling
### Lawyers

#### Overview
The survey will rely on a simple random sample without replacement (SRS) of all registered lawyers in Kyrgyzstan. The reserach interest focuses on the body of lawyers in Kyrgyzstan and is not specifically interested in a subset of this body and stratification is not needed. The *sampling frame* of the survey is a data-cleaned version of an official registry of lawyers maintained by the Ministry of Justice  at this [address](http://minjust.gov.kg/upload/files/2017-07-21/9d165566b2ce38dc4a9d4d2b9b147cd6.docx).

#### Expected sampling issues
It is important to note that the registry is not accuratly maintained by the Ministry and some lawyers on the list may either not be active anymore, were disbarred or are deceased. We do not know at this point how deprecated the registry really is. There is also no reliable and comprehensive way to verify the records in the registry, short of contacting each lawyer individually. The Bar Association of Kyrgyzstan does not have a proprietary database of lawyers that could be used for a general cross-reference or alternative sampling frame. This means that a potentially defect data source needs to be used to draw the sample. This issue raises a particular problem for sampling. Any sample is likley to contain units that are not eligible (if disbarred) or cannot (if deceased) participate (NB: inactive lawyers are eligible to participate). Now, this means that the interviewer will in effect create a non-response by contacting a sample unit and marking the respondent as ineligible. In order to maintain the sample size such quasi- nonrespondents will be substituted through an increased intitial sample relative to calculated sample size. It is argued that this is possible as units that are nonrespondents because of their disbarment or death are *missing completly at random*. They are nonrespondents of reasons not related to the survey. Their opinion and experience on discrimination cases is independent of their nonresponse. Non-response of respondents that are correclty placed within the sampling frame will be treated differently and addressed by way of callbacks.

#### Sample size calculation

The sample is drawn from the above-mentioned cleaned registry of lawyers. The data cleaning process can be reviewed and reproduced [here](Data-cleaning-code). The dataset can be accessed by loading the following pickle file:

In [6]:
import pickle
import pandas as pd
import datetime
import numpy as np

with open('lawyers-2020-10-29.pkl', 'rb') as f:
    df = pickle.load(f)

Due to data quality issues at the source, a workaround is required for sampling. Several records are split in the data set, resulting in duplicate values. These these records were deleted for sampling, and should be refered to only in case contact details missing from the original record. Duplicate records can be selected with the command in the next code box and checking the dataset's length.

In [7]:
df = df[df['index'].notna()]
N = len(df)
N

2405

`df` now represents our final dataset for sampling, i.e. our population. The dataset has N = 2405 records, representing the population. For the purpose of our research, we are looking for an intermediate level of precision only. We want to understand, in broad terms, how the legal community responds to cases relating to discrimination. This means that we will be content with a **confidence level** of **90%** and a **margin of error (ME)** of **0.05**. In estimating sample size, we follow Lohr (2019) using the the following equation and determining size using only the ratio S/yU , the CV for a sample of size 1.

$n = \frac{z^{2}_{\alpha/2}S^{2}}{(r\bar{S}_{U})^{2} + \frac{z^{2}_{\alpha/2}S^{2}}{N}}$

Leading us to the following calculation and result:

In [8]:
n_0 = ((1.645**2)*(1/2)*(1-1/2))/(0.05)**2
n_0

270.60249999999996

Finally, given the limited size of the population it would be possible to adjust for the finite population but that correction (`n = n_0/(1+(n_0/N))` would reduce the sample size only by about 30 respondents and is therefore not neccessary. While a 95%CI and 0.03 ME can be achieved with a samplesize higher than 550, this does also not seem neccessary for our purpose. The final sample size to be achieve is therefore 271. However, as mentioned we cannot reliably estimate the level of registry depecrecation. The arethmetically derived sample size will therefore be increased in order to account for exclusion of ineligible items in the sampling frame. Based on previous experience, the Bar Association assumes that about 10% to 20% of records may be unavailable, which are 54 records. We therefore assume that 20% of the sample's records are depcreated.

The **final sample size** is set at **271** with the number drawn being ***325*** items of the registry for replacement for items that should not be in the sampling frame.

#### Drawing the sample

The sample is drawn using `Pandas` and its in-built function `.sample()`. The function is informed by the pseud-random number generator `RandomState()` of `numpy` to ensure deterministic results.



In [10]:
from numpy.random import RandomState, random

prng = RandomState(146215877) # Value obtained with random()
pd.options.display.max_rows = 10
sample = df.sample(n=325, random_state=prng)
sample

Unnamed: 0,index,license_no,full_name,address,phone,decree,workplace,comment
2504,2384.,4098,Вишнякова\tЕлена Геннадьевна,"Г.Бишкек, ул. К.Акиева, 57-10",0555 515-535,Пр. № 133 от 20.09.2016г. Смена фамилии,,
736,684.,2855,Кадыркулов Кочконбай Анарбаевич,"Таласская обл., с.Бакай-Ата, ул.Жумаке, 14","т: 03457 31943, 0779461781",пр. 56 от 18.07.2011,,
1879,1783.,2959,Бапанова Садия Дуулатовна,"г.Бишкек, ул.Совесткая, 99/25","т: 386377, 0550661552",пр. 9 от 31.01.2012 г.,,
131,127.,3248,Акматбеков Кубандык Акматбекович,"г. Бишкек, ул. Маликова, №62",0778- 19 30 93 0558- 19 30 93,Пр. № 27 от 02.04.13,Ст.9,Пр. № 24 от 11.03.15 г. приостановлено
630,582.,830,Жумагазиев Зарыл,"Бишкек, ул. Малдыбаева, № 215, кв.39",0559- 00 90 39 0770- 48 97 64 0543- 05 53 83,пр № 34 от 06.09.02,,
...,...,...,...,...,...,...,...,...
1183,1113.,980,Низамов Рустам Мамырович,"Бишкек, 12-30-33","466252, 299638",от 03.04.03 № 26,"ЧП""Дильноза""",
518,474.,*501,Джолдошев Руслан Кенешевич,"Бишкек, 7м-р,12-56",47-24-60,от04.12.2000 №34,КСФК КД гл.юрист,ст. 9
1756,1666.,3699,Шарипова Гульбарчин,,,,,
1140,1073.,2537,Мурзабеков Сатыбалды Михайлович,"иссык-Кульская обл.. Г. Чолпон- Ата, ул. Озер...","03943 43327, 0543 077166",пр. № 50 от 24.07.09г.,,


### Costing

The survey will run through three phases: first contact, first and second callback. Each phase has a nonresponse estimation, which has a bearing on costing. Work time is split into call time (i.e. time required to call and conduct the survey) and administrative time (i.e. time required for pre- and post-processing such as amending the database). The expected runtime of one survey is 8 minutes. It is estimated that 20% of records are deprecated. Administrative time for compelted surveys is 2 minutes, for non-responses 3 minutes.


| Sample Unit        | Type     | ResponseRate | Calltime | Admin time | Total | 
|--------------------|----------|--------------|----------|------------|-------|
| **First phase**                                                              |
| 325                | All      | 0.5          | 0        | 0          |       |  
| 163                | Resp     | 1            | 8        | 2          | 1630  |
| 162                | Non-Resp | 0            | 2        | 3          | 810   |
| **Callback - Round 1**                                                       |
| 162                | All      | 0.5          | 0        | 0          |       |
| 81                 | Resp     | 1            | 8        | 2          | 810   |
| 81                 | Non-Resp | 0            | 2        | 3          | 405   |
| **Callback - Round 2**                                                       |
| 81                 | All      | 0.2          | 0        | 0          |       |
| 16                 | Resp     | 1            | 8        | 2          | 160   |
| 65                 | Non-Resp | 0            | 2        | 3          | 325   |
|                    |          |              |          |            | 4140  |

In addition, an **inception workshop** is required for interviewers to explain survey rules and regulations, as well as technical aspects of survey execution. The inception workshop is expected to last 60 minutes. Assuming three interviewers will take part in the inception workshop this is 180 minutes.

The **total workload** is therefore **4510 minutes** or **72 hours**. Assuming three interviewers that is 24 hours per person. Assuming an hourly rate of **21USD**, the final amount will be **1500USD**

### Judges

<div class="alert alert-block alert-danger"><b>Info:</b> Information on the survey for judges will be added at a later stage.</div>

## Code repository

### Data cleaning code

```
from docx import Document
import pickle
import pandas as pd
from datetime import date
import numpy as np
import urllib

url = 'http://minjust.gov.kg/upload/files/2017-07-21/9d165566b2ce38dc4a9d4d2b9b147cd6.docx'

file = urllib.request.urlretrieve(url)
document = Document(file)

tables = []
for table in document.tables:
    df = [['' for i in range(len(table.columns))] for j in range(len(table.rows))]
    for i, row in enumerate(table.rows):
        for j, cell in enumerate(row.cells):
            if cell.text:
                df[i][j] = cell.text
    tables.append(df)

    flat_list = []
for sublist in tables:
    for item in sublist:
        flat_list.append(item)

df = pd.DataFrame(flat_list)

### Preliminary cleaning

new_headers = df.iloc[1]
new_headers
df = df[2:]
df.columns = ['index', 'license_no', 'full_name', 'address', 'phone', 'decree', 'workplace', 'comment', 'empty1', 'empty2']
df.drop(['empty1', 'empty2'], axis = 1, inplace = True)

df.replace(r'\n',' ', regex=True, inplace = True)
df.replace(r'^\s*$', np.nan, regex=True, inplace = True)
df.dropna(subset = ['index', 'full_name'], how = "all", inplace = True)

with open('lawyers-{}.pkl'.format(date.today()), 'wb') as f:
    pickle.dump(df, f)
```