# Codebook  
**Authors:** Lauren Baker  
Documenting existing data files of DaanMatch with information about location, owner, "version", source etc.

In [1]:
import boto3
import numpy as np 
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
import statistics

In [2]:
client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('my-bucket')

# Districts-20-.csv
## TOC:
* [About this dataset](#1)
* [What's in this dataset](#2)
* [Codebook](#3)
    * [Missing values](#3.1)
    * [Summary statistics](#3.2)
* [Columns](#4)
    * [Name](#4.1)
    * [Value](#4.2)

**About this dataset**  <a class="anchor" id="1"></a>  
Data provided by: Unknown.    
Source: https://daanmatchdatafiles.s3-us-west-1.amazonaws.com/DaanMatch_DataFiles/Districts-10-.csv  
Type: csv  
Last Modified: May 29, 2021, 19:53:10 (UTC-07:00)  
Size: 676.0 B

In [3]:
path = "s3://daanmatchdatafiles/DaanMatch_DataFiles/Districts-10-.csv"
districts_10 = pd.read_csv(path)
districts_10

Unnamed: 0,KeyColumn,Name,Value
0,203,Pashchim Champaran,203
1,204,Purba Champaran,204
2,205,Sheohar,205
3,206,Sitamarhi,206
4,207,Madhubani,207
5,208,Supaul,208
6,209,Araria,209
7,210,Kishanganj,210
8,211,Purnia,211
9,212,Katihar,212


**What's in this dataset?** <a class="anchor" id="2"></a>

In [4]:
print("Shape:", districts_10.shape)
print("Rows:", districts_10.shape[0])
print("Columns:", districts_10.shape[1])
print("Each row is a district in the Bihar state in India.")

Shape: (38, 3)
Rows: 38
Columns: 3
Each row is a district in the Bihar state in India.


**Codebook** <a class="anchor" id="3"></a>

In [6]:
districts_10_columns = [column for column in districts_10.columns]
districts_10_description = ["Same as the Value column.",
                            "Name of District in Bihar. There are 28 states in India and 8 union territories, all of which have territories within them. This column represents the names of the 38 territories in the Bihar state.",
                            "This value column has no real meaning, it is meant purely to count the districts. There are 739 total districts in India, so the value represents the number of the district in regards to all other districts."]
districts_10_dtypes = [dtype for dtype in districts_10.dtypes]

data = {"Column Name": districts_10_columns, "Description": districts_10_description, "Type": districts_10_dtypes}
districts_10_codebook = pd.DataFrame(data)
districts_10_codebook.style.set_properties(subset=['Description'], **{'width': '600px'})

Unnamed: 0,Column Name,Description,Type
0,KeyColumn,Same as the Value column.,int64
1,Name,"Name of District in Bihar. There are 28 states in India and 8 union territories, all of which have territories within them. This column represents the names of the 38 territories in the Bihar state.",object
2,Value,"This value column has no real meaning, it is meant purely to count the districts. There are 739 total districts in India, so the value represents the number of the district in regards to all other districts.",int64


**Missing values** <a class="anchor" id="3.1"></a>

In [7]:
districts_10.isnull().sum()

KeyColumn    0
Name         0
Value        0
dtype: int64

There are 38 districts in the state of Bihar, and 38 districts in this dataset. That means that there are no missing districts.

**Summary statistics** <a class="anchor" id="3.2"></a>

In [8]:
districts_10.describe()

Unnamed: 0,KeyColumn,Value
count,38.0,38.0
mean,221.5,221.5
std,11.113055,11.113055
min,203.0,203.0
25%,212.25,212.25
50%,221.5,221.5
75%,230.75,230.75
max,240.0,240.0


## Columns
<a class="anchor" id="4"></a>

### Name
<a class="anchor" id="4.1"></a>
Name of District in the state of Bihar in India. There are 28 states in India and 8 union territories, all of which have territories within them. This column represents the names of the territories in the state Bihar.

In [9]:
column = districts_10["Name"]
column

0     Pashchim Champaran
1        Purba Champaran
2                Sheohar
3              Sitamarhi
4              Madhubani
5                 Supaul
6                 Araria
7             Kishanganj
8                 Purnia
9                Katihar
10             Madhepura
11               Saharsa
12             Darbhanga
13           Muzaffarpur
14             Gopalganj
15                 Siwan
16                 Saran
17              Vaishali
18            Samastipur
19             Begusarai
20              Khagaria
21             Bhagalpur
22                 Banka
23                Munger
24            Lakhisarai
25            Sheikhpura
26               Nalanda
27                 Patna
28               Bhojpur
29                 Buxar
30       Kaimur (Bhabua)
31                Rohtas
32            Aurangabad
33                  Gaya
34                Nawada
35                 Jamui
36             Jehanabad
37                 Arwal
Name: Name, dtype: object

In [10]:
print("No. of unique values:", len(column.unique()))

# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:value for key, value in counter.items() if value > 1}
print("Duplicates:", duplicates)
if len(duplicates) > 0:
    print("No. of duplicates:", len(duplicates))

No. of unique values: 38
Duplicates: {}


### Value
<a class="anchor" id="4.2"></a>
This value column has no real meaning, it is meant purely to count the districts. There are 739 total districts in India, so the value represents the number of the district in regards to all other districts.

In [11]:
column = districts_10["Value"]
column

0     203
1     204
2     205
3     206
4     207
5     208
6     209
7     210
8     211
9     212
10    213
11    214
12    215
13    216
14    217
15    218
16    219
17    220
18    221
19    222
20    223
21    224
22    225
23    226
24    227
25    228
26    229
27    230
28    231
29    232
30    233
31    234
32    235
33    236
34    237
35    238
36    239
37    240
Name: Value, dtype: int64

In [12]:
print("No. of unique values:", len(column.unique()))

# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:value for key, value in counter.items() if value > 1}
print("Duplicates:", duplicates)
if len(duplicates) > 0:
    print("No. of duplicates:", len(duplicates))

No. of unique values: 38
Duplicates: {}
