# Codebook  
**Authors:** Lauren Baker  
Documenting existing data files of DaanMatch with information about location, owner, "version", source etc.

In [2]:
import boto3
import numpy as np 
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
import statistics

In [3]:
client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('my-bucket')

# Districts-20-.csv
## TOC:
* [About this dataset](#1)
* [What's in this dataset](#2)
* [Codebook](#3)
    * [Missing values](#3.1)
    * [Summary statistics](#3.2)
* [Columns](#4)
    * [Name](#4.1)
    * [Value](#4.2)

**About this dataset**  <a class="anchor" id="1"></a>  
Data provided by: Unknown.    
Source: https://daanmatchdatafiles.s3-us-west-1.amazonaws.com/DaanMatch_DataFiles/Districts-20-.csv  
Type: csv  
Last Modified: May 29, 2021, 19:54:19 (UTC-07:00)  
Size: 420.0 B

In [6]:
path = "s3://daanmatchdatafiles/DaanMatch_DataFiles/Districts-20-.csv"
districts_20 = pd.read_csv(path)
districts_20

Unnamed: 0,KeyColumn,Name,Value
0,346,Garhwa,346
1,347,Chatra,347
2,348,Kodarma,348
3,349,Giridih,349
4,350,Deoghar,350
5,351,Godda,351
6,352,Sahibganj,352
7,353,Pakaur,353
8,354,Dhanbad,354
9,355,Bokaro,355


**What's in this dataset?** <a class="anchor" id="2"></a>

In [11]:
print("Shape:", districts_20.shape)
print("Rows:", districts_20.shape[0])
print("Columns:", districts_20.shape[1])
print("Each row is a district in the Jharkhand state in India.")

Shape: (24, 3)
Rows: 24
Columns: 3
Each row is a district in the Jharkhand state in India.


**Codebook** <a class="anchor" id="3"></a>

In [20]:
districts_20_columns = [column for column in districts_20.columns]
districts_20_description = ["Same as the Value column.",
                            "Name of District in Jharkhand. There are 28 states in India and 8 union territories, all of which have territories within them. This column represents the names of the 24 territories in the Jharkhand state.",
                            "This value column has no real meaning, it is meant purely to count the districts. There are 739 total districts in India, so the value represents the number of the district in regards to all other districts."]
districts_20_dtypes = [dtype for dtype in districts_20.dtypes]

data = {"Column Name": districts_20_columns, "Description": districts_20_description, "Type": districts_20_dtypes}
districts_20_codebook = pd.DataFrame(data)
districts_20_codebook.style.set_properties(subset=['Description'], **{'width': '600px'})

Unnamed: 0,Column Name,Description,Type
0,KeyColumn,Same as the Value column.,int64
1,Name,"Name of District in Jharkhand. There are 28 states in India and 8 union territories, all of which have territories within them. This column represents the names of the 24 territories in the Jharkhand state.",object
2,Value,"This value column has no real meaning, it is meant purely to count the districts. There are 739 total districts in India, so the value represents the number of the district in regards to all other districts.",int64


**Missing values** <a class="anchor" id="3.1"></a>

In [14]:
districts_20.isnull().sum()

KeyColumn    0
Name         0
Value        0
dtype: int64

There are 24 districts in the state of Jharkhand, and 24 districts in this dataset. That means that there are no missing values.

**Summary statistics** <a class="anchor" id="3.2"></a>

In [15]:
districts_20.describe()

Unnamed: 0,KeyColumn,Value
count,24.0,24.0
mean,357.5,357.5
std,7.071068,7.071068
min,346.0,346.0
25%,351.75,351.75
50%,357.5,357.5
75%,363.25,363.25
max,369.0,369.0


## Columns
<a class="anchor" id="4"></a>

### Name
<a class="anchor" id="4.1"></a>
Name of District in the state of Jharkhand in India. There are 28 states in India and 8 union territories, all of which have territories within them. This column represents the names of the territories in the state Jharkhand.

In [16]:
column = districts_20["Name"]
column

0              Garhwa
1              Chatra
2             Kodarma
3             Giridih
4             Deoghar
5               Godda
6           Sahibganj
7              Pakaur
8             Dhanbad
9              Bokaro
10          Lohardaga
11    Purbi Singhbhum
12             Palamu
13            Latehar
14          Hazaribag
15            Ramgarh
16              Dumka
17            Jamtara
18             Ranchi
19             Khunti
20              Gumla
21            Simdega
22    Pachim Sionghum
23          Saraikela
Name: Name, dtype: object

In [17]:
print("No. of unique values:", len(column.unique()))

# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:value for key, value in counter.items() if value > 1}
print("Duplicates:", duplicates)
if len(duplicates) > 0:
    print("No. of duplicates:", len(duplicates))

No. of unique values: 24
Duplicates: {}


### Value
<a class="anchor" id="4.2"></a>
This value column has no real meaning, it is meant purely to count the districts. There are 739 total districts in India, so the value represents the number of the district in regards to all other districts.

In [18]:
column = districts_20["Value"]
column

0     346
1     347
2     348
3     349
4     350
5     351
6     352
7     353
8     354
9     355
10    356
11    357
12    358
13    359
14    360
15    361
16    362
17    363
18    364
19    365
20    366
21    367
22    368
23    369
Name: Value, dtype: int64

In [19]:
print("No. of unique values:", len(column.unique()))

# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:value for key, value in counter.items() if value > 1}
print("Duplicates:", duplicates)
if len(duplicates) > 0:
    print("No. of duplicates:", len(duplicates))

No. of unique values: 24
Duplicates: {}
