# Codebook  
**Authors:** Patrick Guo  
Documenting existing data files of DaanMatch with information about location, owner, "version", source etc.

In [1]:
import boto3
import numpy as np 
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
import statistics

In [2]:
client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('my-bucket')

# 42621 Final_Data_ngodarpan.gov.in

## TOC:
* [About this dataset](#1)
* [What's in this dataset](#2)
* [Codebook](#3)
    * [Missing values](#3.1)
    * [Summary statistics](#3.2)
* [Columns](#4)
    * [Name](4.1)
    * [ngo url](4.2)
    * [Mobile](4.3)
    * [UniqueID](4.4)
    * [Off phone1](4.5)
    * [Email](4.6)
    * [Major Activities1](4.7)
    * [operational states db](4.8)
    * [issues working db](4.9)
    * [operational district db](4.10)
    * [reg name](4.11)
    * [fcrano](4.12)
    * [nr regNo](4.13)
    * [nr add](4.14)
    * [nr orgName](4.15)
    * [ngo reg date](4.16)
    * [nr actName](4.17)
    * [nr city](4.18)
    * [TypeDescription](4.19)
    * [StateName](4.20)
    * [status](4.21)
    * [president name](4.22)
    * [president email](4.23)
    * [president mobile](4.24)
    * [Chairman name](4.25)
    * [Chairman email](4.26)
    * [Chairman mobile](4.27)
    * [Secretary name](4.28)
    * [Secretary email](4.29)
    * [Secretary mobile](4.30)
    * [Asisstant Secretary name](4.31)
    * [Asisstant Secretary email](4.32)
    * [Asisstant Secretary mobile](4.33)
    * [Board Member name](4.34)
    * [Board Member email](4.35)
    * [Board Member mobile](4.36)
    * [Vice Chairman name](4.37)
    * [Vice Chairman email](4.38)
    * [Vice Chairman mobile](4.39)
    * [Member name](4.40)
    * [Member email](4.41)
    * [Member mobile](4.42)

In [3]:
# Lists out the column names in TOC format
def toc_maker(dataset):
    counter = 1
    for column in dataset.columns:
        print("* ["+column+"](4."+str(counter)+")")
        counter +=1

In [4]:
#toc_maker(Final_Data_ngodarpan)

**About this dataset**  <a class="anchor" id="1"></a>  
Data provided by: NGO Darpan  
Source: ngodarpan.gov.in   
Type: xlsx  
Last Modified: June 1, 2021, 17:06:30 (UTC-07:00)  
Size: 49.7 MB

In [None]:
path = "s3://daanmatchdatafiles/Darpan21FCRA/42621 Final_Data_ngodarpan.gov.in.xlsx"
xl = pd.ExcelFile(path)
print(xl.sheet_names)
Final_Data_ngodarpan = xl.parse('ngodarpan.gov.in')
Final_Data_ngodarpan.head()

**What's in this dataset?**  <a class="anchor" id="2"></a>  

In [None]:
dataset = Final_Data_ngodarpan
print("Shape:", dataset.shape)
print("Rows:", dataset.shape[0])
print("Columns:", dataset.shape[1])
print("Each row is a NGO.")

**Codebook** <a class="anchor" id="3"></a>

In [None]:
dataset_columns = [column for column in dataset.columns]
dataset_desc = ["Name of NGO",
               "URL for NGO",
               "Mobile phone",
               "Unique ID of VO/NGO",
               "Telephone/Alternate number",
               "Email address",
               "Description of major activities",
               "List of states they operate in",
               "List of issues they are working on",
               "List of districts they operate in",
               "Name of registrar",
               "FCRA number",
               "Registration number",
               "Address",
               "Name of NGO",
               "Registration date",
               "Name of Act",
               "City of NGO",
               ]
dataset_desc = dataset_desc + ["N/A"] * (len(dataset_columns) - len(dataset_desc))
dataset_dtypes = [dtype for dtype in dataset.dtypes]

data = {"Column Name": dataset_columns, "Description": dataset_desc, "Type": dataset_dtypes}
dataset_codebook = pd.DataFrame(data)
dataset_codebook

**Missing values** <a class="anchor" id="3.1"></a>

In [None]:
Final_Data_ngodarpan.isnull().sum()

**Summary statistics** <a class="anchor" id="3.2"></a>

None. All qualitative features.

## Columns
<a class="anchor" id="4"></a>

### Name
<a class="anchor" id="4.1"></a>
Name of NGO.  
No. of unique values: 109682  
No. of duplicates: 1548  

In [None]:
column = dataset["Name"]
column

In [None]:
# Number of empty strings/missing values
print("Invalid:", sum(column == " ") + sum(column.isnull()))
print("No. of unique values:", len(column.unique()))
# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:[value] for key, value in counter.items() if value > 1}
print("No. of duplicates:", len(duplicates))
table = pd.DataFrame.from_dict(duplicates)
table = table.melt(var_name="Duplicate Names", value_name="Count").sort_values(by=["Count"], ascending=False).reset_index(drop=True)
table

In [None]:
# Example
dataset[column == table.iloc[0,0]].head()

Same ```name``` does not mean duplicate rows.

### ngo url
<a class="anchor" id="4.2"></a>
URL for NGO.  
No. of unique values: 24253  
No. of duplicates values: 202   
A lot of NGOs were confused in the information filling process, and pasted the NGO darpan URL instead of the URL to their NGO's website if there is one: the first 13. So there are a large number of invalid URLs.  
Additionally, a large number of urls cannot be reached.

In [None]:
column = dataset["ngo url"]
column

In [None]:
# Number of empty strings/missing values
print("Invalid:", sum(column == " ") + sum(column.isnull()))

print("No. of unique values:", len(column.unique()))

# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:[value] for key, value in counter.items() if value > 1}
print("No. of Duplicates:", len(duplicates))

table = pd.DataFrame.from_dict(duplicates)
table = table.melt(var_name="Duplicate URLs", value_name="Count").sort_values(by=["Count"], ascending=False).reset_index(drop=True)
table

A lot of NGOs were confused in the information filling process, and pasted the NGO darpan URL instead of the URL to their NGO's website if there is one: the first 13. So there are a large number of invalid URLs. Additionally, a large number of urls cannot be reached.

In [None]:
table.iloc[13:]

In [None]:
# Example
dataset[column == table.iloc[13,0]].head()

Duplicates for ```ngo url``` do not mean duplicate rows.

### Mobile
<a class="anchor" id="4.3"></a>
Mobile number. 
Incorrect dtype.
No. of unique values: 24253  
No. of duplicates values: 202   

In [None]:
column = dataset["Mobile"]
column