# Codebook  
**Authors:** Patrick Guo  
Documenting existing data files of DaanMatch with information about location, owner, "version", source etc.

In [1]:
import boto3
import numpy as np 
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
import statistics

In [2]:
client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('my-bucket')

# 42621 Final_Data_ngodarpan.gov.in

## TOC:
* [About this dataset](#1)
* [What's in this dataset](#2)
* [Codebook](#3)
    * [Missing values](#3.1)
    * [Summary statistics](#3.2)
* [Columns](#4)
    * [Name](4.1)
    * [ngo url](4.2)
    * [Mobile](4.3)
    * [UniqueID](4.4)
    * [Off phone1](4.5)
    * [Email](4.6)
    * [Major Activities1](4.7)
    * [operational states db](4.8)
    * [issues working db](4.9)
    * [operational district db](4.10)
    * [reg name](4.11)
    * [fcrano](4.12)
    * [nr regNo](4.13)
    * [nr add](4.14)
    * [nr orgName](4.15)
    * [ngo reg date](4.16)
    * [nr actName](4.17)
    * [nr city](4.18)
    * [TypeDescription](4.19)
    * [StateName](4.20)
    * [status](4.21)
    * [president name](4.22)
    * [president email](4.23)
    * [president mobile](4.24)
    * [Chairman name](4.25)
    * [Chairman email](4.26)
    * [Chairman mobile](4.27)
    * [Secretary name](4.28)
    * [Secretary email](4.29)
    * [Secretary mobile](4.30)
    * [Asisstant Secretary name](4.31)
    * [Asisstant Secretary email](4.32)
    * [Asisstant Secretary mobile](4.33)
    * [Board Member name](4.34)
    * [Board Member email](4.35)
    * [Board Member mobile](4.36)
    * [Vice Chairman name](4.37)
    * [Vice Chairman email](4.38)
    * [Vice Chairman mobile](4.39)
    * [Member name](4.40)
    * [Member email](4.41)
    * [Member mobile](4.42)

In [74]:
# Lists out the column names in TOC format
def toc_maker(dataset):
    counter = 1
    for column in dataset.columns:
        print("* ["+column+"](4."+str(counter)+")")
        counter +=1

In [1]:
#toc_maker(Final_Data_ngodarpan)

**About this dataset**  <a class="anchor" id="1"></a>  
Data provided by: NGO Darpan  
Source: ngodarpan.gov.in   
Type: xlsx  
Last Modified: June 1, 2021, 17:06:30 (UTC-07:00)  
Size: 49.7 MB

In [7]:
path = "s3://daanmatchdatafiles/Darpan21FCRA/42621 Final_Data_ngodarpan.gov.in.xlsx"
xl = pd.ExcelFile(path)
print(xl.sheet_names)
Final_Data_ngodarpan = xl.parse('ngodarpan.gov.in')
Final_Data_ngodarpan.head()

['ngodarpan.gov.in']


Unnamed: 0,Name,ngo url,Mobile,UniqueID,Off phone1,Email,Major Activities1,operational states db,issues working db,operational district db,...,Asisstant Secretary mobile,Board Member name,Board Member email,Board Member mobile,Vice Chairman name,Vice Chairman email,Vice Chairman mobile,Member name,Member email,Member mobile
0,PRAYAS,,9778080000.0,OR/2009/0010000,06858-223440,director_prayas@yahoo.com,1.63 Nos. of SHGs formed,"ORISSA,","Agriculture,Children,Civic Issues,Disaster Man...","ORISSA->Nabarangapur ,",...,,,,,,,,,,
1,PONDICHERRYWOMENSCONFERENCE,,9443253000.0,PY/2016/0100001,0413-2213238,surebe33@gmail.com,Working for Women and Children Obtaining Loan ...,"PUDUCHERRY,","Women's Development & Empowerment,Children,","PUDUCHERRY->Puducherry,",...,,,,,,,,,,
2,SHABRI SAMAJ SEWA SAMITI,http://ssssamitibhind.org,7828394000.0,MP/2016/0100003,0751-1234689,ssssamitibhind@gmail.com,more than one thousand leadership development ...,"MADHYA PRADESH,","Animal Husbandry, Dairying & Fisheries,Agricul...","MADHYA PRADESH->Anuppur, Ashoknagar, Balaghat,...",...,,,,,,,,ALOK,ssssamitibhind@gmail.com,7828498000.0
3,ANAND GANGA SAMAJIK SIKSHA SAMITI,,9450678000.0,UP/2016/0100004,05566-281059,lovelyraivijendra@gmail.com,OUR ORGANISATION HAVE PLANTED MORE THAN 2 LAKH...,"UTTAR PRADESH,","Agriculture,Environment & Forests,Health & Fam...","UTTAR PRADESH->Deoria, Gorakhpur, Sant Kabir N...",...,,,,,,,,,,
4,Himaliyan Gram Vikas Samiti,,9412037000.0,UA/2016/0100009,05964-213271,hgvs1990@gmail.com,Facilitated formation and strengthening of 65C...,"UTTARAKHAND,","Animal Husbandry, Dairying & Fisheries,Agricul...","UTTARAKHAND->Almora , Bageshwar, Champawat, Pi...",...,,Krishna Nand,hgvsgan@yahoo.co.in,7500720000.0,Leela Dhar Joshi,hgvs.jleeladhar.lj@gmail.com,8057816000.0,,,


**What's in this dataset?**  <a class="anchor" id="2"></a>  

In [8]:
dataset = Final_Data_ngodarpan
print("Shape:", dataset.shape)
print("Rows:", dataset.shape[0])
print("Columns:", dataset.shape[1])
print("Each row is a NGO.")

Shape: (111929, 42)
Rows: 111929
Columns: 42
Each row is a NGO.


**Codebook** <a class="anchor" id="3"></a>

In [2]:
dataset_columns = [column for column in dataset.columns]
dataset_desc = ["Name of NGO",
               "URL for NGO",
               "Mobile phone",
               "Unique ID of VO/NGO",
               "Telephone/Alternate number",
               "Email address",
               "Description of major activities",
               "List of states they operate in",
               "List of issues they are working on",
               "List of districts they operate in",
               "Name of registrar",
               "FCRA number",
               "Registration number",
               "Address",
               "Name of NGO",
               "Registration date",
               "Name of Act",
               "City of NGO",
               ]
dataset_desc = dataset_desc + ["N/A"] * (len(dataset_columns) - len(dataset_desc))
dataset_dtypes = [dtype for dtype in dataset.dtypes]

data = {"Column Name": dataset_columns, "Description": dataset_desc, "Type": dataset_dtypes}
dataset_codebook = pd.DataFrame(data)
dataset_codebook

NameError: name 'dataset' is not defined

**Missing values** <a class="anchor" id="3.1"></a>

In [10]:
Final_Data_ngodarpan.isnull().sum()

Name                               0
ngo url                        86142
Mobile                            32
UniqueID                           0
Off phone1                     95402
Email                              0
Major Activities1              27311
operational states db          23039
issues working db              22637
operational district db        23039
reg name                           0
fcrano                         89869
nr regNo                           3
nr add                             0
nr orgName                         0
ngo reg date                       0
nr actName                      1316
nr city                          214
TypeDescription                    0
StateName                          0
status                        111929
president name                 52520
president email                52520
president mobile               52520
Chairman name                  82126
Chairman email                 82132
Chairman mobile                82137
S

**Summary statistics** <a class="anchor" id="3.2"></a>

None. All qualitative features.

## Columns
<a class="anchor" id="4"></a>

### Name
<a class="anchor" id="4.1"></a>
Name of NGO.  
No. of unique values: 109682  
No. of duplicates: 1548  

In [59]:
column = dataset["Name"]
column

0                                                    PRAYAS
1                               PONDICHERRYWOMENSCONFERENCE
2                                  SHABRI SAMAJ SEWA SAMITI
3                         ANAND GANGA SAMAJIK SIKSHA SAMITI
4                               Himaliyan Gram Vikas Samiti
                                ...                        
111924                            Hariom Samaj Vikas Samiti
111925    narmadanchal naya jeevan jan kalyaan seva sami...
111926                    Mathura Prasad Gramodyog Sansthan
111927                   Shree Swaminarayan Education Trust
111928                                      Srijan Sansthan
Name: Name, Length: 111929, dtype: object

In [64]:
# Number of empty strings/missing values
print("Invalid:", sum(column == " ") + sum(column.isnull()))
print("No. of unique values:", len(column.unique()))
# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:[value] for key, value in counter.items() if value > 1}
print("No. of duplicates:", len(duplicates))
table = pd.DataFrame.from_dict(duplicates)
table = table.melt(var_name="Duplicate Names", value_name="Count").sort_values(by=["Count"], ascending=False).reset_index(drop=True)
table

Invalid: 86142
No. of unique values: 24253
No. of duplicates: 202


Unnamed: 0,Duplicate Names,Count
0,,86142
1,http://,859
2,https://ngodarpan.gov.in/index.php/ngo/primaryngo,159
3,https://ngodarpan.gov.in,57
4,http://ngodarpan.gov.in/index.php/ngo/primaryngo,56
...,...,...
197,http://www.saraswatigoi.com,2
198,http://www.msmhc.org,2
199,http://www.poddarinstitute.org,2
200,http://www.svnycindia.org/,2


In [56]:
# Example
dataset[column == table.iloc[0,0]].head()

Unnamed: 0,Name,ngo url,Mobile,UniqueID,Off phone1,Email,Major Activities1,operational states db,issues working db,operational district db,...,Asisstant Secretary mobile,Board Member name,Board Member email,Board Member mobile,Vice Chairman name,Vice Chairman email,Vice Chairman mobile,Member name,Member email,Member mobile
23563,CATHOLIC CHURCH,,9427830000.0,GJ/2017/0168939,,CATHOLICCHURCHVYARA@GMAIL.COM,PRIEST MANTENANCE,"GUJARAT,","Children,Education & Literacy,Health & Family ...","GUJARAT->Tapi,",...,,,,,,,,,,
26416,CATHOLIC CHURCH,,9998133000.0,GJ/2017/0172648,,ccvadtal@gmail.com,RELIGIOUS AND SOCIAL,"GUJARAT,","Education & Literacy,Any Other,","GUJARAT->Anand ,",...,,,,,,,,SELVIN CRUZ,ccvadtal@gmail.com,9763823000.0
26884,CATHOLIC CHURCH,,9426880000.0,GJ/2017/0173225,,catholicchurchzaroli@gmail.com,THE OBJECT OF THE TRUST ARE RELIGIOUS EDUCATIO...,"GUJARAT,","Education & Literacy,Any Other,","GUJARAT->Valsad,",...,,,,,,,,VASAVA KANTILAL HIMATSING,kantivasava1959@gmail.com,9904656000.0
27081,CATHOLIC CHURCH,,9426513000.0,GJ/2017/0173470,,frsjraj@gmail.com,THE OBJECT OF THE TRUST ARE RELIGIOUS AND CHAR...,"GUJARAT,","Education & Literacy,Any Other,","GUJARAT->Anand ,",...,,,,,,,,ARASAKUMAR DEVASAGAYAM RAYAPPAN,arayappan@gmail.com,9426513000.0
27999,CATHOLIC CHURCH,,9426389000.0,GJ/2017/0174578,,frarulhmt@gmail.com,Trust has mainly involved in village animation...,"GUJARAT,","Education & Literacy,","GUJARAT->Arvalli,",...,,,,,,,,Kamjibhai Nemaji Dund,KAMJIDUND@GMAIL.COM,9979210000.0


Same ```name``` does not mean duplicate rows.

### ngo url
<a class="anchor" id="4.2"></a>
URL for NGO.  
No. of unique values: 24253  
No. of duplicates values: 202   
A lot of NGOs were confused in the information filling process, and pasted the NGO darpan URL instead of the URL to their NGO's website if there is one: the first 13. So there are a large number of invalid URLs.  
Additionally, a large number of urls cannot be reached.

In [62]:
column = dataset["ngo url"]
column

0                                  NaN
1                                  NaN
2            http://ssssamitibhind.org
3                                  NaN
4                                  NaN
                      ...             
111924                         http://
111925                             NaN
111926         http://mathuravikas1977
111927                             NaN
111928    http:/www.srijansansthan.com
Name: ngo url, Length: 111929, dtype: object

In [65]:
# Number of empty strings/missing values
print("Invalid:", sum(column == " ") + sum(column.isnull()))

print("No. of unique values:", len(column.unique()))

# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:[value] for key, value in counter.items() if value > 1}
print("No. of Duplicates:", len(duplicates))

table = pd.DataFrame.from_dict(duplicates)
table = table.melt(var_name="Duplicate URLs", value_name="Count").sort_values(by=["Count"], ascending=False).reset_index(drop=True)
table

Invalid: 86142
No. of unique values: 24253
No. of Duplicates: 202


Unnamed: 0,Duplicate URLs,Count
0,,86142
1,http://,859
2,https://ngodarpan.gov.in/index.php/ngo/primaryngo,159
3,https://ngodarpan.gov.in,57
4,http://ngodarpan.gov.in/index.php/ngo/primaryngo,56
...,...,...
197,http://www.saraswatigoi.com,2
198,http://www.msmhc.org,2
199,http://www.poddarinstitute.org,2
200,http://www.svnycindia.org/,2


A lot of NGOs were confused in the information filling process, and pasted the NGO darpan URL instead of the URL to their NGO's website if there is one: the first 13. So there are a large number of invalid URLs. Additionally, a large number of urls cannot be reached.

In [70]:
table.iloc[13:]

Unnamed: 0,Duplicate URLs,Count
13,http://www.jss.nic.in,8
14,http://www.srigurudev.org,8
15,http://www.durbar.org,7
16,http://jss.nic.in,7
17,http://www.ngo.india.gov.in,6
...,...,...
197,http://www.saraswatigoi.com,2
198,http://www.msmhc.org,2
199,http://www.poddarinstitute.org,2
200,http://www.svnycindia.org/,2


In [71]:
# Example
dataset[column == table.iloc[13,0]].head()

Unnamed: 0,Name,ngo url,Mobile,UniqueID,Off phone1,Email,Major Activities1,operational states db,issues working db,operational district db,...,Asisstant Secretary mobile,Board Member name,Board Member email,Board Member mobile,Vice Chairman name,Vice Chairman email,Vice Chairman mobile,Member name,Member email,Member mobile
1532,STATE RESOURCE CENTER DISPUR ASSAM,http://www.jss.nic.in,9435328000.0,AS/2016/0104590,0361-2388990,srcdispur@yahoo.com,The critical focus of SRC Dispur is giving tec...,"ASSAM,","Education & Literacy,Human Rights,Vocational T...","ASSAM->Barpeta, Dhubri, Hailakandi, Tinsukia,",...,,,,,,,,,,
1603,jan shikshan sansthan raigad,http://www.jss.nic.in,99229990000.0,MH/2016/0104734,02141-227932,jssrgd@gmail.com,"14. Jan Shikshan Sansthan Raigad ,Maharashtra ...","MAHARASHTRA,","Vocational Training,","MAHARASHTRA->Raigarh,",...,,Advt,jssrgd@gmail.com,9372212000.0,Mrs,jssrgd@gmail.com,9766168000.0,,,
1611,JAN SHIKSHAN SANSHTHAN CHANDAULI,http://www.jss.nic.in,8924071000.0,UP/2016/0104747,05412-260625,jsschandauli1@gmail.com,JAN SHIKSHAN SANSHTHAN IS ORGNISING VOCATIONAL...,"UTTAR PRADESH,","Vocational Training,","UTTAR PRADESH->Chandauli, Chandauli,",...,,,,,,,,,,
1660,JAN SHIKSHAN SANSTHAN GUNA,http://www.jss.nic.in,9425633000.0,MP/2016/0104852,07542-252375,jssguna@gmail.com,Jan Shikshan Sansthan is working of Vocational...,"MADHYA PRADESH,","Civic Issues,Drinking Water,Education & Litera...","MADHYA PRADESH->Guna,",...,,,,,Vishnu Sharma,yuvacollege@ymail.com,9893453000.0,Manisha Pandey,parthpandey@gmail.com,9926540000.0
1702,JAN SHIKSHAN SANSTHAN BASTI,http://www.jss.nic.in,9415174000.0,UP/2016/0104945,05542-209210,jssbasti44@yahoo.in,We are conducting the activities by govt of In...,"UTTAR PRADESH,","Education & Literacy,","UTTAR PRADESH->Basti,",...,,,,,,,,,,


Duplicates for ```ngo url``` do not mean duplicate rows.

### Mobile
<a class="anchor" id="4.3"></a>
Mobile number. 
Incorrect dtype.
No. of unique values: 24253  
No. of duplicates values: 202   

In [77]:
column = dataset["Mobile"]
column

0         9.778080e+09
1         9.443253e+09
2         7.828394e+09
3         9.450678e+09
4         9.412037e+09
              ...     
111924    9.335113e+09
111925    9.425042e+09
111926    9.452122e+09
111927    9.727767e+09
111928    9.667617e+09
Name: Mobile, Length: 111929, dtype: float64