# Codebook  
**Authors:** Lauren Baker  
Documenting existing data files of DaanMatch with information about location, owner, "version", source etc.

In [1]:
import boto3
import numpy as np 
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
import statistics

In [2]:
client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('daanmatchdatafiles')

# Dadra_Nagar_Haveli_2016.xlsx

## TOC:
* [About this dataset](#1)
* [What's in this dataset](#2)
* [Codebook](#3)
    * [Missing values](#3.1)
    * [Summary statistics](#3.2)
* [Columns](#4)
    * [CIN](#4.1)
    * [COMPANY_NAME](#4.2)
    * [COMPANY_CLASS](#4.3)
    * [COMPANY_CATEGORY](#4.4)
    * [COMPANY_SUBCAT](#4.5)
    * [COMPANY_STATUS](#4.6)
    * [DATE_OF_REGISTRATION](#4.7)
    * [REGISTERED_STATE](#4.8)
    * [Authorized Capital  (Rs.)](#4.9)
    * [PAIDUP_CAPITAL (Rs.)](#4.10)
    * [PRINCIPAL_BUSINESS_ACTIVITY_CODE](#4.11)
    * [REGISTERED_OFFICE_ADDRESS](#4.12)
    * [EMAIL_ID](#4.13)
    * [LATEST ANNUAL REPORT FILING FY END DATE](#4.14)
    * [LATEST BALANCE SHEET FILING FY END DATE](#4.15)

**About this dataset**  <a class="anchor" id="1"></a>  
Data provided by: Unknown.  
Source: https://daanmatchdatafiles.s3.us-west-1.amazonaws.com/DaanMatch_DataFiles/Dadar_Nagar_Haveli_2016.xlsx   
Type: xlsx  
Last Modified: May 29, 2021, 22:52:46 (UTC-04:00)  
Size: 79.5 KB

In [3]:
path = "s3://daanmatchdatafiles/DaanMatch_DataFiles/Dadar_Nagar_Haveli_2016.xlsx"
Dadra_Nagar_Haveli_2016 = pd.ExcelFile(path)
print(Dadra_Nagar_Haveli_2016.sheet_names)

['Sheet3']


In [4]:
Dadra_Nagar_Haveli_2016 = Dadra_Nagar_Haveli_2016.parse('Sheet3')
Dadra_Nagar_Haveli_2016.head()

Unnamed: 0,CIN,COMPANY_NAME,COMPANY_CLASS,COMPANY_CATEGORY,COMPANY_SUBCAT,COMPANY_STATUS,DATE_OF_REGISTRATION,REGISTERED_STATE,Authorized Capital (Rs.),PAIDUP_CAPITAL (RS.),PRINCIPAL_BUSINESS_ACTIVITY_CODE,REGISTERED_OFFICE_ADDRESS,EMAIL_ID,LATEST ANNUAL REPORT FILING FY END DATE,LATEST BALANCE SHEET FILING FY END DATE
0,U51109DN2010PTC005504,KAPOOR MERCANTILE PRIVATE LIMITED,Private,Company limited by Shares,Non-govt company,ACTIVE,2010-02-17,Dadra and Nagar Haveli,30000000,540000,51109,"PLOT NO 17, SURVEY NO 121/P,SILVASSA INDUSTRIA...",rawal_bhv@yahoo.co.in,2015-03-31,2015-03-31
1,U17291DN2013PTC005501,AKSHAT FREIGHT CARRIERS PRIVATE LIMITED,Private,Company limited by Shares,Non-govt company,ACTIVE,2013-02-20,Dadra and Nagar Haveli,100000000,48000000,17291,"PLOT NO 65/B, PIPARIA INDUSTRIAL ESTATE,PIPARI...",rawal_bhv@yahoo.co.in,2015-03-31,2015-03-31
2,U99999DN1995PLC000093,AEC SSANGYONG LTD,Public,Company limited by Shares,Non-govt company,"INACTIVE UNDER SECTION 455 OF CA,2013",1995-08-01,Dadra and Nagar Haveli,500000,0,99999,"SURVEY NO.210&ONS,VILLAGE MORKHAL,PO.SILVASSA ...",,0,0
3,U99999DN1991PLC000037,SILVASSA STANDARD TWIST AND TILES PVT.LTD.,Private,Company limited by Shares,Non-govt company,"INACTIVE UNDER SECTION 455 OF CA,2013",1991-02-11,Dadra and Nagar Haveli,500000,0,99999,"40-INDUS. ESTATE, PIPARIASILVASSADADAR N. HAVA...",,0,0
4,U99999DN1990PLC000033,SILWINES PVT. LTD.,Private,Company limited by Shares,Non-govt company,"INACTIVE UNDER SECTION 455 OF CA,2013",1990-04-06,Dadra and Nagar Haveli,100000,0,99999,C/O. SILPHAR LABORATORIES P.B.NO. 33SILVASSA (...,,0,0


**What's in this dataset?** <a class="anchor" id="2"></a>

In [5]:
print("Shape:", Dadra_Nagar_Haveli_2016.shape)
print("Rows:", Dadra_Nagar_Haveli_2016.shape[0])
print("Columns:", Dadra_Nagar_Haveli_2016.shape[1])
print("Each row is a company.")

Shape: (472, 15)
Rows: 472
Columns: 15
Each row is a company.


**Codebook** <a class="anchor" id="3"></a>

In [6]:
Dadra_Nagar_Haveli_2016_columns = [column for column in Dadra_Nagar_Haveli_2016.columns]
Dadra_Nagar_Haveli_2016_description = ["Corporate Identification Number in India (CIN) is a 21 digit alpha-numeric code issued to companies incorporated within India on being registered with Registrar of Companies (RCA).",
                                           "Name of Company.",
                                           "Class of Company: Private or Public.",
                                           "Category of Company: Limited by Shares, Limited by Guarantee, Unlimited Company.",
                                           "Subcategory of Company: Non-govt, Union Gtvt, State Govt, Subsidiary of Foreign Company, Guarantee and Association Company.",
                                           "Status of Company.",
                                           "Timestamp of date of registration: YYYY-MM-DD HH:MM:SS.",
                                           "State of registration.",
                                           "Authorized capital in rupees (Rs.).",
                                           "Paid up capital in rupees (Rs.).",
                                           "Principal Business code that classifies the main type of product/service sold.",
                                           "Address of registered office.",
                                           "Company email.",
                                           "Latest annual report filing fiscal year end date: YYYY-MM-DD.",
                                           "Latest balance sheet filing fiscal year end date: YYYY-MM-DD."]
Dadra_Nagar_Haveli_2016_dtypes = [dtype for dtype in Dadra_Nagar_Haveli_2016.dtypes]

data = {"Column Name": Dadra_Nagar_Haveli_2016_columns, "Description": Dadra_Nagar_Haveli_2016_description, "Type": Dadra_Nagar_Haveli_2016_dtypes}
Dadra_Nagar_Haveli_2016_codebook = pd.DataFrame(data)
Dadra_Nagar_Haveli_2016_codebook.style.set_properties(subset=['Description'], **{'width': '600px'})

Unnamed: 0,Column Name,Description,Type
0,CIN,Corporate Identification Number in India (CIN) is a 21 digit alpha-numeric code issued to companies incorporated within India on being registered with Registrar of Companies (RCA).,object
1,COMPANY_NAME,Name of Company.,object
2,COMPANY_CLASS,Class of Company: Private or Public.,object
3,COMPANY_CATEGORY,"Category of Company: Limited by Shares, Limited by Guarantee, Unlimited Company.",object
4,COMPANY_SUBCAT,"Subcategory of Company: Non-govt, Union Gtvt, State Govt, Subsidiary of Foreign Company, Guarantee and Association Company.",object
5,COMPANY_STATUS,Status of Company.,object
6,DATE_OF_REGISTRATION,Timestamp of date of registration: YYYY-MM-DD HH:MM:SS.,object
7,REGISTERED_STATE,State of registration.,object
8,Authorized Capital (Rs.),Authorized capital in rupees (Rs.).,int64
9,PAIDUP_CAPITAL (RS.),Paid up capital in rupees (Rs.).,int64


**Missing values** <a class="anchor" id="3.1"></a>

In [7]:
Dadra_Nagar_Haveli_2016.isnull().sum()

CIN                                        0
COMPANY_NAME                               0
COMPANY_CLASS                              0
COMPANY_CATEGORY                           0
COMPANY_SUBCAT                             0
COMPANY_STATUS                             0
DATE_OF_REGISTRATION                       0
REGISTERED_STATE                           0
Authorized Capital  (Rs.)                  0
 PAIDUP_CAPITAL (RS.)                      0
PRINCIPAL_BUSINESS_ACTIVITY_CODE           0
REGISTERED_OFFICE_ADDRESS                  0
EMAIL_ID                                   0
LATEST ANNUAL REPORT FILING FY END DATE    0
LATEST BALANCE SHEET FILING FY END DATE    0
dtype: int64

**Summary statistics** <a class="anchor" id="3.2"></a>

In [8]:
Dadra_Nagar_Haveli_2016.describe()

Unnamed: 0,Authorized Capital (Rs.),PAIDUP_CAPITAL (RS.),PRINCIPAL_BUSINESS_ACTIVITY_CODE
count,472.0,472.0,472.0
mean,258230500.0,126049200.0,43200.557203
std,2294642000.0,1118912000.0,24076.121321
min,0.0,0.0,1111.0
25%,100000.0,100000.0,24230.75
50%,650000.0,100000.0,45200.0
75%,12125000.0,6295700.0,56458.0
max,40000000000.0,16980840000.0,99999.0


## Columns
<a class="anchor" id="4"></a>

### CIN
<a class="anchor" id="4.1"></a>
Corporate Identification Number in India (CIN) is a 21 digit alpha-numeric code issued to companies incorporated within India on being registered with Registrar of Companies (RCA).

In [9]:
column = Dadra_Nagar_Haveli_2016["CIN"]
column

0      U51109DN2010PTC005504
1      U17291DN2013PTC005501
2      U99999DN1995PLC000093
3      U99999DN1991PLC000037
4      U99999DN1990PLC000033
               ...          
467    U27209DN2003PTC000173
468    U27203DN2012PTC000388
469    U27203DN2008PTC000267
470    U17299DN2016PTC005498
471    U65191DH1888PLC000008
Name: CIN, Length: 472, dtype: object

In [10]:
# Check if all rows have 21 digits
CIN_length = [len(CIN) for CIN in column]
print("Rows without 21 digits:", sum([length != 21 for length in CIN_length]))

print("No. of unique values:", len(column.unique()))

# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:value for key, value in counter.items() if value > 1}
print("Duplicates:", duplicates)

Rows without 21 digits: 0
No. of unique values: 472
Duplicates: {}


### COMPANY_NAME 
<a class="anchor" id="4.2"></a>
Name of Company.  
There is trailing and leading whitespace. Need to strip strings.

In [11]:
column = Dadra_Nagar_Haveli_2016["COMPANY_NAME"]
column

0                KAPOOR MERCANTILE PRIVATE LIMITED   
1          AKSHAT FREIGHT CARRIERS PRIVATE LIMITED   
2                                AEC SSANGYONG LTD   
3        SILVASSA STANDARD TWIST AND TILES PVT.LTD.  
4                               SILWINES PVT. LTD.   
                            ...                      
467    DHARMANANDAN STEEL AND METAL PRIVATE LIMITED  
468                  JAIN ALUFOILS PRIVATE LIMITED   
469             GLACIER ALLUMINIUM PRIVATE LIMITED   
470              UNOVEL INDUSTRIES PRIVATE LIMITED   
471                            XYZ Company pvt Ltd   
Name: COMPANY_NAME, Length: 472, dtype: object

In [12]:
# Strip strings
stripped_name = column.str.strip()
print("Invalid names:", sum(stripped_name.isnull()))

print("No. of unique values:", len(column.unique()))

# Check for duplicates
counter = dict(Counter(column))
duplicates = { key:value for key, value in counter.items() if value > 1}
print("Duplicates:", duplicates)
if len(duplicates) > 0:
    print("No. of duplicates:", len(duplicates))

Invalid names: 0
No. of unique values: 471
Duplicates: {'SAI AGRO PESTICIDES COMPANY PVT.LTD.   ': 2}
No. of duplicates: 1


In [14]:
Dadra_Nagar_Haveli_2016[Dadra_Nagar_Haveli_2016["COMPANY_NAME"].isin(duplicates)]

Unnamed: 0,CIN,COMPANY_NAME,COMPANY_CLASS,COMPANY_CATEGORY,COMPANY_SUBCAT,COMPANY_STATUS,DATE_OF_REGISTRATION,REGISTERED_STATE,Authorized Capital (Rs.),PAIDUP_CAPITAL (RS.),PRINCIPAL_BUSINESS_ACTIVITY_CODE,REGISTERED_OFFICE_ADDRESS,EMAIL_ID,LATEST ANNUAL REPORT FILING FY END DATE,LATEST BALANCE SHEET FILING FY END DATE
201,U25199DN1974PTC006789,SAI AGRO PESTICIDES COMPANY PVT.LTD.,Private,Company limited by Shares,Non-govt company,"INACTIVE UNDER SECTION 455 OF CA,2013",1974-04-30,Dadra and Nagar Haveli,200000,0,25199,"SILVASSA,VIA-VAPI,DADRA-NAGAR HAVELI. SILVAS...",,0,0
413,U99999DN1974PTC000002,SAI AGRO PESTICIDES COMPANY PVT.LTD.,Private,Company limited by Shares,Non-govt company,"INACTIVE UNDER SECTION 455 OF CA,2013",1974-04-30,Dadra and Nagar Haveli,200000,1000,99999,"SILVASSA, VIA VAPI,DADRA NAGAR HAVELI.SILVASSA...",,0,0


Duplicates in ```COMPANY_NAME``` does not mean the rows are duplicates.