# Codebook

Author: [Patrick Guo](https://github.com/shpatrickguo)

## Closed_during_the_month_(Registeration_Closure).xls
**Data provided by:** NGO Darpan<br>
**Source:** s3://daanmatchdatafiles/Closed_during_the_month_(Registeration_Closure).xls<br>
**Type:** xlsx<br>
**Last Modified:** October 27, 2021, 16:44:55 (UTC-07:00)<br>
**Size:** 520.5 KB

```Closed_during_the_month_(Registeration_Closure).xls``` named ```reg_closed_df``` contains: <br>
List of companies struck off/closed during the month of August 2014.
- ```S.No```
- ```CIN```
- ```COMPANY_NAME```
- ```COMPANY_STATUS```
- ```TYPE```
- ```DATE_OF_REGISTRATION```
- ```LISTED```
- ```COMPANY_INDICATOR```
- ```REGISTERED_STATE```
- ```ROC_CODE```
- ```INDUSTRIAL-CLASIFICATION```
- ```DESCRIPTION```

In [10]:
reg_closed_df.columns

Index(['S.No', 'CIN', 'COMPANY_NAME', 'CLASS', 'COMPANY_STATUS', 'TYPE',
       'DATE_OF_REGISTRATION', 'LISTED', 'COMPANY_INDICATOR',
       'REGISTERED_STATE', 'ROC_CODE', 'INDUSTRIAL_CLASIFICATION',
       'DESCRIPTION'],
      dtype='object', name=4)

# Import Libraries

In [1]:
import boto3
import io
import string
import requests

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno

# Load Data

In [2]:
client = boto3.client('s3')
resource = boto3.resource('s3')

In [3]:
obj = client.get_object(Bucket='daanmatchdatafiles', Key='Closed_during_the_month_(Registeration_Closure).xls')
df = pd.read_excel(io.BytesIO(obj['Body'].read()))

# Registration Closure Dataset

In [4]:
# First 4 rows are blank and 5th row is column names
reg_closed_df = df.copy()
# Set column names to row 5
reg_closed_df.columns = reg_closed_df.iloc[4]
# Drop first 5 rows
reg_closed_df = reg_closed_df.iloc[5:, :]
# Reset Index
reg_closed_df.reset_index(drop = True, inplace = True)

In [5]:
reg_closed_df.head()

4,S.No,CIN,COMPANY_NAME,CLASS,COMPANY_STATUS,TYPE,DATE_OF_REGISTRATION,LISTED,COMPANY_INDICATOR,REGISTERED_STATE,ROC_CODE,INDUSTRIAL_CLASIFICATION,DESCRIPTION
0,1,U36100MH2013PTC247886,NEW VIVEK JEWELLERS PRIVATE LIMITED,Private,STRIKE OFF,Nongovernment,2013-09-05 00:00:00,Unlisted,Indian Company,Maharashtra,RoC-Mumbai,36100,Manufacturing (Others)
1,2,U70100KL2013PTC034127,PANAMANNA PROPERTY PRIVATE LIMITED,Private,STRIKE OFF,Nongovernment,2013-05-17 00:00:00,Unlisted,Indian Company,Kerala,ROC-Ernakulam,70100,Real Estate and Renting
2,3,U17120GJ2013PTC076589,EMBCOTT TEXTILES PRIVATE LIMITED,Private,STRIKE OFF,Nongovernment,2013-08-26 00:00:00,Unlisted,Indian Company,Gujarat,RoC-Ahmedabad,17120,Manufacturing (Textiles)
3,4,U52609HR2013PLC049291,SRS MODERN RETAIL LIMITED,Public,STRIKE OFF,Nongovernment,2013-05-24 00:00:00,Unlisted,Indian Company,Haryana,RoC-Delhi,52609,Trading
4,5,U52500TZ2013PTC019549,KARAKORAM SYSTEMS AND SOLUTIONS PRIVATE LIMITED,Private,STRIKE OFF,Nongovernment,2013-05-30 00:00:00,Unlisted,Indian Company,Tamil Nadu,RoC-Coimbatore,52500,Trading


In [6]:
# Examing the structure of the dataframe
reg_closed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1850 entries, 0 to 1849
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   S.No                      1850 non-null   object
 1   CIN                       1850 non-null   object
 2   COMPANY_NAME              1850 non-null   object
 3   CLASS                     1850 non-null   object
 4   COMPANY_STATUS            1850 non-null   object
 5   TYPE                      1850 non-null   object
 6   DATE_OF_REGISTRATION      1850 non-null   object
 7   LISTED                    1848 non-null   object
 8   COMPANY_INDICATOR         1850 non-null   object
 9   REGISTERED_STATE          1850 non-null   object
 10  ROC_CODE                  1850 non-null   object
 11  INDUSTRIAL_CLASIFICATION  1850 non-null   object
 12  DESCRIPTION               1809 non-null   object
dtypes: object(13)
memory usage: 188.0+ KB


In [7]:
# Examine the descriptive statistics of the dataframe
reg_closed_df.describe()

4,S.No,CIN,COMPANY_NAME,CLASS,COMPANY_STATUS,TYPE,DATE_OF_REGISTRATION,LISTED,COMPANY_INDICATOR,REGISTERED_STATE,ROC_CODE,INDUSTRIAL_CLASIFICATION,DESCRIPTION
count,1850,1850,1850,1850,1850,1850,1850,1848,1850,1850,1850,1850,1809
unique,1850,1850,1850,2,4,1,1452,2,1,22,22,445,19
top,1,U36100MH2013PTC247886,NEW VIVEK JEWELLERS PRIVATE LIMITED,Private,STRIKE OFF,Nongovernment,2008-03-07 00:00:00,Unlisted,Indian Company,Maharashtra,RoC-Delhi,72200,Business Services
freq,1,1,1,1755,1631,1850,8,1846,1850,443,381,80,440


In [8]:
# Identify the nullity of the dataframe
missing_values_hist = reg_closed_df.isna().sum()
print('Total Missing Values:\n', missing_values_hist)

Total Missing Values:
 4
S.No                         0
CIN                          0
COMPANY_NAME                 0
CLASS                        0
COMPANY_STATUS               0
TYPE                         0
DATE_OF_REGISTRATION         0
LISTED                       2
COMPANY_INDICATOR            0
REGISTERED_STATE             0
ROC_CODE                     0
INDUSTRIAL_CLASIFICATION     0
DESCRIPTION                 41
dtype: int64


In [9]:
# Identify the percentage of nullity in the dataframe for each collumn
missing_values_hist_perc = reg_closed_df.isnull().mean() * 100
print('Percentage of Missing Values:\n', missing_values_hist_perc)

Percentage of Missing Values:
 4
S.No                        0.000000
CIN                         0.000000
COMPANY_NAME                0.000000
CLASS                       0.000000
COMPANY_STATUS              0.000000
TYPE                        0.000000
DATE_OF_REGISTRATION        0.000000
LISTED                      0.108108
COMPANY_INDICATOR           0.000000
REGISTERED_STATE            0.000000
ROC_CODE                    0.000000
INDUSTRIAL_CLASIFICATION    0.000000
DESCRIPTION                 2.216216
dtype: float64


## Observations
- ```LISTED``` is missing 0.108%.
- ```DESCRIPTION``` is missing 2.21%.