# Cleaning the Data
_Author_: https://github.com/raffysantayana

## The Source of Data
The data has been gathered using the `requests` library. Please refer to `01_GatheringTheData.ipynb` for the steps taken to gather the data cleaned in this notebook. This notebook builds URLs to each document using this format: `https://data.sec.gov/submissions/CIK{cik:010d}.json` where `cik:010d` is the current ticker identifer that is to be queried with leading zeroes totaling to 10 digits. The JSON is then parsed to retrieve metadata for that ticker to build more URLs that direct to each 10Q report for that ticker.

|## Initial Imports

In [1]:
from datetime import datetime
import pandas as pd
import os

## Dropping Unnecessary Rows

In [2]:
df_10qs = pd.read_csv('../data/all_10qs.csv')
# Dropping the original indices of the dataframe, which are the default sequence of 0 through N anyways
df_10qs.drop('Unnamed: 0', axis=1, inplace=True)
df_10qs.head()

  df_10qs = pd.read_csv('../data/all_10qs.csv')


Unnamed: 0,ticker,exchange,accession_number,filing_date,report_date,acceptance_datetime,act,form,file_number,film_number,...,size,primary_document,is_XBRL,is_inline_XBRL,primary_doc_description,source,has_multi_ticker,has_multi_exchange,all_submissions_line_number,report_url
0,CVCY,Nasdaq,0001628280-24-023222,2024-05-14,2024-03-31,2024-05-14T16:19:29.000Z,34.0,10-Q,000-31977,24944655.0,...,12967977,cvcy-20240331.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,https://www.sec.gov/Archives/edgar/data/112737...
1,CVCY,Nasdaq,0001127371-23-000152,2023-11-02,2023-09-30,2023-11-02T14:38:44.000Z,34.0,10-Q,000-31977,231371534.0,...,12864003,cvcy-20230930.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,https://www.sec.gov/Archives/edgar/data/112737...
2,CVCY,Nasdaq,0001127371-23-000128,2023-08-03,2023-06-30,2023-08-03T16:56:00.000Z,34.0,10-Q,000-31977,231140967.0,...,12459205,cvcy-20230630.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,https://www.sec.gov/Archives/edgar/data/112737...
3,CVCY,Nasdaq,0001127371-23-000072,2023-05-15,2023-03-31,2023-05-12T18:46:16.000Z,34.0,10-Q,000-31977,23917689.0,...,12111169,cvcy-20230331.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,https://www.sec.gov/Archives/edgar/data/112737...
4,CVCY,Nasdaq,0001127371-22-000167,2022-11-02,2022-09-30,2022-11-02T13:29:11.000Z,34.0,10-Q,000-31977,221353223.0,...,13954236,cvcy-20220930.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,https://www.sec.gov/Archives/edgar/data/112737...


In [3]:
df_10qs.shape

(177246, 21)

In [4]:
df_10qs.drop_duplicates().shape

(177117, 21)

The number of duplicate rows likely caused from when the scraping stopped when the machine went to sleep and I restarted the scrape a few indices prior to the last written index.

In [5]:
f"{df_10qs.shape[0] - df_10qs.drop_duplicates().shape[0]} duplicate rows to be dropped"

'129 duplicate rows to be dropped'

Dropping the duplicate values and resetting the index.

In [6]:
df_10qs.drop_duplicates(inplace=True)
df_10qs.reset_index(drop=True, inplace=True)
df_10qs.shape

(177117, 21)

Accession number and Primary Document are required to be able to successfully build the report url, so I will drop any values that have null values for those two columns.

In [7]:
df_10qs.dropna(subset=['accession_number', 'primary_document'], inplace=False).shape

(173402, 21)

In [8]:
f"{df_10qs.shape[0] - df_10qs.dropna(subset=['accession_number', 'primary_document'], inplace=False).shape[0]} rows with null values for accession_number and/or primary_document columns"

'3715 rows with null values for accession_number and/or primary_document columns'

In [9]:
df_10qs.dropna(subset=['accession_number', 'primary_document'], inplace=True)
df_10qs.reset_index(drop=True, inplace=True)
df_10qs.shape

(173402, 21)

## Data Types
Now let us spend some time ensuring the quality of the datatypes of each column in our dataframe.

In [10]:
df_10qs.head()

Unnamed: 0,ticker,exchange,accession_number,filing_date,report_date,acceptance_datetime,act,form,file_number,film_number,...,size,primary_document,is_XBRL,is_inline_XBRL,primary_doc_description,source,has_multi_ticker,has_multi_exchange,all_submissions_line_number,report_url
0,CVCY,Nasdaq,0001628280-24-023222,2024-05-14,2024-03-31,2024-05-14T16:19:29.000Z,34.0,10-Q,000-31977,24944655.0,...,12967977,cvcy-20240331.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,https://www.sec.gov/Archives/edgar/data/112737...
1,CVCY,Nasdaq,0001127371-23-000152,2023-11-02,2023-09-30,2023-11-02T14:38:44.000Z,34.0,10-Q,000-31977,231371534.0,...,12864003,cvcy-20230930.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,https://www.sec.gov/Archives/edgar/data/112737...
2,CVCY,Nasdaq,0001127371-23-000128,2023-08-03,2023-06-30,2023-08-03T16:56:00.000Z,34.0,10-Q,000-31977,231140967.0,...,12459205,cvcy-20230630.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,https://www.sec.gov/Archives/edgar/data/112737...
3,CVCY,Nasdaq,0001127371-23-000072,2023-05-15,2023-03-31,2023-05-12T18:46:16.000Z,34.0,10-Q,000-31977,23917689.0,...,12111169,cvcy-20230331.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,https://www.sec.gov/Archives/edgar/data/112737...
4,CVCY,Nasdaq,0001127371-22-000167,2022-11-02,2022-09-30,2022-11-02T13:29:11.000Z,34.0,10-Q,000-31977,221353223.0,...,13954236,cvcy-20220930.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,https://www.sec.gov/Archives/edgar/data/112737...


In [11]:
df_10qs.dtypes

ticker                          object
exchange                        object
accession_number                object
filing_date                     object
report_date                     object
acceptance_datetime             object
act                            float64
form                            object
file_number                     object
film_number                     object
items                          float64
size                             int64
primary_document                object
is_XBRL                          int64
is_inline_XBRL                   int64
primary_doc_description         object
source                          object
has_multi_ticker                 int64
has_multi_exchange               int64
all_submissions_line_number      int64
report_url                      object
dtype: object

### Dates

In [12]:
def column_as_datetime(df, column_name):
    df[column_name] = pd.to_datetime(df[column_name])

In [13]:
column_as_datetime(df_10qs, 'filing_date')
column_as_datetime(df_10qs, 'report_date')
column_as_datetime(df_10qs, 'acceptance_datetime')
df_10qs.dtypes

ticker                                      object
exchange                                    object
accession_number                            object
filing_date                         datetime64[ns]
report_date                         datetime64[ns]
acceptance_datetime            datetime64[ns, UTC]
act                                        float64
form                                        object
file_number                                 object
film_number                                 object
items                                      float64
size                                         int64
primary_document                            object
is_XBRL                                      int64
is_inline_XBRL                               int64
primary_doc_description                     object
source                                      object
has_multi_ticker                             int64
has_multi_exchange                           int64
all_submissions_line_number    

### Numeric Data

In [14]:
df_10qs[['act', 'items', 'size', 'is_XBRL', 'is_inline_XBRL', 'has_multi_ticker', 'has_multi_exchange', 'all_submissions_line_number']].head()

Unnamed: 0,act,items,size,is_XBRL,is_inline_XBRL,has_multi_ticker,has_multi_exchange,all_submissions_line_number
0,34.0,,12967977,1,1,0,0,1
1,34.0,,12864003,1,1,0,0,1
2,34.0,,12459205,1,1,0,0,1
3,34.0,,12111169,1,1,0,0,1
4,34.0,,13954236,1,1,0,0,1


#### ACT Column

In [15]:
df_10qs['act'].value_counts(dropna=False)

act
34.0    168916
NaN       4486
Name: count, dtype: int64

There may be value derived from reports that have a null act value versus an act value of 34. In that case, I will decide to keep this column.

#### Items Column

In [16]:
df_10qs['items'].value_counts(dropna=False)

items
NaN    173402
Name: count, dtype: int64

Since the `items` column is completely null, then we can drop this column.

In [17]:
df_10qs.drop(columns=['items'], inplace=True)
df_10qs.shape

(173402, 20)

#### Size Column

In [18]:
df_10qs['size'].value_counts(dropna=False)

size
54816451    6
57148214    6
71624934    6
76165528    6
65073230    6
           ..
13206832    1
12587393    1
11076350    1
14728325    1
131161      1
Name: count, Length: 158251, dtype: int64

#### Is_XBRL Column

In [19]:
df_10qs['is_XBRL'].value_counts(dropna=False)

is_XBRL
1    143795
0     29607
Name: count, dtype: int64

#### Is_Inline_XBRL Column

In [20]:
df_10qs['is_inline_XBRL'].value_counts(dropna=False)

is_inline_XBRL
0    110087
1     63315
Name: count, dtype: int64

#### Has_Multi_Ticker Column

In [21]:
df_10qs['has_multi_ticker'].value_counts(dropna=False)

has_multi_ticker
0    152596
1     20806
Name: count, dtype: int64

#### Has_Multi_Exchange Column

In [22]:
df_10qs['has_multi_exchange'].value_counts(dropna=False)

has_multi_exchange
0    152596
1     20806
Name: count, dtype: int64

#### All_Submission_Line_Number Column

In [23]:
df_10qs['all_submissions_line_number'].value_counts(dropna=False)

all_submissions_line_number
1952    74
7394    74
6296    74
4595    73
2286    73
        ..
2939     1
1472     1
4844     1
5861     1
1923     1
Name: count, Length: 5978, dtype: int64

### Other Data Types

In [24]:
df_10qs.dtypes

ticker                                      object
exchange                                    object
accession_number                            object
filing_date                         datetime64[ns]
report_date                         datetime64[ns]
acceptance_datetime            datetime64[ns, UTC]
act                                        float64
form                                        object
file_number                                 object
film_number                                 object
size                                         int64
primary_document                            object
is_XBRL                                      int64
is_inline_XBRL                               int64
primary_doc_description                     object
source                                      object
has_multi_ticker                             int64
has_multi_exchange                           int64
all_submissions_line_number                  int64
report_url                     

In [25]:
df_10qs[['ticker', 'exchange', 'accession_number', 'form', 'file_number', 'film_number',
         'primary_document', 'primary_doc_description', 'source', 'report_url']].head()

Unnamed: 0,ticker,exchange,accession_number,form,file_number,film_number,primary_document,primary_doc_description,source,report_url
0,CVCY,Nasdaq,0001628280-24-023222,10-Q,000-31977,24944655.0,cvcy-20240331.htm,10-Q,https://data.sec.gov/submissions/CIK0001127371...,https://www.sec.gov/Archives/edgar/data/112737...
1,CVCY,Nasdaq,0001127371-23-000152,10-Q,000-31977,231371534.0,cvcy-20230930.htm,10-Q,https://data.sec.gov/submissions/CIK0001127371...,https://www.sec.gov/Archives/edgar/data/112737...
2,CVCY,Nasdaq,0001127371-23-000128,10-Q,000-31977,231140967.0,cvcy-20230630.htm,10-Q,https://data.sec.gov/submissions/CIK0001127371...,https://www.sec.gov/Archives/edgar/data/112737...
3,CVCY,Nasdaq,0001127371-23-000072,10-Q,000-31977,23917689.0,cvcy-20230331.htm,10-Q,https://data.sec.gov/submissions/CIK0001127371...,https://www.sec.gov/Archives/edgar/data/112737...
4,CVCY,Nasdaq,0001127371-22-000167,10-Q,000-31977,221353223.0,cvcy-20220930.htm,10-Q,https://data.sec.gov/submissions/CIK0001127371...,https://www.sec.gov/Archives/edgar/data/112737...


#### Ticker Column

In [26]:
df_10qs['ticker'].value_counts(dropna=False)

ticker
NaN      1040
SNFCA     146
OCC       144
HCKT      144
ANIK      144
         ... 
PGY         1
MREO        1
LICY        1
WINS        1
ALTM        1
Name: count, Length: 5436, dtype: int64

#### Exchange Column

In [27]:
df_10qs['exchange'].value_counts(dropna=False)

exchange
Nasdaq    80578
NYSE      55244
OTC       36439
NaN        1088
CBOE         53
Name: count, dtype: int64

#### Accession Number Column

In [28]:
df_10qs['accession_number'].value_counts(dropna=False)

accession_number
0000065984-22-000129    6
0000065984-20-000283    6
0000065984-22-000064    6
0000065984-18-000132    6
0000065984-18-000192    6
                       ..
0001564590-19-030902    1
0001564590-19-018177    1
0001564590-18-028647    1
0001564590-18-020705    1
0001164150-08-000035    1
Name: count, Length: 159188, dtype: int64

#### Form Column

In [29]:
df_10qs['form'].value_counts(dropna=False)

form
10-Q    173402
Name: count, dtype: int64

Even though all the values for this column are `10-Q` I'm still going to keep this column just to maintain sanity that the data I am working with are 10-Q reports.

#### File Number Column

In [30]:
df_10qs['file_number'].value_counts(dropna=False)

file_number
000-09341     146
000-12919     144
000-17077     144
001-07731     144
000-12895     144
             ... 
001-41915       1
001-41880       1
333-265883      1
001-42023       1
000-55791       1
Name: count, Length: 6569, dtype: int64

#### Film Number Column

In [31]:
df_10qs['film_number'].value_counts(dropna=False)

film_number
NaN            57
81008772.0      2
20947002.0      2
24523514.0      2
231318411.0     2
               ..
211403281       1
211167481       1
21919684        1
201307006       1
8829716.0       1
Name: count, Length: 161865, dtype: int64

#### Primary Document

In [32]:
df_10qs['primary_document'].value_counts(dropna=False)

primary_document
form10-q.htm             7313
form10q.htm              4307
d10q.htm                 3957
0001.txt                  460
mainbody.htm              279
                         ... 
idcc-q39302015.htm          1
idcc-q13312016.htm          1
idcc-q26302016.htm          1
idcc-q39302016.htm          1
udhi-10q_03312008.txt       1
Name: count, Length: 133773, dtype: int64

#### Primary Doc Description

In [33]:
df_10qs['primary_doc_description'].value_counts(dropna=False)

primary_doc_description
10-Q                                                 83473
FORM 10-Q                                            38853
NaN                                                  16427
QUARTERLY REPORT                                     11213
QUARTERLY REPORT PURSUANT TO SECTIONS 13 OR 15(D)      653
                                                     ...  
WILLAMETTE VALLEY VINEYARDS, INC. 10Q #16513             1
WILLAMETTE VALLEY VINEYARDS, INC. 09/30/2014 10-Q        1
WILLAMETTE VALLEY VINEYARDS, INC. 06/30/2014 10-Q        1
WILLAMETTE VALLEY VINEYARDS, INC. 03/31/2014 10-Q        1
FORM 10-Q Q1-2018                                        1
Name: count, Length: 13031, dtype: int64

#### Source Column

In [34]:
df_10qs['source'].value_counts(dropna=False)

source
https://data.sec.gov/submissions/CIK0000318673.json    146
https://data.sec.gov/submissions/CIK0000032621.json    144
https://data.sec.gov/submissions/CIK0000006207.json    144
https://data.sec.gov/submissions/CIK0000758743.json    144
https://data.sec.gov/submissions/CIK0000718332.json    144
                                                      ... 
https://data.sec.gov/submissions/CIK0001948884.json      1
https://data.sec.gov/submissions/CIK0001714562.json      1
https://data.sec.gov/submissions/CIK0001901203.json      1
https://data.sec.gov/submissions/CIK0001982518.json      1
https://data.sec.gov/submissions/CIK0001855781.json      1
Name: count, Length: 5524, dtype: int64

#### Report URL Column

In [35]:
df_10qs['report_url'].value_counts(dropna=False)

report_url
https://www.sec.gov/Archives/edgar/data/1122904/000112290419000245/ntgr2019q3-10q.htm       2
https://www.sec.gov/Archives/edgar/data/916907/000092708907000051/somo10q123106.htm         2
https://www.sec.gov/Archives/edgar/data/916907/000092708910000273/smbc-10q093010.htm        2
https://www.sec.gov/Archives/edgar/data/916907/000092708910000060/smb-10q033110.htm         2
https://www.sec.gov/Archives/edgar/data/916907/000092708910000024/smbi-10q123109.htm        2
                                                                                           ..
https://www.sec.gov/Archives/edgar/data/8042/000156276213000225/c420-20130630x10q.htm       1
https://www.sec.gov/Archives/edgar/data/8042/000156276213000123/c420-20130331x10q.htm       1
https://www.sec.gov/Archives/edgar/data/8042/000156276213000039/c420-20121231x10q.htm       1
https://www.sec.gov/Archives/edgar/data/8042/000119312512343290/d342052d10q.htm             1
https://www.sec.gov/Archives/edgar/data/1138586/0

### File Extension Count
It might be helpful to see the distribution of file types for error handling later. Maybe a large amount of one file type will error out a specific way for a general resolution to be applied rather than hot fixing individual errors.

In [36]:
df_10qs['file_extension'] = [os.path.splitext(df_10qs.iloc[index]['report_url'])[1].lower() for index in df_10qs.index]
df_10qs['file_extension'].value_counts(dropna=False)

file_extension
.htm      167685
.txt        5697
.html         14
.paper         5
.xfd           1
Name: count, dtype: int64

Reordering the columns for the heck of it.

In [37]:
columns = df_10qs.columns.tolist()
a, b = columns.index('file_extension'), columns.index('report_url')
columns[a], columns[b] = columns[b], columns[a]
df_10qs = df_10qs[columns]
df_10qs.head()

Unnamed: 0,ticker,exchange,accession_number,filing_date,report_date,acceptance_datetime,act,form,file_number,film_number,...,primary_document,is_XBRL,is_inline_XBRL,primary_doc_description,source,has_multi_ticker,has_multi_exchange,all_submissions_line_number,file_extension,report_url
0,CVCY,Nasdaq,0001628280-24-023222,2024-05-14,2024-03-31,2024-05-14 16:19:29+00:00,34.0,10-Q,000-31977,24944655.0,...,cvcy-20240331.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,.htm,https://www.sec.gov/Archives/edgar/data/112737...
1,CVCY,Nasdaq,0001127371-23-000152,2023-11-02,2023-09-30,2023-11-02 14:38:44+00:00,34.0,10-Q,000-31977,231371534.0,...,cvcy-20230930.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,.htm,https://www.sec.gov/Archives/edgar/data/112737...
2,CVCY,Nasdaq,0001127371-23-000128,2023-08-03,2023-06-30,2023-08-03 16:56:00+00:00,34.0,10-Q,000-31977,231140967.0,...,cvcy-20230630.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,.htm,https://www.sec.gov/Archives/edgar/data/112737...
3,CVCY,Nasdaq,0001127371-23-000072,2023-05-15,2023-03-31,2023-05-12 18:46:16+00:00,34.0,10-Q,000-31977,23917689.0,...,cvcy-20230331.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,.htm,https://www.sec.gov/Archives/edgar/data/112737...
4,CVCY,Nasdaq,0001127371-22-000167,2022-11-02,2022-09-30,2022-11-02 13:29:11+00:00,34.0,10-Q,000-31977,221353223.0,...,cvcy-20220930.htm,1,1,10-Q,https://data.sec.gov/submissions/CIK0001127371...,0,0,1,.htm,https://www.sec.gov/Archives/edgar/data/112737...
