### Case Study on the 10k filings for the 10k form - Business, Risk and Management Discussion

Background: 10k filings are the annual fillings required by the SEC. The form a strict format.
For this case study we are interested in the following sections:

Item 1
“Business” requires a description of the company’s business, including its main products and services, what subsidiaries it owns, and what markets it operates in. This section may also include information about recent events, competition the company faces, regulations that apply to it, labor issues, special operating costs, or seasonal factors. This is a good place to start to understand how the company operates.

Item1A
“Risk Factors” includes information about the most significant risks that apply to the company or to its securiies. Companies generally list the risk factors in order of their importance.
In practice, this section focuses on the risks themselves, not how the company addresses those risks. Some risks may be true for the entire economy, some may apply only to the company’s in- dustry sector or geographic region, and some may be unique to the company.

Item7

Management’s Discussion and Analysis of Financial Condition and Results of Operations” gives the company’s perspective on the business results of the past financial year. This section, known as the MD&A for short, allows company management to tell its story in its own words.



We import some standard packages.

### Load the data
We load the data.

In [1]:
import pandas as pd
import numpy as np

In [11]:
link = 'https://storage.googleapis.com/iig-ds-test-data/all_filings_and_sections.csv'
df = pd.read_csv(link)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2420 entries, 0 to 2419
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Unnamed: 0           2420 non-null   int64 
 1   ticker               2419 non-null   object
 2   companyName          2420 non-null   object
 3   formType             2420 non-null   object
 4   description          2420 non-null   object
 5   filedAt              2420 non-null   object
 6   linkToFilingDetails  2420 non-null   object
 7   Section1             2404 non-null   object
 8   Section1A            2395 non-null   object
 9   Section7             2389 non-null   object
dtypes: int64(1), object(9)
memory usage: 189.2+ KB


Let's have a look at the data. But first we parse the dates and drop the 'Unnamed: 0' column.

In [13]:
df['filedAt'] = pd.to_datetime(df['filedAt'], infer_datetime_format=True)
df = df.drop(columns='Unnamed: 0')
df.head(5)

Unnamed: 0,ticker,companyName,formType,description,filedAt,linkToFilingDetails,Section1,Section1A,Section7
0,AIZ,"ASSURANT, INC.",10-K,Form 10-K - Annual report [Section 13 and 15(d...,2023-02-17 16:12:13-05:00,https://www.sec.gov/Archives/edgar/data/126723...,"Item 1. Business \n\nAssurant, Inc. was incor...",Item 1A. Risk Factors \n\nCertain factors may...,Item 7. Management&#8217;s Discussion and Ana...
1,AIZ,"ASSURANT, INC.",10-K,Form 10-K - Annual report [Section 13 and 15(d...,2022-02-22 16:24:39-05:00,https://www.sec.gov/Archives/edgar/data/126723...,"Item 1. Business \n\nAssurant, Inc. was incor...",Item 1A. Risk Factors \n\nCertain factors may...,Item 7. Management&#8217;s Discussion and Ana...
2,AIZ,"ASSURANT, INC.",10-K,Form 10-K - Annual report [Section 13 and 15(d...,2021-02-19 16:44:57-05:00,https://www.sec.gov/Archives/edgar/data/126723...,"Item 1. Business \n\nAssurant, Inc. was incor...",Item 1A. Risk Factors \n\nCertain factors may...,Item 7. Management&#8217;s Discussion and Ana...
3,AIZ,"ASSURANT, INC.",10-K,Form 10-K - Annual report [Section 13 and 15(d...,2020-02-19 17:13:43-05:00,https://www.sec.gov/ix?doc=/Archives/edgar/dat...,"Item 1. Business \n\nAssurant, Inc. was incor...",Item 1A. Risk Factors \n\nCertain factors may...,Item 7. Management&#8217;s Discussion and Ana...
4,AIZ,ASSURANT INC,10-K,Form 10-K - Annual report [Section 13 and 15(d...,2019-02-22 16:48:45-05:00,https://www.sec.gov/ix?doc=/Archives/edgar/dat...,"Item 1. Business \n\nAssurant, Inc. was incor...",Item 1A. Risk Factors \n\nCertain factors may...,Item 7. Management&#8217;s Discussion and Ana...


Let's have a look at the number of filings per ticker.

In [8]:
ticker_counts = df.ticker.value_counts()
ticker_counts.describe()

count    502.000000
mean       4.818725
std        0.610134
min        1.000000
25%        5.000000
50%        5.000000
75%        5.000000
max        9.000000
Name: ticker, dtype: float64

As expected, the number of filings per ticker is above, which is consistent with the 4-5 years. Let's look at the range of dates.

In [30]:
dates_range = df.groupby('ticker')['filedAt'].agg({'min','max'})
dates_range['range_years'] = (dates_range['max'] - dates_range['min']).dt.days / 365

In [29]:
dates_range.describe()

Unnamed: 0,range
count,502.0
mean,3.793822
std,0.56301
min,0.0
25%,3.980822
50%,3.989041
75%,4.005479
max,4.065753


We suspect that there are companies with repeated filings. We can filter for these later. Let's look now at the actual text data.
### 1. Business description

In [37]:
example_sect1 = df.loc[35, 'Section1']
example_sect1a=  df.loc[35, 'Section1A']
example_sect7 = df.loc[35, 'Section7']

print(f'{len(example_sect1)} characters for section 1')
print(f'{len(example_sect1a)} characters for section 1A')
print(f'{len(example_sect7)} characters for section 7')

62027 characters for section 1
51410 characters for section 1A
148246 characters for section 7


We can see the length of the sections is high. Overall, we get the following


In [43]:
sections = [f'Section{s}' for s in ['1', '1A', '7'] ]
lens_summary = { k: df.loc[:,k].astype('str').apply(len).describe()  for k in sections}
for k, v in lens_summary.items():
    print(v)
    print('\n')

count      2420.000000
mean      55073.604545
std       37290.827366
min           3.000000
25%       31010.500000
50%       47889.500000
75%       69790.250000
max      533690.000000
Name: Section1, dtype: float64


count      2420.000000
mean      68644.264463
std       41021.402465
min           3.000000
25%       43639.000000
50%       64877.500000
75%       85113.750000
max      606309.000000
Name: Section1A, dtype: float64


count    2.420000e+03
mean     9.364633e+04
std      6.605767e+04
min      3.000000e+00
25%      5.760825e+04
50%      8.332800e+04
75%      1.138970e+05
max      1.018864e+06
Name: Section7, dtype: float64




1. The 25 percentile is already higher than 10k characters, or around 2000 words.
2. There is a record with only 3 characters.


In the [next](topic_modelling_langchain.ipynb) notebook we look at potential ways of detecting topics and their trending.