# Transform ASIF data by constructing financial variables raw data data to S3

# Objective(s)

## Business needs 

Transform (creating financial variables) ASIF data using Athena and save output to S3 + Glue. 

## Description

### Objective 

Construct the financial ratio variables by aggregating the data (not anymore at the firm level)

The asif_financial_ratio  has the following levels:

* year
* city
* industry -> 2 digits compared with 4 digits with the parent task

**Construction variables**

Rescale output, fa_net, employment
construct the following ratio:
* If possible compute by:
  * industry level
  * city-industry level
  * city-industry-year level
1. Working capital: cuasset- 流动负债合计 (c95)
2. Working capital requirement: 存货 (c81) + 应收帐款 (c80) - 应付帐款 (c96)
3. Current Ratio: cuasset/流动负债合计 (c95)
4. Cash ratio: (( 其中：短期投资 (c79) + 应收帐款 (c80) + 存货 (c81)) - cuasset)/ 流动负债合计 (c95)
5. Debt to total asset:
  1. (流动负债合计 (c95) + 长期负债合计 (c97)) / 资产总计318 (c93)
  2. 负债合计 (c98)/ 资产总计318 (c93)
6. Return on Asset: 全年营业收入合计 (c64) - (主营业务成本 (c108) + 营业费用 (c113) + 管理费用 (c114) + 财产保险费 (c116) + 劳动、失业保险费 (c118)+ 财务费用 (c124) + 本年应付工资总额 (wage)) /资产总计318 (c93)
7. Asset turnover ratio: 全年营业收入合计 (c64) /(ΔΔ 资产总计318 (c93)/2)
8. R&D intensity: rdfee/全年营业收入合计 (c64)
9. Inventory to sales: 存货 (c81) / 全年营业收入合计 (c64)
10. Asset Tangibility: 固定资产合计 (c85) - 无形资产 (c91)
11. Account payable to total asset: (Δ 应付帐款 (c96))/ (Δ资产总计318 (c93))
12 Sales/Assets
  - Update create Sales over asset using Andersen Method → without the asset growth rate

**Steps** 

We will clean the table by doing the following steps:

1. Compute the financial ratio by aggregating the data

**Cautious**

* Make sure there is no duplicates when merging ratio from different level

**Target**

* The file is saved in S3: 
  * bucket: datalake-datascience 
  * path: DATA/ECON/FIRM_SURVEY/ASIF_CHINA/TRANSFORMED/FINANCIAL_RATIO 
* Glue data catalog should be updated
  * database: firms_survey 
  * table prefix: asif_city_industry 
    * table name (prefix + last folder S3 path): asif_city_industry_financial_ratio 

# Metadata

* Key: spr04tlko02392a
* Parent key (for update parent):  
* Notebook US Parent (i.e the one to update): 
* https://github.com/thomaspernet/Financial_dependency_pollution/blob/master/01_data_preprocessing/02_transform_tables/00_asif_financial_ratio.md
* Epic: Epic 2
* US: US 1
* Date Begin: 11/23/2020
* Duration Task: 1
* Description: Transform (creating financial variables) ASIF data using Athena and save output to S3 + Glue. 
* Step type: Transform table
* Status: Active
* Source URL: Create Task and Epics
* Task type: Jupyter Notebook
* Users: Thomas Pernet
* Watchers: Thomas Pernet
* User Account: https://468786073381.signin.aws.amazon.com/console
* Estimated Log points: 10
* Task tag: #athena,#glue,#crawler,#financial-ratio
* Toggl Tag: #data-transformation
* current nb commits: 
 * Meetings:  
* Presentation:  
* Email Information:  
  * thread: Number of threads: 0(Default 0, to avoid display email)
  *  

# Input Cloud Storage [AWS/GCP]

## Table/file

* Origin: 
* Athena
* Name: 
* asif_firms_prepared
* Github: 
  * https://github.com/thomaspernet/Financial_dependency_pollution/blob/master/01_data_preprocessing/01_prepare_tables/00_prepare_asif.md

# Destination Output/Delivery

## Table/file

* Origin: 
* S3
* Athena
* Name:
* DATA/ECON/FIRM_SURVEY/ASIF_CHINA/TRANSFORMED/FINANCIAL_RATIO
* asif_city_industry_financial_ratio
* GitHub:
* https://github.com/thomaspernet/Financial_dependency_pollution/blob/master/01_data_preprocessing/02_transform_tables/00_asif_financial_ratio.md
* URL: 
  * datalake-datascience/DATA/ECON/FIRM_SURVEY/ASIF_CHINA/TRANSFORMED/FINANCIAL_RATIO
* 

# Knowledge

## List of candidates

* [List of financial ratios that can be computed with ASIF panel data](https://roamresearch.com/#/app/thomas_db/page/PS3o9Z3VA)

In [None]:
from awsPy.aws_authorization import aws_connector
from awsPy.aws_s3 import service_s3
from awsPy.aws_glue import service_glue
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import os, shutil, json

import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

path = os.getcwd()
parent_path = str(Path(path).parent.parent)


name_credential = 'financial_dep_SO2_accessKeys.csv'
region = 'eu-west-2'
bucket = 'datalake-london'
path_cred = "{0}/creds/{1}".format(parent_path, name_credential)

In [None]:
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = region)
client= con.client_boto()
s3 = service_s3.connect_S3(client = client,
                      bucket = bucket, verbose = True) 
glue = service_glue.connect_glue(client = client) 

In [None]:
pandas_setting = True
if pandas_setting:
    cm = sns.light_palette("green", as_cmap=True)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_colwidth', None)

# Prepare query 

Write query and save the CSV back in the S3 bucket `datalake-datascience` 

# Steps

Detail computation:

| Origin                  | Variable                    | construction                                                                                                                                                                 | Roam               |
|-------------------------|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------|
| Balance sheet variables | current asset               | c80 + c81 + c82 + c79                                                                                                                                                        | #current-asset     |
| Balance sheet variables | intangible                  | c91 + c92                                                                                                                                                                    | #intangible-asset  |
| Balance sheet variables | tangible                    | tofixed - cudepre                                                                                                                                               | #tangible-asset    |
| Balance sheet variables | total net non current             | tofixed - cudepre + (c91 + c92)                                                                                                                                              | #net-fixed-asset   |
| Balance sheet variables | error                       | (c80 + c81 + c82 + c79 +  tofixed - cudepre + (c91 + c92)) - (c95 + c97  + c99)                                                                                                     |                    |
| Balance sheet variables | total_liabilities           | if (c80 + c81 + c82 + c79 +  tofixed - cudepre + (c91 + c92)) - (c95 + c97  + c99). > 0 then allocate error to liabilities else c98 + c99                                           | #total-liabilities |
| Balance sheet variables | total_asset                 | if (c80 + c81 + c82 + c79 +  tofixed - cudepre + (c91 + c92)) - (c95 + c97  + c99). <  0 then allocate error to asset else c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)  |                    |
| Financial metric        | cashflow                    | (c131 - c134) + cudepre                                                                                                                                                      | #cashflow          |
| Financial metric        | current_ratio               |  c80 + c81 + c82 + c79 / c95                                                                                                                                                 | #current-ratio     |
| Financial metric        | quick ratio                 |  c80 + c81 + c82 + c79 - c80 - c81 / c95                                                                                                                                     | #quick-ratio       |
| Financial metric        | liabilities_tot_asset       | c98 / total_asset                                                                                                                                                            | #leverage          |
| Financial metric        | sales_tot_asset             | sales / total_asset                                                                                                                                                          | #sales-over-asset  |
| Financial metric        | investment_tot_asset        | c84 / total_asset                                                                                                                                                            |                    |
| Financial metric        | rd_tot_asset                | rdfee / total_asset                                                                                                                                                          |                    |
| Financial metric        | asset_tangibility_tot_asset |  tangible / total_asset                                                                                                                                                      | #collateral        |
| Financial metric        | cashflow_tot_asset          | cashflow / total_asset                                                                                                                                                       |                    |
| Financial metric        | cashflow_to_tangible        | cashflow / tangible                                                                                                                                                          |                    |
| Financial metric        | return_to_sale              | c131 / sales                                                                                                                                                                 | #return-on-sales   |
| Financial metric        | coverage_ratio              | c131 / c125                                                                                                                                                                  | #coverage-ratio    |
| Financial metric        | liquidity                   | cuasset - c95 / total_asset                                                                                                                                                  | #liquidity         |
| Other variables         | labor_productivity          | sales/employ                                                                                                                                                                 |                    |
| Other variables         | labor_capital               | employ / tangible                                                                                                                                                            |                    |
| Other variables         | age                         | year - setup                                                                                                                                                                 |                    |
| Other variables         | export_to_sale              |  export / sale                                                                                                                                                               |                    |


## Example step by step



In [None]:
DatabaseName = 'firms_survey'
s3_output_example = 'SQL_OUTPUT_ATHENA'

1. Count the number of digit by industry

We want to keep only the fist two digit

In [None]:
query = """
SELECT len, COUNT(len) as CNT
FROM (
SELECT length(cic) AS len
FROM asif_firms_prepared 
)
GROUP BY len
ORDER BY CNT
"""
output = s3.run_query(
                    query=query,
                    database=DatabaseName,
                    s3_output=s3_output_example,
    filename = 'example_1'
                )
output

Count when substring 1 or 2 digits

In [None]:
query = """
WITH test AS (
SELECT
CASE 
WHEN LENGTH(cic) = 4 THEN substr(cic,1, 2) 
ELSE substr(cic,1, 1) END AS indu_2
FROM asif_firms_prepared 
)

SELECT len, COUNT(len) as CNT
FROM (
SELECT length(indu_2) AS len
FROM test 
)
GROUP BY len
ORDER BY CNT
"""
output = s3.run_query(
                    query=query,
                    database=DatabaseName,
                    s3_output=s3_output_example,
    filename = 'example_1'
                )
output

1. Add consistent city code

There is a need to remove the duplicates in `china_city_code_normalised` because it is possible to have the same code but different Chinese name link Chongqing

In [None]:
query = """
SELECT *
FROM chinese_lookup.china_city_code_normalised 
WHERE extra_code = '5001'
"""
output = s3.run_query(
                    query=query,
                    database=DatabaseName,
                    s3_output=s3_output_example,
    filename = 'example_1'
                )
output

In [None]:
query = """
WITH test AS (
SELECT firm, year, citycode, geocode4_corr, CASE 
WHEN LENGTH(cic) = 4 THEN substr(cic,1, 2) 
ELSE substr(cic,1, 1) END AS indu_2
  FROM firms_survey.asif_firms_prepared 
INNER JOIN 
  (
  SELECT extra_code, geocode4_corr
  FROM chinese_lookup.china_city_code_normalised 
  GROUP BY extra_code, geocode4_corr
  ) as no_dup_citycode
ON asif_firms_prepared.citycode = no_dup_citycode.extra_code
  )
  SELECT CNT, COUNT(*) AS CNT_dup
  FROM(
  SELECT firm, year, geocode4_corr, indu_2, COUNT(*) AS CNT
  FROM test
  GROUP BY firm, year, geocode4_corr, indu_2
    )
    GROUP BY CNT
"""
output = s3.run_query(
                    query=query,
                    database=DatabaseName,
                    s3_output=s3_output_example,
    filename = 'example_1'
                )
output

Make sure the output is the same before and after the use of city consistent code

Asif felt to meet the basic accounting equation that the left part should be equal to the right part

In [None]:
query = """
WITH test AS (
SELECT *, CASE 
WHEN LENGTH(cic) = 4 THEN substr(cic,1, 2)
ELSE substr(cic,1, 1) END AS indu_2,
c98 + c99 as total_asset
FROM firms_survey.asif_firms_prepared 
  WHERE year in (-- '2004', '2005', 
    '2006') 
)
SELECT 
  year, 
  indu_2, 
  
  SUM(c98) as total_liabilities,
  SUM(c99) as equity,
  SUM(toasset) as left_part,
  SUM(c98) + SUM(c99) as right_part,
  SUM(toasset) - (SUM(c98) + SUM(c99)) as diff,
  
  CASE 
  WHEN (SUM(c98) + SUM(c99)) - SUM(toasset) < 0 THEN 
  CAST( (SUM(c98) + SUM(c99) - SUM(toasset)) AS DECIMAL(16, 5)) / SUM(toasset)
  WHEN (SUM(c98) + SUM(c99)) - SUM(toasset) > 0 THEN  
  CAST( (SUM(c98) + SUM(c99) - SUM(toasset)) AS DECIMAL(16, 5)) / (SUM(c98) + SUM(c99))
  END AS pct_missing,
  
  CASE 
  WHEN SUM(toasset) - (SUM(c98) + SUM(c99)) < 0 THEN 
  SUM(toasset) + ABS(SUM(toasset) - (SUM(c98) + SUM(c99))) - (SUM(c98) + SUM(c99))
  
  WHEN SUM(toasset) - (SUM(c98) + SUM(c99)) > 0 THEN 
  SUM(toasset) - (SUM(c98) + SUM(c99) + SUM(toasset) - (SUM(c98) + SUM(c99)) )
  
  ELSE (SUM(c98) + SUM(c99)) - SUM(toasset) END AS diff_adjusted
  
  
  FROM test
  GROUP BY year, indu_2
  ORDER BY year DESC, indu_2
"""

output = s3.run_query(
                    query=query,
                    database=DatabaseName,
                    s3_output=s3_output_example,
    filename = 'example_3'
                )
(
    output
    #.style
    #.format('{0:,.2f}')
)

3. Computation ratio by industry

As an average over year 2002 to 2006. As in Fan, compute directly at the industry, then get the average

- Computed using the Chinese data
    - The ExtFin based on Chinese data is calculated at the 2-digit Chinese Industrial Classification (CIC) level
    - Data available in year 2004–2006 in the NBSC Database. We calculate the aggregate rather than the median external finance dependence at 2-digit industry level, because the median firm in Chinese database often has no capital expenditure
    - In our sample, approximately 68.1% firms have zero capital expenditure

4. General Accepted Accounting Principles to discard observations for which one of the following criteria is violated
   
    - (1) the key financial variables (such as total assets, net value of fixed assets, sales, gross value of industrial output) cannot be missing
    - (2) the number of employees hired by a firm must not be less than 10
    - (3) the total assets must be higher than the liquid assets
    - (4) the total assets must be larger than the total fixed assets
    - (5) the total assets must be larger than the net value of the fixed assets
    - (6) a firm’s identification number cannot be missing and must be unique
    - (7) the established time must be valid (e.g., the opening month cannot be later than December or earlier than January)

![](https://cdn.corporatefinanceinstitute.com/assets/A-Balance-Sheet.png)

To satisfy the equation, we compute the left hand side and the right and side. IF the equation is not satisfied, we add the difference to either the right or left part according to the following rules:

- total asset (toasset) - total liabilities (c98) + total equity (c99) < 0 then add the difference to total asset (left part)
- total asset (toasset) - total liabilities (c98) + total equity (c99) > 0 then add the difference to total liabilities and equity (right part)

In [None]:
pd.set_option('display.max_rows', None)

In [None]:
query= """
WITH test AS (
  SELECT 
    *, 
    CASE WHEN LENGTH(cic) = 4 THEN substr(cic, 1, 2) ELSE concat(
      '0', 
      substr(cic, 1, 1)
    ) END AS indu_2, 
    c80 + c81 + c82 + c79 as current_asset, 
    c91 + c92 AS intangible, 
    tofixed - cudepre  AS tangible, 
    tofixed - cudepre + (c91 + c92) AS net_non_current, 
    (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) - (c95 + c97 + c99) AS error, 
    c95 + c97 as total_liabilities, 
    CASE WHEN (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) - (c95 + c97 + c99) > 0 THEN (c95 + c97 + c99) + ABS(
      (
        c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
      ) - (c95 + c97 + c99)
    ) ELSE (c95 + c97 + c99) END AS total_right, 
    CASE WHEN (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) - (c95 + c97 + c99) < 0 THEN (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) + ABS(
      (
        c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
      ) - (c95 + c97 + c99)
    ) ELSE (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) END AS total_asset, 
    (c131 - c134) + cudepre as cashflow 
  FROM 
    firms_survey.asif_firms_prepared 
    INNER JOIN (
      SELECT 
        extra_code, 
        geocode4_corr, 
        province_en 
      FROM 
        chinese_lookup.china_city_code_normalised 
      GROUP BY 
        extra_code, 
        province_en, 
        geocode4_corr
    ) as no_dup_citycode ON asif_firms_prepared.citycode = no_dup_citycode.extra_code 
  WHERE 
    c95 > 0 -- current liabilities
    AND c97 > 0 -- long term liabilities
    AND c98 > 0 -- total liabilities
    AND c99 > 0 -- equity
    AND c80 + c81 + c82 + c79 > 0 
    AND tofixed > 0 
    AND output > 0 
    and employ > 0
) 
SELECT 
  * 
FROM 
  (
    WITH ratio AS (
      SELECT 
        year, 
        -- cic, 
        indu_2, 
        -- geocode4_corr, 
        -- province_en, 
        CAST(
          SUM(output) AS DECIMAL(16, 5)
        ) AS output, 
        CAST(
          SUM(sales) AS DECIMAL(16, 5)
        ) AS sales, 
        CAST(
          SUM(employ) AS DECIMAL(16, 5)
        ) AS employment, 
        CAST(
          SUM(captal) AS DECIMAL(16, 5)
        ) AS capital, 
        SUM(current_asset) AS current_asset, 
        SUM(tofixed) AS tofixed, 
        SUM(error) AS error, 
        SUM(total_liabilities) AS total_liabilities, 
        SUM(total_asset) AS total_asset, 
        SUM(total_right) AS total_right, 
        SUM(intangible) AS intangible, 
        SUM(tangible) AS tangible, 
        SUM(net_non_current) AS net_non_current, 
        SUM(cashflow) AS cashflow, 
        CAST(
          SUM(c80 + c81 + c82 + c79) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(c95) AS DECIMAL(16, 5)
          ), 
          0
        ) AS current_ratio, 
        CAST(
          SUM(c80 + c81 + c82 + c79 - c80 - c81) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(c95) AS DECIMAL(16, 5)
          ), 
          0
        ) AS quick_ratio, 
        CAST(
          SUM(c98) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS liabilities_tot_asset, 
        CAST(
          SUM(sales) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS sales_tot_asset, 
        CAST(
          SUM(c84) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS investment_tot_asset, 
        CAST(
          SUM(rdfee) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS rd_tot_asset, 
        CAST(
          SUM(tangible) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) asset_tangibility_tot_asset, 
        CAST(
          SUM(cashflow) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS cashflow_tot_asset, 
        CAST(
          SUM(cashflow) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(tangible) AS DECIMAL(16, 5)
          ), 
          0
        ) AS cashflow_to_tangible, 
        -- update
        CAST(
          SUM(c131) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(sales) AS DECIMAL(16, 5)
          ), 
          0
        ) AS return_to_sale, 
        CAST(
          SUM(c131) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(c125) AS DECIMAL(16, 5)
          ), 
          0
        ) AS coverage_ratio, 
        CAST(
          SUM(current_asset - c95) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(tangible) AS DECIMAL(16, 5)
          ), 
          0
        ) AS liquidity 
      FROM 
        test 
      WHERE 
        year in (
          '2000','2001', '2002', '2003', '2004', '2005', 
          '2006', '2007'
        ) 
        AND total_asset > 0 
        AND tangible > 0 
      GROUP BY 
        --province_en, 
        --geocode4_corr, 
        -- cic,
        indu_2, 
        year 
      
    ) 
    SELECT 
      year, 
      indu_2, 
      --geocode4_corr, 
      --province_en, 
      output, 
      sales, 
      employment, 
      capital, 
      current_asset, 
      tofixed, 
      error, 
      total_liabilities, 
      total_asset, 
      total_right, 
      intangible, 
      tangible, 
      net_non_current, 
      cashflow, 
      current_ratio,
      LAG(current_ratio, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_current_ratio,
      quick_ratio, 
      LAG(quick_ratio, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_quick_ratio,
      liabilities_tot_asset, 
      LAG(liabilities_tot_asset, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_liabilities_tot_asset,
      sales_tot_asset,
      LAG(sales_tot_asset, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_sales_tot_asset,
      investment_tot_asset, 
      rd_tot_asset, 
      asset_tangibility_tot_asset, 
      cashflow_tot_asset,
      LAG(cashflow_tot_asset, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_cashflow_tot_asset,
      cashflow_to_tangible, 
      LAG(cashflow_to_tangible, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_cashflow_to_tangible,
      return_to_sale, 
      LAG(return_to_sale, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_return_to_sale,
      coverage_ratio, 
      liquidity 
    FROM 
      ratio
    LIMIT 
        10
  )
"""
output = s3.run_query(
                    query=query,
                    database=DatabaseName,
                    s3_output=s3_output_example,
    filename = 'example_1'
                )
output

In [None]:
query= """
WITH test AS (
  SELECT 
    *, 
    CASE WHEN LENGTH(cic) = 4 THEN substr(cic, 1, 2) ELSE concat(
      '0', 
      substr(cic, 1, 1)
    ) END AS indu_2, 
    c80 + c81 + c82 + c79 as current_asset, 
    c91 + c92 AS intangible, 
    tofixed - cudepre  AS tangible, 
    tofixed - cudepre + (c91 + c92) AS net_non_current, 
    (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) - (c95 + c97 + c99) AS error, 
    c95 + c97 as total_liabilities, 
    CASE WHEN (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) - (c95 + c97 + c99) > 0 THEN (c95 + c97 + c99) + ABS(
      (
        c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
      ) - (c95 + c97 + c99)
    ) ELSE (c95 + c97 + c99) END AS total_right, 
    CASE WHEN (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) - (c95 + c97 + c99) < 0 THEN (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) + ABS(
      (
        c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
      ) - (c95 + c97 + c99)
    ) ELSE (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) END AS total_asset, 
    (c131 - c134) + cudepre as cashflow 
  FROM 
    firms_survey.asif_firms_prepared 
    INNER JOIN (
      SELECT 
        extra_code, 
        geocode4_corr, 
        province_en 
      FROM 
        chinese_lookup.china_city_code_normalised 
      GROUP BY 
        extra_code, 
        province_en, 
        geocode4_corr
    ) as no_dup_citycode ON asif_firms_prepared.citycode = no_dup_citycode.extra_code 
  WHERE 
    c95 > 0 -- current liabilities
    AND c97 > 0 -- long term liabilities
    AND c98 > 0 -- total liabilities
    AND c99 > 0 -- equity
    AND c80 + c81 + c82 + c79 > 0 
    AND tofixed > 0 
    AND output > 0 
    and employ > 0
) 
SELECT 
  * 
FROM 
  (
    WITH ratio AS (
      SELECT 
        year, 
        -- cic, 
        indu_2, 
        -- geocode4_corr, 
        -- province_en, 
        CAST(
          SUM(output) AS DECIMAL(16, 5)
        ) AS output, 
        CAST(
          SUM(sales) AS DECIMAL(16, 5)
        ) AS sales, 
        CAST(
          SUM(employ) AS DECIMAL(16, 5)
        ) AS employment, 
        CAST(
          SUM(captal) AS DECIMAL(16, 5)
        ) AS capital, 
        SUM(current_asset) AS current_asset, 
        SUM(tofixed) AS tofixed, 
        SUM(error) AS error, 
        SUM(total_liabilities) AS total_liabilities, 
        SUM(total_asset) AS total_asset, 
        SUM(total_right) AS total_right, 
        SUM(intangible) AS intangible, 
        SUM(tangible) AS tangible, 
        SUM(net_non_current) AS net_non_current, 
        SUM(cashflow) AS cashflow, 
        CAST(
          SUM(c80 + c81 + c82 + c79) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(c95) AS DECIMAL(16, 5)
          ), 
          0
        ) AS current_ratio, 
        CAST(
          SUM(c80 + c81 + c82 + c79 - c80 - c81) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(c95) AS DECIMAL(16, 5)
          ), 
          0
        ) AS quick_ratio, 
        CAST(
          SUM(c98) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS liabilities_tot_asset, 
        CAST(
          SUM(sales) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS sales_tot_asset, 
        CAST(
          SUM(c84) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS investment_tot_asset, 
        CAST(
          SUM(rdfee) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS rd_tot_asset, 
        CAST(
          SUM(tangible) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) asset_tangibility_tot_asset, 
        CAST(
          SUM(cashflow) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS cashflow_tot_asset, 
        CAST(
          SUM(cashflow) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(tangible) AS DECIMAL(16, 5)
          ), 
          0
        ) AS cashflow_to_tangible, 
        -- update
        CAST(
          SUM(c131) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(sales) AS DECIMAL(16, 5)
          ), 
          0
        ) AS return_to_sale, 
        CAST(
          SUM(c131) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(c125) AS DECIMAL(16, 5)
          ), 
          0
        ) AS coverage_ratio, 
        CAST(
          SUM(current_asset - c95) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(tangible) AS DECIMAL(16, 5)
          ), 
          0
        ) AS liquidity 
      FROM 
        test 
      WHERE 
        year in (
          '2000','2001', '2002', '2003', '2004', '2005', 
          '2006', '2007'
        ) 
        AND total_asset > 0 
        AND tangible > 0 
      GROUP BY 
        --province_en, 
        --geocode4_corr, 
        -- cic,
        indu_2, 
        year 
      
    ) 
    SELECT 
      year, 
      indu_2, 
      --geocode4_corr, 
      --province_en, 
      output, 
      sales, 
      employment, 
      capital, 
      current_asset, 
      tofixed, 
      error, 
      total_liabilities, 
      total_asset, 
      total_right, 
      intangible, 
      tangible, 
      net_non_current, 
      cashflow, 
      current_ratio,
      LAG(current_ratio, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_current_ratio,
      quick_ratio, 
      LAG(quick_ratio, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_quick_ratio,
      liabilities_tot_asset, 
      LAG(liabilities_tot_asset, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_liabilities_tot_asset,
      sales_tot_asset,
      LAG(sales_tot_asset, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_sales_tot_asset,
      investment_tot_asset, 
      rd_tot_asset, 
      asset_tangibility_tot_asset, 
      cashflow_tot_asset,
      LAG(cashflow_tot_asset, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_cashflow_tot_asset,
      cashflow_to_tangible, 
      LAG(cashflow_to_tangible, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_cashflow_to_tangible,
      return_to_sale, 
      LAG(return_to_sale, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_return_to_sale,
      coverage_ratio, 
      liquidity 
    FROM 
      ratio
  )
"""
output = s3.run_query(
                    query=query,
                    database=DatabaseName,
                    s3_output=s3_output_example,
    filename = 'example_3'
                )
(
    output
    #.style
    #.format('{0:,.2f}')
)

The table below shows the rank of variables group by three group:

1. Current asset
    - pct_receivable_curasset
    - pct_non_cash_over_curasset
2. Current Liabilities
    - working_capital_i
    - current_ratio_i
    - quick_ratio_i
    - cash_ratio_i
3. Total asset
    - liabilities_assets_i
    - return_on_asset_i
    - sales_assets_i
    - asset_tangibility_i
    - account_paybable_to_asset_i

In [None]:
list(output.columns)

In [None]:
(
    output
    .loc[lambda x: x['year'].isin(['2005'])]
    .reindex(columns = 
    ['indu_2',
     'cashflow',
 'current_ratio',
 'quick_ratio',
 'liabilities_tot_asset',
 'sales_tot_asset',
 'investment_tot_asset',
 'rd_tot_asset',
 'asset_tangibility_tot_asset',
 'cashflow_tot_asset',
 'cashflow_to_tangible',
 'return_to_sale',
 'coverage_ratio',
 'liquidity'
    ]
     )
    .sort_values(by = ['current_ratio'])
    .set_index(['indu_2'])
    .assign(
        rank_cr = lambda x: x['current_ratio'].rank().astype('int64'),
        rank_qr = lambda x: x['quick_ratio'].rank().astype('int64'),
        
        rank_l = lambda x: x['liabilities_tot_asset'].rank().astype('int64'),
        rank_s = lambda x: x['sales_tot_asset'].rank().astype('int64'),
        rank_att = lambda x: x['asset_tangibility_tot_asset'].rank().astype('int64'),
        rank_cta = lambda x: x['cashflow_tot_asset'].rank().astype('int64'),
        
        rank_rs = lambda x: x['return_to_sale'].rank().astype('int64'),
        rank_cra = lambda x: x['coverage_ratio'].rank().astype('int64'),
        rank_li= lambda x: x['liquidity'].rank().astype('int64')
    )
    .style
    #.background_gradient(cmap=sns.light_palette("green", as_cmap=True), subset = ["rank_rc",'rank_pct_non_cash'])
    #.background_gradient(cmap=sns.light_palette("blue", as_cmap=True), subset = ["rank_w",'rank_c','rank_q', 'rank_cash'])
    #.background_gradient(cmap=sns.light_palette("orange", as_cmap=True), subset = ["rank_li",'rank_re', 'rank_sa', 'rank_at', 'rank_ap'])
)

In [None]:
fig = px.parallel_coordinates(
    (output
    .loc[lambda x: x['year'].isin(['2005'])][['indu_2',
            'cashflow',
            'current_ratio',
            'quick_ratio',
            'liabilities_tot_asset',
            'sales_tot_asset',
            'investment_tot_asset',
            'rd_tot_asset',
            'asset_tangibility_tot_asset',
            'cashflow_tot_asset',
            'cashflow_to_tangible',
            'return_to_sale',
            'coverage_ratio',
            'liquidity']]
    ).rank(),
    labels={
        "cashflow": "cashflow",
        "current_ratio": "current_ratio",
        "quick_ratio": "quick_ratio",
        "liabilities_tot_asset": "liabilities_tot_asset",
        "sales_tot_asset": "sales_tot_asset",
        "investment_tot_asset": "investment_tot_asset",
        "rd_tot_asset": "rd_tot_asset",
        "asset_tangibility_tot_asset": "asset_tangibility_tot_asset",
        "cashflow_tot_asset": "cashflow_tot_asset",
        "cashflow_to_tangible": "cashflow_to_tangible",
        "return_to_sale": "return_to_sale",
        "coverage_ratio": "coverage_ratio",
        "liquidity": "liquidity"
    },
    color_continuous_scale=px.colors.diverging.Tealrose,
    color_continuous_midpoint=2,
)
fig

In [None]:
sns.set_theme(style="white")

# Compute the correlation matrix
corr = (
    output
    .loc[lambda x: x['year'].isin(['2005'])]
    [['indu_2',
            'cashflow',
            'current_ratio',
            'quick_ratio',
            'liabilities_tot_asset',
            'sales_tot_asset',
            'investment_tot_asset',
            'rd_tot_asset',
            'asset_tangibility_tot_asset',
            'cashflow_tot_asset',
            'cashflow_to_tangible',
            'return_to_sale',
            'coverage_ratio',
            'liquidity']]
).set_index('indu_2').rank().corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

# Table `asif_city_industry_financial_ratio`


Since the table to create has missing value, please use the following at the top of the query

```
CREATE TABLE database.table_name WITH (format = 'PARQUET') AS
```


Choose a location in S3 to save the CSV. It is recommended to save in it the `datalake-datascience` bucket. Locate an appropriate folder in the bucket, and make sure all output have the same format

First, we need to delete the table (if exist)

In [None]:
table_name = 'asif_industry_financial_ratio_industry'
s3_output = 'DATA/ECON/FIRM_SURVEY/ASIF_CHINA/TRANSFORMED/FINANCIAL_RATIO/INDUSTRY'

In [None]:
try:
    response = glue.delete_table(
        database=DatabaseName,
        table=table_name
    )
    print(response)
except Exception as e:
    print(e)

Clean up the folder with the previous csv file. Be careful, it will erase all files inside the folder

In [None]:
s3.remove_all_bucket(path_remove = s3_output)

In [None]:
%%time
query = """
CREATE TABLE {0}.{1} WITH (format = 'PARQUET') AS

WITH test AS (
  SELECT 
    *, 
    CASE WHEN LENGTH(cic) = 4 THEN substr(cic, 1, 2) ELSE concat(
      '0', 
      substr(cic, 1, 1)
    ) END AS indu_2, 
    c80 + c81 + c82 + c79 as current_asset, 
    c91 + c92 AS intangible, 
    tofixed - cudepre  AS tangible, 
    tofixed - cudepre + (c91 + c92) AS net_non_current, 
    (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) - (c95 + c97 + c99) AS error, 
    c95 + c97 as total_liabilities, 
    CASE WHEN (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) - (c95 + c97 + c99) > 0 THEN (c95 + c97 + c99) + ABS(
      (
        c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
      ) - (c95 + c97 + c99)
    ) ELSE (c95 + c97 + c99) END AS total_right, 
    CASE WHEN (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) - (c95 + c97 + c99) < 0 THEN (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) + ABS(
      (
        c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
      ) - (c95 + c97 + c99)
    ) ELSE (
      c80 + c81 + c82 + c79 + tofixed - cudepre + (c91 + c92)
    ) END AS total_asset, 
    (c131 - c134) + cudepre as cashflow 
  FROM 
    firms_survey.asif_firms_prepared 
    INNER JOIN (
      SELECT 
        extra_code, 
        geocode4_corr, 
        province_en 
      FROM 
        chinese_lookup.china_city_code_normalised 
      GROUP BY 
        extra_code, 
        province_en, 
        geocode4_corr
    ) as no_dup_citycode ON asif_firms_prepared.citycode = no_dup_citycode.extra_code 
  WHERE 
    c95 > 0 -- current liabilities
    AND c97 > 0 -- long term liabilities
    AND c98 > 0 -- total liabilities
    AND c99 > 0 -- equity
    AND c80 + c81 + c82 + c79 > 0 
    AND tofixed > 0 
    AND output > 0 
    and employ > 0
) 
SELECT 
  * 
FROM 
  (
    WITH ratio AS (
      SELECT 
        year, 
        -- cic, 
        indu_2, 
        -- geocode4_corr, 
        -- province_en, 
        CAST(
          SUM(output) AS DECIMAL(16, 5)
        ) AS output, 
        CAST(
          SUM(sales) AS DECIMAL(16, 5)
        ) AS sales, 
        CAST(
          SUM(employ) AS DECIMAL(16, 5)
        ) AS employment, 
        CAST(
          SUM(captal) AS DECIMAL(16, 5)
        ) AS capital, 
        SUM(current_asset) AS current_asset, 
        SUM(tofixed) AS tofixed, 
        SUM(error) AS error, 
        SUM(total_liabilities) AS total_liabilities, 
        SUM(total_asset) AS total_asset, 
        SUM(total_right) AS total_right, 
        SUM(intangible) AS intangible, 
        SUM(tangible) AS tangible, 
        SUM(net_non_current) AS net_non_current, 
        SUM(cashflow) AS cashflow, 
        CAST(
          SUM(c80 + c81 + c82 + c79) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(c95) AS DECIMAL(16, 5)
          ), 
          0
        ) AS current_ratio, 
        CAST(
          SUM(c80 + c81 + c82 + c79 - c80 - c81) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(c95) AS DECIMAL(16, 5)
          ), 
          0
        ) AS quick_ratio, 
        CAST(
          SUM(c98) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS liabilities_tot_asset, 
        CAST(
          SUM(sales) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS sales_tot_asset, 
        CAST(
          SUM(c84) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS investment_tot_asset, 
        CAST(
          SUM(rdfee) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS rd_tot_asset, 
        CAST(
          SUM(tangible) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) asset_tangibility_tot_asset, 
        CAST(
          SUM(cashflow) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(total_asset) AS DECIMAL(16, 5)
          ), 
          0
        ) AS cashflow_tot_asset, 
        CAST(
          SUM(cashflow) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(tangible) AS DECIMAL(16, 5)
          ), 
          0
        ) AS cashflow_to_tangible, 
        -- update
        CAST(
          SUM(c131) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(sales) AS DECIMAL(16, 5)
          ), 
          0
        ) AS return_to_sale, 
        CAST(
          SUM(c131) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(c125) AS DECIMAL(16, 5)
          ), 
          0
        ) AS coverage_ratio, 
        CAST(
          SUM(current_asset - c95) AS DECIMAL(16, 5)
        ) / NULLIF(
          CAST(
            SUM(tangible) AS DECIMAL(16, 5)
          ), 
          0
        ) AS liquidity 
      FROM 
        test 
      WHERE 
        year in (
          '2000','2001', '2002', '2003', '2004', '2005', 
          '2006', '2007'
        ) 
        AND total_asset > 0 
        AND tangible > 0 
      GROUP BY 
        --province_en, 
        --geocode4_corr, 
        -- cic,
        indu_2, 
        year 
      
    ) 
    SELECT 
      year, 
      indu_2, 
      --geocode4_corr, 
      --province_en, 
      output, 
      sales, 
      employment, 
      capital, 
      current_asset, 
      tofixed, 
      error, 
      total_liabilities, 
      total_asset, 
      total_right, 
      intangible, 
      tangible, 
      net_non_current, 
      cashflow, 
      current_ratio,
      LAG(current_ratio, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_current_ratio,
      quick_ratio, 
      LAG(quick_ratio, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_quick_ratio,
      liabilities_tot_asset, 
      LAG(liabilities_tot_asset, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_liabilities_tot_asset,
      sales_tot_asset,
      LAG(sales_tot_asset, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_sales_tot_asset,
      investment_tot_asset, 
      rd_tot_asset, 
      asset_tangibility_tot_asset, 
      cashflow_tot_asset,
      LAG(cashflow_tot_asset, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_cashflow_tot_asset,
      cashflow_to_tangible, 
      LAG(cashflow_to_tangible, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_cashflow_to_tangible,
      return_to_sale, 
      LAG(return_to_sale, 1) OVER (
        PARTITION BY indu_2 
        ORDER BY 
          year
      ) as lag_return_to_sale,
      coverage_ratio, 
      liquidity 
    FROM 
      ratio
  )
""".format(DatabaseName, table_name)
output = s3.run_query(
                    query=query,
                    database=DatabaseName,
                    s3_output=s3_output,
                )
output

In [None]:
query_ = """
SELECT COUNT(*) AS CNT
FROM {}.{} 
""".format(DatabaseName, table_name)
output = s3.run_query(
                    query=query_,
                    database=DatabaseName,
                    s3_output=s3_output_example,
    filename = 'count_{}'.format(table_name)
                )
output

In [None]:
query_ = """
SELECT len, COUNT(len) as CNT
FROM (
SELECT length(indu_2) AS len
FROM {}.{} 
)
GROUP BY len
ORDER BY CNT
""".format(DatabaseName, table_name)
output = s3.run_query(
                    query=query_,
                    database=DatabaseName,
                    s3_output=s3_output_example,
    filename = 'example_1'
                )
output

# Validate query

This step is mandatory to validate the query in the ETL. If you are not sure about the quality of the query, go to the next step.

To validate the query, please fillin the json below. Don't forget to change the schema so that the crawler can use it.

1. Change the schema if needed. It is highly recommanded to add comment to the fields
2. Add a partition key:
    - Inform if there is group in the table so that, the parser can compute duplicate
3. Provide a description -> detail the steps 

1. Change the schema

Bear in mind that CSV SerDe (OpenCSVSerDe) does not support empty fields in columns defined as a numeric data type. All columns with missing values should be saved as string. 

In [None]:
glue.get_table_information(
    database=DatabaseName,
    table=table_name)['Table']['StorageDescriptor']['Columns']

In [None]:
schema = [{'Name': 'year', 'Type': 'string', 'Comment': ''},
 {'Name': 'indu_2', 'Type': 'string', 'Comment': '2 digits industry name'},
 {'Name': 'output', 'Type': 'decimal(16,5)', 'Comment': 'Output'},
 {'Name': 'sales', 'Type': 'decimal(16,5)', 'Comment': ''},
 {'Name': 'employment', 'Type': 'decimal(16,5)', 'Comment': 'employment'},
 {'Name': 'capital', 'Type': 'decimal(16,5)', 'Comment': 'capital'},
 {'Name': 'current_asset', 'Type': 'bigint', 'Comment': 'current asset'},
 {'Name': 'tofixed', 'Type': 'bigint', 'Comment': 'total fixed asset'},
 {'Name': 'error', 'Type': 'bigint', 'Comment': 'difference between cuasset+tofixed and total liabilities +equity. Error makes the balance sheet equation right'},
 {'Name': 'total_liabilities', 'Type': 'bigint', 'Comment': 'total adjusted liabilities'},
 {'Name': 'total_asset', 'Type': 'bigint', 'Comment': 'total adjusted asset'},
 {'Name': 'total_right', 'Type': 'bigint', 'Comment': 'Adjusted right part balance sheet'},
 {'Name': 'intangible', 'Type': 'bigint', 'Comment': 'intangible asset measured as the sum of intangibles variables'},
 {'Name': 'tangible', 'Type': 'bigint', 'Comment': 'tangible asset measured as the difference between total fixed asset minus intangible asset'},
 {'Name': 'net_non_current', 'Type': 'bigint', 'Comment': 'total net non current asset'},
 {'Name': 'cashflow', 'Type': 'bigint', 'Comment': 'cash flow'},
 {'Name': 'current_ratio', 'Type': 'decimal(21,5)', 'Comment': 'current ratio cuasset/流动负债合计 (c95)'},
 {'Name': 'lag_current_ratio', 'Type': 'decimal(21,5)', 'Comment': 'lag value of current ratio'},
 {'Name': 'quick_ratio', 'Type': 'decimal(21,5)', 'Comment': 'quick ratio (cuasset-存货 (c81) ) / 流动负债合计 (c95)'},
 {'Name': 'lag_quick_ratio', 'Type': 'decimal(21,5)', 'Comment': 'lag value of quick ratio'},
 {'Name': 'liabilities_tot_asset', 'Type': 'decimal(21,5)', 'Comment': 'liabilities to total asset'},
 {'Name': 'lag_liabilities_tot_asset', 'Type': 'decimal(21,5)', 'Comment': 'lag liabilities to total asset'},
 {'Name': 'sales_tot_asset', 'Type': 'decimal(21,5)', 'Comment': 'sales to total asset'},
 {'Name': 'lag_sales_tot_asset', 'Type': 'decimal(21,5)', 'Comment': 'lag sales to total asset'},
 {'Name': 'investment_tot_asset', 'Type': 'decimal(21,5)', 'Comment': 'investment to total asset'},
 {'Name': 'rd_tot_asset', 'Type': 'decimal(21,5)', 'Comment': 'rd to total asset'},
 {'Name': 'asset_tangibility_tot_asset',
  'Type': 'decimal(21,5)',
  'Comment': ''},
 {'Name': 'cashflow_tot_asset', 'Type': 'decimal(21,5)', 'Comment': 'asset tangibility to total asset'},
 {'Name': 'lag_cashflow_tot_asset', 'Type': 'decimal(21,5)', 'Comment': 'lag asset tangibility to total asset'},
 {'Name': 'cashflow_to_tangible', 'Type': 'decimal(21,5)', 'Comment': 'cashflow to total asset'},
 {'Name': 'lag_cashflow_to_tangible', 'Type': 'decimal(21,5)', 'Comment': 'lag cashflow to total asset'},
 {'Name': 'return_to_sale', 'Type': 'decimal(21,5)', 'Comment': 'return to sale '},
 {'Name': 'lag_return_to_sale', 'Type': 'decimal(21,5)', 'Comment': 'lag value of return to sale'},
 {'Name': 'coverage_ratio', 'Type': 'decimal(21,5)', 'Comment': 'net income(c131) /total interest payments'},
 {'Name': 'liquidity', 'Type': 'decimal(21,5)', 'Comment': 'current assets-current liabilities/total assets'}]

2. Provide a description

In [None]:
description = """
Compute the financial ratio by industry
"""

3. provide metadata

- DatabaseName:
- TablePrefix:
- input: 
- filename: Name of the notebook or Python script: to indicate
- Task ID: from Coda
- index_final_table: a list to indicate if the current table is used to prepare the final table(s). If more than one, pass the index. Start at 0
- if_final: A boolean. Indicates if the current table is the final table -> the one the model will be used to be trained

In [None]:
import re

In [None]:
name_json = 'parameters_ETL_Financial_dependency_pollution.json'
path_json = os.path.join(str(Path(path).parent.parent), 'utils',name_json)

In [None]:
with open(path_json) as json_file:
    parameters = json.load(json_file)

In [None]:
partition_keys = ["indu_2"]
notebookname =  "00_asif_financial_ratio.ipynb"
index_final_table = []
if_final = 'False'

Add Github URL

In [None]:
github_url = os.path.join(
    "https://github.com/",
    parameters['GLOBAL']['GITHUB']['owner'],
    parameters['GLOBAL']['GITHUB']['repo_name'],
    "blob/master",
    re.sub(parameters['GLOBAL']['GITHUB']['repo_name'],
           '', re.sub(
               r".*(?={})".format(parameters['GLOBAL']['GITHUB']['repo_name'])
               , '', path))[1:],
    re.sub('.ipynb','.md',notebookname)
)
github_url

Grab the input name from query

In [None]:
list_input = []
tables = glue.get_tables(full_output = False)
regex_matches = re.findall(r'(?=\.).*?(?=\s)|(?=\.\").*?(?=\")', query)
for i in regex_matches:
    cleaning = i.lstrip().rstrip().replace('.', '').replace('"', '')
    if cleaning in tables and cleaning != table_name:
        list_input.append(cleaning)

In [None]:
json_etl = {
    'description': description,
    'query': query,
    'schema': schema,
    'partition_keys': partition_keys,
    'metadata': {
        'DatabaseName': DatabaseName,
        'TableName': table_name,
        'input': list_input,
        'target_S3URI': os.path.join('s3://', bucket, s3_output),
        'from_athena': 'True',
        'filename': notebookname,
        'index_final_table' : index_final_table,
        'if_final': if_final,
        'github_url':github_url
    }
}
json_etl['metadata']

Remove the table name from the current file (if exist)

In [None]:
index_to_remove = next(
                (
                    index
                    for (index, d) in enumerate(parameters['TABLES']['TRANSFORMATION']['STEPS'])
                    if d['metadata']['TableName'] == table_name
                ),
                None,
            )
if index_to_remove != None:
    parameters['TABLES']['TRANSFORMATION']['STEPS'].pop(index_to_remove)

In [None]:
parameters['TABLES']['TRANSFORMATION']['STEPS'].append(json_etl)

In [None]:
print("Currently, the ETL has {} tables".format(len(parameters['TABLES']['TRANSFORMATION']['STEPS'])))

Save JSON

In [None]:
with open(path_json, "w") as json_file:
    json.dump(parameters, json_file)

# Create or update the data catalog

The query is saved in the S3 (bucket `datalake-datascience`) but the table is not available yet in the Data Catalog. Use the function `create_table_glue` to generate the table and update the catalog.

Few parameters are required:

- name_crawler: Name of the crawler
- Role: Role to temporary provide an access tho the service
- DatabaseName: Name of the database to create the table
- TablePrefix: Prefix of the table. Full name of the table will be `TablePrefix` + folder name

To update the schema, please use the following structure

```
schema = [
    {
        "Name": "VAR1",
        "Type": "",
        "Comment": ""
    },
    {
        "Name": "VAR2",
        "Type": "",
        "Comment": ""
    }
]
```

In [None]:
glue.update_schema_table(
    database = DatabaseName,
    table = table_name,
    schema= schema)

## Check Duplicates

One of the most important step when creating a table is to check if the table contains duplicates. The cell below checks if the table generated before is empty of duplicates. The code uses the JSON file to create the query parsed in Athena. 

You are required to define the group(s) that Athena will use to compute the duplicate. For instance, your table can be grouped by COL1 and COL2 (need to be string or varchar), then pass the list ['COL1', 'COL2'] 

In [None]:
#partition_keys = ["geocode4_corr", "indu_2", "year"]
#with open(os.path.join(str(Path(path).parent), 'parameters_ETL_Financial_dependency_pollution.json')) as json_file:
#    parameters = json.load(json_file)

In [None]:
### COUNT DUPLICATES
#if len(partition_keys) > 0:
#    groups = ' , '.join(partition_keys)
#    query_duplicates = parameters["ANALYSIS"]['COUNT_DUPLICATES']['query'].format(
#                                DatabaseName,table_name,groups
#                                )
#    dup = s3.run_query(
#                                query=query_duplicates,
#                                database=DatabaseName,
#                                s3_output="SQL_OUTPUT_ATHENA",
#                                filename="duplicates_{}".format(table_name))
#    display(dup)


# Analytics

In this part, we are providing basic summary statistic. Since we have created the tables, we can parse the schema in Glue and use our json file to automatically generates the analysis.

The cells below execute the job in the key `ANALYSIS`. You need to change the `primary_key` and `secondary_key` 

For a full analysis of the table, please use the following Lambda function. Be patient, it can takes between 5 to 30 minutes. Times varies according to the number of columns in your dataset.

Use the function as follow:

- `output_prefix`:  s3://datalake-datascience/ANALYTICS/OUTPUT/TABLE_NAME/
- `region`: region where the table is stored
- `bucket`: Name of the bucket
- `DatabaseName`: Name of the database
- `table_name`: Name of the table
- `group`: variables name to group to count the duplicates
- `keys`: Variable name to perform the grouping -> Only one variable for now, Variable name to perform the secondary grouping -> Only one variable for now
    - format: 'A,B'
- `proba`: Chi-square analysis probabilitity
- `y_var`: Continuous target variables

Check the job processing in Sagemaker: https://eu-west-3.console.aws.amazon.com/sagemaker/home?region=eu-west-3#/processing-jobs

The notebook is available: https://s3.console.aws.amazon.com/s3/buckets/datalake-datascience?region=eu-west-3&prefix=ANALYTICS/OUTPUT/&showversions=false

Please, download the notebook on your local machine, and convert it to HTML:

```
cd "/Users/thomas/Downloads/Notebook"
aws s3 cp s3://datalake-datascience/ANALYTICS/OUTPUT/asif_unzip_data_csv/Template_analysis_from_lambda-2020-11-22-08-12-20.ipynb .

## convert HTML no code
jupyter nbconvert --no-input --to html Template_analysis_from_lambda-2020-11-21-14-30-45.ipynb
jupyter nbconvert --to html Template_analysis_from_lambda-2020-11-22-08-12-20.ipynb
```

Then upload the HTML to: https://s3.console.aws.amazon.com/s3/buckets/datalake-datascience?region=eu-west-3&prefix=ANALYTICS/HTML_OUTPUT/

Add a new folder with the table name in upper case

In [None]:
#import boto3

#key, secret_ = con.load_credential()
#client_lambda = boto3.client(
#    'lambda',
#    aws_access_key_id=key,
#    aws_secret_access_key=secret_,
#    region_name = region)

In [None]:
#primary_key = 'year'
#secondary_key = 'indu_2'
#y_var = 'working_capital_cit'

In [None]:
#payload = {
#    "input_path": "s3://datalake-datascience/ANALYTICS/TEMPLATE_NOTEBOOKS/template_analysis_from_lambda.ipynb",
#    "output_prefix": "s3://datalake-datascience/ANALYTICS/OUTPUT/{}/".format(table_name.upper()),
#    "parameters": {
#        "region": "{}".format(region),
#        "bucket": "{}".format(bucket),
#        "DatabaseName": "{}".format(DatabaseName),
#        "table_name": "{}".format(table_name),
#        "group": "{}".format(','.join(partition_keys)),
#        "keys": "{},{}".format(primary_key,secondary_key),
#        "y_var": "{}".format(y_var),
#        "threshold":0.5
#    },
#}
#payload

In [None]:
#response = client_lambda.invoke(
#    FunctionName='RunNotebook',
#    InvocationType='RequestResponse',
#    LogType='Tail',
#    Payload=json.dumps(payload),
#)
#response

# Generation report

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os, time, shutil, urllib, ipykernel, json
from pathlib import Path
from notebook import notebookapp
import sys
sys.path.append(os.path.join(parent_path, 'utils'))
import make_toc
import create_schema

In [None]:
def create_report(extension = "html", keep_code = False, notebookname = None):
    """
    Create a report from the current notebook and save it in the 
    Report folder (Parent-> child directory)
    
    1. Exctract the current notbook name
    2. Convert the Notebook 
    3. Move the newly created report
    
    Args:
    extension: string. Can be "html", "pdf", "md"
    
    
    """
    
    ### Get notebook name
    connection_file = os.path.basename(ipykernel.get_connection_file())
    kernel_id = connection_file.split('-', 1)[0].split('.')[0]

    for srv in notebookapp.list_running_servers():
        try:
            if srv['token']=='' and not srv['password']:  
                req = urllib.request.urlopen(srv['url']+'api/sessions')
            else:
                req = urllib.request.urlopen(srv['url']+ \
                                             'api/sessions?token=' + \
                                             srv['token'])
            sessions = json.load(req)
            notebookname = sessions[0]['name']
        except:
            notebookname = notebookname  
    
    sep = '.'
    path = os.getcwd()
    #parent_path = str(Path(path).parent)
    
    ### Path report
    #path_report = "{}/Reports".format(parent_path)
    #path_report = "{}/Reports".format(path)
    
    ### Path destination
    name_no_extension = notebookname.split(sep, 1)[0]
    source_to_move = name_no_extension +'.{}'.format(extension)
    dest = os.path.join(path,'Reports', source_to_move)
    
    ### Generate notebook
    if keep_code:
        os.system('jupyter nbconvert --to {} {}'.format(
    extension,notebookname))
    else:
        os.system('jupyter nbconvert --no-input --to {} {}'.format(
    extension,notebookname))
    
    ### Move notebook to report folder
    #time.sleep(5)
    shutil.move(source_to_move, dest)
    print("Report Available at this adress:\n {}".format(dest))

In [None]:
create_report(extension = "html", keep_code = True, notebookname =notebookname)

Create or update ETL

In [None]:
create_schema.create_schema(path_json, path_save_image = os.path.join(parent_path, 'utils'))

In [None]:
### Update TOC in Github
for p in [parent_path,
          str(Path(path).parent),
          os.path.join(str(Path(path).parent), "00_download_data_from"),
          os.path.join(str(Path(path).parent.parent), "02_data_analysis"),
          os.path.join(str(Path(path).parent.parent), "02_data_analysis", "00_statistical_exploration"),
          os.path.join(str(Path(path).parent.parent), "02_data_analysis", "01_model_estimation"),
         ]:
    try:
        os.remove(os.path.join(p, 'README.md'))
    except:
        pass
    path_parameter = os.path.join(parent_path,'utils', name_json)
    md_lines =  make_toc.create_index(cwd = p, path_parameter = path_parameter)
    md_out_fn = os.path.join(p,'README.md')
    
    if p == parent_path:
    
        make_toc.replace_index(md_out_fn, md_lines, Header = os.path.basename(p).replace('_', ' '), add_description = True, path_parameter = path_parameter)
    else:
        make_toc.replace_index(md_out_fn, md_lines, Header = os.path.basename(p).replace('_', ' '), add_description = False)