## Introduction
Some columns are the sum or the substraction of others columns. We would like to verify this hypothesis.
The parents studied are:
*     Total des produits d’exploitation
*     Total des charges d’exploitation
*     Total des produits financiers
*     Total des charges financières
*     Total des produits exceptionnels
*     Total des charges exceptionnelles

For instance, we would like to check if 'Total des produits d’exploitation' is equal to the sum of :
* 'Dotations financières sur amortissements et provisions',
*  'Intérêts et charges assimilées',
*  'Différences négatives de change',
*  'Charges nettes sur cessions de valeurs mobilières de placement',
*  'Total des charges financières'

![](https://i.ibb.co/f2n18qV/Capture-d-cran-du-2021-07-17-16-46-37.png)

In [None]:
# Import libraries
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt # graph
import seaborn as sns # advanced graph
import math

In [None]:
# Import data
data = pd.read_csv('/kaggle/input/financial-data-of-french-compagnies/data_kaggle.csv', nrows=10000)
data = data.drop('Unnamed: 0', axis=1)
col_info = pd.read_csv('/kaggle/input/financial-data-of-french-compagnies/Profit and loss - Onthology.csv')

In [None]:
# Function
def obtention_log10(val):
  """ Transform a number : 100 000 becomes 5 and -100 becomes -2 """
  if val == np.nan: return np.nan
  elif (isinstance(val, float)) or (isinstance(val, int)):
    signe, nb = np.sign(val), np.absolute(val)
    if nb < 1: return 0
    else : return math.log10(nb) * signe
  else: return val

def division_colonnes(num, den):
  """ This function allows you to divide 2 columns """
  if (num == 0): # numérateur nul
    return 0
  elif den != 0:  # attention aux divisions par zéro
    return num / den
  else :
    return np.nan

def filtre(df, col, filtre):
  """ Returns the companies containing the filter in a column """
  return df[col].str.contains(filtre, case=False, na=False)

## Creation of data by percentage according to parents
Creation of a dataframe containing the data by percentage compared to their parent

For lines where the parent is absent, the entire line is deleted. It is therefore necessary to remedy the missing values, in particular for the parents. 

In [None]:
# lowest level
colonnes_enfants = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['FJ', 'FM', 'FN', 'FO', 'FP', 'FQ',                     # Produit d'exploitation
                          'FS', 'FT', 'FU', 'FV', 'FW', 'FX', 'FY', 'FZ', 'GE',   # Charges d'exploitation
                          'GJ', 'GK', 'GL', 'GM', 'GN', 'GO',                     # Produits financiers
                          'GQ', 'GR', 'GS', 'GT',                                 # Charges financières
                          'HA', 'HB', 'HC',                                       # Produits exceptionnels
                          'HE', 'HF', 'HG'])                                      # Charges exceptionnels
                   ,'Columns of the profit and loss (FR)']                            

# Niveau parent
colonne_parent = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['FR', 'GF',         # Résultat d'exploitation
                          'GP', 'GU',         # Résultat financier
                          'HD', 'HH']),       # Résultat exceptionnel
                          'Columns of the profit and loss (FR)']    

# Niveau grand parent
colonne_grand_parent = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['GG', 'GV','HI']),'Columns of the profit and loss (FR)']

# Niveau arrière grand parent
colonne_arrière_grand_parent = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['HN', 'GG', 'GV', 'GH', 'GI', 'HI', 'HJ', 'HK']),'Columns of the profit and loss (FR)']

In [None]:
colonne_parent

In [None]:
col_info

In [None]:
# Creation of a DataFrame by percentage: sum(children) / parent 
data_percent = pd.DataFrame()
# Produit d'exploitation
colonnes_produits_exploitation = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['FR', 'FJ', 'FM', 'FN', 'FO', 'FP', 'FQ']),:]
for col in list(colonnes_produits_exploitation['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Total des produits d’exploitation']), axis=1)

# Charges d'exploitation
colonnes_charges_exploitation = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['GF', 'FS', 'FT', 'FU', 'FV', 'FW', 'FX', 'FY', 'FZ', 'GE']),:]
for col in list(colonnes_charges_exploitation['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Total des charges d’exploitation']), axis=1)

# Produits financiers
colonnes_produits_financiers = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['GP', 'GJ', 'GK', 'GL', 'GM', 'GN', 'GO']),:]
for col in list(colonnes_produits_financiers['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Total des produits financiers']), axis=1)

# Charges financières
colonnes_charges_financières = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['GU', 'GQ', 'GR', 'GS', 'GT']),:]
for col in list(colonnes_charges_financières['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Total des charges financières']), axis=1)

# Produits exceptionnels
colonnes_produits_exceptionnels = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['HD', 'HA', 'HB', 'HC']),:]
for col in list(colonnes_produits_exceptionnels['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Total des produits exceptionnels']), axis=1)

# Charges exceptionnels
colonnes_charges_exceptionnelles = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['HH', 'HE', 'HF', 'HG']),:]
for col in list(colonnes_charges_exceptionnelles['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Total des charges exceptionnelles']), axis=1)


data_percent.describe()

In [None]:
# Creation of a DataFrame by percentage: Difference (parents) = Grand parent 
# Résultat d'exploitation
colonnes_résultat_exploitation = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['GG', 'FR', 'GF']),:]
for col in list(colonnes_résultat_exploitation['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Résultat d\'exploitation']), axis=1)

# Résultat financier
colonnes_résultat_financier = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['GV', 'GP', 'GU']),:]
for col in list(colonnes_résultat_financier['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Résultat financier']), axis=1)

# Résultat exceptionnel
colonnes_résultat_exceptionnel = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['HI', 'HD', 'HH']),:]
for col in list(colonnes_résultat_exceptionnel['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Résultat exceptionnel']), axis=1)

In [None]:
# Creation of a DataFrame by percentage: Sum (grandparents) = Great-grandparent 
# Résultat en cours avant impôts
colonnes_résultat_avant_impots = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['GW', 'GG', 'GV', 'GH', 'GI']),:]
for col in list(colonnes_résultat_avant_impots['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Résultat en cours avant impôts']), axis=1)

# Bénéfices ou perte (Total des produits ‐ Total des charges)
colonnes_bénefices = col_info.loc[col_info['Liasse (Id)']\
                   .isin(['HN', 'GW', 'HI', 'HJ', 'HK']),:]
for col in list(colonnes_bénefices['Columns of the profit and loss (FR)']):
  data_percent[col] = data.apply(lambda x: division_colonnes(x[col], x['Bénéfices ou perte (Total des produits ‐ Total des charges)']), axis=1)

In [None]:
## Application de filtres pour éliminer les valeurs abérantes
data_percent['index'] = data_percent.index
data['index'] = data.index

# Application of a filter: children over 200% of their parent are removed, i.e. 72 lines 
index_anomalies_enfant = data_percent.loc[pd.DataFrame(abs(data_percent.loc[:,colonnes_enfants]) > 2).sum(axis=1) > 0,'index']

data_percent.drop(list(index_anomalies_enfant), axis=0).describe()

## Production of histograms on the distribution of children in relation to parents


In [None]:
#data_percent_less_than_2.hist(bins=20, figsize=(25, 15))
def hist_enfants(col_famille, range=[-0.2,1.1]):
  """ Allows the creation of a histogram on the distribution of children compared to parents """ 
  plt.figure(figsize=(7,7))
  # identification of family columns
  df_col_hist = col_famille['Columns of the profit and loss (FR)']
  # identification dthe parent
  col_parent = col_famille.loc[filtre(col_famille, 'Columns of the profit and loss (FR)', 'Total'),'Columns of the profit and loss (FR)'].to_numpy()[0]
  # realization of histograms
  for col in df_col_hist.loc[df_col_hist!=col_parent]:
    plt.hist(data_percent[col], range=range, alpha=0.5, label=col, bins=40, density=True,stacked=True)
  plt.legend(loc=2)
  plt.title('Histograms of distribution by percentage of children in \'{}\' '.format(col_parent))
  plt.show()

hist_enfants(colonnes_produits_exploitation)
hist_enfants(colonnes_charges_exploitation)
hist_enfants(colonnes_produits_financiers)
hist_enfants(colonnes_charges_financières)
hist_enfants(colonnes_produits_exceptionnels)
hist_enfants(colonnes_charges_exceptionnelles)

We can observe that some parents are generally equal to their children:
* Chiffre d'affaire net = Total Produit d'exploitation
* Autre intérêts et produits assimilés = Total des produits financiers
* intérêts et charges assimilés = Total des charges financières
* The Total of exceptional income is often equal to one of its children (Exceptional income on management, capital or load transfer)
* The total of exceptional expenses is often equal to one of his children (Exceptional income on management, capital or transfer of load)

It can also be observed that the total operating expenses are well distributed among the children.

it seems that all children are divided between 0 and 100% of their parent. Which is a good sign, it shows the reliability of the database
Most children have the same sign as their parent

## Check that the sum of the children is equal to their only parent
We make the difference, then a comparison to zero 
![](https://i.ibb.co/Sn50b5d/Capture-d-cran-du-2021-07-17-17-01-17.png)

In [None]:
def hist_zero(df_op, op_add, op_sub, title='',print_hist=True, min=0,max=1):
    """ Allows you to perform subtraction and addition operations on a df
     in order to compare this difference to zero with a histogram
     """
    df_tot = pd.Series(index=df_op.index, data=0)
    df_op.fillna(0,inplace=True)
    
    if len(op_add) > 0:
        for col in op_add:
            df_tot += df_op.loc[:,col]
    if len(op_sub) > 0:
        for col in op_sub:
            df_tot -= df_op.loc[:,col]
    
    df_tot_log = df_tot.to_frame().applymap(obtention_log10).copy()
    if print_hist :
      df_tot_log.hist(bins=9)
      plt.title('LOGARITHMIC histogram of the zero position of \'{}\''.format(title))

    return pd.DataFrame((df_tot_log >= min) & (df_tot_log <= max)).iloc[:,0], df_tot
min, max=-1,1

In [None]:
def obtention_col(df):
    """ very specific function, that provides the list of columns"""
    return(';'.join(list(df.loc[df.loc[:,'Calcul'].isna(), 'Columns of the profit and loss (FR)'])))

df_add_sub = pd.DataFrame(columns=['operation to check', 'op_add', 'op_sub'], index=np.arange(0,11))
df_add_sub.loc[0,:] = 'profit or losses (total product - total charges)', ';'.join(['Résultat en cours avant impôts', 'Résultat exceptionnel']), ';'.join(['Bénéfices ou perte (Total des produits ‐ Total des charges)','Participation des salariés aux résultats de l’entreprise','Impôts sur les bénéfices'])
df_add_sub.loc[1,:] = 'result in progress before taxes', 'Résultat en cours avant impôts', ';'.join(['Résultat d\'exploitation','Résultat financier','Bénéfice attribué ou perte transférée','Perte supportée ou bénéfice transféré'])
df_add_sub.loc[2,:] = 'operating result', 'Total des produits d’exploitation', ';'.join(['Résultat d\'exploitation','Total des charges d’exploitation'])
df_add_sub.loc[3,:] = 'financial result', 'Total des produits financiers', ';'.join(['Résultat financier','Total des charges financières'])
df_add_sub.loc[4,:] = 'exceptional result', 'Total des produits exceptionnels', ';'.join(['Résultat exceptionnel','Total des charges exceptionnelles'])

df_add_sub.loc[5,:] = 'total of exploitation product', 'Total des produits d’exploitation', obtention_col(df=colonnes_produits_exploitation)
df_add_sub.loc[6,:] = 'total financial income', 'Total des produits financiers', obtention_col(df=colonnes_produits_financiers)
df_add_sub.loc[7,:] = 'total exceptional products', 'Total des produits exceptionnels', obtention_col(df=colonnes_produits_exceptionnels)

df_add_sub.loc[8,:] = 'total of exploitation charges', 'Total des charges d’exploitation', obtention_col(df=colonnes_charges_exploitation)
df_add_sub.loc[9,:] = 'total financial charges/load', 'Total des charges financières', obtention_col(df=colonnes_charges_financières)
df_add_sub.loc[10,:] = 'total exceptional charges', 'Total des charges exceptionnelles', obtention_col(df=colonnes_charges_exceptionnelles)

df_add_sub

In [None]:
for i in range(0,11):
    mask, df_zero = hist_zero(df_op=data, op_add=df_add_sub.iloc[i,1].split(';'), op_sub=df_add_sub.iloc[i,2].split(';'), title=df_add_sub.iloc[i,0],min=min,max=max)

We can observe that most of the parents are equal to the sum of their children. This is good news !
Some zero position are above or below zero, they need to be corrected.

## Correction of parents equal to zero and not being equal to the sum of their child
We decide to correct parents equal to zero and not equal to the sum of their children. These parents will take the sum of their children as a value.

In [None]:
# Correction of null parents
# We replace null parents with the sum of their children
# We go through all the families 
for col_famille in [colonnes_produits_exploitation,
                    colonnes_charges_exploitation,
                    colonnes_produits_financiers,
                    colonnes_charges_financières,
                    colonnes_produits_exceptionnels,
                    colonnes_charges_exceptionnelles] :
    # We identify the parent
    parent = col_famille.loc[filtre(col_famille, 'Columns of the profit and loss (FR)', 'Total'),'Columns of the profit and loss (FR)'].to_numpy()[0]
    # We identify the children
    enfants = list(col_famille.loc[col_famille.loc[:,'Columns of the profit and loss (FR)'] != parent[0],'Columns of the profit and loss (FR)'])
    print(parent, enfants, '\n')
    # We realize the logarithmic histogram of the difference between the parent and the sum of the children
    # We get the mask on the lines where the parent is null 
    mask, df_zero = hist_zero(df_op=data, op_add=[parent], op_sub=enfants, title=parent, print_hist=False, min=-30,max=-0.5)
    data['comparaison zéro'] = df_zero
    data['somme des enfants'] = data.loc[:,enfants].sum(axis=1)
    # On the lines where the parent is zero, we assign it the sum of the children 
    data.loc[:,'correction '+parent] = data.apply(lambda row: row[parent] if int(row[parent]) != 0 else row['somme des enfants'], axis=1)

![](https://i.ibb.co/7GHNSk9/Capture-d-cran-du-2021-07-17-17-09-35.png)

In [None]:
# Verification
for col_famille in [colonnes_produits_exploitation,
                    colonnes_charges_exploitation,
                    colonnes_produits_financiers,
                    colonnes_charges_financières,
                    colonnes_produits_exceptionnels,
                    colonnes_charges_exceptionnelles] :
    parent = col_famille.loc[filtre(col_famille, 'Columns of the profit and loss (FR)', 'Total'),'Columns of the profit and loss (FR)'].to_numpy()[0]
    enfants = list(col_famille.loc[col_famille.loc[:,'Columns of the profit and loss (FR)'] != parent,'Columns of the profit and loss (FR)'])
    mask, df_zero = hist_zero(df_op=data, op_add=[parent], op_sub=enfants, title=parent,min=-30,max=-0.5)
    mask, df_zero = hist_zero(df_op=data, op_add=['correction '+parent], op_sub=enfants, title='correction '+parent,min=-30,max=-0.5)

In [None]:
obtention_col(df=colonnes_charges_exploitation)

We have been able to correct some parents. Now the main issues come the right of the graph. The right of the graph shows compagnies where the sum the children is above their parents. Futher investigation need to be done.