The aim of this notebook, is to answer [this task](https://www.kaggle.com/mpwolke/cusersmarildownloadsfuneralscsv/tasks?taskId=194), after a discussion in [this notebook](https://www.kaggle.com/mpwolke/public-funerals/).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
import plotly.graph_objs as go

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
df = pd.read_csv('../input/cusersmarildownloadsfuneralscsv/funerals.csv', delimiter=';', encoding = "ISO-8859-1")

In [None]:
df.describe()

In this notebook, we will focus on the "cost recovered" and "cost of funeral" columns. Let's first take a look at the values inside 'cost recovered".

In [None]:
set(df["cost_recovered"])

Here, we can see that the £ symbol is placed in front of the prices. There is also a value "Pending". So they will be interpreted as strings. We want them to be interpreted as float.
First, we have to take a decision about tjhe "Pending" value. I suggest not to converted them by a 0, because there is also a value "£0.00" whose meaning is different. So, the best way is probably to treat them separately.

In [None]:
pending = df[df['cost_recovered'] == 'Pending']
not_pending = df[df['cost_recovered'] != 'Pending']

In [None]:
fig = go.Figure(data=[go.Pie(labels=['Pending', 'Not Pending'], values=[len(pending), len(not_pending)])])
fig.show()

Now, we know the distribution between pending and not pending values. Pending values are 55.9 % of the overall values. We will now focus on the other 44.1 %, whose price is known. First, let's convert the prices in float values.

In [None]:
float_val = []                            # This list with contain the values in float type
for idx, row in not_pending.iterrows():   # We loop on the not_pending DataFrame
    
    ''' There are potentially two char to delete: £ symbol, and the coma between thousands and hundreds.
        The way to do it is to access the value by row['cost_recovered'],
        then we replace those chars by nothing, which is the same as deleting them.
        And we append the results in our correction list, that will become the new "cost_recovered" column later '''
    
    float_val.append(row['cost_recovered'].replace('£','').replace(',',''))
    
not_pending = not_pending.drop('cost_recovered', axis=1)    # We delete the old cost recovered column
not_pending['cost_recovered'] = float_val                   # And replace it by the new one

In [None]:
not_pending.head(5)

It looks like our "cost_recovered" column is now filled by float values :-)

We can now do the exact same thing for cost_of_funeral (we could have make both in the same time, but for the sake of clarity, we chose to do it step by step).

In [None]:
float_fun = []
for idx, row in not_pending.iterrows():   # We loop on the not_pending DataFrame
    
    ''' There are potentially two char to delete: £ symbol, and the coma between thousands and hundreds.
        The way to do it is to access the value by row['cost_of_funeral'],
        then we replace those chars by nothing, which is the same as deleting them.
        And we append the results in our correction list, that will become the new "cost_of_funeral" column later '''
    
    float_fun.append(row['cost_of_funeral'].replace('£','').replace(',',''))
    
not_pending = not_pending.drop('cost_of_funeral', axis=1)    # We delete the old cost recovered column
not_pending['cost_of_funeral'] = float_fun                   # And replace it by the new one

In [None]:
not_pending.head(5)

Looks good :-)

So, let's scatter them now !

In [None]:
px.scatter(not_pending, x='cost_of_funeral',y='cost_recovered')

On this plot, we see that there is one funeral whose cost was significantly higher than the others. It is the lost point at the far right of the graph. Cost of this funeral was entirely recovered.

In the other, funeral whose cost has not been recovered at all tend to be cheaper.

This is the end of this tutorial, hope you enjoyed it !