<h1 style='background:#DBAAFF; border:0; color:black'><center>Introduction</center></h1>

For most of the population in Romania, main income is from salary earnings. This dataset is analyzing the evolution of gross and net (after taxes) income in Romania, between 1991 and 2020 (present). The data source is the Romanian Statistic Institute. The data is provided monthly.


There are few technical informations that need to be specified.

Starting with 1990, post-communist economy depression, failing international markets for post-communist countries and internal political turmoil, correlated with rampant corruption in privatization of public sector resulted in an increasingly devaluation of the national currency `Leu` (`ROL`), or in other words, an aggresive inflation. In 2005, a revaluation of the national currency by Romanian National Bank equated 1 `ROL` to 10,000 `new Romanian Leu` or `RON`.

Taxation as well shows an evolution with some landmarks we need to detail here.

Until 2005, taxation was progressive. The state could hardly collect the taxes and unique solution they could imagine was to increase the taxes.

From 2005, there is a flat income taxation of 16%. It applied to the gross revenue, but besides the income taxes, other taxes, applied to the gross revenue (National Health Insurrance, Retirement National Plan) as well as to the net revenue. Totally, there were 7-8 taxes applied to the gross income (partially directly, partially through net income, the actuall total tax amounting to ~45-50% in average). Therefore, between 2005 and 2010 we can see a tac of ~23.5% to 27.5% (increasing) - this is the part of the total tax deducted from the gross earnings. From 2010 to 2018, tax percent is allmost unchanged.

In 2018, a new fiscal reform was passed and from then, all taxes applied to gross income and the average tax decreased from a total of ~45% from the gross earnings to 40% from gross earnings).



<h1 style='background:#12AAFF; border:0; color:black'><center>Analysis preparation</center></h1>


We load the packages needed for data analysis.
We then load the data.

In [None]:
import numpy as np
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
net_df = pd.read_csv("/kaggle/input/romania-average-monthly-earnings/net_monthly_average_earnings.csv")
gross_df = pd.read_csv("/kaggle/input/romania-average-monthly-earnings/gross_monthly_average_earnings.csv")

<h1 style='background:#12AAFF; border:0; color:black'><center>Data profiling</center></h1>


We analyze the input data, to check for data quality, missing data, data limits.

We start by looking to some samples the net and gross data sets.

In [None]:
net_df.head()

In [None]:
gross_df.head()

We check the data types and missing values.

In [None]:
net_df.info()

In [None]:
gross_df.info()

We change the columns names for `Earnings` to `Net` and `Gross` in the two datasets, before merging them on `Year` and `Month`.

In [None]:
net_df.columns = ['Year', 'Month', 'Net']
gross_df.columns = ['Year', 'Month', 'Gross']
data_df = net_df.merge(gross_df, on=['Year', 'Month'])
net_df.shape, gross_df.shape, data_df.shape

We also create a `Date` column, calculated from `Year` and `Month`.

In [None]:
def get_date(year, month):
    months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
       'August', 'September', 'October', 'November', 'December']
    return datetime.datetime(year, months.index(month) + 1, 1)

We also calculate the tax (substract net from gross), and the tax percent (from gross earnings).

In [None]:
data_df['Date'] = data_df.apply(lambda x: get_date(x.Year, x.Month), axis=1)
data_df['Tax'] = data_df['Gross'] - data_df['Net']
data_df['Tax_Percent'] = np.round(data_df['Tax'] / data_df['Gross'] * 100,1)

Let's also order the `data_df` by `Date`.

In [None]:
data_df = data_df.sort_values(by=['Date'])

In [None]:
data_df.head()

In [None]:
data_df.tail()

Let's also check the distributions of values in the resulted data frame.

In [None]:
data_df.describe()

In [None]:
print(f"Min date: {min(data_df.Date)}")
print(f"Max date: {max(data_df.Date)}")

We observe that we do not have values for the last 3 months in 2020 (most probably not collected and processed by Nov 2020, when the current version of dataset is issued).

<h1 style='background:#12AAFF; border:0; color:black'><center>Gross and Net Earning Evolution</center></h1>

We define few functions for visualization of time evolution of the values.

In [None]:
def plot_time_variation(df, y='Net', title="", size=1, is_log=False):
    f, ax = plt.subplots(1,1, figsize=(4*size,2*size))
    g = sns.lineplot(x="Date", y=y, data=df)
    plt.xticks(rotation=90)
    if(is_log):
        ax.set(yscale="log")
    ax.grid(color='black', linestyle='dotted', linewidth=0.75)
    plt.title(title)
    plt.show()  
    
def plot_time_variations(df, y1='Net', y2='Gross', title="", size=1, is_log=False):
    f, ax = plt.subplots(1,1, figsize=(4*size,2*size))
    g = sns.lineplot(x="Date", y=y1, data=df, label=y1)
    g = sns.lineplot(x="Date", y=y2, data=df, label=y2)
    plt.xticks(rotation=90)
    if(is_log):
        ax.set(yscale="log")
    ax.grid(color='black', linestyle='dotted', linewidth=0.75)
    plt.title(title)
    plt.show()  
    

We will visualize the values variations using both logarithmic scale (when the inflation was very large or for the entire period).
We show average earnings (gross & net), gross earnings and tax, on the entire period or divided between 1991-2004 and 2005-2020.

In [None]:
plot_time_variations(data_df, y1='Gross', y2='Net', title='Average salary earnings variation (1991-2020)', size=4, is_log=True)

In [None]:
plot_time_variations(data_df, y1='Gross', y2='Tax', title='Average gross salary earnings and tax variation (1991-2020)', size=4, is_log=True)

In [None]:
plot_time_variations(data_df.loc[data_df.Date<datetime.datetime(2005, 1, 1)], y1='Gross', y2='Net',
                                title='Average salary enarnings variation (1991-2004)', size=4, is_log=False)

In [None]:
plot_time_variations(data_df.loc[data_df.Date<datetime.datetime(2005, 1, 1)], y1='Gross', y2='Tax',
                                title='Average gross salary earnings and tax variation (1991-2004)', size=4, is_log=False)

In [None]:
plot_time_variation(data_df.loc[data_df.Date<datetime.datetime(2005, 1, 1)], y='Tax_Percent',
                                title='Average tax percent variation (1991-2004)', size=4, is_log=False)

In [None]:
plot_time_variations(data_df.loc[data_df.Date>=datetime.datetime(2005, 1, 1)], y1='Gross', y2='Net',
                                title='Average salary earnings variation (2005-2020)', size=4, is_log=False)

In [None]:
plot_time_variations(data_df.loc[data_df.Date>=datetime.datetime(2005, 1, 1)], y1='Gross', y2='Tax',
                                title='Average gross salary earnings and tax variation (2005-2020)', size=4, is_log=False)

In [None]:
plot_time_variation(data_df.loc[data_df.Date>=datetime.datetime(2005, 1, 1)], y='Tax_Percent',
                                title='Average tax percent variation (2005-2020)', size=4, is_log=False)