# Data analysis

**Goal**: Analyse the differences in hotel costs and per diems of employees of the commission

1. Import and explore the data we have scraped in the previous excercise.
2. Clean this data
3. Combine them with the [inflation index from Eurostat](https://ec.europa.eu/eurostat/databrowser/bookmark/f6a583fa-f744-4590-aa95-173aaa6ea3f1?lang=en) (grab the [direct link to the csv](https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/prc_hicp_midx/1.0/M.I05.CP11.*?c[geo]=BE,BG,CZ,DK,DE,EE,IE,EL,ES,FR,HR,IT,CY,LV,LT,LU,HU,MT,NL,AT,PL,PT,RO,SI,SK,FI,SE,IS,NO,CH,UK,ME,MK&compress=false&format=csvdata&formatVersion=2.0&c[TIME_PERIOD]=ge:2004-01+le:2023-10&lang=en&labels=name))
4. Analyse the data

##  1. Import and explore the scraped data

The first step for a data analysis is to have a close look at the data:
- What columns are there?
- How many rows?
- What is in the columns?

Pandas has helpful methods to do this.

```python
df.info() # information on the column names and non-zero values
df.sample(3) # returns 3 sample rows from the data
df.head(3) # returns the first 3 rows of the data
len(df) # returns the number of rows
df.columns # all the column names as a list
```

In [145]:
import pandas as pd

# supress the warnings when modifying a column
pd.options.mode.chained_assignment = None

In [108]:
df = pd.read_csv("data/regulation_data.csv")
df.sample(3)

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date
183,Luxembourg,148,98,20180101
35,Greece,112,82,20220701
525,Portugal,120,84,20070501


In [109]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 636 entries, 0 to 635
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Destination      636 non-null    object
 1   Hotel ceiling    636 non-null    object
 2   Daily allowance  636 non-null    object
 3   Date             636 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 20.0+ KB


### `df.info()` explained

![](img/info.png)

### Looking closely at the columns

To isolate columns using this syntax and look more closely at those we use the following syntax:

```python
df["column"]
```

This will return a `Series` from a `DataFrame`. It's important to know what data type you are working with, as all of them have their own methods.


```python
# for categorical variables
df["column"].unique() # all the unique values of the column
df["column"].value_counts() # how often does a value occur

# for numeric variables
df.hist(column='column') # makes a histogram
df.describe() # descriptive statistics for all numeric variables
df["column"].describe() # descriptive statistics for a single column
```

In [110]:
df["Destination"].value_counts()

Destination
Belgium            23
Latvia             23
United Kingdom     23
Sweden             23
Finland            23
Slovenia           23
Portugal           23
Poland             23
Austria            23
Netherlands        23
Malta              23
Hungary            23
Lithuania          23
Luxembourg         23
Cyprus             23
Germany            23
Italy              23
Czech Republic     23
France             23
Spain              23
Greece             23
Ireland            23
Denmark            23
Estonia            23
Romania            19
Bulgaria           19
Slovakia           14
Destination        11
Slovak Republic     9
Croatia             9
Destinations        3
Name: count, dtype: int64

This counts the values in the column. Note that we have the words *Destination* and *Destinations* in there. These are remnants of the headers, we don't want them. We also have *Slovakia* and *Slovak Republic* as a value. We need to make them one.

## 2. Cleaning the data

### Filtering the DataFrame

```python
df.query() # filters the DataFrame
```

- [see more example usages of `.query()`](https://github.com/zufanka/2023-GUN_MIJ/blob/main/resources/query_example_usage.md)

In [146]:
# return only the rows that do not contain "Destinations" or "Destination" in the 'Destination' column
df2 = df.query("~Destination.str.contains('Destinations?')")

* `~` except
* `str.contains()` = pandas [str.contains()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html)
* ('Destinations **?**') = ? means that the previous character can be there but does not have to be. Therefore this matches both "Destinations" and "Destination". This syntax comes from [regular expressions](https://regexr.com/)

We also need to replace the "Slovak Republic" with "Slovakia".

In [147]:
df2.loc[:,"Destination"] = df2["Destination"].str.replace("Slovak Republic", "Slovakia")

# checking if the changes took place
df2["Destination"].value_counts()

Destination
Belgium           23
Latvia            23
Sweden            23
Finland           23
Slovakia          23
Slovenia          23
Portugal          23
Poland            23
Austria           23
Netherlands       23
Malta             23
Hungary           23
Luxembourg        23
Lithuania         23
Cyprus            23
Italy             23
France            23
Spain             23
Greece            23
Ireland           23
Estonia           23
Germany           23
Denmark           23
Czech Republic    23
United Kingdom    23
Bulgaria          19
Romania           19
Croatia            9
Name: count, dtype: int64

Next we need to change the `dtypes` of the columns:
- `Hotel ceiling` and `Daily allowance` to `float` instead of `object`
- `Date` to `date` instead of `int`

In [148]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 622 entries, 0 to 635
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Destination      622 non-null    object
 1   Hotel ceiling    622 non-null    object
 2   Daily allowance  622 non-null    object
 3   Date             622 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 24.3+ KB


Use the following functions to change the `dtype` from `object` to `int64` or `float64`

```python
pd.to_numeric() # change text to numbers
```

In [149]:
# we use .str.replace() here to replace the , with . as , is not a valid decimal separator in python
df2["Hotel ceiling"] = pd.to_numeric(df2["Hotel ceiling"].str.replace(",","."))
df2["Daily allowance"] = pd.to_numeric(df2["Daily allowance"].str.replace(",","."))

Do you see this? It means that you have executed the cell twice and the columns are already numbers and can not be converted into numbers again
![attribute_error](img/attribute_error.png)

In [152]:
# checking if the two columns have the correct dtype
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 622 entries, 0 to 635
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Destination      622 non-null    object 
 1   Hotel ceiling    622 non-null    float64
 2   Daily allowance  622 non-null    float64
 3   Date             622 non-null    int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 24.3+ KB


In [153]:
# Diving into descriptive statistics
df2.describe()

Unnamed: 0,Hotel ceiling,Daily allowance,Date
count,622.0,622.0,622.0
mean,138.182781,85.243424,20137500.0
std,23.498804,17.272073,59253.05
min,97.03,50.0,20040500.0
25%,117.0,72.0,20090100.0
50%,139.66,86.89,20140500.0
75%,150.0,97.0,20190100.0
max,210.0,125.0,20230100.0


Most often we want to group values of a column and calculate a `sum`, `mean` or something else.
For this we use

```python
df.groupby("city").mean() # groups by the column "city" and returns the average values for all numeric columns
```

- [see more example usage of `.groupby()`](https://github.com/zufanka/2023-GUN_MIJ/blob/main/resources/groupby_example_usage.md)

In [154]:
# average value of all the numeric values grouped by the country, sorted from largest to smallest on the 'Hotel ceiling' column
df2.groupby("Destination").mean().sort_values("Hotel ceiling", ascending=False)

Unnamed: 0_level_0,Hotel ceiling,Daily allowance,Date
Destination,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United Kingdom,184.916957,108.55087,20135530.0
Sweden,168.187391,104.292609,20135530.0
Netherlands,163.446957,94.990435,20135530.0
Denmark,158.748261,117.873913,20135530.0
France,154.861304,94.814783,20135530.0
Romania,153.894737,56.736842,20152930.0
Bulgaria,152.894737,57.526316,20152930.0
Ireland,152.128696,102.557391,20135530.0
Czech Republic,145.478261,70.434783,20135530.0
Poland,142.130435,68.478261,20135530.0


The above calculation however does not make sense as the data comes from various years and the euro value in 2004 is different than in 2023 due to inflation. We therefore need to adjust this data for inflation first. For that we will pull the [Inflation index from Eurostat](https://ec.europa.eu/eurostat/databrowser/bookmark/f6a583fa-f744-4590-aa95-173aaa6ea3f1?lang=en), filtered on the 'Restaurants and Hotels' inflation.

We use the following function to change the `dtype` to `datetime64`

```python
pd.to_datetime() # change text or number to a date format
```

We have to specify the format our date is in in order for `pandas` to read it correctly. This date format is called `strftime`. Each symbol followed by a character in strftime represents a different component of the date or time. For example:

- `%Y-%m-%d` will output a date in the format of '2023-12-13' 
- `%B %d, %Y` will output a date in the format of 'December 13, 2023'
- `%H:%M:%S` will output the current time in the format of '15:30:45'

See also the [strftime reference cheatsheet](https://strftime.org/)

In [162]:
df2['Date'] = pd.to_datetime(df2['Date'], format="%Y%m%d")

In [164]:
# check if the format is correct
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 622 entries, 0 to 635
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Destination      622 non-null    object        
 1   Hotel ceiling    622 non-null    float64       
 2   Daily allowance  622 non-null    float64       
 3   Date             622 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 24.3+ KB


## 3. Importing the inflation dataset
You can grab the [direct link to the csv here](https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/prc_hicp_midx/1.0/M.I05.CP11.*?c[geo]=BE,BG,CZ,DK,DE,EE,IE,EL,ES,FR,HR,IT,CY,LV,LT,LU,HU,MT,NL,AT,PL,PT,RO,SI,SK,FI,SE,IS,NO,CH,UK,ME,MK&compress=false&format=csvdata&formatVersion=2.0&c[TIME_PERIOD]=ge:2004-01+le:2023-10&lang=en&labels=name) directly into the pandas function, or save the file in the same folder as your notebook

In [243]:
df_inflation = pd.read_csv("data/inflation_data.csv")

# .T transposes the data so we can read it better
df_inflation.sample(3).T

Unnamed: 0,6725,7445,1976
STRUCTURE,dataflow,dataflow,dataflow
STRUCTURE_ID,ESTAT:PRC_HICP_MIDX(1.0),ESTAT:PRC_HICP_MIDX(1.0),ESTAT:PRC_HICP_MIDX(1.0)
STRUCTURE_NAME,HICP - monthly data (index),HICP - monthly data (index),HICP - monthly data (index)
freq,M,M,M
Time frequency,Monthly,Monthly,Monthly
unit,I05,I05,I05
Unit of measure,"Index, 2005=100","Index, 2005=100","Index, 2005=100"
coicop,CP11,CP11,CP11
Classification of individual consumption by purpose (COICOP),Restaurants and hotels,Restaurants and hotels,Restaurants and hotels
geo,SE,UK,EE


In [244]:
df_inflation.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7559 entries, 0 to 7558
Data columns (total 17 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   STRUCTURE                                                     7559 non-null   object 
 1   STRUCTURE_ID                                                  7559 non-null   object 
 2   STRUCTURE_NAME                                                7559 non-null   object 
 3   freq                                                          7559 non-null   object 
 4   Time frequency                                                7559 non-null   object 
 5   unit                                                          7559 non-null   object 
 6   Unit of measure                                               7559 non-null   object 
 7   coicop                                                        7559 no

In [245]:
df_inflation.columns

Index(['STRUCTURE', 'STRUCTURE_ID', 'STRUCTURE_NAME', 'freq', 'Time frequency',
       'unit', 'Unit of measure', 'coicop',
       'Classification of individual consumption by purpose (COICOP)', 'geo',
       'Geopolitical entity (reporting)', 'TIME_PERIOD', 'Time', 'OBS_VALUE',
       'Observation value', 'OBS_FLAG', 'Observation status (Flag)'],
      dtype='object')

In [247]:
df_inflation["STRUCTURE"].unique()

array(['dataflow'], dtype=object)

In [254]:
delete_these = []

for column in df_inflation.columns:
    
    if len(df_inflation[column].unique()) == 1:
        delete_these.append(column)
        # df_inflation.drop(column, axis=1, inplace=True)

In [255]:
delete_these

['STRUCTURE',
 'STRUCTURE_ID',
 'STRUCTURE_NAME',
 'freq',
 'Time frequency',
 'unit',
 'Unit of measure',
 'coicop',
 'Classification of individual consumption by purpose (COICOP)',
 'Time',
 'Observation value']

In [257]:
df_inflation = df_inflation.drop(delete_these, axis=1)
#df_inflation.drop(delete_these, axis=1, inplace=True)

In [258]:
df_inflation.sample(3)

Unnamed: 0,geo,Geopolitical entity (reporting),TIME_PERIOD,OBS_VALUE,OBS_FLAG,Observation status (Flag)
431,BE,Belgium,2020-02,145.94,,
1900,EE,Estonia,2004-08,98.83,,
4739,LU,Luxembourg,2023-01,163.23,,


In [265]:
df_inflation = df_inflation.rename(columns={
    "Geopolitical entity (reporting)" : "Country",
    "OBS_VALUE" : "inflation_index"
})

In [294]:
df_inflation["Country"] = df_inflation["Country"].str.replace("Czechia", "Czech Republic")

In [266]:
df_inflation\
.query('`Observation status (Flag)`.notnull()')\
.groupby("Country")['Observation status (Flag)']\
.count()

Country
Austria             10
Belgium              8
Bulgaria             1
Croatia              3
Cyprus               1
Czechia              1
Denmark              6
France              10
Germany              8
Greece               4
Hungary              4
Ireland              9
Italy                3
Lithuania            2
Luxembourg           5
Netherlands          2
North Macedonia    227
Poland               5
Portugal             2
Romania              5
Slovakia             2
Slovenia             3
Spain                2
Switzerland          5
United Kingdom       5
Name: Observation status (Flag), dtype: int64

In [267]:
df_inflation["Country"].value_counts()

Country
Austria            238
Belgium            238
Slovakia           238
Slovenia           238
Sweden             238
Romania            238
Portugal           238
Poland             238
Norway             238
Netherlands        238
Malta              238
Latvia             238
Luxembourg         238
Lithuania          238
Italy              238
Iceland            238
Ireland            238
Hungary            238
Croatia            238
France             238
Finland            238
Spain              238
Greece             238
Estonia            238
Denmark            238
Germany            238
Czechia            238
Cyprus             238
Bulgaria           238
North Macedonia    227
Switzerland        227
United Kingdom     203
Name: count, dtype: int64

In [268]:
df2.sample()

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date,date_short
127,Luxembourg,148.0,98.0,2020-01-01,2020-01


In [269]:
df_inflation.sample()

Unnamed: 0,geo,Country,TIME_PERIOD,inflation_index,OBS_FLAG,Observation status (Flag)
4860,LV,Latvia,2013-04,142.97,,


In [271]:
df2["date_short"] = df2["Date"].dt.strftime("%Y-%m")
df2.sample()

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date,date_short
276,Slovakia,125.0,80.0,2016-01-01,2016-01


https://learnsql.de/blog/wie-man-sql-joins-lernt/2.png

In [284]:
left = pd.merge(
    df2,
    df_inflation,
    left_on = ["Destination", "date_short"],
    right_on = ["Country", "TIME_PERIOD"],
    how = "left"
)

right = pd.merge(
    df2,
    df_inflation,
    left_on = ["Destination", "date_short"],
    right_on = ["Country", "TIME_PERIOD"],
    how = "right"
)

inner = pd.merge(
    df2,
    df_inflation,
    left_on = ["Destination", "date_short"],
    right_on = ["Country", "TIME_PERIOD"],
    how="inner"
)

outer = pd.merge(
    df2,
    df_inflation,
    left_on = ["Destination", "date_short"],
    right_on = ["Country", "TIME_PERIOD"],
    how="outer"
)

In [290]:
len(df2)

622

In [285]:
left.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 622 entries, 0 to 621
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Destination                622 non-null    object        
 1   Hotel ceiling              622 non-null    float64       
 2   Daily allowance            622 non-null    float64       
 3   Date                       622 non-null    datetime64[ns]
 4   date_short                 622 non-null    object        
 5   geo                        595 non-null    object        
 6   Country                    595 non-null    object        
 7   TIME_PERIOD                595 non-null    object        
 8   inflation_index            595 non-null    float64       
 9   OBS_FLAG                   9 non-null      object        
 10  Observation status (Flag)  9 non-null      object        
dtypes: datetime64[ns](1), float64(3), object(7)
memory usage: 53.6+ KB


In [292]:
len(df_inflation)

7559

In [286]:
right.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7559 entries, 0 to 7558
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Destination                595 non-null    object        
 1   Hotel ceiling              595 non-null    float64       
 2   Daily allowance            595 non-null    float64       
 3   Date                       595 non-null    datetime64[ns]
 4   date_short                 595 non-null    object        
 5   geo                        7559 non-null   object        
 6   Country                    7559 non-null   object        
 7   TIME_PERIOD                7559 non-null   object        
 8   inflation_index            7559 non-null   float64       
 9   OBS_FLAG                   333 non-null    object        
 10  Observation status (Flag)  333 non-null    object        
dtypes: datetime64[ns](1), float64(3), object(7)
memory usage: 649.7+ KB


In [287]:
inner.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595 entries, 0 to 594
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Destination                595 non-null    object        
 1   Hotel ceiling              595 non-null    float64       
 2   Daily allowance            595 non-null    float64       
 3   Date                       595 non-null    datetime64[ns]
 4   date_short                 595 non-null    object        
 5   geo                        595 non-null    object        
 6   Country                    595 non-null    object        
 7   TIME_PERIOD                595 non-null    object        
 8   inflation_index            595 non-null    float64       
 9   OBS_FLAG                   9 non-null      object        
 10  Observation status (Flag)  9 non-null      object        
dtypes: datetime64[ns](1), float64(3), object(7)
memory usage: 51.3+ KB


In [288]:
outer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7586 entries, 0 to 7585
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Destination                622 non-null    object        
 1   Hotel ceiling              622 non-null    float64       
 2   Daily allowance            622 non-null    float64       
 3   Date                       622 non-null    datetime64[ns]
 4   date_short                 622 non-null    object        
 5   geo                        7559 non-null   object        
 6   Country                    7559 non-null   object        
 7   TIME_PERIOD                7559 non-null   object        
 8   inflation_index            7559 non-null   float64       
 9   OBS_FLAG                   333 non-null    object        
 10  Observation status (Flag)  333 non-null    object        
dtypes: datetime64[ns](1), float64(3), object(7)
memory usage: 652.1+ KB


In [295]:
df_adjusted = pd.merge(
    df2,
    df_inflation,
    left_on = ["Destination", "date_short"],
    right_on = ["Country", "TIME_PERIOD"],
    how = "left"
)

In [296]:
df_adjusted.sample(3)

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date,date_short,geo,Country,TIME_PERIOD,inflation_index,OBS_FLAG,Observation status (Flag)
254,Czech Republic,155.0,75.0,2016-01-01,2016-01,CZ,Czech Republic,2016-01,126.3,,
135,Slovenia,117.0,84.0,2020-01-01,2020-01,SI,Slovenia,2020-01,149.68,,
337,Germany,115.0,93.0,2014-01-01,2014-01,DE,Germany,2014-01,121.2,,


In [297]:
df_adjusted.query("inflation_index.isnull()")

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date,date_short,geo,Country,TIME_PERIOD,inflation_index,OBS_FLAG,Observation status (Flag)
27,United Kingdom,209.0,125.0,2023-01-01,2023-01,,,,,,
55,United Kingdom,209.0,125.0,2022-07-01,2022-07,,,,,,
83,United Kingdom,209.0,125.0,2022-01-01,2022-01,,,,,,
111,United Kingdom,209.0,125.0,2021-01-01,2021-01,,,,,,


In [301]:
df_adjusted["hotel_adj"] = df_adjusted["Hotel ceiling"] * (df_adjusted["inflation_index"] / 100)
df_adjusted["allowance_adj"] = df_adjusted["Daily allowance"] * (df_adjusted["inflation_index"] / 100)

In [303]:
df_adjusted.drop(["TIME_PERIOD", "Country", "OBS_FLAG", "Observation status (Flag)"], 
                axis=1,
                inplace=True)

## 3. Analysis

In [305]:
df_adjusted.sample()

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date,date_short,geo,inflation_index,hotel_adj,allowance_adj
324,Austria,130.0,95.0,2014-05-01,2014-05,AT,124.91,162.383,118.6645


In [310]:
averages = df_adjusted.groupby("Destination")[["hotel_adj", "allowance_adj"]].mean().sort_values("hotel_adj", ascending=False)
averages["difference"] = averages["hotel_adj"] / averages["allowance_adj"]
averages.sort_values("difference", ascending=False)

Unnamed: 0_level_0,hotel_adj,allowance_adj,difference
Destination,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Romania,231.846358,87.560137,2.647853
Bulgaria,266.502521,101.662689,2.621439
Hungary,208.052283,101.044443,2.059018
Czech Republic,185.72713,92.193478,2.014537
Poland,186.762565,93.193,2.004041
Latvia,198.248235,105.070278,1.886816
Lithuania,187.01207,107.786878,1.735017
Netherlands,207.232979,121.167651,1.7103
United Kingdom,223.438994,130.754214,1.708847
France,188.149029,114.284925,1.646315


In [307]:
#df_adjusted.groupby("Country")[["Hotel ceiling", "Daily allowance", "hotel_adj","expenses_adj"]].mean().sort_values("expenses_adj", ascending=False)

In [None]:
pip install altair

In [314]:
import altair as alt

https://altair-viz.github.io/

In [320]:
alt.Chart(averages.reset_index()).mark_bar().encode(
    x = "hotel_adj",
    y = alt.Y("Destination", sort="-x")
)

In [331]:
alt.Chart(averages.reset_index().drop("difference", axis=1).melt("Destination")).mark_bar().encode(
    x = "value",
    y = "variable",
    row = "Destination",
    color = "variable"
)

In [341]:
selection = alt.selection_point(fields=['Destination'], bind='legend')

alt.Chart(df_adjusted).mark_line().encode(
    x = "Date",
    y = "hotel_adj",
    color = "Destination",
    tooltip = ["Destination", "Date"],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_params(
    selection
).interactive()

In [342]:
selection = alt.selection_point(fields=['Destination'], bind='legend')

alt.Chart(df_adjusted).mark_line().encode(
    x = "Date",
    y = "Hotel ceiling",
    color = "Destination",
    tooltip = ["Destination", "Date"],
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2))
).add_params(
    selection
).interactive()

In [376]:
df_maxmin = pd.concat([df2.query("Date == Date.max()").set_index("Destination"),
           df2.query("Date == Date.min()").set_index("Destination")],
         axis=1)["Hotel ceiling"]

df_maxmin.columns = ["max", "min"]
df_maxmin.head()
df_maxmin.loc["Bulgaria", "min"] = 169
df_maxmin.loc["Romania", "min"] = 170
df_maxmin.loc["Croatia", "min"] = 110

In [377]:
df2.query("Destination.isin(['Croatia','Bulgaria','Romania']) and Date.dt.year == 2016")

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date,date_short
225,Bulgaria,135.0,57.0,2016-09-10,2016-09
234,Croatia,110.0,75.0,2016-09-10,2016-09
246,Romania,136.0,62.0,2016-09-10,2016-09
254,Bulgaria,169.0,58.0,2016-01-01,2016-01
274,Romania,170.0,52.0,2016-01-01,2016-01


In [386]:
df_maxmin["rising"] = df_maxmin.apply(lambda x: x["max"] > x["min"], axis=1)
rising = df_maxmin.query("rising == True").index

In [400]:
df_adjusted.sample()

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date,date_short,geo,inflation_index,hotel_adj,allowance_adj
146,Ireland,159.0,108.0,2019-01-01,2019-01,IE,123.0,195.57,132.84


In [401]:
#selection = alt.selection_point(fields=['Destination'], bind='legend')

variable = "hotel_adj"

min_year = alt.Chart(df_adjusted.query("Destination.isin(@rising)")).mark_line().encode(
    x = "Date",
    y = variable,
    color = "Destination",
    tooltip = ["Destination", "Date"],
)
max_year = alt.Chart(df_adjusted.query("~Destination.isin(@rising)")).mark_line().encode(
    x = "Date",
    y = variable,
    color = "Destination",
    tooltip = ["Destination", "Date"],
   
)

alt.hconcat(
    min_year, max_year
).resolve_scale(
    color='independent'
)
