# Data analysis

**Goal**: Analyse the differences in hotel costs and per diems of employees of the commission

1. Import and explore the data we have scraped in the previous excercise.
2. Clean this data
3. Combine them with the [inflation index from Eurostat](https://ec.europa.eu/eurostat/databrowser/bookmark/f6a583fa-f744-4590-aa95-173aaa6ea3f1?lang=en) (grab the [direct link to the csv](https://ec.europa.eu/eurostat/api/dissemination/sdmx/3.0/data/dataflow/ESTAT/prc_hicp_midx/1.0/M.I05.CP11.*?c[geo]=BE,BG,CZ,DK,DE,EE,IE,EL,ES,FR,HR,IT,CY,LV,LT,LU,HU,MT,NL,AT,PL,PT,RO,SI,SK,FI,SE,IS,NO,CH,UK,ME,MK&compress=false&format=csvdata&formatVersion=2.0&c[TIME_PERIOD]=ge:2004-01+le:2023-10&lang=en&labels=name))
4. Analyse the data

##  1. Import and explore the scraped data

The first step for a data analysis is to have a close look at the data:
- What columns are there?
- How many rows?
- What is in the columns?

Pandas has helpful methods to do this.

```python
df.info() # information on the column names and non-zero values
df.sample(3) # returns 3 sample rows from the data
df.head(3) # returns the first 3 rows of the data
len(df) # returns the number of rows
df.columns # all the column names as a list
```

In [69]:
import pandas as pd

In [70]:
df = pd.read_csv("regulation_data.csv")
df.sample(3)

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date
63,Greece,112,82,20220101
407,Luxembourg,145,92,20100101
578,Portugal,6891,12489,20060101


In [71]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 636 entries, 0 to 635
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Destination      636 non-null    object
 1   Hotel ceiling    636 non-null    object
 2   Daily allowance  636 non-null    object
 3   Date             636 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 20.0+ KB


### `df.info()` explained

![](img/info.png)

### Looking closely at the columns

To isolate columns using this syntax and look more closely at those we use the following syntax:

```python
df["column"]
```

This will return a `Series` from a `DataFrame`. It's important to know what data type you are working with, as all of them have their own methods.


```python
# for categorical variables
df["column"].unique() # all the unique values of the column
df["column"].value_counts() # how often does a value occur

# for numeric variables
df.hist(column='column') # makes a histogram
df.describe() # descriptive statistics for all numeric variables
df["column"].describe() # descriptive statistics for a single column
```

In [72]:
df["Destination"].value_counts()

Destination
Belgium            23
Latvia             23
United Kingdom     23
Sweden             23
Finland            23
Slovenia           23
Portugal           23
Poland             23
Austria            23
Netherlands        23
Malta              23
Hungary            23
Lithuania          23
Luxembourg         23
Cyprus             23
Germany            23
Italy              23
Czech Republic     23
France             23
Spain              23
Greece             23
Ireland            23
Denmark            23
Estonia            23
Romania            19
Bulgaria           19
Slovakia           14
Destination        11
Slovak Republic     9
Croatia             9
Destinations        3
Name: count, dtype: int64

This counts the values in the column. Note that we have the words *Destination* and *Destinations* in there. These are remnants of the headers, we don't want them. We also have *Slovakia* and *Slovak Republic* as a value. We need to make them one.

### Filtering the DataFrame

```python
df.query() # filters the DataFrame
```

Please see the documentation 

`.query()`

In [73]:
df2 = df.query("~Destination.str.contains('Destinations?')")

* `~` except
* `str.contains()` : pandas [str.contains()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html)
* ('Destinations **?**')

In [74]:
df2["Destination"] = df2["Destination"].replace({"Slovak Republic" : "Slovakia"})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["Destination"] = df2["Destination"].replace({"Slovak Republic" : "Slovakia"})


In [75]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 622 entries, 0 to 635
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Destination      622 non-null    object
 1   Hotel ceiling    622 non-null    object
 2   Daily allowance  622 non-null    object
 3   Date             622 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 24.3+ KB


In [76]:
df2["Hotel ceiling"] = pd.to_numeric(df2["Hotel ceiling"].str.replace(",","."))
df2["Daily allowance"] = pd.to_numeric(df2["Daily allowance"].str.replace(",","."))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["Hotel ceiling"] = pd.to_numeric(df2["Hotel ceiling"].str.replace(",","."))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["Daily allowance"] = pd.to_numeric(df2["Daily allowance"].str.replace(",","."))


In [77]:
df2.describe()

Unnamed: 0,Hotel ceiling,Daily allowance,Date
count,622.0,622.0,622.0
mean,130.785225,92.640981,20137500.0
std,30.967412,23.799554,59253.05
min,50.0,52.0,20040500.0
25%,115.0,74.0,20090100.0
50%,135.0,92.0,20140500.0
75%,150.0,102.0,20190100.0
max,209.0,210.0,20230100.0


In [78]:
df2.groupby("Destination").mean().sort_values("Hotel ceiling", ascending=False)

Unnamed: 0_level_0,Hotel ceiling,Daily allowance,Date
Destination,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United Kingdom,176.811739,116.656087,20135530.0
Sweden,161.814348,110.665652,20135530.0
Netherlands,156.468696,101.968696,20135530.0
Romania,153.894737,56.736842,20152930.0
Bulgaria,152.894737,57.526316,20152930.0
France,151.64087,98.035217,20135530.0
Denmark,151.395652,125.226522,20135530.0
Ireland,144.513913,110.172174,20135530.0
Luxembourg,137.956522,96.293913,20135530.0
Belgium,135.833913,99.184348,20135530.0


In [79]:
df2.query("Destination == 'Sweden'")

Unnamed: 0,Destination,Hotel ceiling,Daily allowance,Date
26,Sweden,187.0,117.0,20230101
54,Sweden,187.0,117.0,20220701
82,Sweden,187.0,117.0,20220101
110,Sweden,187.0,117.0,20210101
138,Sweden,187.0,117.0,20200101
166,Sweden,187.0,117.0,20190101
194,Sweden,187.0,117.0,20180101
222,Sweden,187.0,117.0,20170101
250,Sweden,187.0,117.0,20160910
278,Sweden,160.0,97.0,20160101


In [81]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 622 entries, 0 to 635
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Destination      622 non-null    object 
 1   Hotel ceiling    622 non-null    float64
 2   Daily allowance  622 non-null    float64
 3   Date             622 non-null    int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 24.3+ KB


In [82]:
df2['Date'] = pd.to_datetime(df2['Date'], format="%Y%m%d")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2['Date'] = pd.to_datetime(df2['Date'], format="%Y%m%d")


In [84]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 622 entries, 0 to 635
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Destination      622 non-null    object        
 1   Hotel ceiling    622 non-null    float64       
 2   Daily allowance  622 non-null    float64       
 3   Date             622 non-null    datetime64[ns]
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 24.3+ KB


In [90]:
df_inflation = pd.read_csv("inflation_data.csv")
df_inflation.sample(3).T

Unnamed: 0,5780,6081,814
STRUCTURE,dataflow,dataflow,dataflow
STRUCTURE_ID,ESTAT:PRC_HICP_MIDX(1.0),ESTAT:PRC_HICP_MIDX(1.0),ESTAT:PRC_HICP_MIDX(1.0)
STRUCTURE_NAME,HICP - monthly data (index),HICP - monthly data (index),HICP - monthly data (index)
freq,M,M,M
Time frequency,Monthly,Monthly,Monthly
unit,I05,I05,I05
Unit of measure,"Index, 2005=100","Index, 2005=100","Index, 2005=100"
coicop,CP11,CP11,CP11
Classification of individual consumption by purpose (COICOP),Restaurants and hotels,Restaurants and hotels,Restaurants and hotels
geo,NO,PL,CH


In [95]:
for column in df_inflation.columns:
    display(df_inflation[column].unique())
#df_inflation["Time frequency"].unique()

array(['dataflow'], dtype=object)

array(['ESTAT:PRC_HICP_MIDX(1.0)'], dtype=object)

array(['HICP - monthly data (index)'], dtype=object)

array(['M'], dtype=object)

array(['Monthly'], dtype=object)

array(['I05'], dtype=object)

array(['Index, 2005=100'], dtype=object)

array(['CP11'], dtype=object)

array(['Restaurants and hotels'], dtype=object)

array(['AT', 'BE', 'BG', 'CH', 'CY', 'CZ', 'DE', 'DK', 'EE', 'EL', 'ES',
       'FI', 'FR', 'HR', 'HU', 'IE', 'IS', 'IT', 'LT', 'LU', 'LV', 'MK',
       'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK', 'UK'],
      dtype=object)

array(['Austria', 'Belgium', 'Bulgaria', 'Switzerland', 'Cyprus',
       'Czechia', 'Germany', 'Denmark', 'Estonia', 'Greece', 'Spain',
       'Finland', 'France', 'Croatia', 'Hungary', 'Ireland', 'Iceland',
       'Italy', 'Lithuania', 'Luxembourg', 'Latvia', 'North Macedonia',
       'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania',
       'Sweden', 'Slovenia', 'Slovakia', 'United Kingdom'], dtype=object)

array(['2004-01', '2004-02', '2004-03', '2004-04', '2004-05', '2004-06',
       '2004-07', '2004-08', '2004-09', '2004-10', '2004-11', '2004-12',
       '2005-01', '2005-02', '2005-03', '2005-04', '2005-05', '2005-06',
       '2005-07', '2005-08', '2005-09', '2005-10', '2005-11', '2005-12',
       '2006-01', '2006-02', '2006-03', '2006-04', '2006-05', '2006-06',
       '2006-07', '2006-08', '2006-09', '2006-10', '2006-11', '2006-12',
       '2007-01', '2007-02', '2007-03', '2007-04', '2007-05', '2007-06',
       '2007-07', '2007-08', '2007-09', '2007-10', '2007-11', '2007-12',
       '2008-01', '2008-02', '2008-03', '2008-04', '2008-05', '2008-06',
       '2008-07', '2008-08', '2008-09', '2008-10', '2008-11', '2008-12',
       '2009-01', '2009-02', '2009-03', '2009-04', '2009-05', '2009-06',
       '2009-07', '2009-08', '2009-09', '2009-10', '2009-11', '2009-12',
       '2010-01', '2010-02', '2010-03', '2010-04', '2010-05', '2010-06',
       '2010-07', '2010-08', '2010-09', '2010-10', 

array([nan])

array([ 96.99,  97.86,  97.64, ..., 151.8 , 150.8 , 146.7 ])

array([nan])

array([nan, 'u', 'd', 'du'], dtype=object)

array([nan, 'low reliability', 'definition differs (see metadata)',
       'definition differs (see metadata), low reliability'], dtype=object)