<img src="images\Logo_UCLL_ENG_RGB.png" style="background-color:white;" />

# Data Analytics & Machine learning

Lecturers: Aimée Lynn Backiel, Daan Nijs, and Kenric Borgelioen

Academic year 2024-2025

## Lab 3: Data analytics with Pandas

### Lecture outline

1. Recap last week
2. Introduction to the case
3. Data exploration using Pandas
   1. Univariate analysis
   2. Bivariate analysis

### Recap of last lecture(s)

#### Lab 1

1. We ensured we had a valid Python installation.
2. We learnt what a virtual environment is:
   * Isolated Python executable and packages.
   * We created a virtual environment.
3. Absolute path vs relative path recap.
4. Recap of data structures in Python

#### Lab 2
1. Installed Pandas
2. Learnt how to read data
3. Learnt how to calculate mean, mode, median etc.
4. Basic exploration of the 4 variables

### The case

Ada Turing Travelogue, or as everyone calls her, Ada just started working part time at her parents travel agency. She has a keen understanding and interest of everything related to applied computer science ranging from server & system management to full stack software development. Through database foundations she already understands how to query data and programming 1 and 2 covered the essentials about the Python programming language. Recently she has just decided to start learning about data analytics & machine learning as well.

She uses her skills to connect to the travel agency's database where she finds many, normalized, tables. Ada recalls what she learnt in database foundations and performs all the correct joins. Afterwards she saves the data in the `data/` folder.


She finds the following dataset:

| Column Name          | Description                                                                                       |
| -------------------- | ------------------------------------------------------------------------------------------------- |
| SalesID              | Unique identifier for each sale.                                                                  |
| Age                  | Age of the traveler.                                                                              |
| Country              | Country of origin of the traveler.                                                                |
| Membership_Status    | Membership level of the traveler in the booking system; could be 'standard', 'silver', or 'gold'. |
| Previous_Purchases   | Number of previous bookings made by the traveler.                                                 |
| Destination          | Travel destination chosen by the traveler.                                                        |
| Stay_length          | Duration of stay at the destination.                                                              |
| Guests               | Number of guests traveling (including the primary traveler).                                             |
| Travel_month         | Month in which the travel is scheduled.                                                           |
| Months_before_travel | Number of months prior to travel that the booking was made.                                       |
| Earlybird_discount   | Boolean flag indicating whether the traveler received an early bird discount.                     |
| Package_Type         | Type of travel package chosen by the traveler.                                                    |
| Cost                 | Calculated cost of the travel package.                                                            |
| Margin | The cost (for the traveler) - what the travel agency pays. |
 | Additional_Services_Cost| The amount of additional services (towels, car rentals, room service, ...) that was bought during the trip. |


### Helping Ada explore the dataset

The main goal for the remainder of this lab is to explore the data. We will specifically take four columns:

* Cost
* age
* stay length
* Destination

Our goal is to find interesting relationships between them.

As was covered in the book and lecture there are to main data types in analytics: categorical and continuous data. This is a crucial first step in your analysis because it determines what methods make sense on your data.


**The goal is primarily to find out what influences the cost of the stay.**

### Introduction to Pandas

#### Reading and exploring data

In [None]:
import pandas as pd # by convention
pd.options.display.float_format = '{:.2f}'.format

In [None]:
travel_dataset = pd.read_csv("data/lab_3_dataset.csv")

One of the first things you typically do with a dataset is print out the first few rows. 

In [None]:
travel_dataset.head()

Accessing the columns is equally trivial.

In [None]:
travel_dataset.columns

In [None]:
travel_dataset[["country", "stay_length", "age", "cost"]]


To get multiple columns at once we need to pass in a list of columns

In [None]:
columns = ["country", "stay_length", "age", "cost"]
travel_dataset_subset = travel_dataset[columns]
travel_dataset_subset

#### Data exploration: univariate

We will continue the exploration of our country, destination, stay_length, age and cost variables.

1. We will start with a univariate analysis, which means we will explore one (uni) variable (variate) at a time. 
2. Later on we will move to two (bi) variables (variate) analyses 
3. We round it up with methods that are able to do multivariate analysis. 

Recall that 

1. **Categorical variables** represent categories or labels (e.g., colors, genders). 
2. **Numeric variables** represent quantities and can be ordered or measured (e.g., age, height). 
3. There is a special case called **Ordinal variables**, these are categories where there is a meaningful order (e.g., clothes sizes: small, medium and large).

#### summarizing numeric data

💻📊💡 TIP: the names of the functions are intuitive. For instance `dataframe[column].min()` gets the minimum of that column. It is equivalent to `select min(column) from dataframe` in SQL. Knowing SQL makes it easy to translate back and forth.

💻📊💡 TIP: become good friends with the <a href=./Pandas_Cheat_Sheet.pdf>the cheat sheet</a> and the documentation

In [None]:
travel_dataset["age"].mean()

In [None]:
travel_dataset["age"].median()

In [None]:
travel_dataset["age"].min()

In [None]:
travel_dataset["age"].max()

In [None]:
travel_dataset["age"].std()

###### Cost

In [None]:
travel_dataset["cost"].mean()

In [None]:
travel_dataset["cost"].min()

In [None]:
travel_dataset["cost"].max()

In [None]:
travel_dataset["cost"].median()

In [None]:
travel_dataset["cost"].std()

###### stay length

In [None]:
travel_dataset["stay_length"].mean()

In [None]:
travel_dataset["stay_length"].min()

In [None]:
travel_dataset["stay_length"].max()

In [None]:
travel_dataset["stay_length"].median()

In [None]:
travel_dataset["stay_length"].std()

##### 💻📊💡 TIP: there are better ways to do this

In [None]:
# Good
travel_dataset[["age","stay_length", "cost"]].min()

In [None]:
import numpy as np

In [None]:
# Better
# Notice how we are passing a list of functions in aggregate
travel_dataset[["age","stay_length", "cost"]].aggregate([np.mean, np.max, np.median, np.min])

In [None]:
# Best
travel_dataset[["age","stay_length", "cost"]].describe()

##### ❓ Does anything strike you as odd? 

##### summarizing categorical data

💻📊💡 TIP: `dataframe[column].value_counts()` is a very powerful method. It is equivalent to `select column, count(column) from dataframe group by column`. If you forget `value_counts()` exists you can get there using your SQL knowledge. `dataframe.groupby("column").count()` also gets you very close.

In [None]:
travel_dataset["country"].value_counts()

In [None]:
travel_dataset["destination"].value_counts()

#### Data exploration: bivariate

##### ❓ What pairs of variables do you think are interesting to look at?


##### ❓ What methods can you use to do this?

##### ❓ Carry out these analyses

💻📊💡 TIP: Things like  `dataframe[["column1", "column2"]].groupby("column1").agg(["min", "max"])` are valid Pandas.

 In SQL this would be `select column1, min(column2), max(column2) from dataframe group by column1`.

💻📊💡 TIP: `dataframe[columns].groupby(col1).describe()` is possible

💻📊💡 TIP: <a href=https://pandas.pydata.org/docs/reference/api/pandas.crosstab.html>Look at</a> `pd.crosstab()`

##### ❓ Are there combinations you find suspicious, if so which?

##### ❓ How would you deal with this?

####  Subsetting and cleaning data

Ada gets confirmation from her contacts that the data has some issues. The booking system broke down and produced large negative values Additionally, the currency converter for the latest destination broke down.

We will help her rectify these mistakes.

#####  Removing data: boolean indexing

Generally it's a good idea to not alter your original dataset but filter it in a copy. The way Pandas does this is by filtering with a boolean mask. We will demonstrate step by step how to do this.

The data we want gone are the rows where the cost equals `-1000000.00`. We can check for each row if that's the case. 

Pandas uses a technique called "broadcasting" where if you try to do operations between values of different shapes Pandas will try to expand one to make them match.

This means you don't need to explicitly turn the value into an array of the same size.

<center>
<img src="https://numpy.org/doc/stable/_images/broadcasting_1.png" style="background-color:white;">
</center>

In [None]:
non_errors = travel_dataset["cost"] != -1000000.00
non_errors

Pandas allows you to give a boolean array as an index to filter out rows. The rows where the boolean array (also known as a mask) is `True` are kept. 

In [None]:
travel_dataset[non_errors] 

In [None]:
len(travel_dataset) - len(travel_dataset[non_errors])

In [None]:
_dataset_cleaned = travel_dataset[non_errors]

##### ❓ Make a mask for the people that traveled to Tokyo. Call the variable tokyo.

#####  Updating rows the right way

We can now grab the rows where the destination is Tokyo as follows

In [None]:
_dataset_cleaned[tokyo]

This gives us a filtered dataset we can now grab the cost column from

In [None]:
_dataset_cleaned[tokyo]["cost"]

Now we have the rows we can convert the currency by dividing by 158. You will likely attempt to do the following

`_dataset_cleaned[tokyo]["cost"] /= 158`

In [None]:
_dataset_cleaned[tokyo]["cost"] /= 1

This gives us the same error as we encountered in the previous session. The way we can get around is is by using `data.loc`.

`df.loc` can be quite confusing. It is an object you can slice by using square brackets `[]`. The first element is the **index**. This is typically the column on the far left.

In [None]:
_dataset_cleaned.index # Each dataframe has an index

In [None]:
tokyo.index # Each series has an index too

In [None]:
_dataset_cleaned.loc[tokyo] # Loc filters the dataset where Tokyo is true

We can also give columns or a list of columns to `.loc`

In [None]:
_dataset_cleaned.loc[tokyo, "cost"]

It is a good habit to make a copy before you change your data. If you don't do this and make a mistake you need to rerun your entire script. For a small dataset like this it's not a big problem, but as you scale it can become a bottleneck.

In [None]:
cleaned_dataset = _dataset_cleaned.copy()

In [None]:

cleaned_dataset.loc[tokyo, "cost"] /= 158

In [None]:
cleaned_dataset

We will briefly look at the impact of what we did on our analysis

In [None]:
cleaned_dataset[["destination", "cost"]].groupby("destination").agg(["min", "max", "median", "mean", "std"])

In [None]:
cleaned_dataset[["age", "cost", "stay_length"]].corr()

In [None]:
cleaned_dataset[["cost", "destination", "age", "stay_length"]].groupby("destination").mean()

##### ❓ What changed?  

##### ❓ How did the outliers impact our analysis?


##### ❓ How do interpret the correlations?


<center>
Anscomnbe's quarter: four datasets with an equal correlation.
<br>
<br>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/1200px-Anscombe%27s_quartet_3.svg.png" style="width:50%">
</center>

### Introduction to plotting with Matplotlib, Seaborn and Plotly

We have helped Ada so far to gain insights into her data by wrangling it into shape and making tables to summarize data. Now, to further enhance our understanding and visualize the patterns, trends, and potential anomalies, we will be plotting the data. 

Making data visual simplifies complex datasets and also makes it more intuitive for stakeholders to grasp key takeaways. By transitioning from tabular summaries to graphical plots, we can also communicate more effectively.

#### Matplotlib

<center>
<img src="https://matplotlib.org/stable/_images/sphx_glr_logos2_003.png" style="background-color:white">
</center>

The name matplotlib comes from matrix plotting library. It's a descendant from the MATLAB programming language. It's by now an older library (2003) that has some quirks, but it is still important to know the basics of Matplotlib since other Python plotting libraries build on top of it. 

In [None]:
# uncomment to install
# %pip install matplotlib 

In [None]:
import matplotlib.pyplot as plt # convention

##### Plotting univariate data


The table below is a summary of the different types of plots for **numeric data**.

| Plot Type          | Description                                           | When to Use                                                      |
|--------------------|-------------------------------------------------------|------------------------------------------------------------------|
| **Histogram**      | Displays the distribution of a single continuous variable by dividing the data into bins and showing the frequency of observations in each bin. | To visualize the distribution of a variable, especially to identify its central tendency (mean), spread (standard deviation), and skewness (are low or high values more common).  |
| **Box Plot (or Whisker Plot)** | Shows the distribution of a variable using quartiles and displays potential outliers. | To get a summary of a variable's distribution in terms of its median, quartiles, and possible outliers. Useful when comparing the distribution across categories. |
| **Density Plot (or Kernel Density Plot)** | Provides a smoothed version of a histogram. | To visualize the distribution of a variable in a continuous manner. Particularly useful when comparing the distributions of multiple variables on the same plot. |
| **Violin Plot**    | Combines aspects of box plots and density plots.       | To visualize both the distribution and summary statistics of a variable. Especially useful when comparing across different categories. |


The syntax for plotting is generally `plt.<plotType>(x, y)`. 

In [None]:
plt.boxplot(cleaned_dataset["cost"]); # Matplotlib prints things while plotting, the semicolon an suppress it.

A boxplot provides a comprehensive view of a dataset's distribution, offering more detailed insights than typical tables. The central line within the box represents the median, splitting the data into its lower and upper halves. The box itself is framed by two lines: the lower boundary represents the 25th percentile (or Q1), meaning 25% of the data lies below this value, and the upper boundary denotes the 75th percentile (or Q3), indicating that 75% of the data is below this point.

The range between Q3 and Q1 is known as the Interquartile Range (IQR). Beyond the box, the plot extends 'whiskers'. Their "distance" is calculated as `1.5 * IQR` both above and below the box, providing a range for typical data points. Any data outside these whiskers can be considered outliers.

##### ❓ What other plots would make sense? Make them.

For **categorical data**, categories can often serve as a basis for comparison in other plots, like boxplots. This means you can use a single category to differentiate data within such plots. You can also produce the same type of plot multiple times, once for each category, to analyze patterns within individual categories.



| Plot Type     | Description                                          | When to Use                                         |
|---------------|------------------------------------------------------|-----------------------------------------------------|
| **Count Plot**| Represents the frequency or count of each category.  | To see how often each category appears in the data. |