# **Session 2**



# **Class content for Session 2**

1.   Discussion of the homework
2.   Extraction of specific rows (observations) and columns (variables).
3.   Visualization of *two* quantitative variables


We use again the ski data set.

In [None]:
# Loading packages and data
import pandas as pd
import seaborn as sns

df_ski = pd.read_csv('https://www.dropbox.com/scl/fi/pgfw6h0x9pxbr7c01jg6k/SkiData-FRG.csv?rlkey=yupn270yn4s4xnigoeasv084e&dl=1')
df_ski

# **Part 1 -- Extracting observations and variables (Boolean masks)**

We first explore how to select the observations that meet specific criteria. This is quite useful to manipulate data, and we will use this, in particular, to improve the scatterplots we introduced last time.

## Conditional statements

* Symbols to compare numbers:

==, >, =>, <, <=
* Equality == can be used to compare 'anything' (any type of objects: vectors, tables, matrices, datasets, etc.)
* Outcome a Boolean value: **False** or **True**



In [None]:
# Outcome can be either 'True' or 'False'
(5 < 6)

In [None]:
# Compare to:
(5 == 6)

In [None]:
# Tests may be combined
#   & stands for 'and'
#   | stands for 'or'

print('(2<3) & (3*4 <11) -->', (2<3) & (3*4 <11))
print('(2<3) | (3*4 <11) -->', (2<3) | (3*4 <11))
print('(1==2) == (2==3)  -->',(1==2) == (2==3))


In [None]:
1 == True, 0 == False

In [None]:
# A useful trick: converting True -> 1, False -> 0
True + 0, False + 0

## Extracting rows

In [None]:
# Stations from Vosges only:
row_mask = (df_ski['Mountain'] == 'Vosges')
row_mask

*Row mask* is a pandas series, with the same indices as the original dataset, and with Boolean values in each entries (True/False).

In [None]:
# The following command returns the observations of df_ski for which the entry of row_mask is True:
df_ski[row_mask]

In [None]:
# You may also often encounter the following more direct way of coding the same thing:
df_ski[df_ski['Mountain'] == 'Vosges']

In [None]:
# Let's play a bit with the row_mask:
# We may force the conversion of False/True into numbers 0/1 as follows:
row_mask+0

In [None]:
# Note that the following command returns an error:
df_ski[row_mask+0]

In [None]:
# How many stations in the Vosges?
row_mask.sum()

## Exercice 1: ski dataset, continued

1. Define the variable **ElevationGain** as the difference between **AltitudeTop** and **AltitudeDown**, and display all stations with an elevation gain smaller than 150 meters.

2. Consider stations with 10 or fewer slopes. How many such stations are in the data set? Produce a histogram of the prices they charge. Which is the outlier station?

## Extracting rows and columns.

In [None]:
# We have already seen how to extract specific columns, by specifying a list (between backets) of the relevant variables
# For instance,


df_ski[['Mountain','AltitudeTop']]

To extract *simultaneously* rows and columns, we use the following approach.

* Assume that we are given a mask on the rows `row_mask`
* Assume that we are given a mask on the columns `column_mask`

We can select the corresponding observations (rows) and variables (columns) using the following command:

`df_ski.loc[row_mask,column_mask]`

To illustrate, we will create a new variable `Range`, grouping the mountains together into only three possible categories: Alpes, Pyrénées and older (and lower) mountains.

* `Range = Alpes` **if** `Mountain = Alpes du Nord`, or `Mountain = Alpes du Sud`
* `Range=Other` **if** `Mountain = Massif Central`, or `Mountain = Vosges` or `Mountain = Jura`
* `Range = Pyrenees` **if** `Mountain = Pyrenees`

To do this, we first create this new column by setting all values equal to the default value *Other*, next change this value to Pyrénées/Alpes whenever relevant.


In [None]:
df_ski['Range'] = 'Other'

# For stations in the Pyrénées, we change this variable to Pyrénées
row_mask = (df_ski['Mountain'] == 'Pyrénées')
df_ski.loc[row_mask,'Range'] = 'Pyrénées'

# For stations in the Alpes, we change it to Alpes
row_mask = (df_ski['Mountain'] == 'Alpes du Nord') | (df_ski['Mountain'] == 'Alpes du Sud')
df_ski.loc[row_mask,'Range'] = 'Alpes'

In [None]:
# Outcome:
df_ski

In [None]:
# We do not see much, let's sample stations at random:
df_ski.sample(10)

# **Part 2: Improving scatterplots**

Last time, we produced a scatterplot to visualize the relation between prices and slope numbers (see next cell). We noticed that the presence of outliers makes it difficult to visualize this relation.

In [None]:
df_ski.plot(x='Slopes', y='Price', kind='scatter')

We here apply the tools we have just introduced to improve such a graph.

### Method 1: suppressing outliers in a scatterplot

In [None]:
# Seaborn syntax:
sns.scatterplot(data=df_ski[df_ski['Slopes']<200], x='Slopes', y='Price')

### Method 2: Transforming the variables

The above scatterplot points to a clear association between the number of slopes and the price, but not to a linear relation.

In such cases, one may try and transform existing variables in the hope that simpler relations appear. Here, a logarithm looks suitable:

In [None]:
# We import the numpy library for scientific computations
# The logarithm function is np.log
import numpy as np

np.log(2)

In [None]:
df_ski['LnSlopes']= np.log(df_ski['Slopes'])
df_ski

In [None]:
# Now the association looks linear!
sns.scatterplot(data=df_ski[df_ski['Slopes']<200], x='LnSlopes', y='Price')

### Method 3: Colorings points by a categorical variable

The above plot show a relation between the logarithm of the number of slopes and the price. How does this relation depend on the mountain of the station?

In [None]:
sns.scatterplot(data=df_ski[df_ski['Slopes']<200], x='LnSlopes', y='Price',hue = 'Mountain')

In [None]:
# There are simply too many colors in the previous plot.
# To improve upon it, we use instead the type of mountain
sns.scatterplot(data=df_ski[df_ski['Slopes']<200], x='LnSlopes', y='Price', hue='Range')
# This is much more readable than when using all the different mountains!

## Exercise 2

1. Is **AltitudeDown** influential for the **Price**? Answer based on a well-chosen picture.

2. Do some mountain ranges offer better deals in terms of prices for elevation gain ?

# **Part 3: Single quantitative variable by a categorical variable**

Scatterplots with hue are a great tool to visualize the relation between *two* quantitative variables, conditional on *one* categorical variable.

Here, we address the visualization of *one* quantitative variable, conditional on a categorical one. The chief objective is to assess visually the extent to which the distribution of the quantitative variable depends on the value of the categorical one. That is, to figure out the conditional distribution of the quantitative variable, given the categorical one.

We use the versatile *catplot* tool from seaborn.

#### Conditional boxplots

In [None]:
sns.catplot(data = df_ski, x = 'Range', y ='Price', kind = 'box')

#### Conditional barplots (with standard errors)

In [None]:
# Mean plots: heights = sample means; bars = +/- 1.96 * standard errors
# (where we recall that standard error = standard deviation / sqrt(sample size),
# a.k.a. half width of a symmetric confidence interval)
sns.catplot(data=df_ski, x='Range', y='Price', kind='bar')

How to draw conclusions from such a graph:

<font color='blue'>We read significant differences in average prices between resorts in the 'Other' category vs. 'Alpes' and 'Pyrénées', and no significant difference in average prices between 'Alpes' and 'Pyrénées'.</font>

## Exercise 3

1. Illustrate how **AltitudeTop** varies by the mountain ranges.

2. Do some mountain ranges offer better deals in terms of prices for elevation gain ? Answer using barplot.

## Numerical statistics (advanced topic)

Commands are of the following form, featuring **groupby**:

In [None]:
df_ski.groupby('Range')['Price'].describe()

# **Wrap up/Summary of main commands**

Today, we
- extracted subsets of rows and/or columns from a data set,
    - based on logical conditions, defining masks:

    `df['row_mask']`

    `df['col_mask']`

    - and applied this all to zooming in scatter plots (to removing outliers);
- displayed pairs of quantitative variables by a categorical variable:

    `sns.scatterplot(data = df, x = 'variable_name', y = 'variable_name', hue = 'variable_name')`

- saw how to graphically (and numerically) analyze a quantitative variable by a categorical variable, with box plots

    `sns.catplot(data = df, x = 'variable_name', y = 'variable_name', kind = 'box')`

 and barplots:

    `sns.catplot(data = df, x = 'variable_name', y = 'variable_name',kind= 'bar')`

- Don't forget the **homework 2** for next time!
