# Probability - Basic Concepts and Random Variables




**The data from the LasVegas.csv file, sourced from the UCI repository (https://archive-beta.ics.uci.edu/dataset/397/las+vegas+strip) contains information about reviews written on TripAdvisor by customers of 21 hotels in Las Vegas. We will use this dataset to perform basic exercises on probabilities and random variables. Our dataset is clean and ready for analysis; it contains no missing values and the data is in the necessary format for analysis.**

In [1]:
# Libraries
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("data/LasVegas.csv", delimiter = ";")

In [3]:
# First rows
df.head()

Unnamed: 0,User country,Nr. reviews,Nr. hotel reviews,Helpful votes,Score,Period of stay,Traveler type,Pool,Gym,Tennis court,Spa,Casino,Free internet,Hotel name,Hotel stars,Nr. rooms,User continent,Member years,Review month,Review weekday
0,USA,11,4,13,5,Dec-Feb,Friends,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3,3773,North America,9,January,Thursday
1,USA,119,21,75,3,Dec-Feb,Business,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3,3773,North America,3,January,Friday
2,USA,36,9,25,5,Mar-May,Families,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3,3773,North America,2,February,Saturday
3,UK,14,7,14,4,Mar-May,Friends,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3,3773,Europe,6,February,Friday
4,Canada,5,5,2,4,Mar-May,Solo,NO,YES,NO,NO,YES,YES,Circus Circus Hotel & Casino Las Vegas,3,3773,North America,7,March,Tuesday


In [4]:
# Columns
df.columns

Index(['User country', 'Nr. reviews', 'Nr. hotel reviews', 'Helpful votes',
       'Score', 'Period of stay', 'Traveler type', 'Pool', 'Gym',
       'Tennis court', 'Spa', 'Casino', 'Free internet', 'Hotel name',
       'Hotel stars', 'Nr. rooms', 'User continent', 'Member years',
       'Review month', 'Review weekday'],
      dtype='object')

## Exercise 1

### 1.1 Contingency table
**The "Traveler type" variable indicates the type of traveler, classified as Business, Couples, Families, Friends, or Solo (depending on whether they stayed at the hotel for business, as a couple, with family, with friends, or alone). The "Period of stay" variable indicates in which quarter the trip was made. We will create a contingency table between the "Traveler type" and "Period of stay" variables.**

In [5]:
contingency_table = pd.crosstab(df['Traveler type'], df['Period of stay'])

contingency_table

Period of stay,Dec-Feb,Jun-Aug,Mar-May,Sep-Nov
Traveler type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Business,24,10,20,20
Couples,51,50,54,59
Families,27,37,24,22
Friends,15,21,24,22
Solo,7,8,6,3


### 1.2 Basic Probability

**If we choose an individual from the database at random, what is the probability that it corresponds to a customer who has stayed alone (Solo) and between June and August?**

![figure1](mathfigures/figure1.png)

In [6]:
filtered_df = df[(df['Traveler type'] == 'Solo') & (df['Period of stay'] == 'Jun-Aug')]

probability = len(filtered_df) / len(df)

print(f"The probability is: {probability:.6f}")

The probability is: 0.015873


**What is the probability that a business traveler stayed between December and February?**

![figure2](mathfigures/figure2.png)

In [7]:
filtered_business_df = df[(df['Traveler type'] == 'Business') & (df['Period of stay'] == 'Dec-Feb')]

business_probability = len(filtered_business_df) / len(df[df['Traveler type'] == 'Business'])

print(f"The probability is: {business_probability:.6f}")

The probability is: 0.324324


**What is the probability that a customer who stayed between March and May traveled as a couple?**

![figure3](mathfigures/figure3.png)

In [8]:
filtered_couples_df = df[(df['Traveler type'] == 'Couples') & (df['Period of stay'] == 'Mar-May')]

couples_probability = len(filtered_couples_df) / len(df[df['Period of stay']== 'Mar-May'])

print(f"The probability is: {couples_probability:.6f}")

The probability is: 0.421875


## Exercise 2

### 2.1 Random Variables

**If we choose a customer from the database at random, what is the probability that they traveled between September and November?**

![figure4](mathfigures/figure4.png)

In [9]:
filtered_period_df = df[df['Period of stay'] == 'Sep-Nov']

period_probability = len(filtered_period_df) / len(df)

print(f"The probability is: {period_probability:.2f}")

The probability is: 0.25


**If we randomly select 5 customers from the database, with replacement, we consider the variable that indicates the number of customers, among the 5, who traveled between September and November.**

- **What distribution does this variable follow? What are its parameters?**

The variable indicating the number of customers, among the 5, who traveled between September and November follows a binomial distribution.

The reasoning is that we are performing a series of independent trials (the selection of 5 customers, with replacement) with two possible outcomes: the customer traveled between September and November or did not. Each trial has the same probability of success (traveling between September and November).

Parameters of the binomial distribution:

- n = number of trials = 5 (since we are choosing 5 customers)
- p = probability of success in a single trial = probability that a customer traveled between September and November.

To get *p*, we would need to compute the proportion of customers in the dataset who traveled between September and November:

In [10]:
p = len(filtered_period_df) / len(df)

print(f"The probability p (success in a single trial) is: {p:.4f}")

The probability p (success in a single trial) is: 0.2500


**What is the probability that none of the selected customers traveled between September and November?**

For the binomial distribution, the probability of observing exactly \( k \) successes in \( n \) trials is typically represented by a formula involving combinations.

Where:
- \( n \) = number of trials (5 in this case)
- \( k \) = number of successes (0 in this case, since we want the probability that none of the customers traveled in that period)
- \( p \) = probability of success in a single trial (probability that a customer traveled between September and November)

Given that we're interested in the scenario where \( k = 0 \), the formula can be simplified to:
\[ P(X = 0) = (1-p)^n \]




![figure5](mathfigures/figure5.png)

In [11]:
# With the appropriate Python code, this probability can be computed.
n = 5

probability_none_traveled = (1 - p)**n
print(f"The probability that none of the selected customers traveled between September and November is: {probability_none_traveled:.5f}")

The probability that none of the selected customers traveled between September and November is: 0.23730


**What is the probability that exactly 3 out of the 5 selected customers traveled between September and November?**

To compute the probability that exactly 3 out of the 5 selected customers traveled between September and November, we can use the binomial distribution formula.

Given:
- \( n \) = number of trials (5 in this case)
- \( k \) = number of successes (3 in this case)
- \( p \) = probability of success in a single trial (probability that a customer traveled between September and November)

![figure6](mathfigures/figure6.png)

In [12]:
import math

n = 5
k = 3

combination = math.comb(n, k)
probability_three_traveled = combination * (p**k) * ((1 - p)**(n-k))
print(f"The probability that exactly 3 out of 5 customers traveled between September and November is: {probability_three_traveled:.5f}")


The probability that exactly 3 out of 5 customers traveled between September and November is: 0.08789


## Exercise 3

### 3.1 More distributions

**Let's assume that we know the age of the customers from one of these hotels follows a normal distribution with a mean of 52 and a standard deviation of 11. If we select a random customer, what is the probability that they are older than 40 years?**

In [13]:
from scipy.stats import norm

mean_age = 52
std_dev = 11

probability_older_than_40 = 1 - norm.cdf(40, mean_age, std_dev)
print(f"The probability that a randomly selected customer is older than 40 years is: {probability_older_than_40:.4f}")

The probability that a randomly selected customer is older than 40 years is: 0.8623


**Let´s determine the probability that a randomly selected customer is younger than 63 years using the normal distribution:**

In [14]:
probability_younger_than_63 = norm.cdf(63, mean_age, std_dev)
print(f"The probability that a randomly selected customer is younger than 63 years is: {probability_younger_than_63:.4f}")

The probability that a randomly selected customer is younger than 63 years is: 0.8413


**Find an age such that 25% of the hotel's customers are younger than that value and 75% of the customers are older.**


The *norm.ppf()* function gives the value (in this case, age) at a specific percentile of a normal distribution. In this context, it gives the age below which 25% of the hotel's customers fall.

In [15]:
mean_age = 52
std_dev = 11

age_25th_percentile = norm.ppf(0.25, mean_age, std_dev)
print(f"The age such that 25% of the hotel's customers are younger than that value is: {age_25th_percentile:.2f} years")

The age such that 25% of the hotel's customers are younger than that value is: 44.58 years
