# Selection Bias

## Remarks
- **Bias** is systematically error - produced by the **measurement** or **sampling process**.
- When the data used for analysis or modelling is **not representative** of the population, this leads to **Skewed** or **Biased** results.
- Assuming that we train a model with **non-representative** data, it may perform pooly in the real world.

## Implementation

Let's implement an example of selection bias as well as training a model with non-representative data.

**Context**: <br>
A survey was conducted to understand the average height of the students
of a high-school in Berlin. <br>
Thus, we are going to calculate the average height of the students
as well as simulate <br> 
an selection bias by conducting the survey only with students that are part of the basketball team. <br>
For this case scenario, we assume they are tall people.

In [74]:
import numpy as np

#. Define population
population = np.random.normal(165, scale=10, size=1500)

#. Sample students
num_students = 200
sample_students = random.choices(population, k=num_students)

#. Students that are in the basketball team.
biased_sample = np.random.normal(loc=195, scale=3, size=num_students)

#. Calculate avg. heights
popu_avg_height = round(np.mean(sample_students), 2)
sample_avg_height = round(np.mean(biased_sample), 2)

print(f"True Population Mean Height: { popu_avg_height } cm")
print(f"Biased Sample Mean Height: { sample_avg_height } cm")

True Population Mean Height: 165.11 cm
Biased Sample Mean Height: 194.8 cm


**Remarks**

- This is a simple example that shows that we must be careful when sampling data.
- Make sure to identify possible groups in the data and include samples from each group (i.e. Strata). Thus you keep the population representativeness.

**Context**: <br>
The goal is to predict house prices, but we are going to use a biased training dataset. <br>
In this cases, we train the model only with hours from urban areas. However, we perform <br>
prediction on houses from rural areas. This presents how the non-representativeness can <br>
affect a machine learning model. 

In [89]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

np.random.seed(42)

#. Houses in the urban area
num_houses = 200
urban_data = pd.DataFrame({
    "Sqm": np.random.randint(80, 300, num_houses),
    "Price": np.random.randint(100000, 500000, num_houses)
})

#. Houses in the rural area
num_houses = 30 # Test set
rural_data = pd.DataFrame({
    "Sqm": np.random.randint(120, 280, num_houses),
    "Price": np.random.randint(50000, 200000, num_houses)
})

#. Training data
X_train = urban_data[["Sqm"]]
y_train = urban_data["Price"]

#. Test data
X_test = rural_data[["Sqm"]]
y_test = rural_data["Price"]

#. Build model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

#. Predict price house
price_pred = lr_model.predict(X_test)

#. Evaluate predictions
mse = mean_squared_error(y_test, price_pred)

print(f"Mean Squared Error (MSE) on Rural data: { round(mse, 2) }")

Mean Squared Error (MSE) on Rural data: 30180720960.17


**Remarks**

- As expected, the models failed to generalize to rural data due to selection bias in the training dataset.
- The high prediction errors results of this example shows what may occur when we fail in getting the data representative correctly.

**Other examples**

- A company wants to conduct a survey about its products but it only considers responds from the contact formular fron company's website. it doesn't include responds from letters or even email.

**Avoid selection bias?**

- Use **stratified sampling** to select the same proportion of each group.
- Make sure to run a **fair random sampling** (from the population).
- Handle missing data and collect diverse data. 