In [6]:
%load_ext jupyter_ai_magics

The jupyter_ai_magics extension is already loaded. To reload it, use:
  %reload_ext jupyter_ai_magics


In [7]:
from datascience import *
import numpy as np

# needed for plotting with datascience module
%matplotlib inline
import matplotlib.pyplot as plots
#plots.style.use('fivethirtyeight')

from IPython.display import display, Image

# Sampling, Simulations and Probability

Statistics is a fundamental discipline that plays a pivotal role in the field of data science. It provides the necessary tools and methodologies for collecting, analyzing, interpreting, and presenting data. In the era of big data, where vast amounts of information are collected and stored every second, the application of statistical principles is crucial for making informed decisions based on this data.

Statistics is often considered the base of all the machine learning algorithms. A strong foundation in statistics will not just make us better data scientists, but also will make us more intuitive in real life. As you learn more about statistics and how it is utilised in Machine Learning, you will come to realise that all Machine Learning is just glorified statistics.

<img src="data/statistics_meme.jpeg" alt="Alt text" width="40%" height="40%">

In this notebook we will be covering the basics of sampling and simulations in statistics, which is very essential for advanced data analysis.

# Sampling

Statistics is science of inference. So to infer a value, you can infer it from the entire population or from a fraction of the population, which is called the sample. 

The reason for sampling are:
- Few cases it might be impossible to collect data on the entire population
- Saves time and money

As we might already know, sampling will not be helpful in cases like electing a representative of a country (sample of voters vote instead of entire population of the country), collecting census data of a country

We will look at sample statistics as estimators of population parameters in this lab

How to represent information in statistics:
- Whenever referring to the population parameter, we refer it in Greek letters.
- When referring to the sample estimates, we refer it in English alphabets.

![image.png](data/parameters.png)

# Wine Quality Control

Welcome to our exploration of the fascinating world of wine quality analysis! We're going to delve into a dataset from Kaggle that contains detailed information about various wines, including their chemical properties like acidity, sugar content, alcohol level, and more, as well as a quality rating given by wine experts. This dataset is not just a collection of numbers and facts; it's a doorway into understanding how different factors contribute to the taste and quality of wine.

Now, you might wonder why we can't just test every bottle to ensure top quality in a wine manufacturing plant. The answer lies in practicality and efficiency. Imagine a plant producing thousands of bottles per day - testing each one would be time-consuming and costly. This is where the concept of sampling comes in. By carefully selecting a representative sample of bottles for quality testing, we can draw reliable conclusions about the overall quality of the wine production without having to test every single bottle. Sampling is a powerful tool in quality control, allowing us to maintain high standards efficiently and effectively. Through this dataset and our analysis, we'll uncover the secrets of what makes a great wine and learn how sampling helps maintain this greatness across batches.

In [8]:
wine_quality = Table().read_table("data/winequality-red.csv")
wine_quality.show(5)

fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5


Now as a data scientist, your workflow is the same as usual. What are some interesting insights you could gain from this data?

**Q1.** Before we continue with the lab, write down three potential pieces of information we could uncover by analysing this data. Remember, the best insights are usually well hidden and require a good amount of analysis!

In [None]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-3.5-turbo

**Q1.1** Lets quickly analyze some key columns in our dataset using sampling. Obtain a random sample of 100 wines from the dataset. What is the average quality rating of the sampled wines? Does this average differ significantly from the overall average quality rating in the full dataset?

In [None]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-3.5-turbo

### Workflow

Enter Workflow Here.

**Q1.2** Using your random sample, identify the top three chemical properties that correlate the most strongly with the wine quality. You must have visualizations supporting your conclusions

In [None]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-3.5-turbo

### Workflow

Enter Workflow Here.

**Q1.3** Convenience Sampling is the process of using a sampling strategy that best suits your task. The user gets to decide how they want to sample from the dataset. Pick one of the chemical properties you identified in the previous question. Now can you sample 100 wines such that you maximize the correlation between your chemical property and wine quality?

In [None]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-3.5-turbo

### Workflow

Enter Workflow Here.

**Q2** Pretend you are a wine quality tester. Your job is to identify the best wines given the following information. A high quality wine can be identified by the following factors:

1. High **quality** (7 or above)
2. Low **pH** value leading to good acidity (3.26 and below)
3. Just the right amount of total sulfur dioxide (Between 35 and 50)

These three factors are going to help you identify the high quality wines in this data. We are going to see exactly how rare a good high quality red wine is using probability. We will also leverage sampling and simulation to see the law of large numbers in action.

**Q2.1** What is the probability of getting a wine that satisfies condition 1?

In [None]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-4

### Workflow

Enter Workflow Here.

**Q2.2** What is the probability of getting a wine that satisfies condition 1 and condition 2?

In [None]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-4

### Workflow

Enter Workflow Here.

**Q2.3** What is the probability of getting a wine that satisfies all three conditions?

In [None]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-4

### Workflow

Enter Workflow Here.

**Q3** The Law of Averages states the following: 

If a chance experiment is repeated independently and under identical conditions, then, in the long run, the proportion of times that an event occurs gets closer and closer to the theoretical probability of the event.

The law above implies that if the chance experiment is repeated a large number of times then the proportion of times that an event occurs is very likely to be close to the theoretical probability of the event.

**Q3.1** Lets observe the law of large numbers in action. Simulate the probability in Q2.2 by sampling 50 wines randomly from the dataset. Run the simulation 10,100,1000 and 10,000 times. Plot a histogram of the probability values you observed during each simulation process and compute the average probability from each simulation.

In [None]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-4

### Workflow

Enter Workflow Here.

**Q3.2** How does the value compare to the actual probability computed in Q2.2? Talk about the trend observed as you increased the number of simulations. Feel free to run it more times with different simulation values if you are having a tough time noticing the trend.

In [None]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-4

### Workflow

Enter Workflow Here.

**Q3.3** Now lets switch it up a notch. Simulate the probability in Q2.2 again by sampling **500** wines randomly from the dataset. Run the simulation 10, 100, 1000 and 10000 times. Plot a histogram of the probability values you observed during each simulation process and compute the average probability from each simulation.

In [9]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-4

### Workflow

Enter Workflow Here.

**Q3.4** Now what changes do you observe? Does the simulated probability converge to the actual probability faster? Slower? Is it more or less accurate?

In [None]:
#SOLUTION

### Prompt Here

In [None]:
%%ai openai-chat:gpt-4

Congratulations, you're done with Lab 5!  Be sure to 
- **Keep all your prompts** . 
- **Save and Checkpoint** from the `File` menu.
- **Ensure every cell has been run (has a number Ex:[34] beside the cell)**.
- Submit to Gradescope!