<a href="https://colab.research.google.com/github/sundaybest3/Spring2024/blob/main/Seminar/Seminar01A.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🌱 Seminar 01
---


1. Introduction:
+ 1.1 Why statistics?
+ 1.2 **Steps of statistical approach (understanding probability distribution)**
+ 1.3 Types of data
+ 1.4 Software

2. Descriptive statistics overview: Coding

# Distribution? Probability distribution

Parametric vs. non-parametric

## [1] Normal Distribution

- The normal distribution, also known as the Gaussian distribution, is one of the most important probability distributions. It is symmetric and describes many natural phenomena, such as the heights of people, test scores, etc.

In [None]:
#@markdown Normal distribution
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate data for a normal distribution
data = np.random.normal(loc=0, scale=1, size=1000)

# Plot the histogram
plt.hist(data, bins=30, density=True, alpha=0.6, color='g')

# Plot the PDF on top of the histogram
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, 0, 1)
plt.plot(x, p, 'k', linewidth=2)
title = "Normal Distribution with $\mu$ = 0, $\sigma$ = 1"
plt.title(title)
plt.show()


## [2] Uniform Distribution
- The uniform distribution has equal probability for all values in its range. It's often used to model situations where each outcome is equally likely.

In [None]:
#@markdown Uniform distribution
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import uniform

# Generate data for a uniform distribution
data = np.random.uniform(low=-1, high=1, size=1000)

# Plot the histogram
plt.hist(data, bins=30, density=True, alpha=0.6, color='b')

# Plot the PDF on top of the histogram
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = uniform.pdf(x, -1, 2)
plt.plot(x, p, 'k', linewidth=2)
plt.title("Uniform Distribution")
plt.show()


## [3] Exponential Distribution
- The exponential distribution describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate.

In [None]:
#@markdown Exponential distribution
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import expon

# Generate data for an exponential distribution
data = np.random.exponential(scale=1, size=1000)

# Plot the histogram
plt.hist(data, bins=30, density=True, alpha=0.6, color='r')

# Plot the PDF on top of the histogram
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = expon.pdf(x, 0, 1)
plt.plot(x, p, 'k', linewidth=2)
plt.title("Exponential Distribution")
plt.show()


## [4] Binomial Distribution
+ The binomial distribution models the number of successes in a fixed number of independent trials of a binary experiment. It is parameterized by n (the number of trials) and p (the probability of success on each trial).

In [None]:
#@markdown Binomial distribution: e.g., coin flip
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

# Generate data for a binomial distribution
n, p = 10, 0.5  # number of trials, probability of each trial
data = np.random.binomial(n, p, size=1000)

# Plot the histogram
plt.hist(data, bins=30, density=True, alpha=0.6, color='y')

# Calculate the binomial distribution and plot it
x = np.arange(0, n+1)
p = binom.pmf(x, n, p)
plt.plot(x, p, 'bo', ms=8, label='binom pmf')
plt.vlines(x, 0, p, colors='b', lw=5, alpha=0.5)
plt.title("Binomial Distribution")
plt.show()


## Example data

In [None]:
# To generate fake names for my sample data generation
!pip install faker

In [None]:
#@markdown Generate a sample data: TOEIC score before and after my classes
import pandas as pd
import numpy as np
import random
from faker import Faker

# Initialize Faker to generate names
fake = Faker()

# Set seed for reproducibility
np.random.seed(0)
random.seed(0)

# Generate data
N = 30  # number of students
significant_improvers = 5  # Students with significant improvement
names = [fake.first_name() for _ in range(N)]
toeic_scores_before = np.random.normal(loc=500, scale=100, size=N).astype(int)

# Adjust scores for students with significant improvement
for i in range(significant_improvers):
    toeic_scores_before[i] = np.random.randint(200, 300)

# Create 'after' scores for the significant improvers
toeic_scores_after = np.empty(N, dtype=int)
toeic_scores_after[:significant_improvers] = np.random.randint(980, 990, size=significant_improvers)

# The rest of the students
normal_part = np.random.normal(loc=510, scale=100, size=(N - significant_improvers)).astype(int)
uniform_part_indices = np.random.choice(range(significant_improvers, N), size=(N - significant_improvers)//2, replace=False)
normal_part_indices = set(range(significant_improvers, N)) - set(uniform_part_indices)

# Assigning the rest of the 'after' scores
toeic_scores_after[list(normal_part_indices)] = normal_part[list(normal_part_indices) - np.array(significant_improvers)]
toeic_scores_after[uniform_part_indices] = np.random.uniform(low=450, high=550, size=(N - significant_improvers)//2).astype(int)

# Create a DataFrame
df = pd.DataFrame({
    'Student_Name': names,
    'TOEIC_Score_Before': toeic_scores_before,
    'TOEIC_Score_After': toeic_scores_after
})

print(df)


In [None]:
df.to_csv("toeic.csv", index=False)

In [None]:
#@markdown Paired t-test: before and after
from scipy.stats import ttest_rel

# Perform a paired t-test
t_statistic, p_value = ttest_rel(df['TOEIC_Score_Before'], df['TOEIC_Score_After'])

print(f'Paired t-test statistic: {t_statistic}')
print(f'Paired t-test p-value: {p_value}')

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("We reject the null hypothesis, suggesting there is a statistically significant difference in the scores before and after the classes.")
else:
    print("We do not reject the null hypothesis, suggesting there is not a statistically significant difference in the scores before and after the classes.")


The Shapiro-Wilk test is used to determine whether a dataset is normally distributed. It's particularly useful in the context of assumptions checking for parametric tests that require normality (like a paired t-test).

In [None]:
#@markdown Distribution test: Shapiro normality test
from scipy.stats import shapiro, normaltest

# Perform Shapiro-Wilk test for normality
shapiro_before = shapiro(df['TOEIC_Score_Before'])
shapiro_after = shapiro(df['TOEIC_Score_After'])

print("Shapiro-Wilk Test:")
print(f"Before Classes: Statistics={shapiro_before[0]}, p-value={shapiro_before[1]}")
print(f"After Classes: Statistics={shapiro_after[0]}, p-value={shapiro_after[1]}")

Non-parametric test

In [None]:
#@markdown Wilcoxon Signed-Rank Test
from scipy.stats import wilcoxon

# Perform Wilcoxon signed-rank test
stat, p = wilcoxon(df['TOEIC_Score_Before'], df['TOEIC_Score_After'])

print(f'Wilcoxon signed-rank test statistic: {stat}')
print(f'P-value: {p}')

# Interpretation
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis - suggest a significant difference between the two conditions.")
else:
    print("Fail to reject the null hypothesis - suggest no significant difference between the two conditions.")


---
## **1.3 Types of data**

1. Categorical:
  + Nominal
  + Ordinal

2. Numerical
  + Interval
  + Ratio

### [1] Nominal data

+ Sample data description: A research team conducted a study to investigate the relationship between the colors of cars in a parking lot and the satisfaction levels of the owners of those cars.

> **Color:** Red, Blue, Green, White, Black, Yellow

> **Satisfaction:**
> + 1 = Very Dissatisfied
> + 2 = Dissatisfied
> + 3 = Neutral
> + 4 = Satisfied
> + 5 = Very Satisfied

In [None]:
#@markdown Generate the data (df1 = data01.csv)
import pandas as pd

# Create the nominal data DataFrame
nominal_data = pd.DataFrame({
    'CarID': range(1, 101),
    'Color': ['Red', 'Blue', 'Green', 'Red', 'White', 'Blue', 'Green', 'Black', 'White', 'Yellow'] * 10
})

# Create the ordinal data DataFrame
ordinal_data = pd.DataFrame({
    'CustomerID': range(1, 101),
    'Satisfaction': [4, 3, 5, 2, 4, 3, 4, 1, 5, 2, 4, 3, 4, 2, 5, 4, 2, 3, 5, 1,
                     4, 3, 4, 2, 5, 4, 2, 3, 5, 1, 4, 3, 4, 2, 5, 4, 2, 3, 5, 1,
                     4, 3, 4, 2, 5, 4, 2, 3, 5, 1, 4, 3, 4, 2, 5, 4, 2, 3, 5, 1,
                     4, 3, 4, 2, 5, 4, 2, 3, 5, 1, 4, 3, 4, 2, 5, 4, 2, 3, 5, 1,
                     4, 3, 4, 2, 5, 4, 2, 3, 5, 1, 4, 3, 4, 2, 5, 4, 2, 3, 5, 1]
})

# Combine the two DataFrames on a common key (for example, CarID and CustomerID)
combined_data = pd.merge(nominal_data, ordinal_data, left_on='CarID', right_on='CustomerID')

# Drop the redundant key (CustomerID)
combined_data = combined_data.drop(columns=['CustomerID'])

# Save the combined dataset to a CSV file
combined_data.to_csv("data01.csv", index=False)



In [None]:
df1 = pd.read_csv("data01.csv")

df1.tail()

### [2] Numeric data

+ Sample data description: Collect data on monthly electricity consumption (unit in kWH, ratio data) in households and the number of occupants (interval data). Investigate how household size affects energy usage.

> + **Area**: Urban, Rural
> + **Electricity**: in kWH
> + **Occupants**: integer (ratio data)
> + **Daily indoor Temperature**: in Celsius (interval data)

+ [Related article](https://www.treehugger.com/urban-or-rural-which-is-more-energy-efficient-4863586)

In [None]:
#@markdown Data to generate (df2= data02.csv)
import pandas as pd
import random

# Create a list of areas (50 city and 50 rural)
areas = ["Urban"] * 50 + ["Rural"] * 50

# Generate random occupants data
occupants = [random.randint(1, 5) for _ in range(100)]  # Random values between 1 and 5 occupants

# Generate electricity consumption data with a tendency for rural areas to use more electricity
electricity = []

for area, occupant in zip(areas, occupants):
    if area == "Urban":
        # Generate electricity consumption for the city (lower range)
        consumption = random.uniform(200, 400) + 50 * occupant  # Random values between 200 and 400 kWh, with occupancy effect
    else:
        # Generate electricity consumption for rural areas (higher range)
        consumption = random.uniform(300, 600) + 75 * occupant  # Random values between 300 and 600 kWh, with occupancy effect

    electricity.append(consumption)

# Generate daily temperature data in Celsius with a positive correlation to occupants
daily_temperature = [20 + 1.5 * occupant + random.uniform(-2, 2) for occupant in occupants]
daily_temperature_rounded = [round(temp, 1) for temp in daily_temperature]

# Create a DataFrame
data = pd.DataFrame({'Area': areas, 'Electricity': electricity, 'Occupants': occupants, 'Daily Temperature (°C)': daily_temperature_rounded})

# Save the DataFrame to 'data02.csv' file
data.to_csv('data02.csv', index=False)


In [None]:
df2 = pd.read_csv('/content/data02.csv')
df2

---
# 🌀 **2. Descriptive statistics**

Summarizing the data


### [1] Descriptive stat for categorical data
> 🔵 data.describe() # This is for numerical data

Currently, data = df1, df2

In [None]:
df1.describe()

In [None]:
df2.describe()

For categorical data: Count data

> 🔵 variable = df['Color'].value_counts() # This is for count data

In [None]:
# Count the occurrences of each color
color_counts = df1['Color'].value_counts()
color_counts

_Note:_ 'int64': This indicates that the data in the column consists of 64-bit integers

In [None]:
# Count the occurrences of each satisfaction level
satisfaction_counts = df1['Satisfaction'].value_counts()
print(satisfaction_counts)

### [2] Descriptive stat for Numerical data



In [None]:
df2.describe()

e.g., Describe by Area

> data.groupby('Area').describe()

In [None]:
summary_by_area = df2.groupby('Area').describe()

# Display the summary statistics
print(summary_by_area)
