<a href="https://colab.research.google.com/github/MehrdadJalali-AI/Statistics-and-Machine-Learning/blob/main/Day4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step 1: Generate Sample Data**

We will start by generating sample data with a mean of 45,000 and a standard deviation of 4,500 for a sample size of 1,000.

In [None]:
import numpy as np

# Parameters
population_mean = 45000
sample_std_dev = 4500
sample_size = 1000

# Generate sample data
np.random.seed(42)  # for reproducibility
sample_data = np.random.normal(population_mean, sample_std_dev, sample_size)

# Display the first 10 values of the generated sample data
print("Sample Data (first 10 values):", sample_data[:10])


Sample Data (first 10 values): [47235.21368855 44377.81064473 47914.59842145 51853.63435384
 43946.30981374 43946.38369373 52106.45766978 48453.45628119
 42887.36526329 47441.52019614]


**Step 2: Exercise 1 - Calculate Confidence Interval**

In [2]:
import scipy.stats as stats

# Sample mean and margin of error for 95% confidence interval
sample_mean = np.mean(sample_data)
z_value = stats.norm.ppf(1 - (1 - 0.95) / 2)
margin_of_error = z_value * (sample_std_dev / np.sqrt(sample_size))

confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error)
print(f"Exercise 1 - 95% Confidence Interval for average income: {confidence_interval}")


Exercise 1 - 95% Confidence Interval for average income: (44808.086486663415, 45365.90201573753)


**Step 3: Exercise 2 - Hypothesis Testing**

In [6]:
from scipy.stats import ttest_1samp

# Test value for the population mean
population_mean_to_test = 47000

# One-sample t-test
t_statistic, p_value = ttest_1samp(sample_data, population_mean_to_test)

# Significance level
alpha = 0.05

print(f"\nExercise 2 - T-statistic: {t_statistic}")
print(f"P-value: {p_value}")
if p_value < alpha:
    print("Reject the null hypothesis: The average income is significantly greater than $47,000.")
else:
    print("Fail to reject the null hypothesis: No significant evidence that average income is above $47,000.")



Exercise 2 - T-statistic: -13.728569533902661
P-value: 2.0312021008999865e-39
Reject the null hypothesis: The average income is significantly greater than $47,000.


**Generate Additional Sample Data**

In [8]:
# Generating data for three cities
city_a_income = np.random.normal(45000, 5000, 1000)
city_b_income = np.random.normal(47000, 5000, 1000)
city_c_income = np.random.normal(46000, 5000, 1000)

# Display first 5 values of each city's income data
print("City A Income (first 5 values):", city_a_income[:5])
print("City B Income (first 5 values):", city_b_income[:5])
print("City C Income (first 5 values):", city_c_income[:5])


City A Income (first 5 values): [51996.77718293 49623.16841456 45298.1518496  41765.31611147
 48491.11656807]
City B Income (first 5 values): [43624.10862513 46277.40664642 43037.900395   45460.1923518
 37531.92666523]
City C Income (first 5 values): [36460.96221061 41698.0749461  43931.97233289 55438.4382867
 48782.76562267]


**Independent Two-Sample t-test between City A and City B**

In [10]:
from scipy.stats import ttest_ind

# Independent two-sample t-test
t_statistic_ab, p_value_ab = ttest_ind(city_a_income, city_b_income)

print(f"\nT-statistic (City A vs City B): {t_statistic_ab}")
print(f"P-value: {p_value_ab}")
if p_value_ab < alpha:
    print("Reject the null hypothesis: There is a significant difference in average income between City A and City B.")
else:
    print("Fail to reject the null hypothesis: No significant difference in average income between City A and City B.")



T-statistic (City A vs City B): -7.562787299722131
P-value: 5.978616169664526e-14
Reject the null hypothesis: There is a significant difference in average income between City A and City B.


**ANOVA among City A, City B, and City C**

In [11]:
from scipy.stats import f_oneway

# One-way ANOVA
f_statistic_abc, p_value_abc = f_oneway(city_a_income, city_b_income, city_c_income)
# The F-statistic (f_statistic_abc) measures the ratio of the variance between the group means to the variance within the groups.
# A higher F-statistic indicates that there is more variability between the groups compared to within the groups, suggesting a significant difference in means.
print(f"\nF-statistic (ANOVA across City A, City B, and City C): {f_statistic_abc}")
print(f"P-value: {p_value_abc}")
if p_value_abc < alpha:
    print("Reject the null hypothesis: There is a significant difference in average income among the cities.")
else:
    print("Fail to reject the null hypothesis: No significant difference in average income among the cities.")



F-statistic (ANOVA across City A, City B, and City C): 28.975830573510713
P-value: 3.4361863347647236e-13
Reject the null hypothesis: There is a significant difference in average income among the cities.
