# HW 3 ( 20 points)

Submit your homework as a. ipynb file. Use the format 'HW3_LastName_FirstName.ipynb'. If there are no comments/markdown describing what you have done, your work will not be graded. Follow the code of conduct.

## Problem 1: Hypothesis testing (10 points)

There is a chemical reactor in a refinery. There are a couple of temperature sensors that operator can observe. First temperature sensor is in the room where the reactor is located. Second temperature sensor is in the reactor itself. Is there a statistical difference between the values of two sensors? Temperatures of both the sensors are given below:

| Time points | Room sensor | Reactor sensor |
|---|---|---|
| 1 | 19 | 22 |
| 2 | 19 | 16 |
| 3 | 36 | 39 |
| 4 | 42 | 42 |
| 5 | 41 | 46 |
| 6 | 28 | 25 |
| 7 | 44 | 42 |
| 8 | 51 | 48 |
| 9 | 50 | 50 |
| 10 | 42 | 41 |
| 11 | 25 | 27 |
| 12 | 35 | 32 |
| 13 | 40 | 41 |
| 14 | 52 | 51 |
| 15 | 74 | 73 |



*   What is the hypothesis?
*   What is the test statistic?
*   What is the p-value?
* What is the conclusion


### (a) What is the hypothesis?

#### Null Hypothesis (H0): There is no significant difference between the temperature readings from the room sensor and the reactor sensor. That is, the mean difference between paired measurements is zero.
𝐻0: 𝜇𝑑 = 0

#### Alternative Hypothesis (HA): There is a significant difference between the temperature readings.
𝐻𝐴: 𝜇𝑑 ≠ 0

where is the mean difference between paired measurements.



### (b) What is the test statistic?


In [1]:
import numpy as np
from scipy import stats

# Temperature readings
room_sensor = np.array([19, 19, 36, 42, 41, 28, 44, 51, 50, 42, 25, 35, 40, 52, 74])
reactor_sensor = np.array([22, 16, 39, 42, 46, 25, 42, 48, 50, 41, 27, 32, 41, 51, 73])

# Compute differences
differences = room_sensor - reactor_sensor

# Calculate mean and standard deviation of differences
mean_diff = np.mean(differences)
std_diff = np.std(differences, ddof=1)  # Sample standard deviation
n = len(differences)

# Compute t-statistic
t_stat = mean_diff / (std_diff / np.sqrt(n))

# Print results
print(f"Mean difference: {mean_diff:.2f}")
print(f"Standard deviation of differences: {std_diff:.2f}")
print(f"Sample size (n): {n}")
print(f"T-statistic: {t_stat:.3f}")



Mean difference: 0.20
Standard deviation of differences: 2.54
Sample size (n): 15
T-statistic: 0.305


### (c) What is the p-value?

In [2]:
# Perform paired t-test
t_stat, p_value = stats.ttest_rel(room_sensor, reactor_sensor)

# Print results
print(f"P-value: {p_value:.3f}")

P-value: 0.765


### (D) What is the conclusion?

At a common significance level of 
α=0.05, we compare the p-value:

Since p-value (0.765) > 0.05, we fail to reject the null hypothesis.
Conclusion: There is no statistically significant difference between the temperatures recorded by the room sensor and the reactor sensor. The observed differences could be due to random variation.

## Problem 2: Data Science on your own research data (10 points)

Choose any research data that you have used or any data that is relevant to your research.
* Describe the data set. Is the data set continuous or categorical or a combination of both?
*  How can supervised or unsupervised machine learning enhance your conclusions/understanding of the dataset? ?
* Implement linear regression be applied to the dataset. Report and discuss the results

### Source of dataset:
https://www.kaggle.com/datasets/burakhmmtgl/energy-molecule
(I do not have my own data now.)

### (a) Describe the data set. Is the data set continuous or categorical or a combination of both?
**The data combine both categorical and continuous variables**
The dataset contains 1277 columns. The first 1275 columns are entries in the Coulomb matrix that act as molecular features. The 1276th column is the Pubchem Id where the molecular structures are obtained. The 1277th column is the atomization energy calculated by simulations using the Quantum Espresso package.

In the csv file, the first column (X1) is the data index and unused.

### (b) How can supervised or unsupervised machine learning enhance your conclusions/understanding of the dataset?
Using supervised learning, we can model the relationship between atomic properties and molecular energy, leading to better predictions for new molecules.

**Supervised Learning Approaches**
Regression (Linear Regression, Random Forest, Neural Networks)
- Predict potential energy based on molecular structure.
- Predict scalar coupling constant, which is crucial in understanding molecular interactions.
Classification (Decision Trees, SVM, Logistic Regression)
-Categorize molecules based on their interaction types.
-Identify high-energy vs. low-energy molecules.

**Unsupervised Learning Approaches**
Clustering (K-Means, DBSCAN)
- Group molecules with similar energy levels.
- Identify hidden patterns in molecular interactions.
Dimensionality Reduction (PCA, t-SNE)
- Reduce complexity in molecular feature space.
- Improve visualization and feature selection.

### (c) Implement linear regression be applied to the dataset. Report and discuss the results

- Load the  data and do preprocessing 

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Load dataset
data = pd.read_csv("molecule.csv")  # Ensure the file is downloaded

# Drop the first column. In the csv file, the first column (X1) is the data index and unused.
data = data.drop(data.columns[0], axis=1)

# Drop rows with missing values
data = data.dropna()

# Separate features (X) and target (Y)
X = data.iloc[:, 0:1274]  # First 1275 columns are entries in the Coulomb matrix that act as molecular features
Id = data.iloc[:, 1275]   # The 1276th column is the Pubchem Id where the molecular structures are obtained.
Y = data.iloc[:, 1276]    # The 1277th column is the atomization energy calculated by simulations using the Quantum Espresso package.

Matplotlib is building the font cache; this may take a moment.


- Since 1275 features are too much, we reduced the features before applying PCA

In [4]:
# Scale the features before applying PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA - reduce the dimensions to a smaller number
pca = PCA(n_components=5)  # 5 components can capture > 90% variance
X_pca = pca.fit_transform(X_scaled)

# Explained variance ratio to understand how much information each component retains
print(f'Total explained variance by first 5 components: {sum(pca.explained_variance_ratio_):.2f}')

Total explained variance by first 5 components: 0.91


- Split the data and train the Linear Regression Model

In [5]:
# Split the PCA-transformed data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X_pca, Y, test_size=0.2, random_state=42)

# Initialize the linear regression model
model = LinearRegression()

# Train the model with the PCA-transformed training data
model.fit(X_train, Y_train)

- Evaluate the Model 

In [6]:
# Predict on the test set
Y_pred = model.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(Y_test, Y_pred)

# Calculate R-squared (R²)
r2 = r2_score(Y_test, Y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

# Print out Y statistics to compare with the results
print(f'Y_min: {Y.min()}, Y_max: {Y.max()}')
print(f'Y_mean: {Y.mean()}, Y_std: {Y.std()}')


Mean Squared Error: 2.41
R-squared: 0.84
Y_min: -23.214579159999914, Y_max: -0.8177906099999959
Y_mean: -11.878812676082875, Y_std: 3.769399861191056


### Disccusion


- mse/(Y_max - Y_min)^2 = 0.48% --> MSE is only ~0.48% of the squared range of Y, error is quite small compared to the total spread of Y values.
   
- mse/(Y_std)^2 = 17% --> MSE is about 17% of the variance in Y, meaning the model's errors are much smaller than the natural spread of Y.

- With R² = 0.84, the model explains 84% of the variance in Y, which is strong. This confirms that PCA retained enough important features, and linear regression is doing well. 
   