# Glass Transition Temperature Prediction
This notebook demonstrates the process of data preprocessing, model training, evaluation, and saving the best model for predicting the glass transition temperature (Tg) from molecular data.

In [None]:
import pandas as pd
# Load the dataset
imputed_df = pd.read_csv('train.csv', encoding='ascii')
# Show the first few rows
imputed_df.head()


The dataset contains molecular features and the target variable Tg. We inspect the data to understand its structure and check for missing values.

In [None]:
# Extract features and target
X = imputed_df[['FFV', 'Tc', 'Density', 'Rg']]
Y = imputed_df['Tg']
# Check for missing values
X.isnull().sum(), Y.isnull().sum()


We scale the features using StandardScaler to normalize their ranges, which helps many models perform better.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=42)


We train multiple models: Random Forest, Gradient Boosting, Linear Regression, Support Vector Regression, and K-Nearest Neighbors. We evaluate their performance using Mean Squared Error (MSE) on the test set.

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

models = {
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'Linear Regression': LinearRegression(),
    'Support Vector Regression': SVR(),
    'K-Nearest Neighbors': KNeighborsRegressor()
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    results[name] = mse
    print(f"{name} Test MSE: {mse}")

# Identify the best model
best_model_name = min(results, key=results.get)
print(f"Best model: {best_model_name} with MSE: {results[best_model_name]}")

# Save the best model
import joblib
joblib.dump(models[best_model_name], 'rf_model.pkl')
joblib.dump(scaler, 'scaler.pkl')


The feature importance from the Random Forest model indicates which features most influence the prediction of Tg.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

importances = models['Random Forest'].feature_importances_
feature_names = ['FFV', 'Tc', 'Density', 'Rg']

plt.figure(figsize=(8, 6))
plt.barh(feature_names, importances)
plt.xlabel('Feature Importance')
plt.title('Feature Importance from Random Forest')
plt.gca().invert_yaxis()
plt.show()


The trained model, scaler, and submission files are saved and ready for download.

In [None]:
import joblib
joblib.dump(models['Random Forest'], 'rf_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

# List of files for download
files = ['rf_model.pkl', 'scaler.pkl', 'submission.csv']
files