# Section 1: Goal & Imports

**Explanation**  
We aim to predict the critical temperature (TC) of materials using machine learning. Steps:
1. Expand features using matminer (Magpie descriptors).
2. Split data into train/test and perform EDA.
3. Compare Random Forest performance with and without Magpie features using cross-validation.

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from matminer.featurizers.conversions import StrToComposition
from matminer.featurizers.composition import ElementProperty
plt.rcParams['figure.dpi'] = 120
plt.rcParams['axes.grid'] = False
print('Imports OK')


**Questions**
- Why do we need feature expansion for material properties?

## Section 2: Load Dataset
**Explanation** 
- We load a dataset of materials with their critical temperature (TC).
- Important: Ensure the file path is correct.

In [None]:
csv_path = Path('..') / 'data' / 'DS1.csv'
assert csv_path.exists(), f"Could not find {csv_path}. Adjust the path if you moved the file."

# Load data
df0 = pd.read_csv(csv_path)

**Task**  
- Check the first 5 rows of df0. What columns are present?

## Section 3: Feature Expansion with Matminer
**Explanation**  
- StrToComposition converts chemical formula strings into composition objects.
- ElementProperty adds Magpie descriptors (statistics of elemental properties).
- This increases feature richness for better predictions.

In [None]:
df = df0.copy()
df = StrToComposition().featurize_dataframe(df, 'Name')
ep = ElementProperty.from_preset(preset_name='magpie',impute_nan=True)
df = ep.featurize_dataframe(df, col_id='composition')
print('After Magpie expansion:', df.shape)
df.head(2)


**Questions**  
- How many features were added after Magpie expansion?
- Why might imputation (impute_nan=True) be necessary?

## Section 4: Train/Test Split
**Explanation**  
- We separate original numeric features and Magpie-augmented features.
- Rule: Use train for EDA and model training.

In [None]:
y_all = df['TC']
X_mag_all = df.drop(columns=['TC']).select_dtypes(include=[np.number])
X_orig_all = df0.drop(columns=['TC']).select_dtypes(include=[np.number])
mask = y_all.notna()
y_all = y_all.loc[mask]
X_mag_all = X_mag_all.loc[mask]
X_orig_all = X_orig_all.loc[mask]
X_orig_train, X_orig_test, y_train, y_test = train_test_split(
    X_orig_all, y_all, test_size=0.30, random_state=123, shuffle=True
)
X_mag_train = X_mag_all.loc[X_orig_train.index]
X_mag_test  = X_mag_all.loc[X_orig_test.index]
X_orig_train.shape, X_mag_train.shape


**Task**  
- Print shapes of X_orig_train and X_mag_train. Which has more features? Why?

## Section 5: EDA on Training Data
**Explanation**  
- Use describe() to understand TC distribution.
- Histogram shows skew and range.

In [None]:
display(pd.DataFrame({'TC_train': y_train}).describe())
plt.figure(figsize=(6.2, 4.0))
plt.hist(y_train, bins=30, color='tab:blue', alpha=0.85, edgecolor='white')
plt.xlabel('TC [train]'); plt.ylabel('count')
plt.title('Histogram: TC (train)')
plt.tight_layout(); plt.show()


**Qestions**
- What does the five-number summary tell you about TC?
- Is TC distribution skewed?

## Section 6: Correlation Analysis
**Explanation**  
- Find top Magpie features correlated with TC.
- Helps identify which features might be most predictive.

In [None]:
const_mask = X_mag_train.nunique(dropna=False) > 1
Xc = X_mag_train.loc[:, const_mask]
corr_series = Xc.corrwith(y_train)
corr_abs = corr_series.abs().sort_values(ascending=False)
topk = 20
top_feats = corr_abs.head(topk).index.tolist()
top_corr = corr_series.loc[top_feats].sort_values(key=lambda s: s.abs(), ascending=True)
plt.figure(figsize=(8.0, 5.0))
colors = ['tab:red' if v<0 else 'tab:green' for v in top_corr.values]
plt.barh(top_corr.index, top_corr.values, color=colors)
plt.axvline(0, color='k', lw=1)
plt.xlabel('Pearson r (feature, TC) [train]')
plt.title(f'Top {topk} features by |correlation| with TC (train)')
plt.tight_layout(); plt.show()


**Task**  
Which feature has the strongest positive correlation with TC?

## Section 7: Model Comparison with Cross-Validation
**Explanation**  
Compare Random Forest on original vs Magpie features using 5-fold cross validation (CV).  
Metric: R² (higher is better).

In [None]:
rf_orig = RandomForestRegressor(random_state=123)
rf_mag = RandomForestRegressor(random_state=123)
cv = KFold(n_splits=5, shuffle=True, random_state=123)
scores_orig = cross_val_score(rf_orig, X_orig_train, y_train, scoring='r2', cv=cv)
scores_mag  = cross_val_score(rf_mag,  X_mag_train,  y_train, scoring='r2', cv=cv)
print('R^2 (orig) folds:', np.round(scores_orig, 4))
print('R^2 (mag ) folds:', np.round(scores_mag,  4))
print('Mean±Std R^2 (orig): {:.4f} ± {:.4f}'.format(scores_orig.mean(), scores_orig.std()))
print('Mean±Std R^2 (mag ): {:.4f} ± {:.4f}'.format(scores_mag.mean(),  scores_mag.std()))

**Questions**
- Which feature set performs better? Why might that be?

## Optional: Test Evaluation & Feature Importance
**Explanation**  
Evaluate on test set and inspect feature importance.

In [None]:
rf_orig.fit(X_orig_train, y_train)
rf_mag.fit(X_mag_train, y_train)
y_pred_orig = rf_orig.predict(X_orig_test)
y_pred_mag  = rf_mag.predict(X_mag_test)
def report(y_true, y_pred, label):
    r2  = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    print(f'{label} - R^2: {r2:.4f}, MAE: {mae:.3f}')

report(y_test, y_pred_orig, 'TEST RF (original)')
report(y_test, y_pred_mag,  'TEST RF (+Magpie)')


**Task**  
- Which model generalizes better to test data?
- Why is MAE useful alongside R²?

#### Create a feature imptortance
The RandomforestRegressor provides a feature importance which reflects how much each feature reduces impurity across trees.

In [None]:
# Column names that went into the pipeline (imputer doesn't change names)
feat_names_orig = X_orig_train.columns

# Importance values
imp_orig = rf_orig.feature_importances_

# Build a table and plot top-20
fi_orig = (pd.DataFrame({'feature': feat_names_orig, 'importance': imp_orig})
             .sort_values('importance', ascending=False)
             .head(20))

plt.figure(figsize=(7.5, 5.0))
plt.barh(fi_orig['feature'][::-1], fi_orig['importance'][::-1], color='tab:gray')
plt.xlabel('Feature importance (Gini-based)')
plt.title('Random Forest Feature Importances — Original features (top-20)')
plt.tight_layout(); plt.show()

fi_orig.head(20)
