In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Dieelectric Properties of Materials: A Materials Design Analysis # 

# Introduction

Dielectric materials are crucial to modern electronics, ranging applications from energy storage capacitiors to tunable mirowave devices. However, designing and discovering optimal dielectric materials is a challenge due to complex relation among atomic struncture, chemistry and electronic behavior. But, with the power of informatics we can decypher this complex relationship towards designing efficient dielectric materials tailored to various applications. 

This notebook presents a roadmap as to how we can leverage the power of data in accelearating the novel dielectric materials discovery by combining:

**Computational datasets**: 1,056 materials with DFT-calculated dielectric properties (Petousis et al., Sci. Data 2017), including electronic (e_electronic) and ionic (e_total) contributions.

**Machine learning**: Unsupervised clustering (k-means) and feature analysis to identify hidden structure-property relationships.

**Domain knowledge**: Crystal symmetry (space_group), bond lengths (avg_bond), and band gap (band_gap) as key descriptors.


**Key Questions:**

- How do features (e.g., bond lengths, band_gap, space_group, volume, structure) correlate with dielectric constants?

- Can we identify subgroups (e.g., ferroelectrics) with exceptional properties?

- What design rules emerge for high-κ or low-loss materials?

**Why This Matters:**

**For Researchers**: Identifies candidate materials for high-performance dielectrics.

**For Industry**: Accelerates the design of capacitors, insulators, and optoelectronic devices.

**For Methodology**: Demonstrates how informatics bridges DFT calculations and experimental discovery.

**Workflow**

This project analyzes a dataset of materials properties to identify potential ferroelectric materials with high dielectric constants.

 1. Data Loading and Preprocessing:
   - Loaded the 'dielectric_constant' dataset using matminer.
   - Explored the data's structure using .info() and .describe() and removed outliers based on quantile analysis.

 2. Exploratory Data Analysis (EDA):
   - Visualized the distribution of key features (nsites, volume, band gap, etc.) using boxplots and histograms.
   - Analyzed the relationship between features and the target variable ('pot_ferroelectric').

 3. Feature Engineering:
   - Created a new 'combined_feature' by combining several features ('poly_electronic', 'poly_total', and 'n').
   - Performed feature scaling using MinMaxScaler.
   - Trained a RandomForestRegressor model to determine feature importance.
   - Identified the most influential features for predicting 'pot_ferroelectric'.

 4. Model Selection and Evaluation:
   - Applied GridSearchCV to optimize hyperparameters for Logistic Regression, Decision Tree, and KNN classifiers.
   - Compared model performance based on accuracy and classification reports.

 5. Clustering Analysis:
   - Added bond length features to the dataset.
   - Performed K-Means clustering based on multiple features and observed cluster characteristics.
   - Visualized clusters based on different properties like bond length, band gap, etc.
   - Provided cluster statistics showing average values of important features for each cluster.

 6. Material Identification and Design Rules:
   - Identified potential high-κ candidates based on cluster assignments, poly_total, and band gap.
   - Presented design rules and potential applications based on identified clusters.
   - Created a combined table including cluster statistics, design rules and applications.

**About Dataset**

The dielectric_constant dataset includes 1,056 inorganic materials with dielectric properties computed using DFPT-PBE. It supports materials discovery in dielectrics and optics. Each entry contains:

**material_id**: Unique Materials Project ID

**formula**: Chemical formula of the material

**band_gap**: Band gap (indicator of material conductivity)

**n**: Refractive index

**e_electronic**: Electronic contribution to the dielectric tensor

**e_total**: Total dielectric tensor (electronic + ionic)

**poly_electronic**: Average eigenvalue of e_electronic

**poly_total**: Average eigenvalue of e_total

**nsites**: Number of atoms in the unit cell

**volume**: Volume of the unit cell (Å³)

**space_group**: Crystallographic space group (integer)

**pot_ferroelectric**: Indicates potential ferroelectricity

**structure**: Material structure as a pandas Series

**cif, poscar, meta**: Optional structure/metadata fields

**Reference**: Petousis et al., Scientific Data, 2017

**Data Source**: Figshare

**File Type**: JSON (gzipped)

SHA256: 8eb24812148732786cd7c657eccfc6b5ee66533429c2cfbcc4f0059c0295e8b6

In [None]:
# lets install important dependencies

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score

In [None]:
def warn(*args, **kwargs):
    pass

import warnings 
warnings.warn = warn

# 1. Data Loading and Preprocessing

In [None]:
!pip install matminer
! pip install pymatgen
from pymatgen.core import Structure
from matminer.datasets import get_available_datasets

In [None]:
%env MATMINER_DATA data
from matminer.datasets import load_dataset
df = load_dataset("dielectric_constant")
df.head()

In [None]:
from matminer.datasets import get_all_dataset_info

print(get_all_dataset_info("dielectric_constant"))

In [None]:
# lets get a bird eye view of our dataset

df.info()

In [None]:
df.describe()

So, the dataset seems perfectly fine with no null values. All columns are complete, with data types spanning floats, integers, objects, and a boolean (pot_ferroelectric). The dataset also includes metadata (cif, poscar) for materials science applications. Lets explore the dataset more.

# 2. Exploratory Analysis

For further analysis data needs to be scaled and distributed properly, so we can fit the machine learning models and perform the other analysis effectively. We are going to use Box Plot to see if their are any potential outliers and fix them. Then, we will scale the data with MinMax scaling. We are just going to check the numerical columns only as per our analysis need. 

In [None]:
num_col = ['nsites', 'volume', 'band_gap', 'poly_total', 'space_group']

plt.figure(figsize=(12, 8)) 
for i, col in enumerate(num_col):
    if col in df.columns:
        plt.subplot(3, 2, i + 1) 
        sns.boxplot(x=df[col])
        plt.title(f'Box Plot of {col}')
        plt.xlabel(col) 
plt.tight_layout() 
plt.show()

The box plots reveal outliers in noise (>15), volume (extreme highs), band gap (~8.32 eV), poly_total (max 277.78 vs mean 14.78), and space_group (rare symmetries). These may represent measurement errors, unique materials, or meaningful extremes requiring verification.

The following code will remove outliers using the IQR method (1.5×IQR range) from specified columns while preserving the original DataFrame. The non-destructive approach ensures data integrity throughout the analysis.

In [None]:
def remove_outliers(df, num_col):
    df_cleaned = df.copy()  
    for col in num_col:
        if col in df.columns:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            df_cleaned = df_cleaned[(df_cleaned[col] >= lower_bound) & (df_cleaned[col] <= upper_bound)]
    return df_cleaned


num_col = ['nsites', 'volume', 'band_gap', 'poly_total', 'space_group']
df_cleaned = remove_outliers(df, num_col)

# lets plot the box plot again to visualize the results 
plt.figure(figsize=(12, 8))  
for i, col in enumerate(num_col):
    if col in df_cleaned.columns:
        plt.subplot(3, 2, i + 1)  
        sns.boxplot(x=df_cleaned[col])
        plt.title(f'Box Plot of {col}')
        plt.xlabel(col)  
plt.tight_layout() 
plt.show()

Not entirely but pretty much we have removed the outliers, it can fulfill our purpose for the nex step. 
Now before jumping to actual feature engineering, we need to find out what kind of relationship we have between features and our target variable (pot_ferroelectric)

In [None]:
pot_ferro_counts = df['pot_ferroelectric'].value_counts()
print(pot_ferro_counts)


plt.figure(figsize=(15, 10))

plt.subplot(3, 2, 1)
sns.histplot(x='nsites', hue='pot_ferroelectric', data=df_cleaned, kde=True)
plt.title('nsites vs pot_ferroelectric')

plt.subplot(3, 2, 2)
sns.histplot(x='volume', hue='pot_ferroelectric', data=df_cleaned, kde=True)
plt.title('volume vs pot_ferroelectric')

plt.subplot(3, 2, 3)
sns.histplot(x='poly_total', hue='pot_ferroelectric', data=df_cleaned, kde=True)
plt.title('poly_total vs pot_ferroelectric')

plt.subplot(3, 2, 4)
sns.histplot(x='band_gap', hue='pot_ferroelectric', data=df_cleaned, kde=True)
plt.title('band_gap vs pot_ferroelectric')

plt.subplot(3, 2, 5)
sns.histplot(x = 'space_group', hue = 'pot_ferroelectric', data = df_cleaned, kde = True)
plt.title('space_group vs pot_ferroelectric')

plt.tight_layout()
plt.show()

For nsites, ferroelectric materials show a preference for intermediate site counts (4-8), while non-ferroelectrics span a wider range. 

The poly_total distribution indicates ferroelectrics concentrate at higher polarization values (15-25), contrasting with non-ferroelectrics' peak near 10. 

Volume displays a bimodal distribution for ferroelectrics at ~10 and ~20 units, whereas non-ferroelectrics dominate mid-range volumes. 

Band gap analysis shows ferroelectrics clustering at lower values (1-3 eV), while non-ferroelectrics extend to higher gaps (3-6 eV). These trends suggest ferroelectricity correlates with moderate nsites, higher polarization, specific volume ranges, and smaller band gaps, highlighting key discriminative features for classification.

Ferroelectric materials strongly cluster in specific space groups (150-200), while non-ferroelectrics show uniform distribution, indicating space group is a key predictor of ferroelectric behavior.

# 3. Feature Engineering and Scaling

We are trying to build a classification ML model that could predict if a material is ferroelectric or not based on our features. Before that, we will definitley dive into see the features correlation with our target variable (pot_ferroelectric) and then the feature themselves. So, lets begin. 

But, first we need to do some scaling and label encoding. 

In [None]:
# first, lets convert our features into int64
for col in ['volume', 'band_gap', 'n', 'poly_electronic', 'poly_total']:
    df_cleaned[col] = df_cleaned[col].astype(np.int64)
    df_cleaned.head(5)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_cleaned = df_cleaned.drop(['material_id', 'formula', 'structure', 'cif', 'meta', 'poscar', 'e_electronic', 'e_total'], axis=1)
df_cleaned['pot_ferroelectric'] = le.fit_transform(df_cleaned['pot_ferroelectric'])

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit_transform(df_cleaned)

Now, the data has been scaled and encode, we can proceed see the correlation in the form of heat maop and pair plot. 

In [None]:
correlation_matrix = df_cleaned.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Dielectric Constant Dataset')
plt.show()

In [None]:
sns.pairplot(df)

We can see from the heat map and pair plot that there exist a linear correlation between poly_electronic, poly_total and n. Lets further analyze these features importance and see what else can we get. 

In [None]:
X = df_cleaned.drop('pot_ferroelectric', axis=1)
y = df_cleaned['pot_ferroelectric']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)


importances = rf_model.feature_importances_


feature_importances = pd.DataFrame({'feature': X.columns, 'importance': importances})
feature_importances = feature_importances.sort_values('importance', ascending=False)

# Plot feature importances
plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importances)
plt.title('Feature Importance')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

feature_importances

We can observe that nsites, space_group, volume contribute the most in defining the probability of material being ferroelectric or not.

On the other hand, poly_electronic, poly_total and n contribute the least, but there is a positive correlation among them - so we can combine these three to a one feature. 

In [None]:
# combining the features 

df_cleaned['combined_feature'] = df_cleaned['poly_electronic'] + df['poly_total'] + df['n']

X = df_cleaned.drop(['pot_ferroelectric', 'poly_electronic', 'poly_total', 'n'], axis=1)
y = df_cleaned['pot_ferroelectric']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Get and plot feature importances
importances = rf_model.feature_importances_
feature_importances = pd.DataFrame({'feature': X.columns, 'importance': importances})
feature_importances = feature_importances.sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importances)
plt.title('Feature Importance with Combined Feature')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

feature_importances

# 4. Model Selection and Evaluation

Now, we have combined importance of poly_electronic, poly_total and n. Finally, with feature engineering completed, we can proceed with our classification models. 

There are three classification models (Logistic Regression, KNNs, and Decision Trees) that will be used to asses if a material is ferroelectric or not based on our features. Also, a GridSearchCV pipeline will extract the best parameteres to train and fit the model. 

In [None]:

X = df_cleaned[['nsites', 'space_group', 'volume', 'combined_feature', 'band_gap']]
y = df_cleaned['pot_ferroelectric']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


models = {
    'Logistic Regression': (LogisticRegression(solver='liblinear', max_iter=1000), {
        'C': [0.1, 1, 10]
    }),
    'Decision Tree': (DecisionTreeClassifier(), {
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    }),
    'KNN': (KNeighborsClassifier(), {
        'n_neighbors': [3, 5, 7],
        'weights': ['uniform', 'distance']
    })
}


for model_name, (model, param_grid) in models.items():
    grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    print(f"Best parameters for {model_name}: {grid_search.best_params_}")
    print(f"Best score for {model_name}: {grid_search.best_score_}")

    y_pred = grid_search.predict(X_test)
    print(f"Classification Report for {model_name}:\n{classification_report(y_test, y_pred)}")
    print("ROC-AUC:", roc_auc_score(y_test, y_pred))

**Classification Report: Results and Suggestions**

**Performance Summary**

Logistic Regression: Severe imbalance in classification; only predicts class 1 well. Model underperforms for class 0 — possibly due to data imbalance or poor linear separability.

Decision Trees: Balanced and strong performance on both classes. This is the best model overall in terms of generalization and class-wise metrics.

KNN: Decent for class 1, poor on class 0. Slight improvement over Logistic Regression but not as good as Decision Tree.



# 5. Cluster Analysis

The structure column of our data contains a lot of information as bond length, coordination no, etc. In our cluster analysis, we will extract the bond length form the structure and then group the data based on bond length and anlyze how bond length affects diffent properties of diaelectric materials. 

In [None]:
df.head(3)

In [None]:
def get_bond_features(structure):
    bond_matrix = structure.distance_matrix  
    mask = bond_matrix > 0  
    bonds = bond_matrix[mask]

    return {
        'min_bond': np.min(bonds),
        'avg_bond': np.mean(bonds),
        'max_bond': np.max(bonds),
        'bond_std': np.std(bonds),
    }


bond_features = df['structure'].apply(get_bond_features).apply(pd.Series)
df = pd.concat([df, bond_features], axis=1)
df.head(5)

The get_bond_features function takes a structure object (likely a crystal structure from a materials science library such as pymatgen) and computes its pairwise distance matrix, which represents the distances between all pairs of atoms in the structure. To avoid including self-distances (which are zero by definition), it applies a mask to exclude diagonal entries. It then calculates four statistical features from the non-zero distances: the minimum bond length (min_bond), average bond length (avg_bond), maximum bond length (max_bond), and the standard deviation of bond lengths (bond_std). These features are returned as a dictionary.

There are min_bond, max_bond, avg_bond and bond_std features, but we will use avg_bond for our clustering purpose. 

In [None]:

X = df[['poly_total', 'avg_bond', 'band_gap', 'volume', 'space_group', 'nsites', 'n', 'pot_ferroelectric']].dropna()

kmeans = KMeans(n_clusters=4, random_state=42)
df['cluster'] = kmeans.fit_predict(X)
df['cluster'].head(5)


Since the data has been clustered, lets get some insights out of it.

# 6. Materials Selection and Design Rules 

The cluster that we have generated have some key information and insights that can be leverged to find best materials for respective application. 

In [None]:
print(df.groupby('cluster')[['poly_total', 'avg_bond', 'band_gap', 'nsites', 'n', 'volume', 'space_group']].mean())

In [None]:


cluster_stats = pd.DataFrame({
    'Cluster': [0, 1, 2, 3],
    'Avg Dielectric (poly_total)': [13.51, 15.73, 13.54, 14.28],
    'Avg Bond Length (Å)': [3.12, 2.92, 4.08, 3.70],
    'Avg Band Gap (eV)': [1.99, 2.36, 1.78, 1.74],
    'Avg Volume (Å³)': [180.21, 101.55, 398.16, 238.37],
    'Space Group': ['55 (Pmma)', '178 (P6₃mc)', '86 (P4₃)', '184 (P6₃cm)']
})

# Design rules and applications
design_rules = pd.DataFrame({
    'Cluster': [0, 1, 2, 3],
    'Material Class': [
        'Dense oxides (e.g., perovskites)',
        'High-polarizability materials',
        'Layered/porous materials',
        'Narrow-gap semiconductors'
    ],
    'Design Rules': [
        'Short bonds + small unit cells → Moderate κ',
        'Very short bonds + low nsites → High κ',
        'Large volumes → Low κ',
        'Low band gap → Optoelectronic potential'
    ],
    'Applications': [
        'Capacitors, insulators',
        'High-κ dielectrics',
        'Thermal insulators',
        'Sensors, photovoltaics'
    ]
})

# Merge tables
final_table = pd.merge(cluster_stats, design_rules, on='Cluster')

final_table

**Insights**

To design high-k-dielectric materials, we can priortize cluster 1 with short bonds and small nsites.

For low-k-insulators, we can explore cluster 2 with large volumes

Cluster 0 is dominated by space group 55 which is common in perovskites and cluster 1 have 178 space group and are highly plarized materials. 

All clusters have n ≈ 2.4–2.5, suggesting no strong optical design trade-offs between clusters.

Exception: Cluster 3 has slightly lower n (2.36) — useful for anti-reflective coatings.

Cluster 2 (large volume) has lower κ despite long bonds. This may indicate porosity: High void fraction reduces ionic polarization.

In [None]:
# similarly we can priortize high 'K' materials from cluster 1. 

high_k_mat = df[
    (df['cluster'] == 1) & 
    (df['poly_total'] > 15) & 
    (df['band_gap'] > 2.0)  
][['formula', 'material_id', 'space_group']]
print(high_k_mat)



# **Discussion**

We can do many things from the insights generated from our dielectric data. 

Cluster 1 (high-κ dielectrics, poly_total=15.7) exhibited the shortest bonds (avg_bond=2.92 Å) and small unit cells (volume=101 Å³), aligning with known high-polarizability materials like ferroelectrics (e.g., perovskites in space group 178). This suggests that compressing bond lengths while maintaining low nsites enhances ionic polarization.

Cluster 2 (large-volume materials, volume=398 Å³) showed moderate κ despite long bonds (4.08 Å), likely due to porosity or weak interlayer interactions, making them candidates for low-κ insulators.

Clusters 0 and 3 revealed trade-offs: Cluster 0 combined moderate κ (13.5) with dense packing (space group 55), while Cluster 3’s low band gap (1.74 eV) hints at optoelectronic potential.

The sole purpsoe of this work was to show and explain how materials informatics and ML-accelerated approach could revolutionize the materials discovery and design. We can anlyze the composition, learn it using the latent space or autoencoder and design new materials and much more.  



**Prepared by:** *Usman Raza*  
**Date:** April 24, 2025  
**Project:** Dielectric Properties of Materials: A Materials Design Analysis  
**Contact:** uraza2119@gmail.com