I used __"Sine matrix"__ representation of crystal structure, which is introduced in this [comprehensive article](https://arxiv.org/abs/1503.07406).  
I also refered to the [wonderful Kernel](https://www.kaggle.com/tonyyy/how-to-get-atomic-coordinates) by Tony Y.

In [1]:
# calculation and plot
import numpy as np
import numpy.linalg as LA
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# data processing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# predictor
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

# Reading table data

In [34]:
# train
df_train = pd.read_csv('../input/train.csv')
df_train['dataset'] = 'train'
# test
df_test = pd.read_csv('../input/test.csv')
df_test['dataset'] = 'test'
test_len = len(df_test)
# merge train and test
df = pd.concat([df_train, df_test], axis=0, ignore_index=True, sort=False)
df_len = len(df)

In [35]:
df.head()

In [36]:
df.tail()

# Getting "Sine matrix"

## Calculation methods

### Standardized (0 to 1) coordinates
$\mathbf{R}=\left(\begin{matrix}X\\Y\\Z\end{matrix}\right)$
: position vector of an atom in the cell (Angstrom)  
$\mathbf{a}_i, (i = 1,2,3)$ : lattice vector (column vector)  

If you consider lattice vectors as basis vectors, you can use ratio values to them: $\mathbf{r} = (x, y, z)^T$ as a new coordinate, instead of $\mathbf{R}$.

I mean, with $A = (\mathbf{a}_1, \mathbf{a}_2, \mathbf{a}_3)$,

\begin{aligned}
\mathbf{R} &= x\mathbf{a}_1 + y\mathbf{a}_2 + z\mathbf{a}_3 = A\mathbf{r}\\
\mathbf{r} &= A^{-1}\mathbf{R}
\end{aligned}

### Periodicity
[This article](https://arxiv.org/abs/1503.07406) introduces a simple way to take account of periodicity of crystal structure.  

The idea is using alternative coodinate $\mathbf{r}'$, instead of $\mathbf{r}$.

\begin{equation}
\mathbf{r}' =  \sin^2\mathbf{r} = (\sin^2x, \sin^2y, \sin^2z)^T
\end{equation}

As you can see from the definition, $x' = \sin^2x$ becomes the maximum at $x=0.5$ and decreases after it,
because the section of $x = 0.5 \to 1$ corresponds to the section of $x = 0 \to -0.5$

### Sine matrix (modified Coulomb matrix)
Refer to [the article above](https://arxiv.org/abs/1503.07406) and [the article about Coulomb matrix](https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.108.058301).

Now, we re-convert $\mathbf{r}'$ to Angstrom scale again.  
\begin{equation}
\mathbf{R}' = A\mathbf{r}'
\end{equation}

And each component of __Sine matrix__ $M$ is defined as below.  
\begin{equation}
M_{ij} =\begin{cases}
            0.5Z_i^{2.4} \quad (i=j)\\
            \frac{Z_i Z_j}{|\mathbf{R}'_{ij}|} \quad (i\neq j)
        \end{cases}
\end{equation}
Where  $Z_i$ is the atomic number of the $i$ th atom,
and $\mathbf{R}'_{ij}$ is $\mathbf{R}'$ vector between $i$ th and $j$ th atoms.

### Eigenspectrum
Sine matrix data can't be used directly as a feature vector,
we use the __eigenspectrum(sorted eigenvalues)__ of the matrix instead.

### PCA
__PCA(principal component analysis)__ of the eigenspectrum can increase accuracy.

## Function to process .xyz files

I greatly refered to this [wonderful Kernel](https://www.kaggle.com/tonyyy/how-to-get-atomic-coordinates) by Tony Y.  
Each element of list `xyz`, `lattice` is $\mathbf{R}^T$ and $\mathbf{a}_i^T$, respectively.  

In [5]:
def get_xyz(filename):    
    row = []
    xyz = []
    lattice = []
    
    with open(filename) as f:
        for line in f.readlines():
            row = line.split()
            if row[0] == 'atom':
                xyz.append((np.array(row[1:4], dtype=np.float), row[4]))
            elif row[0] == 'lattice_vector':
                lattice.append(np.array(row[1:4], dtype=np.float))
    
    return xyz, lattice

## Function to get "Sine matrix"

In [6]:
def get_sine_matrix(xyz, lattice):
    
    n_atom = len(xyz) # number of atoms in the cell
    
    distance_matrix = np.ones((n_atom, n_atom))
    A = np.transpose(lattice) # A = (a_1, a_2, a_3), defined as above
    B = LA.inv(A) # inverse matrix of A
    
    # matrix of distance
    for i in range(n_atom):
        for j in range(i):
            r_ij = np.dot(B, xyz[i][0] - xyz[j][0])
            sin_sq_r = (np.sin(np.pi * r_ij))**2
            distance = LA.norm(np.dot(A, sin_sq_r))
            distance_matrix[i, j], distance_matrix[j, i] = distance, distance
    # Note that diagonal components remain 1
    
    # matrix of charge by charge
    labels = np.transpose(xyz)[1] # element symbol labels
    labels = labels.reshape(-1,1)
    for at, charge in zip(['O', 'Al', 'Ga', 'In'], [8, 13, 31, 49]): # convert symbols into electric charges
        labels = np.where(labels==at, charge, labels)
    charge_matrix = np.dot(labels, np.transpose(labels)).astype(np.float)
    charge_matrix -= np.diag(np.diag(charge_matrix)) # let diagonal components zero
    charge_matrix += np.diag(0.5 * labels**2.4).astype(float) # from the definition
    
    # sine matrix
    sine_matrix = charge_matrix / distance_matrix
    
    return sine_matrix

## Function to get eigenspectrum of a matrix

In [7]:
def get_eigenspectrum(matrix):
    spectrum = LA.eigvalsh(matrix)
    spectrum = np.sort(spectrum)[::-1]
    
    return spectrum

## Processiong all data

In [8]:
spectrum_list = []

for index in range(df_len):
    
    dataset_label = df.dataset.values[index]
    row_id = df.id.values[index]
    filename = "../input/{}/{}/geometry.xyz".format(dataset_label, row_id)
    
    # file processing
    xyz, lattice = get_xyz(filename)
    # sine matrix
    sine_matrix = get_sine_matrix(xyz, lattice)
    # eigen spectrum
    spectrum = get_eigenspectrum(sine_matrix)
    
    spectrum_list.append(spectrum)

# Visualizing the eigenspectrum
Examples:`spectrum_list[0:5]`

In [9]:
fig, axs = plt.subplots(1,5, figsize=(15, 6))
for i in range(5):
    ax = axs[i]
    plot_data = spectrum_list[i]
    ax.plot(range(len(plot_data)), plot_data)
    ax.hlines(0, 0, 80, colors='r')
plt.show()

# Shaping data
## Making eigenspectrum  dataframe

In [37]:
spectrum_df = pd.DataFrame(spectrum_list).astype(np.float)
spectrum_df = spectrum_df.fillna(0)

## PCA

In [38]:
# standard scaling
ss = StandardScaler()
spectrum_std_df = pd.DataFrame(ss.fit_transform(spectrum_df.values))
# PCA
pca = PCA(n_components=80)
spectrum_pca_df = pd.DataFrame(pca.fit_transform(spectrum_std_df.values))

spectrum_pca_df.head()

In [39]:
plt.scatter(x=spectrum_pca_df.loc[:100,0], y=spectrum_pca_df.loc[:100,1])
plt.show()

## Making new features from table data
Tendency in "lattice vector length" or "lattice angle" clearly differs by "number_of_total_atoms" and "spacegroup".

### One-Hot encoding by "number of atoms" and "spacegroup"

In [40]:
df.number_of_total_atoms = df.number_of_total_atoms.astype('int')
df['group_natoms'] = df.spacegroup.astype('str') + '_' + df.number_of_total_atoms.astype('str')

In [41]:
sns.lmplot(x='lattice_vector_1_ang', y='bandgap_energy_ev', hue='group_natoms', data=df, fit_reg=False)
plt.show()

In [42]:
df = df.join(pd.get_dummies(df.group_natoms))
df.drop(['group_natoms'], axis=1, inplace=True)

## Removing columns

__Remind that we added an extra "dataset" column.__
Now we have to drop it.

In [43]:
df.columns

In [44]:
df.drop(['dataset'], axis=1, inplace=True)
df.drop(['id', 'spacegroup', 'number_of_total_atoms'], axis=1, inplace=True)

In [45]:
df.columns

## Merging

In [46]:
df_new = pd.concat([spectrum_pca_df, df], axis=1)
df_new.head()

## Outliers

It may be better to remove duplicate rows, as the organizer suggested in [this posting](  
https://www.kaggle.com/c/nomad2018-predict-transparent-conductors/discussion/47998).

In [47]:
df_new.drop([395, 126, 1215, 1886, 2075, 353, 308, 2154, 531, 1379, 2319, 2337, 2370, 2333], axis=0, inplace=True)

In [48]:
df_new.shape

## Making train and test data array

In [49]:
df_len = len(df_new)
train_len = df_len - test_len
# X
X_train = df_new.drop(['formation_energy_ev_natom', 'bandgap_energy_ev'], axis=1)[:train_len].values
X_test = df_new.drop(['formation_energy_ev_natom', 'bandgap_energy_ev'], axis=1)[train_len:].values
# y
y_formation = df_new['formation_energy_ev_natom'][:train_len].values
y_bandgap = df_new['bandgap_energy_ev'][:train_len].values

# Fitting by XGBRegressor
## Fitting "formation"

In [50]:
xgb_formation = xgb.XGBRegressor()
parameters = {
    'max_depth': [2, 3, 4],
    'n_estimators' : [100, 200, 300],
             }

cv_formation = GridSearchCV(xgb_formation, param_grid=parameters, cv=4, verbose=1)
cv_formation.fit(X_train, y_formation)

In [51]:
cv_formation.best_params_

## Fitting "bandgap"

In [52]:
xgb_bandgap = xgb.XGBRegressor()
parameters = {
    'max_depth': [2, 3, 4],
    'n_estimators' : [100, 200, 300],
             }

cv_bandgap = GridSearchCV(xgb_bandgap, param_grid=parameters, cv=4, verbose=1)
cv_bandgap.fit(X_train, y_bandgap)

In [53]:
cv_bandgap.best_params_

## Feature importances

In [54]:
def plot_features(estimator, features):
    importances = estimator.feature_importances_
    plt.figure(figsize=(10, 15))
    plt.barh(range(len(importances)), importances , align='center')
    plt.yticks(np.arange(len(features)), features)
    plt.show()

In [55]:
features = df_new.drop(['formation_energy_ev_natom', 'bandgap_energy_ev'], axis=1).columns

plot_features(cv_formation.best_estimator_, features)
plot_features(cv_bandgap.best_estimator_, features)

# Submission

In [26]:
formation_pred = cv_formation.predict(X_test)
bandgap_pred = cv_bandgap.predict(X_test)

In [28]:
submission = pd.DataFrame(np.arange(1, test_len + 1), columns=['id'])
submission['formation_energy_ev_natom'] = formation_pred
submission['bandgap_energy_ev'] = bandgap_pred
submission.shape

In [29]:
submission.head()

In [None]:
submission.to_csv('submission.csv', index=False)

# Scores

As a late submission, the prediction reached...

|Private Score|Public Score|
|---|---|
|0.0666|0.0499|