# Edex 1: Determine the space group of a structure from the atomic pair distribution function using Convolutional Neural Networks

## Problem Statement and Motivations

In this edex, we use a 1D convolutional neural networks model to solve a materials science problem. 

#### Goal: 
predict the space group of a crystal structure giving a calculated or measured atomic pair distribution function (PDF) from that structure. We will use 40,000 PDFs that are calculated from 8 of the most common space groups. 

#### Motivation: 
Materials scientists are interested in studying structure-property relationships. The ML model would allow us to quickly and easily obtain a list of most likely space groups that can be used for subsequent structural modeling. It allows us to narrow down the range of possible space group at early stage of the research. It also saves a lot of time compared to searching structural databases manually.  

#### relevant materials science background 
- Crystalline materials are consist of repeating arrangement of atoms with long range translational symmetry. The geometric symmetries of crystals are described by space groups. Each space group contains a set of geometric symmetery operations that map a crystal structure back onto itself. As the name suggest, these set of operations form a group in mathematics. For 3D, there are 230 space groups and any crystal structure is described by 1 and only 1 of the 230 possible space groups. 
- We are interested in determining structure crystalline materials because it is crucial for understanding the materials' properties. The pair distribution function (PDF) analysis is a powerful method for solving crystal structures. PDF is a 1D spatial function which describes the distribution of distances between pairs of particles contained within a given volume. It can be calculated or experimentally obtained from powder diffraction data. However, it is not a simple task to deduce the space group from PDF data because PDF doesn't provide information on the overall symmetry and unit cell of the material. 
- We know that symmetry information must exist in the PDF, but we do not have a theory yet to identify the space group from PDF. Therefore, ML model is a promising method in this case for deducing the predictive relationship between PDF and space group. We also have considerable amount of data for training.  

#### Original Papers

This edex is a simplified version adapted from the papers below:

Liu, C. H., Tao, Y., Hsu, D., Du, Q., & Billinge, S. J. (2019). Using a machine learning approach to determine the space group of a structure from the atomic pair distribution function. Acta Crystallographica Section A: Foundations and Advances, 75(4), 633-643.

Lan, L., Liu, C. H., Du, Q., & Billinge, S. J. (2022). Robustness test of the spacegroupMining model for determining space groups from atomic pair distribution function data. Journal of Applied Crystallography, 55(3).

## Frame the problem in machine learning

Recall that machine learning can be roughly divided into supervised and unsupervised learning. Supervised learning uses labeled datasets for training, while unsupervised learning uses ML to analyze and cluster unlabeled datasets. Supervised learning can be further categorized into classification problems and regression problems. Classification problems assign input into specific categories. Regression problems aims to understand the relationship between input and output variables, so a regression model predicts numerical values based on the input data.

In this problem, we will use around 40,000 PDFs that are calculated from 8 of the most common space groups. Therefore, this is a classification task. The output is an array of length 8 where each entry correspond to the probability being in a space group. The sum of all probabilities in the array should add up to 1.

In summary:
- input: PDF data (1D array)
- output: the probabilities of the crystal being in each spacegroup candidate (1D array of length 8)

### Import libraries

First, import the necessary library packages. If any of these packages are not installed in the local environment, install them using pip or conda in the terminal.  

You might need to import more libraries or functions as you work through the problems. 

#### <font color='RED'>YOUR SOLUTION:</font> 

In [None]:
import tensorflow as tf
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tensorflow.keras.models import Model
from tensorflow import random
from tensorflow.keras import regularizers
from tensorflow.keras.layers import Input, Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Activation
from tensorflow.keras.regularizers import l2
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import LearningRateScheduler, EarlyStopping

## Data preprocessing

First, read the data from csv files
* input: "x.csv"
* label: "y.csv"

You should have 2 numpy arrays of the following sizes:
* X.shape = (50550, 209)
* y.shape = (50550,)

#### <font color='RED'>YOUR SOLUTION:</font> 

In [None]:
X = 
y = 



We need to further reshape the X arguments into dimension: (50550, 209, 1), that is, each sample need to be off the shape (steps, input_dim). It is a specific requirement for defining a convolutional neural networks in keras.

#### <font color='RED'>YOUR SOLUTION:</font> 

In [None]:
X = 




Now split the the data into training and testing data using `train_test_split` from `sklearn.model_selection`. Spefify the following arguments:
* `test_size=0.2` determines that testing data is 20% of all the data. 
* `random_state` is an optional argument which ensures that the splits are the same each time (reproducible). 
* `shuffle=True` data is shuffled before the split. 
* `stratify=y` ensures that each target category (the 8 space groups) has proportional representation in the testing data as in the whole data.

#### <font color='RED'>YOUR SOLUTION:</font> 

We can check the shape of these arrays:

In [None]:
print("X_train: ",X_train.shape)
print("y_train: ",y_train.shape)
print("X_test: ",X_test.shape)
print("y_test: ",y_test.shape)

As shown, there are 40440 data points in the training data and 10110 data points in testing data. Each input is the PDF data: a vector of shape (209,1). Each target is an integer, which is the space group number of the material. 

#### one-hot encoding

For classification tasks, the target labels need to be "one-hot encoded".

For example, if `target = 14`, then it is converted it into `np.array([0,1,0,0,0,0,0,0])`. 

To do this, We need to assign an index to the each of 8 space group numbers. The index of each space group number is defined in the numpy array below.

In [None]:
SG_ORDER_CNN = np.array([2,14,15,62,139,194,225,227])

To one-hot encode the labels, we need 2 steps:
1. For each target, obtain its index in the squence `SG_ORDER_CNN`. For example [14,15,227,2,14....] should be converted into [1,2,7,0,1....]
2. Secondly, use `tf.one_hot` transforms each index number into an array of length 8, where every entry is 0 except for it is 1 at the index position. For example the index 7 should be transformed into [0,0,0,0,0,0,0,7]. Don't forget that python indexing starts with 0. 

#### <font color='RED'>YOUR SOLUTION:</font> 

In [None]:
y_train_one_hot =
y_test_one_hot =




First we can try some conventional ML models.  Try whichever you like from what you have learned above, but one that was used in the original work was Logistic Regression.

#### <font color='RED'>YOUR SOLUTION:</font> 

## Convolutional Neural networks

Convolutional neural networks (CNN) are a special type of artificial neural networks (ANN) that uses the convolution operation in place of general matrix multiplication in at least one of the layers. Unlike a multilayer perceptron model, the use of convolution allows CNNs to utilize the local spatial information in the input. For example, CNNs is able to take into account that close-by pixels encode related information. Furthermore, CNNs enforces translational invariance: for instance, a dog in the left corner of an image will be recognized not so different as a dog in the right corner.  As a result, CNNs have superior performances in many applications such as image and signal recognition and processing.

CNNs contain 3 types of layers 

1. Convolutional layer
    * a number of filters of fixed sizes are defined, each filter contains trainable weights. The filter sweeps across the input feature layer taking fixed-size steps called "stride". The receptive field is the input feature space overlapping the filter at each position as the filter traverses the feature layer. The output is computed by taking the dot product of filter and all the receptive fields.
    * filter is also known as kernels; kernel size refers to the dimension of the filter; channel refers to the number of filters used in the layer. 


2. Pooling layer
    * pooling is a method of regularization. It downsamples the feature space and thus avoids overfitting. 
    * There are 2 types of pooling layer: average pooling and max pooling
    * Similarly to convolutional layer, a filter sweeps across to compute the maximum value or average value of each receptive field.


3. Fully-connected layer
    * the regular neural networks layers. Used in the last few layers to map to the final outputs.

The use of filters not only allows CNNs to identify patterns and possible translational symmetry, but also greatly reduces the number of parameters in the model. CNNs reduces the chances of overfitting since the trainable weights in the filters are reused as filters traverse the feature spaces.

### Model Architecture

The performance of a machine learning depends on the architecture of the model as well as a number of hyperparameters such as learning rate, loss function, activation function. However, there isn't a general method for determining the optimal set of hyperparameters and architecture. In order to find an optimal model for the scientific problem at hand, we usually first refer to scientific litteratures which solve similar problems and establish a similar architecture. Then, we can perform hyperparameter tuning. One can use basic methods such as grid search, or resort to well-established hyperparameter tuning frameworks, such as [optuna](https://optuna.org/).

The architecture in this edex is the same one used in the original paper (Liu et al.). This model is already optimized with hyperparameter tuning methods. 

Define constants which we are used for defining the model.

In the cell below, we define the 1D CNN model. 

First, we specify the model to be a `keras.Sequential` model. A [Sequential model](https://keras.io/guides/sequential_model/) is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor, which is the case in our problem. Calling `model=Sequential([layer0,layer1,layer2...])` automatically connects the output of layer0 to input of layer1, output of layer1 to input of layer2, etc. There are other ways to define the model without using the sequential method, but in those cases, the output and inputs usually need to be explicitly linked.


Then, we add the layers in sequence. The first 2 are given to you:
1. Input 
2. Normalize the input 
...

You need to add the rest of the layers:

3. `Conv1D` with 256 channels and kernel size of 32. 
    * include relu activation function
    * padding='same'
    * kernel_regularizer = regularizers.l2(l2_lambda)
    
4. `BatchNormalization`: use the same arguments as the given one

5. `Conv1D` with 64 channels and kernel size of 32. 
    * relu activation
    * padding='same'
    * kernel_regularizer = regularizers.l2(l2_lambda)
    
6. `BatchNormalization`:same arguments as before

7. `MaxPooling1D`: default setting

8. `Dropout` with 0.5 dropout percentage

9. `Flatten`

10. `Dense` with output dimension = 128
    * relu activation 
    * kernel_regularizer = regularizers.l2(l2_lambda)
    
11. `BatchNormalization`:same arguments as before

12. `Dropout` with 0.5 dropout percentage

13. `Dense` with output dimension = 8
    * softmax activation 
    * kernel_regularizer = regularizers.l2(l2_lambda)
    
Finally, don't forget to return the model

In [None]:
input_shape = (209,1)
l2_lambda = 1e-5 # regularization parameter
num_classes = 8 

#### <font color='RED'>YOUR SOLUTION:</font> 

In [None]:
def CNN_classifier(input_shape=input_shape, num_classes=num_classes):  
    
    # Define the sequential model:
    model = Sequential()
    
    # two given layers:
    model.add(Input(shape=input_shape))
    model.add(BatchNormalization(epsilon=1e-06, momentum=0.9, weights=None))
    
    
    # Add the rest of the layers
    
    
    
    
    
    
    
    
    
    
    
    


    return model

Instatiate the model:

In [None]:
model = CNN_classifier()

Run the following to see Model summary:

In [None]:
model.summary()

## Training

(optional) random seed can be introduced for reproducibility.

In [None]:
from numpy.random import seed
seed(0)
random.set_seed(0)

### Training weights for Imbalanced Datasets

A classification data set with different class proportions is called imbalanced. 

Imbalanced dataset could be problematic, because if there are significantly more training data in one class, the  model will spend most of its time optimizing for that one class and not learn enough from samples from the other classes. 

Our data is imbalanced. The amount of data in each class is encoded in the `size_sg.csv`. We can use a pie chart to visualize the proportions of data in each class.

In [None]:
size_sg = pd.read_csv('size_sg.csv', header=None)

In [None]:
labels = size_sg[0].to_numpy()
sizes = size_sg[1].to_numpy()
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.show()

In order to counteract the unwanted effects due to data skewness, we introduce training weights. 

Each class is assigned with a weight equal to the inverse of its proportion (total data points/class data points). We need to encode these weights in the format of a dictionary for use in the training process: {0: ....., 1: ..... .....}

#### <font color='RED'>YOUR SOLUTION:</font> 

In [None]:
weights = {}





### Callbacks

Callbacks are used to control the training of a model. Callbacks can help us prevent overfitting, visualize our training progress, save checkpoints, etc.

In this edex, we include 2 callbacks: [`Early stopping`](https://keras.io/api/callbacks/early_stopping/) and learning rate scheduling. These are optional, but including them can improve model performance. 
* learning rate scheduler automatically reduce the learning rate after certain number of epochs. For some problems, it can increase performance and accelerate training processes.
* Early stopping is form of regularization used to avoid overfitting. Basically it stops the model from training once the model stops improving on the validation data.
    * patience: Number of epochs with no improvement after which training will be stopped.

We need to define a function that takes epoch as an argument, and outputs the corresponding learning rates:
* epoch 0-40: lr = 5e-4
* epoch 40-60: lr = 5e-5
* epoch >60: lr = 5e-6

#### <font color='RED'>YOUR SOLUTION:</font> 

In [None]:
def lr_schedule(epoch):
    
    
    
    # Your code
    
    
    
    
    
    print('Learning rate: ', lr)
    return lr

In [None]:
lr_schedule_callback = LearningRateScheduler(lr_schedule, verbose=1)
earlystopping_callback = EarlyStopping(monitor='val_acc',verbose=1,min_delta=0.5,patience=10,baseline=None)

callbacks = [lr_schedule_callback, earlystopping_callback]

### Training Hyperparameters

Model.compile defines the loss function, metrics, and optimizer.
* Category entropy loss is used as the loss function. It is a common choice for classification tasks.
* A metric is a function that is used for humans to judge the performance of your model. However, it is different from loss function because it is not used for training.
    * TopK Categorical Accuracy calculates the percentage of records for which the targets (non zero yTrue) are in the top K predictions

Read the documentation for `tf.keras.model.compile` to compile the model.
* optimizer: adam
* loss function: categorical cross entropy loss
* metrics: include 2: accuracy and Top K categorical accuracy with k=2

#### <font color='RED'>YOUR SOLUTION:</font> 

In [None]:
# Your code





### Training

#### Validation 

A validation dataset is a sample of data that is not used directly to train the model, but is used to evaluate the model as it is being trained. It provides a more unbiased evaluation because the model is not directly trained on the validation dataset.

However, validation is different from the testing data. The final model is completely unaware of the testing data, whereas we choose the final model based on its performance on the validation dataset. 

Here the validation data is set to be 20% of the training data

#### Batch training

batch training means that the model takes one gradient descent step after considering a batch of data points. This greatly accelerates training speed. 

In [None]:
epochs = 100

history = model.fit(X_train, y_train_one_hot,
                    epochs=epochs, batch_size=32,
                    callbacks=callbacks,
                    class_weight=weights,
                    validation_split = 0.20) 

#### What happens when I increase or decrease the batch size? What's its effect on training?

#### <font color='RED'>YOUR SOLUTION:</font> 

Type your answer here.

### Save model and training history

In [None]:
model.save('my_model.h5')
history_dict = history.history
pd.DataFrame(history_dict).to_json('history.json')

## Evaluation

Finally, we evaluate the model's performance on the testing data. 

Read the documentation for `tf.keras.model.evaluate` to output model performance for the testing data. We want to find both the loss and accuracy of the classification. 

#### <font color='RED'>YOUR SOLUTION:</font> 

In [None]:
# Your code



