## Calculate descriptors and draw heatmap
This tutorial shows how to calculate compositional and structural descriptors by using `xenonpy.descriptor.Compositions` and `xenonpy.descriptor.Structures`.

We will use some sample data in this tutorial. If you don't have it, please see https://github.com/yoshida-lab/XenonPy/blob/master/samples/build_sample_data.ipynb.

### useful tools

In [1]:
%run tools.ipynb

### load sample data 

In [2]:
from xenonpy.datatools import preset

samples = preset.mp_samples
samples.head()
samples.info()

Unnamed: 0,band_gap,composition,density,e_above_hull,efermi,elements,final_energy_per_atom,formation_energy_per_atom,pretty_formula,structure,volume
mp-1008807,0.0,"{'Rb': 1.0, 'Cu': 1.0, 'O': 1.0}",4.784634,0.996372,1.100617,"[Cu, O, Rb]",-3.302762,-0.186408,RbCuO,"[[-3.05935361 -3.05935361 -3.05935361] Rb, [0....",57.268924
mp-1009640,0.0,"{'Pr': 1.0, 'N': 1.0}",8.145777,0.759393,5.213442,"[N, Pr]",-7.082624,-0.714336,PrN,"[[0. 0. 0.] Pr, [1.57925232 1.57925232 1.58276...",31.579717
mp-1016825,0.7745,"{'Hf': 1.0, 'Mg': 1.0, 'O': 3.0}",6.165888,0.58955,2.42457,"[Hf, Mg, O]",-7.911723,-3.06006,HfMgO3,"[[2.03622802 2.03622802 2.03622802] Hf, [0. 0....",67.541269
mp-1017582,0.0,"{'La': 1.0, 'Pt': 3.0, 'C': 1.0}",14.284261,0.523635,8.160496,"[C, La, Pt]",-6.684482,-0.215712,LaPt3C,"[[0. 0. 0.] La, [0. 2.20339716 2.20339...",85.579224
mp-1021511,1.5186,"{'Cd': 1.0, 'S': 1.0}",2.582691,0.25286,-2.12118,"[Cd, S]",-2.909105,-0.719529,CdS,"[[2.12807605 1.22864286 2.67990375] Cd, [-2.46...",92.890725


<class 'pandas.core.frame.DataFrame'>
Index: 933 entries, mp-1008807 to mvc-4727
Data columns (total 11 columns):
band_gap                     933 non-null float64
composition                  933 non-null object
density                      933 non-null float64
e_above_hull                 933 non-null float64
efermi                       796 non-null float64
elements                     933 non-null object
final_energy_per_atom        933 non-null float64
formation_energy_per_atom    933 non-null float64
pretty_formula               933 non-null object
structure                    933 non-null object
volume                       933 non-null float64
dtypes: float64(7), object(4)
memory usage: 87.5+ KB


### calculate descriptors from composition

In [3]:
from xenonpy.descriptor import Compositions

cal = Compositions()
cal

Compositions:
  |- composition:
  |  |- WeightedAvgFeature
  |  |- WeightedSumFeature
  |  |- WeightedVarFeature
  |  |- MaxFeature
  |  |- MinFeature

Print ``Cal`` object will show the structure information.
This information tells us ``Cal`` has one featurizer group called **composition** with 5 featurizers in it.

To use this calculator, you only need to feed a iterable variable contains composition information of compounds to the ``cal.transform`` or ``cal.fit_transform`` method.
The variable type should be the ``pymatgen.Structure``, or dicts which have the structure like *{'H': 2, 'O': 1}*.

By using our sample data, you can calculate descriptors like this:

In [4]:
descriptor = cal.transform(samples['composition'])
descriptor.head(5)
descriptor.info()

Unnamed: 0,ave:atomic_number,ave:atomic_radius,ave:atomic_radius_rahm,ave:atomic_volume,ave:atomic_weight,ave:boiling_point,ave:bulk_modulus,ave:c6_gb,ave:covalent_radius_cordero,ave:covalent_radius_pyykko,...,min:num_s_valence,min:period,min:specific_heat,min:thermal_conductivity,min:vdw_radius,min:vdw_radius_alvarez,min:vdw_radius_mm3,min:vdw_radius_uff,min:sound_velocity,min:Polarizability
mp-1008807,24.666667,174.06714,209.333333,25.666667,55.004267,1297.063333,72.86868,1646.9,139.333333,128.333333,...,1.0,2.0,0.36,0.02658,152.0,150.0,182.0,349.5,317.5,0.802
mp-1009640,33.0,137.0,232.5,19.05,77.45733,1931.2,43.182441,1892.85,137.0,123.5,...,2.0,2.0,0.192,0.02583,155.0,166.0,193.0,360.6,333.6,1.1
mp-1016825,21.6,153.120852,203.4,13.92,50.1584,1420.714,76.663625,343.82,102.8,96.0,...,2.0,2.0,0.146,0.02658,152.0,150.0,182.0,302.1,317.5,0.802
mp-1017582,59.4,139.0,232.8,11.02,147.233694,4226.0,150.2,1037.58,137.6,124.8,...,1.0,2.0,0.133,13.0,170.0,177.0,204.0,275.4,2475.0,1.67
mp-1021511,32.0,140.5,226.0,14.3,72.237,877.912,24.85,272.5,124.5,119.5,...,2.0,3.0,0.232,0.205,180.0,189.0,215.0,284.8,2310.0,2.9


<class 'pandas.core.frame.DataFrame'>
Index: 933 entries, mp-1008807 to mvc-4727
Columns: 290 entries, ave:atomic_number to min:Polarizability
dtypes: float64(290)
memory usage: 2.1+ MB


There also a short way to do the same work. Still remember the name of the featurizer group?. The **composition**.
In fact, if an input is an object of pandas.DataFrame, calculator will try to read those columns which have same name with featurizer groups.
For our case, because ``samples`` have a column named **composition**, we have no need to extract the column **composition** before we pass the input to the calculator's methods. Like this:

In [5]:
descriptor = cal.transform(samples)
descriptor.head(5)
descriptor.info()

Unnamed: 0,ave:atomic_number,ave:atomic_radius,ave:atomic_radius_rahm,ave:atomic_volume,ave:atomic_weight,ave:boiling_point,ave:bulk_modulus,ave:c6_gb,ave:covalent_radius_cordero,ave:covalent_radius_pyykko,...,min:num_s_valence,min:period,min:specific_heat,min:thermal_conductivity,min:vdw_radius,min:vdw_radius_alvarez,min:vdw_radius_mm3,min:vdw_radius_uff,min:sound_velocity,min:Polarizability
mp-1008807,24.666667,174.06714,209.333333,25.666667,55.004267,1297.063333,72.86868,1646.9,139.333333,128.333333,...,1.0,2.0,0.36,0.02658,152.0,150.0,182.0,349.5,317.5,0.802
mp-1009640,33.0,137.0,232.5,19.05,77.45733,1931.2,43.182441,1892.85,137.0,123.5,...,2.0,2.0,0.192,0.02583,155.0,166.0,193.0,360.6,333.6,1.1
mp-1016825,21.6,153.120852,203.4,13.92,50.1584,1420.714,76.663625,343.82,102.8,96.0,...,2.0,2.0,0.146,0.02658,152.0,150.0,182.0,302.1,317.5,0.802
mp-1017582,59.4,139.0,232.8,11.02,147.233694,4226.0,150.2,1037.58,137.6,124.8,...,1.0,2.0,0.133,13.0,170.0,177.0,204.0,275.4,2475.0,1.67
mp-1021511,32.0,140.5,226.0,14.3,72.237,877.912,24.85,272.5,124.5,119.5,...,2.0,3.0,0.232,0.205,180.0,189.0,215.0,284.8,2310.0,2.9


<class 'pandas.core.frame.DataFrame'>
Index: 933 entries, mp-1008807 to mvc-4727
Columns: 290 entries, ave:atomic_number to min:Polarizability
dtypes: float64(290)
memory usage: 2.1+ MB


Sometimes, we don't want to use all these featurizers. For example, we only need calculate **WeightedAvgFeature** and **WeightedSumFeature**. We can use ``featurizers`` parameter.

In [6]:
descriptor = cal.transform(samples, featurizers=['WeightedAvgFeature', 'WeightedSumFeature'])
descriptor.head(5)
descriptor.info()

Unnamed: 0,ave:atomic_number,ave:atomic_radius,ave:atomic_radius_rahm,ave:atomic_volume,ave:atomic_weight,ave:boiling_point,ave:bulk_modulus,ave:c6_gb,ave:covalent_radius_cordero,ave:covalent_radius_pyykko,...,sum:num_s_valence,sum:period,sum:specific_heat,sum:thermal_conductivity,sum:vdw_radius,sum:vdw_radius_alvarez,sum:vdw_radius_mm3,sum:vdw_radius_uff,sum:sound_velocity,sum:Polarizability
mp-1008807,,,,,,,,,,,...,,,,,,,,,,
mp-1009640,,,,,,,,,,,...,,,,,,,,,,
mp-1016825,,,,,,,,,,,...,,,,,,,,,,
mp-1017582,,,,,,,,,,,...,,,,,,,,,,
mp-1021511,,,,,,,,,,,...,,,,,,,,,,


<class 'pandas.core.frame.DataFrame'>
Index: 933 entries, mp-1008807 to mvc-4727
Columns: 116 entries, ave:atomic_number to sum:Polarizability
dtypes: float64(116)
memory usage: 852.8+ KB


### calculate structural descriptors from structure

Like ``Compositions`` calculator did. ``Structures`` accept ``pymatgen.Structure`` objects as its input then return calculated result as pandas.DataFrame.
``samples`` also has the structure information. We can use these to calculate structural descriptors.

In [7]:
from xenonpy.descriptor import Structures

cal = Structures()
cal

Structures:
  |- structure:
  |  |- RadialDistributionFunction
  |  |- ObitalFieldMatrix

In [9]:
descriptor = cal.transform(samples)
descriptor.head(5)
descriptor.info()

Unnamed: 0,0.1,0.2,0.30000000000000004,0.4,0.5,0.6000000000000001,0.7000000000000001,0.8,0.9,1.0,...,f14_f5,f14_f6,f14_f7,f14_f8,f14_f9,f14_f10,f14_f11,f14_f12,f14_f13,f14_f14
mp-1008807,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mp-1009640,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mp-1016825,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
mp-1017582,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3851
mp-1021511,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<class 'pandas.core.frame.DataFrame'>
Index: 933 entries, mp-1008807 to mvc-4727
Columns: 1224 entries, 0.1 to f14_f14
dtypes: float64(1224)
memory usage: 8.7+ MB


### the Base classes

There are still lots of details for descriptor calculator system.
Before we fill these documents, you can see https://github.com/yoshida-lab/XenonPy/blob/master/samples/build_your_own_descriptor_calculator.ipynb for imaging.

In [10]:
from xenonpy.descriptor import Compositions, ObitalFieldMatrix

print('Base class of rdf:', Compositions.__bases__)
print('Base class of composition:', ObitalFieldMatrix.__bases__)

Base class of rdf: (<class 'xenonpy.descriptor.base.BaseDescriptor'>,)
Base class of composition: (<class 'xenonpy.descriptor.base.BaseFeaturizer'>,)
