### 演習12）


この演習では，`xenonpy.descriptor.Composition`モジュールを利用して，化学式から記述子を計算する．

`xenonpy.descriptor`の下に様々な記述子計算モジュールを用意してある．サンプルコード[calculate_descriptors](https://nbviewer.org/github/yoshida-lab/XenonPy/blob/master/samples/calculate_descriptors.ipynb)と[custom_descriptor_calculator](https://nbviewer.org/github/yoshida-lab/XenonPy/blob/master/samples/custom_descriptor_calculator.ipynb)を参考して，モジュールの構成・使い方を詳しく理解できる．

この演習は一定数の化学式が必要になる．[retrieve_materials_project](https://nbviewer.org/github/yoshida-lab/XenonPy/blob/master/MI_Book/retrieve_materials_project.ipynb)を参考してサンプルデータを用意せよ．

#### サンプルデータをロードする

In [1]:
%run common_setting.ipynb

from xenonpy.datatools import preset

samples = preset.mp_samples

samples.head()
samples.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, mp-1013558 to mp-998778
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   band_gap                   2000 non-null   float64
 1   composition                2000 non-null   object 
 2   density                    2000 non-null   float64
 3   e_above_hull               2000 non-null   int64  
 4   efermi                     1992 non-null   float64
 5   elements                   2000 non-null   object 
 6   final_energy_per_atom      2000 non-null   float64
 7   formation_energy_per_atom  2000 non-null   float64
 8   pretty_formula             2000 non-null   object 
 9   structure                  2000 non-null   object 
 10  volume                     2000 non-null   float64
dtypes: float64(6), int64(1), object(4)
memory usage: 187.5+ KB


#### 組成記述子を計算する

In [2]:
from xenonpy.descriptor import Compositions

cal = Compositions()
Compositions?

[0;31mInit signature:[0m
[0mCompositions[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0melemental_info[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mpandas[0m[0;34m.[0m[0mcore[0m[0;34m.[0m[0mframe[0m[0;34m.[0m[0mDataFrame[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_jobs[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;34m-[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeaturizers[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m[[0m[0mstr[0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;34m'classic'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mon_errors[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'nan'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      Calculate elemental descriptors from compound's composition.
[0;31mInit docstring:[0m
Parameters
---

`cal`オブジェクトの`transform`関数を使って，組成記述子を計算できる．インプットは化学組成のリストで，組成式は二つのデータタイプが存在する．

1. Python `dict`．例えば，`H2O`を`{'H': 2, 'O': 1}`のように書く．
2. `pymatgen.core.Composition` オブジェクト．例えば， `Composition("H2O")`のように変換する．

これからは，サンプルデータの中にある化合物の`composition`情報を用いて，記述子を計算する．

In [3]:
samples.composition

mp-1013558                     {'Ca': 3.0, 'Bi': 1.0, 'P': 1.0}
mp-1018025                     {'Lu': 1.0, 'Ag': 1.0, 'O': 2.0}
mp-1019105                     {'K': 2.0, 'Mg': 2.0, 'Bi': 2.0}
mp-1020108    {'Li': 2.0, 'Ca': 2.0, 'Mg': 2.0, 'Si': 2.0, '...
mp-1029828                    {'Rb': 8.0, 'Cr': 8.0, 'N': 16.0}
                                    ...                        
mp-976578                                {'Nd': 6.0, 'H': 18.0}
mp-9856                       {'Cs': 4.0, 'Zr': 2.0, 'Se': 6.0}
mp-985829                                 {'Hf': 1.0, 'S': 2.0}
mp-989541           {'Rb': 2.0, 'Na': 1.0, 'Sb': 1.0, 'F': 6.0}
mp-998778                    {'Rb': 4.0, 'Sn': 4.0, 'Br': 12.0}
Name: composition, Length: 2000, dtype: object

In [4]:
descriptor = cal.transform(samples['composition'])

descriptor.head(5)
descriptor.info()

Unnamed: 0,ave:atomic_number,ave:atomic_radius,ave:atomic_radius_rahm,ave:atomic_volume,ave:atomic_weight,ave:boiling_point,ave:bulk_modulus,ave:c6_gb,ave:covalent_radius_cordero,ave:covalent_radius_pyykko,...,min:num_s_valence,min:period,min:specific_heat,min:thermal_conductivity,min:vdw_radius,min:vdw_radius_alvarez,min:vdw_radius_mm3,min:vdw_radius_uff,min:sound_velocity,min:Polarizability
mp-1013558,31.6,177.8,256.6,25.6,72.037632,1541.4,18.6,1478.0,156.6,155.0,...,2.0,3.0,0.124,0.236,180.0,190.0,222.0,339.9,1790.0,3.63
mp-1018025,33.5,152.85071,209.25,14.025,78.70825,1583.345,75.05302,598.6,116.0,104.0,...,1.0,2.0,0.155,0.02658,152.0,150.0,182.0,314.8,317.5,0.802
mp-1019105,38.0,188.333333,241.333333,26.866667,90.794567,1431.0,26.366667,1684.0,164.0,162.0,...,1.0,3.0,0.124,8.0,173.0,251.0,243.0,302.1,1790.0,7.4
mp-1020108,10.0,131.428571,214.142857,17.285714,20.204143,1014.05,49.299236,664.871429,109.857143,110.285714,...,1.0,2.0,0.653,0.02583,155.0,166.0,193.0,245.1,333.6,1.1
mp-1029828,18.75,140.5,207.75,24.4325,41.369475,1015.2,69.307441,1355.1,125.25,118.5,...,1.0,2.0,0.36,0.02583,155.0,166.0,193.0,302.3,333.6,1.1


<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, mp-1013558 to mp-998778
Columns: 290 entries, ave:atomic_number to min:Polarizability
dtypes: float64(290)
memory usage: 4.5+ MB


`Compositions`クラスでは，インプットは`pandas.DataFrame`の場合，コーロン名が`composition`の列を自動的に使われるように設計されている．つまり，サンプルデータから予め`composition`を抽出しなくても計算できる．

In [5]:
descriptor = cal.transform(samples)
descriptor.head(5)
descriptor.info()

Unnamed: 0,ave:atomic_number,ave:atomic_radius,ave:atomic_radius_rahm,ave:atomic_volume,ave:atomic_weight,ave:boiling_point,ave:bulk_modulus,ave:c6_gb,ave:covalent_radius_cordero,ave:covalent_radius_pyykko,...,min:num_s_valence,min:period,min:specific_heat,min:thermal_conductivity,min:vdw_radius,min:vdw_radius_alvarez,min:vdw_radius_mm3,min:vdw_radius_uff,min:sound_velocity,min:Polarizability
mp-1013558,31.6,177.8,256.6,25.6,72.037632,1541.4,18.6,1478.0,156.6,155.0,...,2.0,3.0,0.124,0.236,180.0,190.0,222.0,339.9,1790.0,3.63
mp-1018025,33.5,152.85071,209.25,14.025,78.70825,1583.345,75.05302,598.6,116.0,104.0,...,1.0,2.0,0.155,0.02658,152.0,150.0,182.0,314.8,317.5,0.802
mp-1019105,38.0,188.333333,241.333333,26.866667,90.794567,1431.0,26.366667,1684.0,164.0,162.0,...,1.0,3.0,0.124,8.0,173.0,251.0,243.0,302.1,1790.0,7.4
mp-1020108,10.0,131.428571,214.142857,17.285714,20.204143,1014.05,49.299236,664.871429,109.857143,110.285714,...,1.0,2.0,0.653,0.02583,155.0,166.0,193.0,245.1,333.6,1.1
mp-1029828,18.75,140.5,207.75,24.4325,41.369475,1015.2,69.307441,1355.1,125.25,118.5,...,1.0,2.0,0.36,0.02583,155.0,166.0,193.0,302.3,333.6,1.1


<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, mp-1013558 to mp-998778
Columns: 290 entries, ave:atomic_number to min:Polarizability
dtypes: float64(290)
memory usage: 4.5+ MB


また，計算しようとする記述子グループも指定可能である．
例えば，以下のように設定すれば，`WeightedAvg`と`WeightedSum`記述子のみ計算される．

`cal.all_featurizers`は利用可能な記述子グループをリストできる．

In [12]:
cal.all_featurizers

['GeometricMean',
 'Counting',
 'WeightedAverage',
 'MinPooling',
 'MaxPooling',
 'WeightedSum',
 'WeightedVariance',
 'HarmonicMean']

In [16]:
cal = Compositions(featurizers=['WeightedAverage', 'WeightedSum'])

descriptor = cal.transform(samples)
descriptor.head(5)
descriptor.info()

Unnamed: 0,ave:atomic_number,ave:atomic_radius,ave:atomic_radius_rahm,ave:atomic_volume,ave:atomic_weight,ave:boiling_point,ave:bulk_modulus,ave:c6_gb,ave:covalent_radius_cordero,ave:covalent_radius_pyykko,...,sum:num_s_valence,sum:period,sum:specific_heat,sum:thermal_conductivity,sum:vdw_radius,sum:vdw_radius_alvarez,sum:vdw_radius_mm3,sum:vdw_radius_uff,sum:sound_velocity,sum:Polarizability
mp-1013558,31.6,177.8,256.6,25.6,72.037632,1541.4,18.6,1478.0,156.6,155.0,...,10.0,21.0,2.84,608.236,1080.0,1230.0,1331.0,1871.4,17270.74287,86.03
mp-1018025,33.5,152.85071,209.25,14.025,78.70825,1583.345,75.05302,598.6,116.0,104.0,...,7.0,15.0,2.550373,446.05316,739.0,827.0,872.0,1378.8,7354.860704,30.284
mp-1019105,38.0,188.333333,241.333333,26.866667,90.794567,1431.0,26.366667,1684.0,164.0,162.0,...,10.0,26.0,3.804,536.0,1310.0,1556.0,1636.0,2240.6,16784.0,122.12
mp-1020108,10.0,131.428571,214.142857,17.285714,20.204143,1014.05,49.299236,664.871429,109.857143,110.285714,...,26.0,36.0,16.377686,1190.15498,2522.0,2884.0,3174.0,4829.2,35225.6,137.52
mp-1029828,18.75,140.5,207.75,24.4325,41.369475,1015.2,69.307441,1355.1,125.25,118.5,...,48.0,104.0,19.151163,1216.41328,6552.0,7184.0,7488.0,11565.6,63257.6,488.32


<class 'pandas.core.frame.DataFrame'>
Index: 2000 entries, mp-1013558 to mp-998778
Columns: 116 entries, ave:atomic_number to sum:Polarizability
dtypes: float64(116)
memory usage: 1.8+ MB
